Participation-based student final performance prediction model through interpretable Genetic Programming: Integrating learning analytics, educational data mining and theory
Introduction
The ability to predict a student’s final performance has gained increased emphasis in education (Baker and Yacef, 2009, Romero and Ventura, 2010). One of the practical applications of student performance prediction is for instructors to monitor students’ progress and identify at-risk students in order to provide timely interventions (Bienkowski, Feng & Means, 2012). It is already difficult to detect at-risk students in a regular classroom, not to mention when classes are much larger and learning happens online, as in MOOCs (Gunnarsson & Alterman, 2012). It would be desirable to expand beyond the at-risk students to predict the future performance of all students to allow a feedback process to enhance learning and awareness for a greater number of students during the course (Zafra & Ventura, 2009). As an automated method, student performance prediction has the potential to decrease teachers’ duty in assessment.
The objective of performance prediction is to estimate an unknown value – the final performance of the student. In order to accomplish this goal, a training set of previously labeled data instances is used to guide the learning process (Espejo, Ventura, & Herrera, 2010) while another set of correctly labeled instances, named the ‘test set’, is employed to measure the quality of the prediction model obtained (Márquez-Vera, Cano, Romero, & Ventura, 2013). Previous studies that have documented student performance prediction models have focused on statistical modeling and data mining techniques (Gunnarsson and Alterman, 2012, Thomas and Galambos, 2004, Wolff et al., 2013). These traditional modeling techniques have their own limitations. From the perspective of educational data mining (EDM), which focuses on model and algorithm development to improve predictions of learning outcomes (Siemens & Baker, 2012), existing statistical and data mining methods typically lack an established paradigm for optimizing performance prediction. For example, statistical models such as linear regression or logistic regression have requirements related to the distribution of data and a priori regression function structures. Poor estimation and inaccurate inferences would be generated if the basic premises of the regression models are breached (Harrell, 2001); and it is difficult for end users to detect when such breaches occur. In addition, there is a strong tradition in the domain of education of employing linear or quadratic models, limiting exploration of potentially more useful models for predicting student performance.
Learning analytics designed to support performance prediction are the type of actionable intelligence teachers and students require to improve learning, and inherently involves the interpretation and contextualization of data (Agudo-Peregrina, Iglesias-Pradas, Conde-González, & Hernández-García, 2013). Model interpretability in performance prediction is important for two primary reasons (Henery, 1994): first, the constructed model is usually assumed to support decisions made by human users − in our context, to facilitate teachers to provide individualized suggestions to students. If the discovered model is a black-box, which renders predictions without explanation or justification, people or teachers may not have confidence in it. Second, if the model is not understandable, users may not be able to validate it. This hinders the interactive aspect of knowledge validation and refinement. Unfortunately, traditional prediction models (e.g. support vector machines, neural networks) require a sophisticated understanding of computation that most teachers do not possess (Romero and Ventura, 2010, Siemens and Baker, 2012). If teachers cannot interpret analytics, they cannot provide meaningful feedback to students. For instance, Campbell, DeBlois, and Oblinger (2007) employed logistic regression, neural networks and other models to search for students that are at-risk of failing and alert instructors to potential issues. While automatic alert messages enable teachers to quickly identify struggling students, the generation of a risk signal is unable to convey enough information to enable personalized interventions for students (Essa & Ayad, 2012). From an application perspective, typical data mining algorithms usually work as black boxes, and as a result, it is difficult to identify the relationship between student performance and the various factors affecting performance. In turn, these models demand far more time and computing resources. Moreover, most previous studies stopped at the level of predicting failure and success of a student in a course or a program (e.g. Hämäläinen and Vinni, 2006, Romero et al., 2013, Zafra and Ventura, 2009), while few went further to predict student performance at more granular levels. With focus put solely on low performing students interventions have the risk of becoming a tool only for punitive interventions (Mintrop & Sunderman, 2009).
Moreover, previous research in forecasting students’ performance has concentrated on methodology and the exploration of algorithms in ways tending to overlook educational contexts, theories, and phenomena (Baker and Yacef, 2009, Romero and Ventura, 2010). Many times, computational model results are at least difficult, if not impossible, for teachers to use and explain (Ferguson, 2012). To gain a deeper understanding of the factors influencing students’ learning and to build an interpretable student performance prediction model, researchers must contextualize those data factors using educational theories and corresponding semantics. The number of factors (variables) affecting students’ performance makes this a difficult challenge. A large set of selected variables can dramatically diminish both statistical and data mining prediction power (Deegalla and Bostrom, 2006, Vanneschi and Poli, 2012). Data dimensionality can be reduced using feature selection, but in educational situations in which human judgment is key (Siemens & Baker, 2012), it is more suitable to accomplish dimensionality reduction by constructing variables according to human theories (Fancsali, 2011). The automatic processing of data generated by these environments without the additional lens of theory provides a kind of “blunt computational instrument”. Feature selection algorithms, statistical models and data mining grounded in mathematical theories lack connection to theories of human behavior that are most relevant in a learning analytics system. In practice, approaches to variable selection and construction are usually based on ad-hoc guesswork or significantly detailed experience in the educational field (Cetintas et al., 2009, Nasiri and Minaei, 2012, Tair and El-Halees, 2012). A principled, theory-based method for synthesizing factors from raw data will connect the input to computational prediction models more coherently than previous approaches.
This paper illustrates the potential for the integration of prediction models focused on automating analytics around humans working in computational systems to increase the understandability and utility of learning analytics. We selected the prediction model (Genetic Programming) that represents what we see in our results as the most optimal tradeoff between model understandability and the predication accuracy. To explore this aim, we synthesize prior work in learning analytics, EDM and activity theory to approach student performance prediction model construction. We draw on a theory proposed by Hrastinski (2009), which emphasizes participation in online learning as a central factor affecting performance. We then contextualize participation-related data factors on a semantic background using an operationalization of activity theory. Integrating activity theory directly into our operationalization of participation indicators allows for a systematic construction of variables and reduces data dimensionality in a CSCL environment to only six aspects.
We then use activity theory derived participation indicators as inputs to a Genetic Programming (GP) model to develop our student performance prediction model. The GP model can build a prediction model without assuming any a priori structure of functions and relies on theoretically grounded factorization of data. Moreover, the proposed GP model is more easily understood by users when compared with traditional statistical and data mining algorithms, providing teachers actionable information to offer individualized suggestions to students in any performance state (at-risk, just survive, average or good) as well as increasing students’ awareness provided that prediction results are also presented to them. As a final product, this model defines tangible relationships between student performance and its related variables. Therefore, in terms of practical application, the resulting prediction model may be easily implemented in a real life context.
This study provides a practical and interpretable student performance prediction model that enables teachers to discern differences in performance among students in a classroom full of small group geometry learners who are working in groups of three to five in a synchronous CSCL environment, Virtual Math Teams with Geogebra (VMTwG). The paper is organized as follows: Section 2 discusses related work and background information. Section 3 introduces the theoretical framework behind this study. Section 4 shows the context of the study and data format. Section 5 describes methodology. Section 6 presents experimental results and analysis. Section 7 discusses results. Section 8 summarizes this study, pointing out limitations and future research directions.
Section snippets
Literature review
The development of student performance prediction models is one of the oldest and most popular practices in education (Romero & Ventura, 2010). There are many examples of the application of computational techniques to predict student performance. Several exemplary works using these techniques are described here to provide our research background.
Barber and Sharkey (2012) predicted student success in a course using a logistic regression technique that incorporated data generated from learning
Online learning as online participation
Research on technology-mediated learning is increasingly influenced by interaction and practice focused lenses pioneered by Vygotskiĭ (1978) and Wenger (1998). Knowledge is a construct that is not only recognized in individual minds but also “in the discourse among individuals, the social relationships that bind them, the physical artifacts that they use and produce, and the theories, models and methods they use to produce them” (Jonassen & Land, 2000). Most recently, Hrastinski (2009) argued
VMT with Geogebra (VMTwG)
In this study, we operationalize activity theory in order to make sense of electronic trace data from a math discourse with 122 students which took place in 2013–14. Our analysis focused on four modules of a course designed to be taught with Virtual Math Teams with Geogebra (VMTwG) software (Fig. 2). The four modules that were analyzed included teams of three to five members. The four modules included: “Constructing Dynamic-Geometry Objects,” “Exploring Triangles,” “Creating Construction
Measure construction
Since the log data is centered on event types and the facilitation of measure construction, we first process each event type into four participation dimensions (Individual, Group, Event Types, Module Set) for each student. The Individual category is the sum of all personal actions (frequency) in which the source and the target of the action are the same in a given event type (Fig. 3). Similarly, the Group category is the sum (frequency) of all actions the student makes in group projects in
Activity theory based measures
In order to reduce data dimensionality and contextualize data for instructors, this study built measures around students’ participation in a course derived from activity theory. As a result, each student can be represented by a 6 dimensional set with a semantic background as illustrated in Table 4. In fact, instructors may already obtain meaningful information by looking at Table 4 alone. Through simply comparing students by column, the instructor is able to discern which student performs well
Discussion
Building a practical and interpretable student performance prediction model is a shared goal for learning analytics and EDM. It is a difficult task not only because factors involved can be overwhelming but also because lack of semantic background for teachers to interpret the model developed (Goggins et al., 2009, Gress et al., 2010). Most previously developed models identified at-risk students, but were unable to predict student performance in a more granular level. As an exploratory study,
Conclusion
This paper describes a methodology which connects perspectives from learning analytics, EDM, theory and application to solve the problem of predicting students’ performance in a CSCL learning environment with small datasets. We operationalized activity theory to holistically quantify student participation in the environment. We then coded an advanced GP technique to construct the prediction model. Results show that the GP-based model is interpretable and has an optimized prediction rate as
References (76)
- et al.
An interpretable classification rule mining algorithm
Information Sciences
(2013) - et al.
Measurement and assessment in computer-supported collaborative learning
Computers in Human Behavior
(2010) A theory of online learning as online participation
Computers & Education
(2009)- et al.
An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models
Decision Support Systems
(2011) - et al.
Predicting students’ final performance from participation in on-line discussion forums
Computers & Education
(2013) - et al.
Genetic programming in classifying large-scale data: An ensemble method
Information Sciences
(2004) - et al.
Genetic programming for cross-release fault count predictions in large and complex software projects
Evolutionary Computation and Optimization Algorithms in Software Engineering
(2010) - Agudo-Peregrina, Á. F., Iglesias-Pradas, S., Conde-González, M. Á., & Hernández-García, Á. (2013). Can we predict...
- et al.
The state of educational data mining in 2009: A review and future visions
Journal of Educational Data Mining
(2009) - et al.
Using activity theory to understand the systemic tensions characterizing a technology-rich introductory astronomy course
Mind, Culture, and Activity
(2002)
An activity theory perspective on student-reported contradictions in international telecollaboration
Language Learning & Technology
A genetic type-2 fuzzy logic based system for the generation of summarised linguistic predictive models for financial applications
Soft Computing
Predicting students’ marks from Moodle logs using neural network models
Current Developments in Technology-Assisted Education
Academic analytics: A new tool for a new era
Educause Review
On the optimality of the simple Bayesian classifier under zero-one loss
Machine Learning
Activity theory and individual and social transformation
Perspectives on Activity Theory
A survey on the application of genetic programming to classification
Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on
Data mining and knowledge discovery with evolutionary algorithms
On the importance of comprehensible classification models for protein function prediction
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An introduction to language
Network analytic techniques for online chat
Group informatics: A methodological approach and ontology for sociotechnical group research
Journal of the American Society for Information Science and Technology
Creating a model of the dynamics of socio-technical groups
User Modeling and User-Adapted Interaction
Genetic algorithms and machine learning
Machine Learning
Activity theory and distributed cognition: Or what does CSCW need to DO with theories?
Computer Supported Cooperative Work (CSCW)
Cited by (229)
Evolutionary machine learning builds smart education big data platform: Data-driven higher education
2023, Applied Soft ComputingMulti-level contrastive graph learning for academic abnormality prediction
2024, Neural Computing and ApplicationsConstruction of E-Learning English Wisdom Classroom Based on Educational Big Data Mining
2024, Computer-Aided Design and ApplicationsUsing learning analytics to explore the multifaceted engagement in collaborative learning
2023, Journal of Computing in Higher EducationGames as stealth assessments
2023, Games as Stealth Assessments