Elsevier

Computers in Human Behavior

Volume 47, June 2015, Pages 168-181
Computers in Human Behavior

Participation-based student final performance prediction model through interpretable Genetic Programming: Integrating learning analytics, educational data mining and theory

https://doi.org/10.1016/j.chb.2014.09.034Get rights and content

Highlights

  • A theory-based method for a computational student performance prediction.

  • A Genetic Programming model for grade prediction is described and tested.

  • Model evaluation suggests high success rates for predicting student grades.

Abstract

Building a student performance prediction model that is both practical and understandable for users is a challenging task fraught with confounding factors to collect and measure. Most current prediction models are difficult for teachers to interpret. This poses significant problems for model use (e.g. personalizing education and intervention) as well as model evaluation. In this paper, we synthesize learning analytics approaches, educational data mining (EDM) and HCI theory to explore the development of more usable prediction models and prediction model representations using data from a collaborative geometry problem solving environment: Virtual Math Teams with Geogebra (VMTwG). First, based on theory proposed by Hrastinski (2009) establishing online learning as online participation, we operationalized activity theory to holistically quantify students’ participation in the CSCL (Computer-supported Collaborative Learning) course. As a result, 6 variables, Subject, Rules, Tools, Division of Labor, Community, and Object, are constructed. This analysis of variables prior to the application of a model distinguishes our approach from prior approaches (feature selection, Ad-hoc guesswork etc.). The approach described diminishes data dimensionality and systematically contextualizes data in a semantic background. Secondly, an advanced modeling technique, Genetic Programming (GP), underlies the developed prediction model. We demonstrate how connecting the structure of VMTwG trace data to a theoretical framework and processing that data using the GP algorithmic approach outperforms traditional models in prediction rate and interpretability. Theoretical and practical implications are then discussed.

Introduction

The ability to predict a student’s final performance has gained increased emphasis in education (Baker and Yacef, 2009, Romero and Ventura, 2010). One of the practical applications of student performance prediction is for instructors to monitor students’ progress and identify at-risk students in order to provide timely interventions (Bienkowski, Feng & Means, 2012). It is already difficult to detect at-risk students in a regular classroom, not to mention when classes are much larger and learning happens online, as in MOOCs (Gunnarsson & Alterman, 2012). It would be desirable to expand beyond the at-risk students to predict the future performance of all students to allow a feedback process to enhance learning and awareness for a greater number of students during the course (Zafra & Ventura, 2009). As an automated method, student performance prediction has the potential to decrease teachers’ duty in assessment.

The objective of performance prediction is to estimate an unknown value – the final performance of the student. In order to accomplish this goal, a training set of previously labeled data instances is used to guide the learning process (Espejo, Ventura, & Herrera, 2010) while another set of correctly labeled instances, named the ‘test set’, is employed to measure the quality of the prediction model obtained (Márquez-Vera, Cano, Romero, & Ventura, 2013). Previous studies that have documented student performance prediction models have focused on statistical modeling and data mining techniques (Gunnarsson and Alterman, 2012, Thomas and Galambos, 2004, Wolff et al., 2013). These traditional modeling techniques have their own limitations. From the perspective of educational data mining (EDM), which focuses on model and algorithm development to improve predictions of learning outcomes (Siemens & Baker, 2012), existing statistical and data mining methods typically lack an established paradigm for optimizing performance prediction. For example, statistical models such as linear regression or logistic regression have requirements related to the distribution of data and a priori regression function structures. Poor estimation and inaccurate inferences would be generated if the basic premises of the regression models are breached (Harrell, 2001); and it is difficult for end users to detect when such breaches occur. In addition, there is a strong tradition in the domain of education of employing linear or quadratic models, limiting exploration of potentially more useful models for predicting student performance.

Learning analytics designed to support performance prediction are the type of actionable intelligence teachers and students require to improve learning, and inherently involves the interpretation and contextualization of data (Agudo-Peregrina, Iglesias-Pradas, Conde-González, & Hernández-García, 2013). Model interpretability in performance prediction is important for two primary reasons (Henery, 1994): first, the constructed model is usually assumed to support decisions made by human users − in our context, to facilitate teachers to provide individualized suggestions to students. If the discovered model is a black-box, which renders predictions without explanation or justification, people or teachers may not have confidence in it. Second, if the model is not understandable, users may not be able to validate it. This hinders the interactive aspect of knowledge validation and refinement. Unfortunately, traditional prediction models (e.g. support vector machines, neural networks) require a sophisticated understanding of computation that most teachers do not possess (Romero and Ventura, 2010, Siemens and Baker, 2012). If teachers cannot interpret analytics, they cannot provide meaningful feedback to students. For instance, Campbell, DeBlois, and Oblinger (2007) employed logistic regression, neural networks and other models to search for students that are at-risk of failing and alert instructors to potential issues. While automatic alert messages enable teachers to quickly identify struggling students, the generation of a risk signal is unable to convey enough information to enable personalized interventions for students (Essa & Ayad, 2012). From an application perspective, typical data mining algorithms usually work as black boxes, and as a result, it is difficult to identify the relationship between student performance and the various factors affecting performance. In turn, these models demand far more time and computing resources. Moreover, most previous studies stopped at the level of predicting failure and success of a student in a course or a program (e.g. Hämäläinen and Vinni, 2006, Romero et al., 2013, Zafra and Ventura, 2009), while few went further to predict student performance at more granular levels. With focus put solely on low performing students interventions have the risk of becoming a tool only for punitive interventions (Mintrop & Sunderman, 2009).

Moreover, previous research in forecasting students’ performance has concentrated on methodology and the exploration of algorithms in ways tending to overlook educational contexts, theories, and phenomena (Baker and Yacef, 2009, Romero and Ventura, 2010). Many times, computational model results are at least difficult, if not impossible, for teachers to use and explain (Ferguson, 2012). To gain a deeper understanding of the factors influencing students’ learning and to build an interpretable student performance prediction model, researchers must contextualize those data factors using educational theories and corresponding semantics. The number of factors (variables) affecting students’ performance makes this a difficult challenge. A large set of selected variables can dramatically diminish both statistical and data mining prediction power (Deegalla and Bostrom, 2006, Vanneschi and Poli, 2012). Data dimensionality can be reduced using feature selection, but in educational situations in which human judgment is key (Siemens & Baker, 2012), it is more suitable to accomplish dimensionality reduction by constructing variables according to human theories (Fancsali, 2011). The automatic processing of data generated by these environments without the additional lens of theory provides a kind of “blunt computational instrument”. Feature selection algorithms, statistical models and data mining grounded in mathematical theories lack connection to theories of human behavior that are most relevant in a learning analytics system. In practice, approaches to variable selection and construction are usually based on ad-hoc guesswork or significantly detailed experience in the educational field (Cetintas et al., 2009, Nasiri and Minaei, 2012, Tair and El-Halees, 2012). A principled, theory-based method for synthesizing factors from raw data will connect the input to computational prediction models more coherently than previous approaches.

This paper illustrates the potential for the integration of prediction models focused on automating analytics around humans working in computational systems to increase the understandability and utility of learning analytics. We selected the prediction model (Genetic Programming) that represents what we see in our results as the most optimal tradeoff between model understandability and the predication accuracy. To explore this aim, we synthesize prior work in learning analytics, EDM and activity theory to approach student performance prediction model construction. We draw on a theory proposed by Hrastinski (2009), which emphasizes participation in online learning as a central factor affecting performance. We then contextualize participation-related data factors on a semantic background using an operationalization of activity theory. Integrating activity theory directly into our operationalization of participation indicators allows for a systematic construction of variables and reduces data dimensionality in a CSCL environment to only six aspects.

We then use activity theory derived participation indicators as inputs to a Genetic Programming (GP) model to develop our student performance prediction model. The GP model can build a prediction model without assuming any a priori structure of functions and relies on theoretically grounded factorization of data. Moreover, the proposed GP model is more easily understood by users when compared with traditional statistical and data mining algorithms, providing teachers actionable information to offer individualized suggestions to students in any performance state (at-risk, just survive, average or good) as well as increasing students’ awareness provided that prediction results are also presented to them. As a final product, this model defines tangible relationships between student performance and its related variables. Therefore, in terms of practical application, the resulting prediction model may be easily implemented in a real life context.

This study provides a practical and interpretable student performance prediction model that enables teachers to discern differences in performance among students in a classroom full of small group geometry learners who are working in groups of three to five in a synchronous CSCL environment, Virtual Math Teams with Geogebra (VMTwG). The paper is organized as follows: Section 2 discusses related work and background information. Section 3 introduces the theoretical framework behind this study. Section 4 shows the context of the study and data format. Section 5 describes methodology. Section 6 presents experimental results and analysis. Section 7 discusses results. Section 8 summarizes this study, pointing out limitations and future research directions.

Section snippets

Literature review

The development of student performance prediction models is one of the oldest and most popular practices in education (Romero & Ventura, 2010). There are many examples of the application of computational techniques to predict student performance. Several exemplary works using these techniques are described here to provide our research background.

Barber and Sharkey (2012) predicted student success in a course using a logistic regression technique that incorporated data generated from learning

Online learning as online participation

Research on technology-mediated learning is increasingly influenced by interaction and practice focused lenses pioneered by Vygotskiĭ (1978) and Wenger (1998). Knowledge is a construct that is not only recognized in individual minds but also “in the discourse among individuals, the social relationships that bind them, the physical artifacts that they use and produce, and the theories, models and methods they use to produce them” (Jonassen & Land, 2000). Most recently, Hrastinski (2009) argued

VMT with Geogebra (VMTwG)

In this study, we operationalize activity theory in order to make sense of electronic trace data from a math discourse with 122 students which took place in 2013–14. Our analysis focused on four modules of a course designed to be taught with Virtual Math Teams with Geogebra (VMTwG) software (Fig. 2). The four modules that were analyzed included teams of three to five members. The four modules included: “Constructing Dynamic-Geometry Objects,” “Exploring Triangles,” “Creating Construction

Measure construction

Since the log data is centered on event types and the facilitation of measure construction, we first process each event type into four participation dimensions (Individual, Group, Event Types, Module Set) for each student. The Individual category is the sum of all personal actions (frequency) in which the source and the target of the action are the same in a given event type (Fig. 3). Similarly, the Group category is the sum (frequency) of all actions the student makes in group projects in

Activity theory based measures

In order to reduce data dimensionality and contextualize data for instructors, this study built measures around students’ participation in a course derived from activity theory. As a result, each student can be represented by a 6 dimensional set with a semantic background as illustrated in Table 4. In fact, instructors may already obtain meaningful information by looking at Table 4 alone. Through simply comparing students by column, the instructor is able to discern which student performs well

Discussion

Building a practical and interpretable student performance prediction model is a shared goal for learning analytics and EDM. It is a difficult task not only because factors involved can be overwhelming but also because lack of semantic background for teachers to interpret the model developed (Goggins et al., 2009, Gress et al., 2010). Most previously developed models identified at-risk students, but were unable to predict student performance in a more granular level. As an exploratory study,

Conclusion

This paper describes a methodology which connects perspectives from learning analytics, EDM, theory and application to solve the problem of predicting students’ performance in a CSCL learning environment with small datasets. We operationalized activity theory to holistically quantify student participation in the environment. We then coded an advanced GP technique to construct the prediction model. Results show that the GP-based model is interpretable and has an optimized prediction rate as

References (76)

  • Barber, R., & Sharkey, M. (2012). Course correction: Using analytics to predict course success. In Proceedings of the...
  • O.K. Basharina

    An activity theory perspective on student-reported contradictions in international telecollaboration

    Language Learning & Technology

    (2007)
  • D. Bernardo et al.

    A genetic type-2 fuzzy logic based system for the generation of summarised linguistic predictive models for financial applications

    Soft Computing

    (2013)
  • Bienkowski, M., Feng, M., & Means, B. (2012). Enhancing Teaching and Learning Through Educational Data Mining and...
  • M.D. Calvo-Flores et al.

    Predicting students’ marks from Moodle logs using neural network models

    Current Developments in Technology-Assisted Education

    (2006)
  • J.P. Campbell et al.

    Academic analytics: A new tool for a new era

    Educause Review

    (2007)
  • Cano, A., Zafra, A., & Ventura, S. (2010). Solving classification problems using genetic programming algorithms on...
  • Cetintas, S., Si, L., Xin, Y. P., & Hord, C. (2009). Predicting correctness of problem solving from low-level log data...
  • Deegalla, S., & Bostrom, H. (2006). Reducing high-dimensional data by principal component analysis vs. random...
  • P. Domingos et al.

    On the optimality of the simple Bayesian classifier under zero-one loss

    Machine Learning

    (1997)
  • Y. Engeström

    Activity theory and individual and social transformation

    Perspectives on Activity Theory

    (1999)
  • Espejo, P. G., Romero, C., Ventura, S., & Herrera, F. (2005). Induction of classification rules with grammar-based...
  • P.G. Espejo et al.

    A survey on the application of genetic programming to classification

    Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on

    (2010)
  • Essa, A., & Ayad, H. (2012). Student success system: risk analytics and data visualization using ensembles of...
  • Fancsali, S. E. (2011). Variable construction for predictive and causal modeling of online education data. In...
  • Ferguson, R. (2012). The state of learning analytics in 2012: A review and future challenges. Knowledge Media...
  • A.A. Freitas

    Data mining and knowledge discovery with evolutionary algorithms

    (2002)
  • A.A. Freitas et al.

    On the importance of comprehensible classification models for protein function prediction

    IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

    (2010)
  • V. Fromkin et al.

    An introduction to language

    (2009)
  • S.P. Goggins et al.

    Network analytic techniques for online chat

  • Goggins, S., Laffey, J., & Galyen, K. (2009). Social ability in online groups: Representing the quality of interactions...
  • Goggins, S. P., Laffey, J., Amelung, C., & Gallagher, M. (2010). Social intelligence in completely online groups. In...
  • S.P. Goggins et al.

    Group informatics: A methodological approach and ontology for sociotechnical group research

    Journal of the American Society for Information Science and Technology

    (2013)
  • S. Goggins et al.

    Creating a model of the dynamics of socio-technical groups

    User Modeling and User-Adapted Interaction

    (2013)
  • D.E. Goldberg et al.

    Genetic algorithms and machine learning

    Machine Learning

    (1988)
  • Gunnarsson, B. L., & Alterman, R. (2012). Predicting failure: A case study in co-blogging. In Proceedings of the 2nd...
  • C.A. Halverson

    Activity theory and distributed cognition: Or what does CSCW need to DO with theories?

    Computer Supported Cooperative Work (CSCW)

    (2002)
  • Hämäläinen, W., & Vinni, M. (2006). Comparison of machine learning methods for intelligent tutoring systems. In...
  • Cited by (229)

    View all citing articles on Scopus
    1

    Tel.: +1 281 309 8515.

    2

    Tel.: +1 215 948 2729.

    View full text