0% found this document useful (0 votes)

43 views21 pages

Predictive Models for Student Success

This document reviews the methodologies used in predictive analytic models of student success in higher education. It identifies four key questions: (1) what data sources and student variables are used, (2) how data is preprocessed and missing data handled, (3) which machine learning techniques are employed, and (4) how model accuracy and generalizability are evaluated. The review aims to provide guidance to researchers on appropriately developing and evaluating predictive models of student performance and success.

Uploaded by

lau

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views21 pages

Predictive Models for Student Success

Uploaded by

lau

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/2398-5348.htm

ILS
120,3/4 Predictive analytic models of
student success in higher education
A review of methodology
208 Ying Cui and Fu Chen
Department of Educational Psychology, University of Alberta, Edmonton,
Received 11 October 2018 Alberta, Canada
Revised 24 January 2019
11 February 2019
Accepted 19 February 2019
Ali Shiri
Department of Library and Information Studies, University of Alberta,
Edmonton, Alberta, Canada, and
Yaqin Fan
Department of Educational Technology, Northeast Normal University, Changchun,
Jilin, China

Abstract
Purpose – Many higher education institutions are investigating the possibility of developing predictive
student success models that use different sources of data available to identify students that might be at risk of
failing a course or program. The purpose of this paper is to review the methodological components related to
the predictive models that have been developed or currently implemented in learning analytics applications in
higher education.
Design/methodology/approach – Literature review was completed in three stages. First, the authors
conducted searches and collected related full-text documents using various search terms and keywords.
Second, they developed inclusion and exclusion criteria to identify the most relevant citations for the purpose
of the current review. Third, they reviewed each document from the ﬁnal compiled bibliography and focused
on identifying information that was needed to answer the research questions
Findings – In this review, the authors identify methodological strengths and weaknesses of current
predictive learning analytics applications and provide the most up-to-date recommendations on predictive
model development, use and evaluation. The review results can inform important future areas of research that
could strengthen the development of predictive learning analytics for the purpose of generating valuable
feedback to students to help them succeed in higher education.
Originality/value – This review provides an overview of the methodological considerations for
researchers and practitioners who are planning to develop or currently in the process of developing predictive
student success models in the context of higher education.
Keywords Higher education, Machine learning, Student success, Learning analytics,
Educational data mining, Methodology review, Predictive models
Paper type Literature review

Introduction
The 2016 Horizon Report Higher Education Edition (Johnson et al., 2016) predicts that
Information and Learning
Sciences
learning analytics will be increasingly adopted by higher education institutions across the
Vol. 120 No. 3/4, 2019
pp. 208-227
globe in the near future to make use of student data gathered through online learning
© Emerald Publishing Limited environments to improve, support and extend teaching and learning. The 2016 Horizon
2398-5348
DOI 10.1108/ILS-10-2018-0104 report defines learning analytics as “an educational application of web analytics aimed at
learner profiling, a process of gathering and analyzing details of individual student Predictive
interactions in online learning activities” (p. 38). It can help to “build better pedagogies, analytic models
empower active learning, target at-risk student populations, and assess factors affecting
completion and student success” (p. 38). Terms such as “educational data mining,”
in higher
“academic analytics” and the more commonly adopted “learning analytics” have been used education
in the literature to refer to the methods, tools and techniques for gathering very large
volumes of online data about learners and their activities and contexts. The advantages of
learning analytics have been enumerated by Siemens et al. (2011) and Siemens and Long 209
(2011), and some of the important ones include: early detection of at-risk students and
generating alerts for learners and educators; personalization and adaption of learning
process and content; extension and enhancement of learner achievement, motivation and
confidence by providing learners with timely information about their performance and that
of their peers; higher quality learning design and improved curriculum development;
interactive visualizations of complex information that give learners and educators the
ability to “zoom in” or “zoom out” on data sets; and more rapid achievement of learning
goals by giving learners access to tools that help them to evaluate their progress.
Many higher education institutions are beginning to explore the use of learning analytics
for improving student learning experiences (Sclater et al., 2016). According to a recent
literature review on learning analytics in higher education (Leitner et al., 2017), the most
popular strand of research in the field is to use student data to make predictions of their
performance (36 citations out of the total of 102 found in the literature review). The primary
goal of this area of research is to develop predictive student success models that make use of
different sources of data available within a higher education institution to identify students
who might be at risk of failing a course or program and could benefit from additional help.
This type of learning analytics research and application is important as it generates
actionable information that allows students to monitor and self-regulate their own learning,
as well as allows instructors to develop and implement effective learning interventions and
ultimately help students succeed.
The purpose of the present paper is to systematically review the methodological
components of the predictive models that have been developed or currently implemented in
the learning analytics applications in higher education. Student learning is a complex
phenomenon as cognitive, socio and emotional factors, together with prior experience, all
influence how students learn and perform (Illeris, 2006). As a result, to predict student
performance in a course or a program, many variables need to be considered, such as
cognitive variables associated with targeted knowledge and skills in the domain and socio-
emotional variables, such as engagement, motivation and anxiety. Student demographic
characteristics and past academic history are also often used in model building to reflect
information related to student prior experiences. Supervised machine learning techniques
such as logistic regression and neural networks are then applied to these student variables
to train and test the predictive models so as to estimate the likelihood of a student’s
successful passing of a course. Kotsiantis (2007) specified several key issues that are
consequential to the success of supervised machine learning applications, including variable
(i.e. attributes, features) selection, data preprocessing, choosing specific learning algorithms
and model validation. These issues are directly related to the steps of the typical process of
statistical modeling in quantitative research, which have guided us in terms of identifying
our research questions, as outlined below:

RQ1. What data sources and student variables were used to predict student
performance in higher education?
ILS RQ2. How data were preprocessed and how missing data were handled prior to their
120,3/4 use in training, testing and validating predictive learning analytics models?
RQ3. Which machine learning techniques were used in developing predictive learning
analytics models?
RQ4. How were the accuracy and generalizability of the predictive learning analytics
210 evaluated?
The main goal of this review is to provide an overview of the methodological considerations
for researchers and practitioners who are planning to develop or currently in the process of
developing predictive student success models in the context of higher education. The
answers to these four questions can provide a practical guide regarding the steps of
developing and evaluating predictive models of student success, from variable selection and
data preparation through results validation. The review also helps identify methodological
strengths and weaknesses of the current predictive learning analytics applications in higher
education so we can provide the most up-to-date recommendations on predictive model
development, use and evaluation. In this process, we also identify areas where research on
predictive learning analytics is lacking, which will inform important future areas of research
that could strengthen the development of predictive learning analytics for the purpose of
generating valuable feedback to students to help them succeed in higher education.

Method
Our literature review was completed in three stages. First, we conducted searches and
collected related full-text documents using various search terms and keywords related to
predictive learning analytics applications in higher education. The search strings include:
(student performance OR student success OR Drop out OR student graduation OR at-risk student)
and (systems OR application OR method OR process OR system OR technique OR methodology
OR procedure) AND (“educational data mining” or “learning analytics”) and (prediction).
We selected “learning analytics” and “educational data mining” as two widely and
interchangeably used search terms in the literature for this study. Siemens and d Baker
(2012) enumerate the common research areas, interests and approaches between learning
analytics and educational data mining. Furthermore, Ferguson (2012) makes a
clear distinction between these two terms (i.e. learning analytics and educational data
mining) and academic analytics. Learning analytics and educational data mining address
technical and educational challenges to benefit and support students and faculty, whereas
academic analytics addresses political and economic challenges that benefit funders,
administrators and marketing at institutional, regional and government levels. Also, a quick
exact phrase searching in Google shows the popularity and the extent of information on
learning analytics (10,80,000 hits) and educational data mining (372,000 hits) as compared to
academic analytics (44,100 hits).
We conducted searches in four international databases of well-known academic
resources and publishers, namely, ScienceDirect, IEEE Xplore, ERIC and Springer. The
rationale for the choice of these four databases is that learning analytics, as an emerging
field of research and practice, involves the interdisciplinary area of science, social science,
education, engineering, psychology and other related fields. These four databases together
cover a broad spectrum of the interdisciplinary area involved in learning analytics. In
addition, they offer various scholarly products from conference proceedings, book chapters
and journal articles to funding agencies research reports, dissertations and policy papers.
For instance, ScieneDirect has an international coverage of physical sciences and
engineering, life sciences, health sciences and social sciences and humanities with over 12 Predictive
million pieces of content from 4,051 academic journals and 28,417 books. ERIC has an analytic models
extensive coverage and collection of the literature in education and psychology with links to
more than 330,000 full-text documents. IEEE explore has a focus on computer science,
in higher
electrical engineering and electronics and allied fields and provide access to more than four education
million documents. Springer covers a variety of topics in the sciences, social sciences and
humanities with over ten million scientific documents.
To filter the irrelevant articles, our review was narrowed down to journal articles, full-
211
text conference papers and book chapters that could be downloaded from library website.
Full-text conference papers were included in our review based on the consideration that in
some fields such as computer science, conference papers are greatly valued as they are
typically peer reviewed and highly selective and considered to be more timely and with
greater novelty. According to Meyer et al. (2009), acceptance rates at selective computer
science conferences range between 10 and 20 per cent. The authors argued that “it is
important not to use journals as the only yardsticks for computer scientists” (p. 32). In
addition, given the emerging nature of learning analytics as a research and development
domain and that many new learning analytics systems and applications and their empirical
studies tend to be reported at conferences, we decided to include a broad range of scholarly
publications, including conference proceedings, to capture the recent literature of the area. A
cursory look at our reviewed papers shows a reasonable combination of journal articles and
conference papers, with conference papers constituting some of the publications after the
year 2015. We understand that some conference proceedings may not be as rigorous as
journal papers, but we wanted to ensure that the recent studies of the area are captured, even
if they are presented in conference proceedings. Our search process yielded 742 results from
all the four databases, which formed the initial list of citations. The publication time of the
selected citations spanned from 2002 to early 2018. Figure 1 displays the number of
publications reviewed over time, which shows that research on learning analytics has
gained more and more popularity in recent years.
Second, we developed inclusion and exclusion criteria to identify the most relevant
citations for the purpose of the current review. For this review, we excluded short conference
papers and abstracts because of their typical lack of detailed information about
methodologies. Because of our focus on the practical methodological considerations during
modeling process of real data applications in higher education, we excluded studies
conducted in educational settings other than higher education (e.g. high schools); we also

Figure 1.
Number of
publications reviewed
over time
ILS excluded citations that are pure theoretical or conceptual without empirical data/results. In
120,3/4 addition, we excluded studies that focused on clustering students into different groups
based on their academic behavior or background. Although students who are grouped
together might share similar profiles that could be linked to student success or dropout, our
review focused on explicit predictive models with specific predictor variables (i.e. variables
used to predict another variable), such as student background variables or activity data
212 from learning management systems, and the outcome variable (i.e. the variable whose value
depends on predictor variables) such as student course grades or last year grade point
average (GPA). As a result, of the 742 citations compiled from the first stage of the literature
review, a total of 121 citations remained after applying our exclusion criteria.
Third, we reviewed each document from the final compiled bibliography of 121 articles
and focused on identifying information that was needed to answer our research questions
regarding the four methodological components of predictive algorithms, namely:
data sources and student variables;
procedures of data handling and processing;
adopted machine learning techniques; and
evaluation of accuracy and generalizability.

We synthesize the current practice and ﬁndings from the 121 articles and conclude our
review with a number of recommendations for predictive algorithm development, analysis
and use based on the literature and our own evaluation, and in this process, we highlight
important areas for further research.

Results
Based on our review, there are two major categories of studies that focused on the prediction
of student performance in the higher education context. Of the 121 articles reviewed in our
study, the majority of studies (a total of 86 studies) focused on the prediction of student
performance and achievement at the course level in specific undergraduate or graduate
courses. These courses are delivered in a variety of formats, including traditional face-to-
face, online or blended. In these studies, student performance and achievement is typically
measured by their assignment scores or final grades on a variety of different scales,
including continuous scales (e.g. percentages), binary scales (e.g. pass or fail) and categorical
scales (e.g. fail, good, very good, or excellent). Course-level prediction of student course
performance is intended to help individual instructors monitor student progress and predict
how well a student will perform in the course so early interventions can be implemented.
Course-level predictions have been also applied to student outcome in massive open online
courses (MOOCs) in a number of studies (Al-Shabandar et al., 2017; Boyer and
Veeramachaneni, 2015; Brinton et al., 2016; Chen et al., 2016; Deeva et al., 2017; Hughes and
Dobbins, 2015; Kidzin ski et al., 2016; Klüsener and Fortenbacher, 2015; Liang et al., 2016;
Li et al., 2017; Liang et al., 2016; Pérez-Lemonche et al., 2017; Ruipérez-Valiente et al., 2017;
Xing et al., 2016; Yang et al., 2017; Ye and Biswas, 2014). The primary aim of predictions for
MOOCs is to identify inactive students to prevent early dropout. As a result, the outcome
variable being predicted in MOOCs is typically course completion or dropout. Another type
of course-level prediction is to estimate student performance in future courses (Elbadrawy
et al., 2016; Polyzou and Karypis, 2016; Sweeney et al., 2016), which could help students
select courses in which they are predicted to succeed and therefore create personalized
degree pathways to facilitate successful and timely graduation.
The second category of studies of predicting student performance (a total of 35 studies) Predictive
has focused on the program-level prediction of student outcome in higher education analytic models
institutions, including student overall academic performance as measured by student
cumulative GPA (CGPA) or GPA at graduation, student retention or degree completion. For
in higher
example, Dekker et al. (2009) predicted the dropout of electrical engineering students after education
the first semester of their studies. This type of prediction can provide important information
to senior administrators regarding institutional accountability and strategies with the goal
to maintain and improve student retention and graduation rates.
213
Although the aims of the course-level and program-level predictions are generally
different, these studies share similar methodological components and considerations, with
some minor differences. We presented our results of the methodological review of the 121
articles in the following four subsections, each related to one of the research questions
outlined in the Introduction.

Data sources and student variables

Based on our review, a variety of data sources and student variables have been utilized in
building predictive models by various studies. Table I provides a summary of the predictor
variables found in the literature that had been used for course- and program-level prediction.
Course-level prediction. For course-level prediction, student course performance data
during the semester such as marks on quizzes and midterms are considered as potential
predictors of ﬁnal course performance. Marbouti et al. (2015) and Marbouti et al. (2016) built
course-level predictive models for a ﬁrst-year engineering course at a large Midwestern
University (USA) with three types of course performance variables: attendance records,
homework and quiz grades. With course performance data only, the predictive model was
able to identify at-risk students and predict students’ success with 85 per cent accuracy.
Another important data source for predicting course-level performance is originated
from the learning management systems, virtual learning environments or other Web-based
learning environments in which detailed student activity log data are recorded, such as
logins, assignment submissions, resources accessed and frequency and interaction with
discussion forums. The rationale of using student logs or behavioral data collected while
interacting with learning management systems is that they serve as indicators of student
engagement and efforts in the course, which have been shown to be positively related to

Predictor variables

Course-level prediction Program-level prediction

Course performance data such as marks on quizzes Student demographics and previous academic history
and midterms Social networking-based data such as Facebook and
Student activity data from learning managing Twitter
systems Linguistic features extracted from college admission
Student socio-emotional variables from surveys and application essays
questionnaires such as self-reported learning
attitudes and self-evaluation
Student demographics and previous academic
Table I.
history
Features of the course such as course modality, Predictor variables
discipline and enrolment for course- and
Variables related instructors such as teaching program-level
quality and style prediction
ILS student academic performance (Chen and Jang, 2010; Davies and Graff, 2005; de Barba et al.,
120,3/4 2016; Kizilcec et al., 2013; Morris et al., 2005; Tempelaar et al., 2015). Given the large amount
of data accumulated through the course of an academic term, statistical procedures such as
correlation analysis have been used in the literature as an initial step (Chen et al., 2018) to
identify relevant student activities that could help predict course performance.
In a few studies, data collected through surveys and questionnaires that measure student
214 self-reported learning attitudes/strategies/difficulties and their self-evaluation (Abdous
et al., 2012) are utilized as predictors in predictive modeling. Sorour et al. (2016) ask students
to freely write comments after each class and extract words to reflect their learning attitude,
understanding of course contents and difficulties of learning. Results show that the
prediction accurate rate achieves over 80 per cent based on the student comments.
Institutional data from student systems that record student demographics and previous
academic history have also been used in many studies as key variables for predicting
student course performance (Evale, 2016; Badr et al., 2016). Examples include student
gender, age, nationality, full time or part time status, educational background and admission
scores.
In addition to student variables, features of the course such as course modality, discipline
and enrollment (Ornelas and Ordonez, 2017), as well as the quality and teaching styles of the
instructor (Corrigan and Smeaton, 2017; Sweeney et al., 2016), have been used to predict
student performance.
Program-level prediction. At the program level, predictions of student outcome rely
mostly on student demographics and previous academic background, as specific
information related to individual courses is no longer directly relevant. Guarín et al. (2015)
predicted student academic attrition based on these two categories of variables. The
demographic variables the authors chose included age at admission, gender, city of origin,
socio-economic classification and ethnicity. Variables related to the previous academic
background included high school type (public or private), type of access (regular or special
admission program), option in which the student chose the program (from 1, first option,
to 3), the previous program if exists, admission test score in five modules (i.e. Sciences, Math,
Image, Text and Social studies) and classification levels for Basic Math and Literacy.
In addition to the use of conventional student demographic and academic background
variables, Uddin and Lee (2017) analyzed social networking-based data such as Facebook
and Twitter to extract student personality traits (i.e. openness, conscientiousness,
extraversion, agreeableness and neuroticism) and other relevant features (e.g. # of Facebook
friends, # of Facebook posts daily, type of posts involved and liking activity) to improve the
predictive modeling of student performance.
Ogihara and Ren (2017) predicted student retention at the program level using linguistic
features extracted from the college admission application essays from the admission
database CommonApp used by many universities and colleges in the USA for streamlining
admission processes. Three sets of linguistic features were generated from text analysis:
(1) latent Dirichletal location (LDA)-based topic modeling with a variety of topic
numbers;
(2) linguistic inquiry and word count; and
(3) part-of-speech (POS) distribution.

The results show that the POS distribution features yielded the best prediction performance
among these three. However, the cross validation error was considerably high, suggesting
that the predictive model was not directly generalizable to other data sets.
Variable selection. In statistical modeling, variable selection, also known as feature Predictive
selection, is the process of selecting a set of relevant variables (i.e. features, predictors) for analytic models
use in model construction. As summarized by Guyon and Elisseeff (2003), there are three
main reasons for variable selection in machine learning-related research, namely, improving
in higher
the predictive power of the models, making faster and more cost-effective predictions and education
providing a better understanding of the processes underlying the data. Variable selection is
especially important when a large number of potential student variables are available but
with a limited sample size. 215
Among the articles we reviewed, only a few studies briefly discussed their variable
selection techniques. Hart et al. (2017) used all-subsets regression to reduce the total
predictor variables that would be entered into their final analysis, dominance analysis,
which is computationally intensive and limited to a maximum of ten predictor
variables. Badr et al. (2016) and Ibrahim and Rusli (2007) used the rankings of the
correlation coefficients to select variables. Xu et al. (2017) conducted principle
component analysis to reduce the dimensions of predictor variables. Daud et al. (2017)
utilized information gain and gain ratio to select the best variable subset. Finally,
Chuan et al. (2016) used chi squared attributes evaluator and ranker search methods to
identify the best attributes/variables.

Data preprocessing and missing data handling

Data preprocessing. Very few studies we reviewed reported briefly accuracy checks of
the data (Jayaprakash et al., 2014; Rachburee et al., 2015). No studies we reviewed provided
detailed procedures for checking data quality and errors. As part of the data preprocessing,
a few papers we reviewed normalized the variables prior to model building. The rationale
for normalization is that some machine learning techniques such as multi-layer perceptron
(MLP) require variables, if measured on different scales, be normalized to ensure that
variables with larger ranges/variabilities would not carry more weights in predicting the
outcome. Different normalization methods have been used in the literature of predictive
learning analytics. Waddington et al. (2016) normalized the data by calculating each
student’s percentile rank of course resource use within each resource category. Gray et al.
(2016) and Sweeney et al. (2016) adopted a standard normal Z-transformation. Meedech et al.
(2016) rescaled the input variables into the range of [0, 1].
In addition to normalization, other data preprocessing has been reported in the literature
to get the data ready for the predictive modeling. For example, data were anonymized to
remove identifier information (Jayaprakash et al., 2014). Data from different systems were
integrated into the same data file (Rubiano and Garcia, 2015). The translation of student
records from the original language to the language of the study was conducted (Badr et al.,
2016; Gray et al., 2016). Discretization of continuous variables into categories was done as
required by certain machine learning models such as Bayesian classifier (Sivakumar and
Selvaraj, 2018). In addition, irrelevant events such as registration place included in the data
set were filtered out (Al-Saleem et al., 2015). When log data from a learning management
system is analyzed, information such as number of videos watched, posts written and posts
ski
read were extracted from the initial log files for later analysis (Corrigan et al., 2015; Kidzin
et al., 2016).
Missing data handling. Missing data are perhaps one of the most ubiquitous problems
in any data modeling. Predictive learning analytics typically involves multiple sources
of data from different institutional information systems. As a result, the presence of
missing data is unavoidable. However, none of the studies we reviewed report the
extent of missing values in the data and the patterns of the missing data. Only a few
ILS studies have provided information regarding brief procedures for handling missing
120,3/4 data. Badr et al. (2016) used three missing data handling methods to replace missing
grades on optional courses, namely:
(1) replacing the missing grades with the average score in the course;
(2) replacing the missing grades with the student’s grade in an equivalent course; and
(3) eliminating student record with high percentage of unavailable data.
216
Jayaprakash, et al. (2014) deleted variables with 20 per cent or more missing values. Han
et al. (2017) deleted student with more than two missing values, and if the student has one or
two missing values, replacement by the mean of the variable is made. Chuan et al. (2016)
simply deleted all data with missing values. Al-Saleem et al. (2015) replaced all the missing
values with the mean by calculating the average grade in the course from other students.

Machine learning techniques

Based on our review and analysis, the most active area of research in predictive learning
analytics appears to be the examination and comparison of different machine learning
techniques for predicting student performance. A variety of machine learning techniques
have been proposed and studied in the literature. Table II presents a list of machine learning
techniques utilized in the literature for predictive learning analytics, along with the number
of publications we reviewed that adopted each technique. The most frequently used and
successful techniques include decision tree (DT), naïve Bayes classifier (NBC), support
vector machine (SVM), artificial neural networks (ANNs), random forest (RF) and logistic
regression. To familiarize the reader with each of these techniques, we provide a brief
overview first, followed by our findings. K-nearest neighbor, a clustering technique
designed to group individuals into different clusters, was not reviewed given the focus of
this review on predictive modeling techniques.
DT is a recursive procedure to find a set of rules that partition data into two or more
homogeneous groups with respect to the outcome variable. In each step, a partition rule is
formed by selecting one predictor and splitting data into groups based on the selected
predictor. The process continues until all data in each partitioned group belong to the same
outcome category or all variables have been used (Ferreira et al., 2001). According to our
review, DT is the most often used machine learning technique in the area of predictive
learning analytics in higher education. Of the 121 studies we reviewed, 46 adopted DT,
which demonstrated the popularity of the technique. For example, Al-Saleem et al. (2015)
used two of the most recognized DT classification algorithms, ID3 and J48, to predict
student performance in future courses based on a model developed using the grades of

Techniques No. of publications

DT 46
Table II. Naïve Bayes 32
SVM 26
Machine learning
Neural networks and MLP 26
techniques and their RF 23
corresponding Logistic regression 22
number of K-nearest neighbor 16
publications Other 25
previous students in different courses. J48 showed superior performance, a higher overall Predictive
accuracy of 83.75 per cent, compared that of ID3, 69.27 per cent. analytic models
NBC is a simple probabilistic classifier that calculates the conditional probability of the
data (given the class membership) by applying Bayes’ theorem and assuming conditional
in higher
independence among the predictors given the class (Friedman et al., 1997). The conditional education
independence assumption greatly simplifies the calculation of the conditional probability of
the data by reducing it to the product of the likelihood of each predictor. Despite the
oversimplified assumption that is often violated in practice (e.g. student academic 217
background and midterm grade may not be conditionally independent), the NBC has shown
excellent performance that could be comparable to more advanced methods such as SVM.
For example, Marbouti et al. (2016) compared the performance of seven different predictive
models for identifying at-risk students in an engineering course and found that NBC
exhibited superior performance compared to other models.
SVM finds a hyperplane that classifies data into two categories (Cortes and Vapnik,
1995). SVM uses a kernel function to map the data from the original space into a new feature
space and finds an optimal decision boundary with the maximum margin from data in both
categories. SVM is suited to learning tasks with large number of features (or predictors)
relative to the size of training sample. This property makes SVM a desirable technique for
the analysis of the learning management data in which a large number of student features
are available. For example, SVM was adopted by Corrigan et al. (2015) because with SVM,
not all of the extracted features from the log data:
Have to be actually useful in terms of discriminating different forms of student outcome [. . .] we
can be open-minded about how we represent students’ online behaviour and if a feature is not
discriminative, the SVM learns this from the training material (p. 47).
ANNs were initially developed to mimic basic principles of biological neural systems where
information processing is modeled as the interactions between numerous interconnected
nerve cells or neurons. ANNs can also serve as a highly flexible nonlinear statistical
technique for modeling complex relationships between inputs and output. MLP is perhaps
the most well-known supervised ANN. An MLP is a network of neurons (i.e. nodes) that are
arranged in a layered architecture. Typically, this type of ANNs consists of three or more
layers: one input layer, one output layer and at least one hidden layer. Statistically, the MLP
functions similar to a nonlinear multivariate regression model. The layer of input neurons is
analogous to the set of predictor variables, whereas the layer of output neurons is analogous
to the outcome variables. The relationship between the input and output layers is parallel to
the mathematical functional form in the regression model. The number of nodes in the
hidden layer is typically chosen by the user to control the degree of nonlinearity between
predictors and the outcome variables. With more nodes in the hidden layer, the relationship
between predictors and outcome variables becomes more nonlinear in the MLP model. It has
been mathematically demonstrated that the MLP, given a sufficient number of hidden
nodes, can approximate any nonlinear function to any desired level of accuracy (Dawson
and Wilby, 2004; Hornik et al., 1989)., Rachburee et al. (2015) developed predictive models
with five classification techniques, namely, DT, NBC, k-nearest neighbors, SVM and MLP.
The results show that MLP generates the best prediction with 89.29 per cent accuracy.
RF is an ensemble classifier built on DTs. In DT, improper constraints or regularizations
on trees may result in overfitting the training data. Models with the problem of overfitting
show low bias and high variance, which imply that they cannot be well generalized to other
external data sets. RF was proposed to deal with this overfitting problem to improve the
model prediction and generalizability. In RF, the bagging method, or bootstrap aggregating,
ILS is used to aggregate the predictions. Specifically, a bootstrap sampling approach with
120,3/4 replacement is used to obtain multiple subsets of the training data. For each subset data, a
DT is then built, which considers only a subset of features. These DTs for different subset
data constitute a forest (i.e. a multitude of DTs) for the whole data set. Multiple classes or
predicted values from different DTs thus can be obtained, and RF outputs the mode of
predicted classes (for classification) or the mean of predicted values (for regression) as the
218 final prediction. As such, by considering different subsets of samples and features, RF
introduces randomness and diversity into the model, which improves the model
generalizability. RF has shown to be a powerful and efficient classifier in the literature. For
example, in their study on the prediction of assignment grades with student online learning
behaviors and demographic information extracted from the MOOC data, Al-Shabandar et al.
(2017) found that RF largely outperformed other seven classifiers considered in the study.
Logistic regression is a classical multivariate statistical procedure used to predict a
categorical outcome variable from a set of continuous, categorical or both types of predictor
variables. When the outcome variable has only two categories, the probability of the
outcome being in one category can be modeled as a sigmoid function of the linear
combination of predictors. The model parameters can be estimated by maximizing the log
likelihood of obtaining the observed data. For example, Jayaprakash et al. (2014) used
logistic regression, among three other techniques, to predict whether students are at risk or
in good standing in a course. The predictors included student age, gender, SAT scores, full-
time or part time status, academic standing, cumulative GPA, year of study, score computed
from partial contributions to the final grade, number of Sakai courses sessions opened by
the student and number of times a section is accessed by the student. Logistic regression
was found to outperform other techniques, with a better combination of high recall, low
percentage of false alarms and higher precision in predicting at-risk students.

Evaluation of accuracy and generalizability

Accuracy evaluation. Once built, the accuracy of the predictive models must be evaluated.
There are several measures that have been used, the most often used one being the overall
prediction accuracy (i.e. the percentage of true positives and true negatives with respect to
the total sample size). Similar measures include precision (i.e. the percentage of true
positives with respect to the total number of model-predicted positives), recall (i.e. the
percentage of true positives with respect to the total number of positives in the sample),
fusion matrix (i.e. a two-by-two matrix listing true positives, true negatives, false positives
and false negatives) and F-measure (i.e. the harmonic mean of the precision and recall).
A few studies we reviewed used the area under the receiver operating characteristics
(ROC) curve as a performance measure of predictive models (Corrigan et al., 2015;
Jayaprakash et al., 2014). ROC graphs are two-dimensional graphs in which the true positive
rate is plotted on the y-axis and the false positive rate is plotted on the x-axis to depict the
relative trade-offs between benefits (true positives) and costs (false positives). For
regression-based modeling, the traditional R2, root mean square residuals and mean
absolute error are often reported as model accuracy measures (Almutairi et al., 2017;
Kidzinski et al., 2016; Strecht et al., 2015).
Corrigan et al. (2015) evaluated the performance of predictive models by examining the
effectiveness of interventions designed based on the model-derived predictions of student
performance. The goal of predictive learning analytics is to develop actionable feedback that
could be provided to students so they can reflect on their learning process and eventually
improve their learning. By examining the effectiveness of the feedback, the validity of the
predictive model was inferred indirectly. In this study, the authors reported that the
students who received emails each week based on the results of predictive models Predictive
outperformed those who opted out by nearly 3 per cent (58.4-61.2 per cent) on average, while analytic models
no prior differences were found between the two groups on a number of measures related to
previous academic records.
in higher
Generalizability evaluation. Regarding the generalizability of the predictive models, education
majority of the studies we reviewed have cross-validated the results by training and testing
the model with independent data sets to examine whether the model could be generalized to
data that have not been used in the training of the model. The simplest method of cross
219
validation was to randomly split the original data into a training set to train the model and a
test set to evaluate it (Al-Shabandar et al., 2017; Chen et al., 2018). K-fold validation has also
been frequently reported in the literature (Luo et al., 2015; Sorour et al., 2016), with the basic
idea of splitting the original sample randomly into k equal sized subsamples, one of which is
retained as the testing data to validate the model. The remaining k 1 subsamples are used
to train the model. The process is repeated k times, with each of the k subsamples used once
as the testing data. The results from the k replications are then averaged to produce the final
estimation.
In addition to the use of cross validation, a few studies we reviewed evaluated the model
generalizability by applying the generated model to data from other academic years or from
other institutions. For example, Gray et al. (2016) trained the predictive model with data
from the 2010 and 2011 student cohort and tested it with data from the 2012 student cohort.
Boyer and Veeramachaneni (2015) called the use of models trained on previous courses for
the real-time prediction in a subsequent offering of the same course (or other new courses) as
transfer learning. Multiple transfer learning methods were proposed, such as the naïve
transfer method, multi-task learning method and logistic regression with prior method. The
authors argue that transfer learning is of great importance for real-time predictions in
learning analytics. Furthermore, the Open Academic Analytics Initiative program
(Jayaprakash et al., 2014) researched issues related to the scaling up of predictive learning
analytics across different higher institutions. Predictive model trained with Marist College
data was applied to data from several other institutions.

Conclusion
This methodology review aims to provide researchers and practitioners with a survey of the
literature on learning analytics with a particular focus on the predictive analytics in the
context of higher education. Learning analytics is still an emerging field in education (Avella
et al., 2016). The adoption and application of learning analytics in higher education is still
mostly small-scale and preliminary. Student data captured within higher education
institutions (e.g. learning management systems, student information systems and student
services) have yet to be properly integrated, analyzed and interpreted to realize its full
potential for providing valuable insight for students and instructors to facilitate and support
learning. Sound analytical methodology is the central tenet of any high-quality learning
analytics application. The aim of the current study was to help better understand the current
state of the methodology in the development of predictive learning analytic models by
systematically reviewing issues related to:
data sources and student variables;
data preprocessing and handling;
machine learning techniques; and
evaluation of accuracy and generalizability.
ILS Summary of results and conclusions
120,3/4 Data sources and student variables. Most of the reviewed studies make use of multiple data
sources and student variables in the modeling process to enhance prediction accuracy. For
course-level prediction, student intermediate course performance data (e.g. marks on quizzes
and midterms), student log data from learning management systems (e.g. logins and
downloads) and student demographics and previous academic history have been the most
220 often used predictors of student performance. Given that student learning involves both
cognitive and socio-emotional competencies, in a few studies, data were collected through
surveys and questionnaires that measure student self-reported learning attitudes/strategies/
difficulties and their self-evaluation, which have been used to predict student performance.
Features of courses and instructors have also been used as predictors considering the
importance of contextual information for learning. For program-level prediction, student
demographic and academic backgrounds are the most typical predictors chosen. The social
networking-based variables have also been researched as possible predictors. However, the
results so far are not clear in terms of whether and to what extent the social networking-
based variables have contributed to a significant improvement of prediction accuracy.
Data preprocessing and handling. Although data preprocessing and missing data handling
are critical for successful predictive learning analytic applications, few studies we reviewed
have presented detailed information about this process. Of the few citations that provided a
documentation on data preprocessing, variable normalization, data anonymization, translation
of student records, discretization of continuous variables, removal of irrelevant information in
data and information extraction from raw log files have been reported at the stage of data
preprocessing. Regarding missing data handling, none of the studies we reviewed provided
information on the extent of missing values in the data, the patterns of the missing data and the
justification of the selected approach for handling missing data. For the few studies that
reported how they handled the missing data, simple procedures such as mean replacement and
listwise deletion (i.e. deleting cases with missing values) were often used.
Machine learning techniques. The most frequently used and successful techniques in the
literature of predictive learning analytics appear to be DT, NBC, SVM, ANNs, RF and
logistic regression. Of these five techniques, SVM and MLP are considered as “black-box”
techniques in the sense that one cannot know exactly how the prediction is derived and how
to interpret the meaning of different parameters in the model. In comparison, results of DT
are highly interpretive as the set of developed rules is simple to understand and can describe
clearly the process of the prediction. However, the disadvantage of DT is its instability,
meaning that small changes in the data might lead to different tree structures and set of
rules. For example, Jayaprakash et al. (2014) applied DT to 25, 50, 75 and 100 per cent of the
training data and found that the method exhibited unstable performance when varying the
sample size. RF, logistic regression and NBC appear to be good options for predictive
learning analytic applications.
Evaluation of accuracy and generalizability. Measures based on the percentages of correct
predictions such as the overall prediction accuracy, precision, recall and F-measure are most
often used measures for evaluating the performance of predictive models. However, as
argued by Fawcett (2004), these measures may be problematic for unbalanced classes where
one class dominates the sample. For example, when the class distribution is highly skewed
with 90 per cent of students passing, a model can have a high overall prediction accuracy by
simply predicting everyone to the majority class. Unbalanced classes are common in the
area of predictive learning analytics, given that typically a relatively small percentage of
students fail a course or drop out of a program. Good performance measures of predictive
modeling should not be influenced by the class distributions in the sample. An example is
ROC curves, which have a desirable property of being insensitive to changes in class Predictive
distributions. Another way to evaluate the performance of predictive models is by analytic models
examining the effectiveness of interventions designed based on the model-derived
predictions of student performance. This type of results can strengthen the practical use of
in higher
predictive models in real settings. education
To evaluate the generalizability of predictive models, cross validation has been routinely
utilized in the learning analytic literature. This is a good practice considering the possibility
of model overfitting with the use of machine learning techniques in learning analytics 221
research. Although cross validation is important, it does not provide strong evidence to
show that the model can be generalized to other contexts or settings. Another, perhaps more
rigorous, way to examine the model generalizability is to apply the generated model to data
from other academic years or from other institutions.

Recommendation for practice

Based on our review, we identify several gaps/issues in the literature that could benefit from
more rigorous investigation in the field of learning analytics. First, although a total of 121
publications were found in the area of predictive learning analytics in the context of higher
education, many papers fail to report methodological details, which makes our review and
assessment challenging. For example, very few studies reported procedures of examining
data accuracy prior to any modeling analysis. This step may be cumbersome, but of extreme
importance, given that the quality and validity of the data underwrites the trustworthiness
of the models derived from the data. Screening for data accuracy involves the removal of
duplicated cases, the correction of inconsistent data and the detection of outliers. In their
multivariate statistics textbook, Tabachnick and Fidell (2013) suggested to inspect the
descriptive statistics of each variable (e.g. minimum, maximum, frequencies, means and
standard deviation) for data accuracy in large data sets. Are the values of each variable
within the acceptable range? Are the means and standard deviations of the continuous
variables (or the frequencies of the categorical variables) consistent with expectations? In
addition, how to deal with missing data is a critical issue that deserves more attention and
documentation. Missing data handling requires a careful deliberation of the patterns of
missing. The best scenario is when missing data appear to be completely random and no
systematic patterns/reasons are suspected for why data are missing. In this case, missing
data do not affect the validity of the predictive models, and different missing data handling
procedures may result in similar findings. Non-randomly missing data, on the other hand,
may pose serious problems in the analyses and results due to the potential distortion of
variable distributions and relationships. For example, suppose that students who refuse to
provide comments after the class may not be well engaged in class learning and activities
and therefore achieve low performance. If missing data are deleted, the distribution of the
class performance variable would be biased. Therefore, it is desirable to test whether the
missing data are random or systematically related to other variables in the study. One
strategy often recommended by many statistical textbooks (Tabachnick and Fidell, 2013;
Warner, 2008) is to compare groups with and without missing data for a particular variable
and investigate whether these two groups are associated with significant differences on
other variables considered in the study. If no significant differences are found, random
missing can be assumed and decisions on how to deal with missing data are not critical.
Otherwise, it is important to preserve the cases with missing data so the missing values will
need to be estimated. Simple estimation methods include mean substitution may lead to the
reduction in variable variances. More sophisticated approaches such as expectation-
maximization and multiple imputation can be considered. Graham et al. (2003) provided a
ILS comprehensive summary of different methods for handling missing data. For future
120,3/4 research in the area of predictive learning analytics, a careful and detailed documentation of
the data handling and analysis is of great importance to:
boost the confidence of stakeholders in the use of developed models;
promote the healthy and methodologically solid development of the field; and
sustain the impact on the learning and teaching in higher education.
222
Second, the majority of research articles, book chapters and conference presentations
available in the literature to date have focused on the programmatic aspect of model
development, and these publications are mostly led by researchers in the field of computer
science. This aspect of research is important, and continued efforts are needed. However,
student learning is a complex phenomenon as many factors (e.g. cognitive, socio and
emotional and background variables) influence the learning process and outcome (Illeris,
2006). Therefore, understanding the cognitive and socio-emotional aspect of human learning
and achievement is also a crucial component for predictive learning analytics, which has
received much less attention. Based on our review of the predictive learning analytic
literature, there is a clear gap in the development of theoretical frameworks and input from
content experts and educators to support and inform key decision-making during the
process of model building. From a theoretical perspective, two questions arise: What student
features are important predictors of the student outcome? How do these features interact
with each other and together influence the outcome? These are examples of important
questions that cannot be solved solely by computer programs. Results from studies in
cognitive science and learning domain knowledge provide valuable insights into how
students learn content and perform tasks, which should be injected into the data pre-
processing and analysis phase to best address the research questions. This calls for a close
collaboration among educators, domain experts, cognitive scientists and data scientists in
building predictive models that aim at providing useful information to benefit student
learning and classroom teaching.
Third, very few studies have discussed how the results of predictions generated
from the model should be best used to help students. If a model predicts that a student
is likely to fail the course, what information should be provided to the student so that
that he/she can take an action upon to improve learning? To answer this question, we
need to understand how the prediction is made, which information/variable is most
relevant and whether the student makes changes can increase his/her likelihood of
passing the course. This bears implications for predictive modeling techniques. To
develop a clear understanding of the process that derives the prediction, the black-box
type of techniques such as SVM and artificial networks may not be ideal for
interpretative purposes. If available, student behavioral variables (e.g. student
activities recorded from learning manage system) should be considered as potential
predictors as these variables are useful in terms of generating actionable information
that help design interventions. Based on the results, for example, feedback related to
how students can change their behaviors (e.g. participate in group discussions or
submit assignment on time) to increase their chance of success in the course can be
provided. When demographic variables and student past academic history are used as
the only predictors of student performance (Valdiviezo-Díaz et al., 2015; Al-Shabandar
et al., 2017; Roy and Garg, 2017; Guarín et al., 2015; Rubiano and Garcia, 2015),
instructors should be encouraged to generate feedback based on further examination
and comparison of resource uses and activities between groups of students who have
been predicted as passing and failing the course. Furthermore, instructor can encourage Predictive
students to have face-to-face meetings with them or visit various student support analytic models
centers on campus such as student success center or student accessibility services.
On a related note, based on our review, student intermediate performance data are often
in higher
used as potential predictors of final course performance. The use of intermediate education
performance data seems to be logical as these data can naturally serve as measures/
indicators of student learning progress in the course. It is also a common practice in higher
education that student marks on quizzes and midterms account for certain percentages of 223
the final marks. When these percentages are high, it is important to make early predictions.
For example, if the midterm performance accounts for a high percentage of the final mark, it
will be desirable to make predictions before the midterm so that students can reflect on their
learning process and change their behaviors to increase their midterm scores, which in turn
increases their chance of success in the course.
Last, the majority of publications we reviewed are targeted at predicting student
performance at the course level. It is worth investigating whether a general prediction model
can be developed for use in multiple courses. Obviously, a general model is more efficient
than course specific models in that the model can be trained once and directly applied to other
courses. However, very few people would doubt that a general model cannot address
the complexity of all courses because learning objectives, activities and assessments of
different courses can vary a great deal. The model accuracy must be compromised. The
question is to what degree. This is an empirical question, and future research is much needed.

References
Abdous, M.H., Wu, H. and Yen, C.J. (2012), “Using data mining for predicting relationships between
online question theme and final grade”, Journal of Educational Technology and Society, Vol. 15
No. 3, pp. 77-88.
Almutairi, F.M., Sidiropoulos, N.D. and Karypis, G. (2017), “Context-aware recommendation-based
learning analytics using tensor and coupled matrix factorization”, IEEE Journal of Selected
Topics in Signal Processing, Vol. 11 No. 5, pp. 729-741.
Al-Saleem, M., Al-Kathiry, N., Al-Osimi, S. and Badr, G. (2015), “Mining educational data to predict
students’ academic performance”, International Workshop on Machine Learning and Data
Mining in Pattern Recognition, Springer, Cham, pp. 403-414.
Al-Shabandar, R., Hussain, A., Laws, A., Keight, R., Lunn, J. and Radi, N. (2017), “Machine learning
approaches to predict learning outcomes in massive open online courses”, 2017 International
Joint Conference on Neural Networks (IJCNN), IEEE, pp. 713-720.
Avella, J.T., Kebritchi, M., Nunn, S.G. and Kanai, T. (2016), “Learning analytics methods, benefits,
and challenges in higher education: a systematic literature review”, Online Learning, Vol. 20
No. 2, pp. 13-29,
Badr, G., Algobail, A., Almutairi, H. and Almutery, M. (2016), “Predicting students’ performance in
university courses: a case study and tool in KSU mathematics department”, Procedia Computer
Science, Vol. 82, pp. 80-89.
Boyer, S. and Veeramachaneni, K. (2015), “Transfer learning for predictive models in massive open online
courses”, International Conference on Artificial Intelligence in Education, Springer, Cham, pp. 54-63.
Brinton, C.G., Buccapatnam, S., Chiang, M. and Poor, H.V. (2016), “Mining MOOC clickstreams: Video-
watching behavior vs. in-video quiz performance”, IEEE Transactions on Signal Processing,
Vol. 64 No. 14, pp. 3677-3692.
Chen, Y., Chen, Q., Zhao, M., Boyer, S., Veeramachaneni, K. and Qu, H. (2016), “DropoutSeer: visualizing
learning patterns in massive open online courses for dropout reasoning and prediction”, 2016
IEEE Conference on Visual Analytics Science and Technology (VAST), IEEE, pp. 111-120.
ILS Chen, W., Brinton, C.G., Cao, D., Mason-singh, A., Lu, C. and Chiang, M. (2018), “Early detection
prediction of learning outcomes in online short-courses via learning behaviors”, IEEE
120,3/4 Transactions on Learning Technologies, doi: 10.1109/TLT.2018.2793193.
Chen, K.C. and Jang, S.J. (2010), “Motivation in online learning: Testing a model of self-determination
theory”, Computers in Human Behavior, Vol. 26 No. 4, pp. 741-752.
Chuan, Y.Y., Husain, W. and Shahiri, A.M. (2016), “An exploratory study on students’ performance
224 classification using hybrid of decision tree and naïve Bayes approaches”, International Conference
on Advances in Information and Communication Technology, Springer, Cham, pp. 142-152.
Corrigan, O. and Smeaton, A.F. (2017), “A course agnostic approach to predicting student success from
VLE log data using recurrent neural networks”, European Conference on Technology Enhanced
Learning, Springer, Cham, pp. 545-548.
Corrigan, O., Smeaton, A.F., Glynn, M. and Smyth, S. (2015), “Using educational analytics to improve test
performance”, Design for Teaching and Learning in a Networked World, Springer, Cham, pp. 42-55.
Cortes, C. and Vapnik, V. (1995), “Support-vector networks”, Machine Learning, Vol. 20 No. 3, pp. 273-297.
Daud, A., Aljohani, N.R., Abbasi, R.A., Lytras, M.D., Abbas, F. and Alowibdi, J.S. (2017), “Predicting
student performance using advanced learning analytics”, Proceedings of the 26th International
Conference on World Wide Web Companion, ACM, pp. 415-421.
Davies, J. and Graff, M. (2005), “Performance in e-learning: online participation and student grades”,
British Journal of Educational Technology, Vol. 36 No. 4, pp. 657-663.
Dawson, C.W. and Wilby, R.L. (2004), “Single network modelling solutions”, in Abrahart, R., Kneale, P.E.
and See, L.M. (Eds), Neural Networks for Hydrological Modeling, A.A. Balkema Publishers, Leiden,
The Netherlands, pp. 39-59.
de Barba, P.D., Kennedy, G.E. and Ainley, M.D. (2016), “The role of students’ motivation and
participation in predicting performance in a MOOC”, Journal of Computer Assisted Learning,
Vol. 32 No. 3, pp. 218-231.
Deeva, G., De Smedt, J., De Koninck, P. and De Weerdt, J. (2017), “Dropout prediction in MOOCs: a
comparison between process and sequence mining”, International Conference on Business
Process Management, Springer, Cham, pp. 243-255.
Dekker, G., Pechenizkiy, M. and Vleeshouwers, J. (2009), “Predicting students drop out: a case study”,
International Conference on Educational Data Mining (EDM), ERIC, pp. 41-50.
Elbadrawy, A., Polyzou, A., Ren, Z., Sweeney, M., Karypis, G. and Rangwala, H. (2016), “Predicting
student performance using personalized analytics”, Computer, Vol. 49 No. 4, pp. 61-69.
Evale, D. (2016), “Learning management system with prediction model and course-content recommendation
module”, Journal of Information Technology Education: Research, Vol. 16 No. 1, pp. 437-457.
Fawcett, T. (2004), “ROC graphs: notes and practical considerations for researchers”, Machine
Learning, Vol. 31 No. 1, pp. 1-38.
Ferguson, R. (2012), “Learning analytics: drivers, developments and challenges”, International Journal
of Technology Enhanced Learning, Vol. 4 Nos 5/6, pp. 304-317.
Ferreira, J.T.A., Denison, D.G. and Hand, D.J. (2001), “Data mining with products of trees”, International
Symposium on Intelligent Data Analysis, Springer, Berlin, Heidelberg, pp. 167-176.
Friedman, N., Geiger, D. and Goldszmidt, M. (1997), “Bayesian network classifiers”, Machine Learning,
Vol. 29 Nos 2/3, pp. 131-163.
Graham, J.W., Cumsille, P.E. and Elek-Fisk, E. (2003), “Methods for handling missing data”, in Schinka,
J. A. and Velicer, W. F. (Eds.). Research Methods in Psychology, John Wiley and Sons. New York,
NY, pp. 87-114. Vol 2 of Handbook of Psychology (I. B. Weiner, Editor-in-Chief).
Gray, G., McGuinness, C., Owende, P. and Hofmann, M. (2016), “Learning factor models of students at
risk of failing in the early stage of tertiary education”, Journal of Learning Analytics, Vol. 3 No. 2,
pp. 330-372.
Guarín, C.E.L., Guzmán, E.L. and González, F.A. (2015), “A model to predict low academic performance Predictive
at a specific enrollment using data mining”, IEEE Revista Iberoamericana de tecnologias del
Aprendizaje, IEEE, pp. 119-125. analytic models
Guyon, I. and Elisseeff, A. (2003), “An introduction to variable and feature selection”, Journal of in higher
Machine Learning Research, Vol. 3, pp. 1157-1182. education
Han, M., Tong, M., Chen, M., Liu, J. and Liu, C. (2017), “Application of ensemble algorithm in students’
performance prediction”, 2017 6th IIAI International Congress on Advanced Applied
Informatics (IIAI-AAI), IEEE, pp. 735-740. 225
Hart, S., Daucourt, M. and Ganley, C. (2017), “Individual differences related to college students’ course
performance in calculus II”, Journal of Learning Analytics, Vol. 4 No. 2, pp. 129-153.
Hornik, K., Stinchcombe, M. and White, H. (1989), “Multilayer feedforward networks are universal
approximators”, Neural Networks, Vol. 2 No. 5, pp. 359-366.
Hughes, G. and Dobbins, C. (2015), “The utilization of data analysis techniques in predicting student
performance in massive open online courses (MOOCs)”, Research and Practice in Technology
Enhanced Learning, doi: 10.1186/s41039-015-0007-z.
Ibrahim, Z. and Rusli, D. (2007), “Predicting students’ academic performance: Comparing artificial
neural network, decision tree and linear regression”, 21st Annual SAS Malaysia Forum, SAS,
Kuala Lumpur, pp. 1-6.
Illeris, K. (2006), “Lifelong learning and the low-skilled”, International Journal of Lifelong Education,
Vol. 25 No. 1, pp. 15-28.
Jayaprakash, S.M., Moody, E.W., Lauría, E.J., Regan, J.R. and amd Baron, J.D. (2014), “Early alert of
academically at-risk students: an open source analytics initiative”, Journal of Learning Analytics,
Vol. 1 No. 1, pp. 6-47.
Johnson, L., Adams Becker, S., Cummins, M., Estrada, V., Freeman, A. and Hall, C. (2016), “NMC
horizon report: 2016 higher education edition”, The New Media Consortium, Austin, TX.
Kidzinski, Ł., Giannakos, M., Sampson, D.G. and Dillenbourg, P. (2016), “A tutorial on machine learning in
educational science”, in Li, Y., Chang, M., Kravcik, M., Popescu, E., Huang, R. and Kinshuk Chen, N.S.
(Eds), State-of-the-Art and Future Directions of Smart Learning, Springer, pp. 453-459.
Kizilcec, R.F., Piech, C. and Schneider, E. (2013), “Deconstructing disengagement: analyzing learner
subpopulations in massive open online courses”, Proceedings of the third international
conference on learning analytics and knowledge, ACM, pp. 170-179.
Klüsener, M. and Fortenbacher, A. (2015), “Predicting students’ success based on forum activities in
MOOCs”, 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced
Computing Systems: Technology and Applications (IDAACS), IEEE, pp. 925-928.
Kotsiantis, S. (2007), “Supervised machine learning: a review of classification techniques”, Informatica
Journal, Vol. 31, pp. 249-268.
Leitner, P., Khalil, M. and Ebner, M. (2017), “Learning analytics in higher education – a literature
review”, in Peña-Ayala, A. (Ed.), Learning Analytics: Fundaments, Applications, and Trends,
Springer, Cham, pp. 1-23.
Li, X., Wang, T. and Wang, H. (2017), “Exploring n-gram features in clickstream data for MOOC
learning achievement prediction”, International Conference on Database Systems for Advanced
Applications, Springer, Cham, pp. 328-339.
Liang, J., Li, C. and Zheng, L. (2016), “Machine learning application in MOOCs: dropout prediction”, 2016
11th International Conference on Computer Science and Education (ICCSE), IEEE, pp. 52-57.
Liang, J., Yang, J., Wu, Y., Li, C. and Zheng, L. (2016), “Big data application in education: dropout
prediction in edx MOOCs”, 2016 IEEE Second International Conference on Multimedia Big Data
(BigMM), IEEE, pp. 440-443.
Luo, J., Sorour, S.E., Goda, K. and Mine, T. (2015), “Predicting student grade based on free-style
comments using Word2vec and ANN by considering prediction results obtained in consecutive
ILS lessons”, International Conference on Educational Data Mining (EDM) (8th, Madrid, Spain, Jun
26-29, 2015), ERIC, pp. 396-399.
120,3/4
Marbouti, F., Diefes-Dux, H.A. and Madhavan, K. (2016), “Models for early prediction of at-risk students in
a course using standards-based grading”, Computers and Education, Vol. 103, pp. 1-15.
Marbouti, M.F., Diefes-Dux, H.A. and Strobel, J. (2015), “Building course-specific regression-based
models to identify at-risk students”, The American Society for Engineering Educators Annual
226 Conference, American Society for Engineering Education, Seattle, WA.
Meedech, P., Iam-On, N. and Boongoen, T. (2016), “Prediction of student dropout using personal profile
and data mining approach”, in Lavangnananda K., Phon-Amnuaisuk S., Engchuan W. and Chan
J. (Eds.), Intelligent and Evolutionary Systems, Springer, Cham, pp. 143-155.
Meyer, B., Choppy, C., Staunstrup, J. and van Leeuwen, J. (2009), “Research evaluation for computer
science”, Communications of the Acm, Vol. 52 No. 4, pp. 31-34.
Morris, L.V., Finnegan, C. and Wu, S.S. (2005), “Tracking student behavior, persistence, and
achievement in online courses”, The Internet and Higher Education, Vol. 8 No. 3, pp. 221-231.
Ogihara, M. and Ren, G. (2017), “Student retention pattern prediction employing linguistic features
extracted from admission application essays”, 2017 16th IEEE International Conference on
Machine Learning and Applications (ICMLA), IEEE, pp. 532-539.
Ornelas, F. and Ordonez, C. (2017), “Predicting student success: a naïve bayesian application to
community college data”, Technology, Knowledge and Learning, Vol. 22 No. 3, pp. 299-315.
Pérez-Lemonche, Á., Martínez-Muñoz, G. and Pulido-Cañabate, E. (2017), “Analysing event transitions
to discover student roles and predict grades in MOOCs”, International Conference on Artificial
Neural Networks, Springer, Cham, pp. 224-232.
Polyzou, A. and Karypis, G. (2016), “Grade prediction with models specific to students and courses”,
International Journal of Data Science and Analytics, Vol. 2 Nos 3/4, pp. 159-171.
Rachburee, N., Punlumjeak, W., Rugtanom, S., Jaithavil, D. and Pracha, M. (2015), “A prediction of
engineering students performance from core engineering course using classification”, in Kim, K.
(Ed.), Information Science and Applications, Springer, Berlin, Heidelberg, pp. 649-656.
Roy, S. and Garg, A. (2017), “Predicting academic performance of student using classification
techniques”, 2017 4th IEEE Uttar Pradesh Section International Conference on Electrical,
Computer and Electronics (UPCON), IEEE, pp. 568-572.
Rubiano, S.M.M. and Garcia, J.A.D. (2015), “Formulation of a predictive model for academic
performance based on students’ academic and demographic data”, 2015 IEEE Frontiers in
Education Conference (FIE), IEEE, pp. 1-7.
Ruipérez-Valiente, J.A., Cobos, R., Muñoz-Merino, P.J., Andujar, Á. and Kloos, C.D. (2017), “Early prediction
and variable importance of certificate accomplishment in a MOOC”, in Delgado Kloos C, Jermann P.,
Pérez-Sanagustín M., Seaton D. and White S. (Eds), Digital Education: Out to the World and Back to
the Campus, Springer, Cham, pp. 263-272.
Sclater, N., Peasgood, A. and Mullan, J. (2016), Learning Analytics in Higher Education, JISC, London,
available at: www.jisc.ac.uk/sites/default/files/learning-analytics-in-he-v3.pdf.
Siemens, G. and Long, P. (2011), “Penetrating the fog: analytics in learning and education”,
EDUCAUSE Review, Vol. 46 No. 5, pp. 30-32.
Siemens, G. and d Baker, R.S. (2012), “Learning analytics and educational data mining: towards
communication and collaboration”, Proceedings of the 2nd international conference on learning
analytics and knowledge, ACM, pp. 252-254.
Siemens, G., Gasevic, D., Haythornthwaite, C., Dawson, S.P., Shum, S., Ferguson, R. and Baker, R.
(2011), “Open learning analytics: an integrated and modularized platform”, Proposal to Design,
Implement and Evaluate an Open Platform to Integrate Heterogeneous Learning Analytics
Techniques, Society for Learning Analytics Research.
Sivakumar, S. and Selvaraj, R. (2018), “Predictive modeling of students performance through the Predictive
enhanced decision tree”, in Kalam A., Das S. and Sharma K. (Eds), Advances in Electronics,
Communication and Computing, Springer, Singapore, pp. 21-36. analytic models
Sorour, S.E., El Rahman, S.A. and Mine, T. (2016), “Teacher interventions to enhance the quality of in higher
student comments and their effect on prediction performance”, 2016 IEEE Frontiers in education
Education Conference (FIE), IEEE.
Sorour, S.E., El Rahman, S.A., Kahouf, S.A. and Mine, T. (2016), “Understandable prediction models of
student performance using an attribute dictionary”, International Conference on Web-Based 227
Learning, Springer, pp. 161-171.
Strecht, P., Cruz, L., Soares, C., Mendes-Moreira, J. and Abreu, R. (2015), “A comparative study of classification
and regression algorithms for modelling students’ academic performance”, International Conference
on Educational Data Mining (EDM) (8th, Madrid, Spain, Jun 26-29, 2015), ERIC, pp. 392-395.
Sweeney, M., Rangwala, H., Lester, J. and Johri, A. (2016), “Next-term student performance prediction: a
recommender systems approach”, Journal of Educational Data Mining (JEDM), Vol. 8, pp. 1-27.
Tabachnick, B.G. and Fidell, L.S. (2013), Using Multivariate Statistics, 5th ed., Allyn and Bacon,
Needham Heights, MA.
Tempelaar, D.T., Rienties, B. and Giesbers, B. (2015), “In search for the most informative data for
feedback generation: learning analytics in a data-rich context”, Computers in Human Behavior,
Vol. 47, pp. 157-167.
Uddin, M.F. and Lee, J. (2017), “Proposing stochastic probability-based math model and algorithms
utilizing social networking and academic data for good fit students prediction”, Social Network
Analysis and Mining, Vol. 7 No. 29, doi: 10.1007/s13278-017-0448-z.
Valdiviezo-Díaz, P., Cordero, J., Reátegui, R. and Aguilar, J. (2015), “A business intelligence model for
online tutoring process”, 2015 IEEE Frontiers in Education Conference (FIE), IEEE, pp. 1-9.
Waddington, R.J., Nam, S., Lonn, S. and Teasley, S.D. (2016), “Improving early warning systems with
categorized course resource usage”, Journal of Learning Analytics, Vol. 3 No. 3, pp. 263-290.
Warner, R.M. (2008), Applied Statistics: From Bivariate through Multivariate Techniques, Sage,
Thousand Oaks, CA.
Xing, W., Chen, X., Stein, J. and Marcinkowski, M. (2016), “Temporal predication of dropouts in
MOOCs: reaching the low hanging fruit through stacking generalization”, Computers in Human
Behavior, Vol. 58, pp. 119-129.
Xu, M., Liang, Y. and Wu, W. (2017), “Predicting honors student performance using RBFNN and PCA
method”, International Conference on Database Systems for Advanced Applications, Springer,
pp. 364-375.
Yang, T.Y., Brinton, C.G., Joe-Wong, C. and Chiang, M. (2017), “Behavior-based grade prediction for
MOOCs via time series neural networks”, IEEE Journal of Selected Topics in Signal Processing,
Vol. 11 No. 5, pp. 716-728.
Ye, C. and Biswas, G. (2014), “Early prediction of student dropout and performance in MOOCs using
higher granularity temporal information”, Journal of Learning Analytics, Vol. 1 No. 3, pp. 169-172.

Corresponding author
Ying Cui can be contacted at: yc@ualberta.ca

For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com
Reproduced with permission of copyright owner. Further
reproduction prohibited without permission.

Predictive Analytic Models of Student Success in Higher Education
No ratings yet
Predictive Analytic Models of Student Success in Higher Education
16 pages
Second
No ratings yet
Second
27 pages
Survey of Learning Analytics Systems
No ratings yet
Survey of Learning Analytics Systems
8 pages
Chapter 1 Final
No ratings yet
Chapter 1 Final
20 pages
Using Machine Learning To Improve Student Success in Higher Education
No ratings yet
Using Machine Learning To Improve Student Success in Higher Education
9 pages
Leveraging Machine Learning Approaches For Predicting Students' Academic Success An Analytical Perspective
No ratings yet
Leveraging Machine Learning Approaches For Predicting Students' Academic Success An Analytical Perspective
16 pages
Ifenthaler 2020 - A Systematic Review of Quality of Student Experience in Higher Education
No ratings yet
Ifenthaler 2020 - A Systematic Review of Quality of Student Experience in Higher Education
30 pages
2019 Book UtilizingLearningAnalyticsToSu
No ratings yet
2019 Book UtilizingLearningAnalyticsToSu
341 pages
Shsconf Glob2021 09001
No ratings yet
Shsconf Glob2021 09001
10 pages
Utilising Learning Analytics To Support Study Success in Higher Education: A Systematic Review
No ratings yet
Utilising Learning Analytics To Support Study Success in Higher Education: A Systematic Review
30 pages
The Predictive Learning Analytics Revolution
100% (1)
The Predictive Learning Analytics Revolution
23 pages
Abstract Educational Data Mining
No ratings yet
Abstract Educational Data Mining
2 pages
Kamal 2018
No ratings yet
Kamal 2018
9 pages
Dne 110309 F
No ratings yet
Dne 110309 F
11 pages
Research Paper, 2020
No ratings yet
Research Paper, 2020
5 pages
Student Course Grade Prediction Using The Random Forest Algorithm - Analysis of Predictors' Importance
No ratings yet
Student Course Grade Prediction Using The Random Forest Algorithm - Analysis of Predictors' Importance
7 pages
Ramaswami 2020
No ratings yet
Ramaswami 2020
5 pages
Chapter One 1.1 Background of The Study
No ratings yet
Chapter One 1.1 Background of The Study
220 pages
Daud 2017
No ratings yet
Daud 2017
7 pages
Machine Learning Approach To Student
No ratings yet
Machine Learning Approach To Student
15 pages
Data Mining Model For Predicting Student Enrolment
No ratings yet
Data Mining Model For Predicting Student Enrolment
8 pages
Learning Analytics As A Tool For Analysing Student Agency in Higher Education
No ratings yet
Learning Analytics As A Tool For Analysing Student Agency in Higher Education
20 pages
Multi-Class Phased Prediction of Academic Performance and Dropout in Higher Education
No ratings yet
Multi-Class Phased Prediction of Academic Performance and Dropout in Higher Education
15 pages
Progress Presentation of Educational Data Analytics System For Decision Support in HE
No ratings yet
Progress Presentation of Educational Data Analytics System For Decision Support in HE
10 pages
De Oliveira 2021 - How Does Learning Analytics Contribute To Prevent Students' Dropout in Higher Education
No ratings yet
De Oliveira 2021 - How Does Learning Analytics Contribute To Prevent Students' Dropout in Higher Education
33 pages
Electronics 13 04157
No ratings yet
Electronics 13 04157
24 pages
Career Predction
No ratings yet
Career Predction
10 pages
Lucky Mini Project
No ratings yet
Lucky Mini Project
32 pages
Chapter 12 Baker Siemens V 3
No ratings yet
Chapter 12 Baker Siemens V 3
29 pages
An Enhanced Machine Learning-Based Approach For Analysis and Prediction of Student Performance in Classroom Learning
No ratings yet
An Enhanced Machine Learning-Based Approach For Analysis and Prediction of Student Performance in Classroom Learning
17 pages
An Integrated System Framework For Predicting Students' Academic Performance in Higher Educational Institutions
No ratings yet
An Integrated System Framework For Predicting Students' Academic Performance in Higher Educational Institutions
9 pages
Early Student Performance Prediction
No ratings yet
Early Student Performance Prediction
12 pages
Predicting Academic Outcomes - A Survey From 2007 Till 2018
No ratings yet
Predicting Academic Outcomes - A Survey From 2007 Till 2018
33 pages
Data Mining Approach To Predict Academic Performance of Students
No ratings yet
Data Mining Approach To Predict Academic Performance of Students
11 pages
Predicting Student Success
No ratings yet
Predicting Student Success
3 pages
GrayEtAl2014PredictAP IEEE
No ratings yet
GrayEtAl2014PredictAP IEEE
6 pages
Learning Analytics Methods, Benefits, and Challenges in Higher Education: A Systematic Literature Review
No ratings yet
Learning Analytics Methods, Benefits, and Challenges in Higher Education: A Systematic Literature Review
17 pages
2950-Article Text-5557-1-10-20210418
No ratings yet
2950-Article Text-5557-1-10-20210418
6 pages
A Naïve Bayes Students' Performance Prediction Model For Decision Support System
No ratings yet
A Naïve Bayes Students' Performance Prediction Model For Decision Support System
9 pages
Computer Science Students Academic Performance Prediction Using Ai
No ratings yet
Computer Science Students Academic Performance Prediction Using Ai
68 pages
Ncisem-2022 Paper 24
No ratings yet
Ncisem-2022 Paper 24
13 pages
(IJCST-V11I4P11) :vaibhav Sharma, Manoj Patil
No ratings yet
(IJCST-V11I4P11) :vaibhav Sharma, Manoj Patil
3 pages
Predicting Students Performance Through Data Mini
No ratings yet
Predicting Students Performance Through Data Mini
15 pages
Machine Learning Approaches For Student Performance Prediction
No ratings yet
Machine Learning Approaches For Student Performance Prediction
6 pages
PM Web 18058
No ratings yet
PM Web 18058
18 pages
Team137 TermPaper
No ratings yet
Team137 TermPaper
6 pages
Optimizing Adult Learner Success 1717338751
No ratings yet
Optimizing Adult Learner Success 1717338751
14 pages
Predicting Academic Performance With Artificial Intelligence
No ratings yet
Predicting Academic Performance With Artificial Intelligence
6 pages
Tracking and Predecting Students Performance With Machine Learning
0% (1)
Tracking and Predecting Students Performance With Machine Learning
47 pages
Final Paper
No ratings yet
Final Paper
8 pages
Big Data Student Performance Analysis
No ratings yet
Big Data Student Performance Analysis
8 pages
An Artificial Intelligence Approach To Monitor Student Performance and Devise Preventive Measures
No ratings yet
An Artificial Intelligence Approach To Monitor Student Performance and Devise Preventive Measures
18 pages
Role of Data Mining in Education For Improving Students Performance For Social Change
No ratings yet
Role of Data Mining in Education For Improving Students Performance For Social Change
2 pages
Impact of Online/Virtual Learning On Students' Academic Performance Using Machine Learning Approaches
No ratings yet
Impact of Online/Virtual Learning On Students' Academic Performance Using Machine Learning Approaches
26 pages
Using Big Data To Predict Student Dropouts: Technology Affordances For Research
No ratings yet
Using Big Data To Predict Student Dropouts: Technology Affordances For Research
4 pages
SSRN Id3243704
No ratings yet
SSRN Id3243704
6 pages
Learning Analytics A Bibliometric Analysis of The Literature Over The Last 2021
No ratings yet
Learning Analytics A Bibliometric Analysis of The Literature Over The Last 2021
12 pages
Bhuma Devi's Synopsis
No ratings yet
Bhuma Devi's Synopsis
17 pages
A Review On Predictive Modeling Technique For Student 53mkp8pasa
No ratings yet
A Review On Predictive Modeling Technique For Student 53mkp8pasa
8 pages
Discover The AI Transforming B2B Profitability 1740673913
No ratings yet
Discover The AI Transforming B2B Profitability 1740673913
15 pages
E-Commerce Trends to 2026
No ratings yet
E-Commerce Trends to 2026
22 pages
Ai Driven Predictive Maintenance in Hvac Systems
No ratings yet
Ai Driven Predictive Maintenance in Hvac Systems
15 pages
Multimedia Data Mining Research Papers
No ratings yet
Multimedia Data Mining Research Papers
6 pages
Machine Learning Pilot Proposal
No ratings yet
Machine Learning Pilot Proposal
3 pages
HRSS and Payroll Service Offering
No ratings yet
HRSS and Payroll Service Offering
10 pages
Customer-Centric Companies Are 60% More Profitable!
No ratings yet
Customer-Centric Companies Are 60% More Profitable!
1 page
AI's Role in Modern Marketing
No ratings yet
AI's Role in Modern Marketing
9 pages
Bank Marketing Campaign Prediction
No ratings yet
Bank Marketing Campaign Prediction
20 pages
Tibco Spotfire Tutorial Guide
67% (3)
Tibco Spotfire Tutorial Guide
16 pages
Introduction To Business Analytics Sem 1
No ratings yet
Introduction To Business Analytics Sem 1
42 pages
The Future of Trading - AI Solutions For Everyday Traders - Miller, Chris
No ratings yet
The Future of Trading - AI Solutions For Everyday Traders - Miller, Chris
91 pages
Implementing AI in Car Diagnostics
No ratings yet
Implementing AI in Car Diagnostics
7 pages
MCA Project Titles
No ratings yet
MCA Project Titles
2 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Analyzing VAERS Data For Vaccine Safety
No ratings yet
Analyzing VAERS Data For Vaccine Safety
17 pages
ITB1 Documentation Detection of Phishing Website Using ML
No ratings yet
ITB1 Documentation Detection of Phishing Website Using ML
49 pages
Case BoB Samples
No ratings yet
Case BoB Samples
8 pages
Ai Course Brochure Iit Madras Pravartak 2025
No ratings yet
Ai Course Brochure Iit Madras Pravartak 2025
17 pages
Oil & Gas Analytics & Machine Learning
No ratings yet
Oil & Gas Analytics & Machine Learning
31 pages
Nokia
No ratings yet
Nokia
15 pages
Integrating AI in Digital Marketing Strategies
No ratings yet
Integrating AI in Digital Marketing Strategies
6 pages
The Futureof Enterpriseresourceplanning ERPHarnessing Artificial Intelligence
No ratings yet
The Futureof Enterpriseresourceplanning ERPHarnessing Artificial Intelligence
6 pages
Gartner Best Practices Planning SAP HANA PDF
No ratings yet
Gartner Best Practices Planning SAP HANA PDF
30 pages
Case Study On MIS Solution
No ratings yet
Case Study On MIS Solution
36 pages
ML Unit-2 Material WORD
No ratings yet
ML Unit-2 Material WORD
25 pages
AI and Cybersecurity in Modern Databases: Innovative Approaches To Threat Detection and Response
No ratings yet
AI and Cybersecurity in Modern Databases: Innovative Approaches To Threat Detection and Response
14 pages
Aspiring Data Scientist's Resume
No ratings yet
Aspiring Data Scientist's Resume
1 page
AI Platform for Foundry Optimization
No ratings yet
AI Platform for Foundry Optimization
23 pages
Evans Analytics2e PPT 01 Final
No ratings yet
Evans Analytics2e PPT 01 Final
54 pages

Predictive Models for Student Success

Uploaded by

Predictive Models for Student Success

Uploaded by

The current issue and full text archive of this journal is available on Emerald Insight at:

Data sources and student variables

Course-level prediction Program-level prediction

Data preprocessing and missing data handling

Machine learning techniques

Techniques No. of publications

Evaluation of accuracy and generalizability

Recommendation for practice

You might also like