Deep Knowledge Tracing
Deep Knowledge Tracing
Abstract
Knowledge tracing—where a machine models the knowledge of a student as they
interact with coursework—is a well established problem in computer supported
education. Though effectively modeling student knowledge would have high ed-
ucational impact, the task has many inherent challenges. In this paper we explore
the utility of using Recurrent Neural Networks (RNNs) to model student learning.
The RNN family of models have important advantages over previous methods
in that they do not require the explicit encoding of human domain knowledge,
and can capture more complex representations of student knowledge. Using neu-
ral networks results in substantial improvements in prediction performance on a
range of knowledge tracing datasets. Moreover the learned model can be used for
intelligent curriculum design and allows straightforward interpretation and dis-
covery of structure in student tasks. These results suggest a promising new line of
research for knowledge tracing and an exemplary application task for RNNs.
1 Introduction
Computer-assisted education promises open access to world class instruction and a reduction in the
growing cost of learning. We can develop on this promise by building models of large scale student
trace data on popular educational platforms such as Khan Academy, Coursera, and EdX.
Knowledge tracing is the task of modelling student knowledge over time so that we can accurately
predict how students will perform on future interactions. Improvement on this task means that re-
sources can be suggested to students based on their individual needs, and content which is predicted
to be too easy or too hard can be skipped or delayed. Already, hand-tuned intelligent tutoring sys-
tems that attempt to tailor content show promising results [28]. One-on-one human tutoring can
produce learning gains for the average student on the order of two standard deviations [5] and ma-
chine learning solutions could provide these benefits of high quality personalized teaching to anyone
in the world for free. The knowledge tracing problem is inherently difficult as human learning is
grounded in the complexity of both the human brain and human knowledge. Thus, the use of rich
models seems appropriate. However most previous work in education relies on first order Markov
models with restricted functional forms.
In this paper we present a formulation that we call Deep Knowledge Tracing (DKT) in which we
apply flexible recurrent neural networks that are ‘deep’ in time to the task of knowledge tracing. This
family of models represents latent knowledge state, along with its temporal dynamics, using large
vectors of artificial ‘neurons’, and allows the latent variable representation of student knowledge to
be learned from data rather than hard-coded. The main contributions of this work are:
1
Exercise attempted: correct, incorrect
Probability
Predicted
Line graph intuition 1.0
Slope of a line
Solving for y-intercept
0.5
Solving for x-intercept
Graphing linear equations
Square roots 0.0
10 20 30 40 50 E[p]
Exercise index
Figure 1: A single student and her predicted responses as she solves 50 Khan Academy exercises. She seems to
master finding x and y intercepts and then has trouble transferring knowledge to graphing linear equations.
The task of knowledge tracing can be formalized as: given observations of interactions x0 . . . xt
taken by a student on a particular learning task, predict aspects of their next interaction xt+1 [6].
In the most ubiquitous instantiation of knowledge tracing, interactions take the form of a tuple of
xt = {qt , at } that combines a tag for the exercise being answered qt with whether or not the exercise
was answered correctly at . When making a prediction, the model is provided the tag of the exercise
being answered, qt and must predict whether the student will get the exercise correct, at . Figure 1
shows a visualization of tracing knowledge for a single student learning 8th grade math. The student
first answers two square root problems correctly and then gets a single x-intercept exercise incorrect.
In the subsequent 47 interactions the student solves a series of x-intercept, y-intercept and graphing
exercises. Each time the student answers an exercise we can make a prediction as to whether or not
she would answer an exercise of each type correctly on her next interaction. In the visualization
we only show predictions over time for a relevant subset of exercise types. In most previous work,
exercise tags denote the single “concept” that human experts assign to an exercise. Our model
can leverage, but does not require, such expert annotation. We demonstrate that in the absence of
annotations the model can autonomously learn content substructure.
2 Related Work
The task of modelling and predicting how human beings learn is informed by fields as diverse
as education, psychology, neuroscience and cognitive science. From a social science perspective
learning has been understood to be influenced by complex macro level interactions including affect
[21], motivation [10] and even identity [4]. The challenges present are further exposed on the micro
level. Learning is fundamentally a reflection of human cognition which is a highly complex process.
Two themes in the field of cognitive science that are particularly relevant are theories that the human
mind, and its learning process, are recursive [12] and driven by analogy [13].
The problem of knowledge tracing was first posed, and has been heavily studied within the intelligent
tutoring community. In the face of aforementioned challenges it has been a primary goal to build
models which may not capture all cognitive processes, but are nevertheless useful.
Bayesian Knowledge Tracing (BKT) is the most popular approach for building temporal models
of student learning. BKT models a learner’s latent knowledge state as a set of binary variables,
each of which represents understanding or non-understanding of a single concept [6]. A Hidden
Markov Model (HMM) is used to update the probabilities across each of these binary variables, as a
learner answers exercises of a given concept correctly or incorrectly. The original model formulation
assumed that once a skill is learned it is never forgotten. Recent extensions to this model include
contextualization of guessing and slipping estimates [7], estimating prior knowledge for individual
learners [33], and estimating problem difficulty [23].
With or without such extensions, Knowledge Tracing suffers from several difficulties. First, the
binary representation of student understanding may be unrealistic. Second, the meaning of the
hidden variables and their mappings onto exercises can be ambiguous, rarely meeting the model’s
expectation of a single concept per exercise. Several techniques have been developed to create and
refine concept categories and concept-exercise mappings. The current gold standard, Cognitive Task
Analysis [31] is an arduous and iterative process where domain experts ask learners to talk through
2
their thought processes while solving problems. Finally, the binary response data used to model
transitions imposes a limit on the kinds of exercises that can be modeled.
Partially Observable Markov Decision Processes (POMDPs) have been used to model learner be-
havior over time, in cases where the learner follows an open-ended path to arrive at a solution [29].
Although POMDPs present an extremely flexible framework, they require exploration of an expo-
nentially large state space. Current implementations are also restricted to a discrete state space,
with hard-coded meanings for latent variables. This makes them intractable or inflexible in practice,
though they have the potential to overcome both of those limitations.
Simpler models from the Performance Factors Analysis (PFA) framework [24] and Learning Factors
Analysis (LFA) framework [3] have shown predictive power comparable to BKT [14]. To obtain
better predictive results than with any one model alone, various ensemble methods have been used
to combine BKT and PFA [8]. Model combinations supported by AdaBoost, Random Forest, linear
regression, logistic regression and a feed-forward neural network were all shown to deliver superior
results to BKT and PFA on their own. But because of the learner models they rely on, these ensemble
techniques grapple with the same limitations, including a requirement for accurate concept labeling.
Recent work has explored combining Item Response Theory (IRT) models with switched nonlinear
Kalman filters [20], as well as with Knowledge Tracing [19, 18]. Though these approaches are
promising, at present they are both more restricted in functional form and more expensive (due to
inference of latent variables) than the method we present here.
Recurrent neural networks are a family of flexible dynamic models which connect artificial neurons
over time. The propagation of information is recursive in that hidden neurons evolve based on both
the input to the system and on their previous activation [32]. In contrast to hidden Markov models
as they appear in education, which are also dynamic, RNNs have a high dimensional, continuous,
representation of latent state. A notable advantage of the richer representation of RNNs is their abil-
ity to use information from an input in a prediction at a much later point in time. This is especially
true for Long Short Term Memory (LSTM) networks—a popular type of RNN [16].
Recurrent neural networks are competitive or state-of-the-art for several time series tasks–for in-
stance, speech to text [15], translation [22], and image captioning [17]–where large amounts of
training data are available. These results suggest that we could be much more successful at tracing
student knowledge if we formulated the task as a new application of temporal neural networks.
3.1 Model
3
y1 y2 y3 yT
h0 h1 h2 h3 … hT
x1 x2 x3 xT
Figure 2: The connection between variables in a simple recurrent neural network. The inputs (xt ) to the
dynamic network are either one-hot encodings or compressed representations of a student action, and the
prediction (yt ) is a vector representing the probability of getting each of the dataset exercises correct.
where both tanh and the sigmoid function, σ (·), are applied elementwise. The model is parame-
terized by an input weight matrix Whx , recurrent weight matrix Whh , initial state h0 , and readout
weight matrix Wyh . Biases for latent and readout units are given by bh and by .
Long Short Term Memory (LSTM) networks [16] are a more complex variant of RNNs that often
prove more powerful. In LSTMs latent units retain their values until explicitly cleared by the action
of a ‘forget gate’. They thus more naturally retain information for many time steps, which is believed
to make them easier to train. Additionally, hidden units are updated using multiplicative interactions,
and they can thus perform more complicated transformations for the same number of latent units.
The update equations for an LSTM are significantly more complicated than for an RNN, and can be
found in Appendix A.
In order to train an RNN or LSTM on student interactions, it is necessary to convert those interac-
tions into a sequence of fixed length input vectors xt . We do this using two methods depending on
the nature of those interactions:
For datasets with a small number M of unique exercises, we set xt to be a one-hot encoding of
the student interaction tuple ht = {qt , at } that represents the combination of which exercise was
answered and if the exercise was answered correctly, so xt ∈ {0, 1}2M . We found that having
separate representations for qt and at degraded performance.
For large feature spaces, a one-hot encoding can quickly become impractically large. For datasets
with a large number of unique exercises, we therefore instead assign a random vector nq,a ∼
N (0, I) to each input tuple, where nq,a ∈ RN , and N M . We then set each input vector
xt to the corresponding random vector, xt = nqt ,at . This random low-dimensional representation
of a one-hot high-dimensional vector is motivated by compressed sensing. Compressed sensing
states that a k-sparse signal in d dimensions can be recovered exactly from k log d random linear
projections (up to scaling and additive constants) [2]. Since a one-hot encoding is a 1-sparse signal,
the student interaction tuple can be exactly encoded by assigning it to a fixed random Gaussian input
vector of length ∼ log 2M . Although the current paper deals only with 1-hot vectors, this technique
can be extended easily to capture aspects of more complex student interactions in a fixed length
vector.
The output yt is a vector of length equal to the number of problems, where each entry represents
the predicted probability that the student would answer that particular problem correctly. Thus the
prediction of at+1 can then be read from the entry in yt corresponding to qt+1 .
3.3 Optimization
The training objective is the negative log likelihood of the observed sequence of student responses
under the model. Let δ(qt+1 ) be the one-hot encoding of which exercise is answered at time t + 1,
and let ` be binary cross entropy. The loss for a given prediction is `(yT δ (qt+1 ) , at+1 ), and the
4
loss for a single student is:
X
L= `(yT δ (qt+1 ) , at+1 ) (3)
t
This objective was minimized using stochastic gradient descent on minibatches. To prevent over-
fitting during training, dropout was applied to ht when computing the readout yt , but not when
computing the next hidden state ht+1 . We prevent gradients from ‘exploding’ as we backpropagate
through time by truncating the length of gradients whose norm is above a threshold. For all models
in this paper we consistently used hidden dimensionality of 200 and a mini-batch size of 100. To
facilitate research in DKTs we have published our code and relevant preprocessed data1 .
4 Educational Applications
The training objective for knowledge tracing is to predict a student’s future performance based on
their past activity. This is directly useful – for instance formal testing is no longer necessary if a
student’s ability undergoes continuous assessment. As explored experimentally in Section 6, the
DKT model can also power a number of other advancements.
One of the biggest potential impacts of our model is in choosing the best sequence of learning items
to present to a student. Given a student with an estimated hidden knowledge state, we can query
our RNN to calculate what their expected knowledge state would be if we were to assign them a
particular exercise. For instance, in Figure 1 after the student has answered 50 exercises we can test
every possible next exercise we could show her and compute her expected knowledge state given that
choice. The predicted optimal next problem for this student is to revisit solving for the y-intercept.
We use a trained DKT to test two classic curricula rules from education literature: mixing where
exercises from different topics are intermixed, and blocking where students answer series of exer-
cises of the same type [30]. Since choosing the entire sequence of next exercises so as to maximize
predicted accuracy can be phrased as a Markov decision problem we can also evaluate the benefits
of using the expectimax algorithm (see Appendix) to chose an optimal sequence of problems.
The DKT model can further be applied to the task of discovering latent structure or concepts in the
data, a task that is typically performed by human experts. We approached this problem by assigning
an influence Jij to every directed pair of exercises i and j,
y (j|i)
Jij = P , (4)
k y (j|k)
where y (j|i) is the correctness probability assigned by the RNN to exercise j on the second timestep,
given that a student answered exercise i correctly on the first. We show that this characterization of
the dependencies captured by the RNN recovers the pre-requisites associated with exercises.
5 Datasets
We test the ability to predict student performance on three datasets: simulated data, Khan Academy
Data, and the Assistments benchmark dataset. On each dataset we measure area under the curve
(AUC). For the non-simulated data we evaluate our results using 5-fold cross validation and in all
cases hyper-parameters are learned on training data. We compare the results of Deep Knowledge
Tracing to standard BKT and, when possible to optimal variations of BKT. Additionally we compare
our results to predictions made by simply calculating the marginal probability of a student getting a
particular exercise correct.
1
https://github.com/chrispiech/DeepKnowledgeTracing
5
Overview AU C
Dataset Students Exercise Tags Answers Marginal BKT BKT* DKT
Simulated-5 4,000 50 200 K ? 0.54 - 0.75
Khan Math 47,495 69 1,435 K 0.63 0.68 - 0.85
Assistments 15,931 124 526 K 0.62 0.67 0.69 0.86
Table 1: AUC results for all datasets tested. BKT is the standard BKT. BKT* is the best reported result from
the literature for Assistments. DKT is the result of using LSTM Deep Knowledge Tracing.
Simulated Data: We simulate virtual students learning virtual concepts and test how well we can
predict responses in this controlled setting. For each run of this experiment we generate two thou-
sand students who answer 50 exercises drawn from k ∈ 1 . . . 5 concepts. For this dataset only, all
students answer the same sequence of 50 exercises. Each student has a latent knowledge state “skill”
for each concept, and each exercise has both a single concept and a difficulty. The probability of
a student getting a exercise with difficulty β correct if the student had concept skill α is modelled
1−c
using classic Item Response Theory [9] as: p(correct|α, β) = c + 1+e β−α where c is the probabil-
ity of a random guess (set to be 0.25). Students “learn” over time via an increase to the concept
skill which corresponded to the exercise they answered. To understand how the different models
can incorporate unlabelled data, we do not provide models with the hidden concept labels (instead
the input is simply the exercise index and whether or not the exercise was answered correctly). We
evaluate prediction performance on an additional two thousand simulated test students. For each
number of concepts we repeat the experiment 20 times with different randomly generated data to
evaluate accuracy mean and standard error.
Khan Academy Data: We used a sample of anonymized student usage interactions from the eighth
grade Common Core curriculum on Khan Academy. The dataset included 1.4 million exercises
completed by 47,495 students across 69 different exercise types. It did not contain any personal
information. Only the researchers working on this paper had access to this anonymized dataset, and
its use was governed by an agreement designed to protect student privacy in accordance with Khan
Academy’s privacy notice [1]. Khan Academy provides a particularly relevant source of learning
data, since students often interact with the site for an extended period of time and for a variety of
content, and because students are often self-directed in the topics they work on and in the trajectory
they take through material.
Benchmark Dataset: In order to understand how our model compared to other models we evaluated
models on the Assistments 2009-2010 “skill builder” public benchmark dataset2 . Assistments is an
online tutor that simultaneously teaches and assesses students in grade school mathematics. It is, to
the best of our knowledge, the largest publicly available knowledge tracing dataset [11].
6 Results
On all three datasets Deep Knowledge Tracing substantially outperformed previous methods. On
the Khan dataset using an LSTM neural network model led to an AUC of 0.85 which was a notable
improvement over the performance of a standard BKT (AUC = 0.68), especially when compared to
the small improvement BKT provided over the marginal baseline (AUC = 0.63). See Table 1 and
Figure 3(b). On the Assistments dataset DKT produced a 25% gain over the previous best reported
result (AUC = 0.86 and 0.69 respectively) [23]. The gain we report in AUC compared to the marginal
baseline (0.24) is more than triple the largest gain achieved on the dataset to date (0.07).
The prediction results from the synthetic dataset provide an interesting demonstration of the capac-
ities of deep knowledge tracing. Both the LSTM and RNN models did as well at predicting student
responses as an oracle which had perfect knowledge of all model parameters (and only had to fit the
latent student knowledge variables). See Figure 3(a). In order to get accuracy on par with an oracle
the models would have to mimic a function that incorporates: latent concepts, the difficulty of each
exercise, the prior distributions of student knowledge and the increase in concept skill that happened
2
https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data
6
0.85 1.0
0.85
0.85
0.7
0.6
Test
Oracle 0.4
0.65 Oracle LSTM
0.65 RNN
RNN MDP-8
0.75 RNN
Test Accuracy
LSTM
LSTM MDP-1
BKT
BKT
BKT 0.2 Blocking
Marginal
Mixing
0.55
0.55 0.0 0.5
11 22 33 44 55 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30
Number
Number of Concepts
of Hidden Concepts False Positive Rate Exercise Index
7 Discussion
In this paper we apply RNNs to the problem of knowledge tracing in education, showing improve-
ment over prior state-of-the-art performance on the Assistments benchmark and Khan dataset. Two
particularly interesting novel properties of our new model are that (1) it does not need expert anno-
tations (it can learn concept patterns on its own) and (2) it can operate on any student input that can
be vectorized. One disadvantage of RNNs over simple hidden Markov methods is that they require
large amounts of training data, and so are well suited to an online education environment, but not a
small classroom environment.
7
6 3
5 22 7 11
30 31 6 3 18 20
5 22 36 7 11 25 33
4
30 31 27
13 17 18 20 45
41 43
36 4 26 32 48 25 2933
42 27
6 3 13 17 37 44 49
41 45
5 22 7 11 29 43
42 26 32 39 48
30 31 6 3 35
18 20 37 44 49 34 38
5 22 6 3
725 11 33
36 4 39
5 22 27 11 47 2 28 Hidden concept 1
30 31 187 20 46 9 14 21 35
17
41 30 31 13 18
45
43 1
34 38 Hidden concept 2
36 432 48 29
25 2033 40 24 8 10 16
42 26 27 47 23
28 Hidden concept 3
36 17444 49 25 33 46 15 2 9 14 21
41 13
37 45
27 43 12
1 Hidden concept 4
393217 48
2613 29 45 40 24 19 8 10 16
4241 43 23 Hidden concept 5
Simulated Data 26 44 32 4948 29
35 15
42 37 34 38 12
39 44 49
37 19
Khan Data 47 28 35
46 Scatter plots 2 9 3914 21
11 1 34 38 1 Linear function intercepts 24 Interpreting function graphs 47 Constructing inconsistent system
39 35
444047249 8 10 16 34Pythagorean
38 2 Recognizing irrational numbers 25 Systems of equations w. Elim. 0 48 Pythagorean theorem proofs
152 9 14 24 23 28
57 46 15 47 21 theorem 3 Linear equations 3 26 Solutions to systems of equations 49 Scientific notation intuition
12 41 2 9 14 21 28
46
40 1 27 4 Multiplication in scientific notation 27 Views of a function 50 Line graph intuition
34 Lines 19 24 8 10 16 23 55
Angles
40 24 1 5 Parallel lines 2 28 Recog func 2 51 Multistep equations w. distribution
15 8 10 16
69 12 28 59 23
62 6 Systems of equations 29 Graphing proportional relationships 52 Fractions as repeating decimals
13 1946 15
12 48 7 Equations word problems 30 Exponent rules 53 Cube roots
68 19 56
Functions
65
19 8 Slope of a line 31 Angles 2 54 Scientific notation
9 Linear models of bivariate data 32 Understand equations word problems 55 Pythagorean theorem 2
32 5 31 14 10 Systems of equations with elimination 33 Exponents 2 56 Functions 1
66 49
Exponents 33 11 Plotting the line of best fit 34 Segment addition 57 Vertical angles 2
7 12 Integer sums 35 Systems of equations w. substitution 58 Solving for the x intercept
60 53 54 30 13 Congruent angles 36 Comparing proportional relationships 59 Recognizing functions
26
6 17 42 58
14 Exponents 1 37 Solutions to linear equations 60 Square roots
67 15 Interpreting scatter plots 38 Finding intercepts of linear functions 61 Slope and triangle similarity
Systems of
12 4 16 Repeating decimals to fractions 2 39 Midpoint of a segment 62 Distance formula
43 23
Equations 22 Line graphs
1 63
17 Graphical solutions to systems 40 Volume word problems 63 Converting decimals to fractions 2
64 18 Linear non linear functions 41 Constructing scatter plots 64 Age word problems
51
10 8 2 19 Interpreting features of linear functions 42 Solving for the y intercept 65 Pythagorean theorem 1
Figure 4: Graphs of conditional influence between exercises in DKT models. Above: We observe a perfect
clustering of latent concepts in the synthetic data. Below: A convincing depiction of how 8th grade math
Common Core exercises influence one another. Arrow size indicates connection strength. Note that nodes may
be connected in both directions. Edges with a magnitude smaller than 0.1 have been thresholded. Cluster
labels are added by hand, but are fully consistent with the exercises in each cluster.
The application of RNNs to knowledge tracing provides many directions for future research. Fur-
ther investigations could incorporate other features as inputs (such as time taken), explore other
educational impacts (such as hint generation, dropout prediction), and validate hypotheses posed in
education literature (such as spaced repetition, modeling how students forget). Because DKTs take
vector input it should be possible to track knowledge over more complex learning activities. An es-
pecially interesting extension is to trace student knowledge as they solve open-ended programming
tasks [26, 27]. Using a recently developed method for vectorization of programs [25] we hope to be
able to intelligently model student knowledge over time as they learn to program.
In an ongoing collaboration with Khan Academy, we plan to test the efficacy of DKT for curriculum
planning in a controlled experiment, by using it to propose exercises on the site.
Acknowledgments
Many thanks to John Mitchell for his guidance and Khan Academy for its support. Chris Piech is
supported by NSF-GRFP grant number DGE-114747.
References
[1] Khan academy privacy notice https://www.khanacademy.org/about/privacy-policy, 2015.
[2] BARANIUK , R. Compressive sensing. IEEE signal processing magazine 24, 4 (2007).
[3] C EN , H., KOEDINGER , K., AND J UNKER , B. Learning factors analysis–a general method for cognitive
model evaluation and improvement. In Intelligent tutoring systems (2006), Springer, pp. 164–175.
[4] C OHEN , G. L., AND G ARCIA , J. Identity, belonging, and achievement a model, interventions, implica-
tions. Current Directions in Psychological Science 17, 6 (2008), 365–369.
[5] C ORBETT, A. Cognitive computer tutors: Solving the two-sigma problem. In User Modeling 2001.
Springer, 2001, pp. 137–147.
[6] C ORBETT, A. T., AND A NDERSON , J. R. Knowledge tracing: Modeling the acquisition of procedural
knowledge. User modeling and user-adapted interaction 4, 4 (1994), 253–278.
[7] D BAKER , R. S. J., C ORBETT, A. T., AND A LEVEN , V. More accurate student modeling through
contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Intelligent Tutoring
Systems (2008), Springer, pp. 406–415.
8
[8] D BAKER , R. S. J., PARDOS , Z. A., G OWDA , S. M., N OORAEI , B. B., AND H EFFERNAN , N. T.
Ensembling predictions of student knowledge within intelligent tutoring systems. In User Modeling,
Adaption and Personalization. Springer, 2011, pp. 13–24.
[9] D RASGOW, F., AND H ULIN , C. L. Item response theory. Handbook of industrial and organizational
psychology 1 (1990), 577–636.
[10] E LLIOT, A. J., AND DWECK , C. S. Handbook of competence and motivation. Guilford Publications,
2013.
[11] F ENG , M., H EFFERNAN , N., AND KOEDINGER , K. Addressing the assessment challenge with an online
system that tutors as it assesses. User Modeling and User-Adapted Interaction 19, 3 (2009), 243–266.
[12] F ITCH , W. T., H AUSER , M. D., AND C HOMSKY, N. The evolution of the language faculty: clarifications
and implications. Cognition 97, 2 (2005), 179–210.
[13] G ENTNER , D. Structure-mapping: A theoretical framework for analogy. Cognitive science 7, 2.
[14] G ONG , Y., B ECK , J. E., AND H EFFERNAN , N. T. In Intelligent Tutoring Systems, Springer.
[15] G RAVES , A., M OHAMED , A.-R., AND H INTON , G. Speech recognition with deep recurrent neural
networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference
on (2013), IEEE, pp. 6645–6649.
[16] H OCHREITER , S., AND S CHMIDHUBER , J. Long short-term memory. Neural computation 9, 8.
[17] K ARPATHY, A., AND F EI -F EI , L. Deep visual-semantic alignments for generating image descriptions.
arXiv preprint arXiv:1412.2306 (2014).
[18] K HAJAH , M., W ING , R. M., L INDSEY, R. V., AND M OZER , M. C. Incorporating latent factors into
knowledge tracing to predict individual differences in learning. Proceedings of the 7th International
Conference on Educational Data Mining (2014).
[19] K HAJAH , M. M., H UANG , Y., G ONZ ÁLEZ -B RENES , J. P., M OZER , M. C., AND B RUSILOVSKY, P.
Integrating knowledge tracing and item response theory: A tale of two frameworks. Proceedings of the
4th International Workshop on Personalization Approaches in Learning Environments (2014).
[20] L AN , A. S., S TUDER , C., AND BARANIUK , R. G. Time-varying learning and content analytics via
sparse factor analysis. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge
discovery and data mining (2014), ACM, pp. 452–461.
[21] L INNENBRINK , E. A., AND P INTRICH , P. R. Role of affect in cognitive processing in academic contexts.
Motivation, emotion, and cognition: Integrative perspectives on intellectual functioning and development
(2004), 57–87.
[22] M IKOLOV, T., K ARAFI ÁT, M., B URGET, L., C ERNOCK Ỳ , J., AND K HUDANPUR , S. Recurrent neural
network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International
Speech Communication Association, Makuhari, Chiba, Japan, 2010 (2010), pp. 1045–1048.
[23] PARDOS , Z. A., AND H EFFERNAN , N. T. Kt-idem: Introducing item difficulty to the knowledge tracing
model. In User Modeling, Adaption and Personalization. Springer, 2011, pp. 243–254.
[24] PAVLIK J R , P. I., C EN , H., AND KOEDINGER , K. R. Performance Factors Analysis–A New Alternative
to Knowledge Tracing. Online Submission (2009).
[25] P IECH , C., H UANG , J., N GUYEN , A., P HULSUKSOMBATI , M., S AHAMI , M., AND G UIBAS , L. J.
Learning program embeddings to propagate feedback on student code. CoRR abs/1505.05969 (2015).
[26] P IECH , C., S AHAMI , M., H UANG , J., AND G UIBAS , L. Autonomously generating hints by inferring
problem solving policies. In Proceedings of the Second (2015) ACM Conference on Learning @ Scale
(New York, NY, USA, 2015), L@S ’15, ACM, pp. 195–204.
[27] P IECH , C., S AHAMI , M., KOLLER , D., C OOPER , S., AND B LIKSTEIN , P. Modeling how students learn
to program. In Proceedings of the 43rd ACM symposium on Computer Science Education.
[28] P OLSON , M. C., AND R ICHARDSON , J. J. Foundations of intelligent tutoring systems. Psychology
Press, 2013.
[29] R AFFERTY, A. N., B RUNSKILL , E., G RIFFITHS , T. L., AND S HAFTO , P. Faster teaching by POMDP
planning. In Artificial intelligence in education (2011), Springer, pp. 280–287.
[30] ROHRER , D. The effects of spacing and mixing practice problems. Journal for Research in Mathematics
Education (2009), 4–17.
[31] S CHRAAGEN , J. M., C HIPMAN , S. F., AND S HALIN , V. L. Cognitive task analysis. Psychology Press,
2000.
[32] W ILLIAMS , R. J., AND Z IPSER , D. A learning algorithm for continually running fully recurrent neural
networks. Neural computation 1, 2 (1989), 270–280.
[33] Y UDELSON , M. V., KOEDINGER , K. R., AND G ORDON , G. J. Individualized bayesian knowledge
tracing models. In Artificial Intelligence in Education (2013), Springer, pp. 171–180.
9
Appendix
A LSTM Equations
B Expectimax
Expectimax is a brute force, tree based, MDP search algorithm that calculates the expected utility
of each action under the assumption that the agent will always make a maximizing decision when
given a choice, and that after an action has been taken, the environment will produce a next state
using a stochastic process.
C Concept Clustering
Figure A.1: It is difficult to cluster concepts using model weights. Here is tSNE using the readout and reading
weights of the best RNN model trained on synthetic data with five hidden concepts (labeled).
D Model Insights
10
Figure A.2: The Khan Academy exercise labels.
34 68
55
39 48
57 65
59
13 62
31 28
19 56
35 46 69
43 5
25 49
26 14
10 6 53 33 54
7 61
60 63 30
52 4
41 32
9 20
66 40
11 21
15 44 45 64 2 16
23
38
1
8 58
50
42
22
Figure A.3: Exercise influence graph derived from student transitions between problems. Edges (a, b) represent
the probability of a student solving b after they solve a. Only transitions with probability > 0.1 are displayed.
These have less structure than the relationships derived in Figure. 4.
Figure A.4: Exercise influence graph using Equation 4, but based on the empirical conditional accuracy on
exercise j following correct performance on exercise i. Only edge weights > 0.1 are displayed. These have
less structure than the relationships derived in Figure 4.
11
0.9
0.8
0.7
Mean Predicted P
0.6
0.5
0.4
0.3
0 10 20 30 40 50
Number of Questions Answered
Figure A.5: How do the best students differ from below-average students? There seems to be much less variance
in their knowledge increase. The red curve shows the mean predicted accuracy for students closest to the 40th
percentile of the class after 50 questions, while the blue curve is for students closest to the 100th percentile of
the class after 50 questions.
Figure A.6: The parameter bz is easy to interpret. In general the ith element captures the marginal probability
of getting the ith exercise correct.
12