How Deep Is Knowledge Tracing
How Deep Is Knowledge Tracing
1.3.1 Recency Effects DKT has the capacity to encode inter-skill similarity. If each
Human behavior is strongly recency driven. For example, hidden unit represents student knowledge state for a partic-
when individuals perform a choice task repeatedly, response ular skill, then the hidden-to-hidden connections encode the
latency can be predicted by an exponentially decaying av- degree of overlap. In an extreme case, if two skills are highly
erage of recent stimuli [12]. Intuitively, one might expect similar, they can be modeled by a single hidden knowledge
to observe recency effects in student performance. Con- state. In contrast, classic BKT treats each skill as an in-
dependent modeling problem and thus can not discover or Incorporating forgetting can not only sensitize BKT to re-
leverage inter-skill similarity. cent events but can also contextualize trial sequences. To
explain, consider an exercise sequence such as A1 –A2 –B1 –
DKT has the additional strength, as demonstrated by Piech A3 –B2 –B3 –A4 , where the labels are instances of skills A and
et al., that it can accommodate the absence of skill labels. B. Ordinary BKT discards the absolute number of trials be-
If each label simply indexes a specific exercise, DKT can tween two exercises of a given skill, but with forgetting, we
discover interdependence between exercises in exactly the can count the number of intervening trials and treat each as
same manner as it discovers interdependence between skills. an independent opportunity for forgetting to occur. Conse-
In contrast, BKT requires exercise labels to be skill indexed. quently, the probability of forgetting between A1 and A2 is
F , but the probability of forgetting between A2 and A3 is
1.3.4 Individual Variation in Ability 1 − (1 − F )2 and between A3 and A4 is 1 − (1 − F )3 . Using
Students vary in ability, as reflected in individual differences forgetting, BKT can readily incorporate some information
in mean accuracy across trials and skills. Individual varia- about the absolute trial sequence, and therefore has more
tion might potentially be used in a predictive manner: a potential than classic BKT to be sensitive to interspersed
student’s accuracy on early trials in a sequence might pre- trials in the exercise sequence.
dict accuracy on later trials, regardless of the skills required
to solve exercises. We performed a simple verification of this 2.2 Skill Discovery
hypothesis using the Assistments data set. In this data set, To model interactions among skills, one might suppose that
students study one skill at a time and then move on to the each skill has some degree of influence on the learning of
next skill. We computed correlation between mean accuracy other skills, not unlike the connection among hidden units
of all trials on the first n skills and the mean accuracy of all in DKT. For BKT to allow for such interactions among
trials on skill n+1, for all students and for n ∈ {1, ..., N −1} skills, the independent BKT models would need to be in-
where N is the number of skills a student studied. We ob- terconnected, using an architecture such as a factorial hid-
tained a correlation coefficient of 0.39: students who tend den Markov model [6]. As an alternative to this somewhat
to do well on the early skills learned tend to do well on later complex approach, we explored a simpler scheme in which
skills, regardless of the skills involved. different exercise labels could be collapsed together to form
a single skill. For example, consider an exercise sequence
DKT is presented with a student’s complete trial sequence. such as A1 –B1 –A2 –C1 –B2 –C2 –C3 . If skills A and B are
It can use a student’s average accuracy up to trial t to pre- highly similar or overlapping, such that learning one pre-
dict trial t + 1. Because BKT models each skill separately dicts learning the other, it would be more sensible to treat
from the others, it does not have the contextual information this sequence in a manner that groups A and B into a sin-
needed to estimate a student’s average accuracy or overall gle skill, and to train a single BKT instantiation on both
ability. A and B trials. This approach can be used whether the
exercise labels are skill indices or exercise indices. (One of
2. EXTENDING BKT the data sets used by Piech et al. [22] to motivate DKT has
In the previous section, we described four regularities that exercise-indexed labels).
appear to be present in the data and which we conjecture
that DKT exploits but which the classic BKT model cannot. We recently proposed an inference procedure that automati-
In this section, we describe three extensions to BKT that cally discovers the cognitive skills needed to accurately model
would bring BKT on par with DKT with regard to these a given data set [18]. (A related procedure was indepen-
regularities. dently proposed in [8].) The approach couples BKT with
a technique that searches over partitions of the exercise la-
2.1 Forgetting bels to simultaneously (1) determine which skill is required
To better capture recency effects, BKT can be augmented to correctly answer each exercise, and (2) model a student’s
to allow for forgetting of skills. Forgetting corresponds to dynamical knowledge state for each skill. Formally, the tech-
fitting a BKT parameter F ≡ P (Ks,i+1 = 0 | Ksi = 1), the nique assigns each exercise label to a latent skill such that a
probability of transitioning from a state of knowing to not student’s expected accuracy on a sequence of same-skill ex-
knowing a skill. In standard BKT, F = 0. ercises improves monotonically with practice according to
BKT. Rather than discarding the skills identified by ex-
Without forgetting, once BKT infers that the student has perts, our technique incorporates a nonparametric prior over
learned, even a long run of poorly performing trials cannot the exercise-skill assignments that is based on the expert-
alter the inferred knowledge state. However, with forgetting, provided skills and a weighted Chinese restaurant process
the knowledge state can transition in either direction, which [11].
allows the model to be more sensitive to the recent trials:
A run of unsuccessful trials is indicative of not knowing the In the above illustration, our technique would group A and
skill, regardless of what preceded the run. Forgetting is B into one skill and C into another. This procedure col-
not a new idea to BKT, and in fact was included in the lapses like skills (or like exercises), yielding better fits to the
original psychological theory that underlies the notion of data by BKT. Thus, the procedure performs a sort of skill
binary knowledge state [1]. However, it has not typically discovery.
been incorporated into BKT. When it has been included in
BKT [23], the motivation was to model forgetting from one 2.3 Incorporating Latent Student-Abilities
day to the next, not forgetting that can occur on a much To account for individual variation in student ability, we
shorter time scale. have extended BKT [14, 13] such that slip and guess prob-
abilities are modulated by a latent ability parameter that is We evaluated five variants of BKT1 , each of which incor-
inferred from the data, much in the spirit of item-response porates a different subset of the extensions described in the
theory [4]. As we did in [14], we assume that students with previous section: a base version that corresponds to the clas-
stronger abilities have lower slip and higher guess probabil- sic model and the model against which DKT was evaluated
ities. When the model is presented with new students, the in [22], which we’ll refer to simply as BKT ; a version that in-
posterior predictive distribution on abilities is used initially, corporates forgetting (BKT+F ), a version that incorporates
but as responses from the new student are observed, un- skill discovery (BKT+S ), a version that incorporates latent
certainty in the student’s ability diminishes, yielding better abilities (BKT+A), and a version that incorporates all three
predictions for the student. of the extensions (BKT+FSA). We also built our own im-
plementation of DKT with LSTM recurrent units2 . (Piech
et al. described the LSTM version as better performing, but
3. SIMULATIONS posted only the code for the standard recurrent neural net
3.1 Data Sets version.) We verified that our implementation produced re-
Piech et al. [22] studied three data sets. One of the data sults comparable to those reported in [22] on Assistments
sets, from Khan Academy, is not publicly available. Despite and Synthetic. We then also ran the model on Spanish
our requests and a plea from one of the co-authors of the and Statics.
DKT paper, we were unable to obtain permission from the
data science team at Khan Academy to use the data set. We For Assistments, Spanish, and Statics, we used a single
did investigate the other two data sets in Piech et al., which train/test split. The Assistments train/test split was iden-
are as follows. tical to that used by Piech et al. For Synthetic, we used
the 20 simulation sets provided by Piech et al. and averaged
Assistments is an electronic tutor that teaches and eval- results across the 20 simulations.
uates students in grade-school math. The 2009-2010 “skill
builder” data set is a large, standard benchmark, available Each model was evaluated on each domain’s test data set,
by searching the web for assistment-2009-2010-data. We and the performance of the model was quantified with a dis-
used the train/test split provided by Piech et al., and fol- criminability score, the area under the ROC curve or AUC.
lowing Piech et al., we discarded all students who had only AUC is a measure ranging from .5, reflecting no ability to
a single trial of data. discriminate correct from incorrect responses, to 1.0, reflect-
ing perfect discrimination. AUC is computed by obtaining
Synthetic is a synthetic data set created by Piech et al. to a prediction on the test set for each trial, across all skills,
model virtual students learning virtual skills. The training and then using the complete set of predictions to form the
and test sets each consist of 2000 virtual students perform- ROC curve. Although Piech et al. [22] do not describe the
ing the same sequence of 50 exercises drawn from 5 skills. procedure they use to compute AUC for DKT, code they
The exercise on trial t is assumed to have a difficulty char- have made available implements the procedure we describe,
acterized by δt and require a skill specified by σt . The ex- and not the obvious alternative procedure in which ROC
ercises are labeled by the identity of the exercise, not by curves are computed on a per-skill basis and then averaged
the underlying skill, σt . The ability of a student, denoted, to obtain an overall AUC.
αt varies over time according to a drift-diffusion process,
generally increasing with practice. The response correctness
on trial t is a Bernoulli draw with probability specified by
3.3 Results
Figure 2 presents the results of our comparison of five vari-
guessing-corrected item-response theory with difficulty and
ants of BKT on the four data sets. We walk through the
ability parameters δt and αt . This data set is challenging
data sets from left to right.
for BKT because the skill assignments, σt , are not provided
and must be inferred from the data. Without the skill as-
On Assistments, classic BKT obtains an AUC of 0.73, bet-
signments, BKT must be used either with all exercises asso-
ter than the 0.67 reported for BKT by Piech et al. We are
ciated with a single skill or each exercise associated with its
not sure why the scores do not match, although 0.67 is close
own skill. Either of these assumptions will miss important
to the AUC score we obtain if we treat all exercises as asso-
structure in the data. Synthetic is an interesting data set
ciated with a single skill or if we compute AUC on a per-skill
in that the underlying generative model is neither a perfect
basis and then average.3 BKT+F obtains an AUC of 0.83,
match to DKT or BKT (even with the enhancements we
have described). The generative model seems realistic in its 1
https://github.com/robert-lindsey/WCRP/tree/forgetting
assumption that knowledge state varies continuously. 2
https://github.com/mmkhajah/dkt
3
Piech et al. cite Pardos and Heffernan [21] as obtain-
We included two additional data sets in our simulations. ing BKT’s best reported performance on Assistments—
Spanish is a data set of 182 middle-school students prac- an AUC of 0.69. In [21], the overall AUC is computed by
ticing 409 Spanish exercises (translations and application of averaging the per-skill AUCs. This method yields a lower
simple skills such as verb conjugation) over the course of a score than the method used by Piech et al., for two reasons.
First, the Piech procedure weighs all trials equally, whereas
15-week semester, with a total of 578,726 trials [17]. Statics the Pardos and Heffernan procedure weighs all skills equally.
is from a college-level engineering statics course with 189,297 With the latter procedure, the overall AUC will be dinged
trials and 333 students and 1,223 exercises [28], available if the model does poorly on a skill with just a few trials, as
from the PSLC DataShop web site [15]. we have observed to be the case with Assistments. The
latter procedure also produces a lower overall AUC because
it suppresses any lift due to being able to predict the rela-
3.2 Methods tive accuracy of different skills. In summary, it appears that
Assistments Synthetic Statics Spanish
0.93 0.82 0.78 0.88
0.90 0.79
BKT+FSA
BKT+S
0.87 0.76
DKT
DKT
DKT
0.84 0.73
AUC
0.75 0.85
BKT+F
BKT+FSA
BKT+FSA
0.81 0.70
BKT+S
BKT+F
BKT+A
BKT+A
BKT+FSA
BKT+S
BKT+A
0.78 0.67
BKT+A
BKT+F
BKT
BKT+S
BKT
DKT
BKT+F
BKT
BKT
0.75 0.64
Figure 2: A comparison of six models on four data sets. Model performance on the test set is quantified by
AUC, a measure of how well the model discriminates (predicts) correct and incorrect student responses. The
models are trained on one set of students and tested on another set. Note that the AUC scale is different for
each graph, but tic marks are always spaced by .03 units in AUC. On Assistments and Synthetic, DKT results
are from Piech et al. [22]; on Statics and Spanish DKT results are from our own implementation. BKT=
classic Bayesian knowledge tracing; BKT+A= BKT with inference of latent student abilities; BKT+F= BKT
with forgetting; BKT+S= BKT with skill discovery; BKT+FSA= BKT with all three extensions; DKT=
deep knowledge tracing
not quite as good as the 0.86 value reported for DKT by puting AUC; either of these explanations is consistent with
Piech et al. Examining the various enhancements to BKT, their reported AUC of 0.54.
AUC is boosted both by incorporating forgetting and by in-
corporating latent student abilities. We find it somewhat Regarding the enhancements to BKT, adding student abil-
puzzling that the combination of the two enhancements, ities (BKT+A) improves prediction of Synthetic which is
embodied in BKT+FSA, does no better than BKT+F or understandable given that the generative process simulates
BKT+A, considering that the two enhancements tap differ- students with abilities that vary slowly over time. Adding
ent properties of the data: the student abilities help predict forgetting (BKT+F) does not help, consistent with the gen-
transfer from one skill to the next, whereas forgetting facil- erative process which assumes that knowledge level is on
itates prediction within a skill. average increasing with practice; there is no systematic for-
getting in the student simulation. Critical to this simulation
To summarize the comparison of BKT and DKT, 31.6% of is skill induction: BKT+S and BKT+FSA achieve an AUC
difference in performance reported in [22] appears to be due of 0.80, better than the reported 0.75 for DKT in [22].
to the use of a biased procedure for computing the AUC
for BTK. Another 50.6% of the difference in performance On Statics, each BKT extension obtains an improvement
reported vanishes if BKT is augmented to allow for forget- over classic BKT, although the magnitude of the improve-
ting. We can further improve BKT if we allow the skill ments are small. The full model, BKT+FSA, obtains an
discovery algorithm to operate with exercise labels that in- AUC of 0.75 and our implementation of DKT obtains a
dex individual exercises, as opposed to labels that index the nearly identical AUC of 0.76. On Spanish, the BKT exten-
skill associated with each exercise. With exercise-indexed la- sions obtain very little benefit. The full model, BKT+FSA,
bels, BKT+S and BKT+FSA both obtain an AUC of 0.90, obtains an AUC of 0.846 and again, DKT obtains a nearly
beating DKT. However, given DKT’s ability to perform skill identical AUC of 0.836. These two sets of results indicate
discovery, we would not be surprised if it also achieved a sim- that for at least some data sets, classic BKT has no glaring
ilar level of performance when allowed to exploit exercise- deficiencies. However, we note that BKT model accuracy
indexed labels. can be improved if algorithms are considered that use exer-
cise labels which are indexed by exercise and not by skill.
Turning to Synthetic, classic BKT obtains an AUC of 0.62, For example, with Statics, performing skill discovery using
again significantly better than the 0.54 reported by Piech et exercise-indexed labels, [17] obtain an AUC of 0.81, much
al. In our simulation, we treat each exercise as having a dis- better than the score of 0.73 we report here for BKT+S
tinct skill label, and thus BKT learns nothing more than the based on skill-indexed labels.
mean performance level for a specific exercise. (Because the
exercises are presented in a fixed order, the exercise identity In summary, enhanced BKT appears to perform as well on
and the trial number are confounded. Because performance average as DKT across the four data sets. Enhanced BKT
tends to improve as trials advance in the synthetic data, outperforms DKT by 20.0% (.05 AUC units) on Synthetic
BKT is able to learn this relationship.) It is possible here and by 3.0% (.01 AUC unit) on Spanish. Enhanced BKT
that Piech et al. treated all exercises as associated with a underperforms DKT by 8.3% (.03 AUC units) on Assist-
single skill or that they used the biased procedure for com- ments and by 3.5% (.01 AUC unit) on Statics. These
percentages are based on the difference of AUCs scaled by
inconsistent procedures may have been used to compute per-
formance of BKT versus DKT in [22], and the procedure for by AUCDKT − 0.5, which takes into account the fact that an
BKT is biased to yield a lower score. AUC of 0.5 indicates no discriminability.
4. DISCUSSION stantive effort to understand what the model had actually
Our goal in this article was to investigate the basis for the learned. Our proposed BKT extensions achieve predictive
impressive predictive advantage of deep knowledge tracing performance similar to DKT whilst remaining interpretable:
over Bayesian knowledge tracing. We found some evidence the model parameters (forgetting rate, student ability, etc.)
that different procedures may have been used to evaluate are psychologically meaningful. When skill discovery is in-
DKT and BKT in [22], leading to a bias against BKT. When corporated into BKT, the result is clear: a partition of exer-
we replicated simulations of BKT reported in [22], we ob- cises into skills. Reading out such a partitioning from DKT
tained significantly better performance: an AUC of 0.73 ver- is challenging and only an approximate representation of the
sus 0.67 on Assistments, and an AUC of 0.62 versus 0.54 knowledge in DKT.
on Synthetic.
Finally, we return to the question posed in the paper’s title:
However, even when the bias is eliminated, DKT obtains How deep is knowledge tracing? Deep learning refers to the
real performance gains over BKT. To understand the basis discovery of representations. Our results suggest that rep-
for these gains, we hypothesized various forms of regularity resentation discovery is not at the core of DKT’s success.
in the data which BKT is not able to exploit. We proposed We base this argument on the fact that our enhancements
enhancements to BKT to allow it to exploit these regulari- to BKT bring it to the performance level of DKT without
ties, and we found that the enhanced BKT achieved a level requiring any sort of subsymbolic representation discovery.4
of performance on average indistinguishable from that of Representation discovery is clearly critical in perceptual do-
DKT over the four data sets tested. The enhancements we mains such as image or speech classification. But the domain
explored are not novel; they have previously been proposed of education and student learning is high level and abstract.
and evaluated in the literature. They include forgetting [23], The input and output elements of models are psychologically
latent student abilities [14, 13, 21], and skill induction [17, meaningful. The relevant internal states of the learner have
8]. some psychological basis. The characterization of exercises
and skills can—to at least a partial extent—be expressed
We observe that different enhancements to BKT matter for symbolically.
different data sets. For Assistments, incorporating forget-
ting is key; forgetting allows BKT to capture recency effects. Instead of attributing DKT’s success to representation dis-
For Synthetic, incorporating skill discovery yielded huge covery, we attribute DKT’s success to its flexibility and gen-
gains, as one would expect when the exercise-skill mapping erality in capturing statistical regularities directly present in
is not known. And for Statics, incorporating latent student the inputs and outputs. As long as there are sufficient data
abilities was relatively most beneficial; these abilities enable to constrain the model, DKT is more powerful than clas-
the model to tease apart the capability of a student and sic BKT. BKT arose in a simpler era, an era in which data
the intrinsic difficulty of an exercise or skill. Of the three and computation resources were precious. DKT reveals the
enhancements, forgetting and student abilities are compu- value of relaxing these constraints in the big data era. But
tationally inexpensive to implement, whereas skill discovery despite the wild popularity of deep learning, there are many
adds an extra layer of computational complexity to infer- ways to relax the constraints and build more powerful mod-
ence. els other than creating a black box predictive device with
a vast interconnected tangle of connections and parameters
The elegance of DKT is apparent when one considers the ef- that are nearly impossible to interpret.
fort we have invested to bring BKT to par with DKT. DKT
did not require its creators to analyze the domain and de- 5. ACKNOWLEDGMENTS
termine sources of structure in the data. In contrast, our This research was supported by NSF grants SES-1461535,
approach to augmenting BKT required some domain exper- SBE-0542013, and SMA-1041755.
tise, a thoughtful analysis of BKT’s limitations, and distinct
solutions to each limitation. DKT is a generic recurrent neu- 6. REFERENCES
ral network model [10], and it has no constructs that are [1] R. Atkinson and J. A. Paulson. An approach to the
specialized to modeling learning and forgetting, discovering psychology of instruction. Psychology Bulletin,
skills, or inferring student abilities. This flexibility makes 78:49–61, 1972.
DKT robust on a variety of datasets with little prior analy-
[2] R. S. Baker, A. T. Corbett, and V. Aleven. More
sis of the domains. Although training recurrent networks is
accurate student modeling through contextual
computationally intensive, tools exist to exploit the parallel
estimation of slip and guess probabilities in Bayesian
processing power in graphics processing units (GPUs), which
knowledge tracing. In Proceedings of the 9th
means that DKT can scale to large datasets. Classic BKT
International Conference on Intelligent Tutoring
is inexpensive to fit, although the variants we evaluated—
Systems, pages 406–415, Berlin, Heidelberg, 2008.
particularly the model that incorporates skill discovery—
Springer-Verlag.
require computation-intensive MCMC methods that have a
distinct set of issues when it comes to parallelization. [3] A. T. Corbett and J. R. Anderson. Knowledge tracing:
Modelling the acquisition of procedural knowledge.
DKT’s advantages come at a price: interpretability. DKT is User Model. User-Adapt. Interact., 4(4):253–278, 1995.
4
massive neural network model with tens of thousands of pa- Of course, the skill discovery mechanism we incorporated
rameters which are near-impossible to interpret. Although certainly does regroup exercises to form skills, but the form
the creators of DKT did not have to invest much up-front of this regrouping or partitioning is far more limited than
time analyzing their domain, they did have to invest sub- the typical transformations in a neural network to map from
one level of representation to another.
[4] P. De Boeck and M. Wilson. Explanatory Item 25:639–47, 2014.
Response Models: a Generalized Linear and Nonlinear [18] R. V. Lindsey, M. Khajah, and M. C. Mozer.
Approach. Springer-Verlag, New York, NY, 2004. Automatic discovery of cognitive skills to improve the
[5] A. Galyardt and I. Goldin. Move your lamp post: prediction of student learning. In Z. Ghahramani,
Recent data reflects learner knowledge better than M. Welling, C. Cortes, N. D. Lawrence, and K. Q.
older data. JEDM-Journal of Educational Data Weinberger, editors, Advances in Neural Information
Mining, 7(2):83–108, 2015. Processing Systems 27, pages 1386–1394. Curran
[6] Z. Ghahramani and M. I. Jordan. Factorial hidden Associates, Inc., 2014.
markov models. In D. S. Touretzky, M. C. Mozer, and [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,
M. E. Hasselmo, editors, Advances in Neural J. Veness, et al. Human-level control through deep
Information Processing Systems 8, pages 472–478. reinforcement learning. Nature, 518:529–533, 2015.
MIT Press, 1996. [20] M. C. Mozer. Induction of multiscale temporal
[7] R. Golden. How to optimize student learning using structure. In J. E. Moody, S. J. Hanson, and R. P.
recurrent neural networks (educational technology). Lippmann, editors, Advances in Neural Information
Web page, 2016. http://tinyurl.com/GoldenDKT, Processing Systems 4, pages 275–282.
retrieved February 29, 2016. Morgan-Kaufmann, 1992.
[8] J. P. Gonzales-Brenes. Modeling skill acquisition over [21] Z. A. Pardos and N. T. Heffernan. KT-IDEM:
time with sequence and topic modeling. In S. V. N. V. Introducing item difficulty to the knowledge tracing
G. Lebanon, editor, Proceedings of the Eighteenth model. In User Modeling, Adaption and
International Conference on Artificial Intelligence and Personalization, pages 243–254. Springer, 2011.
Statistics. JMLR, 2015. [22] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami,
[9] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and L. J. Guibas, and J. Sohl-Dickstein. Deep knowledge
D. Wierstra. DRAW: A recurrent neural network for tracing. In C. Cortes, N. D. Lawrence, D. D. Lee,
image generation. In Proceedings of the 32nd M. Sugiyama, and R. Garnett, editors, Advances in
International Conference on Machine Learning, pages Neural Information Processing Systems 28, pages
1462–1471, 2015. 505–513. Curran Associates, Inc., 2015.
[10] S. Hochreiter and J. Schmidhuber. Long short-term [23] Y. Qiu, Y. Qi, H. Lu, Z. A. Pardos, and N. T.
memory. Neural computation, 9(8):1735–1780, 1997. Heffernan. Does time matter? modeling the effect of
[11] H. Ishwaran and L. F. James. Generalized weighted time with Bayesian knowledge tracing. In
chinese restaurant processes for species sampling M. Pechenizkiy, T. Calders, C. Conati, S. Ventura,
mixture models. Statistica Sinica, pages 1211–1235, C. Romero, and J. C. Stamper, editors, Educational
2003. Data Mining 2011, pages 139–148.
[12] M. Jones, T. Curran, M. C. Mozer, and M. H. Wilder. www.educationaldatamining.org, 2011.
Sequential effects in response time reveal learning [24] D. Rohrer, R. F. Dedrick, and K. Burgess. The benefit
mechanisms and event representations. Psychological of interleaved mathematics practice is not limited to
review, 120:628–666, 2013. superficially similar kinds of problems. Psychonomic
[13] M. Khajah, Y. Huang, J. P. Gonzales-Brenes, M. C. Bulletin and Review, 21:1323–1330, 2014.
Mozer, and P. Brusilovsky. Integrating knowledge [25] D. Rohrer, R. F. Dedrick, and S. Stershic. Interleaved
tracing and item response theory: A tale of two practice improves mathematics learning. Journal of
frameworks. In M. Kravcik, O. C. Santos, and J. G. Educational Psychology, 107:900–908, 2015.
Boticario, editors, Proceedings of the 4th International [26] J. Schmidhuber. Deep learning in neural networks: An
Workshop on Personalization Approaches in Learning overview. Neural networks, 61:85–117, 2015.
Environments, pages 7–15. CEUR Workshop [27] D. Sliver, A. Huang, C. J. Maddison, A. Guez, et al.
Proceedings, 2014. Mastering the game of go with deep neural networks
[14] M. Khajah, R. M. Wing, R. V. Lindsey, and M. C. and tree search. Nature, 529:484–489, 2016.
Mozer. Incorporating latent factors into knowledge [28] P. Steif and N. Bier. OLI Engineering Statics – Fall
tracing to predict individual differences in learning. In 2011. Feb. 2014.
J. Stamper, Z. Pardos, M. Mavrikis, and B. M. [29] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show
McLaren, editors, Proceedings of the 7th International and tell: A neural image caption generator. In
Conference on Educational Data Mining, pages Computer Vision and Pattern Recognition, 2015.
99–106. Educational Data Mining Society Press, 2014. [30] H.-F. Yu and Others. Feature engineering and
[15] K. Koedinger, R. Baker, K. Cunningham, classifier ensemble for KDD cup 2010. Technical
A. Skogsholm, B. Leber, and J. Stamper. A data report, Department of Computer Science and
repository for the EDM community: The PSLC Information Engineering, National Taiwan University,
DataShop. In C. Romero, S. Ventura, M. Pechenizkiy, Taipei, Taiwan, 2010.
and R. Baker, editors, Handbook of Educ. Data [31] M. D. Zeiler and R. Fergus. Visualizing and
Mining, http://pslcdatashop.org, 2010. understanding convolutional networks. In D. Fleet,
[16] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. T. Pajdla, B. Schiele, and T. Tuytelaars, editors,
Nature, 521:436–444, 2015. Computer Vision – ECCV 2014: 13th European
[17] R. Lindsey, J. Shroyer, H. Pashler, and M. Mozer. Conference, Zurich, Switzerland, September 6-12,
Improving student’s long-term knowledge retention 2014, Proceedings, Part I, pages 818–833. Springer
with personalized review. Psychological Science, International Publishing, Cham, 2014.