[go: up one dir, main page]

0% found this document useful (0 votes)
147 views8 pages

How Deep Is Knowledge Tracing

This document discusses a study that compares deep knowledge tracing (DKT), a deep learning approach to predicting student performance, to Bayesian knowledge tracing (BKT), the traditional approach used in educational data mining. The study finds that DKT outperforms BKT on several datasets. It hypothesizes that DKT is able to exploit four sources of statistical regularity in the data that BKT cannot: recency effects, contextualized trial sequences, inter-skill similarity, and individual variation in ability. When BKT is extended to allow for more flexibility in modeling these regularities, its performance becomes comparable to DKT. This suggests that knowledge tracing may not require the "depth" of DKT and that shallow models like BKT

Uploaded by

Ho Manh Linh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views8 pages

How Deep Is Knowledge Tracing

This document discusses a study that compares deep knowledge tracing (DKT), a deep learning approach to predicting student performance, to Bayesian knowledge tracing (BKT), the traditional approach used in educational data mining. The study finds that DKT outperforms BKT on several datasets. It hypothesizes that DKT is able to exploit four sources of statistical regularity in the data that BKT cannot: recency effects, contextualized trial sequences, inter-skill similarity, and individual variation in ability. When BKT is extended to allow for more flexibility in modeling these regularities, its performance becomes comparable to DKT. This suggests that knowledge tracing may not require the "depth" of DKT and that shallow models like BKT

Uploaded by

Ho Manh Linh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

How Deep is Knowledge Tracing?

Mohammad Khajah Robert V. Lindsey Michael C. Mozer


Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
University of Colorado University of Colorado University of Colorado
Boulder, Colorado 80309 Boulder, Colorado 80309 Boulder, Colorado 80309
mohammad.khajah@colorado.edu robert.lindsey@colorado.edu mozer@colorado.edu

ABSTRACT thanks to much faster computing resources and much larger


In theoretical cognitive science, there is a tension between data sets than were available in 1990. Deep learning under-
highly structured models whose parameters have a direct lies state-of-the-art systems in speech recognition, language
psychological interpretation and highly complex, general- processing, and image classification [16, 26]. Deep learning
purpose models whose parameters and representations are also is responsible for systems that can produce captions for
difficult to interpret. The former typically provide more images [29], create synthetic images [9], play video games
insight into cognition but the latter often perform better. [19] and even Go [27].
This tension has recently surfaced in the realm of educa-
tional data mining, where a deep learning approach to pre- The ‘deep’ in deep learning refers to multiple levels of rep-
dicting students’ performance as they work through a series resentation transformation that lie between model inputs
of exercises—termed deep knowledge tracing or DKT —has and outputs. For example, an image-classification model
demonstrated a stunning performance advantage over the may take pixel values as input and produce a labeling of
mainstay of the field, Bayesian knowledge tracing or BKT. the objects in the image as output. Between the input and
In this article, we attempt to understand the basis for DKT’s output is a series of representation transformations that con-
advantage by considering the sources of statistical regularity struct successively higher-order features—features that are
in the data that DKT can leverage but which BKT cannot. less sensitive to lighting conditions and the position of ob-
We hypothesize four forms of regularity that BKT fails to jects in the image, and more sensitive to the identities of the
exploit: recency effects, the contextualized trial sequence, objects and their qualitative relationships. The features dis-
inter-skill similarity, and individual variation in ability. We covered by deep learning exhibit a complexity and subtlety
demonstrate that when BKT is extended to allow it more that make them difficult to analyze and understand (e.g.,
flexibility in modeling statistical regularities—using exten- [31]). Furthermore, no human engineer could wire up a so-
sions previously proposed in the literature—BKT achieves lution as thorough and accurate as solutions discovered by
a level of performance indistinguishable from that of DKT. deep learning. Deep learning models are fundamentally non-
We argue that while DKT is a powerful, useful, general- parametric, in the sense that interpreting individual weights
purpose framework for modeling student learning, its gains and individual unit activations in a network is pretty much
do not come from the discovery of novel representations— impossible. This opacity is in stark contrast to parametric
the fundamental advantage of deep learning. To answer the models, e.g., linear regression, where each of the coefficients
question posed in our title, knowledge tracing may be a do- has a clear interpretation in terms of the problem at hand
main that does not require ‘depth’; shallow models like BKT and the input features.
can perform just as well and offer us greater interpretability
and explanatory power. In one domain after the next, deep learning has achieved
gains over traditional approaches. Deep learning discards
1. INTRODUCTION hand-crafted features in favor of representation learning, and
In the past forty years, machine learning and cognitive sci- deep learning often ignores domain knowledge and structure
ence have undergone many paradigm shifts, but few have in favor of massive data sets and general architectural con-
been as dramatic as the recent surge of interest in deep straints on models (e.g., models with spatial locality to pro-
learning [16]. Although deep learning is little more than cess images, and models with local temporal constraints to
a re-branding of neural network techniques popular around process time series).
1990, deep learning has achieved some remarkable results
It was inevitable that deep learning would be applied to
student-learning data [22]. This domain has traditionally
been the purview of the educational data mining community,
where Bayesian knowledge tracing, or BKT, is the dominant
computational approach [3]. The deep learning approach to
modeling student data, termed deep knowledge tracing or
DKT, created a buzz when it appeared at the Neural Infor-
mation Processing Systems Conference in December 2015,
including press inquiries (N. Heffernan, personal communi-
cation) and descriptions of the work in the blogosphere (e.g., required to perform the exercise. We refer to these two types
[7]). Piech et al. [22] reported substantial improvements in of labels as exercise indexed and skill indexed, respectively.
prediction performance with DKT over BKT on two real-
world data sets (Assistments, Khan Academy) and one
synthetic data set which was generated under assumptions
1.2 Knowledge Tracing
BKT models skill-specific performance, i.e., performance on
that are not tailored to either DKT or BKT. DKT achieves a
a series of exercises that all tap the same skill. A separate in-
reported 25% gain in AUC (a measure of prediction quality)
stantiation of BKT is made for each skill, and a student’s raw
over the best previous result on the Assistments bench-
trial sequence is parsed into skill-specific subsequences that
mark.
preserve the relative ordering of exercises within a skill but
discard the ordering relationship of exercises across skills.
In this article, we explore the success of DKT. One approach
For a given skill σ, BKT is trained using the data from each
to this exploration might be to experiment with DKT, re-
student s, {Xst |Yst = σ}, where the relative trial order is
moving components of the model or modifying the input
preserved. Because it will become important for us to dis-
data to determine which model components and data char-
tinguish between absolute trial index and the relative trial
acteristics are essential to DKT’s performance. We pursue
index within a skill, we use t to denote the former and use i
an alternative approach in which we first formulate hypothe-
to denote the latter.
ses concerning the signals in the data that DKT is able to
exploit but that BKT is not. Given these hypotheses, we
BKT is based on a theory of all-or-none human learning
propose extensions to BKT which provide it with additional
[1] which postulates that the knowledge state of student s
flexibility, and we evaluate whether the enhanced BKT can
following the i’th exercise requiring a certain skill, Ksi , is
achieve results comparable to DKT. This procedure leads
binary: 1 if the skill has been mastered, 0 otherwise. BKT,
not only to a better understanding of how BKT and DKT
formalized as a hidden Markov model, infers Ksi from the
differ, but also helps us to understand the structure and
sequence of observed responses on trials 1 . . . i, {Xs1 , Xs2 ,
statistical regularities in the data source.
. . . , Xsi }. BKT is typically specified by four parameters:
P (Ks0 = 1), the probability that the student has mas-
1.1 Modeling Student Learning tered the skill prior to solving the first exercise; P (Ks,i+1 =
The domain we’re concerned with is electronic tutoring sys- 1 | Ksi = 0), the transition probability from the not-mastered
tems which employ cognitive models to track and assess stu- to mastered state; P (Xsi = 1 | Ksi = 0), the probability of
dent knowledge. Beliefs about what a student knows and correctly guessing the answer prior to skill mastery; and
doesn’t know allow a tutoring system to dynamically adapt P (Xsi = 0 | Ksi = 1), the probability of answering incor-
its feedback and instruction to optimize the depth and effi- rectly due to a slip following skill mastery. Because BKT is
ciency of learning. typically used in modeling practice over brief intervals, the
model assumes no forgetting, i.e., K cannot transition from
Ultimately, the measure of learning is how well students are 1 to 0.
able to apply skills that they have been taught. Conse-
quently, student modeling is often formulated as time series BKT is a highly constrained, structured model. It assumes
prediction: given the series of exercises a student has at- that the student’s knowledge state is binary, that predicting
tempted previously and the student’s success or failure on performance on an exercise requiring a given skill depends
each exercise, predict how the student will fare on a new only on the student’s binary knowledge state, and that the
exercise. Formally, the data consist of a set of binary ran- skill associated with each exercise is known in advance. If
dom variables indicating whether student s produces a cor- correct, these assumptions allow the model to make strong
rect response on trial t, {Xst }. The data also include the inferences. If incorrect, they limit the model’s performance.
exercise labels, {Yst }, which characterize the exercise. Sec- The only way to determine if model assumptions are correct
ondary data has also been incorporated in models, including is to construct an alternative model that makes different
the student’s utilization of hints, response time, and char- assumptions and to determine whether the alternative out-
acteristics of the specific exercise and the student’s partic- performs BKT. DKT is exactly this alternative model, and
ular history with related exercises [2, 30]. Although such its strong performance directs us to examine BKT’s limita-
data improve predictions, the bulk of research in this area tions. First, however, we briefly describe DKT.
has focused on the primary measure—whether a response is
correct or incorrect—and a sensible research strategy is to Rather than constructing a separate model for each skill,
determine the best model based on the primary data, and DKT models all skills jointly. The input to the model is the
then to determine how to incorporate secondary data. complete sequence of exercise-performance pairs, {(Xs1 , Ys1 )
...(Xst , Yst )...(XsT , YsT )}, presented one trial at a time. As
The exercise label, Yst , might index the specific exercise, depicted in Figure 1, DKT is a recurrent neural net which
e.g., 3 + 4 versus 2 + 6, or it might provide a more general takes (Xst , Yst ) as input and predicts Xs,t+1 for each possi-
characterization of the exercise, e.g., single digit addition. ble exercise label. The model is trained and evaluated based
In the latter case, exercise are grouped by the skill that on the match between the actual and predicted Xs,t+1 for
must be applied to obtain a solution. Although we will use the tested exercise (Ys,t+1 ). In addition to the input and
the term skill in this article, others refer to the skill as a output layers representing the current trial and the next
knowledge component, and the authors of DKT also use the trial, respectively, the network has a hidden layer with fully
term concept. Regardless, the important distinction for the recurrent connections (i.e., each hidden unit connects back
purpose of our work is between a label that indicates the to all other hidden units). The hidden layer thus serves to
particular exercise and a label that indicates the general skill retain relevant aspects of the input history as they are use-
Output layer sider, for example, a student’s time varying engagement.
(predicted accuracy If the level of engagement varies slowly relative to the rate
for each exercise) at which exercises are being solved, a correlation would be
induced in performance across local spans of time. A stu-
dent who performed poorly on the last trial because they
Hidden layer were distracted is likely to perform poorly on the current
(student knowledge state)
trial. We conducted a simple assessment of recency using
the Assistments data set (the details of this data set will
Input layer be described shortly). Similarly to [5], we built an autore-
(current exercise + gressive model that predicts performance on the current trial
accuracy) as an exponentially weighted average of performance on past
trials, with a decay half life of about 5 steps. We found that
this single parameter model fit the Assistments data reli-
Figure 1: Deep knowledge tracing (DKT) architec- ably better than classic BKT. (We are not presenting details
ture. Each rectangle depicts a set of processing of this simulation because we will evaluate a more rigorous
units; each arrow depicts complete connectivity be- variant of the idea in a following section. Our goal here is
tween each unit in the source layer and each unit in to convince the reader that there is likely some value to the
the destination layer. notion of recency-weighted prediction.)

Recurrent neural networks tend to be more strongly influ-


ful for predicting future performance. The hidden state of enced by recent events in a sequence than more distal events
the network can be conceived of as embodying the student’s [20]. Consequently, DKT is well suited to exploiting recent
knowledge state. Piech et al. [22] used a particular type performance in making predictions. In contrast, the gener-
of hidden unit, called an LSTM (long short-term memory) ative model underlying BKT supposes that once a skill is
[10], which is interesting because these hidden units behave learned, performance will remain strong, and that a slip at
very much like the BKT latent knowledge state, Ksi . To time t is independent of a slip at t + 1.
briefly explain LSTM, each hidden unit acts like a mem-
ory element that can hold a bit of information. The unit
is triggered to turn on or off by events in the input or the
1.3.2 Contextualized Trial Sequence
state of other hidden units, but when there is no specific The psychological literature on practice of multiple skills in-
trigger, the unit preserves its state, very similar to the way dicates that the sequence in which an exercise is embedded
that the latent state in BKT is sticky—once a skill is learned influences learning and retention (e.g., [24, 25]). For exam-
it stays learned. With 200 LSTM hidden units—the num- ple, given three exercises each of skills A and B, presenting
ber used in simulations reported in [22]—and 50 skills, DKT the exercises in the interleaved order A1 –B1 –A2 –B2 –A3 –B3
has roughly 250,000 free parameters (connection strengths). yields superior performance relative to presenting the exer-
Contrast this number with the 200 free parameters required cises in the blocked order A1 –A2 –A3 –B1 –B2 –B3 . (Perfor-
for embodying 50 different skills in BKT. mance in this situation can be based on an immediate or
delayed test.)
With its thousand-fold increase in flexibility, DKT is a very
general architecture. One can implement BKT-like dynam- Because DKT is fed the entire sequence of exercises a stu-
ics in DKT with a particular, restricted set of connection dent receives in the order the student receives them, it can
strengths. However, DKT clearly has the capacity to en- potentially infer the effect of exercise order on learning. In
code learning dynamics that are outside the scope of BKT. contrast, because classic BKT separates exercises by skill,
This capacity is what allows DKT to discover structure in preserving only the relative order of exercises within a skill,
the data that BKT misses. the training sequence for BKT is the same regardless of
whether the trial order is blocked or interleaved.

1.3 Where Does BKT Fall Short? 1.3.3 Inter-Skill Similarity


In this section, we describe four regularities that we conjec- Each exercise presented to a student has an associated la-
ture to be present in the student-performance data. DKT bel. In typical applications of BKT—as well as two of the
is flexible enough that it has the potential to discover these three simulations reported in Piech et al. [22]—the label in-
regularities, but the more constrained BKT model is simply dicates the skill required to solve the problem. Any two such
not crafted to exploit the regularities. In following sections, skills, S1 and S2 , may vary in their degree of relatedness.
we suggest means of extending BKT to exploit such regular- The stronger the relatedness, the more highly correlated one
ities, and conduct simulation studies to determine whether would expect performance to be on exercises tapping the two
the enhanced BKT achieves performance comparable to that skills, and the more likely that the two skills will be learned
of DKT. simultaneously.

1.3.1 Recency Effects DKT has the capacity to encode inter-skill similarity. If each
Human behavior is strongly recency driven. For example, hidden unit represents student knowledge state for a partic-
when individuals perform a choice task repeatedly, response ular skill, then the hidden-to-hidden connections encode the
latency can be predicted by an exponentially decaying av- degree of overlap. In an extreme case, if two skills are highly
erage of recent stimuli [12]. Intuitively, one might expect similar, they can be modeled by a single hidden knowledge
to observe recency effects in student performance. Con- state. In contrast, classic BKT treats each skill as an in-
dependent modeling problem and thus can not discover or Incorporating forgetting can not only sensitize BKT to re-
leverage inter-skill similarity. cent events but can also contextualize trial sequences. To
explain, consider an exercise sequence such as A1 –A2 –B1 –
DKT has the additional strength, as demonstrated by Piech A3 –B2 –B3 –A4 , where the labels are instances of skills A and
et al., that it can accommodate the absence of skill labels. B. Ordinary BKT discards the absolute number of trials be-
If each label simply indexes a specific exercise, DKT can tween two exercises of a given skill, but with forgetting, we
discover interdependence between exercises in exactly the can count the number of intervening trials and treat each as
same manner as it discovers interdependence between skills. an independent opportunity for forgetting to occur. Conse-
In contrast, BKT requires exercise labels to be skill indexed. quently, the probability of forgetting between A1 and A2 is
F , but the probability of forgetting between A2 and A3 is
1.3.4 Individual Variation in Ability 1 − (1 − F )2 and between A3 and A4 is 1 − (1 − F )3 . Using
Students vary in ability, as reflected in individual differences forgetting, BKT can readily incorporate some information
in mean accuracy across trials and skills. Individual varia- about the absolute trial sequence, and therefore has more
tion might potentially be used in a predictive manner: a potential than classic BKT to be sensitive to interspersed
student’s accuracy on early trials in a sequence might pre- trials in the exercise sequence.
dict accuracy on later trials, regardless of the skills required
to solve exercises. We performed a simple verification of this 2.2 Skill Discovery
hypothesis using the Assistments data set. In this data set, To model interactions among skills, one might suppose that
students study one skill at a time and then move on to the each skill has some degree of influence on the learning of
next skill. We computed correlation between mean accuracy other skills, not unlike the connection among hidden units
of all trials on the first n skills and the mean accuracy of all in DKT. For BKT to allow for such interactions among
trials on skill n+1, for all students and for n ∈ {1, ..., N −1} skills, the independent BKT models would need to be in-
where N is the number of skills a student studied. We ob- terconnected, using an architecture such as a factorial hid-
tained a correlation coefficient of 0.39: students who tend den Markov model [6]. As an alternative to this somewhat
to do well on the early skills learned tend to do well on later complex approach, we explored a simpler scheme in which
skills, regardless of the skills involved. different exercise labels could be collapsed together to form
a single skill. For example, consider an exercise sequence
DKT is presented with a student’s complete trial sequence. such as A1 –B1 –A2 –C1 –B2 –C2 –C3 . If skills A and B are
It can use a student’s average accuracy up to trial t to pre- highly similar or overlapping, such that learning one pre-
dict trial t + 1. Because BKT models each skill separately dicts learning the other, it would be more sensible to treat
from the others, it does not have the contextual information this sequence in a manner that groups A and B into a sin-
needed to estimate a student’s average accuracy or overall gle skill, and to train a single BKT instantiation on both
ability. A and B trials. This approach can be used whether the
exercise labels are skill indices or exercise indices. (One of
2. EXTENDING BKT the data sets used by Piech et al. [22] to motivate DKT has
In the previous section, we described four regularities that exercise-indexed labels).
appear to be present in the data and which we conjecture
that DKT exploits but which the classic BKT model cannot. We recently proposed an inference procedure that automati-
In this section, we describe three extensions to BKT that cally discovers the cognitive skills needed to accurately model
would bring BKT on par with DKT with regard to these a given data set [18]. (A related procedure was indepen-
regularities. dently proposed in [8].) The approach couples BKT with
a technique that searches over partitions of the exercise la-
2.1 Forgetting bels to simultaneously (1) determine which skill is required
To better capture recency effects, BKT can be augmented to correctly answer each exercise, and (2) model a student’s
to allow for forgetting of skills. Forgetting corresponds to dynamical knowledge state for each skill. Formally, the tech-
fitting a BKT parameter F ≡ P (Ks,i+1 = 0 | Ksi = 1), the nique assigns each exercise label to a latent skill such that a
probability of transitioning from a state of knowing to not student’s expected accuracy on a sequence of same-skill ex-
knowing a skill. In standard BKT, F = 0. ercises improves monotonically with practice according to
BKT. Rather than discarding the skills identified by ex-
Without forgetting, once BKT infers that the student has perts, our technique incorporates a nonparametric prior over
learned, even a long run of poorly performing trials cannot the exercise-skill assignments that is based on the expert-
alter the inferred knowledge state. However, with forgetting, provided skills and a weighted Chinese restaurant process
the knowledge state can transition in either direction, which [11].
allows the model to be more sensitive to the recent trials:
A run of unsuccessful trials is indicative of not knowing the In the above illustration, our technique would group A and
skill, regardless of what preceded the run. Forgetting is B into one skill and C into another. This procedure col-
not a new idea to BKT, and in fact was included in the lapses like skills (or like exercises), yielding better fits to the
original psychological theory that underlies the notion of data by BKT. Thus, the procedure performs a sort of skill
binary knowledge state [1]. However, it has not typically discovery.
been incorporated into BKT. When it has been included in
BKT [23], the motivation was to model forgetting from one 2.3 Incorporating Latent Student-Abilities
day to the next, not forgetting that can occur on a much To account for individual variation in student ability, we
shorter time scale. have extended BKT [14, 13] such that slip and guess prob-
abilities are modulated by a latent ability parameter that is We evaluated five variants of BKT1 , each of which incor-
inferred from the data, much in the spirit of item-response porates a different subset of the extensions described in the
theory [4]. As we did in [14], we assume that students with previous section: a base version that corresponds to the clas-
stronger abilities have lower slip and higher guess probabil- sic model and the model against which DKT was evaluated
ities. When the model is presented with new students, the in [22], which we’ll refer to simply as BKT ; a version that in-
posterior predictive distribution on abilities is used initially, corporates forgetting (BKT+F ), a version that incorporates
but as responses from the new student are observed, un- skill discovery (BKT+S ), a version that incorporates latent
certainty in the student’s ability diminishes, yielding better abilities (BKT+A), and a version that incorporates all three
predictions for the student. of the extensions (BKT+FSA). We also built our own im-
plementation of DKT with LSTM recurrent units2 . (Piech
et al. described the LSTM version as better performing, but
3. SIMULATIONS posted only the code for the standard recurrent neural net
3.1 Data Sets version.) We verified that our implementation produced re-
Piech et al. [22] studied three data sets. One of the data sults comparable to those reported in [22] on Assistments
sets, from Khan Academy, is not publicly available. Despite and Synthetic. We then also ran the model on Spanish
our requests and a plea from one of the co-authors of the and Statics.
DKT paper, we were unable to obtain permission from the
data science team at Khan Academy to use the data set. We For Assistments, Spanish, and Statics, we used a single
did investigate the other two data sets in Piech et al., which train/test split. The Assistments train/test split was iden-
are as follows. tical to that used by Piech et al. For Synthetic, we used
the 20 simulation sets provided by Piech et al. and averaged
Assistments is an electronic tutor that teaches and eval- results across the 20 simulations.
uates students in grade-school math. The 2009-2010 “skill
builder” data set is a large, standard benchmark, available Each model was evaluated on each domain’s test data set,
by searching the web for assistment-2009-2010-data. We and the performance of the model was quantified with a dis-
used the train/test split provided by Piech et al., and fol- criminability score, the area under the ROC curve or AUC.
lowing Piech et al., we discarded all students who had only AUC is a measure ranging from .5, reflecting no ability to
a single trial of data. discriminate correct from incorrect responses, to 1.0, reflect-
ing perfect discrimination. AUC is computed by obtaining
Synthetic is a synthetic data set created by Piech et al. to a prediction on the test set for each trial, across all skills,
model virtual students learning virtual skills. The training and then using the complete set of predictions to form the
and test sets each consist of 2000 virtual students perform- ROC curve. Although Piech et al. [22] do not describe the
ing the same sequence of 50 exercises drawn from 5 skills. procedure they use to compute AUC for DKT, code they
The exercise on trial t is assumed to have a difficulty char- have made available implements the procedure we describe,
acterized by δt and require a skill specified by σt . The ex- and not the obvious alternative procedure in which ROC
ercises are labeled by the identity of the exercise, not by curves are computed on a per-skill basis and then averaged
the underlying skill, σt . The ability of a student, denoted, to obtain an overall AUC.
αt varies over time according to a drift-diffusion process,
generally increasing with practice. The response correctness
on trial t is a Bernoulli draw with probability specified by
3.3 Results
Figure 2 presents the results of our comparison of five vari-
guessing-corrected item-response theory with difficulty and
ants of BKT on the four data sets. We walk through the
ability parameters δt and αt . This data set is challenging
data sets from left to right.
for BKT because the skill assignments, σt , are not provided
and must be inferred from the data. Without the skill as-
On Assistments, classic BKT obtains an AUC of 0.73, bet-
signments, BKT must be used either with all exercises asso-
ter than the 0.67 reported for BKT by Piech et al. We are
ciated with a single skill or each exercise associated with its
not sure why the scores do not match, although 0.67 is close
own skill. Either of these assumptions will miss important
to the AUC score we obtain if we treat all exercises as asso-
structure in the data. Synthetic is an interesting data set
ciated with a single skill or if we compute AUC on a per-skill
in that the underlying generative model is neither a perfect
basis and then average.3 BKT+F obtains an AUC of 0.83,
match to DKT or BKT (even with the enhancements we
have described). The generative model seems realistic in its 1
https://github.com/robert-lindsey/WCRP/tree/forgetting
assumption that knowledge state varies continuously. 2
https://github.com/mmkhajah/dkt
3
Piech et al. cite Pardos and Heffernan [21] as obtain-
We included two additional data sets in our simulations. ing BKT’s best reported performance on Assistments—
Spanish is a data set of 182 middle-school students prac- an AUC of 0.69. In [21], the overall AUC is computed by
ticing 409 Spanish exercises (translations and application of averaging the per-skill AUCs. This method yields a lower
simple skills such as verb conjugation) over the course of a score than the method used by Piech et al., for two reasons.
First, the Piech procedure weighs all trials equally, whereas
15-week semester, with a total of 578,726 trials [17]. Statics the Pardos and Heffernan procedure weighs all skills equally.
is from a college-level engineering statics course with 189,297 With the latter procedure, the overall AUC will be dinged
trials and 333 students and 1,223 exercises [28], available if the model does poorly on a skill with just a few trials, as
from the PSLC DataShop web site [15]. we have observed to be the case with Assistments. The
latter procedure also produces a lower overall AUC because
it suppresses any lift due to being able to predict the rela-
3.2 Methods tive accuracy of different skills. In summary, it appears that
Assistments Synthetic Statics Spanish
0.93 0.82 0.78 0.88

0.90 0.79

BKT+FSA
BKT+S
0.87 0.76

DKT

DKT

DKT
0.84 0.73
AUC

0.75 0.85
BKT+F

BKT+FSA
BKT+FSA
0.81 0.70

BKT+S

BKT+F
BKT+A

BKT+A

BKT+FSA
BKT+S

BKT+A
0.78 0.67
BKT+A

BKT+F

BKT

BKT+S
BKT

DKT
BKT+F
BKT

BKT
0.75 0.64

0.72 0.61 0.72 0.82

Figure 2: A comparison of six models on four data sets. Model performance on the test set is quantified by
AUC, a measure of how well the model discriminates (predicts) correct and incorrect student responses. The
models are trained on one set of students and tested on another set. Note that the AUC scale is different for
each graph, but tic marks are always spaced by .03 units in AUC. On Assistments and Synthetic, DKT results
are from Piech et al. [22]; on Statics and Spanish DKT results are from our own implementation. BKT=
classic Bayesian knowledge tracing; BKT+A= BKT with inference of latent student abilities; BKT+F= BKT
with forgetting; BKT+S= BKT with skill discovery; BKT+FSA= BKT with all three extensions; DKT=
deep knowledge tracing

not quite as good as the 0.86 value reported for DKT by puting AUC; either of these explanations is consistent with
Piech et al. Examining the various enhancements to BKT, their reported AUC of 0.54.
AUC is boosted both by incorporating forgetting and by in-
corporating latent student abilities. We find it somewhat Regarding the enhancements to BKT, adding student abil-
puzzling that the combination of the two enhancements, ities (BKT+A) improves prediction of Synthetic which is
embodied in BKT+FSA, does no better than BKT+F or understandable given that the generative process simulates
BKT+A, considering that the two enhancements tap differ- students with abilities that vary slowly over time. Adding
ent properties of the data: the student abilities help predict forgetting (BKT+F) does not help, consistent with the gen-
transfer from one skill to the next, whereas forgetting facil- erative process which assumes that knowledge level is on
itates prediction within a skill. average increasing with practice; there is no systematic for-
getting in the student simulation. Critical to this simulation
To summarize the comparison of BKT and DKT, 31.6% of is skill induction: BKT+S and BKT+FSA achieve an AUC
difference in performance reported in [22] appears to be due of 0.80, better than the reported 0.75 for DKT in [22].
to the use of a biased procedure for computing the AUC
for BTK. Another 50.6% of the difference in performance On Statics, each BKT extension obtains an improvement
reported vanishes if BKT is augmented to allow for forget- over classic BKT, although the magnitude of the improve-
ting. We can further improve BKT if we allow the skill ments are small. The full model, BKT+FSA, obtains an
discovery algorithm to operate with exercise labels that in- AUC of 0.75 and our implementation of DKT obtains a
dex individual exercises, as opposed to labels that index the nearly identical AUC of 0.76. On Spanish, the BKT exten-
skill associated with each exercise. With exercise-indexed la- sions obtain very little benefit. The full model, BKT+FSA,
bels, BKT+S and BKT+FSA both obtain an AUC of 0.90, obtains an AUC of 0.846 and again, DKT obtains a nearly
beating DKT. However, given DKT’s ability to perform skill identical AUC of 0.836. These two sets of results indicate
discovery, we would not be surprised if it also achieved a sim- that for at least some data sets, classic BKT has no glaring
ilar level of performance when allowed to exploit exercise- deficiencies. However, we note that BKT model accuracy
indexed labels. can be improved if algorithms are considered that use exer-
cise labels which are indexed by exercise and not by skill.
Turning to Synthetic, classic BKT obtains an AUC of 0.62, For example, with Statics, performing skill discovery using
again significantly better than the 0.54 reported by Piech et exercise-indexed labels, [17] obtain an AUC of 0.81, much
al. In our simulation, we treat each exercise as having a dis- better than the score of 0.73 we report here for BKT+S
tinct skill label, and thus BKT learns nothing more than the based on skill-indexed labels.
mean performance level for a specific exercise. (Because the
exercises are presented in a fixed order, the exercise identity In summary, enhanced BKT appears to perform as well on
and the trial number are confounded. Because performance average as DKT across the four data sets. Enhanced BKT
tends to improve as trials advance in the synthetic data, outperforms DKT by 20.0% (.05 AUC units) on Synthetic
BKT is able to learn this relationship.) It is possible here and by 3.0% (.01 AUC unit) on Spanish. Enhanced BKT
that Piech et al. treated all exercises as associated with a underperforms DKT by 8.3% (.03 AUC units) on Assist-
single skill or that they used the biased procedure for com- ments and by 3.5% (.01 AUC unit) on Statics. These
percentages are based on the difference of AUCs scaled by
inconsistent procedures may have been used to compute per-
formance of BKT versus DKT in [22], and the procedure for by AUCDKT − 0.5, which takes into account the fact that an
BKT is biased to yield a lower score. AUC of 0.5 indicates no discriminability.
4. DISCUSSION stantive effort to understand what the model had actually
Our goal in this article was to investigate the basis for the learned. Our proposed BKT extensions achieve predictive
impressive predictive advantage of deep knowledge tracing performance similar to DKT whilst remaining interpretable:
over Bayesian knowledge tracing. We found some evidence the model parameters (forgetting rate, student ability, etc.)
that different procedures may have been used to evaluate are psychologically meaningful. When skill discovery is in-
DKT and BKT in [22], leading to a bias against BKT. When corporated into BKT, the result is clear: a partition of exer-
we replicated simulations of BKT reported in [22], we ob- cises into skills. Reading out such a partitioning from DKT
tained significantly better performance: an AUC of 0.73 ver- is challenging and only an approximate representation of the
sus 0.67 on Assistments, and an AUC of 0.62 versus 0.54 knowledge in DKT.
on Synthetic.
Finally, we return to the question posed in the paper’s title:
However, even when the bias is eliminated, DKT obtains How deep is knowledge tracing? Deep learning refers to the
real performance gains over BKT. To understand the basis discovery of representations. Our results suggest that rep-
for these gains, we hypothesized various forms of regularity resentation discovery is not at the core of DKT’s success.
in the data which BKT is not able to exploit. We proposed We base this argument on the fact that our enhancements
enhancements to BKT to allow it to exploit these regulari- to BKT bring it to the performance level of DKT without
ties, and we found that the enhanced BKT achieved a level requiring any sort of subsymbolic representation discovery.4
of performance on average indistinguishable from that of Representation discovery is clearly critical in perceptual do-
DKT over the four data sets tested. The enhancements we mains such as image or speech classification. But the domain
explored are not novel; they have previously been proposed of education and student learning is high level and abstract.
and evaluated in the literature. They include forgetting [23], The input and output elements of models are psychologically
latent student abilities [14, 13, 21], and skill induction [17, meaningful. The relevant internal states of the learner have
8]. some psychological basis. The characterization of exercises
and skills can—to at least a partial extent—be expressed
We observe that different enhancements to BKT matter for symbolically.
different data sets. For Assistments, incorporating forget-
ting is key; forgetting allows BKT to capture recency effects. Instead of attributing DKT’s success to representation dis-
For Synthetic, incorporating skill discovery yielded huge covery, we attribute DKT’s success to its flexibility and gen-
gains, as one would expect when the exercise-skill mapping erality in capturing statistical regularities directly present in
is not known. And for Statics, incorporating latent student the inputs and outputs. As long as there are sufficient data
abilities was relatively most beneficial; these abilities enable to constrain the model, DKT is more powerful than clas-
the model to tease apart the capability of a student and sic BKT. BKT arose in a simpler era, an era in which data
the intrinsic difficulty of an exercise or skill. Of the three and computation resources were precious. DKT reveals the
enhancements, forgetting and student abilities are compu- value of relaxing these constraints in the big data era. But
tationally inexpensive to implement, whereas skill discovery despite the wild popularity of deep learning, there are many
adds an extra layer of computational complexity to infer- ways to relax the constraints and build more powerful mod-
ence. els other than creating a black box predictive device with
a vast interconnected tangle of connections and parameters
The elegance of DKT is apparent when one considers the ef- that are nearly impossible to interpret.
fort we have invested to bring BKT to par with DKT. DKT
did not require its creators to analyze the domain and de- 5. ACKNOWLEDGMENTS
termine sources of structure in the data. In contrast, our This research was supported by NSF grants SES-1461535,
approach to augmenting BKT required some domain exper- SBE-0542013, and SMA-1041755.
tise, a thoughtful analysis of BKT’s limitations, and distinct
solutions to each limitation. DKT is a generic recurrent neu- 6. REFERENCES
ral network model [10], and it has no constructs that are [1] R. Atkinson and J. A. Paulson. An approach to the
specialized to modeling learning and forgetting, discovering psychology of instruction. Psychology Bulletin,
skills, or inferring student abilities. This flexibility makes 78:49–61, 1972.
DKT robust on a variety of datasets with little prior analy-
[2] R. S. Baker, A. T. Corbett, and V. Aleven. More
sis of the domains. Although training recurrent networks is
accurate student modeling through contextual
computationally intensive, tools exist to exploit the parallel
estimation of slip and guess probabilities in Bayesian
processing power in graphics processing units (GPUs), which
knowledge tracing. In Proceedings of the 9th
means that DKT can scale to large datasets. Classic BKT
International Conference on Intelligent Tutoring
is inexpensive to fit, although the variants we evaluated—
Systems, pages 406–415, Berlin, Heidelberg, 2008.
particularly the model that incorporates skill discovery—
Springer-Verlag.
require computation-intensive MCMC methods that have a
distinct set of issues when it comes to parallelization. [3] A. T. Corbett and J. R. Anderson. Knowledge tracing:
Modelling the acquisition of procedural knowledge.
DKT’s advantages come at a price: interpretability. DKT is User Model. User-Adapt. Interact., 4(4):253–278, 1995.
4
massive neural network model with tens of thousands of pa- Of course, the skill discovery mechanism we incorporated
rameters which are near-impossible to interpret. Although certainly does regroup exercises to form skills, but the form
the creators of DKT did not have to invest much up-front of this regrouping or partitioning is far more limited than
time analyzing their domain, they did have to invest sub- the typical transformations in a neural network to map from
one level of representation to another.
[4] P. De Boeck and M. Wilson. Explanatory Item 25:639–47, 2014.
Response Models: a Generalized Linear and Nonlinear [18] R. V. Lindsey, M. Khajah, and M. C. Mozer.
Approach. Springer-Verlag, New York, NY, 2004. Automatic discovery of cognitive skills to improve the
[5] A. Galyardt and I. Goldin. Move your lamp post: prediction of student learning. In Z. Ghahramani,
Recent data reflects learner knowledge better than M. Welling, C. Cortes, N. D. Lawrence, and K. Q.
older data. JEDM-Journal of Educational Data Weinberger, editors, Advances in Neural Information
Mining, 7(2):83–108, 2015. Processing Systems 27, pages 1386–1394. Curran
[6] Z. Ghahramani and M. I. Jordan. Factorial hidden Associates, Inc., 2014.
markov models. In D. S. Touretzky, M. C. Mozer, and [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,
M. E. Hasselmo, editors, Advances in Neural J. Veness, et al. Human-level control through deep
Information Processing Systems 8, pages 472–478. reinforcement learning. Nature, 518:529–533, 2015.
MIT Press, 1996. [20] M. C. Mozer. Induction of multiscale temporal
[7] R. Golden. How to optimize student learning using structure. In J. E. Moody, S. J. Hanson, and R. P.
recurrent neural networks (educational technology). Lippmann, editors, Advances in Neural Information
Web page, 2016. http://tinyurl.com/GoldenDKT, Processing Systems 4, pages 275–282.
retrieved February 29, 2016. Morgan-Kaufmann, 1992.
[8] J. P. Gonzales-Brenes. Modeling skill acquisition over [21] Z. A. Pardos and N. T. Heffernan. KT-IDEM:
time with sequence and topic modeling. In S. V. N. V. Introducing item difficulty to the knowledge tracing
G. Lebanon, editor, Proceedings of the Eighteenth model. In User Modeling, Adaption and
International Conference on Artificial Intelligence and Personalization, pages 243–254. Springer, 2011.
Statistics. JMLR, 2015. [22] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami,
[9] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and L. J. Guibas, and J. Sohl-Dickstein. Deep knowledge
D. Wierstra. DRAW: A recurrent neural network for tracing. In C. Cortes, N. D. Lawrence, D. D. Lee,
image generation. In Proceedings of the 32nd M. Sugiyama, and R. Garnett, editors, Advances in
International Conference on Machine Learning, pages Neural Information Processing Systems 28, pages
1462–1471, 2015. 505–513. Curran Associates, Inc., 2015.
[10] S. Hochreiter and J. Schmidhuber. Long short-term [23] Y. Qiu, Y. Qi, H. Lu, Z. A. Pardos, and N. T.
memory. Neural computation, 9(8):1735–1780, 1997. Heffernan. Does time matter? modeling the effect of
[11] H. Ishwaran and L. F. James. Generalized weighted time with Bayesian knowledge tracing. In
chinese restaurant processes for species sampling M. Pechenizkiy, T. Calders, C. Conati, S. Ventura,
mixture models. Statistica Sinica, pages 1211–1235, C. Romero, and J. C. Stamper, editors, Educational
2003. Data Mining 2011, pages 139–148.
[12] M. Jones, T. Curran, M. C. Mozer, and M. H. Wilder. www.educationaldatamining.org, 2011.
Sequential effects in response time reveal learning [24] D. Rohrer, R. F. Dedrick, and K. Burgess. The benefit
mechanisms and event representations. Psychological of interleaved mathematics practice is not limited to
review, 120:628–666, 2013. superficially similar kinds of problems. Psychonomic
[13] M. Khajah, Y. Huang, J. P. Gonzales-Brenes, M. C. Bulletin and Review, 21:1323–1330, 2014.
Mozer, and P. Brusilovsky. Integrating knowledge [25] D. Rohrer, R. F. Dedrick, and S. Stershic. Interleaved
tracing and item response theory: A tale of two practice improves mathematics learning. Journal of
frameworks. In M. Kravcik, O. C. Santos, and J. G. Educational Psychology, 107:900–908, 2015.
Boticario, editors, Proceedings of the 4th International [26] J. Schmidhuber. Deep learning in neural networks: An
Workshop on Personalization Approaches in Learning overview. Neural networks, 61:85–117, 2015.
Environments, pages 7–15. CEUR Workshop [27] D. Sliver, A. Huang, C. J. Maddison, A. Guez, et al.
Proceedings, 2014. Mastering the game of go with deep neural networks
[14] M. Khajah, R. M. Wing, R. V. Lindsey, and M. C. and tree search. Nature, 529:484–489, 2016.
Mozer. Incorporating latent factors into knowledge [28] P. Steif and N. Bier. OLI Engineering Statics – Fall
tracing to predict individual differences in learning. In 2011. Feb. 2014.
J. Stamper, Z. Pardos, M. Mavrikis, and B. M. [29] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show
McLaren, editors, Proceedings of the 7th International and tell: A neural image caption generator. In
Conference on Educational Data Mining, pages Computer Vision and Pattern Recognition, 2015.
99–106. Educational Data Mining Society Press, 2014. [30] H.-F. Yu and Others. Feature engineering and
[15] K. Koedinger, R. Baker, K. Cunningham, classifier ensemble for KDD cup 2010. Technical
A. Skogsholm, B. Leber, and J. Stamper. A data report, Department of Computer Science and
repository for the EDM community: The PSLC Information Engineering, National Taiwan University,
DataShop. In C. Romero, S. Ventura, M. Pechenizkiy, Taipei, Taiwan, 2010.
and R. Baker, editors, Handbook of Educ. Data [31] M. D. Zeiler and R. Fergus. Visualizing and
Mining, http://pslcdatashop.org, 2010. understanding convolutional networks. In D. Fleet,
[16] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. T. Pajdla, B. Schiele, and T. Tuytelaars, editors,
Nature, 521:436–444, 2015. Computer Vision – ECCV 2014: 13th European
[17] R. Lindsey, J. Shroyer, H. Pashler, and M. Mozer. Conference, Zurich, Switzerland, September 6-12,
Improving student’s long-term knowledge retention 2014, Proceedings, Part I, pages 818–833. Springer
with personalized review. Psychological Science, International Publishing, Cham, 2014.

You might also like