[go: up one dir, main page]

0% found this document useful (0 votes)
21 views16 pages

Predicting Difficultiesfroma Pieceof Code

The document discusses a study comparing two machine learning models designed to predict whether students learning to code can solve programming exercises independently, even when their code contains errors. The first model relies on a feature-engineered approach based on student interaction patterns, while the second uses natural language processing techniques to analyze the students' code. The findings suggest that the deep learning model performs comparably to experienced educators and is easier to design, although feature-engineered models are more interpretable pedagogically.

Uploaded by

sanjayrithick0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views16 pages

Predicting Difficultiesfroma Pieceof Code

The document discusses a study comparing two machine learning models designed to predict whether students learning to code can solve programming exercises independently, even when their code contains errors. The first model relies on a feature-engineered approach based on student interaction patterns, while the second uses natural language processing techniques to analyze the students' code. The findings suggest that the deep learning model performs comparably to experienced educators and is easier to design, although feature-engineered models are more interpretable pedagogically.

Uploaded by

sanjayrithick0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/352807752

IEEE Transactions on Learning Technologies IEEE TRANSACTIONS ON


LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 1

Book · June 2019

CITATIONS READS

0 556

1 author:

Tom Owens
nomuni
3 PUBLICATIONS 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Tom Owens on 29 June 2021.

The user has requested enhancement of the downloaded file.


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information:
DOI 10.1109/TLT.2021.3092998, IEEE Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 1

Predicting Difficulties from a


Piece of Code
Marco Moresi, Marcos J. Gomez, and Luciana Benotti´
Abstract—Based on hundreds of thousands of hours of data about key limitations is that it is unable to identify when students
how students learn in massive open online courses, educational have disengaged or are struggling with a task. After a fixed
machine learning promises to help students who are learning to code.
However, in most classrooms, students and assignments do not have number of failed attempts, it just gives the correct solution to
enough historical data for feeding these data hungry algorithms. the student. Thanks to web-based coding tools such as
Previous work on predicting dropout is data hungry and, moreover, Codeacademy and others [4], it is possible to collect large
requires the code to be syntactically correct. As we deal with
databases that students generate during their learning process.
beginners’ code in a text-based language our models are trained on
noisy student text; almost 40% of the code in our datasets contains Such datasets are valuable for learning analytics [6]. Code.org
parsing errors. In this paper we compare two machine learning [1] in the U.S. organizes the “One hour of code” campaign, a
models that predict whether students need help regardless of massive online challenge where students solve fixed puzzles by
whether their code compiles or not. That is, we compare two methods
for automatically predicting whether students will be able to solve a programming. Piech et al. [7] model student knowledge in
programming exercise on their own. The first model is a heavily order to predict performance on future interactions on code
featureengineered approach that implements pedagogical theories of generated by the students of Code.org. However, their
the relation between student interaction patterns and the probability approach (called knowledge tracing) is not directly applicable
of dropout; it requires a rich history of student interaction. The
second method is based on a short program (that may contain errors) to most programming classrooms for at least three reasons.
written by a student, together with a few hundred attempts by their First, programming assignments in Code.org use the
classmates on the same exercise. This second method uses natural blockbased programming language Blockly [8], so programs
language processing techniques; it is based on the intuition that are syntactically correct by construction. In most CS1
beginners’ code may be closer to a natural language than to a formal
one. It is inspired by previous work on predicting people’s fluency classrooms programming is taught using text-based languages
when learning a second natural language. where students can type freely. In our classrooms, almost 40%
Index Terms—Interactive environments, computer science of the programs written by students do not compile. We
education, modeling and prediction, machine learning. cannot build the abstract syntax tree of the program as Piech
et al. do, only the raw text is available. Second, although
blockbased languages (like those in Code.org) can theoretically
I. INTRODUCTION
generate an infinite number of programs, in practice the set of
T HERE is a worldwide interest in promoting youth
engagement in computer science (CS) by teaching them to
probable student programs is smaller than in the typical CS1
coding assignment. Finally, knowledge tracing requires a rich
program. This is the case in many countries, for example, the history from that student and massive amounts of data on that
U.S. [1], New Zealand [2], and the U.K. [3]. Nowadays, all these particular programming assignment in order to train the
countries share the same problem: there are too many algorithms.
students that want to learn to program in comparison with the In this paper we compare two models that do not assume
number of trained teachers. As more students become that the code written by the student compiles. The first model
interested, it is necessary to figure out how to scale-up the is a heavily feature-engineered approach that implements
learning opportunities. With this goal in mind, massive open pedagogical theories regarding the relation between the
online courses (MOOC) that teach programming have been interaction patterns of a student and their probability of
developed. Codecademy [4], [5] is a MOOC implemented as a dropout. It requires a rich history of interaction with the
web portal containing interactive learning materials for student. The second model takes as input a short program
programming in many languages, such as Python, Ruby, from the student. The program may have all kinds of errors
Javascript, and others. Courses are completed by following including syntax errors. The model is trained on other
Manuscript received July 16, 2020; revised April 5, 2021 and June 4, 2021;
accepted TBD. Date of publication TBD; date of current version June 4, 2021. programs, done by different students, on the same exercise. It
(Corresponding author: Marco Moresi.) uses techniques from natural language processing (NLP),
M. Moresi is with the Computer Science Department at the National namely a kind of recurrent neural network able to model long
University of Cordoba, C´ ordoba 5000, Argentina (email: dependencies known as long shortterm memory (LSTM) and
mrc.moresi@gmail.com).´
M. J. Gomez and L. Benotti are with the Computer Science Depart-´ ment at word embeddings that are trained to capture the distributional
the National University of Cordoba, C´ ordoba 5000, Argentina and´ with the semantics of program tokens. The second model is inspired in
Argentinean National Scientific and Technical Research Council (CONICET), the idea that the task of predicting whether a student is going
Buenos Aires 1407, Argentina (email: {marcos.gomez, to abandon a programming exercise is an indirect measure of
luciana.benotti}@unc.edu.ar).
his or her fluency while writing code for this exercise. Our work
step-by-step instructions and writing code in the code editor. was inspired by the area of second natural language learning
Codecademy is widespread but, as described in [5], one of its

2372-0050 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 2
[9]. There is quite a lot of work that uses NLP techniques for language learning. Then we compare them to current advances
assisting software engineering tasks [10] and for second in difficulties detection in second language learning. Following
language learning [11]. However, the use of NLP to support this we present NLP work on word embeddings and describe
programming language learning, as we do in this paper, seems previous work using deep neural networks and program
to be new. embeddings for dropout prediction. Section III first presents
The domain that we work with is data obtained from a the online environment that we used to collect our datasets
webbased coding tool similar to Codecademy that we and then it describes the two datasets that were collected. In
introduce briefly here and describe in detail in the next section. Section IV we present how we measured expert human
We have two different datasets. One is from students learning performance as well as the machine learning models we
to program in a CS1 classroom with 75 students; the other is compare. One of these models is based on NLP techniques, the
from four thousand autodidact learners using the coding tool other is based on laborious feature design. We present our
on the web. In Fig. 1 we show a screenshot of the web-based quantitative and qualitative results in Section V. This section
coding tool that we use; it is called Mumuki. The exercise includes a qualitative analysis of our model performance in one
belongs to the functional frogramming chapter in Mumuki and programming exercise. Section VI describes some applications
the programming language is Haskell. The current lesson and implications of our work for teaching programming, and
displayed in the figure is exercise 4, which contains five also discusses limitations of this work. Finally, Section VII
exercises. The left-hand panel shows the exercise description. presents the conclusions and discusses future work.
The right-hand panel shows the editor where a student wrote
a program. Below the title, a horizontal bar shows with colors II. RELATED WORK
the evaluation of the five exercises in the lesson. The first and In this section we first describe previous work on scaling
last exercises have not been done and are marked in gray. The programming education in general. Then we present recent
current exercise is marked with the blue dot in the bar. It is advances in the area of difficulty detection in second language
assessed as correct, as can be seen in the bottom of the figure, learning which motivate our approach to dropout prediction.
and marked in green by Mumuki. The second exercise in the We also describe recent advances in the area of word
progress bar is marked in dark red: containing syntactic errors. embeddings as a way of capturing the distributional semantics
The third exercise is assessed as light read: containing test case of both natural and programming languages. Finally, we
errors. The task that we address in this paper can be described describe previous work on dropout prediction for students
as answering the question: Will the student be able to turn a learning to program based on knowledge tracing as well as on
light or dark red exercise into a green one without help?. In this student learning style. Both methods require a rich student
paper we make the following contributions: history.
• We compare two machine learning models on the task of
predicting whether a student will be able to solve a A. Programming Language Learning
programming exercise on his or her own in two different Over the last few years, governments, universities,
learning scenarios. companies, and organizations around the world have joined
• We propose a novel way to encode learner’s code as forces to bring the teaching of CS and programming to
input to a deep neural network, even when the code does elementary, middle, and high schools. Different strategies
not compile, based on word embeddings and LSTM were chosen: development of programming environments [4],
recurrent neural networks. [12], pedagogical resources [13], [14], specific events
• We evaluate expert human performance for this task on organized as hackathons, summer camps, game jams [15], and
teachers with different experience in the context of a professional development for teachers [16], [17]. As more
programming course teaching Haskell. students become interested in learning CS, many countries are
• We find that the performance for our best model in both moving forward in their efforts to massively introduce it into
learning scenarios is similar to expert human the mandatory school curriculum. One topic of debate
performance of teachers with more than ten years of between academics, policy makers, and the whole educational
experience teaching this course. community is: who is going to teach CS in schools and how are
• We find that our deep learning model is easier to design teachers going to be prepared? The lack of teachers trained in
without knowledge of the learning domain, relative to CS and programming and the low impact of professional
heavily feature-engineered approaches. However, development courses [16] make it difficult to implement the
featureengineered approaches are more interpretable teaching of programming at scale in real school settings. For
from a pedagogical perspective. this reason, programming teaching environments have
The paper is organized as follows. In Section II we review become increasingly important.
related work about dropout prediction in programming
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 3
Many environments for teaching programming have been and the feedback for each student. In this way, they store
developed. We briefly characterize some of them here in terms information about the learning process of the students.
of the formative feedback that they generate automatically; Like natural languages, programming languages are not
for a complete survey see [4]. Automatic formative feedback learned spontaneously but are acquired and evolve through
has the potential to scale programming education. Most interaction. Language learning is affected by different
environmets for learning programming include a parser and conditions of interactivity. The formative feedback that can be
compiler and they are able to generate automatic feedback generated automatically by these environments is not as good
about syntactic errors. Some environments also propose a as the one manually provided by a teacher. However, it can be
predefined curricula of programming exercises. When useful for solving simple and repetitive errors and for guiding
programming exercises are predefined, it is possible to provide students when a teacher is not available. In this paper we
automatic feedback about whether the program does what it investigate how these environments can be enhanced by
is supposed to do. Examples of environments that do not predicting when a student will still struggle after reading the
provide such automatic feedback are Scratch [18], [19], Alice automatic feedback, and thus will need the personalized help
[20], of a teacher.

Fig. 1. Mumuki screenshot. The exercise belongs to the Functional Programming chapter in Mumuki; it is exercise 4 in the current lesson. The horizontal bar
shows the evaluation of the exercises in the lesson. The current exercise is marked with the blue dot. The left-hand panel shows an exercise description. The
right-hand panel shows the editor where a student wrote a solution. It is assessed as correct and marked in green by Mumuki. The second exercise in the
progress bar is marked as dark red: containing syntactic errors. The third exercise is assessed as light read: containing test case errors. The first and last
exercises are not done and aremarked in gray. B. Second Language Learning
There are many computer-based educational apps for
and MIT App Inventor [21]. They require that a teacher gives second language learning which have increased in popularity
manual and personalized feedback on the behaviour of in recent years. These apps generate vast amounts of student
student programs. Such manual feedback is extremely learning data which can be harnessed to drive personalized
valuable for the learning process but it is expensive and does instruction.
not scale when not enough teachers are available. Mumuki.io In this context, the shared task of second language acquisition
[22], Code.org [1], and Codecademy [4] are examples of modeling was proposed [9]. Given a history of exercises
environments that provide automatic feedback not only about attempted by learners of a second language, the task is to
syntax but also about the behaviour of the program. This predict when a learner will make an error in the future.
feedback is about whether the program does what it is Kaneko et al. [11] presents a model that participated in the
supposed to do. Such environments have a predefined set of shared task for second language learning. It used a
exercises created with a sequence of contents and concepts. bidirectional recurrent neural network implemented as a LSTM
and can be useful for courses with too many students. They to predict potential errors made by a particular learner at a
also store the programs sent by the student, their evaluation, given exercise. The model was trained on previous answers by

EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 4
different learners. The authors did not engineer any additional In the word embedding matrix, two vector representations
features and trained a single model for many exercises. are close together if the corresponding words often occur in
Among a key finding for the second language learning task is similar contexts. Word embeddings use this distribution to
the observation that, for this particular formulation of the task, help define a semantic relation: words with vectors that are
the choice of learning algorithm appears to be more important similar should have related meanings. This is called the
than clever feature engineering. In particular, the most distributional hypothesis in the NLP literature [26], and our
effective teams employed recurrent neural networks that can work assumes that the same holds for code.
capture the sequential nature of the language produced by the
learners. Furthermore, using a unified model that leverages
D. Dropout Prediction
data from all different kinds of exercises, no matter who the
student is, provides further improvements. These results Here we describe related work on dropout prediction for
suggest two key insights. First, sequential algorithms such as students learning to program. What all these approaches have
LSTM that can process sequences of words are easier to in common is that they require a rich student history for
implement than feature engineering for this task, and second, training.
unified learning approaches that share information across Papert and Turkle [27] argue that student learning styles can
exercises are effective. be characterized as tinkerer or planner. This characterization is
based on the behavior of students when trying to solve
programming exercises. This behavior is registered by the
C. Word Embeddings programming learning environments. Students characterized
Word embeddings are the state of the art method for lexical as tinkerers make many incremental and small changes in their
semantic representations [23] in the area of NLP. They path towards a correct solution. They tend to experiment
represent semantic relations between words that can be through trial and error and have an approach to building
learned from large databases of raw text. For instance, it can programs that has been described as bottom-up [28]. Students
learn that the Haskell word “if” is a synonym of the operator characterized as planners by Papert and Turkle identify a path
“|”, when the training data contains enough examples of the of action and then implement it with the aim of reaching the
two cases. The semantic relation between “if” and “|” is solution they consider correct. The programming style of these
represented in the word embeddings with two vectors that are students could be described as top-down. The tinkerer and
similar. That is, their substraction is close to zero. Word planner characterization was proposed for open-ended
embeddings can also learn relations between more than two environments where students could explore freely and learn
words. For instance, it can learn that “0” (the number zero) is through unstructured activities; we investigate whether it is
to “Int” (the data type for integer numbers in Haskell) as “True” relevant for predicting dropout in more constrained
(the boolean constant) is to “Bool” (the boolean data type in environments such as the one offered by Mumuki.
Haskell). In this paper we propose a heavily feature-engineered
Word embeddings are generally used for natural languages approach to predicting exercise dropout. In order to model
and not for programming languages. However, there is some tinkerers and planners we consider features such as the
recent work that applies word embeddings to code in order to frequency between submissions, the amount of change in the
search for specific functions in large code datasets [24]. Such code between successive submissions, how complete the
word embeddings are applied over code that is (at least) initial submission is, among others. Blikstein et al. [29] use
syntactically correct. In this paper, we train our own word methods from machine learning to discover patterns in
embeddings similarly to how it is done in [24] but on beginner student programming history and try to predict final exam
code which is frequently not syntactically correct. Moreover, grades in a constrained programming environment similar to
in [24] the word embeddings are used in order to answer Mumuki. The results in this case show that students’ change in
questions about how to program something in large code programming patterns is only weakly predictive of course
databases. In other words, they translate a natural language performance. Based on the features used by Blikstein et al. we
query into correct code that implements the queried function. extend our featurebased approach with the features of
For example, a query could be “how to programmatically close student’s level of expertise, amount of dropouts in the
or hide the Android software keyboard?” and the output student’s history, and others that we describe in the next
would be a fragment of code that implements this function. section.
Also, our architecture is different from theirs because we The work described in [7] trains a neural model using a full
combine the word embeddings using LSTM networks, whereas program as an item of the LSTM vocabulary. Each program
they use information retrieval techniques (such as TF–IDF [25]) submitted by a student is modeled using program embeddings.
because they are dealing with a search task rather than a Piech et al. [30] describes how these program embeddings are
binary classification as we are in this paper. created. If we consider an analogy to words, then, their words
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 5
are the successive programs that the students submit, that is, selftaught students. An evaluation of its use in the classroom
their learning trajectories. Differently from them and more (at university level) is presented in [22].
similar to what we described above regarding second language A screenshot of the tool can be seen in Fig. 1. The courses
learning, our vocabulary are the words of the program. For within the platform are organized by different programming
example, for us “var1 = 3” corresponds to three words in our paradigms, with many programming exercises for each
vocabulary while it corresponds to only one word to [7]. As a paradigm organized into sections. For example, the figure
result, our model is able to capture the similarity between the shows an exercise of an early section in the functional
previous example and “var1 = 2” while for [7] these are two programming paradigm which focuses on function reuse and
different program embeddings. We say that our model is logical operators. The interface includes a progress bar that
trained at the word-level while [7] model is trained at the shows which exercises have been solved correctly (in green),
program-level. We use one embedding for each word of the which have been attempted but are incorrect due to a syntax
program while they use one embedding for each complete error (in dark red) or due to a test case error (in light red), and
program of the student trajectory. As a result, our model is which have not been attempted yet (in gray). The current
able to learn from different exercises in a unified model while exercise is marked with a blue dot. In Fig. 1, the program
[7] trains a different model for each exercise. passed all the tests and Mumuki says so with a message in
green font in the bottom of the screen. The students can solve
Summing up: in this paper, we compare a laborious the exercises in any order they want, but Mumuki suggests an
featureengineered approach based on student interaction order with the progress bar.
style to a deep learning architecture trained on beginners raw From the student’s point of view, an exercise includes the
code. The models based on feature engineering that we description of the problem (a few paragraphs in natural
develop require a rich student history. In contrast, the deep language) with at least one example and, optionally, an extra
learning architecture does not require a history of interaction explanation with some functions that can or must be reused.
with a particular student, but learns from other students and An example can be seen in the left-hand panel of Fig. 1. The
other exercises. An advantage of the deep learning exercise “Leap Year” is presented with an example and the
architecture is that it prevents a cold start for a new student, suggestion of using the auxiliary function “multiple”. The right-
similar to what was done by [11] for second language hand panel includes an editor where the student types the
acquisition. It is different from what was done for knowledge solution and another tab with an interactive console where the
tracing of a programming language learner in [7]; this model solution and the reusable functions can be tested. Automatic
requires a rich student history, as with our feature-engineered feedback for the student’s solution is shown at the bottom of
models. Wu et al. in [31] propose a model for knowledge the screen once the student presses the “Submit” button.
tracing of new students [32], [33]. Unlike our work, they use From the exercise designer’s point of view, an exercise
the block-based language Blockly [8]. Blockly programs are needs two parts. On the one hand, the designer needs to write
always syntactically correct by construction. The intuition a description of the problem the student should solve by
behind the deep learning approach is that students learning a providing a program, one which generally requires no more
new programming language may perform syntactic as well as than a few lines of code. On the other hand, the designer
other kinds of errors, similarly to students learning a new provides a suite of representative unit tests that are executed
natural language. against the final program. The final program can optionally
III. STUDY DESIGN include auxiliary functions programmed by the designer. The
In this section, we first describe the web-based coding tool designer is responsible for providing enough test cases in order
Mumuki, which we use to collect the datasets. Then we to verify the correctness of the program. Although correctness
present the datasets that we use for the experiments. cannot be assured using this technique, it successfully
identifies common errors made by the students.

A. The Web-based Coding Tool


Mumuki is an open-source web-based coding tool for
teaching programming (source code available at [34]). It
supports 17 programming languages, including Haskell, Prolog,
Python, JavaScript, C, and Ruby, among others. Around 16
thousand students access the platform at least once a month.
Mumuki has currently 70 thousand registered users. It is used
by people of different ages, genders, and educational levels
ranging from primary school to postgraduate courses, and by

EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 6
Every time a solution is submitted, it is analyzed by a
compiler of the programming language with three possible

outcomes from Mumuki: dark red, light red, or green. If a


syntax error is found, the error reported by the compiler is
shown to the student, as can be seen in Fig. 2, and the exercise
is marked as dark red. If the code compiles, then the test cases
are evaluated and the tool shows the user which cases were
solved correctly and which were not. For example, in Fig. 3 the
two first test cases are correct and the last two are incorrect:
the program only verifies whether the year is multiple of 400,
but does not consider whether the year is multiple of 4 but not
of 100. As a result, two of the tests defined by the exercise
designer fail, and this is reported by Mumuki. In this case,
Mumuki assesses the exercise as light red.
Every solution submitted by a student is stored, thus it is
possible to access not only the final solution but also all the
Fig. 2. Incorrect code corresponding to isLeapYear function introduced in Fig.
steps that lead to it. The timestamp, the tests, and
1. The code is assessed as dark red, because it has a syntax error. The student
can see the Haskell’s compiler response. compilation results are also stored, so the full history can be
reproduced. This data could be used by a teacher to see the
student’s progress, to explore common mistakes, to check if
an exercise is particularly difficult, etc. The logged data is
presented to the designer of the exercise, usually the
teacher, on a webpage inside Mumuki, as shown in Fig. 4.
Using the tool shown in Fig. 4, the teacher can track the
submissions made by all the students and their progress. The
student has submitted 13 different solutions for this exercise
as shown in the red navigation bar in the bottom. Submission
number 12 is currently on display in the figure and the changes
with respect to submission 11 are highlighted. If the teacher
wants to analyze progress, it is possible to inspect the
submisions one by one. However this would not be possible in
large rooms or with exercises with a large number of
submissions. The tool provides valuable information for
understanding the interaction of the users with Mumuki, but
when it grows, it becomes intractable. In this work, we collect
all this data in order to create two different datasets which are
described in the following section.
Fig. 3. Incorrect code corresponding to isLeapYear function introduced in Fig.
1. The code is assessed as light red because some tests results are incorrect.
The student can see what is the expected result and compare it to the
obtained result in the test case that did not pass.
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 7
B. The Datasets in the MOOC dataset are more evenly distributed along the day
In this section, we present the two different datasets with a peak between 6 and 9pm; some students might use
collected through Mumuki, which represent two different Mumuki after their working hours.
educational environments: one a classroom setting, the other Table I shows the distribution of the submissions according
a MOOC. We call them the CS1 and MOOC datasets, to the colors described in the previous section: dark red, light
respectively. The programming language for both datasets is red, and green. The CS1 dataset contains more parse errors
Haskell. The submissions of the CS1 dataset were sent by (dark red) than the MOOC dataset. In the CS1 group, this is the
students that were enrolled at a CS1 university course in the first programming language that the students learn, so syntax
second semester of 2018. This is the first programming course errors are highly expected. In the MOOC dataset, students
of a five-year degree in CS. The course was taught for eight might be familiar with other programming languages. This
hours a week divided into two days, Tuesday and Thursday, could be the reason why in the MOOC dataset the proportion
from 9am to 1pm. A university professor, two teaching of correct submissions, assessed green by the tool, is higher
assistants, and one student assistant were available for the than in CS1 dataset and the proportion of parse errors is lower.
students. The submissions in the MOOC dataset were sent by
TABLE I
students learning to program on their own in the setup of a SUBMISSION DISTRIBUTION FOR THE DATASETS CS1 AND MOOC CLASSIFIED ACCORDING
MOOC. The students were not part of a formal education TO MUMUKI’S ASSESSMENT
system. We are interested in analyzing these two datasets CS1 MOOC
since they are collected from different educational contexts.
We describe here both datasets in order to highlight their Submission assessment # % # %
differences. Fig. 5 shows the usage of Mumuki distributed Code with parse error (dark red) 7457 38.5 69249 29.3
along the days of the week. On the one hand, the majority of
Code with test error (light red) 7855 40.5 86525 36.7
Fig. 5. Usage statistics of the Mumuki according to the day of the week in the
respective datasets, CS1 and MOOC. Correct code (green) 4060 21.0 79927 34.0
Total number of submissions 19372 100.0 235701 100.0
Total number of students 75 3915
The table shows that CS1 dataset is more than ten times
smaller than the MOOC dataset in their number of total
submissions. It also has only 75 students while the MOOC
dataset has almost four thousand. Both datasets were
collected for the same 67 exercises, but due to the larger
number of users in the MOOC dataset, we might expect to find
a more diverse range of solutions there.

IV. METHODOLOGY
In this section, we first define the task that we tackle as a
classification problem. Then, we propose a model based on
feature engineering for the task. Moreover, we present the
Fig. 6. Usage of the Mumuki according to the hour of the day in the respective deep learning architecture that we propose. Finally, we
datasets, CS1 and MOOC.
describe the rationale and the methodology for the human
annotation.

the submissions performed by users in CS1 are concentrated A. Task Definition


on Tuesdays and Thursdays; 68% of the submissions are send The task that we focus on is defined as follows. We
in these two days. On the other hand, the submissions made formulate it as a binary classification.
by students in the MOOC dataset are similarly distributed
along the days of the week (around 15% each day) with a slight Task: Given an incorrect program from a student, predict
decline on Saturdays and Sundays. whether the student will be able to solve the programming
Fig. 6 shows the distribution of the submissions by the hour exercise on his or her own in the future or whether he or she
of the day. The majority of submissions of the CS1 dataset were will abandon the exercise with an incorrect solution. We
performed while the course was taking place, between 9am formulate this task as a binary classification problem with the
and 1pm. The submissions outside this period follow a classes dropout and success.
distribuition similar to that of the MOOC dataset. Submissions
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 8
For a given student and a given exercise, a submission is • Dropout average (DA): DA is the proportion of dropouts
considered a dropout if it belongs to a session whose last performed by a student during his or her whole history in
submission is assessed as incorrect by Mumuki (dark red or Mumuki. DA is defined as follow: DA where n
light red). Intuitively, we say that the student dropouts of an is defined as above and #dropouts are the quantity of
exercise if it was abandoned without being able to reach a dropouts that the student made during his or her whole
correct solution. Conversely, a submission is considered a history. With this feature we have the information about
success if it belongs to a session whose last submission is how prone the student is to making dropouts.
assessed as correct (green). For a given student and a given • Average elapsed time (AET): In order to try to capture
exercise, we define session as a sequence of solutions sent whether the student behaves like a tinkerer or a planer,
within a time frame where the idle time does not exceed a we define the average time elapsed between consecutive
certain threshold. In order to define the threshold per dataset, solutions inside a session. We average these over all the
we calculate the distance in seconds between the submissions sessions of that student.
sent by the student. As time passes, the probability that the • Levenshtein distance average (LDA): We define LDA to try
student makes a new submission on the same exercise to capture how much the student changes a submission,
decreases. We define the threshold empirically as the time with respect to the previous submission, before
elapsed that covers the 90% percentile of idle times between submitting it. We calculate the average distance of
submissions. This corresponds to around eight minutes for our Levenshtein [35] between the pairs of successive
datasets. That is, after more than eight minutes of inactivity we submissions of each exercise for each session. We
consider that a new submission corresponds to a new session. average all these over all the sessions of that student.
We consider that if the student switches to another exercise, • Bag of words (W): Considering the content of the program,
the session on the previous exercise ends and a new session we train a unigram model to represent the code of the
starts. submissions. No word order features are included so the
order between the words in the code is lost (this
B. Feature Engineering on Student and Exercise History is different from the model we propose in the following
After tagging the solutions of both datasets, we define section).
features that try to characterize dropout/success sessions. As All the proposed features mentioned above are calculated
discussed in Section II, we characterize students’ learning style for each student separately and are based on pedagogical
considering the notions of tinkerer and planner. This literature [27], [29], considering different characteristics of the
classification depends on the behavior of students when trying students. After we define these features, we train our
to solve programming exercises. On the one hand, students baselines using logistic regression from scikit-learn [36].
considered as tinkerers make many incremental changes Incrementally we try different feature combinations as input
through trial and error in order to create a valid final solution. of the model in order to obtain a better F1 score in our model.
On the other hand, students considered as planners identify a Using the grid search method provided by scikit-learn [36], we
path of action and then implement it with the aim of reaching obtained the optimal parametrization of this model which is
the solution they consider correct. penalty: l2, C=1, solver=liblinear, with a tolerance of 1e−6 and
We define features that try to model the learning style of the class weight=balanced.
student. We consider aspects such as students’ level of As noted before, these models need a rich history for each
expertise, number of dropouts in their history, the frequency student in order to calculate these features. A rich history is
with which they send submissions, the number of changes in required in order to have a reliable representation of the
the code between successive submissions, among other things. student to train these models.
We now explain each of the features we include in our model.
• Expertise with failures (EWF): In order to represent the C. Sequential Natural Language Processing Architecture
level of experience of a student, we define the EWF As we argued in the introduction, we believe that processing
feature. {e1,..,em} represents the m exercises that a code using sequential natural language processing techniques
student attempted (from the 67 exercises that are rather than a bag of words model is suitable for the task at
available). An exercise is defined as attempted if the hand. To test this, we train the recurrent neural network (RNN)
student sent at least one submission to try to solve it; and depicted in Fig. 7 based on LSTM units. As shown in the image,
n represents the total number of submissions (counting each submission made by a student is tokenized and encoded
both submissions with and without errors) sent by the word by word using word embeddings trained on the 67
student to try to solve the m exercises attempted. EWF is exercises available in each dataset. Each sequence of word
defined as: EWF . Note that EWF is always a number embeddings that represents a program is fed to an LSTM unit.
between 0 and 1 (as m will always be greater than n). Then a sigmoid layer is in charge of making the binary
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 9
prediction. In Fig. 7 we see an example of a program in Haskell th
length 100. In this way, we cover 99.9 percentile of
and how it is tokenized.
vocabulary size and code length. Programs are encoded
Below we first explain briefly how RNNs work and how they
utilizing the embedding layer mentioned above. In addition,
make predictions. Then we describe how our network is
we explore different sizes of dense vector embeddings for
parametrized and how it is trained. Finally we say how much
word semantics and we find that a 256-dimensional vector
time it took us to train the models and how much time they
obtains the best performance for both datasets on our dev set
take to make a prediction given a new program.
(the dev sets contains 10% of the datasets). 256-dimensional
RNN are a particular type of neural network. The units of
word embeddings are usually used for other NLP tasks [40].
RNN are connected in such a way that they form a directed
Through hyper-parameter tuning, we find that an LSTM with
graph along a sequence. This allows the model to exhibit
100 hidden units, a dropout layer and 0.2 of learning rate,
dynamic behavior for a sequence. Unlike feedforward
achieves better performance on the dev set. A dropout layer is
networks, RNNs can use their internal state (memory) to
added to prevent over-fitting. The program encoded in this
process sequences. This makes these types of networks useful
way is fed into a densely-connected layer with a sigmoid
for tasks in which order within the sequence matters, such as
activation function that performs the binary classification. We
speech recognition [37] or handwriting recognition [38],
added to the network the Adam optimizer to help the network
among other applications. LSTM units are made up of a cell, an
converge faster [41]. We trained this model for a maximum of
entry door, an exit door, and a forgetting door. Each unit
25 epochs due to our computing power restrictions, optimizing
remembers values by time intervals, the three doors are in
the binary-cross-entropy loss. We keep the model that
charge of regulating the flow of information that enters and
performs best on the dev set, considering F1 score, the same
leaves the units. This type of network is used for different
metric we used for the feature-based approach. We achieve
purposes such as classification, processing and prediction
the best results after 20 epochs in both datasets with a batch
always based on sequences. For each type of use there are
size of 32. Batch size depends on the GPU hardware used to
different ways of structuring the network. In particular, for this
train the model.
work, it is suitable to build a network that follows the
It takes 40 hours to train the model for the MOOC dataset
architecture known as many-to-one. This architecture is
and 8 for CS1. Once the models are trained, prediction time is
illustrated in Fig. 7.
two ms. The resulting trained models are a blackbox that
Here we describe how the network is parametrized and how
receive a program and give a binary classification for our task
it is trained for the sake of reproducibility. The libraries that we
for each dataset.
use can be found in our code (available at [39]). Our network

D. Human Performance
We ask two teachers to evaluate a total of 40 submissions
from eight different exercises; we will call them human1 and
human2. We pick randomly five submissions per exercise from
the CS1 dataset. Both raters have been teachers in the CS1
course. human1 has ten years of experience in CS1 while the
human2 has two years. Human1 is significantly more familiar
with the types of errors that beginner students usually make
on the eight different exercises selected and their severity.
These exercises are used in the CS1 course every year. We
would expect that a CS teacher not familiar with the exercises
Fig. 7. Architecture of our deep learning model. Each submission made by a and typical errors in functional programming will probably
student is tokenized and encoded word by word using word embeddings have a random performance in this annotation task.
trained on the 67 exercises available in each dataset. Each sequence of word
embeddings that represents a program is fed to the network. Then a sigmoid The teachers decided if each submission was dropout or
layer is in charge of making the binary prediction. success based solely on the code of the submission. They did
not know who the author of a particular program was. The
annotation task took around 90 minutes per teacher, while the
consists of an embedding layer to build word embeddings for linear models or neuronal network models complete the task
all the words in the vocabulary, considering each word as a in less than a few milliseconds for each example. Both teachers
token. It then has an LSTM-based layer and finally a sigmoid reported being tired after one hour of rating, so their
layer as shown in Fig. 7. Each code submitted by a student is performance may decrease over time. This experiment helps
tokenized using a vocabulary of size 35000 and padded to contextualize the performance of our models showing that the
task is tiring and time consuming for humans. It is more
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 10
difficult when the teachers are not familiar with the common W + AET 0.60 0.64 0.64 0.60
errors associated to the exercises. Human annotation helps W + LDA 0.61 0.63 0.65 0.60
contextualize a machine learning classification as done here. It
W + DA 0.65 0.76 0.63 0.61
is a common practice in the area of NLP [42], [43].
W + EWF + AET + LDA + DA 0.67 0.73 0.63 0.68

V. RESULTS Sequential NLP architecture


Known students 0.86 0.84 - -
In this section we first present a quantitative and
comparative evaluation of the models we proposed in the New students - - 0.67 0.70
previous sections. We then describe a qualitative analysis with Expert teacher performance
the goal of discussing the interpretability of the deep learning human1 - - 0.64 -
model.
human2 - - 0.58 -

while being tested on the past. Although training and test data
A. Quantitative Results
are disjoint due to random sampling, there is data leakage
We report weighted F1 scores on the test sets in Table II for because of the temporal nature of our task. We think that the
CS1 and MOOC datasets. We assess the performance of a temporal data leakage problem is not well discussed in the
dummy classifier guessing baseline (guesses based on prior educational data mining community. For notable exceptions
class distributions) and use this as a lower bound. The stratified see [45]. Duplicated records are also another problem for
random is implemented using sklearn dummy class with a applying machine learning to educational data as noted in [45].
stratified strategy. The dummy performance is better in the In our second experiment, we use time series which is the
MOOC dataset because its majority class is success. For each appropriate methodology for our type of data. The results are
dataset we implement two experiments to evaluate our shown in the two time series column in Table II, one per each
models. Each experimental evaluation methodology combined dataset. We sort the submissions for both datasets temporarily,
with one dataset constitute the four columns of the table. We thus training in the past and testing in the future avoiding the
discuss here why the first experimental methodology is not data leakage issue of the previous methodology. The reported
suitable. Then we describe the results of the second results are the average of performing k-fold cross-validation
experiment in detail. using time series [46]. We organize the rows in four parts. As
In the first experiment, shown in the two first column of the we already mentioned, the first part of the table includes a
table, the training set for each dataset is built by random random baseline.
sampling. The results reported in the table are the average of The second part of the table describes our exploration of the
performing k-fold cross-validation [44]. We show these results feature space. That is, it presents the results of the model
to illustrate the fact that this evaluation methodology, explained in Section IV-B. The machine learning algorithm used
common for machine learning tasks, is flawed for our task. In is logistic regression. Logistic regression is frequently used as a
the known students row we see an artificially high result of .86 baseline for comparing deep neural networks since it can be
F1 for CS1. Results are also artificially good for linear regression seen as a shallow neural network [47]. The model that
models but the difference is larger for deep methods such as combines all the features gets 0.68 F1 in the MOOC dataset
RNNs. Random sampling is not suitable for our task because and 0.63 on the CS1 dataset. That is, the model shows 13/100
students edit and resubmit the code and the RNNs are able to points of improvement over the stratified random baseline on
memorize future code snippets of the same student on the the MOOC dataset and 11/100 points of improvement over the
same exercise. Doing random sampling means that the MOOC dataset.
machine learning methods can train on data from the future The third part of the table describes the performance of the
TABLE II
model we propose in Section IV-C. Due to how the data is split,
RESULTS COMPARING A STRATIFIED RANDOM BASELINE, FEATURE
EXPLORATION ON STUDENT EXERCISE HISTORY, THE SEQUENTIAL NLP ARCHITECTURE, AND the model is making predictions for submissions of students (in
HUMAN PERFORMANCE the test set) it has never seen (in the training set). All the other
Random sampling Time series models in the table (except for the expert teacher) have access
to previous data from the same student they are making their
CS1 MOOC CS1 MOOC predictions. The polarization of the prediction is good, with
Stratified random 0.51 0.57 0.50 0.57 less than 10% of the submissions in the 0.3-0.6
TABLE III ABLATION STUDY OF THE
Feature exploration
PROPOSED FEATURES
W 0.59 0.69 0.62 0.60
Random sampling Time series
W + EWF 0.62 0.68 0.63 0.65 Model
CS1 MOOC CS1 MOOC

EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 11
W 0.59 0.69 0.62 0.60 identifies the example with a number. The second column
EWF 0.58 0.62 0.53 0.56 shows the submission itself. The third column reports the
AET 0.50 0.50 0.51 0.48 certainty of the network in its prediction. The fourth shows
how Mumuki assessed the submission. The fifth summarizes
LDA 0.53 0.55 0.56 0.47
the feedback that Mumuki gave. And the last reports the
DA 0.62 0.71 0.57 0.64
predicted class with 1 for success and 0 for dropout.
W + EWF 0.62 0.68 0.63 0.65
The table illustrates that it is not possible to decide whether
W + AET 0.60 0.64 0.64 0.60 a submission is a dropout or not based only on syntax errors or
W + LDA 0.61 0.63 0.65 0.60 test errors. In the table we find submissions assessed as dark
W + DA 0.65 0.76 0.63 0.61 red by Mumuki (submissions with syntax errors) classified as
W + EWF + AET + LDA + DA 0.67 0.73 0.63 0.68 dropout and success by the model. The same happens with
range. The network is able to separate submissions between submissions with test case errors (light red). For example, in
dropout and success with confidence. Although this model submission number 10 the student forgot a closing parenthesis
does not have access to data by the student, except for the and it is assessed as dark red by Mumuki. The model identifies
code fragment, it outperforms the models based on student it is a minor error and assumes the student will be able to solve
history for both datasets. it on his or her own. On the other hand, the syntax error in
Fourth, we report human performance on this task following submission number 5 is connected to two conceptual
the methodology described in Section IV-D. This is a hard task misconceptions: the student neither understands the number
for humans and in particular for those that are less familiar of arguments that are needed, nor that the variables must be
with the common errors in these exercises. defined before being used. These are common misconceptions
Table III shows an ablation study of the proposed features [48], [49]. The model correctly classifies submission number 5
for the model based on student history. Using again logistic as a dropout. With respect to test case errors, there are light
regression as the classifier we tested one feature at time with red submissions that the model predicts as dropout with
the goal of finding out which are the features that are more confidence, while there are others that are predicted as
informative for this task on both datasets. Table III, shows that, success. For example, submission number 8 is missing one of
when considered in isolation the most predictive features are the conditions for being leap year (mod n 4 == 0) while
the bag of words (W) and the history of dropouts by the submission number 9 includes all the conditions, but there is
student (DA). W shows that it is important not only to an error in the Boolean operator. Students that submit a
considered the history of the student but also what was program with an incorrect Boolean operator seem more likely
written in the submitted program. This is in line with our to correct the exercise on their own, than those that are not
previous results that show that a model that considers not only able to fully articulate the program.
words but also their order is desirable for this task. DA Intuitively, the approach illustrated in this section models
measures how often this student abandoned other exercises how fluent a beginner programmer is in a programming
in the past. This is the non textual feature that we found to be language, viewing it as if it were a new second natural language,
most predictive for our task. Our features AET and LDA, which as discussed in Section II. In what follows, we discuss the
intend to represent the tinkerer and planner distinction limitations of our approaches and their possible implications
introduced in Section II, do not have predictive power for our for teaching programming.
task and datasets.
VI. DISCUSSION
B. Qualitative Analysis We discuss here the contributions and implications of this
We present here a qualitative analysis of the sequential paper. The reliability of our results is supported by the fact that
natural language model for new students on the programming we obtain similar results in two quite different datasets. We
exercise introduced in Section 1. The exercise goal is to also discuss its limitations.
program a function called isLeapYear that verifies whether a
year is leap. This analysis illustrates that the deep learning A. Contributions and Implications for Teaching Programming
method we used in our model is not interpretable beyond As we describe in Section II, there is a worldwide interest in
analyzing its output as we do here. It is not possible to perform promoting youth engagement in CS and to teach programming
ablation studies in the same way we did for the feature at all levels of education. But there is a problem: there are too
engineering models in the previous section because the model many students that want to learn programming for the
is a blackbox. number of trained teachers. Various strategies, such as
Table IV shows the network predictions for the exercise for development of programming platforms [4], pedagogical
14 submissions made by different students. The first column resources [13], [14], special event organization (hackathons,
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 12
summer camps, game jams, and so on) [15], and professional learning domain, relative to heavily feature-engineered
development for teachers [16], [17] have been developed in approaches. However, feature-engineered approaches are
response to the teacher shortage. more interpretable.
The strategies just listed focus on tools and resources that In Section V we describe the performance of experienced
can facilitate the role of the teacher in the classroom. However, teachers in trying to predict whether a solution will be a
most of them do not consider that there are different learning
TABLE IV
SAMPLE PIECES OF CODE FOR THE ISLEAPYEAR EXERCISE SUBMITTED BY DIFFERENT STUDENTS
N Submission Certainty Assessment Feedback C
1 isLeapYear multiple x*4 = x*4 1.000 Dark red Error in pattern 1
2 isLeapYear x=not (multiple x 4) 1.000 Light red Test case error 1
3 isLeapYear 400=not multiple 400 100 0.999 Light red Non exhaustive 1
4 isLeapYear = mod 400 4 && not (mod 100) 0.985 Light red Type error 1
5 isLeapYear = mod n 400 4 && not (mod n 100) 0.918 Dark red Undefined variable 1
6 isLeapYear n = mod n 400 4 && not (mod n 100) 0.867 Dark red Number of arguments 1
7 isLeapYear n = mod n 400 && not (mod n 100) 0.790 Light red Type error 1
8 isLeapYear n = mod n 400 == 0 && not (mod n 100 == 0) 0.618 Light red Test case error 1
9 isLeapYear n = mod n 400 == 0 && not (mod n 100 == 0) && (mod n 4 == 0) 0.310 Light red Test case error 0
10 isLeapYear n = mod n 400 == 0 k (not (mod n 100 == 0) && (mod n 4 == 0) 0.222 Dark red Mismatched parenthesis 0
11 isLeapYear n = multiple n 400 && not (multiple n 100) 0.238 Light red Test case error 0
12 isLeapYear n = multiple n 400 && not (multiple n 100) && (multiple n 4) 0.011 Light red Test case error 0
13 isLeapYear n = multiple n 400 k not (multiple n 100) && (multiple n 4)) 0.006 Green The solution passed all tests 0
14 isLeapYear n = (multiple n 400) k ((multiple n 4) && not(multiple n 100)) 0.001 Green The solution passed all tests 0
styles and processes [50]. Also, there are not enough teachers dropout or not. The task was hard and time consuming for
trained in CS and programming, and professional development teachers with two and ten years of experience. Being able to
courses in programming have low impact. Considering that provide tools that automatically identify students who are at
there are not enough teachers, and that the existing teachers risk of abandoning an exercise, could be useful for teaching
are often beginners, asking them to recognize the different effectively in a heterogeneous classroom. Our models could
processes in a heterogeneous classroom on their own may well help teachers recognize students with difficulties. Also, early
be too much. dropout predictions could be used as inputs to define different
Previous work on predicting dropout requires students’ code exercise trajectories for different students, both for in-person
to be syntactically correct. As we deal with beginners’ code in and online courses.
a text-based language, our models are able to handle noisy
student code that may not be syntactically correct. Almost 40% B. Threats to Validity and Limitations
of the code in our datasets contains parsing errors. In Sections Regarding validity, there are more features that we could
IV and V we propose and compare two machine learning have tried. We are certainly not claiming that our features are
methods on the task of predicting whether a student will be the best for this problem. Our claim in this paper is simply that
able to solve a programming exercise on his or her own in two designing relevant features manually is hard — particularly for
different learning scenarios introduced in Section III. The first a problem such as this one, in which there are features in so
method is inspired by pedagogical theories related to many possible dimensions (student, exercise, program, time,
programming. For the second method we propose a novel way etc). Feature engineering is difficult and problem dependent.
to encode learner’s code as input to a deep neural network Our contribution in this paper is to say that: for this particular
(even when the code does not compile) based on word problem, feature engineering may not be necessary since we
embeddings and LSTM. We also evaluated expert human have such rich sequential data that can be processed by
performance for this task (with teachers with different current deep neural models to achieve better task
experience) in the context of a programming course teaching performance. Also, by not needing student-specific data, deep
Haskell. We found that the performance for our best model in learning models do not suffer a cold start for each new student.
both learning scenarios was similar to expert human Although harder to implement, our model based on carefully
performance of teachers with more than ten years of domain dependent features offers more insights into the kind
experience teaching this course. We also found that the deep
learning model was easier to design without knowledge of the
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 13
of information that is relevant for the task. Therefore such a teachers detect which students need help promptly. The
model is more interpretable than our deep learning model. performance of both our approaches, the network model and
Our deep learning architecture ignores previous submissions the one based on features, outperform a teacher with ten
made by the same student, which certainly contains years of experience in the prediction task.
information about what that student has already learned. Our Through a qualitative analysis of an exercise we illustrated
contribution here is to show that without such heavy that the certainty of the neural network model can be used to
machinery, the model has a reasonable accuracy when classify the errors better than the symbolic feedback
predicting whether a student will need help on a particular generation made by usual techiques implemented by Mumuki
exercise. We imagine this to be similar to when an experienced (parsing and test case testing). Based on the level of certainty
teacher looks at the screen of a new student, and by seeing his for a student in an exercise, a different path of exercises in the
or her code can predict whether or not the student needs help. tool could be proposed. If certainty is low, it would be possible
The teacher can do this if the code written is clearly on the right to use the output of our model to suggest that the student
track or has serious conceptual errors. But there are many hard practice on an easier exercise. This might help keep the
cases in between in which our system has been shown to have student motivated to learn, avoiding the frustration caused
better than teacher performance when the teacher does not when the exercises cannot be solved. We leave the exploration
know the student. Regarding human annotation, our of such extensions for future work.
methodology does not give the tutor the identity of the There is plenty of room for future work on our models, as we
students. It is possible that annotators would do better if they mentioned in Section VI-B. Along with this paper we are
knew who the student is, but our goal was to help releasing our source code [39]. Data will also be released for
overcrowded courses or e-learning environments, so we tried research purposes by request. We plan to introduce
to reproduce this setup when doing the human evaluation. penalization for type 2 errors. When the model predicts a false
Also, our annotators did 40 annotations in two hours; this is a positive, it means that the student is going to abandon of the
demanding task and performance might be improved by doing exercise, but the model cannot predict this situation correctly.
fewer annotations in one go. In addition, we plan to try a different neural network
There are two types of errors in our model’s predictions. architecture, a bidirectional long short-term memory, in which
First, predicting a student submission as success when the the network is able to consider the relationship between
student is going to dropout of the exercise. Second, predicting tokens from left to right and vice versa. We believe that these
a student submission as a dropout when it is going to be a two improvements may boost the usefulness of our model.
success. Our evaluation metric penalizes equally type 1 and Despite the fact that creating features is a time consuming
type 2 errors. That is, it penalizes equally when the student is task, the knowledge gained from the design of the features and
not going to dropout and the model predicts a dropout (false the ablation studies that are possible, makes such models
positive, type 1) as when the student abandons but the model more interpretable. Our future work will explore ways of
did not predict it (false negative, type 2). For an application, it combining both models.
might be interesting to prefer avoiding type 2 errors, since not Unlike previous work, our models are able to predict
helping a student that needs help should be prioritized. dropouts on code that is not syntactically correct. This is
However, the effects of telling a student who is not going to particularly important for learning environments where the
abandon that it seems that he or she needs help should not be students are beginners. Moreover the representation that we
overlooked. use to feed our neural network is simple and easy to
implement which makes this approach easy to extend to
VII. CONCLUSION different programming languages and tools. This paper is a
step towards automatic personalised formative feedback
In this paper we propose and compare two approaches to because it addresses the question of when a student may need
predicting whether students will complete a programming it.
exercise on their own. This is a step towards generating
automated personalised formative feedback. We proposed
two different models, one based on pedagogically-motivated ACKNOWLEDGMENTS
features and the other one on NLP techniques applied to the The authors would like to thank Franco Bulgarelli and Nadia
raw and incorrect code written by students. Finzi from Mumuki Project for making available the Mumuki
We find that both models are faster than human experts at user-generated dataset and for comments about this work. We
this task. In particular, our neural approach showed a good also want to thank Cecilia Martinez and Emilia Echeveste for
performance with new students, which is usually a difficult advising us on how to think about teaching programming in
case because other approaches rely on the history of the real school settings. We are thankful for the comments given
student. This approach is easy to implement and could help by the anonymous peer reviewers. Moreover, we want to
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 14
thank Patrick Blackburn for proofreading the paper and Denmark, Nov. 11–13, 2013, pp. 87–90, doi: 10.1145/2532748.2532759.
[15] Tao Lin, C. Zhong, Y. John, and P. Liu, “Retrieval of Relevant Historical
helping us improve it. We are grateful to Ana Casali, Francisco
Data Triage Operations in Security Operations Center”, in Data and
Tamarit, Laura Brandan Briones, Raul Fervari, and Pedro Application Security and Privacy: Status and Prospects: Springer LNCS,
D’Argenio for comments related to our different models and 2018.
representations. Lastly, the authors appreciate the [16] X. Fu, Y. Ma, and Tao Lin, "A Novel Image Matching Algorithm Based on
Graph Theory", Computer Applications and Software, vol. 33, no. 12,
collaboration and support of the initiative Program.ar from the pp. 156-159, 2016. Shanghai Computer Society.
Manuel Sadosky Foundation at the Argentinean Ministry of [17] N. Ragonis, O. Hazzan, and J. Gal-Ezer, “A survey of computer science
Science, Technology and Productive Innovation. teacher preparation programs in Israel tells us: Computer science
deserves a designated high school teacher preparation!” in Proc. 41st
ACM Tech. Symp. Computer Science Education (SIGCSE’10), Milwaukee,
WI, USA, Mar. 10–13, 2010, pp. 401–405, doi:
REFERENCES
10.1145/1734263.1734402. [18] M. Resnick et al., “Scratch:
[1] C. Wilson, “Hour of code: We can solve the diversity problem in Programming for all,” Commun. Assoc. for Comput. Machinery, vol. 52,
computer science,” Assoc. Comput. Machinery Inroads, vol. 5, no. 4, p. no. 11, pp. 60–67, Nov. 2009, doi: 10.1145/1592761.1592779.
22, Dec. 2014, doi: 10.1145/2684721.2684725. [19] J. Maloney, M. Resnick, N. Rusk, B. Silverman, and E. Eastmond, “The
[2] T. Bell, “Establishing a nationwide CS curriculum in New Zealand high Scratch programming language and environment,” ACM
schools,” Commun. Assoc. Comput. Machinery, vol. 57, no. 2, pp. 28– Trans. Comput. Educ., vol. 10, no. 4, pp. 1–16, Nov. 2010, doi:
30, Feb. 2014, doi: 10.1145/2556937. 10.1145/1868358.1868363.
[3] S. Furber, “Shut down or restart? The way forward for computing in U.K. [20] Tao Lin, J. Gao, X. Fu, and Y. Lin, "A Novel Bug Report Extraction
schools,” The Royal Society, London, Tech. Rep., 2012. Approach", in the 15th International Conference on Algorithms and
[4] P. Brusilovsky et al., “Increasing adoption of smart learning content for Architectures for Parallel Processing, 2015, pp. 771-780.
computer science education,” in Proc. 2014 ACM Conf. Innovation and [21] D. Wolber, “App inventor and real-world motivation,” in Proc. 42nd ACM
Technology in Computer Science Education Working Group Reports Tech. Symp. Computer Science Education (SIGCSE’11), Dallas, TX, USA,
(ITiCSE-WGR’14), Uppsala, Sweden, Jun. 23–25, 2014, pp. 31–57, doi: Mar. 9–12, 2011, pp. 601–606, doi: 10.1145/1953163.1953329.
10.1145/2713609.2713611. [22] L. Benotti, F. Aloi, F. Bulgarelli, and M. J. Gomez, “The effect of a
[5] J. H. Sharp, “Using Codecademy interactive lessons as an instructional webbased coding tool with automatic feedback on students’
supplement in a Python programming course,” Inf. Syst. Educ. J., vol. performance and perceptions,” in Proc. 49th ACM Tech. Symp. Computer
17, no. 3, p. 20, Jun. 2019. Science Education (SIGCSE’18), Baltimore, MD, USA, Feb. 21–24, 2018,
[6] P. Ihantola et al., “Educational data mining and learning analytics in pp. 2–7, doi: 10.1145/3159450.3159579.
programming: Literature review and case studies,” in Proc. 2015 ACM [23] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed
Conf. Innovation and Technology in Computer Science Education representations of words and phrases and their compositionality,” in
Working Group Reports (ITiCSE-WGR’15), Vilnius, Lithuania, Jul. 4–8, Proc. 26th Int. Conf. Neural Information Processing Systems, vol. 2, Lake
2015, pp. 41–63, doi: 10.1145/2858796.2858798. Tahoe, NV, USA, Dec. 5–8, 2013, pp. 3111–3119.
[7] L. Wang, A. Sy, L. Liu, and C. Piech, “Deep knowledge tracing on [24] S. Sachdev, H. Li, S. Luan, S. Kim, K. Sen, and S. Chandra, “Retrieval on
programming exercises,” in Proc. 4th ACM Conf. Learning @ Scale source code: A neural code search,” in Proc. 2nd ACM SIGPLAN Int.
(L@S’17), Cambridge, MA, USA, Apr. 20–21, 2017, pp. 201–204, doi: Workshop Machine Learning and Programming Languages
10.1145/3051457.3053985. (MAPL@PLDI’18), Philadelphia, PA, USA, Jun. 18–22, 2018, pp. 31– 41,
[8] N. Fraser, “Ten things we’ve learned from Blockly,” in Proc. 2015 IEEE doi: 10.1145/3211346.3211353.
Blocks and Beyond Workshop, Atlanta, GA, USA, Oct. 22 , 2015, pp. 49– [25] T. Kenter, A. Borisov, C. V. Gysel, M. Dehghani, M. de Rijke, and B. Mitra,
50, doi: 10.1109/blocks.2015.7369000. “Neural networks for information retrieval,” in Proc. 40th Int. ACM
[9] B. Settles, C. Brust, E. Gustafson, M. Hagiwara, and N. Madnani, Special Interest Group in Information Retrieval Conf. Research
“Second language acquisition modeling,” in Proc. 13th Workshop on and Development in Information Retrieval, Tokyo, Japan, Aug. 7–11,
Innovative Use of NLP for Building Educational Applications (NAACL- 2017, pp. 1403–1406, doi: 10.1145/3077136.3082062.
HLT’18), New Orleans, LA, USA, Jun. 5, 2018, pp. 56–65, doi: [26] P. D. Turney and P. Pantel, “From frequency to meaning: Vector space
10.18653/v1/W18-0506. models of semantics,” J. Artif. Intell. Res., vol. 37, no. 1, pp. 141–188, Jan.
[10] X. Huo, M. Li, and Z.-H. Zhou, “Learning unified features from natural 2010, doi: 10.1613/jair.2934.
and programming languages for locating buggy source code,” in Proc. [27] S. Turkle and S. Papert, “Epistemological pluralism and the revaluation
26th Int. Joint Conf. Artificial Intelligence (IJCAI’16), New York, NY, USA, of the concrete,” J. Math. Behav., vol. 11, no. 1, pp. 3–33, Mar. 1992.
Jul. 9–15, 2016, pp. 1606–1612. [28] C. A. R. Hoare, “Theories of programming: Top-down and bottom-up and
[11] M. Kaneko, T. Kajiwara, and M. Komachi, “TMU system for SLAM2018,” meeting in the middle,” in Formal Methods (FM’99), Toulouse, France,
in Proc. 13th Workshop on Innovative Use of NLP for Building Sep. 20–24, 1999, pp. 1–27, doi: 10.1007/3-540-48119-2 1.
Educational Applications (NAACL-HLT’18), New Orleans, LA, USA, Jun. [29] P. Blikstein, M. Worsley, C. Piech, M. Sahami, S. Cooper, and D. Koller,
5, 2018, pp. 365–369, doi: 10.18653/v1/W18-0544. “Programming pluralism: Using learning analytics to detect patterns in
[12] M. C. Mart´ınez, M. J. Gomez, and L. Benotti, “A comparison of the learning of computer programming,” J. Learn. Sci., vol. 23, no. 4, pp.
preschool and elementary school children learning computer science 561–599, 2014, doi: 10.1080/10508406.2014.954750.
concepts through a multilanguage robot programming platform,” in [30] C. Piech, J. Huang, A. Nguyen, M. Phulsuksombati, M. Sahami, and L.
Proc. 2015 ACM Conf. Innovation and Technology in Computer Science Guibas, “Learning program embeddings to propagate feedback on
Education (ITiCSE’15), Vilnius, Lithuania, Jul. 4–8, 2015, pp. 159–164, student code,” in Proc. 32nd Int. Conf. Machine Learning (ICML’15), Lille,
doi: 10.1145/2729094.2742599. France, Jul. 6–11, 2015, pp. 1093–1102.
[13] B. Ericson, M. Guzdial, and M. Biggers, “Improving secondary CS [31] M. Wu, M. Mosse, N. D. Goodman, and C. Piech, “Zero shot learning for
education: Progress and problems,” in Proc. 38th ACM Tech. Symp. code education: Rubric sampling with deep learning inference,” in Proc.
Computer Science Education (SIGCSE’07), Covington, KY, USA, Mar. 7– 33rd AAAI Conf. Artificial Intelligence (AAAI’19), Honolulu, HI, USA,
11, 2007, pp. 298–301, doi: 10.1145/1227310.1227416. Jan.27–Feb.1, 2019, pp. 782–790, doi: 10.1609/aaai.v33i01.3301782.
[14] D. Thompson and T. Bell, “Adoption of new computer science high [32] M. Palatucci, D. Pomerleau, G. Hinton, and T. M. Mitchell, “Zero-shot
school standards by New Zealand teachers,” in Proc. 8th Workshop in learning with semantic output codes,” in Proc. 22nd Int. Conf. Neural
Primary and Secondary Computing Education (WiPSCE’13), Aarhus, Information Processing Systems, Vancouver, Canada, Dec. 7–10, 2009,
pp. 1410–1418.
EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.
accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2021.3092998, IE
Learning Technologies
NS ON LEARNING TECHNOLOGIES, VOL. XX, NO. X, XXXXXX 20XX 15
[33] R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng, “Zero-shot learning Marco Moresi received his M.Sc. degree in computer
through cross-modal transfer,” in Proc. 26th Int. Conf. Neural science from the National University of Cordoba,
Information Processing Systems, vol. 1, Lake Tahoe, NV, USA, Dec. 5–8, from the province of C´ ordoba in Ar-´ gentina,
2013, pp. 935–943. in 2019, where he has also had experience as a
[34] Mumuki. (2021), IKUMI SRL. Accessed: Jun. 3, 2021. [Online]. teaching assistant in several programming courses.
Available: https://github.com/mumuki His research interests lie in natural language
[35] F. P. Miller, A. F. Vandome, and J. McBrewster, Levenshtein Distance: processing, machine learning, data science, and
Information Theory, Computer Science, String (Computer Science), String computer science education. He has authored six
Metric, Damerau? Levenshtein Distance, Spell Checker, Hamming papers in these areas. He completed a one year
Distance. Alpha Press, 2009, doi: 10.5555/1822502. research internship at the research group dialog
[36] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” The J. systems and machine learning at the University of Dusseldorf. He¨
Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011. is currently working as a freelance data scientist.
[37] X. Li and X. Wu, “Constructing long short-term memory based deep
recurrent neural networks for large vocabulary speech recognition,” in
2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP’15),
South Brisbane, Australia, Apr. 19–24, 2015, pp. 4520– 4524, doi:
10.1109/ICASSP.2015.7178826. Marcos J. Gomez´ received his Ph.D. degree in
[38] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and´ J. computer science from the National University of
Schmidhuber, “A novel connectionist system for unconstrained Cordoba, from the province of C´ ordoba in
handwriting recognition,” IEEE Trans. Pattern Anal. & Mach. Intell., vol. Ar-´ gentina, in 2020. Before that, he obtained a M.Sc.
31, no. 5, pp. 855–868, May 2009, doi: 10.1109/TPAMI.2008.137. degree in computer science from the same university,
in 2014. He was part of the Google Traiblazer, Google
[39] Predicting students’ difficulties from a piece of code: Source code. (2021),
RISE, and Google CS4HS projects from 2014 to 2016.
DEDA-UNC. Accessed: Jun. 3, 2021. [Online]. Available:
His research interests include computer science
https://github.com/uncmasmas/DEDA
education, K-12 education, and machine learning,
[40] K. Patel and P. Bhattacharyya, “Towards lower bounds on number of
having authored seven papers in these areas. His
dimensions for word embeddings,” in Proc. 8th Int. Joint Conf. Natural
computer science education research
Language Processing (IJCNLP’17), vol. 2, Taipei, Taiwan, Nov. 27–Dec. 1,
focuses on learning experiences in real school settings. He is currently the
2017, pp. 31–36.
coordinator of 45 technical high schools with specialization in computer
[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
programming, and is working on computer science curriculum design for
3rd Int. Conf. Learning Representations (ICLR’15), San Diego, CA, USA,
elementary and high school students. As of 2021, he is member of the
May 7–9, 2015, pp. 31–36.
initiative Program.ar from the Manuel Sadosky Foundation at the Argentinean
[42] J. Amidei, P. Piwek, and A. Willis, “Rethinking the agreement in human Ministry of Science, Technology and Productive Innovation. He is also a
evaluation tasks,” in Proc. 27th Int. Conf. Computational Linguistics member of the ACM Special Interest Group on Computer Science Education
(COLING’18), Santa Fe, NM, USA, Aug. 20–26, 2018, pp. 3318–3329. (SIGCSE).
[43] M. Teruel, C. Cardellino, F. Cardellino, L. Alonso Alemany, and S. Villata,
“Increasing argument annotation reproducibility by using inter-
annotator agreement to improve guidelines,” in Proc. 11th Int. Conf.
Language Resources and Evaluation (LREC’18), Miyazaki, Japan, May 7–
12, 2018. Luciana Benotti received her Ph.D. degree in
[44] R. Kohavi, “A study of cross-validation and bootstrap for accuracy computational linguistics from the University of
estimation and model selection,” in Proc. 14th Int. Joint Conf. Artificial Lorraine, Nancy, France, in 2010. Before that, she
Intelligence (IJCAI’95), Montreal, Canada, Aug. 20–25, 1995, pp. 1137–´ obtained the M.Sc. degree in computer science
1143, doi: 10.5555/1643031.1643047. through an Erasmus Mundus program jointly offered
[45] X. Xiong, S. Zhao, E. V. Inwegen, and J. Beck, “Going deeper with deep by the Free University of Bozen-Bolzano, Bolzano,
knowledge tracing,” in Proc. 9th Int. Conf. Educational Data Mining, Italy and the Polytechnic University of Madrid,
Raleigh, NC, USA, Jun. 29–Jul. 2, 2016, pp. 545–550. Madrid, Spain, in 2006. She is currently a Professor
[46] D. R. Roberts et al., “Cross-validation strategies for data with temporal, at the National University of Cordoba, C´ ordoba,
spatial, hierarchical, or phylogenetic structure,” Ecography, vol. 40, no. Argentina´ and a Researcher with the Argentinian
8, pp. 913–929, Dec. 2017, doi: 10.1111/ecog.02881. National Scientific and Technical Research Council (CONICET). Her research
[47] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, interests include natural language processing, dialogue and interactive
Cambridge, MA, USA, 2016, doi: 10.5555/3086952. systems, and computer science education. She has authored 55 papers in
[48] D. Weintrop and U. Wilensky, “Using commutative assessments to these areas. Dr. Benotti has received an IBM SUR award for her work on robust
compare conceptual understanding in blocks-based and text-based user text processing and a Google RISE award for her outreach efforts in
programs,” in Proc. 11th Annu. Int. Conf. Int. Computing Education developing AI-based technology for education. She has been an invited
Research (ICER’15), Omaha, NE, USA, Aug. 9–13, 2015, pp. 101–110, doi: scholar at the University of Trento (2019), Stanford University (2018), Roskilde
10.1145/2787622.2787721. University (2014), University of Lorraine (2013), Universidad de Costa Rica
[49] A. Kumar, D. D’Souza, and M.-J. Laakso, “Identifying novice student (2012), and University of Southern California (2010). As of 2021, she is a
programming misconceptions and errors from summative assessments,” member of the executive board of the North American Association for
J. of Educ. Technol. Syst., vol. 45, no. 1, pp. 50–73, Sep. 2016, doi: Computational Linguistics. She is also a member of the ACM Special Interest
10.1177/0047239515627263. Group on Computer Science Education (SIGCSE).
[50] H. Pashler, M. McDaniel, D. Rohrer, and R. Bjork, “Learning styles:
Concepts and evidence,” Psychological Sci. Public Interest, vol. 9, no. 3,
pp. 105–119, Dec. 2008, doi: 10.1111/j.1539-6053.2009.01038.x.

EE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amazon. Downloaded on June 29,2021 at 03:07:21 UTC from IEEE Xplore. Restrictions apply.

View publication stats

You might also like