ARTICLE IN PRESS
Int. J. Human-Computer Studies 61 (2004) 169–185
Program comprehension and authentic
measurement:
a scheme for analysing descriptions of programs
Judith Gooda,*, Paul Brnab
a
Organizational Learning and Instructional Technologies, College of Education, MSC05-3040 1, University
of New Mexico, 106 Education Office Building, Albuquerque, NM 87131-1231, USA
b
School of Informatics, Pandon Building, Northumbria University, Newcastle upon Tyne NE1 8ST, UK
Received 13 October 2003; accepted 19 December 2003
Abstract
This paper describes an analysis scheme which was developed to probe the comprehension
of computer programming languages by students learning to program. The scheme operates
on free-form program summaries, i.e. textual descriptions of a program which are produced in
response to minimal instructions by the researcher/experimenter. The scheme has been applied
to descriptions of programs written in various languages, and it is felt that the scheme has the
potential to be applied to languages of markedly different types (e.g. procedural, objectoriented, event-driven). The paper first discusses the basis for the scheme, before describing the
scheme in detail. It then presents examples of the scheme’s application, and concludes with a
discussion of some open issues.
r 2004 Elsevier Ltd. All rights reserved.
1. Introduction
The comprehensibility of programming languages is a topic of interest in an era in
which much effort is being expended to open up programming to an increasingly
wider audience. Although much work has already been carried out in the area of
program comprehension (see Upchurch (2002) for an extensive bibliography), it is a
complex topic which deserves further study.
*Corresponding author. Tel.: +1-505-277-2028.
E-mail address: judithg@unm.edu (J. Good).
1071-5819/$ - see front matter r 2004 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ijhcs.2003.12.010
ARTICLE IN PRESS
170
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
One of the criticisms levelled at program comprehension studies, and indeed at
many studies of programming in general, is that they do not capture the richness of
the activities which occur in a naturalistic setting. Studies of programmers are often
limited to experiments involving small numbers of undergraduate students
individually studying short programs and answering multiple choice questions
during a 1-hour period. This is far removed from industrial settings where groups of
programmers work for several months or more on programs which are thousands of
lines long and which may have originally come from a number of sources.
At the same time, it is useful to distinguish between learning to program and
exercising the skill of programming. The former case is of interest to many researchers
because of the steep learning curve typically associated with the process, and because
of a desire to address the issue of the bimodal distribution of scores which often
occurs in programming courses.
In the case of novices learning to program, more naturalistic research does not
necessarily imply longitudinal, large-scale group studies involving complex
programs, as they may not be representative of novice programming tasks. On the
other hand, it is difficult to envisage program comprehension activities which are
naturalistic for novices, since most teaching of computer science is low on tasks that
support learning to understand programs as their primary goal. Nonetheless, many
of the tasks set for novice programmers do require them to understand chunks of
code, and we would argue that by opting for more open-ended methods of
measuring skills, and finer grained analyses of data, we can develop a deeper
understanding of the processes involved in novice program comprehension.
We have been working on more authentic ways to measure program comprehension. One of the ways we are investigating this is by looking at how programmers
communicate their own understanding of a program when asked to do so using nondirective questioning. Other researchers are also tackling this issue: for example, von
Mayrhauser and Lang (1999) have developed a scheme for analysing verbal
protocols of the actions which a programmer makes during a software maintenance
task such as debugging. Similarly, O’Brien et al. (2001) have focused on the processes
involved in program comprehension, using verbal protocol analysis to investigate the
use of different comprehension strategies. The comprehension process has long been
a subject of interest, with a number of researchers proposing theories which describe
the steps in the process. Brooks (1983) proposed one of the first temporal models of
program comprehension, envisaging comprehension as a top-down process.
Shneiderman and Mayer (1979) and Pennington (1987a) developed models based
on a bottom-up conceptualization, with understanding occurring as the result of the
integration of lower-level segments into an overall whole. Finally, Letovsky (1986)
focused on programmers as knowledge-based understanders, using a mix of bottomup and top-down processes in comprehension. Boehm-Davis (1988) similarly
conceives of comprehension as an iterative, opportunistic process.
In this paper, we present a scheme for measuring program comprehension which
involves coding free-form program summary statements. Rather than examining the
processes involved in program comprehension (the goal of the research described in
the preceding paragraph), this scheme focuses on the content and structure of the
ARTICLE IN PRESS
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
171
information artefacts which are produced following a program comprehension
phase. In a study of this type, participants are asked to examine programs, and to
write a summary describing what the program does in their own words. We then
code the statements in the summary according to the types of information which they
represent. We feel that the scheme offers a promising way of investigating program
comprehension in-depth and that, combined with quantitative measures, it can give
us more insight into the types of information that programmers glean from a
program.
The rest of the paper is structured as follows: Section 2 discusses some of the issues
involved in analysing program summaries. Section 3 describes the background to the
program summary analysis scheme, and the origins of the idea of information types
in programs. Section 4 describes the scheme itself, while Section 5 presents some
examples of applying the scheme to different types of programming languages.
Finally, Section 6 discusses open questions and ideas for further work.
2. Program summaries: issues of analysis
Program summaries have played an important role in the information types
methodology, and data of this type has been collected by Pennington (1987a),
Corritore and Wiedenbeck (1991) and Good (1999).
A program summary is a free-form account of a program which an individual
produces after studying the program. Instructions given to participants have tended
to be relatively non-directive, leaving the content essentially up to them. The lack of
explicit guidelines for the content of the summary allows for wide scope and
variation in the responses.
By the same token, the open-endedness of the task provides a valuable source of
rich, realistic data. Furthermore, the program summary methodology neatly
circumvents the problems of ‘false positive’ results often associated with binary
choice questions, and the difficulties associated with developing sensitive and reliable
multiple choice questions and corresponding distracter items. Program summaries
allow participants to express their view of a program, using their own words, at their
chosen level of abstraction, including as much (or as little) detail as they feel is
necessary.
As always, the price to pay for rich data is the difficulty of analysing it:
quantitative statistics are not always appropriate, and qualitative methods must be
devised. There are numerous ways of analysing written texts of this type, however, it
is not simply a case of identifying the ‘correct’ method in the same way one chooses
the right statistical test, particularly when the semantic content of the text is of
interest. Analysis schemes are rarely universal, given differences in research aims
between studies. They are both content and context sensitive, and must be developed
through what is often a lengthy, iterative process.
The complexity of the analysis is related to the issue of replicability. Rich, complex
data may lead to complex analysis schemes: extra care must be taken to ensure that
these schemes are in fact understandable and usable, and are no more complex than
ARTICLE IN PRESS
172
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
is necessary for the purpose of the analysis. Replicability can also be compromised
by schemes which are ill-defined. Schemes which are not fully worked out and/or
which are not accompanied by explicit instructions enabling them to be used by
persons other than the original researcher are not of much use: it is impossible to
compare results reliably.
In designing a scheme to analyse program summaries, it is necessary to ensure that
it is both complete, in the sense that all of the statements in the summary can be
classified in some way, and that it contains a number of distinct categories, so that it
allows for the detection of different patterns of statements across program
summaries. In reviewing the literature on program comprehension, Pennington’s
description of information types was felt to provide a useful starting point for
developing the scheme. Information types are described in more detail in the
following section.
3. Background to the scheme
3.1. Information types
Information types have been defined as ‘‘different kinds of information explicit ‘in
the text’ that must be detected in order to fully understand the program’’ (Green
et al., 1980; Green, 1980) in Pennington (l987a, p. 299). The concept of information
types has been used extensively by Pennington (1987a, b), in studies by Wiedenbeck,
Corritore and colleagues (see, for example, (Corritore and Wiedenbeck, 1991;
Ramalingam and Wiedenbeck, 1997)) and more recently by Romero et al. (2002).
Pennington described information types in terms of internal, rather than external,
abstractions of a computer program. They are not meant to be mental
representations of the program, but are ‘‘based on formal analyses of programs
developed by computer scientists’’ (Pennington, 1987a, p. 298), and can be compared
with the abstractions which are made with respect to natural language, such as
referential or causal abstractions.
Pennington identified five types of information: function, control flow, data flow,
state and operations, defined as follows:
Function: information about the overall goal of the program, essentially, ‘‘What is
the purpose of the program? What does the program do?’’. Since function also
includes program subgoals, goals and subgoals can be represented in a goal
hierarchy. Some information about the order of events can be inferred, but not
details of how the events are implemented.
Control flow: information about the temporal sequence of events occurring in the
program, e.g. ‘‘What happens after X occurs? What has occurred just before X?’’ If
the information is represented graphically, then the links will correspond to the
direction of control, rather than to the movement of data. Data flow information can
be inferred in a program by searching for repeated occurrences of data objects, but
goal/subgoal information is harder to detect.
ARTICLE IN PRESS
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
173
Data flow: essentially concerned with the transformations which data objects
undergo during execution, including data dependencies and data structure
information, e.g. ‘‘Does variable X contribute to the final value of Y?’’ Data flow
and function are linked in the sense that function information can be partially
reconstructed from a data flow abstraction. Similarly, control flow information can
also be more readily inferred from the data flow abstraction than from the function
abstraction.
Operations: information about specific actions which take place in the code,
generally corresponding to a single line of code or less, such as ‘‘does a variable
become instantiated with a particular value?’’ Although Pennington does not
describe these in great detail, operations seem most related to control flow
information, in the sense that describing the control flow of a program would lead to
‘‘stringing together’’ a series of operations.
State: time-slice descriptions of the state of objects and events in the program, e.g.
‘‘When the program is in state X, is event Y taking place, or has object Z been
created/modified?’’ This abstraction is quite distinct in the sense that other types of
information are hard to infer from it, and vice versa.
The categories are orthogonal in terms of information coverage.
Although Pennington does not address the issue of granularity explicitly, the
categories vary: function can often cover the entire program, while operations will
concern only a single line (or node, in the case of a visual programming
language).
Pennington was interested not so much in information types per se, but in the
relationship between information types and what she called programming knowledge
structures. She looked at two competing knowledge structures: text structure
knowledge, which is organized around control structure primes, and plan knowledge,
which, according to her, is primarily functionally oriented. Pennington carried out
two experiments to investigate these ideas, one of which involved the analysis of
participants’ written summaries of the programs they were asked to examine, as
described in the next section.
3.2. Pennington’s methodology for program summary analysis
Analysing program summaries as a way of measuring program comprehension
can be traced to an experiment carried out by Pennington (1987a). In addition to
answering binary choice questions about a program of moderate length, participants
were also asked to write a summary of the program at two points during the
experiment: firstly after a 45 min study period, and again after having carried out a
modification to the program. Although the exact wording of the request is not given,
it is likely that the instructions were brief and non-directive with respect to the type
of information the summary should contain.
Pennington performed two analyses on the program summaries, classifying each
statement by both information type and level of detail. The methods used in the
analysis are described in the following two sections.
ARTICLE IN PRESS
174
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
3.2.1. Information type analysis
Pennington states that the information types investigated included procedural,
data flow, and function statements. The other categories used in the program
comprehension tests, namely operations and state, do not seem to have been used: no
results were reported for them in any case. Why they were omitted from the analysis
is not discussed. Pennington’s definition of each category is very brief, and expressed
primarily through examples. She defines the three categories as follows:
‘‘procedural statements include statements of process, ordering and conditional
program actions’’ (Pennington, 1987a, p. 332);
‘‘data flow statements also include statements about data structures’’ (Pennington,
1987a, p. 332);
functional statements are not defined by Pennington, but illustrated with an example.
The following summary excerpts are provided to illustrate each type of statement,
all from (Pennington, 1987a, p. 332):
Procedural: ‘‘after this, the program will read in the cable file, comparing against
the previous point of cable file, then on equal condition compares against the
internal table y if found, will read the tray-area-point file for matching point-area.
In this read if found, will create a type-point-index record. If not found, will read
another cable record.’’
Data flow: ‘‘the tray-point file and the tray-area file are combined to create a trayarea-point file in Phase 1 of the program. Phase 2 tables information from the typecode file in working storage. The parameter file, cables file and the tray-area-point
file are then used to create a temporary-exceed-index file and a point-index file.’’
Functional: ‘‘the program is computing area for cable accesses throughout a
building. The amount of area per hold is first determined and then a table for cables
and diameters is loaded. Next a cable file is read to accumulate the sum of the cables’
diameters going through each hole.’’
3.2.2. Level of detail analysis
Pennington defined four levels of detail for program summaries:
detailed: references to a program’s operations and variables;
program: references to a program’s ‘‘procedural blocks’’;
domain: references to real-world objects;
vague: statements with no specific referents.
Pennington uses the example summary segments above as illustrations of the level
of detail: the procedural summary is the most detailed, the data flow summary is
described at the program level, the functional summary is described at the domain
level, and an example of a vague statement is, ‘‘this program reads and writes a lot of
files’’ (Pennington, 1987a, p. 333).
4. Description of the program summary analysis scheme
Based on the description of Pennington’s schemes in Pennington (1987a), applying
the schemes to program summaries presented a number of difficulties. Firstly,
ARTICLE IN PRESS
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
175
Pennington used only three categories to classify program summaries (procedural,
function and data flow). From her description, the mapping between procedural
statements and the five information types is not clear: are procedural statements
equivalent to the operations information type? The control information type? A
combination of the two, or something else entirely? Secondly, as mentioned above,
Pennington does not explain why only three categories are used, rather than five
(corresponding to the five information types). If the five information types are
defined to encode different types of program knowledge, it follows that an analysis
scheme that does not include all categories will not be sufficient to code all
statements.
The granularity of application of the categories is also unclear: were program
summaries segmented? If so, how? Or was a single category applied to an entire
summary? Parts of a summary? Without this information, it is impossible to reliably
apply the scheme. Pennington’s examples above suggest that the categories may have
been applied to several sentences at a time, however, not segmenting sentences may
mean that detail is lost. For example, a sentence describing a program may begin
with an overall description of the program, i.e. a function statement, before going on
to describe a low-level detail in the program’s functioning, which might be classified
as an operations statement. Coding the entire sentence as a function statement would
lose valuable information.
Furthermore, not all of the categories used by Pennington are defined succinctly.
In some cases, she says that particular categories include statements of a given type,
but does not give an exhaustive definition. In other cases, no definition is given,
simply an example. Without clear definitions of each category, it is not possible to
apply the scheme with any certainty.
Finally, the distinction between the information types and level of detail analyses
is not clear. In the examples given above (the only examples provided in Pennington
(1987a)), there seems to be a direct mapping between the procedural—detailed, dataflow—program, and functional—domain categories, suggesting that the two schemes
are redundant. Also, does this imply that data-flow statements are always couched in
program terms? It would have been useful to see examples where this direct mapping
does not occur so as to better understand the relationship between the
two schemes.
Although the above factors meant that Pennington’s analysis schemes were not
possible to replicate, they nonetheless served as a useful basis for the development of
new schemes, which are proposed below. The classification is similar to Pennington’s
in that it depends on two passes through the summaries: one based on information
types and the other based on object descriptions.
The information types classification is a more finely grained and fully specified
refinement of Pennington’s scheme, while the object classification is essentially a
more restricted version of Pennington’s levels of detail. It was decided to focus solely
on data objects within the program, as describing program events in terms of level of
detail was felt to entail a possible unwanted overlap with the information types
classification, as discussed above. Furthermore, given that the same data object can
be described in very different ways (e.g. a basketball team, a list of heights, or a list of
ARTICLE IN PRESS
176
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
numbers), focusing on object descriptions provides much insight into how
programmers choose to describe program objects.
4.1. Information types classification
The information types classification is used to code summary statements on the
basis of the information types they contain. The categories which make up the
classification are described below, followed by a short discussion of the relationships
between categories, and the way in which they fit together to form a program summary.
4.1.1. Information types categories: descriptions and examples
The information types classification comprises 11 categories, described below with
examples of each. In some cases, segments preceding or following the segment of
interest have been included to provide context and aid understanding (shown in
square brackets).
function: the overall aim of the program, described succinctly.
The program is selecting all players over a certain height and allowing them
to join the team.
The program calculates the differences between the input distancesy
actions: events occurring in the program which are described at a lower level than
function (i.e. they refer to events within the program), but at a higher level than
operations (described below). An action may involve a small group of nodes rather
than one node only. Alternatively, it may be described as operating over a series of
inputs, or describe actions in non-specific ways, e.g. describing tests in general, rather
than the exact tests being carried out.
This sub-program checks each individual element of this listy
‘Sun Span’ is then worked out.
The program makes two checksy
operations: small-scale events which occur in the program, such as tests, assignment,
etc. Operations usually correspond to one node in a VPL, or one line of textual code.
y then the program sets the height to head(height)...
y then it increments the counter by 1y
A selector checks to see if the set is equal to [ ] i.e. 0y
state-high: a high-level definition of the notion of state. Describes the current state of
a program when a condition has been met (and upon which an action is dependent).
State-high differs from state-low in terms of granularity: the former describes an
event at a more abstract level than the latter (which usually describes the direct result
of a test on a single data object). The relationship between the two is akin to the
relationship between actions and operations.
Once all the elements have been processed...
[The program continues] until there are no player left unchecked in the list
ARTICLE IN PRESS
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
177
state-low: a lower-level version of state-high. State-low usually relates to a test
condition being met, or not met, and upon which an operation depends.
If the head is greater than 180y
ywhen the test is empty is truey
yif empty distances (e.g. [ ])y
data: inputs and outputs to programs, data flow through programs, and descriptions
of data objects and data states.
The program accepted a list of numbers indicating sunhours.
yit then passes a list of heights to a sub-programy
ythe heights over the height are sent to the teamy
control: information having to do with program control structures and with
sequencing, e.g. recursion, calls to sub-programs, stopping conditions.
ythe nested recursions begin to unwind.
It exits the program and goes back to the main programy
elaborate: further information about a process/event/data object which has already
been described. This also includes examples.
[If the current mark is above a certain pass level] (65 in this case)y
[The head(numbers) is assigned to one variable] (which I’ll call mark)y
meta: statements about the participant’s own reasoning processes, e.g. ‘‘I’m not sure
what this does’’.
Dhoo! forgot where that route went!!!
y[and then joins it to the other value it would have created if it had done
what i just said] (complicated).
unclear: statements which cannot be coded because their meaning is ambiguous or
uninterpretable.
[If the height is greater than 180, 1 is added to the counter] and the height is
recorded. It is not clear here whether ‘recorded’ means ‘printed’, ‘added to a
list’, ‘assigned to a variable’...
The program is listing how many hours of sun there was only when the sun
was High.
incomplete: statements which cannot be coded because they are incomplete.
Statements which fall into this category tend to be unfinished sentences.
Information categories are related to each other in terms of level of granularity,
which can be envisaged as follows: at the top level, the program can be described in
terms of a small number of functions (in some cases, just one, if the programs are
small). At a finer level of granularity, these functions are accomplished by a series of
actions. The actions may be dependent on certain conditions, represented by statehigh nodes. At an even finer level of granularity, the actions themselves are
ARTICLE IN PRESS
178
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
implemented in the program by operations which usually correspond to a single line
of code, or one node in a VPL. Likewise, state-low nodes describe the state of a single
data object, usually just after a test.
4.2. Object descriptions
The object classification looks at the way in which objects are described. The basic
question being asked is, ‘‘How do participants, when not constrained by specific
instructions, choose to describe objects present in the program?’’.
Some objects cannot be classified at more than one level. For example, a program
is, by definition, a program specific object. Similarly, objects introduced within the
program (i.e. not inputs or outputs), and which have a raison d’#etre only within a
program, cannot be classified in domain terms (e.g. a counter). However, the most
interesting cases arise when there is a choice of levels at which the object can be
described. For example, an input to the program could be described as a list of
numbers, or alternatively, as a series of basketball player’s heights.
4.2.1. Object categories: descriptions and examples
The object classification comprises seven categories, which were derived
empirically. The categories are described below with examples.
program only: refers to items which occur only in the program domain, and which
would not have a meaning in another context, for example, a counter.
This program initially sets a counter to zeroy
program: an object, which could be described at various levels, described in program
terms. Program terms refer to the use of any program specific data structure
(e.g. a list) or variable (indicated by the lack of an article, the word in quotes,
capitalized, etc.).
ychecking first whether the list is empty or noty
If ‘Height’ is then equal to or less than 180 ‘Sub Team’ is run again.
If the current height variable is abovey
program-real-world: object descriptions using terminology which is valid in both realworld and program domains, e.g. results, numbers. This category contrasts with the
domain category in that the latter is specific to the problem domain described in the
program, e.g. basketball players’ heights, exam marks, distances between cities, while
the former refers to terminology which is more abstract, and would be shared across
problem domains. For example, a reference to numbers would be classified as a
program-real-world description, while exam marks would be classified as a domain
description.
The program takes 2 numbers...
The program gives out the 5 highest values that were input to the program.
ARTICLE IN PRESS
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
179
program-domain: object descriptions which contain a mixture of program and
problem domain references, e.g. a list of marks (note that care must be taken to
ensure that domain references are not in fact being used as variable names), or
a reference which is equally valid in the program and the problem domains
(e.g. differences).
This is processing a list of marksy
yit then passes a list of heights to a sub-program.
domain: an object which is described in domain terms, e.g. a mark, a distance, sunny
days rather than by its representation within the program.
This program checks a basketball players height from [the list given].
This program calculate the number of students who passedy
indirect reference: an anaphoric reference to a data object.
ythey are stripped in turn out of [the list].
yif it is then the program returns to the main program.
unclear: statements which are ambiguous and cannot be coded, either because the
statement itself is unclear, or because the object which is being referred to cannot be
identified.
yis sent to the pass markery
y[the head goes into] a folder.
Some of the categories above have links with other categories. Program and
domain categories could be referred to as ‘‘pure’’ categories in the sense that they
refer to one level of description only. Program-real-world and program-domain are
amalgamates of pure categories. Program only is a special case: unlike the categories
just mentioned, it is used for objects which are inherently linked to the program
domain and hence cannot be described at other levels. Finally, indirect reference and
unclear statements are not linked to the others in any obvious way.
Additionally, the categories, as listed above, can be viewed as occurring on a
continuum in terms of degree of ambiguity, starting with specific ‘program’
references having a low degree of ambiguity, through to ‘‘unclear’’ statements at the
other end of the ambiguity spectrum.
5. Examples of applying the scheme
The program summary analysis scheme described above has been used in
an experiment which examined program comprehension in Prolog (Good
ARTICLE IN PRESS
180
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
et al., 1997),1 and also in a comparative study which looked at simple mock-ups of
visual programming languages based on the data flow and control flow paradigms.
Program summary data was analysed in conjunction with multiple choice measures
of program comprehension.
This section briefly describes the latter study, before focusing on the results which
pertain to the program summaries themselves. The aim is to show how the analysis
scheme can be used, and its particular utility in comparatives studies of
programming languages. As such, this section does not give a full account of the
study, and the results should not be taken to mean that one of the language
paradigms used in the study is somehow ‘‘better’’ than the other.
The study in question examined the effect of visual programming language
paradigm (data flow or control flow) on program comprehension. The programs
used were short, recursive list processing programs. Comprehension was measured
using multiple choice questions focusing on each of the five information types, and a
short program summary. By gathering quantitative data in the form of the multiple
choice responses, and qualitative data in the form of program summaries, our aim
was to provide a fuller account of participants’ program comprehension, a technique
used previously by Pennington (1987a), and Corritore and Wiedenbeck (1991).
Twenty participants took part in the experiment. All were starting the second year
of a computer studies degree, and had been taught Cþþ and COBOL. Participants
were randomly divided into two groups: control flow and data flow. The experiment
included a number of pre-tests which explored graphical skill and prior programming knowledge: as the results are not reported in this paper, they will not be
described.
The experimental setup was implemented in Macromedia Director. The first
screen explained the overall structure of the experimental session. The following
screens provided an introduction to the visual programming language (data or
control flow, depending on the group), presenting the nodes used in the language
with a textual description of their function. Participants also saw a sample program,
with an explanation of how each node contributed to producing the program output.
A practice session allowed participants to answer questions similar to those they
would encounter in the experiment.
In the experiment itself, participants were asked to study a short program. The
next screen required participants to type a free-form summary of the program into a
text box (the program was not visible on this screen). The subsequent five screens
each showed a multiple choice comprehension question (presented in random order),
corresponding to one of the five information types, along with the program. Both
groups saw the same programs, but their form differed depending upon the group to
which the participant had been assigned: control flow or data flow. Participants
worked through a total of four programs. Note that only the data pertaining to the
program summaries will be discussed here: full details of the study, and an in-depth
consideration of results, can be found in (Good, 1999).
1
Note that this paper describes only participants’ responses to multiple choice questions about their
program understanding; the analysis of their program summaries was not reported.
ARTICLE IN PRESS
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
181
Fig. 1 shows, on the left, a simple visual control flow version of a program
designed to count the number of passing marks in a list, and, on the right, the
corresponding data flow version.
Tables 1 and 2 show the application of the information types classification to two
program summaries of the passes program. The first summary is from a participant
who studied the control flow version of the program, while the second is from a
participant in the data flow condition.
In the study which compared data flow and control flow visual program
representations, differences between the two groups was obvious in the length of the
program summaries: 70.91 words for the control flow group, and 48.85 words for the
data flow group. However, the program summary analysis scheme allows for a finer
grained analysis of the differences, as shown in Table 3. This shows the mean
proportion of information types category statements for the control and data flow
groups. Summaries from the data flow group contain higher proportions of function,
action, state-high, and data flow information types than do the control flow group.
The control flow group’s summaries contain higher proportions of operation, statelow, and control flow statements. In terms of level of granularity, it emerges that the
control flow group’s summaries contain many more low-level statements than the
data flow group.
A comparison of object description categories yielded the following results
(Table 4).
Fig. 1. The passes program. Left: control flow version. Right: data flow version.
ARTICLE IN PRESS
182
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
Table 1
Detailed description of the passes program, with statement types
Statement
Code
it first checks to see if the mark list is empty
if it is
then the program exits
if that check is false
it then sets mark to the first number in the list,
and sets the rest of the list to some variable,
it then checks to see if mark is greater than 65
if it is
it then adds 1 to pass
and exit
if it is not
it recurses
operation
state-low
control
state-low
operation
operation
operation
state-low
operation
control
state-low
control
Table 2
Higher level description of the passes program, with statement types
Statement
Code
This is processing a list of marks
and finding which are greater than 65
and adding 1 to the counter
to give a result of how many passed
action
action
operation
data
Table 3
Mean proportion of information types statements per group
Category
Control flow
(mean %)
Data flow
(mean %)
Function
Data
State-high
Action
Operation
State-low
Control
Elaborate
Meta
Unclear
11.62
13.10
6.22
7.10
30.22
12.93
14.10
0.49
0.15
4.07
20.93
24.68
8.23
9.10
15.67
10.04
5.33
3.61
1.05
1.36
Summaries from the data flow group contain higher proportions of program-realworld, program-domain and indirect statements than do the control
flow group. The control flow group’s summaries contain higher proportions of
program only, program and domain statements. Finally, data flow subjects made
ARTICLE IN PRESS
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
183
Table 4
Mean proportion of object description statements per group
Category
Control flow
(mean %)
Data flow
(mean %)
Program only
Program
Program-real-world
Program-domain
Domain
Indirect
Unclear
4.07
46.93
11.21
4.52
22.78
10.09
0.39
3.81
33.84
18.55
5.29
20.46
17.02
1.03
more references to objects which were judged to be unclear than did control flow
subjects.
The results from the information types analysis are compelling, and point to
differences between the two groups in terms of communicating their program
comprehension. This suggests that the information types scheme is of value in
capturing fine-grained differences in program summaries. On the other hand, results
from applying the object description analysis are less clear-cut, and point to the need
for further investigation of the scheme’s utility.
6. Discussion and conclusions
This paper presented a coding scheme for analysing program summaries. The
scheme aims to provide more authentic measures of program comprehension, by
allowing programmers to express their understanding in their own words. To date,
the scheme has been applied to textual descriptions of programs, but it is
hypothesized that it could be equally useful for verbal protocols gathered during
comprehension tasks. The scheme has been used on descriptions of programs written
in diverse languages, and seems to capture subtleties in participants’ reporting of
comprehension resulting from differences in program representation.
In addition, the scheme has also been used to look at levels of abstraction in
program summaries. By mapping information types onto levels of abstraction (with
function, elaborate and meta statements at the highest level of abstraction; action,
state-high and data statements at an intermediate level, and operation, state-low and
control statements at a low level of abstraction), we found that participants in the
control flow condition tended not to change level of abstraction between statements:
almost 80% of consecutive statements in their summaries represent cases where the
statements may be of differing information type, but are at the same level of
abstraction. By contrast, in the data flow group, less than half of their consecutive
statements were of this kind, showing that were as likely to change level of
abstraction as to stay within the same level.
Furthermore, by using the scheme in conjunction with an analysis of errors in
program summaries, where errors were classified at the highest level as either errors
ARTICLE IN PRESS
184
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
of commission, or errors of omission, we discovered that although control flow
participants produced long summaries at consistently low levels of abstraction, they
tended to exhibit more errors of omission than did data flow participants, who
produced shorter summaries at varied levels of abstraction. This research on levels of
abstraction and errors in program summaries is reported in full in Good and
Oberlander (2002), but is mentioned here in order to demonstrate varying ways in
which the scheme may be put to use.
Finally, work is now underway to refine and generalize the scheme by investigating
the following issues:
Inter-rater reliability: the results described in this paper resulted from the
application of the scheme by one researcher. Explicit coding instructions and
procedures have been developed, and the scheme is currently being applied by two
other researchers in order to measure inter-rater reliability.
Universality: as mentioned above, the scheme has been used with diverse
languages, however, two of these languages were not full-scale executable languages.
In order to judge the general applicability of the scheme, it is currently being tested
out on summaries of COBOL programs.
Scalability: the programs described in this paper were short programs, which were
studied by novice programmers. We are now looking at much longer programs
(approximately 1200 lines of code), being studied by professional programmers.
Summary modality: the scheme has been applied to written program summaries.
The scheme is now being applied to verbal protocol data which was obtained both
during and following a comprehension task.
We are confident that this ongoing work will allow us to shed more light on
the scope and applicability of the coding scheme, making it useful for other
researchers who are interested in describing the structure and content of program
comprehension artefacts. At the same time, just as the cognitive dimensions
of notations framework has been described as a potentially extensible framework
(Blackwell et al., 2001), we feel that the coding scheme described in this paper
is also a work in progress, and we welcome comments, suggestions and
refinements.
Acknowledgements
This work was carried out at the Human Communication Research Centre,
University of Edinburgh, and was supported by the UK Engineering and Physical
Sciences Research Council, through Grant GR/L36987. Our grateful thanks to Jon
Oberlander and Richard Cox, our collaborators on the project.
References
Blackwell, A.F., Britton, C., Cox, A., Green, T.R.G., Gurr, C.A., Kadoda, G.F., Kutar, M., Loomes, M.,
Nehaniv, C.L., Petre, M., Roast, C., Roes, C., Wong, A., Young, R.M., 2001. Cognitive dimensions of
ARTICLE IN PRESS
J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185
185
notations: design tools for cognitive technology. In: Beynon, M., Nehaniv, C.L., Dautenhahn, K.
(Eds.), Cognitive Technology 2001. Springer, Berlin, pp. 325–341.
Boehm-Davis, D.A., 1988. Software comprehension. In: Helander, M. (Ed.), Handbook of Human–
Computer Interaction. Elsevier, Amsterdam, pp. 107–121.
Brooks, R., 1983. Towards a theory of the comprehension of computer programs. International Journal of
Man–Machine Studies 18 (6), 543–554.
Corritore, C.L., Wiedenbeck, S., 1991. What do novices learn during program comprehension?
International Journal of Human–Computer Interaction 3, 199–222.
Good, J., 1999. Programming paradigms, information types and graphical representations: empirical
investigations of novice program comprehension. Ph.D. Thesis, Department of Artificial Intelligence,
University of Edinburgh.
Good, J., Oberlander, J., 2002. Verbal effects of visual programs: information type, structure and error in
program summaries. Document Design 3, 120–134.
Good, J., Brna, P., Cox, R., 1997. Novices and program comprehension: does language make a difference?
In: Proceedings of the Nineteenth Annual Conference of the Cognitive Science Society, LEA, Stanford,
California, p. 936. (A longer version appeared as Technical Report 97/10, Computer Based Learning
Unit, University of Leeds).
Green, T.R.G., 1980. Programming as a cognitive activity. In: Smith, H., Green, T.R.G. (Eds.), Human
Interaction with Computers. Academic Press, New York, pp. 271–320.
Green, T.R.G., Sime, M.E., Fitter, M.J., 1980. The problems the programmer faces. Ergonomics 23,
893–907.
Letovsky, S., 1986. Cognitive processes in program comprehension. In: Soloway, E., Iyengar, S. (Eds.),
Empirical Studies of Programmers: First Workshop. Ablex Publishing Corporation, New Jersey,
pp. 58–79.
O’Brien, M.P., Shaft, T.M., Buckley, J., 2001. An Open-Source Analysis Scheme for Identifying Software
Comprehension Processes. In: Kadoda, G. (Ed.), Proceedings of PPIG-13: 13th Annual Meeting of the
Psychology of Programming Interest Group, Bournemouth, UK, pp. 129–146.
Pennington, N., 1987a. Stimulus structures and mental representations in expert comprehension of
computer programs. Cognitive Psychology 19, 295–341.
Pennington, N., 1987b. Comprehension strategies in programming. In: Olson, G.M., Sheppard, S.,
Soloway, E. (Eds.), Empirical Studies of Programmers: Second Workshop. Ablex Publishing
Corporation, New Jersey, pp. 100–113.
Ramalingam, V., Wiedenbeck, S., 1997. An empirical study of novice program comprehension in the
imperative and object-oriented styles. In: Proceedings of Seventh Workshop on Empirical Studies of
Programmers, ACM Press, New York, pp. 124–139.
Romero, P., Cox, R., du Boulay, B., Lutz, R., 2002. Visual attention and representation switching during
Java program debugging: a study using the Restricted Focus Viewer. In: Proceedings of the Second
International Conference on the Theory and Application of Diagrams (Diagrams 2002), Callaway
Gardens, GA, pp. 221–235.
Shneiderman, B., Mayer, R., 1979. Syntactic/semantic interactions in programmer behavior: a model and
experimental results. International Journal of Computer and Information Sciences 8 (3), 219–238.
Upchurch, R., 2002. Code reading and program comprehension: annotated bibliography. Retrieved
January 6, 2003 from http://www2.umassd.edu/SWPI/ProcessBibliography/bib-codereading2.html.
von Mayrhauser, A., Lang, S., 1999. A coding scheme to support systematic analysis of software
comprehension. IEEE Transactions on Software Engineering 25, 526–540.