[go: up one dir, main page]

Academia.eduAcademia.edu
ARTICLE IN PRESS Int. J. Human-Computer Studies 61 (2004) 169–185 Program comprehension and authentic measurement: a scheme for analysing descriptions of programs Judith Gooda,*, Paul Brnab a Organizational Learning and Instructional Technologies, College of Education, MSC05-3040 1, University of New Mexico, 106 Education Office Building, Albuquerque, NM 87131-1231, USA b School of Informatics, Pandon Building, Northumbria University, Newcastle upon Tyne NE1 8ST, UK Received 13 October 2003; accepted 19 December 2003 Abstract This paper describes an analysis scheme which was developed to probe the comprehension of computer programming languages by students learning to program. The scheme operates on free-form program summaries, i.e. textual descriptions of a program which are produced in response to minimal instructions by the researcher/experimenter. The scheme has been applied to descriptions of programs written in various languages, and it is felt that the scheme has the potential to be applied to languages of markedly different types (e.g. procedural, objectoriented, event-driven). The paper first discusses the basis for the scheme, before describing the scheme in detail. It then presents examples of the scheme’s application, and concludes with a discussion of some open issues. r 2004 Elsevier Ltd. All rights reserved. 1. Introduction The comprehensibility of programming languages is a topic of interest in an era in which much effort is being expended to open up programming to an increasingly wider audience. Although much work has already been carried out in the area of program comprehension (see Upchurch (2002) for an extensive bibliography), it is a complex topic which deserves further study. *Corresponding author. Tel.: +1-505-277-2028. E-mail address: judithg@unm.edu (J. Good). 1071-5819/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.ijhcs.2003.12.010 ARTICLE IN PRESS 170 J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 One of the criticisms levelled at program comprehension studies, and indeed at many studies of programming in general, is that they do not capture the richness of the activities which occur in a naturalistic setting. Studies of programmers are often limited to experiments involving small numbers of undergraduate students individually studying short programs and answering multiple choice questions during a 1-hour period. This is far removed from industrial settings where groups of programmers work for several months or more on programs which are thousands of lines long and which may have originally come from a number of sources. At the same time, it is useful to distinguish between learning to program and exercising the skill of programming. The former case is of interest to many researchers because of the steep learning curve typically associated with the process, and because of a desire to address the issue of the bimodal distribution of scores which often occurs in programming courses. In the case of novices learning to program, more naturalistic research does not necessarily imply longitudinal, large-scale group studies involving complex programs, as they may not be representative of novice programming tasks. On the other hand, it is difficult to envisage program comprehension activities which are naturalistic for novices, since most teaching of computer science is low on tasks that support learning to understand programs as their primary goal. Nonetheless, many of the tasks set for novice programmers do require them to understand chunks of code, and we would argue that by opting for more open-ended methods of measuring skills, and finer grained analyses of data, we can develop a deeper understanding of the processes involved in novice program comprehension. We have been working on more authentic ways to measure program comprehension. One of the ways we are investigating this is by looking at how programmers communicate their own understanding of a program when asked to do so using nondirective questioning. Other researchers are also tackling this issue: for example, von Mayrhauser and Lang (1999) have developed a scheme for analysing verbal protocols of the actions which a programmer makes during a software maintenance task such as debugging. Similarly, O’Brien et al. (2001) have focused on the processes involved in program comprehension, using verbal protocol analysis to investigate the use of different comprehension strategies. The comprehension process has long been a subject of interest, with a number of researchers proposing theories which describe the steps in the process. Brooks (1983) proposed one of the first temporal models of program comprehension, envisaging comprehension as a top-down process. Shneiderman and Mayer (1979) and Pennington (1987a) developed models based on a bottom-up conceptualization, with understanding occurring as the result of the integration of lower-level segments into an overall whole. Finally, Letovsky (1986) focused on programmers as knowledge-based understanders, using a mix of bottomup and top-down processes in comprehension. Boehm-Davis (1988) similarly conceives of comprehension as an iterative, opportunistic process. In this paper, we present a scheme for measuring program comprehension which involves coding free-form program summary statements. Rather than examining the processes involved in program comprehension (the goal of the research described in the preceding paragraph), this scheme focuses on the content and structure of the ARTICLE IN PRESS J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 171 information artefacts which are produced following a program comprehension phase. In a study of this type, participants are asked to examine programs, and to write a summary describing what the program does in their own words. We then code the statements in the summary according to the types of information which they represent. We feel that the scheme offers a promising way of investigating program comprehension in-depth and that, combined with quantitative measures, it can give us more insight into the types of information that programmers glean from a program. The rest of the paper is structured as follows: Section 2 discusses some of the issues involved in analysing program summaries. Section 3 describes the background to the program summary analysis scheme, and the origins of the idea of information types in programs. Section 4 describes the scheme itself, while Section 5 presents some examples of applying the scheme to different types of programming languages. Finally, Section 6 discusses open questions and ideas for further work. 2. Program summaries: issues of analysis Program summaries have played an important role in the information types methodology, and data of this type has been collected by Pennington (1987a), Corritore and Wiedenbeck (1991) and Good (1999). A program summary is a free-form account of a program which an individual produces after studying the program. Instructions given to participants have tended to be relatively non-directive, leaving the content essentially up to them. The lack of explicit guidelines for the content of the summary allows for wide scope and variation in the responses. By the same token, the open-endedness of the task provides a valuable source of rich, realistic data. Furthermore, the program summary methodology neatly circumvents the problems of ‘false positive’ results often associated with binary choice questions, and the difficulties associated with developing sensitive and reliable multiple choice questions and corresponding distracter items. Program summaries allow participants to express their view of a program, using their own words, at their chosen level of abstraction, including as much (or as little) detail as they feel is necessary. As always, the price to pay for rich data is the difficulty of analysing it: quantitative statistics are not always appropriate, and qualitative methods must be devised. There are numerous ways of analysing written texts of this type, however, it is not simply a case of identifying the ‘correct’ method in the same way one chooses the right statistical test, particularly when the semantic content of the text is of interest. Analysis schemes are rarely universal, given differences in research aims between studies. They are both content and context sensitive, and must be developed through what is often a lengthy, iterative process. The complexity of the analysis is related to the issue of replicability. Rich, complex data may lead to complex analysis schemes: extra care must be taken to ensure that these schemes are in fact understandable and usable, and are no more complex than ARTICLE IN PRESS 172 J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 is necessary for the purpose of the analysis. Replicability can also be compromised by schemes which are ill-defined. Schemes which are not fully worked out and/or which are not accompanied by explicit instructions enabling them to be used by persons other than the original researcher are not of much use: it is impossible to compare results reliably. In designing a scheme to analyse program summaries, it is necessary to ensure that it is both complete, in the sense that all of the statements in the summary can be classified in some way, and that it contains a number of distinct categories, so that it allows for the detection of different patterns of statements across program summaries. In reviewing the literature on program comprehension, Pennington’s description of information types was felt to provide a useful starting point for developing the scheme. Information types are described in more detail in the following section. 3. Background to the scheme 3.1. Information types Information types have been defined as ‘‘different kinds of information explicit ‘in the text’ that must be detected in order to fully understand the program’’ (Green et al., 1980; Green, 1980) in Pennington (l987a, p. 299). The concept of information types has been used extensively by Pennington (1987a, b), in studies by Wiedenbeck, Corritore and colleagues (see, for example, (Corritore and Wiedenbeck, 1991; Ramalingam and Wiedenbeck, 1997)) and more recently by Romero et al. (2002). Pennington described information types in terms of internal, rather than external, abstractions of a computer program. They are not meant to be mental representations of the program, but are ‘‘based on formal analyses of programs developed by computer scientists’’ (Pennington, 1987a, p. 298), and can be compared with the abstractions which are made with respect to natural language, such as referential or causal abstractions. Pennington identified five types of information: function, control flow, data flow, state and operations, defined as follows: Function: information about the overall goal of the program, essentially, ‘‘What is the purpose of the program? What does the program do?’’. Since function also includes program subgoals, goals and subgoals can be represented in a goal hierarchy. Some information about the order of events can be inferred, but not details of how the events are implemented. Control flow: information about the temporal sequence of events occurring in the program, e.g. ‘‘What happens after X occurs? What has occurred just before X?’’ If the information is represented graphically, then the links will correspond to the direction of control, rather than to the movement of data. Data flow information can be inferred in a program by searching for repeated occurrences of data objects, but goal/subgoal information is harder to detect. ARTICLE IN PRESS J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 173 Data flow: essentially concerned with the transformations which data objects undergo during execution, including data dependencies and data structure information, e.g. ‘‘Does variable X contribute to the final value of Y?’’ Data flow and function are linked in the sense that function information can be partially reconstructed from a data flow abstraction. Similarly, control flow information can also be more readily inferred from the data flow abstraction than from the function abstraction. Operations: information about specific actions which take place in the code, generally corresponding to a single line of code or less, such as ‘‘does a variable become instantiated with a particular value?’’ Although Pennington does not describe these in great detail, operations seem most related to control flow information, in the sense that describing the control flow of a program would lead to ‘‘stringing together’’ a series of operations. State: time-slice descriptions of the state of objects and events in the program, e.g. ‘‘When the program is in state X, is event Y taking place, or has object Z been created/modified?’’ This abstraction is quite distinct in the sense that other types of information are hard to infer from it, and vice versa. The categories are orthogonal in terms of information coverage. Although Pennington does not address the issue of granularity explicitly, the categories vary: function can often cover the entire program, while operations will concern only a single line (or node, in the case of a visual programming language). Pennington was interested not so much in information types per se, but in the relationship between information types and what she called programming knowledge structures. She looked at two competing knowledge structures: text structure knowledge, which is organized around control structure primes, and plan knowledge, which, according to her, is primarily functionally oriented. Pennington carried out two experiments to investigate these ideas, one of which involved the analysis of participants’ written summaries of the programs they were asked to examine, as described in the next section. 3.2. Pennington’s methodology for program summary analysis Analysing program summaries as a way of measuring program comprehension can be traced to an experiment carried out by Pennington (1987a). In addition to answering binary choice questions about a program of moderate length, participants were also asked to write a summary of the program at two points during the experiment: firstly after a 45 min study period, and again after having carried out a modification to the program. Although the exact wording of the request is not given, it is likely that the instructions were brief and non-directive with respect to the type of information the summary should contain. Pennington performed two analyses on the program summaries, classifying each statement by both information type and level of detail. The methods used in the analysis are described in the following two sections. ARTICLE IN PRESS 174 J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 3.2.1. Information type analysis Pennington states that the information types investigated included procedural, data flow, and function statements. The other categories used in the program comprehension tests, namely operations and state, do not seem to have been used: no results were reported for them in any case. Why they were omitted from the analysis is not discussed. Pennington’s definition of each category is very brief, and expressed primarily through examples. She defines the three categories as follows: ‘‘procedural statements include statements of process, ordering and conditional program actions’’ (Pennington, 1987a, p. 332); ‘‘data flow statements also include statements about data structures’’ (Pennington, 1987a, p. 332); functional statements are not defined by Pennington, but illustrated with an example. The following summary excerpts are provided to illustrate each type of statement, all from (Pennington, 1987a, p. 332): Procedural: ‘‘after this, the program will read in the cable file, comparing against the previous point of cable file, then on equal condition compares against the internal table y if found, will read the tray-area-point file for matching point-area. In this read if found, will create a type-point-index record. If not found, will read another cable record.’’ Data flow: ‘‘the tray-point file and the tray-area file are combined to create a trayarea-point file in Phase 1 of the program. Phase 2 tables information from the typecode file in working storage. The parameter file, cables file and the tray-area-point file are then used to create a temporary-exceed-index file and a point-index file.’’ Functional: ‘‘the program is computing area for cable accesses throughout a building. The amount of area per hold is first determined and then a table for cables and diameters is loaded. Next a cable file is read to accumulate the sum of the cables’ diameters going through each hole.’’ 3.2.2. Level of detail analysis Pennington defined four levels of detail for program summaries: detailed: references to a program’s operations and variables; program: references to a program’s ‘‘procedural blocks’’; domain: references to real-world objects; vague: statements with no specific referents. Pennington uses the example summary segments above as illustrations of the level of detail: the procedural summary is the most detailed, the data flow summary is described at the program level, the functional summary is described at the domain level, and an example of a vague statement is, ‘‘this program reads and writes a lot of files’’ (Pennington, 1987a, p. 333). 4. Description of the program summary analysis scheme Based on the description of Pennington’s schemes in Pennington (1987a), applying the schemes to program summaries presented a number of difficulties. Firstly, ARTICLE IN PRESS J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 175 Pennington used only three categories to classify program summaries (procedural, function and data flow). From her description, the mapping between procedural statements and the five information types is not clear: are procedural statements equivalent to the operations information type? The control information type? A combination of the two, or something else entirely? Secondly, as mentioned above, Pennington does not explain why only three categories are used, rather than five (corresponding to the five information types). If the five information types are defined to encode different types of program knowledge, it follows that an analysis scheme that does not include all categories will not be sufficient to code all statements. The granularity of application of the categories is also unclear: were program summaries segmented? If so, how? Or was a single category applied to an entire summary? Parts of a summary? Without this information, it is impossible to reliably apply the scheme. Pennington’s examples above suggest that the categories may have been applied to several sentences at a time, however, not segmenting sentences may mean that detail is lost. For example, a sentence describing a program may begin with an overall description of the program, i.e. a function statement, before going on to describe a low-level detail in the program’s functioning, which might be classified as an operations statement. Coding the entire sentence as a function statement would lose valuable information. Furthermore, not all of the categories used by Pennington are defined succinctly. In some cases, she says that particular categories include statements of a given type, but does not give an exhaustive definition. In other cases, no definition is given, simply an example. Without clear definitions of each category, it is not possible to apply the scheme with any certainty. Finally, the distinction between the information types and level of detail analyses is not clear. In the examples given above (the only examples provided in Pennington (1987a)), there seems to be a direct mapping between the procedural—detailed, dataflow—program, and functional—domain categories, suggesting that the two schemes are redundant. Also, does this imply that data-flow statements are always couched in program terms? It would have been useful to see examples where this direct mapping does not occur so as to better understand the relationship between the two schemes. Although the above factors meant that Pennington’s analysis schemes were not possible to replicate, they nonetheless served as a useful basis for the development of new schemes, which are proposed below. The classification is similar to Pennington’s in that it depends on two passes through the summaries: one based on information types and the other based on object descriptions. The information types classification is a more finely grained and fully specified refinement of Pennington’s scheme, while the object classification is essentially a more restricted version of Pennington’s levels of detail. It was decided to focus solely on data objects within the program, as describing program events in terms of level of detail was felt to entail a possible unwanted overlap with the information types classification, as discussed above. Furthermore, given that the same data object can be described in very different ways (e.g. a basketball team, a list of heights, or a list of ARTICLE IN PRESS 176 J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 numbers), focusing on object descriptions provides much insight into how programmers choose to describe program objects. 4.1. Information types classification The information types classification is used to code summary statements on the basis of the information types they contain. The categories which make up the classification are described below, followed by a short discussion of the relationships between categories, and the way in which they fit together to form a program summary. 4.1.1. Information types categories: descriptions and examples The information types classification comprises 11 categories, described below with examples of each. In some cases, segments preceding or following the segment of interest have been included to provide context and aid understanding (shown in square brackets). function: the overall aim of the program, described succinctly. The program is selecting all players over a certain height and allowing them to join the team. The program calculates the differences between the input distancesy actions: events occurring in the program which are described at a lower level than function (i.e. they refer to events within the program), but at a higher level than operations (described below). An action may involve a small group of nodes rather than one node only. Alternatively, it may be described as operating over a series of inputs, or describe actions in non-specific ways, e.g. describing tests in general, rather than the exact tests being carried out. This sub-program checks each individual element of this listy ‘Sun Span’ is then worked out. The program makes two checksy operations: small-scale events which occur in the program, such as tests, assignment, etc. Operations usually correspond to one node in a VPL, or one line of textual code. y then the program sets the height to head(height)... y then it increments the counter by 1y A selector checks to see if the set is equal to [ ] i.e. 0y state-high: a high-level definition of the notion of state. Describes the current state of a program when a condition has been met (and upon which an action is dependent). State-high differs from state-low in terms of granularity: the former describes an event at a more abstract level than the latter (which usually describes the direct result of a test on a single data object). The relationship between the two is akin to the relationship between actions and operations. Once all the elements have been processed... [The program continues] until there are no player left unchecked in the list ARTICLE IN PRESS J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 177 state-low: a lower-level version of state-high. State-low usually relates to a test condition being met, or not met, and upon which an operation depends. If the head is greater than 180y ywhen the test is empty is truey yif empty distances (e.g. [ ])y data: inputs and outputs to programs, data flow through programs, and descriptions of data objects and data states. The program accepted a list of numbers indicating sunhours. yit then passes a list of heights to a sub-programy ythe heights over the height are sent to the teamy control: information having to do with program control structures and with sequencing, e.g. recursion, calls to sub-programs, stopping conditions. ythe nested recursions begin to unwind. It exits the program and goes back to the main programy elaborate: further information about a process/event/data object which has already been described. This also includes examples. [If the current mark is above a certain pass level] (65 in this case)y [The head(numbers) is assigned to one variable] (which I’ll call mark)y meta: statements about the participant’s own reasoning processes, e.g. ‘‘I’m not sure what this does’’. Dhoo! forgot where that route went!!! y[and then joins it to the other value it would have created if it had done what i just said] (complicated). unclear: statements which cannot be coded because their meaning is ambiguous or uninterpretable. [If the height is greater than 180, 1 is added to the counter] and the height is recorded. It is not clear here whether ‘recorded’ means ‘printed’, ‘added to a list’, ‘assigned to a variable’... The program is listing how many hours of sun there was only when the sun was High. incomplete: statements which cannot be coded because they are incomplete. Statements which fall into this category tend to be unfinished sentences. Information categories are related to each other in terms of level of granularity, which can be envisaged as follows: at the top level, the program can be described in terms of a small number of functions (in some cases, just one, if the programs are small). At a finer level of granularity, these functions are accomplished by a series of actions. The actions may be dependent on certain conditions, represented by statehigh nodes. At an even finer level of granularity, the actions themselves are ARTICLE IN PRESS 178 J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 implemented in the program by operations which usually correspond to a single line of code, or one node in a VPL. Likewise, state-low nodes describe the state of a single data object, usually just after a test. 4.2. Object descriptions The object classification looks at the way in which objects are described. The basic question being asked is, ‘‘How do participants, when not constrained by specific instructions, choose to describe objects present in the program?’’. Some objects cannot be classified at more than one level. For example, a program is, by definition, a program specific object. Similarly, objects introduced within the program (i.e. not inputs or outputs), and which have a raison d’#etre only within a program, cannot be classified in domain terms (e.g. a counter). However, the most interesting cases arise when there is a choice of levels at which the object can be described. For example, an input to the program could be described as a list of numbers, or alternatively, as a series of basketball player’s heights. 4.2.1. Object categories: descriptions and examples The object classification comprises seven categories, which were derived empirically. The categories are described below with examples. program only: refers to items which occur only in the program domain, and which would not have a meaning in another context, for example, a counter. This program initially sets a counter to zeroy program: an object, which could be described at various levels, described in program terms. Program terms refer to the use of any program specific data structure (e.g. a list) or variable (indicated by the lack of an article, the word in quotes, capitalized, etc.). ychecking first whether the list is empty or noty If ‘Height’ is then equal to or less than 180 ‘Sub Team’ is run again. If the current height variable is abovey program-real-world: object descriptions using terminology which is valid in both realworld and program domains, e.g. results, numbers. This category contrasts with the domain category in that the latter is specific to the problem domain described in the program, e.g. basketball players’ heights, exam marks, distances between cities, while the former refers to terminology which is more abstract, and would be shared across problem domains. For example, a reference to numbers would be classified as a program-real-world description, while exam marks would be classified as a domain description. The program takes 2 numbers... The program gives out the 5 highest values that were input to the program. ARTICLE IN PRESS J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 179 program-domain: object descriptions which contain a mixture of program and problem domain references, e.g. a list of marks (note that care must be taken to ensure that domain references are not in fact being used as variable names), or a reference which is equally valid in the program and the problem domains (e.g. differences). This is processing a list of marksy yit then passes a list of heights to a sub-program. domain: an object which is described in domain terms, e.g. a mark, a distance, sunny days rather than by its representation within the program. This program checks a basketball players height from [the list given]. This program calculate the number of students who passedy indirect reference: an anaphoric reference to a data object. ythey are stripped in turn out of [the list]. yif it is then the program returns to the main program. unclear: statements which are ambiguous and cannot be coded, either because the statement itself is unclear, or because the object which is being referred to cannot be identified. yis sent to the pass markery y[the head goes into] a folder. Some of the categories above have links with other categories. Program and domain categories could be referred to as ‘‘pure’’ categories in the sense that they refer to one level of description only. Program-real-world and program-domain are amalgamates of pure categories. Program only is a special case: unlike the categories just mentioned, it is used for objects which are inherently linked to the program domain and hence cannot be described at other levels. Finally, indirect reference and unclear statements are not linked to the others in any obvious way. Additionally, the categories, as listed above, can be viewed as occurring on a continuum in terms of degree of ambiguity, starting with specific ‘program’ references having a low degree of ambiguity, through to ‘‘unclear’’ statements at the other end of the ambiguity spectrum. 5. Examples of applying the scheme The program summary analysis scheme described above has been used in an experiment which examined program comprehension in Prolog (Good ARTICLE IN PRESS 180 J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 et al., 1997),1 and also in a comparative study which looked at simple mock-ups of visual programming languages based on the data flow and control flow paradigms. Program summary data was analysed in conjunction with multiple choice measures of program comprehension. This section briefly describes the latter study, before focusing on the results which pertain to the program summaries themselves. The aim is to show how the analysis scheme can be used, and its particular utility in comparatives studies of programming languages. As such, this section does not give a full account of the study, and the results should not be taken to mean that one of the language paradigms used in the study is somehow ‘‘better’’ than the other. The study in question examined the effect of visual programming language paradigm (data flow or control flow) on program comprehension. The programs used were short, recursive list processing programs. Comprehension was measured using multiple choice questions focusing on each of the five information types, and a short program summary. By gathering quantitative data in the form of the multiple choice responses, and qualitative data in the form of program summaries, our aim was to provide a fuller account of participants’ program comprehension, a technique used previously by Pennington (1987a), and Corritore and Wiedenbeck (1991). Twenty participants took part in the experiment. All were starting the second year of a computer studies degree, and had been taught Cþþ and COBOL. Participants were randomly divided into two groups: control flow and data flow. The experiment included a number of pre-tests which explored graphical skill and prior programming knowledge: as the results are not reported in this paper, they will not be described. The experimental setup was implemented in Macromedia Director. The first screen explained the overall structure of the experimental session. The following screens provided an introduction to the visual programming language (data or control flow, depending on the group), presenting the nodes used in the language with a textual description of their function. Participants also saw a sample program, with an explanation of how each node contributed to producing the program output. A practice session allowed participants to answer questions similar to those they would encounter in the experiment. In the experiment itself, participants were asked to study a short program. The next screen required participants to type a free-form summary of the program into a text box (the program was not visible on this screen). The subsequent five screens each showed a multiple choice comprehension question (presented in random order), corresponding to one of the five information types, along with the program. Both groups saw the same programs, but their form differed depending upon the group to which the participant had been assigned: control flow or data flow. Participants worked through a total of four programs. Note that only the data pertaining to the program summaries will be discussed here: full details of the study, and an in-depth consideration of results, can be found in (Good, 1999). 1 Note that this paper describes only participants’ responses to multiple choice questions about their program understanding; the analysis of their program summaries was not reported. ARTICLE IN PRESS J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 181 Fig. 1 shows, on the left, a simple visual control flow version of a program designed to count the number of passing marks in a list, and, on the right, the corresponding data flow version. Tables 1 and 2 show the application of the information types classification to two program summaries of the passes program. The first summary is from a participant who studied the control flow version of the program, while the second is from a participant in the data flow condition. In the study which compared data flow and control flow visual program representations, differences between the two groups was obvious in the length of the program summaries: 70.91 words for the control flow group, and 48.85 words for the data flow group. However, the program summary analysis scheme allows for a finer grained analysis of the differences, as shown in Table 3. This shows the mean proportion of information types category statements for the control and data flow groups. Summaries from the data flow group contain higher proportions of function, action, state-high, and data flow information types than do the control flow group. The control flow group’s summaries contain higher proportions of operation, statelow, and control flow statements. In terms of level of granularity, it emerges that the control flow group’s summaries contain many more low-level statements than the data flow group. A comparison of object description categories yielded the following results (Table 4). Fig. 1. The passes program. Left: control flow version. Right: data flow version. ARTICLE IN PRESS 182 J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 Table 1 Detailed description of the passes program, with statement types Statement Code it first checks to see if the mark list is empty if it is then the program exits if that check is false it then sets mark to the first number in the list, and sets the rest of the list to some variable, it then checks to see if mark is greater than 65 if it is it then adds 1 to pass and exit if it is not it recurses operation state-low control state-low operation operation operation state-low operation control state-low control Table 2 Higher level description of the passes program, with statement types Statement Code This is processing a list of marks and finding which are greater than 65 and adding 1 to the counter to give a result of how many passed action action operation data Table 3 Mean proportion of information types statements per group Category Control flow (mean %) Data flow (mean %) Function Data State-high Action Operation State-low Control Elaborate Meta Unclear 11.62 13.10 6.22 7.10 30.22 12.93 14.10 0.49 0.15 4.07 20.93 24.68 8.23 9.10 15.67 10.04 5.33 3.61 1.05 1.36 Summaries from the data flow group contain higher proportions of program-realworld, program-domain and indirect statements than do the control flow group. The control flow group’s summaries contain higher proportions of program only, program and domain statements. Finally, data flow subjects made ARTICLE IN PRESS J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 183 Table 4 Mean proportion of object description statements per group Category Control flow (mean %) Data flow (mean %) Program only Program Program-real-world Program-domain Domain Indirect Unclear 4.07 46.93 11.21 4.52 22.78 10.09 0.39 3.81 33.84 18.55 5.29 20.46 17.02 1.03 more references to objects which were judged to be unclear than did control flow subjects. The results from the information types analysis are compelling, and point to differences between the two groups in terms of communicating their program comprehension. This suggests that the information types scheme is of value in capturing fine-grained differences in program summaries. On the other hand, results from applying the object description analysis are less clear-cut, and point to the need for further investigation of the scheme’s utility. 6. Discussion and conclusions This paper presented a coding scheme for analysing program summaries. The scheme aims to provide more authentic measures of program comprehension, by allowing programmers to express their understanding in their own words. To date, the scheme has been applied to textual descriptions of programs, but it is hypothesized that it could be equally useful for verbal protocols gathered during comprehension tasks. The scheme has been used on descriptions of programs written in diverse languages, and seems to capture subtleties in participants’ reporting of comprehension resulting from differences in program representation. In addition, the scheme has also been used to look at levels of abstraction in program summaries. By mapping information types onto levels of abstraction (with function, elaborate and meta statements at the highest level of abstraction; action, state-high and data statements at an intermediate level, and operation, state-low and control statements at a low level of abstraction), we found that participants in the control flow condition tended not to change level of abstraction between statements: almost 80% of consecutive statements in their summaries represent cases where the statements may be of differing information type, but are at the same level of abstraction. By contrast, in the data flow group, less than half of their consecutive statements were of this kind, showing that were as likely to change level of abstraction as to stay within the same level. Furthermore, by using the scheme in conjunction with an analysis of errors in program summaries, where errors were classified at the highest level as either errors ARTICLE IN PRESS 184 J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 of commission, or errors of omission, we discovered that although control flow participants produced long summaries at consistently low levels of abstraction, they tended to exhibit more errors of omission than did data flow participants, who produced shorter summaries at varied levels of abstraction. This research on levels of abstraction and errors in program summaries is reported in full in Good and Oberlander (2002), but is mentioned here in order to demonstrate varying ways in which the scheme may be put to use. Finally, work is now underway to refine and generalize the scheme by investigating the following issues: Inter-rater reliability: the results described in this paper resulted from the application of the scheme by one researcher. Explicit coding instructions and procedures have been developed, and the scheme is currently being applied by two other researchers in order to measure inter-rater reliability. Universality: as mentioned above, the scheme has been used with diverse languages, however, two of these languages were not full-scale executable languages. In order to judge the general applicability of the scheme, it is currently being tested out on summaries of COBOL programs. Scalability: the programs described in this paper were short programs, which were studied by novice programmers. We are now looking at much longer programs (approximately 1200 lines of code), being studied by professional programmers. Summary modality: the scheme has been applied to written program summaries. The scheme is now being applied to verbal protocol data which was obtained both during and following a comprehension task. We are confident that this ongoing work will allow us to shed more light on the scope and applicability of the coding scheme, making it useful for other researchers who are interested in describing the structure and content of program comprehension artefacts. At the same time, just as the cognitive dimensions of notations framework has been described as a potentially extensible framework (Blackwell et al., 2001), we feel that the coding scheme described in this paper is also a work in progress, and we welcome comments, suggestions and refinements. Acknowledgements This work was carried out at the Human Communication Research Centre, University of Edinburgh, and was supported by the UK Engineering and Physical Sciences Research Council, through Grant GR/L36987. Our grateful thanks to Jon Oberlander and Richard Cox, our collaborators on the project. References Blackwell, A.F., Britton, C., Cox, A., Green, T.R.G., Gurr, C.A., Kadoda, G.F., Kutar, M., Loomes, M., Nehaniv, C.L., Petre, M., Roast, C., Roes, C., Wong, A., Young, R.M., 2001. Cognitive dimensions of ARTICLE IN PRESS J. Good, P. Brna / Int. J. Human-Computer Studies 61 (2004) 169–185 185 notations: design tools for cognitive technology. In: Beynon, M., Nehaniv, C.L., Dautenhahn, K. (Eds.), Cognitive Technology 2001. Springer, Berlin, pp. 325–341. Boehm-Davis, D.A., 1988. Software comprehension. In: Helander, M. (Ed.), Handbook of Human– Computer Interaction. Elsevier, Amsterdam, pp. 107–121. Brooks, R., 1983. Towards a theory of the comprehension of computer programs. International Journal of Man–Machine Studies 18 (6), 543–554. Corritore, C.L., Wiedenbeck, S., 1991. What do novices learn during program comprehension? International Journal of Human–Computer Interaction 3, 199–222. Good, J., 1999. Programming paradigms, information types and graphical representations: empirical investigations of novice program comprehension. Ph.D. Thesis, Department of Artificial Intelligence, University of Edinburgh. Good, J., Oberlander, J., 2002. Verbal effects of visual programs: information type, structure and error in program summaries. Document Design 3, 120–134. Good, J., Brna, P., Cox, R., 1997. Novices and program comprehension: does language make a difference? In: Proceedings of the Nineteenth Annual Conference of the Cognitive Science Society, LEA, Stanford, California, p. 936. (A longer version appeared as Technical Report 97/10, Computer Based Learning Unit, University of Leeds). Green, T.R.G., 1980. Programming as a cognitive activity. In: Smith, H., Green, T.R.G. (Eds.), Human Interaction with Computers. Academic Press, New York, pp. 271–320. Green, T.R.G., Sime, M.E., Fitter, M.J., 1980. The problems the programmer faces. Ergonomics 23, 893–907. Letovsky, S., 1986. Cognitive processes in program comprehension. In: Soloway, E., Iyengar, S. (Eds.), Empirical Studies of Programmers: First Workshop. Ablex Publishing Corporation, New Jersey, pp. 58–79. O’Brien, M.P., Shaft, T.M., Buckley, J., 2001. An Open-Source Analysis Scheme for Identifying Software Comprehension Processes. In: Kadoda, G. (Ed.), Proceedings of PPIG-13: 13th Annual Meeting of the Psychology of Programming Interest Group, Bournemouth, UK, pp. 129–146. Pennington, N., 1987a. Stimulus structures and mental representations in expert comprehension of computer programs. Cognitive Psychology 19, 295–341. Pennington, N., 1987b. Comprehension strategies in programming. In: Olson, G.M., Sheppard, S., Soloway, E. (Eds.), Empirical Studies of Programmers: Second Workshop. Ablex Publishing Corporation, New Jersey, pp. 100–113. Ramalingam, V., Wiedenbeck, S., 1997. An empirical study of novice program comprehension in the imperative and object-oriented styles. In: Proceedings of Seventh Workshop on Empirical Studies of Programmers, ACM Press, New York, pp. 124–139. Romero, P., Cox, R., du Boulay, B., Lutz, R., 2002. Visual attention and representation switching during Java program debugging: a study using the Restricted Focus Viewer. In: Proceedings of the Second International Conference on the Theory and Application of Diagrams (Diagrams 2002), Callaway Gardens, GA, pp. 221–235. Shneiderman, B., Mayer, R., 1979. Syntactic/semantic interactions in programmer behavior: a model and experimental results. International Journal of Computer and Information Sciences 8 (3), 219–238. Upchurch, R., 2002. Code reading and program comprehension: annotated bibliography. Retrieved January 6, 2003 from http://www2.umassd.edu/SWPI/ProcessBibliography/bib-codereading2.html. von Mayrhauser, A., Lang, S., 1999. A coding scheme to support systematic analysis of software comprehension. IEEE Transactions on Software Engineering 25, 526–540.