Research papers
Does observation add to the validity of the long case?
Val Wass1 & Brian Jolly2
Background A London medical school ®nal MBBS
examination for 155 candidates.
Objective To investigate whether observing the student±patient interaction in a history taking (HT) long
case adds incremental information to the traditional
presentation component.
Design A prospective study of a HT long case which
included both examiner observation of the studentpatient interview (Part 1) and traditional presentation
to different examiners (Part 2). Checklist and global
ratings of both parts were compared. Examiners were
paired to estimate inter-rater reliability. The students
also took a 20 station Objective Structured Clinical
Examination (OSCE).
Outcome measures Correlation of (I) examiner ratings
for observation and presentation of the HT long case
(II) examiner pair ratings and (III) stepwise regression
analysis of scores for the HT long case with OSCE
scores.
Results Seventy-®ve (48á4%) candidates had two
examiner pairs marking their case history. Observation
Introduction
The search for the ideal assessment of clinical
competence for undergraduates, which is both valid
and reliable, remains controversial.1 In the traditional
long case, candidates are given uninterrupted and
unobserved time, usually 30±45 minutes, to interview
and examine a patient, selected from the wards or
outpatients and untrained for examinations. Candi-
1
Department of General Practice and Primary Care, Guy's, King's
and St Thomas' School of Medicine, London, UK
2
Department of Medical Education, The University of Shef®eld,
Shef®eld, UK
Correspondence: Dr Val Wass, Department of General Practice and
Primary Care, Guy's, King's and St Thomas' School of Medicine,
Weston Education Centre, 10, Cutcombe Road, London, SE5 9RJ,
UK. Tel.: 020 78485576; Fax: 020 78485711; E-mail: valerie.wass@
kcl.ac.uk
and presentation scores correlated poorly (checklist
0á38 and global 0á33). Checklist and global scores for
each part correlated at higher levels (observation 0á64
and presentation 0á61). Inter-rater reliability correlations were higher for observation (checklist 0á72 and
global 0á71) than for presentation (checklist 0á38 and
global 0á60). When HT long case scores were correlated
with OSCE scores, using stepwise regression, global
presentation scores showed the highest correlation with
the OSCE score (0á36) and the global observation score
contributed a further 12% to the correlation (0á50).
Conclusion Observation of history taking in a long case
appears to measure a useful and distinct component of
clinical competence over and above the contribution
made by the presentation.
Keywords Observation, *HT; medical history taking,
*HT; *clinical competence; education, medical,
undergraduate; educational measurement; reproducibility of results; professional patient relations;
prospective studies; regression analysis.
Medical Education 2001;35:729±734
dates then present their ®ndings to the examiners as in
an unstructured oral examination. The long case
attempts to assess the integrated interaction between
the doctor and a `real' patient. It is arguably a valid and
educationally valuable test.2
However, the assessment is lengthy and usually only
one long case is used. The argument against the long
case hinges on reliability. There is now indisputable
evidence that in all measurements of clinical competence, candidates perform variably across tasks.3,4 One
case is insuf®cient to produce a reliable measure of the
candidate's ability.
The introduction of the objective structured clinical
examination (OSCE),5 which uses multiple stations to
produce a more reliable test format, enables isolated
components of the long case to be examined in a variety
of contexts. The use of this OSCE format improves
reliability.6 Consequently, many medical schools are
Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734
729
730
The validity of the long case
·
V Wass & B Jolly
Key learning points
Traditionally students are taught and judged on
presentation of case histories. The interaction with
the patient is rarely observed.
When the long case interaction is observed, marks
given for the interview correlate poorly with those
given by different examiners for the presentation.
Marks given independently by two examiners
show good agreement particularly for the observed
interview. One examiner is probably suf®cient.
Ratings of observation and presentation contributed signi®cantly and independently to the correlation with clinical competence, as judged from
the OSCE score.
Observation and presentation of a history taking
long case appear to measure different parameters
of clinical competence.
turning towards this form of clinical competence testing. However, this move may take place at the expense
of validity.1 It may be limited by its reductionist
approach to clinical performance, as well as being timeconsuming and costly. These last factors cannot be
ignored.
Some UK medical schools have been reluctant to
lose the long case as an assessment tool. Our examination board originally decided, when an OSCE was
introduced in the ®nal examination, that a long case
using real patients should be included. It was agreed
that assessment of the candidate's interaction with
randomly selected real patients was important and that
examiners should be able to cross-examine candidates
in the traditional relatively unstructured way. Two
modi®cations were proposed. Firstly that the physical
examination of the patient should be excluded from the
long case assessment as there were several clinical
examination stations in the OSCE. Secondly, that the
student's interview with the patient should be observed.
The latter decision was important to us. Attempts
have been made to improve reliability within a long
case. Gleeson7 developed the objective structured long
examination record (OSLER) where the presentation
is structured to increase the observations made by
examiners on the candidate's approach to the case.
Observed long cases have been recommended.8,9
However, surprisingly little has been published on the
psychometrics of the long case in any of its forms.2
The examination format enabled us to investigate the
extent to which the traditional part of the long case, i.e.
the presentation to the examiners, related to assessment
based on observing the interaction with the patient.
Two pairs of different examiners independently rated
the two parts: the observation and the presentation.
Important research questions could be asked: Does
observing the student±patient interaction add incremental information on clinical competence to the
traditional presentation component? How well do
independent ratings for the two components correlate?
Is the inter-rater reliability of the observed component
better than for the unobserved component? Is the
combined rating of both parts better than for the
traditional unobserved one alone?
Method
The clinical examinations took place over three days.
The aim was to test clinical competence. The candidates had also taken three written papers of (a) multiple
choice (b) short answer and (c) essay questions.
The format of the clinical examination was one HT
long case and a twenty station OSCE.
The history taking long case
Candidates were given 16 minutes (Part 1) with a
patient to elicit their full medical history, formulate
diagnostic issues and summarise their thoughts. They
were not required to examine the patient. This was
followed by an eight minute unstructured presentation
of the case (Part 2).
During part 1, the candidate was silently observed
taking the history by one or two examiners. No interaction was allowed. The examiner(s) marked against a
comprehensive checklist. The items covered by the
checklist are summarised in Table 1. This list scored
out of 20 for history taking. In addition a further 20
marks could be given, using a 1±5 Likert scale, as global
ratings allocated for four parameters: medical history,
psychosocial history, interviewing skills and overall
¯uency of the performance.
For Part 2, the candidate moved on to present the
case to one or two examiners in the traditional way. The
examiners could ask unstructured questions, as in
the usual long case format, to clarify the candidate's
description and interpretation of the history and the
conclusions reached. They marked independently using
an identical checklist (20 marks) and global rating
scheme (20 marks). Thus the assessment of the
Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734
The validity of the long case
·
731
V Wass & B Jolly
Table 1 Observation of the history taking long case: items
covered by the checklist
Maximum
points
allocated
Heading
Items
Patient details
Appropriate introducion/age/
occupation
Symptoms: type, description,
duration, alleviating,
aggravating, precipitating
factors, previous episodes,
medication, psychological
symptoms.
1á0
Past medical
history
Past illness, medications,
hospitalisation, allergies
3á0
System review
Micturition, appetite,
weight, bowel function,
menstruation if appropriate
4á0
Social history
Family history, housing,
alcohol, tobacco, nutritional
5á0
Present illness
Total score
7á0
20á0
candidate's presentation could be compared with the
assessment of the actual observed interaction.
Patients
Medical, surgical and psychiatric patients were selected
from hospital wards and outpatient departments. They
were untrained for examinations and no attempt was
made to standardise them. Two cases were alternated,
wherever possible, at each of the long case stations.
Candidates were assigned to patients at random.
Examiners
Providing numbers permitted, a team of 12 examiners
per session (four per candidate) was allocated to the
long cases. The examiners were consultants or senior
lecturers, experienced in long case assessment in the
usual `presentation only' format. Before the examination, they were briefed together for both parts of the
assessment and instructed to mark independently and
not collude or alter scores after discussion. Wherever
possible, they worked in pairs. They were allocated by
rota to examine on either Part 1 or Part 2. In the course
of a session, they had equal experience of examining
both parts. If there were insuf®cient examiners to make
pairs, the candidates were marked by one examiner
only.
The OSCE examination
The OSCE examination consisted of 20 stations
(8 minutes each) with one examiner per station. The
stations included 8 testing physical examination, 4
assessing communication skills, 3 on practical skills and
5 on data and imaging interpretation. A single examiner
rated each station.
Statistical analysis
Data analysis was carried out using the Statistical
Package for Social Sciences (SPSS). Mean scores on
observation and presentation of the long case were
calculated for all candidates. Mean examiner scores for
those candidates marked by two examiners on both
parts of the long case were correlated with marks for
the observation and presentation using the standard
(Pearson) product-moment correlation. The inter-rater
reliability of the paired examiners was estimated on
each part, using intraclass correlations for both checklist and global scores.
The total OSCE scores for the candidates were correlated with the checklist and global scores using stepwise regression. The traditional long case score, i.e. the
global presentation score, was entered ®rst followed by
the other component scores to investigate any contribution made to explain the variance in OSCE scores.
Results
One hundred and ®fty-®ve candidates took the examination. They were marked on the HT long cases by a
team of 50 examiners who rotated to cover the three
day examination. All candidates were rated by a different examiner on the observation part and the presentation part of one HT long case. Seventy-®ve
(48á4%) candidates were rated by two examiners on
both parts. The performance of this `study group' of 75
candidates was analysed in detail.
Based on the scores of one examiner only (a random
selection of one was made from the study group
examiner pairs; the non study group had only one
examiner), the mean total HT long case score (out of
80) for the study group was 56á8 SD 8á6 compared to
60á9 SD 8á3 for the rest of the cohort. The study
group scored statistically signi®cantly lower (P < 0á05,
95% con®dence interval (CI) ±6á73 to ±1á36). The
mean marks for the observation part were 14á5 (global)
and 13á9 (checklist) for the study group and for the
remaining cohort 15á7 (global) and 14á5 (checklist).
The mean marks for the presentation part were 14á5
(global) and 13á9 (checklist) for the study group and for
Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734
732
The validity of the long case
·
V Wass & B Jolly
Table 2 Correlation between mean (averaged across two
examiners) for observation and presentation scores for all double
marked candidates (n = 75)
Observation
Correlations
(Inter-rater
correlation)
Observation
Checklist
Observation
Global
Presentation
Checklist
Presentation
Global
Table 3 Results of the stepwise regression of long case components on the OSCE scores and their signi®cance
Presentation
Step Variables entered
Checklist
(0á72)**
Global
(0á71)**
Checklist
(0á38)**
Global
(0á60)**
1á00
0á64
0á38
0á18
1á00
0á25
0á33
1á00
0á61
**Intraclass or standard Pearson
indicated
1á00
R2
change
Signi®cance
R2 change
1
Presentation Global
Rating
(PGR)
0á364
13á3%
0á133
0á001
2
PGR and
Observation
Global Rating
(OGR)
0á503
25á3%
0á120
0á001
3
PGR, OGR and
Observation
Check list Rating
(OCR)
0á503
22á2%
0á001
0á778
4
PGR, OGR, OCR
and Presentation
Check list Rating
0á510
26á0%
0á007
0á427
correlations were used as
the remaining cohort 15á5 and 14á8, respectively. The
range of SDs was 2á5±3á1. All comparisons on the long
case scores between the study group and the remaining
cohort were statistically signi®cant (P < 0á05). Mean
OSCE scores were 259á7 ( 10á0) and 260á2 ( 8á4) for
the study and remaining candidates, respectively. This
difference was not signi®cant. There was also no difference between the study group and the remainder on
any other part of the examination.
Correlation coef®cients for the double marked study
group are summarised in Table 2. Intraclass correlations between examiners (inter-examiner reliabilities)
were higher for observation of the case (0á72 for
checklist and 0á71 for global observed ratings) than
for the case presentation (0á38 for checklist and 0á60 for
global presentation ratings). Examiner pair scores were
then combined to compare ratings across the long case
components for the 75 candidates. Within the separate
observation and presentation components, checklist
and global ratings showed Pearson correlation coef®cients of 0á64 (observation) and 0á61 (presentation).
There was a clear lack of correlation between scores
given for long case observation compared to presentation using either checklist or global scores (Pearson
correlations ranged from 0á18 to 0á38).
Table 3 summarises the results of the regression
analysis of the percentage variance contribution made
by each part (variable) of the HT long case correlated
with the OSCE score, with the signi®cance of each part
(using a t-test, as produced as part of a SPSS standard
output). The presentation global rating showed the
highest individual correlation of 0á36 (P 0á001) with
R
% variance
explained
the OSCE scores. Adding the observation score (globally rated) added signi®cant explanatory power to the
regression equation almost doubling the explained
variance from 13% to 25% giving a correlation of 0á50
(P 0á001). Further inclusions did not increase the
variance explained by these variables.
Discussion
The presentation of a history in the long case has been
accepted as a measure of the candidate's overall ability to
carry out a medical interview, appraise the ®ndings and
decide a course of action. We have shown that examiners
observing and marking the candidate during the (usually
neglected) interaction with the patient rate the candidate
differently. When taking the OSCE as a measure of
clinical competence, direct observation of the HT long
case contributed as much again as the presentation to the
correlation of the HT long case with the OSCE results.
This is, perhaps, not surprising but has never been
demonstrated before in psychometric terms. It challenges the tradition of case presentations alone.
This study examined the history taking process of the
long case only. Real patients were used and there was
no attempt to standardise them. The candidate was not
asked to carry out a clinical examination. This was
partly for reasons of time and also because clinical
examination techniques were tested in the OSCE itself.
However, the candidates were expected to process the
information gathered from the patients to present to the
examiner in the usual way.
Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734
The validity of the long case
·
733
V Wass & B Jolly
The presentation was conducted in the traditional
format, the examiners being free to ask the questions
they perceived relevant in an unstructured way and
then make general overall judgements. The introduction of a checklist to be completed at the presentation
was part of the research design to assess comparability
of the presentation and observation. As reported in the
literature,10 checklist and global ratings correlated
highly. Inter-rater reliability was better for the observed
part of the long case. The low examiner agreement on
the checklist scores for the presentation tends to con®rm the unstructured format of the presentation. The
format was therefore as close as we could get to the
complete long case. The crucial issue of observation vs.
presentation would hold with or without the clinical
examination component.
The study cohort performed slightly worse on the
long case, but not on any other part of the examination.
This may have been because the presence of another
examiner made each more cautious in awarding marks.
However the long case score ranges and distributions
were not signi®cantly different for the two groups. The
difference in mean scores would not affect the results of
the study, which were generated entirely from within
the study cohort.
The higher inter-rater reliability for the observed part
of the case, compared to the presentation, throws doubt
on relying on the presentation alone. In the latter, the
examiners may be trying to make inferences about what
had happened in the previous part. This may have
contributed to the lower correlation of their scores. Thus
further work is needed to assess the difference if the same
examiners are used for both parts. However the fact that
the ratings of observation and presentation correlated
poorly and contributed signi®cantly and independently
to the correlation with clinical competence, as judged
from the OSCE score, suggests that they measure
different parameters of clinical competence.
Swanson3 points out that all performance-based
assessments within health professions con®rm that
testing examinees in realistic performance-based situations is fraught with dif®culty. Complex interactions
between the context (situation/task) and the construct
(skill/knowledge) are being measured. Van der Vleuten
has underlined these problems. He uses the complex
cognitive psychological processes involved in professional clinical expertise4 to explain the variability seen
across different examination components and from
context to context.
The failure to ®nd a strong correlation between performance on the long case and the OSCE stations could
be accounted for by one of two factors. The OSCE may
measure a different clinical process. Alternatively, it
may just re¯ect insuf®cient sampling in the use of only
one long case. If extrapolations from other examinations are used, the latter may well be true. Swanson11
calculated from oral examination case histories it would
take a full day of testing with 12 to 16 cases to achieve
an acceptable level of generalisability for a high stakes
examination. Stillman,12 using standardised patients as
both subjects and examiners, showed that 13 and 17
cases were needed to achieve reliability scores of 0á68
and 0á88, respectively. Further work is clearly necessary
to estimate the sampling errors of long cases, as we can
no longer assume that the long case, even if observed, is
an equivalent process to a clinical viva.13
We conclude that the traditional long-case presentation becomes a more valid measure of the candidate's clinical competence if the interaction with the
patient is observed. This conclusion has direct relevance to the use of long cases in clinical examinations
but may also have relevance to traditional case presentations on ward rounds. In the clinical setting, students present histories on ward rounds but are not
generally observed while they interview patients. More
attention to observation during ward clerkships may
therefore be necessary.
Acknowledgements
We thank Dr John Rees and Professor Gwyn Williams
for supporting this research within the ®nals clinical
examination, Professor Cees van der Vleuten for his
advice on analysing the study and Professor Roger
Jones and Professor David Newble for their comments
on the manuscript.
Contributors
Both authors designed the study together. VW organised the study and collected the data. BJ performed
2 the statistical analysis.
Funding
There was no external funding for this project.
References
1 Meadow R. The structured exam has taken over. BMJ
1998;317:1329.
2 Van der Vleuten CPM. Making the best of the `long case'.
Lancet 1996;347:704±5.
3 Swanson DB, Norman GR, Linn RL. Performance-based
assessment: Lessons learnt from the health professions. Educ
Res 1995;24(5):5±11.
Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734
734
The validity of the long case
·
V Wass & B Jolly
4 Van der Vleuten CPM. The assessment of professional competence: Developments, research and practical implications.
Adv Health Sci Educ 1996;1:41±67.
5 Harden RM, Gleeson FA. ASME Medical Educational
Booklet no. 8 Assessment of medical competence using an
objective structured clinical examination (OSCE). J Med Educ
1979;13:41±54.
6 Newble DI, Swanson DB. Psychometric characteristics of the
objective structured clinical examination. Med Educ
1996;22:325±34.
7 Gleeson F. The effect of immediate feedback on clinical skills
using the OSLER. In: AI Rothman, R Cohen, eds. Proc of the
Sixth Ottawa Conference of Medical Education. Toronto: University of Toronto Bookstore Custom Publishing. 1994;
412±5.
8 Newble DI. The observed long case in clinical assessment.
Med Educ 1994;25:369±73.
9 Price J, Byrne GJA. The direct clinical examination: an
alternative method for the assessment of clinical psychiatric
10
11
12
13
skills in undergraduate medical students. Med Educ
1994;28:120±5.
Regehr G, MacRae H, Reznick R, Szalay D. Comparing the
psychometric properties of checklists and global rating scales
for assessing performance on an OSCE-format examination.
Acad Med 1998;73 (9):993±7.
Swanson DB. A measurement framework for performance
based tests. In: IR.Hart, RM.Harden. eds. Further Developments in Assessing Clinical Competence, pp 13±45. Montreal:
Can-Heal; 1987.
Stillman P, Regan MB, Swanson D, Case S, McCaha J,
Feinblatt J et al. An assessment of the clinical skills of fourthyear students at four New England medical schools. Acad Med
1990;65:320±6.
Hardy KJ, Demos LL, McNeil JJ. Undergraduate surgical
examinations: an appraisal of the clinical orals. Med Educ
1998;32:582±9.
Received 31 January 2001; accepted for publication 22 February 2001
Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734