Advances in Health Sciences Education 9: 47–60, 2004.
© 2004 Kluwer Academic Publishers. Printed in the Netherlands.
47
Assessing the Written Communication Skills of
Medical School Graduates
JOHN R. BOULET∗, THOMAS A. REBBECCHI, ELIZABETH C. DENTON,
DANETTE W. MCKINLEY and GERALD P. WHELAN
Research and Evaluation, Educational Commission for Foreign Medical Graduates (ECFMG),
3624 Market Street, Philadelphia, PA 19104-2685, USA (∗ author for correspondence, e-mail:
jboulet@ecfmg.org)
Abstract. The ECFMG Clinical Skills Assessment (CSA ) was developed to evaluate whether
graduates of international medical schools (IMGs) are ready to enter graduate training programs
in the United States. The patient note (PN) exercise, conducted after a 15-minute interview with
a standardized patient (SP), is specifically used to assess a candidate’s ability to summarize and
synthesize the data collected. On a yearly basis, approximately 75,000 patient notes are reviewed and
scored by physician raters. Recent changes to the PN scoring rubric, combined with enhancements to
quality assurance procedures, mandate that additional evidence be provided to support the intended
use of PN scores. The purpose of this study was to further investigate the psychometric adequacy of
PN scores. Generalizability analyses suggest that while variability in PN ratings can be attributed to
the choice of rater, candidate scores are reproducible over the 10-encounter CSA. The relationship of
PN scores with other related ability measures and select candidate characteristics provides additional
evidence to support the validity of the written exercise.
Key words: certification tests, reliability, standardized patient, validity, written communication
Introduction
The ability of physicians to communicate with patients and other health professionals, both verbally and in writing, is a fundamental skill. Recently, the role
of doctor-patient communication skills in patient care has been highlighted in
the literature, emphasized in training programs, and studied extensively (Rollnick
et al., 2002; Shapiro, 1999; van Dalen et al., 2002). However, while the verbal
communication between physician and patient has been linked to treatment compliance, satisfaction, and medical outcomes (Williams et al., 1998), comparatively
little research has been focused on written communication and its role in the
management of patients. Charting errors, including the provision of inaccurate
information and documentation inconsistencies, can not only have a negative
impact on health care outcomes, but can also be an important factor in any malpractice litigation. Furthermore, illegible medical records can lead to added financial
costs and inferior patient care (Weber, 2002). Since progress notes are used to
document medical history and physical examination, and eventually to imple-
48
JOHN R. BOULET ET AL.
ment treatment regimes, it is imperative that the recorded information is readable,
comprehensive, and intelligible.
While there are many ways for health professionals, including physicians, to
document the information gathered during patient or client encounters, use of
the SOAP (subjective, objective, assessment, plan) format is presently common
(Grace-Farfaglia and Rosow, 1995; Larimore and Jordan, 1995; Sleszynski et al.,
1999). Within this framework, the physicians document what the patient told them
(subjective: chief complaint, history of present illness, past medical history), what
they saw in the examination (objective: significant positive and negative physical
findings), the assessment (problem list, diagnoses), and the plan (treatment, further
diagnostic tests). The information can be written on paper, transcribed through
dictation, or machine-entered via keyboard or hand-held device. The strength of
the SOAP format is its ease of use and its general adoption and acceptance by
a wide variety of health care practitioners (e.g., clinical nutritionists, dietitians,
chiropractors, occupational therapists, allopathic/osteopathic physicians).
Although the documentation of clinical activities is important, especially
for physicians, the proper assessment of charting skills can be difficult, timeconsuming, and expensive. Expert review of charts is problematic in that the
“true” nature of the patient complaint is often unknown, making the evaluation
of accuracy highly subjective. Unfortunately, without secondary patient interviews, the correctness and truthfulness of the physician’s documentation cannot
be assessed with great precision (Cradock et al., 2001). Standardized assessments,
using either real or simulated patients, can, however, overcome this problem. Here,
the patient complaint is fixed, and the physician must document a known medical
history and associated physical symptoms. As a result, inaccurate or spurious information can be easily spotted. Furthermore, when more than one individual is being
assessed, meaningful score comparisons can be made. These benefits, combined
with the overall importance of written communication skills in medical practice,
suggest that the development and validation of standardized assessment methods is
essential.
A number of studies have been undertaken to develop methods to evaluate the
documentation skills of medical students, residents and physicians (Boulet et al.,
1998b; Crossley et al., 2001; Howell et al., 2000). Many of these assessments have
used some form of post-encounter exercise that is part of a larger standardized
patient (SP) assessment. Typically, the physician gathers data from the patient
(e.g., medical history, physical examination results) and summarizes these findings, often including a differential diagnosis, on a structured assessment form.
Scoring these exercises, while somewhat dependent on the assessment form, is
generally straightforward. Where the patient condition is known, as would be the
case for an examination that utilizes standardized patients, one of the simplest
methods is to review patient notes for content. For example, based on SOAP
classification, experts can easily determine a list of keywords/phrases that, given
the patient history and presenting complaint, would be expected to be found on
ASSESSING WRITTEN COMMUNICATION IN MEDICAL
49
a well-documented note. Several investigators have used some form of analytic
method for measuring achievement (e.g., keyword matching) and found that not
only physicians but also raters with other medical backgrounds (e.g., nurses, billing
clerks) could provide scores that were sufficiently reproducible (Friedman et al.,
1997). Unfortunately, while analytic scoring can produce reliable scores, it is difficult, if not impossible, to assess important written communication skills such as
organization and clarity using this method. In addition, scores are typically based
solely on the content of the note, ignoring superfluous or erroneous information and
potentially egregious actions. From a score validity perspective, the use of holistic
scoring is often preferred (Boulet et al., 2000; Slater and Boulet, 2001). Here,
trained experts read the summaries and provide ratings based not only on medical
content but also on traits such as interpretability, logic and thought processes.
Provided that the scoring rubrics are well-defined, and the raters have sufficient
knowledge of the content domain, the scores based on holistic rating can be both
reliable and valid (Regehr et al., 1998).
Since the pioneering work of Harden and Gleeson (1979), the use of Objective
Structured Clinical Examinations (OSCEs) and standardized patient assessments
has increased dramatically. These types of examinations are now commonly
used by licensing and certification bodies to assess the professional competencies of examinees. The Educational Commission for Foreign Medical Graduates
(ECFMG) administers a clinical skills assessment (CSA) as part of the certification requirements for graduates of international medical schools. Certification,
which includes the assessment of clinical skills, is meant to ensure that graduates
of medical schools outside of the United States and Canada are ready to enter
graduate medical education programs in the United States. The Medical Council
of Canada (MCC) also evaluates clinical skills as part of the licensure requirements
for physicians wishing to practice in Canada (Medical Council of Canada, 2002).
Finally, the National Board of Medical Examiners (NBME) has administered SP
cases to thousands of examinees over the past decade in preparation for inclusion of
a clinical skills component into the United States Medical Licensing Examination
(Friedman et al., 1999; Hallock, 2002). While the psychometric adequacy of these
types of assessments has been studied extensively, validation is an ongoing process,
requiring regular accumulation of evidence to support the use and interpretation of
test scores. This is especially true if adjustments or modifications are made to the
content, administration or scoring of any embedded exercises.
Purpose
The post-encounter patient note (PN) exercise has been a fundamental component
of the ECFMG Clinical Skills Assessment since its inception in 1998. This exercise allows candidates to summarize and interpret the data collected in a clinical
encounter with a standardized patient. Although analytic scoring was attempted in
earlier field trials, the initial implementation of CSA utilized holistic ratings by
50
JOHN R. BOULET ET AL.
trained physician raters. Originally the ratings were provided on a 1 to 4 scale,
ranging from unacceptable to superior. While valid and moderately reproducible
scores could be obtained using this rubric, the analysis of quality assurance (QA)
data (inter-, intra-rater statistical summaries) suggested that potential modifications to the scoring rubric, rater training, and feedback mechanisms could provide
for more reliable scores. Furthermore, based on focus group sessions, the raters
indicated that, at a broad level, it was easier to differentiate performance along
three dimensions (i.e., unacceptable, superior, or somewhere in between) than four.
Based on this feedback, and a number of pilot studies, the patient note scoring
rubric was modified. All patient note raters (PNRs) were retrained and the new
rubric was implemented for live scoring at the end of March 2001. The specific
purpose of this investigation was to examine the psychometric properties of patient
note scores obtained using the modified scoring anchors.
Methods
M EASUREMENT TOOLS
Clinical skills assessment (CSA). The Educational Commission for Foreign
Medical Graduates is responsible for the certification of graduates of international
medical schools (IMGs), those individuals who attended medical school outside
the United States or Canada, who wish to enter graduate training programs in the
United States. The CSA, unlike the other certification requirements, is performance
based. Candidates must demonstrate their clinical skills in a high-fidelity simulated
environment. This is accomplished by training people to portray patients with
common clinical conditions. These individuals, known as standardized patients
(SPs), can consistently and accurately model the complaints, histories, and mannerisms of real patients. Most SPs employed by ECFMG are trained to portray more
than one patient. Similarly, most cases used in the examination can be performed
by more than one SP.
Candidates must evaluate a series of 10 SPs, interacting as they would with
actual patients. They are instructed to gather relevant patient data, perform focused
physical examinations as needed, and summarize the data in the form of a clinical note. Scores for data gathering (DG) are based on case-specific checklists
completed by the SPs following the 15-minute encounter. These checklists include
questions that should be asked and physical examination maneuvers that should
be performed. Following the patient encounter the candidates are required to
summarize their findings in a patient note (PN). The DG and PN component scores
are combined to form an Integrated Clinical Encounter (ICE) composite score.
The SPs also evaluate the communication skills of the candidates. Interpersonal skills (IPS) are evaluated along four dimensions: interviewing and collecting
information, counseling and delivering information, personal manner, and rapport.
Spoken English proficiency (ENG) is also evaluated in every encounter. These
ASSESSING WRITTEN COMMUNICATION IN MEDICAL
51
evaluations are combined to form a doctor-patient communication composite
(COM).
Patient note scoring rubric. Patient note raters are specifically trained to provide
scores for individual clinical encounters (cases). Typically, each rater is qualified
to rate several cases. To generate note scores, each PN is rated by a single rater
who is individually trained for that case. Although multiple ratings of individual
notes are often done for quality assurance reasons, only a single rating of each
note is used to produce a total PN score. The notes are distributed so that a
minimum of three different raters provides ratings for each candidate’s set of notes.
On average, between 8 and 9 different raters provide scores for each individual
candidate (mean = 8.6, min = 6, max = 10). Candidates typically produce 10 notes,
one for each encounter, over the course of the assessment.
Three performance levels (unacceptable, acceptable, and superior) are defined
on the scoring rubric. In addition, these performance levels are described for
each of the patient note components (medical history and physical examination,
differential diagnosis, diagnostic work-up) (see Appendix A). To use this modified
scoring rubric the rater must first determine, based on the written descriptions,
whether the note is unacceptable, acceptable or superior. This judgement is based
on a holistic overview of the adequacy of the candidate’s written information in
each section of the note. Once this initial performance level is determined the rater
must then settle on an appropriate gradation within the preliminary category. For
example, if a note is first judged to be acceptable, the rater can assign a score from 4
(just above unacceptable) to 6 (closer to superior). As a result, patient note scores
can range from 1 (clearly unacceptable) to 9 (clearly superior).
Post-CSA questionnaire. All candidates are asked to provide information regarding
the logistics of exam administration, prior medical training, and medical school
characteristics. Over 98% of the candidates complete the questionnaire. The
candidates are encouraged, but not required, to identify themselves. Over 95% of
the candidates supply their unique identifier, allowing CSA performance data to be
linked to individual survey responses.
Criterion variables. Performance on the United States Medical Licensing Examination (USMLE) Steps 1 and 2, and the Test of English as a Foreign Language
(TOEFL) were used as criterion measures. Step 1 (basic science) assesses whether
a candidate can understand and apply important concepts of the sciences basic to
the practice of medicine. Step 2 (clinical science) assesses whether a candidate
can apply medical knowledge and understanding of clinical science necessary for
patient care. The TOEFL measures the ability of nonnative speakers of English to
use and understand North American English.
52
JOHN R. BOULET ET AL.
S AMPLE
Candidates. There were 7,375 CSA administrations between April 2001 and March
2002. Within this cohort, there were 6,225 first-time takers. To eliminate potential
confounding due to repeat testing, only first-time takers were used in the analyses.
The majority of the candidates were male (58.3%). Based on citizenship at medical
school, the largest candidate cohorts were United States (23.5%), India (20.6%),
and Pakistan (6.7%). Asians constituted 43.6% of the first-time administrations,
followed by Whites (33.2%). Almost one quarter of the sample (n = 1458) was US
citizens who attended medical schools outside of the United States, Canada and
Puerto Rico. For the majority of candidates (76.0%), English was not their native
language. The average age of the study population was 33.3 years (min = 22.5,
max = 67.0, SD = 5.3).
Patient notes. For first time CSA administrations there were 61,497 PN ratings.1
Standardized patients. Approximately 56% of the encounters were with female
standardized patients. The distribution of SP encounters by self-declared ethnicity
was as follows: Asian (0.5%), Black (52.2%), Hispanic (9.5%), Indian (0.5%) and
White (37.4%).
Cases. There were 101 different cases used in the one-year time frame. The cases
were fairly evenly distributed based on categorizations of primary reason for visit
(Abdominal, 19.9%; Chest, 19.7%; Constitutional, 20.0%; Neurological, 19.7%;
Miscellaneous, 20.7%) acuity (Acute, 31.6%; Sub-Acute, 36.8%; Chronic, 31.6%)
and patient age category (18–44, 36.6%; 45–64, 39.7%; 65+, 23.8%).
Patient note raters. In the time period studied, 43 PN raters provided scores. All
patient note raters (PNRs) are physicians. They must be licensed to practice medicine, certified by a specialty board of the American Board of Medical Specialties
(ABMS) or the American Osteopathic Association (AOA), and have experience
in medical education. The majority of the raters were male (60%). Just over 50%
of the raters indicated that internal medicine was their primary specialty. The next
largest cohorts were emergency medicine (13%) and family medicine (13%).
A NALYSIS
Generalizability theory (Brennan, 2001) was used to estimate variance components
and provide measures of the reproducibility of PN scores. The design was a persons
by raters nested in cases (p × [raters: cases]). Here, notes for each examinee are
rated individually by a set of up to 10 different raters, with each rater specifically
assigned to a particular case or cases. Although each individual’s score is obtained
from multiple raters, each note is only rated once for operational scoring purposes.
Pearson correlations were used to quantify the strength of the relationships between
ASSESSING WRITTEN COMMUNICATION IN MEDICAL
53
Table I. Estimated variance components for patient note ratings
(all first time takers)
Estimate
% of total variance
Person
Case
Rater(Case)
Error
0.24
0.05
0.21
0.97
16.1
3.4
14.3
66.2
Generalizability (ρ 2 )
Dependability ()
0.71 (0.31)
0.65 (0.36)
Standard error of measurement.
PN scores and both internal (other CSA component scores) and external performance measures. Mean PN scores, stratified by various candidate background
variables, were calculated. Effect sizes (Prentice and Miller, 1998) are provided
as a measure of the degree and meaningfulness of any group-based differences.
The SAS system software was used for all analyses (SAS Institute, 1989).
Results
D ESCRIPTIVE STATISTICS
The mean PN rating, over the 61,497 notes, was 5.4 (SD = 1.2, min = 1, max = 9).
The mean candidate PN score, averaged over all encounters taken, was 5.4 (SD =
0.6, min = 3.0, max = 7.4).
G ENERALIZABILITY ANALYSES
Variance components for patient note ratings are presented in Table I. The relatively small estimate of variance attributable to the cases indicates that individual
tasks (i.e., summarizing the information from the patient encounter) tend to be of
relatively equal difficulty. The non-zero Rater(Case) [rater nested in case] variance
component suggests that case mean scores can fluctuate as a function of which
PNR provides the ratings. The generalizability coefficient (ρ 2 ), which does not take
into account variations in case difficulty or rater stringency, was 0.71, indicating a
moderate consistency in PN ratings over cases. The dependability coefficient was
0.65 (SEM = 0.36).
C ORRELATIONAL STUDIES
Investigating the associations between PN scores and external variables can provide information regarding the degree to which these relationships are consistent
54
JOHN R. BOULET ET AL.
Table II. Relationship of PN scores with external and internal variables
Corelation
External variables
USMLE Step 1 (basic science)
USMLE Step 1 (# attempts)
USMLE Step 2 (clinical science)
USMLE Step 2 (# attempts)
Test of English as a Foreign Language (TOEFL)
Internal variables
Spoken English proficiency (ENG)
Interpersonal Skills (IPS)
Doctor-Patient Communication (COM)
Data Gathering (DG)
0.22
–0.15
0.28
–0.19
0.32
0.29
0.39
0.40
0.51
with the construct underlying the proposed test interpretations. Similar information
can be obtained by investigating the internal structure of the test scores. Correlations of PN scores with external and internal (other CSA components) variables
are presented in Table II.
The correlations provided in Table II indicate that the ability to interpret and
synthesize the data gathered in the patient encounter is related to overall medical
ability. These correlations are not corrected for attenuation and therefore underestimate what the relationships would be if measurement error was not present.
Moderate correlations between the PN score and USMLE Step 1 (basic science)
and Step 2 (clinical science) scores suggest that, while somewhat different abilities
are being measured, candidates with stronger basic and clinical science backgrounds are better able to summarize the information gathered in the patient
encounter. The negative correlation with USMLE Step attempts also supports this
premise in that individuals who fail basic science or clinical science, and have
to repeat these tests, would tend to be of lower ability. The moderate correlation
between TOEFL scores and PN scores was expected in that TOEFL contains a
mandatory essay and, therefore, there should be some overlap in the constructs
being measured. Here, lower writing abilities are associated with lower PN scores.
The correlations between the other CSA component scores and the PN score
were also informative. There was 26% shared variance between data gathering
and PN scores. If relevant information cannot be solicited from the patient, the
task of synthesizing and interpreting the data would definitely be more difficult,
and average scores would be expected to be lower. Doctor-patient communication
ratings, including the spoken English and interpersonal elements, were moderately
related to PN scores.
55
ASSESSING WRITTEN COMMUNICATION IN MEDICAL
Table III. Comparison of mean PN scores for select candidate cohorts
English as a native language
US citizen at medical school
Pass CSA
Yes
No
Effect size
5.53
5.44
5.51
5.34
5.37
4.72
0.32
0.12
1.50
Table IV. Comparison of candidate PN scores by background variables
Patient note time was sufficient
Introductory medical charting course
Previous experience with standardized patients
Language of instruction at medical school (English)
Specialized clinical skills course
Yes
No
Effect size
5.41
5.39
5.54
5.48
5.36
5.34
5.39
5.36
5.23
5.40
0.12
0.00
0.31
0.42
–0.07
The comparison of PN scores by group membership variables can also provide
information to support the interpretation of test scores. Mean performance on the
PN exercise by select candidate cohorts is presented in Table III. As expected,
candidates with English as a native language obtained significantly higher PN
scores than those who did not. Since English speaking and writing abilities are
related, and organization and quality of information are two of the traits measured
via the PN exercise, one would anticipate that native language speakers would be
better able to summarize the data gathered in the encounter. Although candidates
cannot fail CSA based solely on PN performance, those individuals who did not
meet standards obtained significantly lower scores.
P OST- CSA QUESTIONNAIRE
The candidates were specifically asked whether the 10-minute time limit was
adequate for the patient note exercise. Over 76% of the candidates indicated that
the time was sufficient. A comparison of mean PN scores by relevant background
variables is presented in Table IV. Candidates who had had previous experience
with standardized patients or who had attended medical schools where the language
of instruction was English scored significantly higher on the PN exercise. Interestingly, candidates who claimed to have taken a specialized clinical skills course
performed less well than those who did not.
56
JOHN R. BOULET ET AL.
Discussion
All clinical skills evaluations, regardless of purpose, structure, content or scoring,
must be designed so that reliable and valid assessment scores and/or decisions can
be made. One part of the validation process involves gathering examinee data and
investigating whether scoring patterns and data associations make sense. Although
numerous studies have been done to investigate the psychometric properties of
CSA component scores (Ayers and Boulet, 2001; Boulet et al., 1998a, 2001, 2002)
recent changes to the PN scoring rubric and rater training regimen mandate that
additional data be collected and analyzed. While changes to the scoring rubric
were primarily based on rater feedback, and were tested in pilot studies, there is
still a need to collect new evidence to support the use of the PN exercise and to
ensure that potential sources of irrelevant variance are not compromising test score
interpretations.
The generalizability analysis indicated that moderately reproducible PN scores
could be obtained using the existing training methods and 9-point rating rubric.
The values obtained were comparable to those reported elsewhere (Boulet et al.,
1998), and acceptable for this type of performance assessment. The non-zero
rater(case) component suggests that the choice of rater (for a given case) may
have some impact on candidate scores. That is, the relative stringency, or leniency,
of raters for a particular case is not equivalent. Nevertheless, the PNs (n = 10)
for a given candidate are distributed to an average of more than eight different
raters. Therefore, any so-called “hawk” and “dove” effects would tend to cancel
out. It should also be noted that CSA pass/fail decisions are not based solely on
the PN, but a composite that also includes the data gathering scores. As a result,
measurement error will be further minimized. The reproducibility of the PN scores
could be enhanced through changes in the rating design (e.g., multiple ratings
of each note), or score adjustments based on some form of rating scale analysis
(Australian Council for Educational Research, 1998). However, an inspection of
the variance components suggests that, consistent with previous research (Swanson
and Norcini, 1989; van der Vleuten et al. 1991), reproducibility gains would best
be achieved by increasing the number of tasks (PNs), not the number of raters
per given task. Currently, ECFMG is addressing potential measurement concerns
through enhanced quality assurance procedures, including inter- and intra-rater
score comparisons, additional rater training, benchmark note comparisons, and
regular rater feedback.
Obtaining evidence for the validity of PN scores can take many forms. Analyses
of the relationship of PN scores to variables external to the assessment, combined
with analyses exploring the internal structure of the test, can both be used to
make judgments concerning validity (American Educational Research Association, American Psychological Association, & National Council on Measurement
in Education, 1999). The moderate correlations between PN scores and both
basic science and clinical science examination scores were expected. For the
ASSESSING WRITTEN COMMUNICATION IN MEDICAL
57
PN exercise, candidates are required to document the pertinent positive and
negative findings, provide a list of plausible differential diagnoses, and generate
an initial diagnostic management plan. Applying the knowledge and understanding
of key concepts of basic biomedical science (USMLE Step 1) and clinical science
considered essential for the provision of care under supervision (USMLE Step 2)
should be related, at least positively, to the aforementioned PN tasks. Likewise,
since TOEFL requires listening, structure, writing, and reading skills, one would
expect that candidates with higher scores would be better able to record, interpret
and organize information for the PN exercise. The moderate correlations between
PN scores and various other CSA component scores suggest some overlap in, or
interrelationships amongst, the traits being measured. Given that it would be very
difficult to write a proper note if there were flaws in the data gathering process, the
relatively strong correlation between PN and data gathering scores was predictable.
Overall, both the external and internal test score relationships provide additional
evidence for the validity of the PN scores.
Responses to the post-CSA questionnaire provided additional data to support
the validity of PN ratings. Candidates who perceived the time limits to be
adequate performed significantly better on the task. It would be expected that
these individuals would be more proficient at synthesizing the information from the
encounter and summarizing it on the note. Unfortunately, the perceived adequacy
of the time limit may simply reflect comfort with the task, not necessarily actual
time pressures. Therefore, additional studies are needed to investigate the relationship between actual time spent composing the clinical summaries and the resultant
scores. Candidates who completed their medical training in English outperformed
candidates who had been instructed in a language other than English. Here, it would
be expected that the writing skills of those candidates who were taught in English
would be better than those who were not educated in English, resulting in higher
PN scores. Finally, those candidates who had had previous SP experience scored
significantly higher on the PN exercise. Familiarity with the task of interviewing
simulated patients would be expected to diminish anxiety and subsequently lead
to improved data gathering skills. If this added information finds its way onto
the patient note, scores would be expected to increase. Interestingly, candidates
who claimed to have taken a specialized clinical skills training course actually
performed less well on the PN exercise than those who did not. While there are
a number of possible explanations for this seemingly contradictory result, it is
likely that those candidates who chose to take a course were initially of much
lower ability than those who did not. In fact, candidates who took a specialized
course had significantly lower USMLE Step and TOEFL scores, and significantly
more USMLE Step attempts. Overall, the direction and moderate strength of the
relationships between candidate PN scores and variables related to prior training
provides additional evidence to support the validity of the written exercise.
The PN exercise is an important part of the assessment of the readiness of
medical school graduates to enter graduate training programs. The ability to inter-
58
JOHN R. BOULET ET AL.
pret and synthesize medical information, and present this data in a reasoned and
rational manner, is certainly a skill that is required for effective medical practice. As such, it is important that medical school graduates be assessed in this
domain. Furthermore, given that skill deficiencies in this area can have profound
consequences, both for the candidate in a high-stakes testing situation and for the
patient in real-life medical encounters, it is imperative that any resultant proficiency measures are reproducible and valid. Although some improvements to the
assessment process could be entertained, the ECFMG CSA PN exercise currently
provides an effective means of assessing the written communication skills of
medical school graduates.
Appendix A
Patient Note Scoring Rubric
Note component
Patient note scoring anchors
Unacceptable
1
Medical history and
physical examination
•
•
•
Acceptable
2
3
4
Note is disorganized with
elements inappropriately
interspersed, information is
ambiguous, legibility is poor
Information is inaccurate,
spurious or rote, such that a
medical reader would have
difficulty grasping the nature
of the case
Significant positives or
negatives are omitted or
inappropriately detailed
findings obscure key elements
•
•
•
Superior
5
6
7
Organization of note shows
generally ordered approach to
case, with little ambiguity, and
acceptable legibility
Contains some inaccurate,
spurious, or rote information,
but a medical reader would be
able to understand the nature
of the case. Provides enough
information for adequate patient
treatment.
Significant positives or
negatives are included by
may be inappropriately detailed
•
•
•
8
9
Organization of note reflects
ordered approach to case,
information is clear and legible
Information is accurate and
relevant and presented in a way
that makes clear to a medical
reader the nature of the case
Significant positive and
negative elements of history
and physical are recognized as
such by inclusion in note
Differential diagnosis
•
Differential diagnosis is
inconsistent with findings or
supporting findings are lacking
•
Differential diagnosis generally
consistent with findings, but
some supporting findings may
be lacking
•
Findings are correctly
interpreted as reflected in a list
of reasonable differential
diagnoses
Diagnostic work-up
•
Proposed work-up is
inconsistent with diagnoses
being entertained or is rote
(“shotgunning” or is oblivious
to any cost containment
•
Proposed work-up is consistent
with diagnoses, but may be rote
or oblivious to cost containment
•
Proposed work-up is consistent
with diagnoses being
entertained, reasonably specific
to case and reflects awareness
of cost containment
Descriptors of Traits Measured in the Patient Note
Organization: Clear portrayal of patient problem; order of assessment and plan are
reasonable.
Quality of information: Information presented with appropriate detail and includes
significant positive and negative elements of history and physical.
Interpretation of data: Correct interpretation of data gathered is reflected in
reasonable differential diagnoses.
ASSESSING WRITTEN COMMUNICATION IN MEDICAL
59
Egregious/dangerous actions: Avoids diagnostic management plans that could
result in harm or expensive, non-indicated diagnostic tests.
Legibility: Easily read with little effort required.
Note
1 Although each CSA administration involves 10 patient encounters, there are a few cases (e.g.,
pre-employment physical examination) that do not require a written summary. Therefore, for 6,225
candidates, there will be fewer than 62,250 PNs.
References
American Educational Research Association, American Psychological Association, National
Council on Measurement in Education (1999). Standards for Educational and Psychological
Testing. Washington, DC: Author.
Australian Council for Educational Research (1998). ACER ConQuest: Generalized Item Response
Modelling Software. Melbourne, Australia: Author.
Ayers, W.R. & Boulet, J.R. (2001). Establishing the validity of test score inferences: Performance
of 4th-year U.S. medical students on the ECFMG Clinical Skills Assessment. Teaching and
Learning in Medicine 13(4): 214–220.
Boulet, J.R., Ben David, M.F., Ziv, A., Burdick, W.P., Curtis, M. & Peitzman, S. et al. (1998a). Using
standardized patients to assess the interpersonal skills of physicians. Academic Medicine 73(10
Suppl.): 94–96.
Boulet, J.R., Friedman Ben-David, M., Hambleton, R.K., Burdick, W.P., Ziv, A. & Gary, N.E.
(1998b). An investigation of the sources of measurement error in the post-encounter written
scores from standardized patient examinations. Advances in Health Sciences Education: Theory
and Practice 3: 89–100.
Boulet, J.R., Friedman Ben-David, M., Ziv, A., Burdick, W.P. & Gary, N.E. (2000). The use of
holistic scoring for post-encounter written exercises. In D. Melnick (ed.), Proceedings of the
Eighth Ottawa Conference on Medical Education and Assessment (pp. 254–260). Philadelphia,
PA: National Board of Medical Examiners.
Boulet, J.R., McKinley, D.W., Norcini, J.J. & Whelan, G.P. (2002). Assessing the comparability of
standardized patient and physician evaluations of clinical skills. Advances in Health Sciences
Education: Theory and Practice 7: 85–97.
Boulet, J.R., van Zanten, M., McKinley, D.W. & Gary, N.E. (2001). Evaluating the spoken English
proficiency of graduates of foreign medical schools. Medical Education 35(8): 767–773.
Brennan, R.L. (2001). Generalizability Theory. New York: Springer-Verlag.
Cradock, J., Young, A.S. & Sullivan, G. (2001). The accuracy of medical record documentation in
schizophrenia. Journal of Behavioral Health Services & Research 28(4): 456–465.
Crossley, G.M., Howe, A., Newble, D., Jolly, B. & Davies, H.A. (2001). Sheffield Assessment Instrument for Letters (SAIL): Performance assessment using outpatient letters. Medical Education
35(12): 1115–1124.
Friedman Ben-David, M.F., Boulet, J.R., Burdick, W.P., Ziv, A., Hambleton, R.K. & Gary, N.E.
(1997). Issues of validity and reliability concerning who scores the post-encounter patientprogress note. Academic Medicine 72(10 Suppl. 1): 79–81.
Friedman Ben-David, M.F., Klass, D.J., Boulet, J., De Champlain, A., King, A.M. & Pohl, H.S.
et al. (1999). The performance of foreign medical graduates on the National Board of Medical
Examiners (NBME) standardized patient examination prototype: A collaborative study of the
NBME and the Educational Commission for Foreign Medical Graduates (ECFMG). Medical
Education 33(6): 439–446.
60
JOHN R. BOULET ET AL.
Grace-Farfaglia, P. & Rosow, P. (1995). Automating clinical dietetics documentation Journal of the
American Dietetic Association 95(6): 687–690.
Hallock, J.A. ECFMG and the Challenges Facing International Medical Graduates (2002). Reporter
11(8): 2–3. Washington, DC: Association of American Medical Colleges.
Harden, R.M. & Gleeson, F.A. (1979). Assessment of clinical competence using an objective
structured clinical examination (OSCE). Medical Education 13(1): 41–54.
Howell, J., Chisholm, C., Clark, A. & Spillane, L. (2000). Emergency medicine resident documentation: Results of the 1999 American Board of Emergency Medicine in-training examination
survey. Academic Emergency Medicine 7(10): 1135–1138.
Larimore, W.L. & Jordan, E.V. (1995). SOAP to SNOCAMP: improving the medical record format.
Journal of Family Practice 41(4): 393–398.
Medical Council of Canada (2002). Qualifying Examination Part II, Information Pamphlet. Ottawa,
Canada: Author.
Prentice, D.A. & Miller, D.T. (1998). When small effects are impressive. In A.E. Kazdin (ed.), Methodological Issues and Strategies in Clinical Research (pp. 163–173). Washington, DC: American
Psychological Association.
Regehr, G., MacRae, H., Reznick, R.K. & Szalay, D. (1998). Comparing the psychometric properties
of checklists and global rating scales for assessing performance on an OSCE-format examination.
Academic Medicine 73(9): 993–997.
Rollnick, S., Kinnersley, P. & Butler, C. (2002). Context-bound communication skills training:
Development of a new method. Medical Education 36(4): 377–383.
SAS Institute, Inc. (1989). SAS/STAT User’s Guide: Version 6 (4th ed.), Cary, NC: Author.
Shapiro, J. (1999). Correlates of family-oriented physician communications. Family Practice 16(3):
294–300.
Slater, S.C. & Boulet, J.R. (2001). Predicting holistic ratings of written performance assessments
from analytic scoring. Advances in Health Sciences Education: Theory and Practice 6(2): 103–
119.
Sleszynski, S.L., Glonek, T. & Kuchera, W.A. (1999). Standardized medical record: A new outpatient osteopathic SOAP note form: Validation of a standardized office form against physician’s
progress notes. Journal of the American Osteopathic Association 99(10): 516–529.
Swanson, D.B. & Norcini, J.J. (1989). Factors influencing reproducibility of tests using standardized
patients. Teaching and Learning in Medicine 1(3): 158–166.
van Dalen, J., Kerkhofs, E., Verwijnen, G.M., Knippenberg-Van Den Berg, B.W., van Den Hout,
H.A., Scherpbier, A.J. et al. (2002). Predicting communication skills with a paper-and-pencil
test. Medical Education 36(2): 148–153.
van der Vleuten, C.P., Norman, G.R. & De Graaff, E. (1991). Pitfalls in the pursuit of objectivity:
Issues of reliability. Medical Education 25(2): 110–118.
Weber, D.O. (2002). Charting a course toward legible medical records: Perfect paperwork can mean
financial savings, better patient care. Physician Executive 28(1): 8–13.
Williams, S., Weinman, J. & Dale, J. (1998). Doctor-patient communication and patient satisfaction:
A review. Family Practice 15(5): 480–492.