[go: up one dir, main page]

Academia.eduAcademia.edu
Research papers Does observation add to the validity of the long case? Val Wass1 & Brian Jolly2 Background A London medical school ®nal MBBS examination for 155 candidates. Objective To investigate whether observing the student±patient interaction in a history taking (HT) long case adds incremental information to the traditional presentation component. Design A prospective study of a HT long case which included both examiner observation of the studentpatient interview (Part 1) and traditional presentation to different examiners (Part 2). Checklist and global ratings of both parts were compared. Examiners were paired to estimate inter-rater reliability. The students also took a 20 station Objective Structured Clinical Examination (OSCE). Outcome measures Correlation of (I) examiner ratings for observation and presentation of the HT long case (II) examiner pair ratings and (III) stepwise regression analysis of scores for the HT long case with OSCE scores. Results Seventy-®ve (48á4%) candidates had two examiner pairs marking their case history. Observation Introduction The search for the ideal assessment of clinical competence for undergraduates, which is both valid and reliable, remains controversial.1 In the traditional long case, candidates are given uninterrupted and unobserved time, usually 30±45 minutes, to interview and examine a patient, selected from the wards or outpatients and untrained for examinations. Candi- 1 Department of General Practice and Primary Care, Guy's, King's and St Thomas' School of Medicine, London, UK 2 Department of Medical Education, The University of Shef®eld, Shef®eld, UK Correspondence: Dr Val Wass, Department of General Practice and Primary Care, Guy's, King's and St Thomas' School of Medicine, Weston Education Centre, 10, Cutcombe Road, London, SE5 9RJ, UK. Tel.: 020 78485576; Fax: 020 78485711; E-mail: valerie.wass@ kcl.ac.uk and presentation scores correlated poorly (checklist 0á38 and global 0á33). Checklist and global scores for each part correlated at higher levels (observation 0á64 and presentation 0á61). Inter-rater reliability correlations were higher for observation (checklist 0á72 and global 0á71) than for presentation (checklist 0á38 and global 0á60). When HT long case scores were correlated with OSCE scores, using stepwise regression, global presentation scores showed the highest correlation with the OSCE score (0á36) and the global observation score contributed a further 12% to the correlation (0á50). Conclusion Observation of history taking in a long case appears to measure a useful and distinct component of clinical competence over and above the contribution made by the presentation. Keywords Observation, *HT; medical history taking, *HT; *clinical competence; education, medical, undergraduate; educational measurement; reproducibility of results; professional patient relations; prospective studies; regression analysis. Medical Education 2001;35:729±734 dates then present their ®ndings to the examiners as in an unstructured oral examination. The long case attempts to assess the integrated interaction between the doctor and a `real' patient. It is arguably a valid and educationally valuable test.2 However, the assessment is lengthy and usually only one long case is used. The argument against the long case hinges on reliability. There is now indisputable evidence that in all measurements of clinical competence, candidates perform variably across tasks.3,4 One case is insuf®cient to produce a reliable measure of the candidate's ability. The introduction of the objective structured clinical examination (OSCE),5 which uses multiple stations to produce a more reliable test format, enables isolated components of the long case to be examined in a variety of contexts. The use of this OSCE format improves reliability.6 Consequently, many medical schools are Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734 729 730 The validity of the long case · V Wass & B Jolly Key learning points Traditionally students are taught and judged on presentation of case histories. The interaction with the patient is rarely observed. When the long case interaction is observed, marks given for the interview correlate poorly with those given by different examiners for the presentation. Marks given independently by two examiners show good agreement particularly for the observed interview. One examiner is probably suf®cient. Ratings of observation and presentation contributed signi®cantly and independently to the correlation with clinical competence, as judged from the OSCE score. Observation and presentation of a history taking long case appear to measure different parameters of clinical competence. turning towards this form of clinical competence testing. However, this move may take place at the expense of validity.1 It may be limited by its reductionist approach to clinical performance, as well as being timeconsuming and costly. These last factors cannot be ignored. Some UK medical schools have been reluctant to lose the long case as an assessment tool. Our examination board originally decided, when an OSCE was introduced in the ®nal examination, that a long case using real patients should be included. It was agreed that assessment of the candidate's interaction with randomly selected real patients was important and that examiners should be able to cross-examine candidates in the traditional relatively unstructured way. Two modi®cations were proposed. Firstly that the physical examination of the patient should be excluded from the long case assessment as there were several clinical examination stations in the OSCE. Secondly, that the student's interview with the patient should be observed. The latter decision was important to us. Attempts have been made to improve reliability within a long case. Gleeson7 developed the objective structured long examination record (OSLER) where the presentation is structured to increase the observations made by examiners on the candidate's approach to the case. Observed long cases have been recommended.8,9 However, surprisingly little has been published on the psychometrics of the long case in any of its forms.2 The examination format enabled us to investigate the extent to which the traditional part of the long case, i.e. the presentation to the examiners, related to assessment based on observing the interaction with the patient. Two pairs of different examiners independently rated the two parts: the observation and the presentation. Important research questions could be asked: Does observing the student±patient interaction add incremental information on clinical competence to the traditional presentation component? How well do independent ratings for the two components correlate? Is the inter-rater reliability of the observed component better than for the unobserved component? Is the combined rating of both parts better than for the traditional unobserved one alone? Method The clinical examinations took place over three days. The aim was to test clinical competence. The candidates had also taken three written papers of (a) multiple choice (b) short answer and (c) essay questions. The format of the clinical examination was one HT long case and a twenty station OSCE. The history taking long case Candidates were given 16 minutes (Part 1) with a patient to elicit their full medical history, formulate diagnostic issues and summarise their thoughts. They were not required to examine the patient. This was followed by an eight minute unstructured presentation of the case (Part 2). During part 1, the candidate was silently observed taking the history by one or two examiners. No interaction was allowed. The examiner(s) marked against a comprehensive checklist. The items covered by the checklist are summarised in Table 1. This list scored out of 20 for history taking. In addition a further 20 marks could be given, using a 1±5 Likert scale, as global ratings allocated for four parameters: medical history, psychosocial history, interviewing skills and overall ¯uency of the performance. For Part 2, the candidate moved on to present the case to one or two examiners in the traditional way. The examiners could ask unstructured questions, as in the usual long case format, to clarify the candidate's description and interpretation of the history and the conclusions reached. They marked independently using an identical checklist (20 marks) and global rating scheme (20 marks). Thus the assessment of the Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734 The validity of the long case · 731 V Wass & B Jolly Table 1 Observation of the history taking long case: items covered by the checklist Maximum points allocated Heading Items Patient details Appropriate introducion/age/ occupation Symptoms: type, description, duration, alleviating, aggravating, precipitating factors, previous episodes, medication, psychological symptoms. 1á0 Past medical history Past illness, medications, hospitalisation, allergies 3á0 System review Micturition, appetite, weight, bowel function, menstruation if appropriate 4á0 Social history Family history, housing, alcohol, tobacco, nutritional 5á0 Present illness Total score 7á0 20á0 candidate's presentation could be compared with the assessment of the actual observed interaction. Patients Medical, surgical and psychiatric patients were selected from hospital wards and outpatient departments. They were untrained for examinations and no attempt was made to standardise them. Two cases were alternated, wherever possible, at each of the long case stations. Candidates were assigned to patients at random. Examiners Providing numbers permitted, a team of 12 examiners per session (four per candidate) was allocated to the long cases. The examiners were consultants or senior lecturers, experienced in long case assessment in the usual `presentation only' format. Before the examination, they were briefed together for both parts of the assessment and instructed to mark independently and not collude or alter scores after discussion. Wherever possible, they worked in pairs. They were allocated by rota to examine on either Part 1 or Part 2. In the course of a session, they had equal experience of examining both parts. If there were insuf®cient examiners to make pairs, the candidates were marked by one examiner only. The OSCE examination The OSCE examination consisted of 20 stations (8 minutes each) with one examiner per station. The stations included 8 testing physical examination, 4 assessing communication skills, 3 on practical skills and 5 on data and imaging interpretation. A single examiner rated each station. Statistical analysis Data analysis was carried out using the Statistical Package for Social Sciences (SPSS). Mean scores on observation and presentation of the long case were calculated for all candidates. Mean examiner scores for those candidates marked by two examiners on both parts of the long case were correlated with marks for the observation and presentation using the standard (Pearson) product-moment correlation. The inter-rater reliability of the paired examiners was estimated on each part, using intraclass correlations for both checklist and global scores. The total OSCE scores for the candidates were correlated with the checklist and global scores using stepwise regression. The traditional long case score, i.e. the global presentation score, was entered ®rst followed by the other component scores to investigate any contribution made to explain the variance in OSCE scores. Results One hundred and ®fty-®ve candidates took the examination. They were marked on the HT long cases by a team of 50 examiners who rotated to cover the three day examination. All candidates were rated by a different examiner on the observation part and the presentation part of one HT long case. Seventy-®ve (48á4%) candidates were rated by two examiners on both parts. The performance of this `study group' of 75 candidates was analysed in detail. Based on the scores of one examiner only (a random selection of one was made from the study group examiner pairs; the non study group had only one examiner), the mean total HT long case score (out of 80) for the study group was 56á8 ‹ SD 8á6 compared to 60á9 ‹ SD 8á3 for the rest of the cohort. The study group scored statistically signi®cantly lower (P < 0á05, 95% con®dence interval (CI) ±6á73 to ±1á36). The mean marks for the observation part were 14á5 (global) and 13á9 (checklist) for the study group and for the remaining cohort 15á7 (global) and 14á5 (checklist). The mean marks for the presentation part were 14á5 (global) and 13á9 (checklist) for the study group and for Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734 732 The validity of the long case · V Wass & B Jolly Table 2 Correlation between mean (averaged across two examiners) for observation and presentation scores for all double marked candidates (n = 75) Observation Correlations (Inter-rater correlation) Observation Checklist Observation Global Presentation Checklist Presentation Global Table 3 Results of the stepwise regression of long case components on the OSCE scores and their signi®cance Presentation Step Variables entered Checklist (0á72)** Global (0á71)** Checklist (0á38)** Global (0á60)** 1á00 0á64 0á38 0á18 1á00 0á25 0á33 1á00 0á61 **Intraclass or standard Pearson indicated 1á00 R2 change Signi®cance R2 change 1 Presentation Global Rating (PGR) 0á364 13á3% 0á133 0á001 2 PGR and Observation Global Rating (OGR) 0á503 25á3% 0á120 0á001 3 PGR, OGR and Observation Check list Rating (OCR) 0á503 22á2% 0á001 0á778 4 PGR, OGR, OCR and Presentation Check list Rating 0á510 26á0% 0á007 0á427 correlations were used as the remaining cohort 15á5 and 14á8, respectively. The range of SDs was 2á5±3á1. All comparisons on the long case scores between the study group and the remaining cohort were statistically signi®cant (P < 0á05). Mean OSCE scores were 259á7 (‹ 10á0) and 260á2 (‹ 8á4) for the study and remaining candidates, respectively. This difference was not signi®cant. There was also no difference between the study group and the remainder on any other part of the examination. Correlation coef®cients for the double marked study group are summarised in Table 2. Intraclass correlations between examiners (inter-examiner reliabilities) were higher for observation of the case (0á72 for checklist and 0á71 for global observed ratings) than for the case presentation (0á38 for checklist and 0á60 for global presentation ratings). Examiner pair scores were then combined to compare ratings across the long case components for the 75 candidates. Within the separate observation and presentation components, checklist and global ratings showed Pearson correlation coef®cients of 0á64 (observation) and 0á61 (presentation). There was a clear lack of correlation between scores given for long case observation compared to presentation using either checklist or global scores (Pearson correlations ranged from 0á18 to 0á38). Table 3 summarises the results of the regression analysis of the percentage variance contribution made by each part (variable) of the HT long case correlated with the OSCE score, with the signi®cance of each part (using a t-test, as produced as part of a SPSS standard output). The presentation global rating showed the highest individual correlation of 0á36 (P ˆ 0á001) with R % variance explained the OSCE scores. Adding the observation score (globally rated) added signi®cant explanatory power to the regression equation almost doubling the explained variance from 13% to 25% giving a correlation of 0á50 (P ˆ 0á001). Further inclusions did not increase the variance explained by these variables. Discussion The presentation of a history in the long case has been accepted as a measure of the candidate's overall ability to carry out a medical interview, appraise the ®ndings and decide a course of action. We have shown that examiners observing and marking the candidate during the (usually neglected) interaction with the patient rate the candidate differently. When taking the OSCE as a measure of clinical competence, direct observation of the HT long case contributed as much again as the presentation to the correlation of the HT long case with the OSCE results. This is, perhaps, not surprising but has never been demonstrated before in psychometric terms. It challenges the tradition of case presentations alone. This study examined the history taking process of the long case only. Real patients were used and there was no attempt to standardise them. The candidate was not asked to carry out a clinical examination. This was partly for reasons of time and also because clinical examination techniques were tested in the OSCE itself. However, the candidates were expected to process the information gathered from the patients to present to the examiner in the usual way. Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734 The validity of the long case · 733 V Wass & B Jolly The presentation was conducted in the traditional format, the examiners being free to ask the questions they perceived relevant in an unstructured way and then make general overall judgements. The introduction of a checklist to be completed at the presentation was part of the research design to assess comparability of the presentation and observation. As reported in the literature,10 checklist and global ratings correlated highly. Inter-rater reliability was better for the observed part of the long case. The low examiner agreement on the checklist scores for the presentation tends to con®rm the unstructured format of the presentation. The format was therefore as close as we could get to the complete long case. The crucial issue of observation vs. presentation would hold with or without the clinical examination component. The study cohort performed slightly worse on the long case, but not on any other part of the examination. This may have been because the presence of another examiner made each more cautious in awarding marks. However the long case score ranges and distributions were not signi®cantly different for the two groups. The difference in mean scores would not affect the results of the study, which were generated entirely from within the study cohort. The higher inter-rater reliability for the observed part of the case, compared to the presentation, throws doubt on relying on the presentation alone. In the latter, the examiners may be trying to make inferences about what had happened in the previous part. This may have contributed to the lower correlation of their scores. Thus further work is needed to assess the difference if the same examiners are used for both parts. However the fact that the ratings of observation and presentation correlated poorly and contributed signi®cantly and independently to the correlation with clinical competence, as judged from the OSCE score, suggests that they measure different parameters of clinical competence. Swanson3 points out that all performance-based assessments within health professions con®rm that testing examinees in realistic performance-based situations is fraught with dif®culty. Complex interactions between the context (situation/task) and the construct (skill/knowledge) are being measured. Van der Vleuten has underlined these problems. He uses the complex cognitive psychological processes involved in professional clinical expertise4 to explain the variability seen across different examination components and from context to context. The failure to ®nd a strong correlation between performance on the long case and the OSCE stations could be accounted for by one of two factors. The OSCE may measure a different clinical process. Alternatively, it may just re¯ect insuf®cient sampling in the use of only one long case. If extrapolations from other examinations are used, the latter may well be true. Swanson11 calculated from oral examination case histories it would take a full day of testing with 12 to 16 cases to achieve an acceptable level of generalisability for a high stakes examination. Stillman,12 using standardised patients as both subjects and examiners, showed that 13 and 17 cases were needed to achieve reliability scores of 0á68 and 0á88, respectively. Further work is clearly necessary to estimate the sampling errors of long cases, as we can no longer assume that the long case, even if observed, is an equivalent process to a clinical viva.13 We conclude that the traditional long-case presentation becomes a more valid measure of the candidate's clinical competence if the interaction with the patient is observed. This conclusion has direct relevance to the use of long cases in clinical examinations but may also have relevance to traditional case presentations on ward rounds. In the clinical setting, students present histories on ward rounds but are not generally observed while they interview patients. More attention to observation during ward clerkships may therefore be necessary. Acknowledgements We thank Dr John Rees and Professor Gwyn Williams for supporting this research within the ®nals clinical examination, Professor Cees van der Vleuten for his advice on analysing the study and Professor Roger Jones and Professor David Newble for their comments on the manuscript. Contributors Both authors designed the study together. VW organised the study and collected the data. BJ performed 2 the statistical analysis. Funding There was no external funding for this project. References 1 Meadow R. The structured exam has taken over. BMJ 1998;317:1329. 2 Van der Vleuten CPM. Making the best of the `long case'. Lancet 1996;347:704±5. 3 Swanson DB, Norman GR, Linn RL. Performance-based assessment: Lessons learnt from the health professions. Educ Res 1995;24(5):5±11. Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734 734 The validity of the long case · V Wass & B Jolly 4 Van der Vleuten CPM. The assessment of professional competence: Developments, research and practical implications. Adv Health Sci Educ 1996;1:41±67. 5 Harden RM, Gleeson FA. ASME Medical Educational Booklet no. 8 Assessment of medical competence using an objective structured clinical examination (OSCE). J Med Educ 1979;13:41±54. 6 Newble DI, Swanson DB. Psychometric characteristics of the objective structured clinical examination. Med Educ 1996;22:325±34. 7 Gleeson F. The effect of immediate feedback on clinical skills using the OSLER. In: AI Rothman, R Cohen, eds. Proc of the Sixth Ottawa Conference of Medical Education. Toronto: University of Toronto Bookstore Custom Publishing. 1994; 412±5. 8 Newble DI. The observed long case in clinical assessment. Med Educ 1994;25:369±73. 9 Price J, Byrne GJA. The direct clinical examination: an alternative method for the assessment of clinical psychiatric 10 11 12 13 skills in undergraduate medical students. Med Educ 1994;28:120±5. Regehr G, MacRae H, Reznick R, Szalay D. Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med 1998;73 (9):993±7. Swanson DB. A measurement framework for performance based tests. In: IR.Hart, RM.Harden. eds. Further Developments in Assessing Clinical Competence, pp 13±45. Montreal: Can-Heal; 1987. Stillman P, Regan MB, Swanson D, Case S, McCaha J, Feinblatt J et al. An assessment of the clinical skills of fourthyear students at four New England medical schools. Acad Med 1990;65:320±6. Hardy KJ, Demos LL, McNeil JJ. Undergraduate surgical examinations: an appraisal of the clinical orals. Med Educ 1998;32:582±9. Received 31 January 2001; accepted for publication 22 February 2001 Ó Blackwell Science Ltd ME D I C A L ED U C A T I ON 2001;35:729±734