[go: up one dir, main page]

Academia.eduAcademia.edu
Psychological Assessment 2002, Vol. 14, No. 1, 16 –26 Copyright 2002 by the American Psychological Association, Inc. 1040-3590/02/$5.00 DOI: 10.1037//1040-3590.14.1.16 Underreporting of Psychopathology on the MMPI-2: A Meta-Analytic Review Ruth A. Baer and Joshua Miller University of Kentucky Meta-analytic techniques were applied to studies of the MMPI-2 in which participants given standard instructions were compared with participants instructed or believed to have been underreporting. Traditional and supplementary indices of underreporting yielded a mean effect size of 1.25, suggesting that underreporting respondents differ from those responding honestly by a little more than 1 standard deviation, on the average, on these scales. Analyses of classification accuracy suggested that several scales are moderately effective in detecting underreporting, although accuracy decreases if participants have been coached about validity scales. Base rates of defensive responding in relevant populations are reviewed, and methodological issues, including research designs, coaching, and incremental validity of supplementary underreporting scales, are discussed. Self-report inventories are more likely to yield accurate and useful information when test takers respond honestly (Graham, 2000). Response biases such as overreporting of symptoms (malingering or “faking bad”) and underreporting of symptoms (defensiveness or “faking good”) can result in invalid and misleading test results. Clinicians are most likely to encounter these response biases in settings that provide test takers with substantial incentives for distortion. For example, evaluations for competency to stand trial and lawsuits for psychological damages can provide strong incentives to exaggerate or fabricate psychological symptoms. Applicants for jobs or training programs and divorcing parents undergoing custody evaluations may have powerful reasons to present themselves in unrealistically positive terms, perhaps by attempting to conceal existing symptoms of psychopathology. The Minnesota Multiphasic Personality Inventory (MMPI; Hathaway & McKinley, 1983) was among the first self-report inventories to include validity scales designed to detect these response biases. Its successor, the MMPI-2 (Butcher, Dahlstrom, Graham, Tellegen, & Kaemmer, 1989) is widely used in clinical, legal, and organizational settings (Graham, 2000; Lees-Haley, 1992). The efficacy of the validity scales of the MMPI and MMPI-2 in detecting fake-bad and fake-good response biases has been extensively studied. Berry, Baer, and Harris (1991), in a meta-analytic review of the detection of malingering on the original MMPI, found that the Infrequency (F) scale and Infrequency minus Correction (F ⫺ K) index (Gough, 1950) were quite effective in discriminating honest from malingering respondents. Rogers, Sewell, and Salekin (1994) obtained similar results in a metaanalysis of malingering on the MMPI-2. Baer, Wetter, and Berry (1992) conducted a meta-analysis of the detection of underreport- ing on the original MMPI, and they found an overall mean effect size of 1.05, indicating that participants who underreported psychopathology differed from those who responded honestly by approximately one standard deviation, on the average, on underreporting indices. For the traditional Lie (L) and Correction (K) scales, effect sizes of just under one standard deviation were noted. Effect sizes of approximately 1.5 were noted for two supplementary scales: the Positive Malingering scale (Mp) and Wiggins’s Social Desirability scale (Wsd; described below). Widely varied cutting scores were reported across studies. In general, underreporting was noted to be more difficult to detect than overreporting. Currently, no reviews of the detection of underreporting on the MMPI-2 are available, although several studies of this question have been published. Thus, the purpose of the present review is to apply meta-analytic strategies to an evaluation of the literature on detection of underreporting on the MMPI-2. The utility of available indices of underreporting is investigated, and current methodological issues in this area of research are explored. Underreporting Indices on the MMPI-2 The most widely used indices of underreporting on the MMPI-2 are the traditional L and K scales. The F ⫺ K index (Gough, 1950) also is frequently used. In addition, several supplementary underreporting scales are available. Some of these were developed for the original MMPI. For example, Cofer, Chance, and Judson (1949) described the use of the L ⫹ K index, in which the raw L and K scores are summed. These authors also developed the Mp scale, which consists of items that participants answered in the socially undesirable direction when responding honestly or faking bad, but in the opposite direction when faking good. Edwards’s (1957) Social Desirability scale (Esd) includes items for which 10 judges unanimously agreed on the socially desirable response. Wiggins’s (1959) Social Desirability scale includes items shown to discriminate participants instructed to respond in a socially desirable manner from those given the standard instructions. Hanley’s (1957) Test-Taking Defensiveness scale (Tt) includes items for which judges agreed on the socially desirable response, but which Ruth A. Baer and Joshua Miller, Department of Psychology, University of Kentucky. Correspondence concerning this article should be addressed to Ruth A. Baer, Department of Psychology, 115 Kastle Hall, University of Kentucky, Lexington, Kentucky 40506-0044. E-mail: rbaer@uky.edu 16 UNDERREPORTING ON MMPI-2 only about half of a normative group endorsed in the socially desirable direction. Wiener (1948) identified Obvious items, whose content clearly reflected psychological disturbance, and Subtle items, whose relationship to psychopathology was unclear. Respondents who are underreporting should obtain low scores on the Obvious items, but often obtain high scores on the Subtle items, and negative scores when Subtle scale is subtracted from the Obvious scale (O ⫺ S). Several underreporting scales have been developed for the MMPI-2. The Other Deception (Od) scale (Nichols & Greene, 1991) consists of items from the Mp and Wsd scales and was designed to measure the other-deception factor of Paulhus’s (1984, 1986) two-factor model of socially desirable responding. “Other deception,” or impression management, is a deliberate attempt to present an unrealistically favorable self-description, whereas “selfdeception” is an overly positive self-presentation, akin to narcissism, that the respondent believes to be true (Paulhus, 1998). The Positive Mental Health (PMH4) scale (Nichols, 1992) consists of items that appear on at least 4 of 28 supplementary scales designed to measure a positive trait. The Superlative (S) scale (Butcher & Han, 1995) consists of items shown to discriminate a large group of pilots seeking employment with a major airline from the MMPI-2 normative sample. Methodological Issues in Underreporting Research Research Design Several research designs have been used in the literature on detection of response biases (Rogers, 1997). The most common is the simulation design, in which volunteers from either clinical or nonclinical populations complete the MMPI-2 in accordance with instructions provided by the experimenter. Generally, a group instructed to feign is compared to a group given standard instructions. Although this design can be useful in clarifying potential differences between honest and feigning respondents, the extent to which responses of experimental feigners resemble those of “real world” feigners is unclear. Rogers (1997) suggested several strategies for increasing the ecological validity of the simulation design. For example, participants should be similar to those with whom the test is used in clinical practice, and participants instructed to feign should be given a realistic scenario in which feigning might occur (e.g., “Imagine you are applying for a highly desirable job”). They should be instructed to “be believable” in their feigned presentation and should be provided with incentives for successful feigning. Their understanding of and compliance with their feigning instructions should be assessed. In differential prevalence designs, participants who are believed because of their circumstances to have strong incentives for faking are compared with those who appear to have no such motives. For example, anonymous volunteers given standard instructions might be compared with a group being evaluated for child custody proceedings (also given standard instructions). On the average, the custody evaluees would be expected to obtain higher scores on underreporting scales. Unfortunately, as the underreporting status of individual participants in the custody group is unknown, this design cannot yield precise information about the classification accuracy of validity scales. The known-groups design compares scores on validity scales from individuals known to have responded honestly with those 17 known to have distorted their responses. In these studies, faking instructions are not provided by an experimenter. Instead, a subset of participants who have completed the MMPI-2 in a clinical or applied setting are discovered to have misrepresented themselves, and these participants are compared with those who are believed to have responded honestly. Results from such studies may be more generalizable than results from simulation designs, because participants have independently chosen to feign good or bad adjustment, rather than being instructed by an experimenter to do so. However, because the determination of whether participants have feigned must be made independently of the validity scale being investigated, the accuracy of the method used to identify feigners and nonfeigners is important. For example, Borum and Stock (1993) described a sample of police applicants who were confronted about inconsistencies in their applications and confessed to having misrepresented themselves. Their scores on validity scales were significantly different from a group not suspected of deception. Unfortunately, as confessions of feigning appear to be rare, and as known feigners may not be representative of feigners who are never caught, the results of such studies may not be generalizable to unidentified feigners. Rogers (1997) suggested that confidence in the utility of validity scales increases with supporting evidence across several research designs. Initial research on a validity scale most often uses the simulation design. When the potential utility of a scale has been well supported through simulation designs, differential prevalence and known-groups designs can provide valuable information about the generalizability of findings to clinically important situations. This review summarizes the use of these designs in the current literature, and the extent to which Rogers’s (1997) methodological recommendations are followed. Coaching Wetter and Corrigan (1995) surveyed law students and practicing attorneys about their attitudes toward preparation of clients for psychological evaluations, and they found that the majority of law students and attorneys believed that educating clients about the tests they will complete is their professional responsibility. Half of the attorneys and 33% of the students believed that they should tell clients about the presence and purpose of validity scales on these tests. Thus, it seems likely that some proportion of test takers involved in legal proceedings may have been coached by their attorneys in how to complete the tests. Lees-Haley (1997) suggested that coaching by attorneys of clients in forensic cases is very common. Motivated clients also may provide their own coaching by studying professional materials about psychological assessment available in libraries and bookstores (Baer, Wetter, & Berry, 1995). Although researchers must avoid compromising the integrity of psychological tests by publishing information about effective feigning strategies (Ben-Porath, 1994; Berry, Lamb, Wetter, Baer, & Widiger, 1994), it seems important to investigate the effects of coaching and methods for detecting coached feigners. This review summarizes the current literature on coaching of underreporting respondents. Base Rates The accuracy of any validity scale depends on the sensitivity and specificity of the scale as well as the base rate of response BAER AND MILLER 18 distortion in the population (Finn & Kamphuis, 1995). In clinical practice, positive predictive power (PPP) and negative predictive power (NPP) are the most relevant classification accuracy statistics. Positive predictive power is the likelihood that an individual classified as feigning by the validity scale is truly feigning, whereas NPP is the likelihood that an individual classified as nonfeigning by the validity scale is actually responding honestly. A validity scale with high sensitivity and specificity may have low PPP in a population with a low base rate of response bias. For example, for a validity scale with sensitivity and specificity of .90, PPP in a population with a 5% base rate of response bias is only .32, meaning that of every three individuals in this population who are classified as feigning by the scale, only one of them is actually feigning. Because the consequences of mistakenly accusing a test taker of feigning may be quite serious, the PPP of the available scales should be carefully evaluated in populations with very low base rates of response bias. This review summarizes current findings about the base rate of underreporting in clinical settings, as well as the classification accuracy of the available underreporting scales at representative base rates. Incremental Validity As noted above, the traditional L and K scales are the most widely used indices of underreporting on the MMPI-2, although several supplementary scales are available. Before the use of supplementary scales can be recommended, it is important to determine whether these scales are more effective than the established scales and whether they show incremental validity over these scales, in discriminating valid from invalid protocols. If they do not, then little will be gained by scoring and interpreting them. In a meta-analysis of the scales available for the original MMPI, Baer et al. (1992) found promising classification accuracy for several scales, including L and K. Although two supplementary scales (Mp and Wsd) obtained somewhat higher mean classification rates, this literature did not examine whether using these scales in place of, or in addition to, the routinely scored L and K scales would result in improved classification accuracy. However, more recent studies with the MMPI-2 have explored this question. Findings are summarized in this review. Method Literature Search A computer search was conducted of titles including the words MMPI-2, dissimulation, faking good, simulation, or underreporting. Tables of contents of recent issues of assessment journals were scanned. Reference lists of all articles obtained were searched for additional articles. Studies were included in the review if they (a) were published in English language journals, (b) compared a group of participants instructed or presumed to have been underreporting on the MMPI-2 with a group instructed or presumed to have followed the standard instructions, (c) included at least one scale or index of underreporting, and (d) reported effect sizes or provided enough data to calculate them. Unpublished dissertations and convention papers were excluded from the review. Fourteen studies meeting these criteria were identified. Coding Demographic variables coded for each study included number and type of participants (i.e., students, patients, job applicants, or community mem- bers) and age, sex, education, race, and diagnoses of participants. Methodological variables included (a) research design (i.e., simulation within groups, simulation between groups, differential prevalence, known groups), (b) type of instructions provided to underreporting participants (e.g., look healthy, normal, well adjusted, deny weaknesses, symptoms, etc.), (c) whether a scenario was provided and what type (e.g., imagine trying to get a good job, win a custody dispute, etc.), (d) presence of instructions to be believable, (e) random assignment to groups, (f) whether participants’ understanding of or compliance with instructions was assessed, (g) whether participants were screened for random responding, and (h) presence of coaching. Coaching was defined to include statements to participants about the presence and purpose of validity scales on the MMPI-2. Outcome variables included means and standard deviations for all underreporting indices; effect sizes (Cohen’s d) for underreporting indices, if reported; indicators of classification accuracy (sensitivity, specificity, overall hit rate, PPP, and NPP); and incremental validity of underreporting scales, if reported. Effect sizes were calculated for studies that did not report them. For studies using a between-groups design, the following formula was used: d ⫽ (Mu ⫺ Ms) / SDp, in which Mu indicates mean of the underreporting group on an index of underreporting, Ms indicates mean of the standard group on that index, and SDp indicates the pooled standard deviation of the two groups. For studies using a within-groups design, effect size was calculated from t or F(1 df). All calculations of effect sizes used methods described by Rosenthal (1984). We calculated PPP and NPP by using sensitivity, specificity, and base-rate data provided in each study. A second rater independently coded 9 of the 14 studies. No significant disagreements were noted. Results Characteristics of Studies Table 1 shows demographic and methodological characteristics in 22 comparisons from 14 studies reviewed. The number of participants per group ranged from 18 to 437 for underreporting groups and from 18 to 1,138 in standard groups. Excluding one study with unusually large groups (Butcher, 1994), the mean number of participants per underreporting group was 39 (SD ⫽ 18) and the mean number per standard group was 74 (SD ⫽ 41). Of the 22 comparisons reported, 15 compared students or community members instructed to underreport with students or community members given standard instructions or told to respond honestly. Three compared patients instructed to underreport with patients given standard instructions. Three compared patients instructed to underreport with students or community members given standard instructions, and one compared job applicants with the MMPI-2 normative sample. Mean age of participants ranged from 19 to 40 years, with a mean of 27.1 (SD ⫽ 7.11). Percentage of participants who were male ranged from 20% to 100%, with a mean of 42% (SD ⫽ 16.66%). Years of education ranged from 12 to 16, with an overall mean of 14.17 (SD ⫽ 0.88). Percentage of participants who were non-Caucasian ranged from 0% to 45% with an overall mean of 10.6% (SD ⫽ 12.46%). Of the 22 comparisons reported, 21 used simulation designs. Of these, 16 were between-group comparisons, in which a group instructed to underreport was compared with a group given standard instructions. Five were within-group comparisons, in which all participants completed the MMPI-2 under both underreporting and standard instructions. One study used a differential prevalence UNDERREPORTING ON MMPI-2 19 Table 1 Demographic and Methodological Characteristics for Studies Included in the Meta-Analysis Study Austin, 1992 Baer & Sekirnjak, 1997 Baer, Wetter, & Berry, 1995 Baer, Wetter, Nichols, et al., 1995 Bagby, Buis, & Nicholson, 1995 Bagby, Rogers, & Buis, 1994 Bagby, Rogers, Buis, & Kalemba, 1994 Bagby et al., 1997 Brems & Harris, 1996 Butcher, 1994 Cassisi & Workman, 1992 Graham, Watts, & Timbrook, 1991 Lim & Butcher, 1996 Shores & Carstairs, 1998 Group N FG 40 Std 33 FG 20 Std 20 FG 20 Std 20 FG-C 20 Std 20 FG-C 20 Std 20 FG 24 Std 23 FG-LC 24 Std 23 FG-HC 24 Std 23 FG 50 Std 50 FG 70 Std 198 FG 67 Std 90 FG 67 Std 90 FG-C 49 Std 49 FG 38 Std 38 FG 38 Std 49 FG 40 Std 40 FG 437 Std 1,138 FG 20 Hon 20 FG 56 Std 56 FG 57 Std 57 FG 59 Std 59 FG 18 Std 18 Belv Comp Inc Rand scrn FG inst job Y N N N LG BGsim cust N Y Y Y LG BGsim cust N Y Y Y LG BGsim cust Y Y Y Y LG BGsim cust Y Y Y Y LG BGsim cust N Y Y Y LG&DW BGsim cust Y Y Y Y LG&DW BGsim cust Y Y Y Y LG&DW BGsim job N Y Y Y LG BGsim mix Y N Y Y LG BGsim mix Y N Y Y DW BGsim mix Y N Y N LG WGsim job Y N N Y DW WGsim none N N N Y LG BGsim none N N N Y LG BGsim court N N N N LG DP none n/a n/a n/a N n/a BGsim mix Y Y N N LG&DW WGsim job N N N Y LG WGsim job Y Y N Y DW WGsim job Y Y N Y LG BGsim N N N N LG Type Age Educ % male % min Design Scen stu stu pat pat pat com pat pat pat com stu stu stu stu stu stu c&s c&s stu stu stu stu stu stu stu stu pat pat pat stu stu stu appl com stu stu stu stu stu stu stu stu stu stu — — 36 36 36 36 36 36 36 36 20 19 19 19 19 19 24 24 22 22 23 23 22 22 24 24 40 40 40 24 31 31 — — 22 22 19 19 24 24 24 24 37 31 14 14 15 15 15 15 15 15 15 15 13 13 12 13 13 13 14 14 14 14 14 14 14 14 14 14 — — — 14 14 14 16 — 14 14 14 14 14 14 14 14 16 16 — — 35 35 35 35 35 35 35 35 54 43 33 43 42 43 52 52 20 29 35 35 31 31 31 31 61 61 61 31 28 28 100 100 58 58 48 48 24 24 24 24 22 28 — — 0 0 0 5 0 0 0 5 21 22 13 22 0 22 5 5 — — — — — — — — — — — — 17 17 — — 45 45 6 6 5 5 5 5 — — BGsim none Note. Age and education (Educ) are given in years. % male ⫽ percentage of participants who were male; % min ⫽ percentage of participants who were non-Caucasian; Scen ⫽ scenario provided for faking participants; Belv ⫽ faking participants instructed to respond believably; Comp ⫽ participants’ compliance with instructions assessed; Inc ⫽ incentive offered for successful faking; Rand scrn ⫽ participants screened for random responding; FG inst ⫽ fake-good instructions. Under Group: FG ⫽ fake good; Std ⫽ standard instructions; FG-C ⫽ fake good with coaching; FG-LC ⫽ fake good with low-detail coaching; FG-HC ⫽ fake good with high-detail coaching; Hon ⫽ instructed to be as honest as possible. Under Type: stu ⫽ students; pat ⫽ patients; com ⫽ community members; c&s ⫽ community members and students; appl ⫽ job applicants. Under Design: BG ⫽ between groups; sim ⫽ simulation; WG ⫽ within groups; DP ⫽ differential prevalence. Under Scen: job ⫽ job application; cust ⫽ child custody evaluation; mix ⫽ several scenarios provided; court ⫽ court case. Under FG inst: LG ⫽ look good; DW ⫽ deny weaknesses. Y ⫽ yes; N ⫽ no; n/a ⫽ not applicable. Dashes indicate data not available. design, in which job applicants were compared to a normative sample. No known-groups designs were reported. Scenarios were provided for underreporting participants in 18 of the 22 comparisons (82%). Of these, 6 involved job applications, 7 involved child custody, 1 was a nonspecified court procedure, and 4 provided a list of several scenarios. Underreporting participants were instructed to respond believably in 12 of the 22 comparisons (55%). They were coached on avoiding detection in 5 comparisons (23%). Their understanding of or compliance with instructions was assessed in 11 comparisons (50%), and incentives for successful underreporting were offered in 11 comparisons (50%). Participants were screened for random responding, with those scoring over a cutoff excluded from data analyses, in 18 comparisons (82%). In 14 comparisons (64%), underreporting instructions asked participants to appear healthy, well adjusted, normal, good, or favorable, sometimes using terms such as “very,” 20 BAER AND MILLER “completely,” “extremely,” or “unrealistically.” In three comparisons (14%), underreporting participants were told only to deny or minimize symptoms and problems. In four comparisons, they were asked both to look good and to deny problems. Effect Sizes Table 2 shows effect sizes for each index of underreporting and for each comparison. Their overall mean is 1.25 (SD ⫽ 0.68), suggesting that, on the average, underreporting participants scored more than one standard deviation higher on these indices than participants given standard instructions. The final column in Table 2 shows the mean effect size for each comparison of an underreporting group with a standard group, collapsed across all indices of underreporting used in the comparison. They ranged from 0.27 to 2.17, with a mean of 1.25. The last three rows of Table 2 show the means for each underreporting index, collapsed across comparisons. The first of these final rows presents overall means, followed in the last two rows by means for comparisons in which underreporting participants were coached or not coached, respectively. Overall means ranged from 0.91 for PMH4 to 1.56 for Wsd. Means for coached participants were generally lower, ranging from 0.23 (L) to 1.37 (Wsd). Means for uncoached participants ranged from 1.02 (PMH4) to 1.80 (Od). Relationships of Effect Sizes to Participant and Methodological Variables Participant variables. Correlations were computed between mean effect size obtained from each comparison and the following participant variables: number of participants, mean age, percentage male, years of education, and percentage non-Caucasian. In an effort to preserve independence of mean effect sizes, only one mean effect size from each study was used in these correlations. Comparisons in which underreporting participants were coached were eliminated, because these comparisons are relatively infrequent and the mean effect sizes generated from them appear to differ from those in which underreporting participants were not coached. Remaining effect sizes were averaged within studies to yield a single mean effect size from each study. Correlations between these 14 independent mean effect sizes and participant variables were nonsignificant, suggesting that within this literature, there are no relationships between age, sex, education, race, or number of participants and efficacy of underreporting indices in discriminating underreported from standard protocols. Methodological variables. Relationships between effect sizes and methodological variables can be seen in Table 3. The small number of comparisons available and the nonindependence of some of the effect sizes make statistical analysis of these relationships impractical. Thus, these findings are only suggestive. A relatively large difference in mean effect sizes can be seen for type of comparison. The most clinically relevant comparison (patients faking good vs. nonpatients given standard instructions) shows a smaller mean effect size (0.82) than the other two types of comparisons (1.33 and 1.40). This finding suggests that studies that use only students or normal volunteers as participants may overestimate the differences between faking and honest respondents in real-world situations and that the most generalizable findings are likely to be obtained from studies in which individuals attempting to conceal significant problems are compared with honest responders who are truly functioning within the normal range. Another difference was noted for the type of scenario presented to underreporting participants. Those given a job application scenario had a higher mean effect size (1.55) than those given a child custody scenario (0.99). However, as several of the groups with custody scenarios also received coaching in avoiding detection, whereas no groups with job application scenarios were coached, it seems likely that the lower effect sizes were due to coaching. Mean effect size for participants warned to respond believably (1.12) was smaller than for those not warned (1.49), supporting Rogers’s (1997) suggestion that these warnings are important. A notable difference can be seen for coaching, with comparisons in which participants were coached in avoiding detection yielding a lower mean effect size (0.89) than comparisons in which participants were not coached (1.38). Mean effect size for participants given incentives (1.16) was somewhat smaller than for those not given incentives (1.37), although this difference may not be significant. Classification Accuracy Sensitivity, specificity, PPP, and NPP were coded or calculated for all studies that examined classification accuracy of underreporting scales. Sensitivity (the percentage of underreporters correctly identified by the scale in question) and specificity (the percentage of nonunderreporters correctly identified) were reported in most studies. Some studies reported these figures by using the single most effective cutting score for their sample. Other studies reported sensitivity and specificity for a variety of cutting scores. In these latter cases, the cutting score with the highest combination of sensitivity and specificity was selected for inclusion in our analyses. Fewer studies reported PPP and NPP, but these values can be calculated from the sensitivity and specificity values provided. However, although sensitivity and specificity remain stable across base rates, PPP and NPP vary with the base rate of underreporting in the sample. If sensitivity and specificity are known, then PPP and NPP can be calculated for any base rate. For most of the studies included in this review, base rates were .50, because underreporting and standard groups usually had equal numbers of participants. However, applied and clinical settings may have different base rates of underreporting. Thus, the most informative PPP and NPP values will be obtained when calculated for a base rate typical of applied settings. To determine a base rate typical of applied settings, we reviewed literature describing the frequency of underreporting on the MMPI-2 in such settings. Five studies providing relevant data were found. These studies, summarized in Table 4, examined the base rate of underreporting in three samples of custody litigants and two personnel selection samples. (These studies were not included in the meta-analytic review because they did not define feigning and standard groups independently of MMPI-2 scores.) Strong, Greene, Hoppe, Johnston, and Oleson (1999) used taxometric procedures to determine prevalence of underreporting, and they obtained a mean base rate of .36 across several analyses. The other studies used the L and K scales to identify underreporting, and Bagby, Nicholson, Buis, Radovanovic, and Fidler (1999) also used a criterion based on the sum of Wsd and S. The base rates obtained range from .20 to .74. All of these base rates fall within roughly Table 2 Effect Sizes (d) for Underreporting Indices Study Austin, 1992 Baer & Sekirnjak, 1997 Baer, Wetter, & Berry, 1995 Brems & Harris, 1996 Butcher, 1994 Cassisi & Workman, 1992 Graham, Watts, & Timbrook, 1991 Lim & Butcher, 1996 Shores & Carstairs, 1998 N Overall M M coach M no coach L K F⫺K L⫹K Mp Wsd Esd Tt Od PMH4 1 2 3 2a 3a 1 1a 1a 1 1 1 1 1a 2 3 1 1 1 1 1 1 1 3.07 1.23 0.99 0.02 ⫺0.60 2.41 0.58 ⫺0.06 1.56 1.29 1.26 1.25 1.19 1.59 1.20 0.94 0.71 1.32 1.56 1.20 1.46 1.97 22 1.19 0.23 1.47 1.71 1.30 1.38 0.47 0.47 1.68 0.30 ⫺0.04 1.72 — 1.39 1.43 2.06 1.40 0.77 — 1.73 0.42 1.36 0.84 0.74 1.53 20 1.13 0.65 1.29 1.16 1.61 1.20 0.97 0.42 1.10 0.35 0.16 1.90 — 1.27 1.30 2.08 1.87 0.36 — — — — — — — 14 1.12 0.80 1.31 — 1.42 1.48 0.40 0.42 2.32 0.41 ⫺0.06 2.05 — — — 1.98 1.61 1.10 — — — — — — — 11 1.19 0.63 1.66 — 1.58 1.04 1.19 0.52 2.56 0.92 0.34 1.81 1.20 1.25 1.43 2.22 1.86 1.10 — — — — — — — 14 1.36 1.04 1.54 — 1.64 1.45 1.38 1.14 2.44 1.06 0.86 1.75 — — — 2.41 1.66 1.40 — — — — — — — 11 1.56 1.37 1.72 — 1.58 0.89 1.30 0.58 1.22 0.67 0.46 1.59 — — — 2.67 2.08 0.36 — — — — — — — 11 1.22 1.14 1.29 — 1.99 0.76 0.88 ⫺0.07 2.12 0.79 0.30 1.71 — — — 2.17 1.50 0.88 — — — — — — — 11 1.18 0.81 1.49 — 1.75 1.28 1.25 0.62 2.58 0.90 0.49 1.98 — — — 2.73 2.00 1.19 — — — — — — — 11 1.52 1.20 1.80 — 1.11 0.54 0.94 0.30 1.06 0.73 0.21 1.55 — — — 1.70 1.76 0.12 — — — — — — — 11 0.91 0.78 1.02 S O⫺S M — 1.68 1.61 1.15 1.08 2.21 0.84 0.32 2.18 — — — 2.61 2.00 0.93 — — — — 1.12 0.96 2.34 14 1.51 1.20 1.66 0.18 — — — — — — — — 1.38 1.48 1.54 — — — — — — — — — — 4 1.14 — 1.14 1.53 1.54 1.15 0.90a 0.44a 1.97 0.69a 0.27a 1.80 1.29 1.33 1.39 2.17a 1.76 0.86 0.94 1.22 0.87 1.40 1.05 1.05 1.95 — 1.25 0.89 1.38 UNDERREPORTING ON MMPI-2 Baer, Wetter, Nichols, et al., 1995 Bagby, Buis, & Nicholson, 1995 Bagby, Rogers, & Buis, 1994 Bagby, Rogers, Buis, & Kalemba, 1994 Bagby et al., 1997 Comp Note. Comp ⫽ comparison; L ⫽ Lie; K ⫽ Correction; F ⫺ K ⫽ Infrequency minus Correction; L ⫹ K ⫽ Lie plus Correction; Mp ⫽ Positive Malingering: Wsd ⫽ Wiggins’s Social Desirability; Esd ⫽ Edwards’s Social Desirability; Tt ⫽ Hanley’s Test-Taking Defensiveness; Od ⫽ Other Deception; PMH4 ⫽ Positive Mental Health 4; S ⫽ Superlative; O ⫺ S ⫽ Obvious ⫺ Subtle. 1 ⫽ normals faking vs. normals standard; 2 ⫽ patients faking vs. patients standard; 3 ⫽ patients faking vs. normals standard. Coach ⫽ faking group coached on avoiding detection. Dashes indicate data not available. a Underreporting participants coached. 21 BAER AND MILLER 22 Table 3 Mean Effect Sizes and Methodological Variables Variable Type of comparison Normal fake good vs. normal standard Patient fake good vs. patient standard Patient fake good vs. normal standard Design Simulation, between groups Simulation, within group Type of scenario Job application Custody evaluation Believability warning Yes No Coaching Yes No Random assignment Yes No Compliance check Yes No Incentive Yes No Random screen Yes No Fake instructions Look good Deny problems N Mean d 16 3 3 1.33 1.40 0.82 16 5 1.18 1.55 6 7 1.55 0.99 12 9 1.12 1.49 5 17 0.89 1.38 8 6 1.09 1.23 11 10 1.10 1.46 11 11 1.16 1.37 18 4 1.27 1.25 14 3 1.30 1.57 one standard deviation of the mean of these values, except the highest value (.74), which falls two standard deviations from the mean. For this reason, this value was judged to be an outlier and was dropped from consideration. The median of the remaining base rates is .30. If the base rates are separated into job applicant and custody litigant categories, then the median base rate of each category also is .30. Thus, .30 was judged to be the best available estimate of the prevalence of underreporting in relevant applied settings, and PPP and NPP values for the studies included in the meta-analysis were calculated for this base rate. Table 5 shows the mean cutting scores, sensitivity, specificity, PPP, and NPP (at a base rate of .30) for underreporting indices. Because coaching of the underreporting group appeared to be related to smaller mean effect sizes, classification accuracy values are shown separately for comparisons with and without coaching. Number of comparisons for each underreporting index also is shown (right-hand columns), and the overall mean across underreporting indices is shown in the last row. Many of the available scales have been examined in only a few comparisons, and therefore, the results must be interpreted cautiously. In addition, all of these values are specific to the cutting scores used in each comparison and may have differed if other cutting scores had been used. In general, however, these data suggest that classification accuracy decreases when underreporting respondents have been told about the presence and purpose of validity scales. This pattern is consistent across all validity scales and all measures of classification accuracy. The scale most resistant to the effects of coaching appears to be Wsd, which had the highest values for sensitivity, specificity, PPP, and NPP, for the coaching comparisons. The Wsd scale also showed high classification accuracy values for no-coaching comparisons. The traditional L and K scales showed mixed results. The L scale had high specificity, PPP, and NPP, but lower than average sensitivity in the no-coaching comparisons and lower than average accuracy in the coaching comparisons. The K scale showed average accuracy levels in most comparisons. For clinicians who must make judgments about the veracity of an individual test taker’s responses, PPP and NPP are the most useful measures of classification accuracy. Positive predictive power for uncoached participants ranged from .53 to .75 (M ⫽ .65) in these studies. The highest values were noted for Wsd (.75), closely followed by L and Mp (.72), suggesting that, at a base rate of underreporting of .30, roughly 75% of the test takers identified by these scales as underreporting will actually be using this response set. Negative predictive power for uncoached participants Table 4 Reported Base Rates of Defensiveness on Minnesota Multiphasic Personality Inventory–2 in Naturalistic Settings Study Bagby, Nicholson, Buis, Radovanovic, & Fidler, 1999 Bathurst, Gottfried, & Gottfried, 1997 Butcher, Morfitt, Rouse, & Holden, 1997 Caldwell-Andrews, 2000 Strong, Greene, Hoppe, Johnston, & Oleson, 1999 Note. N Sample 115 Custody litigants 508 Custody litigants 271 Airline pilot applicants 100 Police applicants 412 Custody litigants Criterion for defensiveness % meeting criterion ⬎65T on L and/or K ⬎42 on Wsd ⫹ S ⬎65T on L ⬎65T on K ⬎65T on L or ⬎70T on K ⬎65T on L ⬎65T on K Taxometric procedures 52 74 20 25 27 30 43 36 T ⫽ T score; L ⫽ Lie; K ⫽ Correction; Wsd ⫹ S ⫽ Wiggins’s Social Desirability plus Superlative scales. UNDERREPORTING ON MMPI-2 23 Table 5 Mean Cutting Score, Sensitivity, Specificity, and Positive and Negative Predictive Power for Underreporting Indices for Comparisons With and Without Coaching M cut score M sensitivity M specificity M PPP at base rate ⫽ .30 M NPP at base rate ⫽ .30 No. of comparisons Scale Co⫺ Co⫹ Co⫺ Co⫹ Co⫺ Co⫹ Co⫺ Co⫹ Co⫺ Co⫹ Co⫺ Co⫹ L K F⫺K L⫹K Mp Wsd Esd Tt Od PMH4 S M 64T 56T ⫺13.6 24.57 13.00 16.75 30.25 14.25 16.25 25.50 30.83 — 49T 50T ⫺8.00 17.75 9.50 13.75 27.00 11.75 12.50 24.25 22.75 — .69 .69 .78 .82 .76 .80 .84 .80 .82 .78 .84 .78 .63 .70 .66 .63 .70 .76 .73 .58 .68 .69 .71 .68 .88 .80 .77 .82 .84 .88 .67 .80 .83 .70 .81 .70 .48 .45 .52 .53 .62 .74 .57 .52 .68 .50 .61 .57 .72 .62 .61 .68 .72 .75 .53 .64 .67 .54 .65 .65 .35 .36 .39 .37 .44 .56 .42 .35 .48 .38 .44 .41 .87 .87 .89 .91 .89 .91 .91 .90 .91 .88 .92 .90 .74 .78 .70 .77 .82 .88 .82 .73 .83 .78 .82 .79 11 10 10 7 6 4 4 4 4 4 6 — 4 4 4 4 4 4 4 4 4 4 4 — Note. Dash indicates not applicable. PPP ⫽ positive predictive power; NPP ⫽ negative predictive power; Co⫺ ⫽ underreporting group not coached; Co⫹ ⫽ underreporting group coached; T ⫽ T score; L ⫽ Lie; K ⫽ Correction; F ⫺ K ⫽ Infrequency minus Correction; L ⫹ K ⫽ Lie plus Correction; Mp ⫽ positive malingering; Wsd ⫽ Wiggins’s Social Desirability; Esd ⫽ Edwards’s Social Desirability; Tt ⫽ Hanley’s Test-Taking scale; Od ⫽ Other Deception; PMH4 ⫽ Positive Mental Health 4; S ⫽ Superlative scale. ranged from .87 to .92, suggesting that, at a base rate of .30, most test takers identified by these scales as responding honestly will have been correctly classified. Incremental Validity Incremental validity was examined in five of the studies included in the meta-analysis. Findings are mixed and inconclusive. Baer, Wetter, Nichols, Greene, and Berry (1995) found that Wsd and S were more effective in classifying standard and underreported profiles than were L and K. Including Wsd and S in regression analyses resulted in significant increases over L and K in prediction of group membership, and using Wsd and S in addition to L and K resulted in improved classification accuracy. However, this finding was not replicated in a follow-up study by Baer, Wetter, and Berry (1995), who found that Wsd and S yielded no improvement in classification accuracy over the use of L or K alone. Baer and Sekirnjak (1997) found small but statistically significant improvements in prediction of group membership when supplementary underreporting scales were added to regression equations that already included L and K (different scales were significant in different comparisons). However, the changes were too small to translate into improvements in classification accuracy. Bagby, Buis, and Nicholson (1995) found that Mp and O ⫺ S had incremental validity over L (and that conversely, L had incremental validity over O ⫺ S and Mp), but they did not examine whether this translated into improved classification rates. These findings differed from those of Timbrook, Graham, Keiller, and Watts (1993), who reported that O ⫺ S had no incremental validity over L (although L had incremental validity over O ⫺ S). Finally, Bagby et al. (1997) found that Od, S, Esd, and Wsd each had incremental validity over L and K in one of three comparisons, but they did not examine whether using these scales in combination resulted in improved classification rates. Discussion The findings of this review suggest that groups of defensive and nondefensive test takers differ by an average of 1.25 standard deviations on the underreporting indices of the MMPI-2. For uncoached participants, the mean effect size was 1.38, whereas the mean for coached participants was 0.89. The mean effect size for uncoached participants is larger than the mean of 1.05 reported by Baer et al. (1992) in their meta-analysis of underreporting on the original MMPI, which included no studies of coaching. However, the mean effect size for coached participants is somewhat smaller. Baer et al. (1992) also noted that their mean effect size was considerably smaller than the mean of 2.07 noted by Berry et al. (1991) in a meta-analysis of malingering on the original MMPI, suggesting that underreporting may be more difficult to distinguish from honest responding than is overreporting. Comparison of the current findings with those of Rogers et al. (1994), the most recent meta-analysis of overreporting on the MMPI-2, suggests that this pattern continues. Rogers et al. (1994) found effect sizes ranging from 1.08 to 3.33, with a mean of 2.03. The findings of this review also suggest that classification accuracy varies across scales and is generally lower when faking participants have been coached on avoiding detection. Classification accuracy figures must be interpreted with caution, because they are based on the assumption that all participants in faking groups were actually faking and that all of those in standard groups refrained from defensive responding. It is possible that some participants did not follow their instructions, and, as noted above, 10 of the studies did not assess compliance with instructions. Nevertheless, the findings suggest that even the most effective underreporting scales will inaccurately label some test takers. The consequences of misclassifying test takers may vary across settings, but could be quite problematic. Thus, in an effort to minimize such errors, it may be important to consider other sources of information, such as interview data, behavioral obser- 24 BAER AND MILLER vations, other self-report data, and collateral information when making decisions about individual protocols (Berry, Baer, Wetter, & Rinaldo, in press). Classification accuracy figures also are specific to the cutting score used in each study. Widely applicable cutting scores have been difficult to establish for at least two reasons. First, optimal cutting scores vary across studies. In the present review, the small number of studies available did not allow examination of whether this variation is related to other characteristics of the studies (e.g., type of participant), although cutting scores were somewhat lower when faking participants had been coached on avoiding detection. Second, for this review, optimal cutting scores were defined as those that best balanced sensitivity and specificity. However, as the consequences of Type I and Type II classification errors are likely to vary across settings, optimal cutting scores also will vary, depending on which type of error is more important to minimize in each setting. Thus, clinicians who choose to use a different cutting score in order to minimize a specific type of error may obtain different classification accuracy rates than those presented here. Several of the methodological weaknesses noted by Baer et al. (1992) and Rogers (1997) have been addressed in some of the current literature. For example, most of the comparisons included realistic scenarios for underreporting participants to imagine, and many included warnings to respond believably. Half of the comparisons included assessment of participants’ understanding of or compliance with instructions, and half included incentives for successful faking. However, methodological problems remain common in this area. Many of the simulation designs compared students instructed to fake with students given standard instructions. The generalizability of findings from these comparisons to populations in which underreporting is likely to occur, such as personnel selection and child custody, is unclear. Although many individuals in these settings may fall within the normal range of psychological functioning, the most important discrimination for the clinician is between test takers truly functioning within normal ranges and those trying to conceal significant problems. Simulation designs might shed more light on this discrimination if they compare clinical samples instructed to appear well adjusted with nonclinical samples given the standard instructions. Only three of the comparisons reported here fit this description. Thus, it is very important that future studies include more comparisons that use clinically relevant samples. Another problem in the current literature is that simulation designs are far more common than differential prevalence and known-groups designs. In fact, no known-groups designs, and only one differential prevalence design, were found. As noted above, Rogers (1997) has emphasized the importance of converging evidence from all designs in evaluating the efficacy of validity scales. Research on malingering has shown that tests that do well in simulation designs are not always effective in known-groups designs (Gillis, Rogers, & Bagby, 1991; Lewis, Simcox, & Berry, in press). Thus, although more difficult to conduct, differential prevalence and known-groups designs are critically important for the advancement of this area. The current literature suggests that when test takers have been coached about the presence and purpose of validity scales, underreporting is more difficult to detect. Previous researchers have noted that professional materials containing information about validity scales are readily available to both lay persons and attor- neys and that attorneys may transmit this information to their clients who will undergo psychological testing (Baer, Wetter, & Berry, 1995; Lamb, Berry, Wetter, & Baer, 1994; Rogers, Bagby, & Chakraborty, 1993; Wetter & Corrigan, 1995). Thus, it seems important to develop validity scales that remain effective even when test takers have been coached. To date, the literature suggests that the Wsd scale shows the most promising efficacy in coached respondents. However, this pattern has been reported in only two studies (Baer & Sekirnjak, 1997; Baer, Wetter, & Berry, 1995). Thus, additional research on validity scales resistant to the effects of coaching is warranted. The extent to which supplementary underreporting scales should be used is an important question. The current literature suggests that the traditional L and K scales are reasonably accurate in detecting uncoached underreporters. Several scales showed higher mean sensitivity than L, but none showed higher specificity. Only one scale (Wsd) showed higher PPP, and the NPP values were very similar across scales. Studies of the incremental validity of supplementary scales over the traditional L and K scales produced mixed findings. The use of multiple scales to identify feigners can alter the probability of Type I and Type II errors, depending on how they are used. If the criterion for classification as a feigner is an elevation on any single scale, then use of multiple scales increases the probability of Type I error (identifying an honest responder as a faker). However, if the criterion for classification as a feigner is elevation on all available scales, then use of supplementary scales will reduce the probability of Type I error but will increase the likelihood of Type II error. Because the literature reviewed here does not provide clear support for the use of combinations of traditional and supplementary scales to identify underreporting, it may not be advisable to consider all of the supplementary scales in clinical practice. The scales Wsd and S have shown incremental validity in a few comparisons, and they warrant additional research to determine whether the routine use of either or both can be recommended. However, relying on L and K, for which much larger bodies of supporting data are available, may be the most defensible approach. Another important question is the extent to which individuals with elevations on validity scales in settings that provide incentives for underreporting, such as personnel selection and child custody evaluation, are concealing significant problems or merely “putting their best foot forward” in a manner that should be considered normal under the circumstances. Bathurst, Gottfried, and Gottfried (1997) argued that custody litigants are largely free of psychopathology and that elevations on Scales 3 and 6, which are the most common in this group, can be attributed to the stresses associated with divorce and custody proceedings in an adversarial legal system. However, they also noted, as did Bagby et al. (1999) that it cannot be determined from the MMPI-2 data whether the defensive response style observed is “an overestimate of mental health in a psychologically healthy population or an attempt by psychologically disturbed individuals to conceal symptomatology” (Bathurst et al., 1997, p. 209). This question can only be resolved with extratest data, such as behavioral observations or ratings by others (Bagby et al., 1999). Thus, because child custody and personnel selection procedures might encourage some test takers to respond defensively even if they have nothing significant to hide, it is important to develop ways of discriminating defensive responders who are concealing psychopathology from defensive UNDERREPORTING ON MMPI-2 responders who are functioning normally, and to clarify the proportion of defensive responders in these settings who are hiding problems. If many of the defensive responders in personnel selection and custody evaluation settings are not concealing significant psychopathology, but are responding defensively because of situational demands, then the base rates of clinically significant defensiveness may be lower than the estimates provided in Table 4. In summary, research on the detection of underreporting has advanced in several ways since the publication of the MMPI-2. However, important problems remain to be addressed. The use of known-groups and differential prevalence designs remains very rare, and most simulation designs have used students rather than clinical or “real world” comparison groups. The detection of coached feigners and the incremental validity of supplementary scales over the traditional L and K scales require further investigation. The prevalence of significant symptomatology in personnel selection and custody evaluation settings warrants additional exploration. In the meantime, the L and K scales continue to be reasonably effective in identifying defensive responding in uncoached samples, and the Wsd scale may be useful in samples for which coaching is likely. Because underreporting remains more difficult to detect than overreporting, suspicions about underreporting that are triggered by elevations on these scales should be investigated through interview, behavioral observations, other selfreport inventories, and collateral sources of information, as available and appropriate (Berry et al., in press). References References marked with an asterisk indicate studies included in the meta-analysis. *Austin, J. S. (1992). The detection of fake good and fake bad on the MMPI-2. Educational and Psychological Measurement, 52, 669 – 674. *Baer, R. A., & Sekirnjak, G. (1997). Detection of underreporting on the MMPI-2 in a clinical population: Effects of information about validity scales. Journal of Personality Assessment, 69, 555–567. Baer, R. A., Wetter, M. W., & Berry, D. T. R. (1992). Detection of underreporting of psychopathology on the MMPI: A meta-analysis. Clinical Psychology Review, 12, 509 –525. *Baer, R. A., Wetter, M. W., & Berry, D. T. R. (1995). Effects of information about validity scales on underreporting of symptoms on the MMPI-2: An analogue investigation. Assessment, 2, 189 –200. *Baer, R. A., Wetter, M. W., Nichols, D. S., Greene, R. L., & Berry, D. T. R. (1995). Sensitivity of MMPI-2 validity scales to underreporting of symptoms. Psychological Assessment, 7, 419 – 423. *Bagby, R. M., Buis, T., & Nicholson, R. A. (1995). Relative effectiveness of the standard validity scales in detecting fake-bad and fake-good responding: Replication and extension. Psychological Assessment, 7, 84 –92. Bagby, R. M., Nicholson, R. A., Buis, T., Radovanovic, H., & Fidler, B. J. (1999). Defensive responding on the MMPI-2 in family custody and access evaluations. Psychological Assessment, 11, 24 –28. *Bagby, R. M., Rogers, R., & Buis, T. (1994). Detecting malingered and defensive responding on the MMPI-2 in a forensic inpatient sample. Journal of Personality Assessment, 62, 191–203. *Bagby, R. M., Rogers, R., Buis, T., & Kalemba, V. (1994). Malingered and defensive response styles on the MMPI-2: An examination of validity scales. Assessment, 1, 31–38. *Bagby, R. M., Rogers, R., Nicholson, R. A., Buis, T., Seeman, M. V., & Rector, N. A. (1997). Effectiveness of the MMPI-2 validity indicators in the detection of defensive responding in clinical and nonclinical samples. Psychological Assessment, 9, 406 – 413. 25 Bathurst, K., Gottfried, A. W., & Gottfried, A. E. (1997). Normative data for the MMPI-2 in child custody litigation. Psychological Assessment, 9, 205–211. Ben-Porath, Y. S. (1994). The ethical dilemma of coached malingering research. Psychological Assessment, 6, 14 –15. Berry, D. T. R., Baer, R. A., & Harris, M. J. (1991). Detection of malingering on the MMPI: A meta-analysis. Clinical Psychology Review, 11, 585–598. Berry, D. T. R., Baer, R. A., Wetter, M. W., & Rinaldo, J. (in press). Assessment of malingering. In J. N. Butcher (Ed.), Clinical personality assessment (2nd ed.). New York: Oxford University Press. Berry, D. T. R., Lamb, D. G., Wetter, M. W., Baer, R. A., & Widiger, T. A. (1994). Ethical considerations in research on coached malingering. Psychological Assessment, 6, 16 –17. Borum, R., & Stock, H. V. (1993). Detection of deception in law enforcement applicants. Law and Human Behavior, 17, 157–166. *Brems, C., & Harris, K. (1996). Faking the MMPI-2: Utility of the subtle-obvious scales. Journal of Clinical Psychology, 52, 525–533. *Butcher, J. N. (1994). Psychological assessment of airline pilot applicants with the MMPI-2. Journal of Personality Assessment, 62, 31– 44. Butcher, J., Dahlstrom, W., Graham, J., Tellegen, A., & Kaemmer, B. (1989). Manual for administering and scoring the MMPI-2. Minneapolis: University of Minnesota Press. Butcher, J. N., & Han, K. (1995). Development of an MMPI-2 scale to assess the presentation of self in a superlative manner: The S scale. In J. N. Butcher & C. D. Spielberger (Eds.), Advances in personality assessment (Vol. 10, pp. 25–50). Hillsdale, NJ: Erlbaum. Butcher, J. N., Morfitt, R. C., Rouse, S. V., & Holden, R. R. (1997). Reducing MMPI-2 defensiveness: The effect of specialized instructions on retest validity in a job applicant sample. Journal of Personality Assessment, 68, 385– 401. Caldwell-Andrews, A. A. (2000). Relationships between MMPI-2 validity scales and NEO PI-R experimental validity scales in police candidates. Unpublished dissertation. *Cassisi, J. E., & Workman, D. E. (1992). The detection of malingering and deception with a short form of the MMPI-2 based on the L, F, and K scales. Journal of Clinical Psychology, 48, 54 –58. Cofer, C. N., Chance, J., & Judson, A. J. (1949). A study of malingering on the MMPI. Journal of Psychology, 27, 491– 499. Edwards, A. L., (1957). The social desirability variable in personality assessment and research. New York: Dryden. Finn, S. E., & Kamphuis, J. H. (1995). What a clinician needs to know about base rates. In J. N. Butcher (Ed.), Clinical personality assessment (1st ed.). New York: Oxford University Press. Gillis, J. R., Rogers, R., & Bagby, R. M. (1991). Validity of the M Test: Simulation-design and natural-group approaches. Journal of Personality Assessment, 57, 130 –140. Gough, H. G. (1950). The F minus K dissimulation index for the MMPI. Journal of Consulting Psychology, 14, 408 – 413. Graham, J. R. (2000). MMPI-2: Assessing personality and psychopathology (3rd ed.). New York: Oxford University Press. *Graham, J. R., Watts, D., & Timbrook, R. E. (1991). Detecting fake-good and fake-bad MMPI-2 profiles. Journal of Personality Assessment, 57, 264 –277. Hanley, C. (1957). Deriving a measure of test-taking defensiveness. Journal of Consulting Psychology, 21, 391–397. Hathaway, S. R., & McKinley, J. C. (1983). The Minnesota Multiphasic Personality Inventory manual. New York: Psychological Corporation. Lamb, D. G., Berry, D. T. R., Wetter, M. W., & Baer, R. A. (1994). Effects of two types of information on malingering of closed head injury on the MMPI-2: An analogue investigation. Psychological Assessment, 6, 8 –13. Lees-Haley, P. R. (1992). Psychodiagnostic test usage by forensic psychologists. American Journal of Forensic Psychology, 10, 25–30. 26 BAER AND MILLER Lees-Haley, P. R. (1997). Attorneys influence expert evidence in forensic psychological and neuropsychological cases. Assessment, 4, 321–324. Lewis, J. L., Simcox, A. J., & Berry, D. T. R. (in press). Known groups validation of MMPI-2 validity scales and the Structured Inventory of Malingered Symptoms (SIMS) for malingering screening in a forensic sample. Psychological Assessment. *Lim, J., & Butcher, J. N. (1996). Detection of faking on the MMPI-2: Differentiation among faking-bad, denial, and claiming extreme virtue. Journal of Personality Assessment, 67, 1–25. Nichols, D. S. (1991). Development of a global measure for positive mental health. Unpublished manuscript. Nichols, D. S., & Greene, F. L. (1991, March). New measures for dissimulation on the MMPI/MMPI-2. Paper presented at the 26th annual symposium on Recent Developments in the Use of the MMPI, St. Petersburg Beach, FL. Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of Personality and Social Psychology, 46, 598 – 609. Paulhus, D. L. (1986). Self-deception and impression management in test responses. In A. Angleitner & J. S. Wiggins (Eds.), Personality assessment via questionnaires: Current issues in theory and measurement (pp. 143–165). Berlin: Springer-Verlag. Paulhus, D. L. (1998). Paulhus Deception Scales (PDS): The Balanced Inventory of Desirable Responding-7. North Tonawanda, NY: MultiHealth Systems. Rogers, R. (1997). Researching dissimulation. In R. Rogers (Ed.), Clinical assessment of malingering and deception (2nd ed., pp. 309 –327). New York: Guilford Press. Rogers, R., Bagby, R. M., & Chakraborty, D. (1993). Feigning schizo- phrenic disorders on the MMPI-2: Detection of coached simulators. Journal of Personality Assessment, 60, 215–226. Rogers, R., Sewell, K. R., & Salekin, R. T. (1994). A meta-analysis of malingering on the MMPI-2. Assessment, 1, 227–237. Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage. *Shores, E. A., & Carstairs, J. R. (1998). Accuracy of the MMPI-2 computerized Minnesota Report in identifying fake-good and fake-bad response sets. The Clinical Neuropsychologist, 12, 101–106. Strong, D. R., Greene, R. L., Hoppe, C., Johnston, T., & Oleson, N. (1999). Taxometric analysis of impression management and self-deception on the MMPI-2 in child custody litigants. Journal of Personality Assessment, 73, 1–18. Timbrook, R. E., Graham, J. R., Keiller, S. W., & Watts, D. (1993). Comparison of the Wiener–Harmon Subtle–Obvious scales and the standard validity scales in detecting valid and invalid MMPI-2 profiles. Psychological Assessment, 5, 53– 61. Wetter, M. W., & Corrigan, S. K. (1995). Providing information to clients about psychological tests: A survey of attorneys’ attitudes. Professional Psychology: Research and Practice, 26, 474 – 477. Wiener, D. N. (1948). Subtle and obvious keys for the MMPI. Journal of Consulting Psychology, 12, 164 –170. Wiggins, J. S. (1959). Interrelationships among MMPI measures of dissimulation under standard and social desirability instructions. Journal of Consulting Psychology, 23, 419 – 427. Received March 5, 2001 Revision received July 23, 2001 Accepted September 27, 2001 䡲