Psychological Assessment
2002, Vol. 14, No. 1, 16 –26
Copyright 2002 by the American Psychological Association, Inc.
1040-3590/02/$5.00 DOI: 10.1037//1040-3590.14.1.16
Underreporting of Psychopathology on the MMPI-2:
A Meta-Analytic Review
Ruth A. Baer and Joshua Miller
University of Kentucky
Meta-analytic techniques were applied to studies of the MMPI-2 in which participants given standard
instructions were compared with participants instructed or believed to have been underreporting.
Traditional and supplementary indices of underreporting yielded a mean effect size of 1.25, suggesting
that underreporting respondents differ from those responding honestly by a little more than 1 standard
deviation, on the average, on these scales. Analyses of classification accuracy suggested that several
scales are moderately effective in detecting underreporting, although accuracy decreases if participants
have been coached about validity scales. Base rates of defensive responding in relevant populations are
reviewed, and methodological issues, including research designs, coaching, and incremental validity of
supplementary underreporting scales, are discussed.
Self-report inventories are more likely to yield accurate and
useful information when test takers respond honestly (Graham,
2000). Response biases such as overreporting of symptoms
(malingering or “faking bad”) and underreporting of symptoms
(defensiveness or “faking good”) can result in invalid and
misleading test results. Clinicians are most likely to encounter
these response biases in settings that provide test takers with
substantial incentives for distortion. For example, evaluations
for competency to stand trial and lawsuits for psychological
damages can provide strong incentives to exaggerate or fabricate psychological symptoms. Applicants for jobs or training
programs and divorcing parents undergoing custody evaluations
may have powerful reasons to present themselves in unrealistically positive terms, perhaps by attempting to conceal existing
symptoms of psychopathology.
The Minnesota Multiphasic Personality Inventory (MMPI;
Hathaway & McKinley, 1983) was among the first self-report
inventories to include validity scales designed to detect these
response biases. Its successor, the MMPI-2 (Butcher, Dahlstrom,
Graham, Tellegen, & Kaemmer, 1989) is widely used in clinical,
legal, and organizational settings (Graham, 2000; Lees-Haley,
1992). The efficacy of the validity scales of the MMPI and
MMPI-2 in detecting fake-bad and fake-good response biases has
been extensively studied. Berry, Baer, and Harris (1991), in a
meta-analytic review of the detection of malingering on the original MMPI, found that the Infrequency (F) scale and Infrequency
minus Correction (F ⫺ K) index (Gough, 1950) were quite effective in discriminating honest from malingering respondents. Rogers, Sewell, and Salekin (1994) obtained similar results in a metaanalysis of malingering on the MMPI-2. Baer, Wetter, and Berry
(1992) conducted a meta-analysis of the detection of underreport-
ing on the original MMPI, and they found an overall mean effect
size of 1.05, indicating that participants who underreported psychopathology differed from those who responded honestly by
approximately one standard deviation, on the average, on underreporting indices. For the traditional Lie (L) and Correction (K)
scales, effect sizes of just under one standard deviation were noted.
Effect sizes of approximately 1.5 were noted for two supplementary scales: the Positive Malingering scale (Mp) and Wiggins’s
Social Desirability scale (Wsd; described below). Widely varied
cutting scores were reported across studies. In general, underreporting was noted to be more difficult to detect than overreporting.
Currently, no reviews of the detection of underreporting on the
MMPI-2 are available, although several studies of this question
have been published. Thus, the purpose of the present review is to
apply meta-analytic strategies to an evaluation of the literature on
detection of underreporting on the MMPI-2. The utility of available indices of underreporting is investigated, and current methodological issues in this area of research are explored.
Underreporting Indices on the MMPI-2
The most widely used indices of underreporting on the MMPI-2
are the traditional L and K scales. The F ⫺ K index (Gough, 1950)
also is frequently used. In addition, several supplementary underreporting scales are available. Some of these were developed for
the original MMPI. For example, Cofer, Chance, and Judson
(1949) described the use of the L ⫹ K index, in which the raw L
and K scores are summed. These authors also developed the Mp
scale, which consists of items that participants answered in the
socially undesirable direction when responding honestly or faking
bad, but in the opposite direction when faking good. Edwards’s
(1957) Social Desirability scale (Esd) includes items for which 10
judges unanimously agreed on the socially desirable response.
Wiggins’s (1959) Social Desirability scale includes items shown to
discriminate participants instructed to respond in a socially desirable manner from those given the standard instructions. Hanley’s
(1957) Test-Taking Defensiveness scale (Tt) includes items for
which judges agreed on the socially desirable response, but which
Ruth A. Baer and Joshua Miller, Department of Psychology, University
of Kentucky.
Correspondence concerning this article should be addressed to Ruth A.
Baer, Department of Psychology, 115 Kastle Hall, University of Kentucky,
Lexington, Kentucky 40506-0044. E-mail: rbaer@uky.edu
16
UNDERREPORTING ON MMPI-2
only about half of a normative group endorsed in the socially
desirable direction. Wiener (1948) identified Obvious items,
whose content clearly reflected psychological disturbance, and
Subtle items, whose relationship to psychopathology was unclear.
Respondents who are underreporting should obtain low scores on
the Obvious items, but often obtain high scores on the Subtle
items, and negative scores when Subtle scale is subtracted from the
Obvious scale (O ⫺ S).
Several underreporting scales have been developed for the
MMPI-2. The Other Deception (Od) scale (Nichols & Greene,
1991) consists of items from the Mp and Wsd scales and was
designed to measure the other-deception factor of Paulhus’s (1984,
1986) two-factor model of socially desirable responding. “Other
deception,” or impression management, is a deliberate attempt to
present an unrealistically favorable self-description, whereas “selfdeception” is an overly positive self-presentation, akin to narcissism, that the respondent believes to be true (Paulhus, 1998). The
Positive Mental Health (PMH4) scale (Nichols, 1992) consists of
items that appear on at least 4 of 28 supplementary scales designed
to measure a positive trait. The Superlative (S) scale (Butcher &
Han, 1995) consists of items shown to discriminate a large group
of pilots seeking employment with a major airline from the
MMPI-2 normative sample.
Methodological Issues in Underreporting Research
Research Design
Several research designs have been used in the literature on
detection of response biases (Rogers, 1997). The most common is
the simulation design, in which volunteers from either clinical or
nonclinical populations complete the MMPI-2 in accordance with
instructions provided by the experimenter. Generally, a group
instructed to feign is compared to a group given standard instructions. Although this design can be useful in clarifying potential
differences between honest and feigning respondents, the extent to
which responses of experimental feigners resemble those of “real
world” feigners is unclear. Rogers (1997) suggested several strategies for increasing the ecological validity of the simulation design. For example, participants should be similar to those with
whom the test is used in clinical practice, and participants instructed to feign should be given a realistic scenario in which
feigning might occur (e.g., “Imagine you are applying for a highly
desirable job”). They should be instructed to “be believable” in
their feigned presentation and should be provided with incentives
for successful feigning. Their understanding of and compliance
with their feigning instructions should be assessed.
In differential prevalence designs, participants who are believed
because of their circumstances to have strong incentives for faking
are compared with those who appear to have no such motives. For
example, anonymous volunteers given standard instructions might
be compared with a group being evaluated for child custody
proceedings (also given standard instructions). On the average, the
custody evaluees would be expected to obtain higher scores on
underreporting scales. Unfortunately, as the underreporting status
of individual participants in the custody group is unknown, this
design cannot yield precise information about the classification
accuracy of validity scales.
The known-groups design compares scores on validity scales
from individuals known to have responded honestly with those
17
known to have distorted their responses. In these studies, faking
instructions are not provided by an experimenter. Instead, a subset
of participants who have completed the MMPI-2 in a clinical or
applied setting are discovered to have misrepresented themselves,
and these participants are compared with those who are believed to
have responded honestly. Results from such studies may be more
generalizable than results from simulation designs, because participants have independently chosen to feign good or bad adjustment, rather than being instructed by an experimenter to do so.
However, because the determination of whether participants have
feigned must be made independently of the validity scale being
investigated, the accuracy of the method used to identify feigners
and nonfeigners is important. For example, Borum and Stock
(1993) described a sample of police applicants who were confronted about inconsistencies in their applications and confessed to
having misrepresented themselves. Their scores on validity scales
were significantly different from a group not suspected of deception. Unfortunately, as confessions of feigning appear to be rare,
and as known feigners may not be representative of feigners who
are never caught, the results of such studies may not be generalizable to unidentified feigners.
Rogers (1997) suggested that confidence in the utility of validity
scales increases with supporting evidence across several research
designs. Initial research on a validity scale most often uses the
simulation design. When the potential utility of a scale has been
well supported through simulation designs, differential prevalence
and known-groups designs can provide valuable information about
the generalizability of findings to clinically important situations.
This review summarizes the use of these designs in the current
literature, and the extent to which Rogers’s (1997) methodological
recommendations are followed.
Coaching
Wetter and Corrigan (1995) surveyed law students and practicing attorneys about their attitudes toward preparation of clients for
psychological evaluations, and they found that the majority of law
students and attorneys believed that educating clients about the
tests they will complete is their professional responsibility. Half of
the attorneys and 33% of the students believed that they should tell
clients about the presence and purpose of validity scales on these
tests. Thus, it seems likely that some proportion of test takers
involved in legal proceedings may have been coached by their
attorneys in how to complete the tests. Lees-Haley (1997) suggested that coaching by attorneys of clients in forensic cases is
very common. Motivated clients also may provide their own
coaching by studying professional materials about psychological
assessment available in libraries and bookstores (Baer, Wetter, &
Berry, 1995). Although researchers must avoid compromising the
integrity of psychological tests by publishing information about
effective feigning strategies (Ben-Porath, 1994; Berry, Lamb, Wetter, Baer, & Widiger, 1994), it seems important to investigate the
effects of coaching and methods for detecting coached feigners.
This review summarizes the current literature on coaching of
underreporting respondents.
Base Rates
The accuracy of any validity scale depends on the sensitivity
and specificity of the scale as well as the base rate of response
BAER AND MILLER
18
distortion in the population (Finn & Kamphuis, 1995). In clinical
practice, positive predictive power (PPP) and negative predictive
power (NPP) are the most relevant classification accuracy statistics. Positive predictive power is the likelihood that an individual
classified as feigning by the validity scale is truly feigning,
whereas NPP is the likelihood that an individual classified as
nonfeigning by the validity scale is actually responding honestly.
A validity scale with high sensitivity and specificity may have low
PPP in a population with a low base rate of response bias. For
example, for a validity scale with sensitivity and specificity of .90,
PPP in a population with a 5% base rate of response bias is only
.32, meaning that of every three individuals in this population who
are classified as feigning by the scale, only one of them is actually
feigning. Because the consequences of mistakenly accusing a test
taker of feigning may be quite serious, the PPP of the available
scales should be carefully evaluated in populations with very low
base rates of response bias. This review summarizes current findings about the base rate of underreporting in clinical settings, as
well as the classification accuracy of the available underreporting
scales at representative base rates.
Incremental Validity
As noted above, the traditional L and K scales are the most
widely used indices of underreporting on the MMPI-2, although
several supplementary scales are available. Before the use of
supplementary scales can be recommended, it is important to
determine whether these scales are more effective than the established scales and whether they show incremental validity over
these scales, in discriminating valid from invalid protocols. If they
do not, then little will be gained by scoring and interpreting them.
In a meta-analysis of the scales available for the original MMPI,
Baer et al. (1992) found promising classification accuracy for
several scales, including L and K. Although two supplementary
scales (Mp and Wsd) obtained somewhat higher mean classification rates, this literature did not examine whether using these
scales in place of, or in addition to, the routinely scored L and K
scales would result in improved classification accuracy. However,
more recent studies with the MMPI-2 have explored this question.
Findings are summarized in this review.
Method
Literature Search
A computer search was conducted of titles including the words MMPI-2,
dissimulation, faking good, simulation, or underreporting. Tables of contents of recent issues of assessment journals were scanned. Reference lists
of all articles obtained were searched for additional articles. Studies were
included in the review if they (a) were published in English language
journals, (b) compared a group of participants instructed or presumed to
have been underreporting on the MMPI-2 with a group instructed or
presumed to have followed the standard instructions, (c) included at least
one scale or index of underreporting, and (d) reported effect sizes or
provided enough data to calculate them. Unpublished dissertations and
convention papers were excluded from the review. Fourteen studies meeting these criteria were identified.
Coding
Demographic variables coded for each study included number and type
of participants (i.e., students, patients, job applicants, or community mem-
bers) and age, sex, education, race, and diagnoses of participants. Methodological variables included (a) research design (i.e., simulation within
groups, simulation between groups, differential prevalence, known
groups), (b) type of instructions provided to underreporting participants
(e.g., look healthy, normal, well adjusted, deny weaknesses, symptoms,
etc.), (c) whether a scenario was provided and what type (e.g., imagine
trying to get a good job, win a custody dispute, etc.), (d) presence of
instructions to be believable, (e) random assignment to groups, (f) whether
participants’ understanding of or compliance with instructions was assessed, (g) whether participants were screened for random responding, and
(h) presence of coaching. Coaching was defined to include statements to
participants about the presence and purpose of validity scales on the
MMPI-2. Outcome variables included means and standard deviations for
all underreporting indices; effect sizes (Cohen’s d) for underreporting
indices, if reported; indicators of classification accuracy (sensitivity, specificity, overall hit rate, PPP, and NPP); and incremental validity of underreporting scales, if reported.
Effect sizes were calculated for studies that did not report them. For
studies using a between-groups design, the following formula was used:
d ⫽ (Mu ⫺ Ms) / SDp, in which Mu indicates mean of the underreporting
group on an index of underreporting, Ms indicates mean of the standard
group on that index, and SDp indicates the pooled standard deviation of the
two groups. For studies using a within-groups design, effect size was
calculated from t or F(1 df). All calculations of effect sizes used methods
described by Rosenthal (1984). We calculated PPP and NPP by using
sensitivity, specificity, and base-rate data provided in each study.
A second rater independently coded 9 of the 14 studies. No significant
disagreements were noted.
Results
Characteristics of Studies
Table 1 shows demographic and methodological characteristics
in 22 comparisons from 14 studies reviewed. The number of
participants per group ranged from 18 to 437 for underreporting
groups and from 18 to 1,138 in standard groups. Excluding one
study with unusually large groups (Butcher, 1994), the mean
number of participants per underreporting group was 39 (SD ⫽ 18)
and the mean number per standard group was 74 (SD ⫽ 41). Of
the 22 comparisons reported, 15 compared students or community
members instructed to underreport with students or community
members given standard instructions or told to respond honestly.
Three compared patients instructed to underreport with patients
given standard instructions. Three compared patients instructed to
underreport with students or community members given standard
instructions, and one compared job applicants with the MMPI-2
normative sample.
Mean age of participants ranged from 19 to 40 years, with a
mean of 27.1 (SD ⫽ 7.11). Percentage of participants who were
male ranged from 20% to 100%, with a mean of 42% (SD ⫽
16.66%). Years of education ranged from 12 to 16, with an overall
mean of 14.17 (SD ⫽ 0.88). Percentage of participants who were
non-Caucasian ranged from 0% to 45% with an overall mean
of 10.6% (SD ⫽ 12.46%).
Of the 22 comparisons reported, 21 used simulation designs. Of
these, 16 were between-group comparisons, in which a group
instructed to underreport was compared with a group given standard instructions. Five were within-group comparisons, in which
all participants completed the MMPI-2 under both underreporting
and standard instructions. One study used a differential prevalence
UNDERREPORTING ON MMPI-2
19
Table 1
Demographic and Methodological Characteristics for Studies Included in the Meta-Analysis
Study
Austin, 1992
Baer & Sekirnjak, 1997
Baer, Wetter, & Berry, 1995
Baer, Wetter, Nichols, et al., 1995
Bagby, Buis, & Nicholson, 1995
Bagby, Rogers, & Buis, 1994
Bagby, Rogers, Buis, & Kalemba,
1994
Bagby et al., 1997
Brems & Harris, 1996
Butcher, 1994
Cassisi & Workman, 1992
Graham, Watts, & Timbrook, 1991
Lim & Butcher, 1996
Shores & Carstairs, 1998
Group
N
FG
40
Std
33
FG
20
Std
20
FG
20
Std
20
FG-C
20
Std
20
FG-C
20
Std
20
FG
24
Std
23
FG-LC
24
Std
23
FG-HC
24
Std
23
FG
50
Std
50
FG
70
Std
198
FG
67
Std
90
FG
67
Std
90
FG-C
49
Std
49
FG
38
Std
38
FG
38
Std
49
FG
40
Std
40
FG
437
Std
1,138
FG
20
Hon
20
FG
56
Std
56
FG
57
Std
57
FG
59
Std
59
FG
18
Std
18
Belv
Comp
Inc
Rand
scrn
FG inst
job
Y
N
N
N
LG
BGsim
cust
N
Y
Y
Y
LG
BGsim
cust
N
Y
Y
Y
LG
BGsim
cust
Y
Y
Y
Y
LG
BGsim
cust
Y
Y
Y
Y
LG
BGsim
cust
N
Y
Y
Y
LG&DW
BGsim
cust
Y
Y
Y
Y
LG&DW
BGsim
cust
Y
Y
Y
Y
LG&DW
BGsim
job
N
Y
Y
Y
LG
BGsim
mix
Y
N
Y
Y
LG
BGsim
mix
Y
N
Y
Y
DW
BGsim
mix
Y
N
Y
N
LG
WGsim job
Y
N
N
Y
DW
WGsim none
N
N
N
Y
LG
BGsim
none
N
N
N
Y
LG
BGsim
court
N
N
N
N
LG
DP
none
n/a
n/a
n/a
N
n/a
BGsim
mix
Y
Y
N
N
LG&DW
WGsim job
N
N
N
Y
LG
WGsim job
Y
Y
N
Y
DW
WGsim job
Y
Y
N
Y
LG
BGsim
N
N
N
N
LG
Type
Age
Educ
% male
% min
Design
Scen
stu
stu
pat
pat
pat
com
pat
pat
pat
com
stu
stu
stu
stu
stu
stu
c&s
c&s
stu
stu
stu
stu
stu
stu
stu
stu
pat
pat
pat
stu
stu
stu
appl
com
stu
stu
stu
stu
stu
stu
stu
stu
stu
stu
—
—
36
36
36
36
36
36
36
36
20
19
19
19
19
19
24
24
22
22
23
23
22
22
24
24
40
40
40
24
31
31
—
—
22
22
19
19
24
24
24
24
37
31
14
14
15
15
15
15
15
15
15
15
13
13
12
13
13
13
14
14
14
14
14
14
14
14
14
14
—
—
—
14
14
14
16
—
14
14
14
14
14
14
14
14
16
16
—
—
35
35
35
35
35
35
35
35
54
43
33
43
42
43
52
52
20
29
35
35
31
31
31
31
61
61
61
31
28
28
100
100
58
58
48
48
24
24
24
24
22
28
—
—
0
0
0
5
0
0
0
5
21
22
13
22
0
22
5
5
—
—
—
—
—
—
—
—
—
—
—
—
17
17
—
—
45
45
6
6
5
5
5
5
—
—
BGsim
none
Note. Age and education (Educ) are given in years. % male ⫽ percentage of participants who were male; % min ⫽ percentage of participants who were
non-Caucasian; Scen ⫽ scenario provided for faking participants; Belv ⫽ faking participants instructed to respond believably; Comp ⫽ participants’
compliance with instructions assessed; Inc ⫽ incentive offered for successful faking; Rand scrn ⫽ participants screened for random responding; FG inst ⫽
fake-good instructions. Under Group: FG ⫽ fake good; Std ⫽ standard instructions; FG-C ⫽ fake good with coaching; FG-LC ⫽ fake good with low-detail
coaching; FG-HC ⫽ fake good with high-detail coaching; Hon ⫽ instructed to be as honest as possible. Under Type: stu ⫽ students; pat ⫽ patients; com ⫽
community members; c&s ⫽ community members and students; appl ⫽ job applicants. Under Design: BG ⫽ between groups; sim ⫽ simulation; WG ⫽
within groups; DP ⫽ differential prevalence. Under Scen: job ⫽ job application; cust ⫽ child custody evaluation; mix ⫽ several scenarios provided; court ⫽
court case. Under FG inst: LG ⫽ look good; DW ⫽ deny weaknesses. Y ⫽ yes; N ⫽ no; n/a ⫽ not applicable. Dashes indicate data not available.
design, in which job applicants were compared to a normative
sample. No known-groups designs were reported.
Scenarios were provided for underreporting participants in 18 of
the 22 comparisons (82%). Of these, 6 involved job applications, 7
involved child custody, 1 was a nonspecified court procedure,
and 4 provided a list of several scenarios. Underreporting participants were instructed to respond believably in 12 of the 22
comparisons (55%). They were coached on avoiding detection in 5
comparisons (23%). Their understanding of or compliance with
instructions was assessed in 11 comparisons (50%), and incentives
for successful underreporting were offered in 11 comparisons
(50%). Participants were screened for random responding, with
those scoring over a cutoff excluded from data analyses, in 18
comparisons (82%). In 14 comparisons (64%), underreporting
instructions asked participants to appear healthy, well adjusted,
normal, good, or favorable, sometimes using terms such as “very,”
20
BAER AND MILLER
“completely,” “extremely,” or “unrealistically.” In three comparisons (14%), underreporting participants were told only to deny or
minimize symptoms and problems. In four comparisons, they were
asked both to look good and to deny problems.
Effect Sizes
Table 2 shows effect sizes for each index of underreporting and
for each comparison. Their overall mean is 1.25 (SD ⫽ 0.68),
suggesting that, on the average, underreporting participants scored
more than one standard deviation higher on these indices than
participants given standard instructions. The final column in Table 2 shows the mean effect size for each comparison of an
underreporting group with a standard group, collapsed across all
indices of underreporting used in the comparison. They ranged
from 0.27 to 2.17, with a mean of 1.25. The last three rows of
Table 2 show the means for each underreporting index, collapsed
across comparisons. The first of these final rows presents overall
means, followed in the last two rows by means for comparisons in
which underreporting participants were coached or not coached,
respectively. Overall means ranged from 0.91 for PMH4 to 1.56
for Wsd. Means for coached participants were generally lower,
ranging from 0.23 (L) to 1.37 (Wsd). Means for uncoached participants ranged from 1.02 (PMH4) to 1.80 (Od).
Relationships of Effect Sizes to Participant and
Methodological Variables
Participant variables. Correlations were computed between
mean effect size obtained from each comparison and the following
participant variables: number of participants, mean age, percentage
male, years of education, and percentage non-Caucasian. In an
effort to preserve independence of mean effect sizes, only one
mean effect size from each study was used in these correlations.
Comparisons in which underreporting participants were coached
were eliminated, because these comparisons are relatively infrequent and the mean effect sizes generated from them appear to
differ from those in which underreporting participants were not
coached. Remaining effect sizes were averaged within studies to
yield a single mean effect size from each study. Correlations
between these 14 independent mean effect sizes and participant
variables were nonsignificant, suggesting that within this literature, there are no relationships between age, sex, education, race,
or number of participants and efficacy of underreporting indices in
discriminating underreported from standard protocols.
Methodological variables. Relationships between effect sizes
and methodological variables can be seen in Table 3. The small
number of comparisons available and the nonindependence of
some of the effect sizes make statistical analysis of these relationships impractical. Thus, these findings are only suggestive. A
relatively large difference in mean effect sizes can be seen for type
of comparison. The most clinically relevant comparison (patients
faking good vs. nonpatients given standard instructions) shows a
smaller mean effect size (0.82) than the other two types of comparisons (1.33 and 1.40). This finding suggests that studies that use
only students or normal volunteers as participants may overestimate the differences between faking and honest respondents in
real-world situations and that the most generalizable findings are
likely to be obtained from studies in which individuals attempting
to conceal significant problems are compared with honest responders who are truly functioning within the normal range.
Another difference was noted for the type of scenario presented
to underreporting participants. Those given a job application scenario had a higher mean effect size (1.55) than those given a child
custody scenario (0.99). However, as several of the groups with
custody scenarios also received coaching in avoiding detection,
whereas no groups with job application scenarios were coached, it
seems likely that the lower effect sizes were due to coaching. Mean
effect size for participants warned to respond believably (1.12) was
smaller than for those not warned (1.49), supporting Rogers’s
(1997) suggestion that these warnings are important. A notable
difference can be seen for coaching, with comparisons in which
participants were coached in avoiding detection yielding a lower
mean effect size (0.89) than comparisons in which participants
were not coached (1.38). Mean effect size for participants given
incentives (1.16) was somewhat smaller than for those not given
incentives (1.37), although this difference may not be significant.
Classification Accuracy
Sensitivity, specificity, PPP, and NPP were coded or calculated
for all studies that examined classification accuracy of underreporting scales. Sensitivity (the percentage of underreporters correctly identified by the scale in question) and specificity (the
percentage of nonunderreporters correctly identified) were reported in most studies. Some studies reported these figures by
using the single most effective cutting score for their sample. Other
studies reported sensitivity and specificity for a variety of cutting
scores. In these latter cases, the cutting score with the highest
combination of sensitivity and specificity was selected for inclusion in our analyses.
Fewer studies reported PPP and NPP, but these values can be
calculated from the sensitivity and specificity values provided.
However, although sensitivity and specificity remain stable across
base rates, PPP and NPP vary with the base rate of underreporting
in the sample. If sensitivity and specificity are known, then PPP
and NPP can be calculated for any base rate. For most of the
studies included in this review, base rates were .50, because
underreporting and standard groups usually had equal numbers of
participants. However, applied and clinical settings may have
different base rates of underreporting. Thus, the most informative
PPP and NPP values will be obtained when calculated for a base
rate typical of applied settings.
To determine a base rate typical of applied settings, we reviewed
literature describing the frequency of underreporting on the
MMPI-2 in such settings. Five studies providing relevant data were
found. These studies, summarized in Table 4, examined the base
rate of underreporting in three samples of custody litigants and two
personnel selection samples. (These studies were not included in
the meta-analytic review because they did not define feigning and
standard groups independently of MMPI-2 scores.) Strong,
Greene, Hoppe, Johnston, and Oleson (1999) used taxometric
procedures to determine prevalence of underreporting, and they
obtained a mean base rate of .36 across several analyses. The other
studies used the L and K scales to identify underreporting, and
Bagby, Nicholson, Buis, Radovanovic, and Fidler (1999) also used
a criterion based on the sum of Wsd and S. The base rates obtained
range from .20 to .74. All of these base rates fall within roughly
Table 2
Effect Sizes (d) for Underreporting Indices
Study
Austin, 1992
Baer & Sekirnjak, 1997
Baer, Wetter, & Berry, 1995
Brems & Harris, 1996
Butcher, 1994
Cassisi & Workman, 1992
Graham, Watts, & Timbrook, 1991
Lim & Butcher, 1996
Shores & Carstairs, 1998
N
Overall M
M coach
M no coach
L
K
F⫺K
L⫹K
Mp
Wsd
Esd
Tt
Od
PMH4
1
2
3
2a
3a
1
1a
1a
1
1
1
1
1a
2
3
1
1
1
1
1
1
1
3.07
1.23
0.99
0.02
⫺0.60
2.41
0.58
⫺0.06
1.56
1.29
1.26
1.25
1.19
1.59
1.20
0.94
0.71
1.32
1.56
1.20
1.46
1.97
22
1.19
0.23
1.47
1.71
1.30
1.38
0.47
0.47
1.68
0.30
⫺0.04
1.72
—
1.39
1.43
2.06
1.40
0.77
—
1.73
0.42
1.36
0.84
0.74
1.53
20
1.13
0.65
1.29
1.16
1.61
1.20
0.97
0.42
1.10
0.35
0.16
1.90
—
1.27
1.30
2.08
1.87
0.36
—
—
—
—
—
—
—
14
1.12
0.80
1.31
—
1.42
1.48
0.40
0.42
2.32
0.41
⫺0.06
2.05
—
—
—
1.98
1.61
1.10
—
—
—
—
—
—
—
11
1.19
0.63
1.66
—
1.58
1.04
1.19
0.52
2.56
0.92
0.34
1.81
1.20
1.25
1.43
2.22
1.86
1.10
—
—
—
—
—
—
—
14
1.36
1.04
1.54
—
1.64
1.45
1.38
1.14
2.44
1.06
0.86
1.75
—
—
—
2.41
1.66
1.40
—
—
—
—
—
—
—
11
1.56
1.37
1.72
—
1.58
0.89
1.30
0.58
1.22
0.67
0.46
1.59
—
—
—
2.67
2.08
0.36
—
—
—
—
—
—
—
11
1.22
1.14
1.29
—
1.99
0.76
0.88
⫺0.07
2.12
0.79
0.30
1.71
—
—
—
2.17
1.50
0.88
—
—
—
—
—
—
—
11
1.18
0.81
1.49
—
1.75
1.28
1.25
0.62
2.58
0.90
0.49
1.98
—
—
—
2.73
2.00
1.19
—
—
—
—
—
—
—
11
1.52
1.20
1.80
—
1.11
0.54
0.94
0.30
1.06
0.73
0.21
1.55
—
—
—
1.70
1.76
0.12
—
—
—
—
—
—
—
11
0.91
0.78
1.02
S
O⫺S
M
—
1.68
1.61
1.15
1.08
2.21
0.84
0.32
2.18
—
—
—
2.61
2.00
0.93
—
—
—
—
1.12
0.96
2.34
14
1.51
1.20
1.66
0.18
—
—
—
—
—
—
—
—
1.38
1.48
1.54
—
—
—
—
—
—
—
—
—
—
4
1.14
—
1.14
1.53
1.54
1.15
0.90a
0.44a
1.97
0.69a
0.27a
1.80
1.29
1.33
1.39
2.17a
1.76
0.86
0.94
1.22
0.87
1.40
1.05
1.05
1.95
—
1.25
0.89
1.38
UNDERREPORTING ON MMPI-2
Baer, Wetter, Nichols, et al., 1995
Bagby, Buis, & Nicholson, 1995
Bagby, Rogers, & Buis, 1994
Bagby, Rogers, Buis, & Kalemba, 1994
Bagby et al., 1997
Comp
Note. Comp ⫽ comparison; L ⫽ Lie; K ⫽ Correction; F ⫺ K ⫽ Infrequency minus Correction; L ⫹ K ⫽ Lie plus Correction; Mp ⫽ Positive Malingering: Wsd ⫽ Wiggins’s Social Desirability; Esd ⫽
Edwards’s Social Desirability; Tt ⫽ Hanley’s Test-Taking Defensiveness; Od ⫽ Other Deception; PMH4 ⫽ Positive Mental Health 4; S ⫽ Superlative; O ⫺ S ⫽ Obvious ⫺ Subtle. 1 ⫽ normals faking
vs. normals standard; 2 ⫽ patients faking vs. patients standard; 3 ⫽ patients faking vs. normals standard. Coach ⫽ faking group coached on avoiding detection.
Dashes indicate data not available.
a
Underreporting participants coached.
21
BAER AND MILLER
22
Table 3
Mean Effect Sizes and Methodological Variables
Variable
Type of comparison
Normal fake good vs. normal standard
Patient fake good vs. patient standard
Patient fake good vs. normal standard
Design
Simulation, between groups
Simulation, within group
Type of scenario
Job application
Custody evaluation
Believability warning
Yes
No
Coaching
Yes
No
Random assignment
Yes
No
Compliance check
Yes
No
Incentive
Yes
No
Random screen
Yes
No
Fake instructions
Look good
Deny problems
N
Mean d
16
3
3
1.33
1.40
0.82
16
5
1.18
1.55
6
7
1.55
0.99
12
9
1.12
1.49
5
17
0.89
1.38
8
6
1.09
1.23
11
10
1.10
1.46
11
11
1.16
1.37
18
4
1.27
1.25
14
3
1.30
1.57
one standard deviation of the mean of these values, except the
highest value (.74), which falls two standard deviations from the
mean. For this reason, this value was judged to be an outlier and
was dropped from consideration. The median of the remaining
base rates is .30. If the base rates are separated into job applicant
and custody litigant categories, then the median base rate of each
category also is .30. Thus, .30 was judged to be the best available
estimate of the prevalence of underreporting in relevant applied
settings, and PPP and NPP values for the studies included in the
meta-analysis were calculated for this base rate.
Table 5 shows the mean cutting scores, sensitivity, specificity,
PPP, and NPP (at a base rate of .30) for underreporting indices.
Because coaching of the underreporting group appeared to be
related to smaller mean effect sizes, classification accuracy values
are shown separately for comparisons with and without coaching.
Number of comparisons for each underreporting index also is
shown (right-hand columns), and the overall mean across underreporting indices is shown in the last row.
Many of the available scales have been examined in only a
few comparisons, and therefore, the results must be interpreted
cautiously. In addition, all of these values are specific to the
cutting scores used in each comparison and may have differed
if other cutting scores had been used. In general, however, these
data suggest that classification accuracy decreases when underreporting respondents have been told about the presence and
purpose of validity scales. This pattern is consistent across all
validity scales and all measures of classification accuracy. The
scale most resistant to the effects of coaching appears to be
Wsd, which had the highest values for sensitivity, specificity,
PPP, and NPP, for the coaching comparisons. The Wsd scale
also showed high classification accuracy values for no-coaching
comparisons. The traditional L and K scales showed mixed
results. The L scale had high specificity, PPP, and NPP, but
lower than average sensitivity in the no-coaching comparisons
and lower than average accuracy in the coaching comparisons. The K scale showed average accuracy levels in most
comparisons.
For clinicians who must make judgments about the veracity of
an individual test taker’s responses, PPP and NPP are the most
useful measures of classification accuracy. Positive predictive
power for uncoached participants ranged from .53 to .75 (M ⫽ .65)
in these studies. The highest values were noted for Wsd (.75),
closely followed by L and Mp (.72), suggesting that, at a base rate
of underreporting of .30, roughly 75% of the test takers identified
by these scales as underreporting will actually be using this response set. Negative predictive power for uncoached participants
Table 4
Reported Base Rates of Defensiveness on Minnesota Multiphasic Personality Inventory–2
in Naturalistic Settings
Study
Bagby, Nicholson, Buis,
Radovanovic, & Fidler, 1999
Bathurst, Gottfried, & Gottfried,
1997
Butcher, Morfitt, Rouse, &
Holden, 1997
Caldwell-Andrews, 2000
Strong, Greene, Hoppe, Johnston,
& Oleson, 1999
Note.
N
Sample
115
Custody litigants
508
Custody litigants
271
Airline pilot applicants
100
Police applicants
412
Custody litigants
Criterion for
defensiveness
% meeting
criterion
⬎65T on L and/or K
⬎42 on Wsd ⫹ S
⬎65T on L
⬎65T on K
⬎65T on L or
⬎70T on K
⬎65T on L
⬎65T on K
Taxometric procedures
52
74
20
25
27
30
43
36
T ⫽ T score; L ⫽ Lie; K ⫽ Correction; Wsd ⫹ S ⫽ Wiggins’s Social Desirability plus Superlative scales.
UNDERREPORTING ON MMPI-2
23
Table 5
Mean Cutting Score, Sensitivity, Specificity, and Positive and Negative Predictive Power for Underreporting Indices
for Comparisons With and Without Coaching
M
cut score
M
sensitivity
M
specificity
M PPP at
base rate ⫽ .30
M NPP at
base rate ⫽ .30
No. of
comparisons
Scale
Co⫺
Co⫹
Co⫺
Co⫹
Co⫺
Co⫹
Co⫺
Co⫹
Co⫺
Co⫹
Co⫺
Co⫹
L
K
F⫺K
L⫹K
Mp
Wsd
Esd
Tt
Od
PMH4
S
M
64T
56T
⫺13.6
24.57
13.00
16.75
30.25
14.25
16.25
25.50
30.83
—
49T
50T
⫺8.00
17.75
9.50
13.75
27.00
11.75
12.50
24.25
22.75
—
.69
.69
.78
.82
.76
.80
.84
.80
.82
.78
.84
.78
.63
.70
.66
.63
.70
.76
.73
.58
.68
.69
.71
.68
.88
.80
.77
.82
.84
.88
.67
.80
.83
.70
.81
.70
.48
.45
.52
.53
.62
.74
.57
.52
.68
.50
.61
.57
.72
.62
.61
.68
.72
.75
.53
.64
.67
.54
.65
.65
.35
.36
.39
.37
.44
.56
.42
.35
.48
.38
.44
.41
.87
.87
.89
.91
.89
.91
.91
.90
.91
.88
.92
.90
.74
.78
.70
.77
.82
.88
.82
.73
.83
.78
.82
.79
11
10
10
7
6
4
4
4
4
4
6
—
4
4
4
4
4
4
4
4
4
4
4
—
Note. Dash indicates not applicable. PPP ⫽ positive predictive power; NPP ⫽ negative predictive power; Co⫺ ⫽ underreporting group not coached; Co⫹
⫽ underreporting group coached; T ⫽ T score; L ⫽ Lie; K ⫽ Correction; F ⫺ K ⫽ Infrequency minus Correction; L ⫹ K ⫽ Lie plus Correction; Mp ⫽
positive malingering; Wsd ⫽ Wiggins’s Social Desirability; Esd ⫽ Edwards’s Social Desirability; Tt ⫽ Hanley’s Test-Taking scale; Od ⫽ Other Deception;
PMH4 ⫽ Positive Mental Health 4; S ⫽ Superlative scale.
ranged from .87 to .92, suggesting that, at a base rate of .30, most
test takers identified by these scales as responding honestly will
have been correctly classified.
Incremental Validity
Incremental validity was examined in five of the studies included in the meta-analysis. Findings are mixed and inconclusive.
Baer, Wetter, Nichols, Greene, and Berry (1995) found that Wsd
and S were more effective in classifying standard and underreported profiles than were L and K. Including Wsd and S in
regression analyses resulted in significant increases over L and K
in prediction of group membership, and using Wsd and S in
addition to L and K resulted in improved classification accuracy.
However, this finding was not replicated in a follow-up study by
Baer, Wetter, and Berry (1995), who found that Wsd and S yielded
no improvement in classification accuracy over the use of L or K
alone. Baer and Sekirnjak (1997) found small but statistically
significant improvements in prediction of group membership when
supplementary underreporting scales were added to regression
equations that already included L and K (different scales were
significant in different comparisons). However, the changes were
too small to translate into improvements in classification accuracy.
Bagby, Buis, and Nicholson (1995) found that Mp and O ⫺ S had
incremental validity over L (and that conversely, L had incremental
validity over O ⫺ S and Mp), but they did not examine whether
this translated into improved classification rates. These findings
differed from those of Timbrook, Graham, Keiller, and Watts
(1993), who reported that O ⫺ S had no incremental validity over
L (although L had incremental validity over O ⫺ S). Finally,
Bagby et al. (1997) found that Od, S, Esd, and Wsd each had
incremental validity over L and K in one of three comparisons, but
they did not examine whether using these scales in combination
resulted in improved classification rates.
Discussion
The findings of this review suggest that groups of defensive and
nondefensive test takers differ by an average of 1.25 standard
deviations on the underreporting indices of the MMPI-2. For
uncoached participants, the mean effect size was 1.38, whereas the
mean for coached participants was 0.89. The mean effect size for
uncoached participants is larger than the mean of 1.05 reported by
Baer et al. (1992) in their meta-analysis of underreporting on the
original MMPI, which included no studies of coaching. However,
the mean effect size for coached participants is somewhat smaller.
Baer et al. (1992) also noted that their mean effect size was
considerably smaller than the mean of 2.07 noted by Berry et al.
(1991) in a meta-analysis of malingering on the original MMPI,
suggesting that underreporting may be more difficult to distinguish
from honest responding than is overreporting. Comparison of the
current findings with those of Rogers et al. (1994), the most recent
meta-analysis of overreporting on the MMPI-2, suggests that this
pattern continues. Rogers et al. (1994) found effect sizes ranging
from 1.08 to 3.33, with a mean of 2.03.
The findings of this review also suggest that classification
accuracy varies across scales and is generally lower when faking
participants have been coached on avoiding detection. Classification accuracy figures must be interpreted with caution, because
they are based on the assumption that all participants in faking
groups were actually faking and that all of those in standard groups
refrained from defensive responding. It is possible that some
participants did not follow their instructions, and, as noted
above, 10 of the studies did not assess compliance with instructions. Nevertheless, the findings suggest that even the most effective underreporting scales will inaccurately label some test takers.
The consequences of misclassifying test takers may vary across
settings, but could be quite problematic. Thus, in an effort to
minimize such errors, it may be important to consider other
sources of information, such as interview data, behavioral obser-
24
BAER AND MILLER
vations, other self-report data, and collateral information when
making decisions about individual protocols (Berry, Baer, Wetter,
& Rinaldo, in press).
Classification accuracy figures also are specific to the cutting
score used in each study. Widely applicable cutting scores have
been difficult to establish for at least two reasons. First, optimal
cutting scores vary across studies. In the present review, the small
number of studies available did not allow examination of whether
this variation is related to other characteristics of the studies (e.g.,
type of participant), although cutting scores were somewhat lower
when faking participants had been coached on avoiding detection.
Second, for this review, optimal cutting scores were defined as
those that best balanced sensitivity and specificity. However, as
the consequences of Type I and Type II classification errors are
likely to vary across settings, optimal cutting scores also will vary,
depending on which type of error is more important to minimize in
each setting. Thus, clinicians who choose to use a different cutting
score in order to minimize a specific type of error may obtain
different classification accuracy rates than those presented here.
Several of the methodological weaknesses noted by Baer et al.
(1992) and Rogers (1997) have been addressed in some of the
current literature. For example, most of the comparisons included
realistic scenarios for underreporting participants to imagine, and
many included warnings to respond believably. Half of the comparisons included assessment of participants’ understanding of or
compliance with instructions, and half included incentives for
successful faking. However, methodological problems remain
common in this area. Many of the simulation designs compared
students instructed to fake with students given standard instructions. The generalizability of findings from these comparisons to
populations in which underreporting is likely to occur, such as
personnel selection and child custody, is unclear. Although many
individuals in these settings may fall within the normal range of
psychological functioning, the most important discrimination for
the clinician is between test takers truly functioning within normal
ranges and those trying to conceal significant problems. Simulation designs might shed more light on this discrimination if they
compare clinical samples instructed to appear well adjusted with
nonclinical samples given the standard instructions. Only three of
the comparisons reported here fit this description. Thus, it is very
important that future studies include more comparisons that use
clinically relevant samples.
Another problem in the current literature is that simulation
designs are far more common than differential prevalence and
known-groups designs. In fact, no known-groups designs, and only
one differential prevalence design, were found. As noted above,
Rogers (1997) has emphasized the importance of converging evidence from all designs in evaluating the efficacy of validity scales.
Research on malingering has shown that tests that do well in
simulation designs are not always effective in known-groups designs (Gillis, Rogers, & Bagby, 1991; Lewis, Simcox, & Berry, in
press). Thus, although more difficult to conduct, differential prevalence and known-groups designs are critically important for the
advancement of this area.
The current literature suggests that when test takers have been
coached about the presence and purpose of validity scales, underreporting is more difficult to detect. Previous researchers have
noted that professional materials containing information about
validity scales are readily available to both lay persons and attor-
neys and that attorneys may transmit this information to their
clients who will undergo psychological testing (Baer, Wetter, &
Berry, 1995; Lamb, Berry, Wetter, & Baer, 1994; Rogers, Bagby,
& Chakraborty, 1993; Wetter & Corrigan, 1995). Thus, it seems
important to develop validity scales that remain effective even
when test takers have been coached. To date, the literature suggests that the Wsd scale shows the most promising efficacy in
coached respondents. However, this pattern has been reported in
only two studies (Baer & Sekirnjak, 1997; Baer, Wetter, & Berry,
1995). Thus, additional research on validity scales resistant to the
effects of coaching is warranted.
The extent to which supplementary underreporting scales should
be used is an important question. The current literature suggests
that the traditional L and K scales are reasonably accurate in
detecting uncoached underreporters. Several scales showed higher
mean sensitivity than L, but none showed higher specificity. Only
one scale (Wsd) showed higher PPP, and the NPP values were very
similar across scales. Studies of the incremental validity of supplementary scales over the traditional L and K scales produced
mixed findings. The use of multiple scales to identify feigners can
alter the probability of Type I and Type II errors, depending on
how they are used. If the criterion for classification as a feigner is
an elevation on any single scale, then use of multiple scales
increases the probability of Type I error (identifying an honest
responder as a faker). However, if the criterion for classification as
a feigner is elevation on all available scales, then use of supplementary scales will reduce the probability of Type I error but will
increase the likelihood of Type II error. Because the literature
reviewed here does not provide clear support for the use of
combinations of traditional and supplementary scales to identify
underreporting, it may not be advisable to consider all of the
supplementary scales in clinical practice. The scales Wsd and S
have shown incremental validity in a few comparisons, and they
warrant additional research to determine whether the routine use of
either or both can be recommended. However, relying on L and K,
for which much larger bodies of supporting data are available, may
be the most defensible approach.
Another important question is the extent to which individuals
with elevations on validity scales in settings that provide incentives for underreporting, such as personnel selection and child
custody evaluation, are concealing significant problems or merely
“putting their best foot forward” in a manner that should be
considered normal under the circumstances. Bathurst, Gottfried,
and Gottfried (1997) argued that custody litigants are largely free
of psychopathology and that elevations on Scales 3 and 6, which
are the most common in this group, can be attributed to the stresses
associated with divorce and custody proceedings in an adversarial
legal system. However, they also noted, as did Bagby et al. (1999)
that it cannot be determined from the MMPI-2 data whether the
defensive response style observed is “an overestimate of mental
health in a psychologically healthy population or an attempt by
psychologically disturbed individuals to conceal symptomatology”
(Bathurst et al., 1997, p. 209). This question can only be resolved
with extratest data, such as behavioral observations or ratings by
others (Bagby et al., 1999). Thus, because child custody and
personnel selection procedures might encourage some test takers
to respond defensively even if they have nothing significant to
hide, it is important to develop ways of discriminating defensive
responders who are concealing psychopathology from defensive
UNDERREPORTING ON MMPI-2
responders who are functioning normally, and to clarify the proportion of defensive responders in these settings who are hiding
problems. If many of the defensive responders in personnel selection and custody evaluation settings are not concealing significant
psychopathology, but are responding defensively because of situational demands, then the base rates of clinically significant defensiveness may be lower than the estimates provided in Table 4.
In summary, research on the detection of underreporting has
advanced in several ways since the publication of the MMPI-2.
However, important problems remain to be addressed. The use of
known-groups and differential prevalence designs remains very
rare, and most simulation designs have used students rather than
clinical or “real world” comparison groups. The detection of
coached feigners and the incremental validity of supplementary
scales over the traditional L and K scales require further investigation. The prevalence of significant symptomatology in personnel
selection and custody evaluation settings warrants additional exploration. In the meantime, the L and K scales continue to be
reasonably effective in identifying defensive responding in uncoached samples, and the Wsd scale may be useful in samples for
which coaching is likely. Because underreporting remains more
difficult to detect than overreporting, suspicions about underreporting that are triggered by elevations on these scales should be
investigated through interview, behavioral observations, other selfreport inventories, and collateral sources of information, as available and appropriate (Berry et al., in press).
References
References marked with an asterisk indicate studies included in the
meta-analysis.
*Austin, J. S. (1992). The detection of fake good and fake bad on the
MMPI-2. Educational and Psychological Measurement, 52, 669 – 674.
*Baer, R. A., & Sekirnjak, G. (1997). Detection of underreporting on the
MMPI-2 in a clinical population: Effects of information about validity
scales. Journal of Personality Assessment, 69, 555–567.
Baer, R. A., Wetter, M. W., & Berry, D. T. R. (1992). Detection of
underreporting of psychopathology on the MMPI: A meta-analysis.
Clinical Psychology Review, 12, 509 –525.
*Baer, R. A., Wetter, M. W., & Berry, D. T. R. (1995). Effects of
information about validity scales on underreporting of symptoms on the
MMPI-2: An analogue investigation. Assessment, 2, 189 –200.
*Baer, R. A., Wetter, M. W., Nichols, D. S., Greene, R. L., & Berry,
D. T. R. (1995). Sensitivity of MMPI-2 validity scales to underreporting
of symptoms. Psychological Assessment, 7, 419 – 423.
*Bagby, R. M., Buis, T., & Nicholson, R. A. (1995). Relative effectiveness
of the standard validity scales in detecting fake-bad and fake-good
responding: Replication and extension. Psychological Assessment, 7,
84 –92.
Bagby, R. M., Nicholson, R. A., Buis, T., Radovanovic, H., & Fidler, B. J.
(1999). Defensive responding on the MMPI-2 in family custody and
access evaluations. Psychological Assessment, 11, 24 –28.
*Bagby, R. M., Rogers, R., & Buis, T. (1994). Detecting malingered and
defensive responding on the MMPI-2 in a forensic inpatient sample.
Journal of Personality Assessment, 62, 191–203.
*Bagby, R. M., Rogers, R., Buis, T., & Kalemba, V. (1994). Malingered
and defensive response styles on the MMPI-2: An examination of
validity scales. Assessment, 1, 31–38.
*Bagby, R. M., Rogers, R., Nicholson, R. A., Buis, T., Seeman, M. V., &
Rector, N. A. (1997). Effectiveness of the MMPI-2 validity indicators in
the detection of defensive responding in clinical and nonclinical samples. Psychological Assessment, 9, 406 – 413.
25
Bathurst, K., Gottfried, A. W., & Gottfried, A. E. (1997). Normative data
for the MMPI-2 in child custody litigation. Psychological Assessment, 9,
205–211.
Ben-Porath, Y. S. (1994). The ethical dilemma of coached malingering
research. Psychological Assessment, 6, 14 –15.
Berry, D. T. R., Baer, R. A., & Harris, M. J. (1991). Detection of
malingering on the MMPI: A meta-analysis. Clinical Psychology Review, 11, 585–598.
Berry, D. T. R., Baer, R. A., Wetter, M. W., & Rinaldo, J. (in press).
Assessment of malingering. In J. N. Butcher (Ed.), Clinical personality
assessment (2nd ed.). New York: Oxford University Press.
Berry, D. T. R., Lamb, D. G., Wetter, M. W., Baer, R. A., & Widiger, T. A.
(1994). Ethical considerations in research on coached malingering. Psychological Assessment, 6, 16 –17.
Borum, R., & Stock, H. V. (1993). Detection of deception in law enforcement applicants. Law and Human Behavior, 17, 157–166.
*Brems, C., & Harris, K. (1996). Faking the MMPI-2: Utility of the
subtle-obvious scales. Journal of Clinical Psychology, 52, 525–533.
*Butcher, J. N. (1994). Psychological assessment of airline pilot applicants
with the MMPI-2. Journal of Personality Assessment, 62, 31– 44.
Butcher, J., Dahlstrom, W., Graham, J., Tellegen, A., & Kaemmer, B.
(1989). Manual for administering and scoring the MMPI-2. Minneapolis: University of Minnesota Press.
Butcher, J. N., & Han, K. (1995). Development of an MMPI-2 scale to
assess the presentation of self in a superlative manner: The S scale. In
J. N. Butcher & C. D. Spielberger (Eds.), Advances in personality
assessment (Vol. 10, pp. 25–50). Hillsdale, NJ: Erlbaum.
Butcher, J. N., Morfitt, R. C., Rouse, S. V., & Holden, R. R. (1997).
Reducing MMPI-2 defensiveness: The effect of specialized instructions
on retest validity in a job applicant sample. Journal of Personality
Assessment, 68, 385– 401.
Caldwell-Andrews, A. A. (2000). Relationships between MMPI-2 validity
scales and NEO PI-R experimental validity scales in police candidates.
Unpublished dissertation.
*Cassisi, J. E., & Workman, D. E. (1992). The detection of malingering
and deception with a short form of the MMPI-2 based on the L, F, and
K scales. Journal of Clinical Psychology, 48, 54 –58.
Cofer, C. N., Chance, J., & Judson, A. J. (1949). A study of malingering
on the MMPI. Journal of Psychology, 27, 491– 499.
Edwards, A. L., (1957). The social desirability variable in personality
assessment and research. New York: Dryden.
Finn, S. E., & Kamphuis, J. H. (1995). What a clinician needs to know
about base rates. In J. N. Butcher (Ed.), Clinical personality assessment
(1st ed.). New York: Oxford University Press.
Gillis, J. R., Rogers, R., & Bagby, R. M. (1991). Validity of the M Test:
Simulation-design and natural-group approaches. Journal of Personality
Assessment, 57, 130 –140.
Gough, H. G. (1950). The F minus K dissimulation index for the MMPI.
Journal of Consulting Psychology, 14, 408 – 413.
Graham, J. R. (2000). MMPI-2: Assessing personality and psychopathology (3rd ed.). New York: Oxford University Press.
*Graham, J. R., Watts, D., & Timbrook, R. E. (1991). Detecting fake-good
and fake-bad MMPI-2 profiles. Journal of Personality Assessment, 57,
264 –277.
Hanley, C. (1957). Deriving a measure of test-taking defensiveness. Journal of Consulting Psychology, 21, 391–397.
Hathaway, S. R., & McKinley, J. C. (1983). The Minnesota Multiphasic
Personality Inventory manual. New York: Psychological Corporation.
Lamb, D. G., Berry, D. T. R., Wetter, M. W., & Baer, R. A. (1994). Effects
of two types of information on malingering of closed head injury on the
MMPI-2: An analogue investigation. Psychological Assessment, 6,
8 –13.
Lees-Haley, P. R. (1992). Psychodiagnostic test usage by forensic psychologists. American Journal of Forensic Psychology, 10, 25–30.
26
BAER AND MILLER
Lees-Haley, P. R. (1997). Attorneys influence expert evidence in forensic
psychological and neuropsychological cases. Assessment, 4, 321–324.
Lewis, J. L., Simcox, A. J., & Berry, D. T. R. (in press). Known groups
validation of MMPI-2 validity scales and the Structured Inventory of
Malingered Symptoms (SIMS) for malingering screening in a forensic
sample. Psychological Assessment.
*Lim, J., & Butcher, J. N. (1996). Detection of faking on the MMPI-2:
Differentiation among faking-bad, denial, and claiming extreme virtue.
Journal of Personality Assessment, 67, 1–25.
Nichols, D. S. (1991). Development of a global measure for positive mental
health. Unpublished manuscript.
Nichols, D. S., & Greene, F. L. (1991, March). New measures for dissimulation on the MMPI/MMPI-2. Paper presented at the 26th annual
symposium on Recent Developments in the Use of the MMPI, St.
Petersburg Beach, FL.
Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of Personality and Social Psychology, 46, 598 – 609.
Paulhus, D. L. (1986). Self-deception and impression management in test
responses. In A. Angleitner & J. S. Wiggins (Eds.), Personality assessment via questionnaires: Current issues in theory and measurement (pp.
143–165). Berlin: Springer-Verlag.
Paulhus, D. L. (1998). Paulhus Deception Scales (PDS): The Balanced
Inventory of Desirable Responding-7. North Tonawanda, NY: MultiHealth Systems.
Rogers, R. (1997). Researching dissimulation. In R. Rogers (Ed.), Clinical
assessment of malingering and deception (2nd ed., pp. 309 –327). New
York: Guilford Press.
Rogers, R., Bagby, R. M., & Chakraborty, D. (1993). Feigning schizo-
phrenic disorders on the MMPI-2: Detection of coached simulators.
Journal of Personality Assessment, 60, 215–226.
Rogers, R., Sewell, K. R., & Salekin, R. T. (1994). A meta-analysis of
malingering on the MMPI-2. Assessment, 1, 227–237.
Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage.
*Shores, E. A., & Carstairs, J. R. (1998). Accuracy of the MMPI-2
computerized Minnesota Report in identifying fake-good and fake-bad
response sets. The Clinical Neuropsychologist, 12, 101–106.
Strong, D. R., Greene, R. L., Hoppe, C., Johnston, T., & Oleson, N. (1999).
Taxometric analysis of impression management and self-deception on
the MMPI-2 in child custody litigants. Journal of Personality Assessment, 73, 1–18.
Timbrook, R. E., Graham, J. R., Keiller, S. W., & Watts, D. (1993).
Comparison of the Wiener–Harmon Subtle–Obvious scales and the
standard validity scales in detecting valid and invalid MMPI-2 profiles.
Psychological Assessment, 5, 53– 61.
Wetter, M. W., & Corrigan, S. K. (1995). Providing information to clients
about psychological tests: A survey of attorneys’ attitudes. Professional
Psychology: Research and Practice, 26, 474 – 477.
Wiener, D. N. (1948). Subtle and obvious keys for the MMPI. Journal of
Consulting Psychology, 12, 164 –170.
Wiggins, J. S. (1959). Interrelationships among MMPI measures of dissimulation under standard and social desirability instructions. Journal of
Consulting Psychology, 23, 419 – 427.
Received March 5, 2001
Revision received July 23, 2001
Accepted September 27, 2001 䡲