Key Evaluation Chechlist KEC - 4.18.2011
Key Evaluation Chechlist KEC - 4.18.2011
Michael
Scriven
Claremont
Graduate
University
&
The
Evaluation
Center,
Western
Michigan
University
•
For
use
in
professional
designing,
managing,
and
evaluating
or
monitoring
of:
programs,
projects,
plans,
processes,
and
policies;
•
for
assessing
their
evaluability;
•
for
requesting
proposals
(i.e.,
writing
RFPs)
to
do
or
evaluate
them;
&
for
evaluating
proposed,
ongoing,
or
completed
evaluations
of
them.1
INTRODUCTION
This
introduction
takes
the
form
of
a
number
of
‘General
Notes,’
more
of
which
may
be
found
in
the
body
of
the
document,
along
with
many
keypoint-‐specific
Notes.
General Note 1: APPLICABILITY The KEC can be used, with care, for evaluating more than
the five evaluands2 listed above, just as it can be used, with considerable care, by others besides
professional evaluators. For example, it can be used for some help with: (i) the evaluation of
products;3 (ii) the evaluation of organizations and organizational units4 such as departments, re-
search centers, consultancies, associations, companies, and for that matter, (iii) hotels, restau-
rants, and mobile food carts, (iv) services, which can be treated as if they were aspects or con-
stituents of programs e.g., as processes; (v) many processes, policies, practices, or procedures,
which are often implicit programs (e.g., “Our practice at this school is to provide guards for chil-
dren walking home after dark”), hence evaluable using the KEC; or habitual patterns of behav-
iour i.e., performances (as in “In my practice as a consulting engineer, I often assist designers,
not just manufacturers”), which is, strictly speaking, a slightly different subdivision of evalu-
ation; and, with some use of the imagination and a heavy emphasis on the ethical values in-
volved, for (vi) some tasks or major parts of tasks in the evaluation of personnel. So it is a kind
1
That is, for what is called meta-evaluation, i.e., the evaluation of one or more evaluations.
2
‘Evaluand’ is a term used to refer to whatever is being evaluated. Note that what counts as a program is often also
sometimes called an initiative or intervention, sometimes even an approach or strategy, although the latter are really
types of program
3
For which it was originally designed and used, c. 1971—although it has since been completely rewritten for its
present purposes, and then revised or rewritten (and circulated or re-posted) at least 60 times. The latest version can
always be found at michaelscriven.info. It is an example of ‘continuous interactive publication’ a type of project
with some new significance in the field of knowledge development, although (understandably) a source of irritation
to some librarians and bibliographers. It enables the author, like a garden designer (and unlike a traditional architect
or composer), to steadily improve his or her specific individual creations over the years or decades, with the help of
user input. It is simply a technologically-enabled extension to the limit of the stepwise process of producing succes-
sive editions in traditional publishing, and arguably a substantial improvement, in the cases where it’s appropriate.
4
There is of course a large literature on the evaluation of organizations, from Baldrige to Senge, and some of it will
be useful for a serious evaluator, but much of it is confused and confusing e.g., about the difference and links be-
tween evaluation and explanation, needs and markets, criteria and indicators, goals and duties.
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
5
It’s not important, but you can remember the part titles from this mnemonic: A for Approach, B for Before, C for
Core (or Center), and D for Dependencies. Since these have 3+5+5+3 components, it’s a 16-point checklist.
6
Major decisions about the program include: refunding, defunding, exporting, replicating, developing further, and
deciding whether it represents a proper or optimal use of funds (i.e., evaluation for accountability, as in an audit).
General
Note
4:
TYPE
OF
CHECKLIST
This
is
an
iterative
checklist,
not
a
one-shot
checklist,
i.e.,
you
should
expect
to
go
through
it
several
times
when
dealing
with
a
single
project,
even
for
design
purposes,
since
discoveries
or
problems
that
come
up
under
later
check-‐
points
will
often
require
modification
of
what
was
entered
under
earlier
ones
(and
no
re-‐
arrangement
of
the
order
will
completely
avoid
this).7
For
more
on
the
nature
of
checklists,
and
their
use
in
evaluation,
see
the
author’s
paper
on
that
topic,
and
a
number
of
other
pa-‐
pers
about,
and
examples
of,
checklists
for
evaluation
by
various
authors,
under
the
listing
for
the
Checklist
Project
at
evaluation.wmich.edu.
General
Note
5:
EXPLANATIONS
Since
it
is
not
entirely
helpful
to
simply
list
here
what
(allegedly)
needs
to
be
covered
in
an
evaluation
when
the
reasons
for
the
recommended
coverage
(or
exclusions)
are
not
obvious—especially
when
the
issues
are
highly
controver-‐
sial
(e.g.,
Checkpoint
D2)—brief
summaries
of
the
reasons
for
the
position
taken
are
also
provided
in
such
cases.
General
Note
6:
CHECKPOINT
FOCUS
The
determination
of
merit,
or
worth,
or
significance
(a.k.a.
(respectively)
quality,
value,
or
importance),
the
triumvirate
value
foci
of
evaluation,
each
rely
to
different
degrees
on
slightly
different
slices
of
the
KEC,
as
well
as
on
a
good
deal
of
it
as
common
ground.
These
differences
are
marked
by
a
comment
on
these
distinc-‐
tive
elements
with
the
relevant
term
of
the
three
underlined
in
the
comment,
e.g.,
worth,
unlike
merit
(or
quality,
as
the
terms
are
commonly
used),
brings
in
Cost
(Checkpoint
C3).
General
Note
7:
THE
COST
OF
EVALUATION
The KEC is a list of what ought to be covered in
an evaluation, but in the real world, the budget for an evaluation is often not enough to cover the
whole list thoroughly. People sometimes ask what checkpoints could be skipped when one has a
very small evaluation budget. The answer is, “None, but….” These are, generally speaking, nec-
essary conditions for validity, But… (i) sometimes the client, or you, if you are the client, can
show that one or two are not relevant to the information need
in
this
case
(e.g.,
cost
may
not
be
important
in
some
cases);
(ii)
the
fact
that
you
can’t
skip
any
checkpoint
doesn’t
mean
you
have
to
spend
significant
money
on
each
of
them.
What
you
do
have
to
do
is
think
through
each
checkpoint’s
implications
for
the
case
in
hand,
and
consider
whether
an
economical
way
of
coping
with
it
would
be
probably
adequate
for
an
acceptably
probable
conclusion,
i.e.,
focus
on
robustness
(see
Checkpoint
D5,
Meta-‐evaluation,
below).
In
an
extreme
case,
you
may
have
to
rely
on
a
subject-‐matter
expert
for
an
estimate
based
on
his/her
experi-‐
ence,
maybe
covering
more
than
one
checkpoints
in
a
half-‐day
of
consulting—or
on
a
few
hours
of
literature
+
phone
search
by
you—of
the
relevant
facts
about
e.g.,
resources,
or
critical
competitors.
But
reality
sometimes
mean
the
evaluation
can’t
be
done;
that’s
the
cost
of
integrity
for
evaluators
and,
sometimes,
excessive
parsimony
for
clients.
Don’t
for-‐
get
that
honesty
on
this
point
prevents
some
bad
scenes
later—and
may
lead
to
a
change
of
budget.
7
An important category of these is identified in Note C2.5 below
3 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
These
preliminary
checkpoints
are
clearly
essential
parts
of
an
evaluation
report,
but
may
seem
to
have
no
relevance
to
the
design
and
execution
phases
of
the
evaluation
itself.
That’s
why
they
are
segregated
from
the
rest
of
the
KEC
checklist:
however,
it
turns
out
to
be
quite
useful
to
begin
all
one’s
thinking
about
an
evaluation
by
role-‐playing
the
situation
when
you
will
come
to
write
a
report
on
it.
Amongst
other
benefits,
it
makes
you
realize
the
importance
of
describing
context;
of
settling
on
a
level
of
technical
terminology
and
pre-‐
supposition;
of
clearly
identifying
the
most
notable
conclusions;
and
of
starting
a
log
on
the
project
as
well
as
its
evaluation
as
soon
as
the
latter
becomes
a
possibility.
Similarly,
it’s
good
practice
to
make
explicit
at
an
early
stage
the
clarification
step
and
the
methodology
array
and
its
justification
A2. Clarifications
Now
is
the
time
to
clearly
identify
and
define
in
your
notes,
for
assertion
in
the
final
report
(and
resolution
of
ambiguities
along
the
way):
(i)
the
client,
if
there
is
one:
this
is
the
per-‐
son,
group,
or
committee
who
officially
requests,
and,
if
it’s
a
paid
evaluation,
pays
for
(or
authorizes
payment
for)
the
evaluation,
and—you
hope—the
same
entity
to
whom
you
first
report
(if
not,
try
to
arrange
this,
to
avoid
crossed
wires
in
communications).
(ii)
The
pros-‐
pective
(i.e.,
overt)
audiences
(for
the
report).
(iii)
The
stakeholders
in
the
program
(those
who
have
or
will
have
a
substantial
vested
interest—not
just
an
intellectual
interest—in
the
outcome
of
the
evaluation,
and
may
have
important
information
or
views
about
the
program
and
its
situation/history).
(iv)
Anyone
else
who
(probably)
will
see,
have
the
right
to
see,
or
should
see,
(a)
the
results,
and/or
(b)
the
raw
data—these
are
the
covert
audi-‐
ences.
Get
clear
in
your
mind
your
actual
role
or
roles—internal
evaluator,
external
evalu-‐
ator,
a
hybrid
(e.g.,
an
outsider
on
the
payroll
for
a
limited
time
to
help
the
staff
with
setting
up
and
running
evaluation
processes),
an
evaluation
trainer
(sometimes
described
as
an
empowerment
evaluator),
a
repairer/‘fixit
guy’,
visionary
(or
re-‐visionary),
etc.
Each
of
these
roles
has
different
risks
and
responsibilities,
and
is
viewed
with
different
expecta-‐
tions
by
your
staff
and
colleagues,
the
clients,
the
staff
of
the
program
being
evaluated,
et
al.
You
may
also
pick
up
some
other
roles
along
the
way—e.g.,
counsellor,
therapist,
mediator,
decision-‐maker,
inventor,
advocate—sometimes
for
everyone
but
sometimes
for
only
part
of
the
staff/stakeholders/others
involved.
It’s
good
to
formulate
and
sometimes
to
clarify
these
roles,
at
least
for
your
own
thinking
(especially
about
possible
conflicts
of
role),
in
the
project
log.
The
project
log
is
absolutely
essential;
and
it’s
worth
considering
making
a
standard
practice
of
having
someone
else
read
and
initial
entries
in
it
that
may
at
some
stage
become
very
important.
And
now
is
the
time
to
get
down
to
the
nature
and
details
of
the
job
or
jobs,
as
the
client
sees
them—and
to
encourage
the
client
to
clarify
their
position
on
the
details
that
they
have
not
yet
thought
out.
Get
all
this
into
a
written
contract
if
possible
(essential
if
you’re
an
external
evaluator,
highly
desirable
for
an
internal
one.)
Can
you
determine
the
source
and
nature
of
the
request,
need,
or
interest,
leading
to
the
evaluation:
for
example,
is
the
request,
or
the
need,
for
an
evaluation
of
worth—which
usually
involves
really
serious
at-‐
tention
to
cost
analysis—rather
than
of
merit;
or
of
significance
which
always
requires
ad-‐
vanced
knowledge
of
the
research
(or
other
current
work)
scene
in
the
evaluand’s
field;
or
of
more
than
one
of
these?
Is
the
evaluation
to
be
formative,
summative,
or
ascriptive8;
or
for
more
than
one
of
these
purposes?
Exactly
what
are
you
supposed
to
be
evaluating
(the
evaluand
alone,
or
also
the
context
and/or
the
infrastructure?):
how
much
of
the
context
is
to
be
taken
as
fixed;
do
they
just
want
an
evaluation
in
general
terms,
or
if
they
want
de-‐
tails,
what
counts
as
a
detail
(enough
to
replicate
the
program
elsewhere,
or
enough
to
rec-
ognize
it
anywhere,
or
just
enough
for
prospective
readers
to
know
what
you’re
referring
to);
are
you
supposed
to
be
simply
evaluating
the
effects
of
the
program
as
a
whole
(holistic
evaluation);
or
the
dimensions
of
its
success
and
failure
(one
type
of
analytic
evaluation);
or
the
quality
on
each
of
those
dimensions,
or
the
quantitative
contribution
of
each
of
its
components
to
its
overall
m/w/s
(another
two
types
of
analytic
evaluation);
are
you
re-‐
quired
to
rank
the
evaluand
against
other
actual
or
possible
programs
(which
ones?),
or
only
to
grade
it;9
and
to
what
extent
is
a
conclusion
that
involves
generalization
from
this
context
being
requested
or
required
(e.g.,
where
are
they
thinking
of
exporting
it?).
And,
of
particular
importance,
is
the
main
thrust
to
be
on
ex
post
facto
(historical)
evaluation,
or
ex
8 Formative evaluations, as mentioned earlier, are usually done to find areas needing improvement of the evaluation:
summative are mainly done to support a decision about the disposition of the evaluand (e.g., to refund, defund, or
replicate it); and ‘ascriptive’ evaluations are done simply for the record, for history, for benefit to the discipline, or
just for interest.
9
Grading refers not only to the usual academic letter grades (A-F, Satisfactory/Unsatisfactory, etc.) but to any allo-
cation to a category of merit, worth, or significance, e.g., grading of meat, grading of ideas and thinkers.
5 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
ante
(predictive)
evaluation,
or
(the
most
common,
but
don’t
assume
it)
both?
Note
that
predictive
program
evaluation
is
very
close
to
being
(the
almost
universal
variety
of)
policy
analysis,
and
vice
versa.
Are
you
also
being
asked
(or
expected)
either
to
evaluate
the
client’s
theory
of
how
the
ev-‐
aluand’s
components
work,
or
to
create/improve
such
a
‘program
theory’—keeping
in
mind
that
this
is
something
over
and
above
the
literal
evaluation
of
the
program,
and
espe-‐
cially
keeping
in
mind
that
this
is
sometimes
impossible
for
even
the
most
expert
of
field
experts
in
the
present
state
of
subject-‐matter
knowledge?
10
Is
the
required
conclusion
sim-‐
ply
to
provide
and
justify
grades,
ranks,
scores,
profiles,
or
(a
different
level
of
difficulty
altogether)
distribution
of
funding?
Are
recommendations
(for
improvement
or
disposi-‐
tion),
or
identifications
of
human
fault,
or
predictions,
requested,
or
expected,
or
feasible
(another
level
of
difficulty,
too—see
Checkpoint
D2)?
Is
the
client
really
willing
and
anxious
to
learn
from
faults
or
is
this
just
conventional
rhetoric?
(Your
contract
or,
for
an
internal
evaluator,
your
job,
may
depend
on
getting
the
answer
to
this
question
right,
so
you
might
consider
trying
this
test:
ask
them
to
explain
how
they
would
handle
the
discovery
of
ex-‐
tremely
serious
flaws
in
the
program—you
will
often
get
an
idea
from
their
reaction
to
this
question
whether
they
have
‘the
right
stuff’
to
be
a
good
client.)
Or
you
may
discover
that
you
are
really
expected
to
produce
a
justification
for
the
program
in
order
to
save
some-‐
one’s
neck;
and
that
they
have
no
interest
in
hearing
about
faults.
Have
they
thought
about
post-‐report
help
with
interpretation
and
utilization?
(If
not,
offer
it
without
extra
charge—
see
Checkpoint
D2
below.)
It’s
best
to
complete
the
discussion
of
these
issues
about
what’s
expected
and/or
feasible
to
evaluate,
and
clarify
your
commitment
(and
your
cost
estimate,
if
it’s
not
already
fixed),
only
after
doing
a
quick
pass
through
the
KEC,
so
ask
for
a
little
time
to
do
this,
overnight
if
possible
(see
Note
D2.3
near
the
end
of
the
KEC).
Be
sure
to
note
later
any
subsequently
negotiated,
or
imposed,
changes
in
any
of
the
preceding.
And
here’s
where
you
give
ac-‐
knowledgments/thanks/etc.,
so
it
probably
should
be
the
last
section
you
revise
in
the
final
report.
10 Essentially, this is a request for decisive non-evaluative explanatory research on the evaluand and/or context. You
may or may not have the skills for this, depending on the exact problem; you certainly didn’t acquire them in the
course of your evaluation training. It’s one thing to determine whether (and to what degree) a program reduces de-
linquency; any good evaluator can do that (given the budget and time required). It’s another thing altogether to be
able to explain why that program does or does not work—that often requires an adequate theory of delinquency,
which so far doesn’t exist. Although program theory enthusiasts think their obligations always include or require
such a theory, the standards for acceptance of any of these theories by the field as a whole are often beyond their
reach; and you risk lowering the standards of the evaluation field if you claim your evaluation depends on providing
such a theory, since in many of the most important areas, you will not be able to do that.
elsewhere,
in
both
social
science
and
evaluation
texts.
In
this
section,
we
just
list
some
en-‐
try
points
for
that
slice
of
evaluation
methodology,
and
provide
rather
more
details
about
the
evaluative
slice
of
evaluation
methodology,
the
neglected
part;
apart
from
a
few
com-‐
ments
here,
this
is
mostly
covered,
or
at
least
introduced,
under
the
later
checkpoints
which
refer
to
the
necessary
aspects
of
the
investigation—Values,
Process,
Outcomes,
Costs,
Comparisons,
Generalizability,
and
Synthesis
checkpoints.
Leaving
this
slice
out
of
the
methodology
of
evaluation
is
roughly
the
same
as
leaving
out
any
discussion
of
inferen-‐
tial
statistics
from
a
discussion
of
statistics.
Two
orienting
points
to
start
with.
(i)
Program
evaluation
is
usually
about
a
single
program
rather
than
a
set
of
programs.
Although
program
evaluation
is
not
as
individualistic—the
technical
term
is
idiographic
rather
than
nomothetic—as
dentistry,
forensic
pathology,
or
motorcycle
maintenance,
since
most
programs
have
large
numbers
of
impactees
rather
than
just
one,
it
is
more
individualistic
than
most
social
sciences,
even
applied
social
sci-‐
ences.
So
you’ll
need
to
be
knowledgeable
about
case
study
methodology11.
(ii)
Program
ev-‐
aluation
is
nearly
always
a
complex
task,
involving
the
investigation
of
a
number
of
differ-‐
ent
aspects
of
program
performance—even
a
number
of
different
aspects
of
a
single
ele-‐
ment
in
that
such
as
impact
or
cost—which
means
it
is
part
of
the
realm
of
study
that
re-‐
quires
extensive
use
of
checklists.
The
humble
checklist
has
been
ignored
in
most
of
the
lit-‐
erature
on
research
methods,
but
turns
out
to
be
more
complex
and
also
more
important
than
was
generally
realized,
so
look
up
the
online
Checklists
Project
at
http://www.wmich.edu/evalctr/checklists
for
some
papers
about
the
methodology
and
a
long
list
of
specific
checklists
composed
by
evaluators
(including
an
earlier
version
of
this
one).
You
can
find
a
few
others,
and
the
latest
version
of
this
one,
at
michaelscriven.info.
Now
for
some
entry
points
for
applying
social
science
methodology:
that
is,
some
examples
of
the
kind
of
question
that
you
may
need
to
answer.
Do
you
have
adequate
domain
(a.k.a.
subject-‐matter,
and/or
local
context)
expertise
for
what
you
have
now
identified
as
the
real
tasks?
If
not,
how
will
you
add
it
to
the
evaluation
team
(via
consultant(s),
advisory
panel,
full
team
membership,
sub-‐contract,
or
surveys/interviews)?
More
generally,
identify,
as
soon
as
possible,
all
investigative
procedures
for
which
you’ll
need
expertise,
time,
equip-‐
ment,
and
staff—and
perhaps
training—in
this
evaluation:
observation,
participant
obser-‐
vation,
logging,
journaling,
audio/photo/video
recording,
tests,
simulating,
role-‐playing,
surveys,
interviews,
experimental
design,
focus
groups,
text
analysis,
library/online
searches/search
engines,
etc.;
and
data-‐analytic
procedures
(stats,
cost-‐analysis,
modeling,
topical-‐expert
consulting,
etc.),
plus
reporting
techniques
(text,
stories,
plays,
graphics,
freestyle
drawings,
stills,
movies,
etc.),
and
their
justification.
You
probably
need
to
allocate
time
for
a
lit
review
on
some
of
these
methods....
In
particular,
on
the
difficult
causation
component
of
the
methodology
(establishing
that
certain
claimed
or
discovered
phenom-‐
ena
are
the
effects
of
the
interventions),
can
you
use
separate
control
or
comparison
groups
to
determine
causation
of
supposed
effects/outcomes?
If
not,
look
at
interrupted
time
series
designs,
the
GEM
approach,12
and
some
ideas
in
case
study
design.
If
there
is
to
be
a
control
or
quasi-‐control
(i.e.,
comparison)
group,
can
you
and
should
you
try
to
ran-‐
11
This means reading at least some books by Yin, Stake, and Brinkerhoff (check Amazon).
12
See “A Summative Evaluation of RCT methodology; and an alternative approach to causal research” in Journal
of Multidisciplinary Evaluation vol. 5, no. 9, March 2008, at jmde.com.
7 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
domly
allocate
subjects
to
it
(and
can
you
get
through
IRB)?
How
will
you
control
differen-‐
tial
attrition;
cross-‐group
contamination;
other
threats
to
internal
validity?
If
you
can’t
con-‐
trol
these,
what’s
the
decision-‐rule
for
declining/aborting
the
study?
Can
you
double-‐
or
single-‐blind
the
study
(or
triple-‐blind
if
you’re
very
lucky)?
If
the
job
requires
you
to
de-‐
termine
the
separate
contribution
to
the
effects
from
individual
components
of
the
ev-‐
aluand—how
will
you
do
that?....
If
a
sample
is
to
be
used
at
any
point,
how
selected,
and
if
stratified,
how
stratified?….
Will/should
the
evaluation
be
goal-‐based
or
goal-‐free?13....
To
what
extent
participatory
or
collaborative;
if
to
a
considerable
extent,
what
standards
and
choices
will
you
use,
and
justify,
for
selecting
partners/assistants?
In
considering
your
de-‐
cision
on
that,
keep
in
mind
that
participatory
approaches
improve
implementation
(and
sometimes
validity),
but
may
cost
you
credibility
(and
possibly
validity).
How
will
you
han-‐
dle
that
threat?....
If
judges
are
to
be
involved
at
any
point,
what
reliability
and
bias
controls
will
you
need
(again,
for
credibility
as
well
as
validity)?...
How
will
you
search
for
side
ef-‐
fects
and
side-‐impacts,
an
essential
element
in
almost
all
evaluations
(see
Checkpoint
C2)?
Most
important
of
all,
with
respect
to
all
(significantly)
relevant
values
how
are
you
going
to
go
through
the
value-‐side
steps
in
the
evaluation
process,
i.e.,
(i)
identify,
(ii)
particu-‐
larize,
(iii)
validate,
(iv)
measure,
(v)
set
standards
(‘cutting
scores’)
for,
(vi)
set
weights
for,
and
then
(vii)
incorporate
(synthesize,
integrate)
the
value-‐side
with
the
empirical
data-‐gathering
side
in
order
to
generate
the
evaluative
conclusion?...
Now
check
the
sug-‐
gestions
about
values-‐specific
methodology
in
the
Values
checkpoint,
especially
the
com-‐
ment
on
pattern-‐searching….
When
you
can
handle
all
this,
you
are
in
a
position
to
set
out
the
‘logic
of
the
evaluation,’
i.e.,
a
general
description
and
justification
of
the
total
design
for
this
project,
something
that—at
least
in
outline—is
a
critical
part
of
the
report,
under
the
heading
of
Methodology.
Note
A3.1:
The
above
process
will
also
generate
a
list
of
needed
resources
for
your
plan-‐
ning
and
budgeting
efforts—i.e.,
the
money
(and
other
costs)
estimate.
And
it
will
also
pro-‐
vide
the
basis
for
the
crucial
statement
of
the
limitations
of
the
evaluation
that
may
need
to
be
reiterated
in
the
conclusion
and
perhaps
in
the
executive
summary.
13
That is, at least partly done by evaluators who are not informed of the goals of the program.
9 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
“possibility
space,”
i.e.,
the
range
of
what
could
have
been
done,
often
an
important
ele-‐
ment:
in
the
assessment
of
achievement;
in
the
comparisons,
and
in
identifying
directions
for
improvement
that
an
evaluation
considers.
This
means
the
checkpoint
is
crucial
for
Checkpoint
C4
(Comparisons),
Checkpoint
D1
(Synthesis,
for
achievement),
Checkpoint
D2
(Recommendations),
and
Checkpoint
D3
(Responsibility).
Particularly
for
D1
and
D2,
it’s
helpful
to
list
specific
resources
that
were
not
used
but
were
available
in
this
implementa-‐
tion.
For
example,
to
what
extent
were
potential
impactees,
stakeholders,
fund-‐raisers,
vol-‐
unteers,
and
possible
donors
not
recruited
or
not
involved
as
much
as
they
could
have
been?
(As
a
crosscheck,
and
as
a
complement,
consider
all
constraints
on
the
program,
in-‐
cluding
legal,
environmental,
and
fiscal
constraints.)
Some
matters
such
as
adequate
insur-‐
ance
coverage
(or,
more
generally,
risk
management)
could
be
discussed
here
or
under
Process
(Checkpoint
C1
below);
the
latter
is
preferable
since
the
status
of
insurance
cover-‐
age
is
ephemeral,
and
good
process
must
include
a
procedure
for
regular
checking
on
it.
This
checkpoint
is
the
one
that
covers
individual
and
social
capital
available
to
the
pro-‐
gram;
the
evaluator
must
also
identify
social
capital
used
by
the
program
(enter
this
as
part
of
its
Costs
at
Checkpoint
C3),
and,
sometimes,
social
capital
benefits
produced
by
the
pro-‐
gram
(enter
as
part
of
the
Outcomes,
at
Checkpoint
C2).16
Remember
to
include
the re-
sources contributed by other stakeholders, including other organizations and clients.17
B5.
Values
The
values
of
primary
interest
in
typical
professional
program
evaluations
are
for
the
most
part
not
mere
personal
preferences
of
the
impactees,
unless
those
overlap
with
their
needs
and
the
community/society’s
needs
and
committed
values,
e.g.,
those
in
the
Bill
of
Rights
and
the
wider
body
of
law.
Preferences
as
such
are
not
irrelevant
in
evaluation,
especially
the
preferences
of
impactees,
and
on
some
issues,
e.g.,
surgery
options,
they
are
often
de-‐
finitive;
it’s
just
that
they
are
generally
less
important—think
of
food
preferences
in
chil-‐
dren—than
dietary
needs
and
medical,
legal,
or
ethical
requirements,
especially
for
program
evaluation
by
contrast
with
product
evaluation.
While
there
are
intercultural
and
interna-‐
16 Individual human capital is the sum of the physical and intellectual abilities, skills, powers, experience, health,
energy, and attitudes a person has acquired. These blur into their—and their community’s—social capital, which
also includes their relationships (‘social networks’) and their share of any latent attributes that their group acquires
over and above the sum of their individual human capital (i.e., those that depend on interactions with others). For
example, the extent of the trust or altruism that pervades a group, be it family, army platoon, corporation, or other
organization, is part of the value the group has acquired, a survival-related value that they (and perhaps others) bene-
fit from having in reserve. (Example of non-additive social capital: the skills of football or other team members that
will only provide (direct) benefits for others who are part of a group, e.g., a team, with complementary skills.) These
forms of capital are, metaphorically, possessions or assets to be called on when needed, although they are not di-
rectly observable in their normal latent state. A commonly discussed major benefit resulting from the human capital
of trust and civic literacy is support for democracy; a less obvious one, resulting in tangible assets, is the current set
of efforts towards a Universal Digital Library containing ‘all human knowledge’. Human capital can usually be
taken to include natural gifts as well as acquired ones, or those whose status is indeterminate as between these cate-
gories (e.g., creativity, patience, empathy, adaptability), but there may be contexts in which this should not be as-
sumed. (The short term for all this might seem to be “human resources” but that term has been taken over to mean
“employees,” and that is not what we are talking about here.) The above is a best effort to construct the current
meaning: the 25 citations in Google for ‘human capital’ and the 10 for ‘social capital’ (at 6/06/07) include simplified
and erroneous as well as other and inconsistent uses—few dictionaries have yet caught up with these terms (al-
though the term ‘human capital’ dates from 1916).
17 Thanks to Jane Davidson for the reminder on this last point.
11 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
tional
differences
of
great
importance
in
evaluating
programs,
most
of
the
values
listed
be-‐
low
are
highly
regarded
in
all
cultures;
the
differences
are
generally
in
their
precise
inter-‐
pretation,
the
contextual
parameters,
the
exact
standards
laid
down
for
each
of
them,
and
the
relative
weight
assigned
to
them;
and
taking
those
differences
into
account
is
fully
al-‐
lowed
for
in
the
approach
here.
Of
course,
your
client
won’t
let
you
forget
what
they
value,
usually
the
goals
of
the
program,
and
you
should
indeed
keep
them
in
mind
and
report
on
success
in
achieving
them;
but
since
you
must
value
every
unintended
effect
of
the
program
just
as
seriously
as
the
intended
ones,
and
in
most
contexts
you
must
take
into
account
values
other
than
those
of
the
clients,
e.g.,
those
of
the
impactees
and
usually
also
those
of
other
stakeholders,
you
need
a
repertoire
of
values
to
check
when
doing
serious
program
evaluation,
and
what
follows
is
a
proposal
for
that.
Keep
in
mind
that
with
respect
to
each
of
these
(sets
of)
values,
you
will
have
to:
(i)
define
and
justify
relevance
to
this
program
in
this
context;
(ii)
justify
the
relative
weight
(i.e.,
comparative
importance)
you
will
accord
this
value;
(iii)
identify
any
bars
(i.e.,
minimum
acceptable
performance
standards
on
each
value
dimension)
you
will
require
an
evaluand
to
meet
in
order
to
be
considered
at
all
in
this
context;
(iii)
specify
the
empirical
levels
that
will
justify
the
application
of
each
grade
level
above
the
bar
on
that
value
that
you
may
wish
to
distinguish
(e.g.,
define
what
will
count
as
fair/good/excellent.
And
one
more
thing,
rarely
identified
as
part
of
the
evalu-‐
ator’s
task
but
crucial;
(v)
once
you
have
a
list
of
impactees,
however
partial,
you
must
be-‐
gin
to
look
for
patterns
within
them,
e.g.,
that
pregnant
women
have
greater
calorific
re-‐
quirements
(i.e.,
needs)
than
those
who
are
not
pregnant.
If
you
don’t
do
this,
you
will
miss
extremely
important
ways
to
optimize
the
use
of
intervention
resources.18
To
get
all
this
done,
you
should
begin
by
identifying
the
relevant
values
for
evaluating
this
evaluand
in
these
circumstances.
There
are
several
very
important
groups
of
these.
(i)
Some
of
these
follow
simply
from
understanding
the
nature
of
the
evaluand
(these
are
sometimes
called
definitional
criteria
of
merit,
or
dimensions
of
merit).
For
example,
if
it’s
a
health
program,
then
the
criteria
of
merit,
simply
from
the
meaning
of
the
terms,
include
the
extent
(a.k.a.,
reach
or
breadth)
of
its
impact
(i.e.,
the
size
and
range
of
the
demographic
(age/gender/ethnic/economic)
and
medical
categories
of
the
impactee
population),
and
the
impact’s
depth
(usually
a
function
of
magnitude,
extent
and
duration)
of
beneficial
ef-‐
fects.
(ii)
Other
primary
criteria
of
merit
in
such
a
case
are
extracted
from
a
general
or
spe-‐
cialist
understanding
of
the
nature
of
a
health
program,
include
safety
of
staff
and
patients,
quality
of
medical
care
(from
diagnosis
to
follow-‐up),
low
adverse
eco-‐impact,
physical
ease
of
access/entry;
and
basic
staff
competencies
plus
basic
functioning,
diagnostic
and
minor
therapeutic
supplies
and
equipment.
Knowing
what
these
values
are
is
one
reason
you
need
either
specific
evaluand-‐area
expertise
or
a
consultant
who
has
it.
(iii)
Then
look
for
particular,
site-‐specific,
criteria
of
merit—for
example,
the
need
for
one
or
more
sec-‐
ond-‐language
competencies
in
service
providers;
you
will
probably
need
to
do
or
find
a
valid
needs
assessment
for
the
targeted,
and
perhaps
also
for
any
other
probably
impacted
population.
Here
you
must
include
representatives
from
the
impactee
population
as
rel-‐
evant
experts,
and
you
may
need
only
their
expertise
for
the
needs
assessment,
but
prob-‐
18 And if you do this, you will be doing what every scientist tries to do—find patterns in data. This is one of several
ways in which good evaluation requires full-fledged traditional scientific skills; and something more as well (han-
dling the values component).
ably
should
do
a
serious
needs
assessment
and
have
them
help
design
and
interpret
it.
(iv)
Next,
list
the
explicit
goals/values
of
the
client
if
not
already
covered,
since
they
will
surely
want
to
know
whether
and
to
what
extent
these
were
met.
(v)
Finally,
turn
to
the
list
below
to
find
other
relevant
values.
Validate
them
as
relevant
or
irrelevant
for
the
present
evalu-‐
ation,
and
as
contextually
supportable.19
Now,
for
each
of
the
values
you
are
going
to
rely
on
at
all
heavily,
there
are
two
important
steps
you
will
usually
need
to
take.
First,
you
need
to
establish
a
scale
or
scales
on
which
you
can
measure
performance
that
is
relevant
to
merit.
On
each
of
these
scales,
you
need
to
locate
levels
of
performance
that
will
count
as
being
of
a
certain
value
(these
are
called
the
‘cut
scores’
if
the
dimension
is
measurable).
For
example,
you
might
measure
know-‐
ledge
of
first
aid
on
a
certain
well-‐validated
test
and
set
90%
as
the
score
that
marks
an
A
grade,
75%
as
a
C
grade,
etc.20
Second,
you
will
usually
need
to
stipulate
the
relative
im-‐
portance
of
each
of
these
scales
in
determining
the
overall
m/w/s
of
the
evaluand.
A
useful
basic
toolkit
for
this
involves
doing
what
we
call
identifying
the
“stars,
bars,
and
steps”
for
our
listed
values.
(i)
The
“stars”
(usually
best
limited
to
1–3
stars)
are
the
weights,
i.e.,
the
relative
or
absolute
importance
of
the
dimensions
of
merit
(or
worth
or
significance)
that
will
be
used
as
premises
to
carry
you
from
the
facts
about
the
evaluand,
as
you
locate
or
determine
those,
to
the
evaluative
conclusions
you
need.
Their
absolute
importance
might
be
expressed
qualitatively
(e.g.,
major/medium/minor,
or
by
letter
grades
A-‐F);
or
quantitatively
(e.g.,
points
on
a
five,
ten,
or
other
point
scale,
or—often
a
better
method
of
giving
relative
importance—by
the
allocation
of
100
‘weighting
points’
across
the
set
of
values);
or,
if
merely
relative
values
are
all
you
need,
these
can
even
be
ex-‐
pressed
in
terms
of
an
ordering
of
their
comparative
importance
(rarely
an
adequate
ap-‐
proach).
(ii)
The
“bars”
are
absolute
minimum
standards
for
acceptability,
if
any:
that
is,
they
are
minima
on
the
particular
scales,
scores
or
ratings
that
must
be
‘cleared’
(exceeded)
if
the
candidate
is
to
be
acceptable,
no
matter
how
well
s/he
scores
on
other
scales.
Note
that
an
F
grade
for
performance
on
a
particular
scale
does
not
mean
‘failure
to
clear
a
bar,’
e.g.,
an
F
on
the
GRE
quantitative
scale
may
be
acceptable
if
offset
by
other
virtues,
for
se-‐
lecting
students
into
a
creative
writing
program21.
Bars
and
stars
may
be
set
on
any
rel-‐
19 The view taken here is the commonsense one that values of the kind used by evaluators looking at programs serv-
ing the usual ‘good causes’ of health, education, social service, disaster relief, etc., are readily and objectively sup-
portable, to a degree acceptable to essentially all stakeholders and supervisory or audit personnel, contrary to the
doctrine of value-free social science which held that values are essentially matters of taste and hence lack objective
verifiability. The ones in the list here are usually fully supportable to the degree needed by the evaluator for the par-
ticular case, by appeal to publicly available evidence, expertise, and careful reasoning. Bringing them into consid-
eration is what distinguishes evaluation from plain empirical research, and only their use makes it possible for
evaluators to answer the questions that mere empirical research cannot answer, e.g., Is this the best vocational high
school in this city? Do we really need a new cancer clinic building? Is the new mediation training program for police
officers who are working the gang beat really worth what it costs to implement? In other words, the most important
practical questions, for most people—and their representatives—who are looking at programs (and the same applies
to product, personnel, policy evaluation, etc.)
20
This is the tricky process of identifying ‘cutscores’ a specialized topic in test theory—there is a whole book by
that title devoted to discussing how it should be done. A good review of the main issues is in Gene Glass’ paper
21 If an F is acceptable on that scale, why is that dimension still listed at all—why is it relevant? Answer: it may be
one of several on which high scores are weighted as a credit, on one of which the candidate must score high, but not
on any particular one. In other words the applicant has to have some special talent, but a wide range of talents are
acceptable. This might be described as a case where there is a ‘group’ bar, i.e., a ‘floating’ bar on a group of dimen-
13 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
evant
properties
(a.k.a.
dimensions
of
merit),
or
directly
on
dimensions
of
measured
(valued)
performance,
and
may
additionally
include
holistic
bars
or
stars22.
(iii)
In
serious
evaluation,
it
is
often
appropriate
to
locate
“steps”
i.e.,
points
or
zones
on
measured
dimen-‐
sions
of
merit
where
the
weight
changes,
if
the
mere
stars
don’t
provide
enough
precision.
An
example
of
this
is
the
setting
of
several
cutting
scores
on
the
GRE
for
different
grades
in
the
use
of
that
scale
for
the
two
types
of
evaluation
given
above
(evaluating
the
program
and
evaluating
applicants
to
it).
The
grades,
bars,
and
stars
(weights),
are
often
loosely
in-‐
cluded
under
what
is
called
‘standards.’
(Bars
and
steps
may
be
fuzzy
as
well
as
precise.)
Three
values
are
of
such
general
importance
that
they
receive
full
checkpoint
status
and
are
listed
in
the
next
section:
cost
(minimization
of),
superiority
(to
comparable
alterna-‐
tives),
and
generalizability/exportabillity.
Their
presence
in
the
KEC
brings
the
number
of
types
of
values
considered,
including
the
list
below,
up
to
a
total
of
21.
At
least
check
all
these
values
for
relevance
and
look
for
others:
and
for
those
that
are
rel-‐
evant,
set
up
an
outline
of
a
set
of
defensible
standards
that
you
will
use.
Since
these
are
context-‐dependent
(e.g.,
the
standards
for
a
C
in
evaluating
free
clinics
in
Zurich
today
are
not
the
same
as
for
a
C
in
evaluating
a
free
clinic
in
Darfur
at
the
moment),
and
the
client’s
evaluation-‐needs—i.e.,
the
questions
they
need
to
be
able
to
answer—differ
massively,
there
isn’t
a
universal
dictionary
for
them.
You’ll
need
to
have
a
topical
expert
on
your
team
or
do
a
good
literature
search
to
develop
a
draft,
and
eventually
run
serious
sessions
with
impactee
and
other
stakeholder
representatives
to
ensure
defensibility
for
the
revised
draft.
The
final
version
of
each
of
the
standards,
and
the
set
of
them,
is
often
called
a
rubric,
meaning
a
table
translating
evaluative
terms
into
observable
or
testable
terms
and/or
vice
versa.23
(i) Definitional
values—those
that
follow
from
the
definitions
of
terms
in
standard
usage
(e.g.,
breadth
and
depth
of
impact
are,
definitionally,
dimensions
of
merit
for
a
public
health
program),
or
that
follow
from
the
contextual
implications
of
having
an
ideal
or
excellent
evaluand
of
this
type
(e.g.,
an
ideal
shuttle
bus
ser-‐
vice
for
night
shift
workers
would
feature
increased
frequency
of
service
around
shift
change
times).
The
latter
draw
from
general
knowledge
and
to
some
extent
from
program-‐area
expertise.
sions, which must be cleared by the evaluand’s performance on at least one of them. It can be exhibited in the list of
dimensions of merit by bracketing the group of dimensions in the abscissa, and stating the height of the floating bar
in an attached note. Example: “Candidates for admission to the psychology grad program must have passed one
upper division statistics course.”
22 Example: The candidates for admission to a graduate program—whose quality is one criterion of merit for the
program—may meet all dimension-specific minimum standards in each respect for which these were specified (i.e.,
they ‘clear these bars’), but may be so close to missing the bars (minima) in so many respects, and so weak in re-
spects for which no minimum was specified, that the selection committee feels they are not good enough for the
program. We can describe this as a case where they failed to clear a holistic (a.k.a. overall) bar that was implicit in
this example, but can often be made explicit through dialog. (The usual way to express a quantitative holistic bar is
via an average grade; but that is not always the best way to specify it and often not strictly defensible.)
23
The term ‘rubric’ as used here is a technical term originating in educational testing parlance; this meaning is not
in most dictionaries, or is sometimes distinguished as ‘an assessment rubric.’ A complication we need to note here is
that some of the observable/measurable terms may themselves be evaluative, at least in some contexts.
(ii) Needs
of
the
impacted
population,
via
a
needs
assessment
(distinguish
perform-‐
ance
needs
(e.g.,
need
for
health)
from
treatment
needs
(need
for
specific
medi-‐
cation
or
delivery
system),
needs
that
are
currently
met
from
unmet
needs,24
and
meetable
needs
from
ideal
but
impractical
or
impossible-‐with-‐present-‐resources
needs
(consider
the
Resources
checkpoint)).
The
needs
are
matters
of
fact,
not
values
in
themselves,
but
in
any
context
that
accepts
the
most
rudimentary
ethi-‐
cal
considerations
(i.e.,
the
non-‐zero
value
of
the
welfare
of
other
human
beings),
those
facts
are
value-‐imbued.
NOTE:
needs
may
have
macro
as
well
as
micro
lev-‐
els
that
must
be
considered;
e.g.,
there
are
local
community
needs,
regional
needs
within
a
country,
national
needs,
global
region
needs,
and
global
needs
(these
often
overlap,
e.g.,
in
the
case
of
building
codes
(illustrated
by
their
ab-‐
sence
in
the
Port-‐au-‐Prince
earthquake
of
2010).
Doing
a
needs
assessment
is
sometimes
the
most
important
part
of
an
evaluation,
and
in
much
of
the
litera-‐
ture
is
based
on
invalid
definitions
of
need,
e.g.,
the
idea
that
needs
are
the
gaps
between
the
actual
level
of
some
factor
(e.g.,
income,
calories)
and
the
ideal
level.
Needs
for
X
are
the
levels
of
X
without
which
the
subject(s)
will
be
unable
to
function
satisfactorily
(not
the
same
as
optimally,
maximally,
or
ideally);
and
of
course,
what
functions
are
under
study
and
what
level
will
count
as
satisfactory
will
vary
with
the
study
and
the
context.
Final
note;
check
the
Resources
check-‐
point,
a.k.a.
Strengths
Assessment,
for
other
entities
valued
in
that
context
and
hence
of
value
in
this
evaluation.
(iii) Logical
requirements
(e.g.,
consistency,
sound
inferences
in
design
of
program
or
measurement
instruments
e.g.,
tests).
(iv) Legal
requirements
(but
see
(v)
below).
(v) Ethical
requirements
(overlaps
with
legal
and
overrides
them
when
in
conflict),
usually
including
(reasonable)
safety,
and
confidentiality
(and
sometimes
ano-‐
nymity)
of
all
records,
for
all
impactees.
(Problems
like
conflict
of
interest
and
protection
of
human
rights
have
federal
legal
status
in
the
US,
and
are
also
re-‐
garded
as
scientific
good
procedural
standards,
and
as
having
some
very
general
ethical
status.)
In
most
circumstances,
health,
shelter,
education,
and
other
wel-‐
fare
considerations
for
impactees
and
potential
impactees
are
other
obvious
values
to
which
ethical
weight
must
be
given.
(vi) Cultural
values
(not
the
same
as
needs
or
wants,
although
overlapping
with
them)
held
with
a
high
degree
of
respect
(and
thus
distinguished
from
matters
of
manners,
style,
taste,
etc.),
of
which
an
extremely
important
one
is
honor;
an-‐
other
group,
not
always
distinct
from
that
one,
concerns
respect
for
ancestors,
elders,
tribal
or
totem
spirits,
and
local
deities.
These,
like
legal
requirements,
are
subject
to
override,
in
principle
at
least,
by
ethical
values,
although
often
taken
to
have
the
same
and
sometimes
higher
status.
24
A very common mistake—reflected in definitions of needs that are widely used—is to think that met needs are not
‘really’ needs, and should not be included in a needs assessment. That immediately leads to the ‘theft’ of resources
that are meeting currently met needs, in order to serve the remaining unmet needs, a blunder that can cost lives.
Identify all needs, then identify the ones that are still unmet
15 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
(vii) Personal,
group,
and
organizational
goals/desires
(unless
you’re
doing
a
goal-‐
free
evaluation)
if
not
in
conflict
with
ethical/legal/practical
considerations.
These
are
usually
less
important
than
the
needs
of
the
impactees,
since
they
lack
specific
ethical
or
legal
backing,
but
are
enough
by
themselves
to
drive
the
infer-‐
ence
to
many
evaluative
conclusions
about
e.g.,
what
recreational
facilities
to
provide
in
community-‐owned
parks,
subject
to
consistency
with
ethical
and
legal
constraints.
They
include
some
things
that
are
arguable
needed
as
well
as
de-‐
sired,
by
some,
e.g.,
convenience,
recreation,
respect,
earned
recognition,
excite-‐
ment,
and
compatibility
with
aesthetic
preferences
of
recipients.
(viii) Environmental
needs,
if
these
are
regarded
as
not
simply
reducible
to
‘human
needs
with
respect
to
the
environment,’
e.g.,
habitat
needs
of
other
species
(fauna
or
flora),
and
perhaps
Gaian
‘needs
of
the
planet.’
(viii)
Fidelity
to
alleged
specifications
(a.k.a.
“authenticity,”
“adherence,”
“implemen-‐
tation,”
“dosage,”
or
“compliance”)—this
is
often
usefully
expressed
via
an
“index
of
implementation”;
and—a
different
but
related
matter—consistency
with
the
supposed
program
model
(if
you
can
establish
this
BRD—beyond
reasonable
doubt);
crucially
important
in
Checkpoint
C1.
(ix)
Sub-‐legal
but
still
important
legislative
preferences
(GAO
used
to
determine
these
from
an
analysis
of
the
hearings
in
front
of
the
sub-‐committee
in
Congress
from
which
the
legislation
emanated.)
(x)
Professional
standards
(i.e.,
standards
set
by
the
profession)
of
quality
that
ap-‐
ply
to
the
evaluand25.
(xi)
Expert
refinements
of
any
standards
lacking
a
formal
statement,
e.g.,
ones
in
(ix).
(xii)
Historical/Traditional
standards.
(xiii)
Scientific
merit
(or
worth
or
significance).
(xiv)
Technological
m/w/s.
(xv)
Marketability,
in
commercial
program
evaluation.
(xvi)
Political
merit,
if
you
can
establish
it
BRD.
(xvii)
Risk
(sometimes
meaning
the
same
as
chance),
meaning
the
probability
of
failure
(or
success)
or,
sometimes,
of
the
loss
(or
gain)
that
would
result
from
failure
(or
success),
or
sometimes
the
product
of
these
two.
Risk
in
this
ontext
does
not
mean
the
probability
of
error
about
the
facts
or
values
we
are
using
as
param-‐
eters—i.e.,
the
level
of
confidence
we
have
in
our
data
or
conclusions.
Risk
here
is
the
value
or
disvalue
of
the
chancy
element
in
the
enterprise
in
itself,
as
an
in-‐
dependent
positive
or
negative
element—positive
for
those
who
are
positively
attracted
by
gambling
as
such
(this
is
usually
taken
to
be
a
real
attraction,
unlike
25Since one of the steps in the evaluation checklist is the meta-evaluation, in which the evaluation itself is the
evaluand, you will also need, when you come to t)hat checkpoint, to apply professional standards for evaluations to
the list. Currently the best ones would be those developed by the Joint Committee (Program Evaluation Standards
2e (Sage), but there are several others of note, e.g., the GAO Yellow Book), and perhaps the KEC. And see the final
checkpoint in the KEC.
risk-‐tolerance)
and
negative
for
those
who
are,
by
contrast,
risk-‐averse.
This
consideration
is
particularly
important
in
evaluating
plans
(preformative
evalu-‐
ation)
and
in
formative
evaluation,
but
is
also
relevant
in
summative
and
ascrip-‐
tive
evaluation
when
either
is
done
prospectively
(i.e.,
before
all
data
is
avail-‐
able).
There
is
an
option
of
including
this
under
personal
preferences,
item
(vii)
above,
but
it
is
often
better
to
consider
it
separately
since
it
benefits
by
being
ex-‐
plicit,
can
be
very
important,
and
is
a
matter
on
which
evidence/expertise
(in
the
logic
of
probability)
should
be
brought
to
bear,
not
simply
a
matter
of
personal
taste.26
(xviii)
last
but
not
least—Resource
economy
(i.e.,
how
low-‐impact
the
program
is
with
respect
to
short-‐term
and
long-‐term
limits
on
resources
of
money,
space,
time,
labor,
contacts,
expertise
and
the
eco-‐system).
Note
that
‘low-‐impact’
is
not
what
we
normally
mean
by
‘low-‐cost’
(covered
separately
in
Checkpoint
C3)
in
the
normal
currencies
(money
and
non-‐money),
but
refers
to
absolute
(usually
means
irreversible)
loss
of
available
resources
(in
some
framework,
which
might
range
from
a
single
person
to
a
country).
This
could
be
included
under
an
ex-‐
tended
notion
of
(opportunity)
cost
or
need,
but
has
become
so
important
in
its
own
right
that
it
is
probably
better
to
put
it
under
its
own
heading
as
a
value.
It
partly
overlaps
with
Checkpoint
C5,
because
a
low
score
on
resource
economy
undermines
sustainability,
so
watch
for
double-‐counting.
Also
check
for
double
counting
against
value
(viii),
if
that
is
being
weighted
by
client
or
audiences
and
is
not
overridden
by
ethical
or
other
higher-‐weighted
concerns.
Fortunately,
bringing
these
values
and
their
standards
to
bear27
is
less
onerous
than
it
may
appear,
since
many
of
these
values
will
be
unimportant
or
only
marginally
important
in
many
specific
cases,
although
each
one
will
be
crucially
important
in
other
particular
cases.
And
doing
all
this
values-‐analysis
will
be
easy
to
do
sometimes,
although
very
hard
on
other
occasions;
it
can
often
require
expert
advice
and/or
impactee/stakeholder
advice.
And,
of
course,
some
of
these
values
will
conflict
with
others
(e.g.,
impact
size
with
resource
economy),
so
their
relative
weights
may
then
have
to
be
determined
for
the
particular
case,
a
non-‐trivial
task
by
itself.
Hence
you
need
to
be
very
careful
not
to
assume
that
you
have
to
26 Note that risk is often defined in the technical literature as the product of the likelihood of failure and the magni-
tude of the disaster if the program, or part of it, does fail (the possible loss itself is often called ‘the hazard’); but in
common parlance, the term ‘risk’ is often used to mean either the probability of disaster (“very risky”) or the disas-
ter itself (“the risk of death”). Now the classical definition of a gambler is someone who will prefer to pay a dollar to
get a 1 in 1,000 chance of making $1,000 over paying a dollar to get a 1 in 2 chance of making $2, even though the
expectancy is the same in each case; the risk-averse person will reverse those preferences and in extreme cases will
prefer to simply keep the dollar; and the rational risk-tolerant person will, supposedly, treat all three options as of
equal merit. So, if this is correct, then one might argue that the more precise way to put the value differences here is
to say that the gambler is not attracted by the element of chance in itself but by the possibility of making the larger
sum despite the low probability of that outcome, i.e., that s/he is less risk-averse, not more of a risk-lover. (I think
this way of putting the matter actually leads to a better analysis, viz., any of these positions can be rational depend-
ing on contextual specification of the cost of Type 1 vs. Type 2 errors.) However described, this can be a major
value difference between people and organizations e.g., venture capitalist groups vs. city planning groups.
27 ‘Bringing them to bear’ involves: (a) identifying the relevant ones, (b) specifying them (i.e., determining the di-
mensions for each and a method of measuring performance/achievements on all of these scales), (c) validating the
relevant standards for the case, and (d) applying the standards to the case.
17 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
generate
a
ranking
of
evaluands
from
an
evaluation
you
are
asked
to
do,
since
if
that’s
not
required,
you
can
often
avoid
settling
the
issue
of
relative
weights
of
criteria,
or
at
least
avoid
any
precision
in
settling
it,
by
simply
doing
a
grading
of
each
evaluand,
on
a
profiling
display
(i.e.,
showing
the
merit
on
all
relevant
dimensions
of
merit
in
a
bar-‐graph
for
each
evaluand(s)).
That
will
exhibit
the
various
strengths
and
weaknesses
of
each
evaluand,
ideal
for
helping
them
to
improve,
and
for
helping
clients
to
refine
their
weights
for
the
cri-‐
teria
of
merit,
which
will
often
make
it
obvious
which
is
the
best
choice.
Note
B5.1:
You
must
cover
in
this
checkpoint
all
values
that
you
will
use,
including
those
used
in
evaluating
the
side-‐effects
(if
any),
not
just
the
intended
effects
(if
any
materialize).
Some
of
these
values
may
well
occur
to
you
only
after
you
find
the
side-‐effects
(Checkpoint
C2),
but
that’s
not
a
problem—this
is
an
iterative
checklist,
and
in
practice
that
means
you
will
often
have
to
come
back
to
modify
findings
on
earlier
checkpoints.
C1.
Process
We
start
with
this
core
checkpoint
because
it
forces
us
to
confront
immediately
the
merit
of
the
means
this
intervention
employs,
so
that
we
are
able,
as
soon
as
possible,
to
answer
the
question
whether
the
(intended
or
produced
unintentionally)
ends—many
of
which
we’ll
cover
in
the
next
checkpoint—justify
the
means,
in
this
specific
case
or
set
of
cases.
The
Process
checkpoint
involves
the
assessment
of
the
m/w/s
of
everything
that
happens
or
28 Here, and commonly, this sense of the term means non-evaluative fact-finding. There are plenty of evaluative
facts that we often seek, e.g., whether the records show that an attorney we are considering has a history of malprac-
tice; whether braided nylon fishline is as good as wire for fish over 30kg.
29
Although this is generally true, there are evaluations in which one or more of the sub-evaluations are irrelevant,
e.g., when cost is of no concern.
applies
before
true
outcomes
emerge,
especially:
(i)
the
vision,
design,
planning
and
oper-‐
ation
of
the
program,
from
the
justification
of
its
goals
(if
you’re
not
operating
in
goal-‐free
mode)—and
note
that
these
may
have
changed
or
be
changing
since
the
program
began—
through
design
provisions
for
reshaping
the
program
under
environmental
or
political
or
fiscal
duress
(including
planning
for
worst-‐case
outcomes);
to
the
development
and
justifi-‐
cation
of
the
program’s
supposed
‘logic’
a.k.a.
design
(but
see
Checkpoint
D2),
along
with
(ii)
the
program’s
‘implementation
fidelity’
(i.e.,
the
degree
of
implementation
of
the
sup-‐
posed
archetype
or
exemplar
program,
if
any).
This
index
is
also
called
“authenticity,”
“ad-‐
herence,”
“alignment,”
“fidelity,”
“internal
sustainability,”
or
“compliance”.30
You
must
also,
under
Process,
check
the
accuracy
of
the
official
name
or
subtitle
(whether
descriptive
or
evaluative),
or
the
official
description
of
the
program
e.g.,
“an
inquiry-‐based
science
educa-‐
tion
program
for
middle
school”—one,
two,
three,
or
even
four
of
the
components
of
this
compound
descriptive
claim
(it
may
also
be
contextually
evaluative)
can
be
false.
(Other
examples:
“raises
beginners
to
proficiency
level”,
“advanced
critical
thinking
training
pro-‐
gram”).
Also
check
(iii)
the
quality
of
its
management
(especially
(a)
the
arrangements
for
getting
and
appropriately
reporting
evaluative
feedback
(that
package
is
often
much
of
what
is
called
accountability
or
transparency),
along
with
support
for
learning
from
that
feedback,
and
from
any
mistakes/solutions
discovered
in
other
ways,
along
with
meeting
more
obviously
appropriate
standards
of
accountability
and
transparency;
(b)
the
quality
of
the
risk-‐management,31
including
the
presence
of
a
full
suite
of
‘Plan
B’
options;
(c)
the
extent
to
which
the
program
planning
covers
issues
of
sustainability
and
not
just
short-‐
term
returns
(this
point
can
also
be
covered
in
C5).
You
need
to
examine
all
activities
and
procedures,
especially
the
program’s
general
learning/training
process
(e.g.,
regular
‘up-‐
dating
training’
to
cope
with
changes
in
the
operational
and
bio-‐environment,
staff
aging,
essential
skill
pool,
new
technology32);
attitudes/values;
and
morale.
Of
course,
manage-‐
ment
quality
is
something
that
continues
well
beyond
the
beginning
of
the
program,
so
in
looking
at
it,
you
need
to
be
clear
when
it
had
what
form
or
you
won’t
be
able
to
ascribe
re-‐
sults—good
or
bad—to
management
features,
if
you
are
hoping
to
be
able
to
do
that.
Orga-‐
nization
records
often
lack
this
kind
of
detail,
so
try
to
improve
that
practice,
at
least
for
the
duration
of
your
evaluation.
As
mentioned,
under
this
heading
you
may
or
may
not
examine
the
quality
of
the
original
‘logic
of
the
program’
(the
rationale
for
its
design)
and
its
current
logic
(both
the
current
official
version
and
the
possibly
different
one
implicit
in
the
operations
or
in
staff
behavior
30Several recent drug studies have shown huge outcome differences between subjects filling 80% or more of their
prescriptions and those filling less than 80%, in both the placebo and treatment groups, even when it’s unknown how
many of those getting the drug from the pharmacy are actually taking it, and even though there is no overall differ-
ence in average outcomes between the two groups. In other words, mere aspects of the process of treatment can be
more important than the nature of the treatment or the fact of treatment status. So be sure you know what the process
actually comprises, and whether any comparison group is closely similar on each aspect.
31 Risk-management has emerged fairly recently as a job classification in large organizations, growing out of the
specialized task of analyzing the adequacy and justifiability of the organization’s insurance coverage, but now in-
cluding matters such as the adequacy and coordination of protocols and training for emergency response to natural
and human-caused disasters, the identification of responsibility for each risk, and the sharing of risk and insurance
with other parties.
32
See also my paper on “Evaluation of Training” at michaelscriven.info for a checklist that massively extends
Kirkpatrick’s groundbreaking effort at this task.
19 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
rather
than
rhetoric).
It
is
not
generally
appropriate
to
try
to
determine
and
affirm
whether
the
model
is
correct
in
detail
and
in
scientific
fact
unless
you
have
specifically
undertaken
that
kind
of
(usually
ambitious
and
sometimes
unrealistically
ambitious)
analytic
evalu-‐
ation
of
the
program
design/plan/theory.
You
need
to
judge
with
great
care
whether
com-‐
ments
on
the
plausibility
of
the
program
theory
are
likely
to
be
helpful,
and,
if
so,
whether
you
are
sufficiently
expert
to
make
them.
Just
keep
in
mind,
that
it’s
never
been
hard
to
ev-‐
aluate
aspirin
for
e.g.,
its
analgesic
effects,
although
it
is
only
very
recently
that
we
had
any
idea
how/why
it
works.
It
would
have
been
a
logical
error—and
unhelpful
to
society—to
make
the
earlier
evaluations
depend
on
solving
the
causal
mystery.
It
helps
to
keep
in
mind
that
there’s
no
mystery
until
you’ve
done
the
evaluation,
since
you
can’t
explain
outcomes
if
there
aren’t
any
(or
explain
why
there
aren’t
any
until
you’ve
shown
that
that’s
the
situa-‐
tion).
So
if
you
can
be
helpful
by
evaluating
the
program
theory,
and
you
have
the
resources
to
spare,
do
it;
but
doing
this
is
not
an
essential
part
of
doing
a
good
evaluation,
will
often
be
a
diversion,
and
is
sometimes
a
cause
for
disruptive
antagonism.
Process
evaluation
may
also
include
(iv)
the
evaluation
of
what
are
often
called
“outputs,”
(usually
taken
to
be
‘intermediate
outcomes’
that
are
developed
en
route
to
‘true
out-‐
comes,’
the
longer-‐term
results
that
are
sometimes
called
‘impact’).
Typical
outputs
are
knowledge,
skill,
or
attitude
changes
in
staff
(or
clients),
when
these
changes
are
not
major
outcomes
in
their
own
right.
Remember
that
in
any
program
that
involves
learning,
whether
incidental
or
intended,
the
process
of
learning
is
gradual
and
at
any
point
in
time,
long
before
you
can
talk
about
outcomes/impact,
there
will
have
been
substantial
learning
that
produces
a
gain
in
individual
or
social
capital,
which
must
be
regarded
as
a
tangible
gain
for
the
program
and
for
the
intervention.
It’s
not
terribly
important
whether
you
call
it
process
or
output
or
short-‐term
outcome,
as
long
as
you
find
it,
estimate
it,
and
record
it—
once.
(Recording
it
under
more
than
one
heading—other
than
for
merely
annotative
rea-‐
sons—leads
to
double
counting
when
you
are
aiming
for
an
overall
judgement.)
Note
C1.1:
Five
other
reasons
why
process
is
an
essential
element
in
program
evaluation,
despite
the
common
tendency
in
much
evaluation
to
place
almost
the
entire
emphasis
on
outcomes:
(v)
gender
or
racial
prejudice
in
selection/promotion/treatment
of
staff
is
an
unethical
practice
that
must
be
checked
for,
and
comes
under
process;
(vi)
in
evaluating
health
programs
that
involve
medication
or
exercise,
‘adherence’
or
‘implementation
fi-‐
delity’
means
following
the
prescribed
regimen
including
drug
dosage,
and
it
is
often
vitally
important
to
determine
the
degree
to
which
this
is
occurring—which
is
also
a
process
con-‐
sideration.
We
now
know,
because
researchers
finally
got
down
to
triangulation
(e.g.,
via
randomly
timed
counts,
by
a
nurse-‐observer,
of
the
number
of
pills
remaining
in
the
pa-‐
tient’s
medicine
containers),
that
adherence
can
be
very
low
in
many
needy
populations,
e.g.,
Alzheimer’s
patients,
a
fact
that
completely
altered
evaluative
conclusions
about
treatment
efficacy;
(vii)
the
process
may
be
where
the
value
lies—writing
poetry
in
the
creative
writing
class
may
be
a
good
thing
to
do
in
itself,
not
because
of
some
later
out-‐
comes
(same
for
having
fun,
in
kindergarten
at
least;
painting;
and
marching
to
protest
war,
even
if
it
doesn’t
succeed);
(viii)
the
treatment
of
human
subjects
must
meet
federal,
state,
and
other
ethical
standards,
and
an
evaluator
can
rarely
avoid
the
responsibility
for
checking
that
they
are
met;
(ix)
as
the
recent
scandal
in
anaesthesiology
underscores,
many
widely
accepted
evaluation
procedures,
e.g.,
peer
review,
rest
on
assumptions
that
are
sometimes
completely
wrong
(e.g.,
that
the
researcher
actually
did
get
the
data
reported
from
real
patients),
and
the
evaluator
should
try
to
do
better
than
rely
on
such
assump-‐
tions.
C2.
Outcomes
This
checkpoint
is
the
poster-‐boy
of
many
evaluations,
and
the
one
that
many
people
mis-‐
takenly
think
of
as
covering
‘the
results’
of
an
intervention.
In
fact,
the
results
are
every-‐
thing
covered
in
Part
C.
This
checkpoint
does
cover
the
‘ends’
at
which
the
‘means’
dis-‐
cussed
in
C1
(Process)
are
aimed,
but
(a)
only
to
the
extent
they
were
achieved;
and
(b)
much
more
than
that.
It
requires
the
identification
of
all
(good
and
bad)
effects
of
the
pro-‐
gram
(a.k.a.
intervention)
on:
(i)
program
recipients
(both
targeted
and
untargeted—an
example
of
the
latter
are
thieves
of
aid
or
drug
supplies);
on
(ii)
other
impactees,
e.g.,
fami-‐
lies
and
friends—and
enemies—of
recipients;
and
on
(iii)
the
environment
(biological,
physical,
and
more
remote
social
environments).
These
effects
must
include
direct
and
in-‐
direct
effects,
intended
and
unintended
ones,
immediate33
and
short-‐term
and
long-‐term
ones
(the
latter
being
one
kind
of
sustainability).
(These
are
all,
roughly
speaking,
the
focus
of
Campbell’s
‘internal
validity.’)
Finding
outcomes
cannot
be
done
by
hypothesis-‐testing
methodology,
because:
(i)
often
the
most
important
effects
are
unanticipated
ones
(the
four
main
ways
to
find
such
side-‐effects
are:
(a)
goal-‐free
evaluation,
(b)
skilled
observation,
(c)
interviewing
that
is
explicitly
focused
on
finding
side-‐effects,
and
(d)
using
previous
ex-‐
perience
(as
provided
in
the
mythical
“Book
of
Causes”34).
And
(ii)
because
determining
the
m/w/s
of
the
effects—that’s
the
bottom
line
result
you
have
to
get
out
of
this
sub-‐
evaluation—is
often
the
hard
part,
not
just
determining
whether
there
are
any,
or
even
what
they
are
intrinsically,
and
who
they
affect
(some
of
which
you
can
get
by
hypothesis
testing)…
Immediate
outcomes
(e.g.,
the
publication
of
instructional
leaflets
for
AIDS
car-‐
egivers)
are
often
called
‘outputs,’
especially
if
their
role
is
that
of
an
intermediate
cause
or
intended
cause
of
main
outcomes,
and
they
are
normally
covered
under
Checkpoint
C1.
But
note
that
some
true
outcomes
(including
results
that
are
of
major
significance,
whether
or
not
intended)
can
occur
during
the
process
but
may
be
best
considered
here,
especially
if
they
are
highly
durable.
(Long-‐term
results
are
sometimes
called
‘effects’
(or
‘true
effects’
or
‘results’)
and
the
totality
of
these
is
often
referred
to
as
the
‘impact’;
but
you
should
ad-‐
just
to
the
highly
variable
local
usage
of
these
terms
by
clients/audiences/stakeholders.)
Note
that
you
must
pick
up
effects
on
individual
and
social
capital
here
(see
the
earlier
footnote);
much
of
this
ensemble
is
normally
not
counted
as
outcomes,
because
they
are
33 The ‘immediate’ effects of a program are not only the first effects that occur after the program starts up, but
should also include major effects that occur before the program starts. These (preformative) effects impact ‘anticipa-
tors’ who react to the announcement of—or have secret intelligence about—the future start of the program. For ex-
ample, the award of the 2012 Olympic Games to Rio de Janeiro, made several years in advance of any implementa-
tion of the planned constructions etc. for the games, had a huge immediate effect on real estate prices, and later on
favela policing for drug and violence control.
34 The Book of Causes shows, when opened at the name of a condition, factor, or event: (i) on the left (verso) side of
the opening, all the things which are known to be able to cause it, in some context or other (which is specified at that
point); and (ii) on the right (recto) side, all the things which it can cause: that’s the side you need in order to guide
the search for side-effects. Since the BofC is only a virtual book, you have to create the relevant pages, using all
your resources such as accessible experts and a literature/internet search. Good forensic pathologists and good field
epidemiologists, amongst other scientists, have very comprehensive ‘local editions’ of the BofC in their heads and
as part of the social capital of their guild. Modern computer technology makes the BofC feasible, perhaps imminent.
21 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
gains
in
latent
ability
(capacity,
potentiality),
not
necessarily
in
observable
achievements
or
goods.
Particularly
in
educational
evaluations
aimed
at
improving
test
scores,
a
common
mistake
is
to
forget
to
include
the
possibly
life-‐long
gain
in
ability
as
an
effect.
Sometimes,
not
always,
it’s
useful
and
feasible
to
provide
explanations
of
success/failure
in
terms
of
components/context/decisions.
For
example,
when
evaluating
a
statewide
consor-‐
tium
of
training
programs
for
firemen
dealing
with
toxic
fumes,
it’s
probably
fairly
easy
to
identify
the
more
and
less
successful
programs,
maybe
even
to
identify
the
key
to
success
as
particular
features—e.g.,
realistic
simulations—that
are
to
be
found
in
and
only
in
the
successful
programs.
To
do
this
usually
does
not
require
the
identification
of
the
whole
op-‐
erating
logic/theory
of
program
operation.
(Remember
that
the
operating
logic
is
not
ne-‐
cessarily
the
same
as:
(i)
the
original
official
program
logic,
(ii)
the
current
official
logic,
(iii)
the
implicit,
logics
or
theories
of
field
staff).
Also
see
Checkpoint
D2
below.
Given
that
the
most
important
outcomes
may
have
been
unintended
(a
broader
class
than
unexpected),
it’s
worth
distinguishing
between
side-‐effects
(which
affect
the
target
popula-‐
tion
and
possibly
others)
and
side-‐impacts
(meaning
impacts
of
any
kind
on
non-‐targeted
populations).
The
biggest
methodological
problem
with
this
checkpoint
is
establishing
the
causal
connec-‐
tion,
especially
when
there
are
many
possible
or
actual
causes,
and—a
separate
point—if
attribution
of
portions
of
the
effect
to
each
of
them
must
be
attempted.35
Note
C2.1:
As
Robert
Brinkerhoff
argues,
success
cases
may
be
worth
their
own
analysis
as
a
separate
group,
regardless
of
the
average
improvement
(if
any)
due
to
the
program
(since
the
benefits
in
those
cases
alone
may
justify
the
cost
of
the
program)36;
the
failure
cases
should
also
be
examined,
for
differences
and
toxic
factors.
Note
C2.2:
Keep
the
“triple
bottom-‐line”
approach
in
mind.
This
means
that,
as
well
as
(i)
conventional
outcomes
(e.g.,
learning
gains
by
impactees),
you
should
always
be
looking
for
(ii)
community
(include
social
capital)
changes,
and
(iii)
environmental
impact…
And
al-‐
ways
comment
on
(iv)
the
risk
aspect
of
outcomes,
which
is
likely
to
be
valued
very
differ-‐
ently
by
different
stakeholders…
Especially,
do
not
overlook
(v)
the
effects
on
the
program
staff,
good
and
bad;
e.g.,
lessons
and
skills
learned,
and
the
usual
effects
of
stress;
and
(vi)
the
pre-‐program
effects
mentioned
earlier:
that
is,
the
(often
major)
effects
of
the
an-‐
nouncement
or
discovery
that
a
program
will
be
implemented,
or
even
may
be
imple-‐
mented.
These
effects
include
booms
in
real
estate
and
migration
of
various
groups
to/from
the
community,
and
are
sometimes
more
serious
in
at
least
the
economic
dimen-‐
sion
than
the
directly
caused
results
of
the
program’s
implementation
on
this
impact
group,
the
‘anticipators.’
Looking
at
these
effects
carefully
is
sometimes
included
under
preforma-‐
tive
evaluation
(which
also
covers
looking
at
other
dimensions
of
the
planned
program,
such
as
evaluability).
Note
C2.3:
It
is
usually
true
that
evaluations
have
to
be
completed
long
before
some
of
the
main
outcomes
have,
or
indeed
could
have,
occurred—let
alone
have
been
inspected
care-‐
35
On this, consult recent literature by, or cited by, Cook or Scriven, e.g., in the 6th and the 8th issues of the Journal of
MultiDisciplinary Evaluation (2008), at jmde.com, and American Journal of Evaluation (3, 2010)).
36 Robert Brinkerhoff in The Success Case Method (Berrett-Koehler, 2003).
fully.
This
leads
to
a
common
practice
of
depending
heavily
on
predictions
of
outcomes
based
on
indications
or
small
samples
of
what
they
will
be.
This
is
a
risky
activity,
and
needs
to
be
carefully
highlighted,
along
with
the
assumptions
on
which
the
prediction
is
based,
and
the
checks
that
have
been
made
on
them,
as
far
as
is
possible.
Many
very
expen-‐
sive
evaluations
of
giant
international
aid
programs
have
been
based
almost
entirely
on
outcomes
estimated
by
the
same
agency
that
did
the
evaluation
and
the
installation
of
the
program—estimates
that,
not
too
surprisingly,
turned
out
to
be
absurdly
optimistic.
Pessi-‐
mism
can
equally
well
be
ill-‐based,
for
example
predicting
the
survival
chances
of
Stage
IV
cancer
patients
is
often
done
using
the
existing
data
on
five-‐year
survival—but
that
ignores
the
impact
of
research
on
treatment
in
(at
least)
the
last
five
years,
which
has
often
been
considerable.
On
the
other
hand,
waiting
for
the
next
Force
8
earthquake
to
test
disaster
plans
is
stupid;
simulations,
if
designed
by
a
competent
external
agency,
can
do
a
very
good
job
in
estimating
long-‐term
effects
of
a
new
plan.
Note
C2.4:
Identifying
the
impactees
is
not
only
a
matter
of
identifying
each
individual—or
at
least
small
group—that
is
impacted
(targeted
or
not),
hard
though
that
is;
it
is
also
a
matter
of
finding
patterns
in
them,
e.g.,
a
tendency
for
the
intervention
to
be
more
success-‐
ful
with
women
than
men.
Finding
patterns
in
the
data
is
of
course
a
traditional
scientific
task,
so
here
is
one
case
amongst
several
where
the
task
of
the
evaluator
includes
one
of
the
core
tasks
of
the
traditional
scientist.
Note
C2.5:
Furthermore,
if
you
have
discovered
any
unanticipated
side-‐effects
at
all,
con-‐
sider
that
they
are
likely
to
require
evaluation
against
some
values
that
were
not
con-‐
sidered
under
the
Values
checkpoint,
since
you
were
not
expecting
them;
you
need
to
go
back
and
expand
your
list
of
relevant
values,
and
develop
scales
and
rubrics
for
these,
too.
Note C2.6: Almost without exception, the social science literature on effects identifies them as
what happened after an intervention that would not have happened without the presence of the
intervention—this is the so-called counterfactual property. This is a complete fallacy, and shows
culpable ignorance of about a century’s literature on causation in the logic of science (see refer-
ences given above on causation, e.g., in footnote 8). Many effects would have happened anyway,
due to the presence of other factors with causal potential; this is the phenomenon of ‘overdeter-
mination’ which is common in the social sciences. For example, the good that Catholic Charities
does in a disaster might well have occurred if they were not operating, since there are other
sources of help with identical target populations; this does not show they were not in fact the
causal agency nor does it show that they are redundant.
C3.
Costs
This
checkpoint
brings
in
what
might
be
called
‘the
other
quantitative
element
in
evalu-‐
ation’
besides
statistics,
i.e.,
(most
of)
cost
analysis.
But
don’t
forget
there
is
also
such
a
thing
as
qualitative
cost
analysis,
and
it’s
also
very
important—and,
done
properly,
it’s
not
a
feeble
surrogate
for
quantitative
cost
analysis
but
an
essentially
independent
effort;
note
that
both
quantitative
and
qualitative
cost-‐analysis
are
included
in
the
economist’s
defini-‐
tion
of
cost-‐effectiveness.
Both
are
usually
very
important
in
determining
worth
(or,
in
one
sense,
value)
by
contrast
with
plain
merit
(a.k.a.
quality).
Both
were
almost
totally
ignored
for
many
years
after
program
evaluation
became
a
matter
of
professional
practice;
and
a
recent
survey
of
journal
articles
by
Nadini
Persaud
shows
they
are
still
seriously
underused
23 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
in
evaluation.
An
impediment
to
progress
that
she
points
out
is
that
today,
CA
(cost
analy-‐
sis)
is
done
in
a
different
way
by
economists
and
accountants,37
and
you
will
need
to
make
clear
which
approach
you
are
using,
or
that
you
are
using
both—and,
if
you
do
use
both,
indicate
when
and
where
you
use
each.
There
are
also
a
number
of
different
types
of
quan-‐
titative
CA—cost-‐benefit
analysis,
cost-‐effectiveness
analysis,
cost-‐utility
analysis,
cost-‐
feasibility
analysis,
etc.,
and
each
has
a
particular
purpose;
be
sure
you
know
which
one
you
need
and
explain
why
in
the
report
(the
definitions
in
Wikipedia
are
good).
The
first
two
require
calculation
of
benefits
as
well
as
costs,
which
usually
means
you
have
to
find,
and
monetize
if
important
(and
possible),
the
benefits
and
damages
from
Checkpoint
C2
as
well
as
the
more
conventional
(input)
costs.
At
a
superficial
level,
cost
analysis
requires
attention
to
and
distinguishing
between:
(i)
money
vs.
non-‐money
vs.
non-‐monetizable
costs;
(ii)
direct
and
indirect
costs;
(iii)
both
ac-‐
tual
and
opportunity
costs;38
and
(iv)
sunk
(already
spent)
vs.
prospective
costs.
It
is
also
often
helpful,
for
the
evaluator
and/or
audiences,
to
itemize
these
by:
developmental
stage,
i.e.,
in
terms
of
the
costs
of:
(a)
start-‐up
(purchase,
recruiting,
training,
site
preparation,
etc.);
(b)
maintenance
(including
ongoing
training
and
evaluating);
(c)
upgrades;
(d)
shut-‐
down;
(e)
residual
(e.g.,
environmental
damage);
and/or
by
calendar
time
period;
and/or
by
cost
elements
(rent,
equipment,
personnel,
etc.);
and/or
by
payee.
Include
use
of
ex-‐
pended
but
never
utilized
value,
if
any,
e.g.,
social
capital
(such
as
decline
in
workforce
mo-‐
rale).
The
most
common
significant
non-‐money
costs
that
are
often
monetizable
are
space,
time,
expertise,
and
common
labor,
to
the
extent
that
these
are
not
available
for
purchase
in
the
open
market—when
they
are
so
available,
they
can
be
monetized.
The
less
measurable
but
often
more
significant
ones
include:
lives,
health,
pain,
stress
(and
other
positive
or
neg-‐
ative
affects),
social/political/personal
capital
or
debts
(e.g.,
reputation,
goodwill,
interper-‐
sonal
skills),
morale,
energy
reserves,
content
and
currency
of
technical
knowledge/skills,
and
immediate/long-‐term
environmental
costs.
Of
course,
in
all
this,
you
should
be
analyz-‐
ing
the
costs
and
benefits
of
unintended
as
well
as
intended
outcomes;
and,
although
unin-‐
tended
heavily
overlaps
unanticipated,
both
must
be
covered.
The
non-‐money
costs
are
al-‐
most
never
trivial
in
large
program
evaluations,
technology
assessment,
or
senior
staff
ev-‐
aluation,
and
very
often
decisive.
The
fact
that
in
rare
contexts
(e.g.,
insurance
suits)
some
money
equivalent
of
e.g.,
a
life,
is
treated
seriously
is
not
a
sign
that
life
is
a
monetizable
37
Accountants
do
‘financial
analysis’
which
is
oriented
towards
an
individual’s
monetary
situation,
econo-‐
mists
do
‘economic
analysis’
which
is
takes
a
societal
point
of
view.
38 Economists often define the costs of P as the value of the most valuable forsaken alternative (MVFA), i.e., as the
same as opportunity costs. This risks circularity, since it’s arguable that you can’t determine the value of the MVFA
without knowing what it required you to give up, i.e., identifying its MVFA. In general, it may be better to define
ordinary costs as the tangible valued resources that were used to cause the evaluand to come into existence (money,
time, expertise, effort, etc.), and opportunity costs as another dimension of cost, namely the MVFA you spurned by
choosing to create the evaluand rather than the best alternative path to your goals, using about the same resources.
The deeper problem is this: the ‘opportunity cost of the evaluand’ is ambiguous; it may mean the value of something
else to do the same job, or it may mean the value of the resources if you didn’t attempt this job at all. (See my “Cost
in Evaluation: Concept and Practice”, in The Costs of Evaluation, edited by Alkin and Solomon, (Sage, 1983) and
“The Economist’s Fallacy” in jmde.com, 2007).
value
in
general
i.e.,
across
more
than
that
very
limited
context,39
let
alone
a
sign
that
if
we
only
persevere,
cost
analysis
can
be
treated
as
really
a
quantitative
task
or
even
as
a
task
for
which
a
quantitative
version
will
give
us
a
useful
approximation
to
the
real
truth.
Both
views
are
categorically
wrong,
as
is
apparent
if
you
think
about
the
difference
between
the
value
of
a
particular
person’s
life
to
their
family,
vs.
to
their
employer/employees/cowork-‐
ers,
vs.
to
their
profession,
and
to
their
friends;
and
the
difference
between
those
values
as
between
different
people
whose
lost
lives
we
are
evaluating.
And
don’t
think
that
the
way
out
is
to
allocate
different
money
values
to
each
specific
case,
i.e.,
to
each
person’s
life-‐loss
for
each
impacted
group:
not
only
will
this
destroy
generalizability
but
the
cost
to
some
of
these
impactees
is
clearly
still
not
covered
by
money,
e.g.,
when
a
great
theoretician
or
musician
dies.
As
an
evaluator
you
aren’t
doing
a
literal
audit,
since
you’re
(usually)
not
an
accountant,
but
you
can
surely
benefit
if
an
audit
is
available,
or
being
done
in
parallel.
Otherwise,
con-‐
sider
hiring
a
good
accountant
as
a
consultant
to
the
evaluation;
or
an
economist,
if
you’re
going
that
way.
But
even
without
the
accounting
expertise,
your
cost
analysis
and
certainly
your
evaluation,
if
you
follow
the
lists
here,
will
include
key
factors—for
decision-‐making
or
simple
appraisal—usually
omitted
from
standard
auditing
practice.
And
keep
in
mind
that
there
are
evaluations
where
it
is
appropriate
to
analyze
benefits
(a
subset
of
out-‐
comes)
in
just
the
same
way,
i.e.,
by
type,
time
of
appearance,
etc.
This
is
especially
useful
when
you
are
doing
an
evaluation
with
an
emphasis
on
cost-‐benefit
tradeoffs.
Note
C3.1:
This
sub-‐evaluation
(especially
item
(iii)
in
the
first
list)
is
the
key
element
in
the
determination
of
worth.
Note
C3.2:
If
you
have
not
already
evaluated
the
program’s
risk-‐management
efforts
under
Process,
consider
doing—or
having
it
done—as
part
of
this
checkpoint.
Note
C3.3:
Sensitivity
analysis
is
the
cost-‐analysis
analog
of
robustness
analysis
in
statistics
and
testing
methodology,
and
equally
important.
It
is
essential
to
do
it
for
any
quantitative
results.
Note
C3.4:
The
discussion
of
CA
in
this
checkpoint
so
far
uses
the
concept
of
cost-‐effect-‐
iveness
in
the
usual
economic
sense,
but
there
is
another
sense
of
this
concept
that
is
of
considerable
importance
in
evaluation—in
some
but
not
all
contexts,
and
this
sense
does
not
seem
to
be
discussed
in
the
economic
or
accounting
literature.
(It
is
the
‘extended
sense’
mentioned
in
the
Resources
checkpoint
discussion
above.)
In
this
sense,
efficiency
or
cost-‐effectiveness
means
the
ratio
of
benefits
to
resources
available
not
resources
used.
In
this
sense—remember,
it’s
only
appropriate
in
certain
contexts—one
would
say
that
a
pro-‐
gram,
e.g.,
an
aid
program
funded
to
provide
clean
water
to
refugees
in
the
Haitian
tent
cit-‐
ies
in
2010,
was
(at
least
in
this
respect)
inefficient/cost-‐ineffective
if
it
did
not
do
as
much
as
was
possible
with
the
resources
provided.
There
may
be
exigent
circumstances
that
de-‐
flect
any
imputation
of
irresponsibility
here,
but
the
fact
remains
that
the
program
needs
to
be
categorized
as
unsatisfactory
with
respect
to
getting
the
job
done,
even
though
it
was
provided
with
adequate
resources
to
do
it.
Moral:
when
you’re
doing
CA
in
an
evaluation,
don’t
just
analyze
what
was
spent
but
also
what
was
available.
39
The World Bank since 1966 has recommended reporting mortality data in terms of lives saved or lost, not dollars
(Persaud reference).
25 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
C4.
Comparisons
Comparative
or
relative
m/w/s,
which
requires
comparisons,
is
often
extremely
illuminat-‐
ing,
and
sometimes
absolutely
essential—as
when
a
government
has
to
decide
on
whether
to
refund
a
health
program,
go
with
a
different
one,
or
abandon
the
sector
to
private
enter-‐
prise.
Here
you
must
look
for
programs
or
other
entities
that
are
alternative
ways
for
get-‐
ting
the
same
or
similar
benefits
from
about
the
same
resources,
especially
those
that
use
fewer
resources.
Anything
that
comes
close
to
this
is
known
as
a
“critical
competitor”.
Iden-‐
tifying
the
most
important
critical
competitors
is
a
test
of
high
intelligence,
since
they
are
often
very
unlike
the
standard
competitors,
e.g.,
a
key
critical
competitor
for
telephone
and
email
communication
in
extreme
disaster
planning
is
carrier
pigeons,
even
today.
It
is
also
often
worth
looking
for,
and
reporting
on,
at
least
one
other
alternative—if
you
can
find
one—that
is
much
cheaper
but
not
much
less
effective
(‘el
cheapo’);
and
one
much
stronger
although
costlier
alternative,
i.e.,
one
that
produces
far
more
payoffs
or
process
advantages
(‘el
magnifico’),
although
still
within
the
outer
limits
of
the
available
Resources
identified
in
Checkpoint
B4;
the
extra
cost
may
still
be
the
best
bet.
(But
be
sure
that
you
check
care-‐
fully,
e.g.,
don’t
assume
the
more
expensive
option
is
higher
quality
because
it’s
higher
priced.)
It’s
also
sometimes
worth
comparing
the
evaluand
with
a
widely
adopted/admired
approach
that
is
perceived
by
important
stakeholders
as
an
alternative,
though
not
really
in
the
race,
e.g.,
a
local
icon.
Keep
in
mind
that
looking
for
programs
‘having
the
same
effects’
means
looking
at
the
side-‐effects
as
well
as
intended
effects,
to
the
extent
they
are
known,
though
of
course
the
best
available
critical
competitor
might
not
match
on
side-‐effects…
Treading
on
potentially
thin
ice,
there
are
also
sometimes
strong
reasons
to
compare
the
evaluand
with
a
demonstrably
possible
alternative,
a
‘virtual
critical
competitor’—one
that
could
be
assembled
from
existing
or
easily
constructed
components
(the
next
checkpoint
is
another
place
where
ideas
for
this
can
emerge).
The
ice
is
thin
because
you’re
now
moving
into
the
partial
role
of
a
program
designer
rather
than
an
evaluator,
which
creates
a
risk
of
conflict
of
interest
(you
may
be
ego-‐involved
as
author
of
a
possible
competitor
and
hence
not
objective
about
evaluating
it
or,
therefore,
the
original
evaluand).
Also,
if
your
ongoing
role
is
that
of
formative
evaluator,
you
need
to
be
sure
that
your
client
can
digest
sugges-‐
tions
of
virtual
competitors
(see
also
Checkpoint
D2).
The
key
comparisons
should
be
con-‐
stantly
updated
as
you
find
out
more
from
the
evaluation
of
the
primary
evaluand,
espe-‐
cially
new
side-‐effects,
and
should
always
be
in
the
background
of
your
thinking
about
the
evaluand.
Note
C4.1:
It
sometimes
looks
as
if
looking
for
critical
competitors
is
a
completely
wrong
approach,
e.g.,
when
we
are
doing
formative
evaluation
of
a
program
i.e.,
with
the
interest
of
improvement:
but
in
fact,
it’s
important
even
then
to
be
sure
that
the
changes
made
or
recommended
really
do
add
up,
taken
all
together,
to
an
improvement,
so
you
need
to
com-‐
pare
version
2
with
version
1,
and
also
with
available
alternatives
since
the
set
of
critical
competitors
may
change
as
you
modify
the
evaluand.
Note
C4.2:
It’s
tempting
to
collapse
the
Cost
and
Comparison
checkpoints
into
‘Compara-‐
tive
Cost-‐Effectiveness’
(as
Davidson
does,
for
example)
but
it’s
better
to
keep
them
sepa-‐
rate
because
for
certain
important
purposes,
e.g.,
fund-‐raising,
you
will
need
the
separate
results.
Other
examples:
you
often
need
to
look
at
simple
cost-‐feasibility,
which
does
not
involve
comparisons
(but
give
the
critical
competitors
a
quick
look
in
case
one
of
them
is
cost-‐feasible);
or
at
relative
merit
when
‘cost
is
no
object’
(which
means
‘all
available
alter-‐
natives
are
cost-‐feasible,
and
the
merit
gains
from
choosing
correctly
are
much
more
im-‐
portant
than
cost
savings’).
Note
C4.3:
One
often
hears
the
question:
“But
won’t
the
Comparisons
Checkpoint
double
or
triple
our
costs
for
the
evaluation—after
all,
the
comparisons
needed
have
to
be
quite
de-‐
tailed
in
order
to
match
one
based
on
the
KEC?”
Some
responses:
(i)
“But
the
savings
on
purchase
costs
may
be
much
more
than
that;”
(ii)
“There
may
already
be
a
decent
evalu-‐
ation
of
some
or
several
or
all
critical
competitors
in
the
literature;”
(iii)
“Other
funding
sources
may
be
interested
in
the
broader
evaluation,
and
able
to
help
with
the
extra
costs;”
(iv)
“Good
design
of
the
evaluations
of
alternatives
will
often
eliminate
potential
competi-‐
tors
at
trifling
cost,
by
starting
with
the
checkpoints
on
which
they
are
most
obviously
vul-‐
nerable;”
(v)
“
Estimates,
if
that’s
all
you
can
afford,
are
much
cheaper
than
evaluations,
and
better
than
not
doing
a
comparison
at
all.”
C5. Generalizability
Other
names
for
this
checkpoint
(or
something
close
to,
or
part
of
it)
are:
exportability,
transferability,
transportability—which
would
put
it
close
to
Campbell’s
“external
valid-‐
ity”—
but
it
also
covers
sustainability,
longevity,
durability,
and
resilience,
since
these
tell
you
about
generalizing
the
program’s
merit
to
other
times
rather
than
(or
as
well
as)
other
places
or
circumstances
besides
the
one
you’re
at
(in
either
direction,
so
the
historian
is
in-‐
volved.)
Note
that
this
checkpoint
concerns
the
sustainability
of
the
program,
not
the
sus-‐
tainability
of
its
effects,
which
is
also
important
and
covered
under
impact.
Although
other
checkpoints
bear
on
it
(because
they
are
needed
to
establish
that
the
pro-‐
gram
has
non-‐trivial
benefits),
this
checkpoint
is
frequently
the
most
important
one
of
the
core
five
when
attempting
to
determine
significance.
(The
other
highly
relevant
checkpoint
for
that
is
C4,
where
we
look
at
how
much
better
it
is
compared
to
whatever
else
is
avail-‐
able;
and
the
final
word
on
that
comes
in
Checkpoint
D1,
especially
Note
D1.1.)
Under
Checkpoint
C5,
you
must
find
the
answers
to
questions
like
these:
Can
the
program
be
used,
with
similar
results,
if
we
use
it:
(i)
with
other
content;
(ii)
at
other
sites;
(iii)
with
other
staff;
(iv)
on
a
larger
(or
smaller)
scale;
(v)
with
other
recipients;
(vi)
in
other
climates
(social,
political,
physical);
etc.
An
affirmative
answer
on
any
of
these
‘dimensions
of
gener-‐
alization’
is
a
merit,
since
it
adds
another
universe
to
the
domains
in
which
the
evaluand
can
yield
benefits
(or
adverse
effects).
Looking
at
generalizability
thus
makes
it
possible
(sometimes)
to
benefit
greatly
from,
instead
of
dismissing,
programs
and
policies
whose
use
at
the
time
of
the
study
was
for
a
very
small
group
of
impactees—such
programs
may
be
extremely
important
because
of
their
generalizability.
Generalization
to
(vii)
later
times,
a.k.a.
longevity,
is
nearly
always
important
(under
com-‐
mon
adverse
conditions,
it’s
durability).
Even
more
important
is
(viii)
sustainability
(this
is
external
sustainability,
not
the
same
as
the
internal
variety
mentioned
under
Process).
It
is
sometimes
inadequately
treated
as
meaning,
or
as
equivalent
to,
‘resilience
to
risk.’
Sus-‐
tainability
usually
requires
making
sure
the
evaluand
can
survive
at
least
the
termination
27 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
of
the
original
funding
(which
is
usually
not
a
risk
but
a
known
certainty),
and
also
some
range
of
hazards
under
the
headings
of
warfare
or
disasters
of
the
natural
as
well
as
finan-‐
cial,
social,
ecological,
and
political
varieties.
Sustainability
isn’t
the
same
as
resilience
to
risk
especially
because
is
must
cover
future
certainties,
such
as
seasonal
changes
in
tem-‐
perature,
humidity,
water
supply—and
the
end
of
the
reign
of
the
present
CEO,
or
of
pres-‐
ent
funding.
But
the
‘resilience
to
risk’
definition
has
the
merit
of
reminding
us
that
this
checkpoint
will
require
some
effort
at
identifying
and
then
estimating
the
likelihood
of
the
occurrence
of
the
more
serious
risks,
and
costing
the
attendant
losses.
Sustainability
is
sometimes
even
more
important
than
longevity,
for
example
when
evaluating
international
or
cross-‐cultural
developmental
programs;
longevity
and
durability
refer
primarily
to
the
reliability
of
the
‘machinery’
of
the
program
and
its
maintenance,
including
availability
of
the
required
labor/expertise
and
tech
supplies;
but
are
less
connotative
of
external
threats
such
as
the
‘100-‐year
drought’
or
civil
war,
and
less
concerned
with
‘continuing
to
produce
the
same
results’
which
is
what
you
really
care
about.
Note
that
what
you’re
generalizing—
i.e.,
predicting—about
these
programs
is
the
future
(effects)
of
‘the
program
in
context,’
not
the
mere
existence
of
the
program,
and
so
any
context
required
for
the
effects
should
be
specified,
and
include
any
required
infrastructure.
Here,
as
in
the
previous
checkpoint,
we
are
making
predictions
about
outcomes
in
certain
scenarios,
and,
although
risky,
this
some-‐
times
generates
the
greatest
contribution
of
the
evaluation
to
improvement
of
the
world
(see
also
the
‘possible
scenarios’
of
Checkpoint
D4).
All
three
show
the
extent
to
which
good
evaluation
is
a
creative
and
not
just
a
reactive
enterprise.
That’s
the
good
news
way
of
putting
the
point;
the
bad
news
way
is
that
much
good
evaluation
involves
raising
ques-‐
tions
that
can
only
be
answered
definitively
by
doing
work
that
you
are
probably
not
funded
to
do.
Note
C5.1:
Above
all,
keep
in
mind
that
the
absence
of
generalizability
has
absolutely
no
deleterious
effect
on
establishing
that
a
program
is
meritorious,
unlike
the
absence
of
a
positive
rating
on
any
of
the
four
other
sub-‐evaluation
dimensions.
It
only
affects
establish-‐
ing
the
extent
of
its
benefits.
This
can
be
put
by
saying
that
generalizability
is
a
plus,
but
its
absence
is
not
a
minus—unless
you’re
scoring
for
the
Ideal
Program
Oscars.
Putting
it
an-‐
other
way,
generalizability
is
highly
desirable,
but
that
doesn’t
mean
that
it’s
a
requirement
for
m/w/s.
A
program
may
do
the
job
of
meeting
needs
just
where
it
was
designed
to
do
that,
and
not
be
generalizable—and
still
rate
an
A+.
Note
C5.2:
Although
generalizability
is
‘only’
a
plus,
it
needs
to
be
explicitly
defined
and
de-‐
fended.
It
is
still
the
case
that
good
researchers
make
careless
mistakes
of
inappropriate
implicit
generalization.
For
example,
there
is
still
much
discussion,
with
good
researchers
on
both
sides,
of
whether
the
use
of
student
ratings
of
college
instructors
and
courses
im-‐
proves
instruction,
or
has
any
useful
level
of
validity.
But
any
conclusion
on
this
topic
in-‐
volves
an
illicit
generalization,
since
the
evaluand
‘student
ratings’
is
about
as
useful
in
such
evaluations
as
‘herbal
medicine’
is
in
arguments
about
whether
herbal
medicine
is
beneficial
or
not.
Since
any
close
study
shows
that
herbal
medicines
with
the
same
label
often
contain
completely
different
substances
(and
almost
always
substantially
different
amounts
of
the
main
element),
and
since
most
but
not
all
student
rating
forms
are
invalid
or
uninterpretable
for
more
than
one
reason,
the
essential
foundation
for
the
generaliza-‐
tion—a
common
referent—is
non-‐existent.
Similarly,
investigations
of
whether
online
teaching
is
superior
to
onsite
instruction,
or
vice
versa,
are
about
absurdly
variable
ev-‐
aluands,
and
generalizing
about
their
relative
merits
is
like
generalizing
about
the
ethical
standards
of
‘white
folk’
compared
to
‘Asians.’
Conversely,
and
just
as
importantly,
evalu-‐
ative
studies
of
a
nationally
distributed
reading
program
must
begin
by
checking
the
fi-‐
delity
of
your
sample
(Description
and
Process
checkpoints).
This
is
checking
instantiation
(sometimes
this
is
part
of
what
is
called
‘checking
dosage’
in
the
medical/pharmaceutical
context),
the
complementary
problem
to
checking
generalization.
Note
C5.3:
Checkpoint
C5
is,
perhaps
more
than
any
others,
the
residence
of
prediction,
with
all
its
special
problems.
Will
the
program
continue
to
work
in
its
present
form?
Will
it
work
in
some
modified
form?
In
some
different
context?
With
different
person-‐
nel/clients/recipients?
These,
and
the
others
listed
above,
are
each
formidable
prediction
tasks
that
will,
in
important
cases,
require
separate
research
into
their
special
problems.
When
special
advice
cannot
be
found,
it
is
tempting
to
fall
back
on
the
assumption
that,
ab-‐
sent
ad
hoc
considerations,
the
best
prediction
is
extrapolation
of
current
trends.
That’s
the
best
simple
choice,
but
it’s
not
the
best
you
can
do;
you
can
at
least
identify
the
most
com-‐
mon
interfering
conditions
and
check
to
see
if
they
are/will
be
present
and
require
a
modi-‐
fication
or
rejection
of
the
simple
extrapolation.
Example:
will
the
program
continue
to
do
as
well
as
it
has
been
doing?
Possibly
not
if
the
talented
CEO
dies/retires/leaves/burns
out?
So
check
on
the
evidence
for
each
of
these
possibilities,
thereby
increasing
the
validity
of
the
bet
on
steady-‐state
results,
or
forcing
a
switch
to
another
bet.
See
also
Note
D2.2.
General
Note
7:
Comparisons,
Costs,
and
Generalizability
are
in
the
same
category
as
values
from
the
list
in
Checkpoint
B5;
they
are
all
considerations
of
certain
dimensions
of
value—comparative
value,
economic
value,
general
value.
Why
do
they
get
special
billing
with
their
own
checkpoint
in
the
list
of
sub-‐evaluations?
Basically,
because
of
(i)
their
vir-‐
tually
universal
critical
importance40,
(ii)
the
frequency
with
which
one
or
more
are
omit-‐
ted
from
evaluations
when
they
should
have
been
included,
and
(iii)
because
they
each
in-‐
volve
some
techniques
of
a
relatively
special
kind.
Despite
their
idiosyncrasies,
it’s
also
possible
to
see
them
as
potential
exemplars,
by
analogy
at
least,
of
how
to
deal
with
some
of
the
other
relevant
values
from
Checkpoint
B5,
which
will
come
up
as
relevant
under
Pro-‐
cess,
Outcomes,
and
Comparisons.
29 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
poses
you
need
a
further
synthesis,
this
time
of
the
sub-‐evaluations,
because
you
need
to
get
a
one-‐dimensional
evaluative
conclusion,
i.e.,
an
overall
grade
or,
if
you
can
justify
a
quantitative
scale,
an
overall
score.
For
example,
you
may
need
to
assist
the
client
in
choos-‐
ing
the
best
of
several
evaluands,
which
means
ranking
them,
and
the
easiest
way
to
do
this
is
to
have
each
of
them
evaluated
on
a
single
overall
summative
dimension.
That’s
easy
to
say,
but
it’s
not
easy
to
justify
most
efforts
to
do
that,
because
in
order
to
combine
those
multiple
dimensions
into
a
single
one,
you
have
to
have
a
legitimate
common
metric
for
them,
which
is
rarely
supportable.
(It’s
easy
to
see
why
a
quantitative
approach
is
so
attrac-‐
tive!)
At
the
least,
you’ll
need
a
supportable
estimate
of
the
relative
importance
of
each
di-‐
mension
of
merit,
and
not
even
that
is
easy
to
get.
Details
of
how
and
when
it
can
be
done
will
be
provided
elsewhere
and
would
take
too
much
space
to
fit
in
here.41
The
content
focus
(point
of
view)
of
the
synthesis,
on
which
the
common
metric
should
be
based,
should
usually
be
the
present
and
future
total
impact
on
consumer
(e.g.,
employer,
employee,
pa-‐
tient,
student)
or
community
needs,
subject
to
the
constraints
of
ethics,
the
law,
and
re-‐
source-‐feasibility,
etc…
Apart
from
the
need
for
a
ranking
there
is
very
often
also
a
practical
need
for
a
concise
presentation
of
the
most
crucial
evaluative
information.
A
profile
showing
merit
on
the
five
core
dimensions
of
Part
C
can
often
meet
that
need,
without
going
to
a
uni-‐dimensional
compression
into
a
single
grade.
Another
possible
profile
for
such
a
sum-‐
mary
would
be
based
on
the
SWOT
checklist
widely
used
in
business:
Strengths,
Weak-‐
nesses,
Opportunities,
and
Threats.42
Sometimes
it
makes
sense
to
provide
both
profiles.
This
part
of
the
synthesis/summary
could
also
include
referencing
the
results
against
the
clients’
and
perhaps
other
stakeholders’
goals,
wants,
or
hopes
(if
feasible),
e.g.,
goals
met,
ideals
realized,
created
but
unrealized
value,
when
these
are
determinable,
which
can
also
be
done
with
a
profile.
But
the
primary
obligation
of
the
evaluator
is
usually
to
reference
the
results
to
the
needs
of
the
impacted
population,
within
the
constraints
of
overarching
values
such
as
ethics,
the
law,
the
culture,
etc.
Programs
are
not
made
into
good
programs
by
matching
someone’s
goals,
but
by
doing
something
worthwhile,
on
balance.
Of
course,
for
public
or
philanthropic
funding,
the
two
should
coincide,
but
you
can’t
assume
they
do;
in
fact,
they
are
all-‐too-‐often
incompatible.
Another
popular
focus
for
the
overall
report
is
the
ROI
(return
on
investment),
which
is
su-‐
perbly
concise,
but
it’s
too
limited
(no
ethics,
side-‐effects,
goal
critique,
etc.)
The
often-‐
suggested
3D
expansion
of
ROI
gives
us
the
3P
dimensions—benefits
to
People,
Planet,
and
Profit—often
called
the
‘triple
bottom
line.’
It’s
still
a
bit
narrow
and
we
can
do
better
with
the
5
dimensions
listed
here
as
the
sub-‐evaluations
listed
in
Part
C:
Process,
Outcomes;
Costs;
Comparisons;
Generalizability.
A
bar
graph
showing
the
merit
of
achievements
on
each
of
these
provides
a
succinct
and
insightful
profile
of
a
program’s
value.
To
achieve
it,
you
will
need
defensible
definitions
of
the
standards
you
are
using
on
each
column
(the
rubrics),
e.g.,
“An
A
grade
for
Outcomes
will
require…”
and
there
will
be
‘bars’
(i.e.,
absolute
minimum
standards)
on
several
of
these,
e.g.,
ethical
acceptability
on
the
Outcomes
scale,
cost-‐feasibility
on
the
Costs
scale.
Since
it’s
highly
desirable
that
you
have
these
for
any
serious
program
evaluation,
this
5D
summary
should
not
be
a
dealbreaker
requirement.
41
An article “The Logic of Evaluation” forthcoming by summer, 2011, on the web site michaelscriven.info does a
better job on this than my previous efforts, which do not now seem adequate as references.
42 Google provides 6.2 million references for SWOT (@1/23/07), but the top two or three are good introductions.
(Another
version
of
a
5D
approach
is
given
in
the
paper
“Evaluation
of
Training”
that
is
on-‐
line
at
michaelscriven.info.
)
Apart
from
the
rubrics
for
each
relevant
value,
if
you
have
to
come
up
with
an
overall
grade
of
some
kind,
you
will
need
to
do
an
overall
synthesis
to
reduce
the
two-‐dimensional
pro-‐
file
to
a
‘score’
on
a
single
dimension.
(Since
it
may
be
qualitative,
we’ll
use
the
term
‘grade’
for
this
property.)
Getting
to
an
overall
grade
requires
what
we
might
call
a
meta-‐rubric—a
set
of
rules
for
converting
profiles—which
are
typically
themselves
a
set
of
grades
on
sev-‐
eral
dimensions—to
a
grade
on
a
single
scale.
What
we
call
‘weighting’
the
dimensions
is
a
basic
kind
of
meta-‐rubric
since
it’s
an
instruction
to
take
some
of
the
constituent
grades
more
seriously
than
others
for
some
further,
‘higher-level’
evaluative
purpose.
(A
neat
way
to
display
this
graphically
is
by
using
the
width
of
a
column
in
the
profile
to
indicate
import-‐
ance.)
If
you
are
lucky
enough
to
have
developed
an
evaluative
profile
for
a
particular
ev-‐
aluand,
in
which
each
dimension
of
merit
is
of
equal
importance
(or
of
some
given
numeri-
cal
importance
compared
to
the
others),
and
if
each
grade
can
be
expressed
numerically,
then
you
can
just
average
the
grades.
BUT
legitimate
examples
of
such
cases
are
almost
un-‐
known,
although
we
often
oversimplify
and
act
as
if
we
have
them
when
we
don’t.
For
ex-‐
ample,
we
average
college
grades
to
get
the
GPA
(grade
point
average)
and
use
this
in
many
overall
evaluative
contexts
such
as
selection
for
admission
to
graduate
programs.
Of
course,
this
oversimplification
can
be,
and
frequently
is,
‘gamed’
by
students
e.g.,
by
taking
courses
where
grade
inflation
means
that
the
A’s
do
not
represent
excellent
work
by
any
reasonable
standard.
A
better
meta-‐rubric
results
from
using
a
comprehensive
exam,
graded
by
a
departmental
committee
instead
of
one
person,
and
then
giving
the
grade
on
this
double
weight,
or
even
80%
of
the
weight.
Another
common
meta-‐rubric
in
graduate
schools
is
setting
a
meta-‐bar,
i.e.,
an
overall
absolute
requirement
for
graduation,
e.g.,
that
no
single
dimension
(course
or
a
named
subset
of
crucially
important
courses)
be
graded
below
B-‐.
Note
D1.1:
One
special
conclusion
to
go
for,
often
a
major
part
of
determining
significance,
comes
from
looking
at
what
was
done
against
what
could
have
been
done
with
the
Re-‐
sources
available,
including
social
and
individual
capital.
This
is
one
of
several
cases
where
imagination
is
needed
to
determine
a
grade
on
the
Opportunities
part
of
the
SWOT
analy-‐
sis.
But
remember
this
is
thin
ice
territory
(see
Note
C4.1).
Note
D1.2:
Be
sure
to
convey
some
sense
of
the
strength
of
your
conclusions,
which
means
the
combination
of:
(i)
the
net
weight
of
the
evidence
for
the
premises,
with
(ii)
the
proba-
bility
of
the
inferences
from
them
to
the
conclusion(s),
and
(iii)
the
probability
that
there
is
no
other
relevant
evidence.
For
example,
indicate
whether
the
performance
on
the
various
dimensions
of
merit
was
a
tricky
inference
or
directly
observed;
did
the
evaluand
clear
any
bars
or
lead
any
competitors
‘by
a
mile’
or
just
scrape
over
(i.e.,
use
gap-‐ranking
not
just
ranking43);
were
the
predictions
involved
double-‐checked
for
invalidating
indicators
(see
Note
C5.2);
was
the
conclusion
established
‘beyond
any
reasonable
doubt,’
or
merely
‘sup-‐
ported
by
the
balance
of
the
evidence’?
This
complex
property
of
the
evaluation
is
referred
43
Gap-ranking is a refinement of ranking in which a qualitative or quantitative estimate of the size of intervals be-
tween evaluands is provided (modeled after the system in horse-racing—‘by a head,’ ‘by a nose,’ ‘by three lengths,’
etc. This is often enormously more useful than mere ranking e.g., because it tells a buyer that s/he can get very
nearly as good a product for much less money.
31 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
to
as
‘robustness.’
Some
specific
aspects
of
the
limitations
also
need
statement
here
e.g.,
those
due
to
limited
time-‐frame
(which
often
rules
out
some
mid-‐
or
long-‐term
follow-‐ups
that
are
badly
needed).
D2.
(possible)
Recommendations,
Explanations,
Predictions,
and
Redesigns.
All
of
these
possibilities
are
examples
of
the
‘something
more’
approach
to
evaluation,
by
contrast
with
the
more
conservative
‘nothing
but’
approach,
which
advocates
rather
careful
restriction
of
the
evaluator’s
activities
to
evaluation,
‘pure
and
simple.’
These
alternatives
have
analogies
in
every
profession—judges
are
tempted
to
accept
directorships
in
com-‐
panies
who
may
come
before
them
as
defendants,
counsellors
consider
adopting
counse-‐
lees,
etc.
The
‘nothing
more
approach’
can
be
expressed,
with
thanks
to
a
friend
of
Gloria
Steinem,
as:
‘An
evaluation
without
recommendations
(or
explanations,
etc.)
is
like
a
fish
without
a
bicycle.’
Still,
there
are
more
caveats
about
pressing
for
evaluation-‐separation
than
with
the
fish.
In
other
words,
‘lessons
learned’—of
whatever
type—should
be
sought
diligently,
expressed
cautiously,
and
applied
even
more
cautiously.
Let’s
start
with
recommendations.
Micro-‐recommendations—those
concerning
the
inter-‐
nal
workings
of
program
management
and
the
equipment
or
personnel
choices/use—often
become
obvious
to
the
evaluator
during
the
investigation,
and
are
demonstrable
at
little
or
no
extra
cost/effort
(we
sometimes
say
they
“fall
out”
from
the
evaluation;
as
an
example
of
how
easy
this
can
sometimes
be,
think
of
copy-‐editors,
who
often
do
both
evaluation
and
recommendation
to
an
author
in
one
pass),
or
they
may
occur
to
a
knowledgeable
evalu-‐
ator
who
is
motivated
to
help
the
program,
because
of
his/her
expert
knowledge
of
this
or
an
indirectly
or
partially
relevant
field
such
as
information
or
business
technology,
organi-‐
zation
theory,
systems
concepts,
or
clinical
psychology.
These
‘operational
recommenda-‐
tions’
can
be
very
useful—it’s
not
unusual
for
a
client
to
say
that
these
suggestions
alone
were
worth
more
than
the
cost
of
the
evaluation.
(Naturally,
these
suggestions
have
to
be
within
the
limitations
of
the
(program
developer’s)
Resources
checkpoint,
except
when
doing
the
Generalizability
checkpoint.)
Generating
these
‘within-‐program’
recommenda-‐
tions
as
part
of
formative
evaluation
(though
they’re
one
step
away
from
the
primary
task
of
formative
evaluation
which
is
straight
evaluation
of
the
present
quality
of
the
evaluand),
is
one
of
the
good
side-‐effects
that
may
come
from
using
an
external
evaluator,
who
often
has
a
new
view
of
things
that
everyone
on
the
scene
may
have
seen
too
often
to
see
criti-‐
cally.
On
the
other
hand,
macro-‐recommendations—which
are
about
the
disposition
or
classifica-‐
tion
of
the
whole
program
(refund,
cut,
modify,
export,
etc.—which
we
might
also
call
ex-‐
ternal
management
recommendations,
or
dispositional
recommendations)—are
usually
another
matter.
These
are
important
decisions
serviced
by
and
properly
dependent
on,
summative
evaluations,
but
making
recommendations
about
the
evaluand
is
not
intrinsi-‐
cally
part
of
the
task
of
evaluation
as
such,
since
it
depends
on
other
matters
besides
the
m/w/s
of
the
program,
which
is
all
the
evaluator
normally
can
undertake
to
determine.
For
the
evaluator
to
make
dispositional
recommendations
about
a
program’s
disposition
will
typically
require
two
extras
over
and
above
what
it
takes
to
evaluate
the
program:
(i)
extensive
knowledge
of
the
other
factors
in
the
context-‐of-‐decision
for
the
top-‐level
(‘about-‐program’)
decision-‐makers.
Remember
that
those
people
are
often
not
the
clients
for
the
evaluation—they
are
often
further
up
the
organization
chart—and
they
may
be
un-‐
willing
or
psychologically
or
legally
unable
to
provide
full
details
about
the
context-‐of-‐
decision
concerning
the
program
(e.g.,
unable
because
implicit
values
are
not
always
rec-‐
ognized
by
those
who
operate
using
them).
The
correct
dispositional
decisions
often
rightly
depend
on
legal
or
donor
constraints
on
the
use
of
funds,
and
sometimes
on
legitimate
po-‐
litical
constraints
not
explained
to
the
evaluator,
not
just
m/w/s;
and
any
of
these
can
arise
after
the
evaluation
begins
and
the
evaluator
is
briefed
about
then-‐known
environmental
constraints,
if
s/he
is
briefed
at
all.
Such
recommendations
will
also
often
require
(ii)
considerable
extra
effort
e.g.,
to
evaluate
each
of
the
other
macro-‐options.
Key
elements
in
this
may
be
trade
secrets
or
national
se-‐
curity
matters
not
available
to
the
evaluator,
e.g.,
the
true
sales
figures,
the
best
estimate
of
competitors’
success,
the
extent
of
political
vulnerability
for
work
on
family
planning,
the
effect
on
share
prices
of
withdrawing
from
this
slice
of
the
market.
This
elusiveness
also
often
applies
to
the
macro-‐decision
makers’
true
values,
with
respect
to
this
decision,
which
are
quite
often
trade
or
management
or
government
secrets
of
the
board
of
direc-‐
tors,
or
select
legislators,
or
perhaps
personal
values
only
known
to
their
psychotherapists.
So
it
is
really
a
quaint
conceit
of
evaluators
to
suppose
that
the
m/w/s
of
the
evaluand
are
the
only
relevant
grounds
for
deciding
how
to
dispose
of
it;
there
are
often
entirely
legiti-‐
mate
political,
legal,
public-‐perception,
market,
and
ethical
considerations
that
are
at
least
as
important,
especially
in
toto.
So
it’s
simply
presumptuous
to
propose
macro-‐recomm-‐
endations
as
if
they
follow
directly
from
the
evaluation:
they
almost
never
do,
even
when
the
client
may
suppose
that
they
do,
and
encourage
the
evaluator
to
produce
them.
(It’s
a
mistake
I’ve
made
more
than
once.)
If
you
do
have
the
required
knowledge
to
infer
to
them,
then
at
least
be
very
clear
that
you
are
doing
a
different
evaluation
in
order
to
reach
them,
namely
an
evaluation
of
the
alternative
options
open
to
the
disposition
decision-‐makers,
by
contrast
with
an
evaluation
of
the
evaluand
itself.
In
the
standard
program
evaluation,
but
not
in
the
evaluation
of
various
dispositions
of
it,
you
can
sometimes
include
an
evaluation
of
the
internal
choices
available
to
the
program
manager,
i.e.,
recommendations
for
im-‐
provements.
There
are
a
couple
of
ways
to
‘soften’
recommendations
in
order
to
take
account
of
these
hazards.
The
simplest
way
is
to
preface
them
by
saying,
“Assuming
that
the
program’s
dis-‐
position
is
dependent
only
on
its
m/w/s,
it
is
recommended
that…”
A
more
creative
and
often
more
productive
approach,
advocated
by
Jane
Davidson,
is
to
convert
recommenda-‐
tions
into
options,
e.g.,
as
follows:
“It
would
seem
that
program
management/staff
faces
a
choice
between:
(i)
continuing
with
the
status
quo;
(ii)
abandoning
this
component
of
the
program;
(iii)
implementing
the
following
variant
[here
you
insert
your
recommendation]
or
some
variation
of
this.”
The
program
management/staff
is
thus
invited
to
adopt
and
be-‐
come
a
co-‐author
of
an
option,
a
strategy
that
is
often
more
likely
to
result
in
implementa-‐
tion
than
a
mere
recommendation
from
an
outsider.
Many
of
these
extra
requirements
for
making
macro-‐recommendations—and
sometimes
one
other—also
apply
to
providing
explanations
of
success
or
failure.
The
extra
require-‐
ment
is
possession
of
the
correct
(not
just
the
believed)
logic
or
theory
of
the
program,
which
typically
requires
more
than—and
rarely
requires
less
than—state-‐of-‐the-‐art
sub-‐
33 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
ject-‐matter
expertise,
both
practical
and
‘theoretical’
(i.e.,
the
scientific
or
engineering
ac-‐
count),
about
the
evaluand’s
inner
workings
(i.e.,
about
what
optional
changes
would
lead
to
what
results).
A
good
automobile
mechanic
has
the
practical
kind
of
knowledge
about
cars
that
s/he
works
on
regularly,
which
includes
knowing
how
to
identify
malfunction
and
its
possible
causes;
but
it’s
often
only
the
automobile
engineer
who
can
give
you
the
rea-‐
sons
why
these
causal
connections
work,
which
is
what
the
demand
for
explanations
will
usually
require.
The
combination
of
these
requirements
imposes
considerable,
and
some-‐
times
enormous,
extra
time
and
research
costs
which
has
too
often
meant
that
the
attempt
to
provide
recommendations
or
explanations
(by
using
the
correct
program
logic)
is
done
at
the
expense
of
doing
the
basic
evaluation
task
well
(or
even
getting
to
it
at
all),
a
poor
trade-‐off
in
most
cases.
Moreover,
getting
the
explanation
right
will
sometimes
be
abso-‐
lutely
impossible
within
the
‘state
of
the
art’
of
science
and
engineering
at
the
moment—
and
this
is
not
a
rare
event,
since
in
many
cases
where
we’re
looking
for
a
useful
social
intervention,
no-‐one
has
yet
found
a
plausible
account
of
the
underlying
phenomenon:
for
example,
in
the
cases
of
delinquency,
addiction,
autism,
serial
killing,
ADHD.
In
such
cases,
what
we
need
to
know
is
whether
we
have
found
a
cure—complete
or
partial—since
we
can
use
that
knowledge
to
save
people
immediately,
and
also,
thereafter,
to
start
work
on
finding
the
explanation.
That’s
the
‘aspirin
case’—the
situation
where
we
can
easily,
and
with
great
benefit
to
many
sufferers,
evaluate
a
claimed
medication
although
we
don’t
know
why
it
works,
and
don’t
need
to
know
that
in
order
to
evaluate
its
efficacy.
In
fact,
un-‐
til
the
evaluation
is
done,
there’s
no
success
or
failure
for
the
scientist
to
investigate,
which
vastly
reduces
the
significance
of
the
causal
inquiry,
and
hence
the
probability/value
of
its
occurrence.
It’s
also
extremely
important
to
realize
that
macro-‐recommendations
will
typically
require
the
ability
to
predict
the
results
of
the
recommended
changes
in
the
program,
at
the
very
least
in
this
specific
context,
which
is
something
that
the
program
logic
or
program
theory
(like
many
social
science
theories)
is
often
not
able
to
do
with
any
reliability.
Of
course,
procedural
recommendations
in
the
future
tense,
e.g.,
about
needed
further
research
or
data-‐gathering
or
evaluation
procedures,
are
often
possible—although
typically
much
less
useful.
‘Plain’
predictions
are
also
often
requested
by
clients
or
thought
to
be
included
in
any
good
evaluation.
(e.g.,
Will
the
program
work
reliably
in
our
schools?
Will
it
work
with
the
recommended
changes,
without
staff
changes?)
and
are
often
very
hazardous.44
Now,
since
these
are
reasonable
questions
to
answer
in
deciding
on
the
value
of
the
program
for
many
clients,
you
have
to
try
to
provide
the
best
response.
So
read
Clinical
vs.
Statistical
Predic-
tion
by
Paul
Meehl
and
the
follow-‐up
literature,
and
the
following
Note
D2.1,
and
then
call
in
the
subject
matter
experts.
In
most
cases,
the
best
thing
you
can
do,
even
with
all
that
help,
is
not
just
to
pick
what
appears
to
be
the
most
likely
result,
but
to
give
a
range
from
the
probability
of
the
worst
possible
outcome
(which
you
describe
carefully)
to
that
of
the
44 Evaluators sometimes say, in response to such questions, Well, why wouldn’t it work—the reasons for it doing so
are really good? The answer was put rather well some years ago: "…it ought to be remembered that there is nothing
more difficult to take in hand, more perilous to conduct, or more uncertain of success, than to take the lead in the
introduction of a new order of things. Because the innovator has for enemies all those who have done well under the
old conditions, and lukewarm defenders in those who may do well under the new.” (Niccolo Machiavelli (1513),
with thanks to John Belcher and Richard Hake for bringing it up recently (PhysLrnR, 16 Apr 2006)
best
possible
outcome
(also
described),
plus
the
probability
of
the
most
likely
outcome
in
the
middle
(described
even
more
carefully).45
On
rare
occasions,
you
may
be
able
to
esti-‐
mate
a
confidence
interval
for
these
estimates.
Then
the
decision-‐makers
can
apply
their
choice
of
strategy
(e.g.,
minimax—minimizing
maximum
possible
loss)
based
on
their
risk-‐
aversiveness.
Although
it’s
true
that
almost
every
evaluation
is
in
a
sense
predictive,
since
the
data
it’s
based
on
is
yesterday’s
data
but
its
conclusions
are
put
forward
as
true
today,
there’s
no
need
to
be
intimidated
by
the
need
to
predict;
one
just
has
to
be
very
clear
what
assump-‐
tions
one
is
making
and
how
much
evidence
there
is
to
support
them.
Finally,
a
new
twist
on
‘something
more’
that
I
first
heard
proposed
by
John
Gargani
and
Stewart
Donaldson
at
the
2010
AEA
convention,
is
for
the
evaluator
to
do
a
redesign
of
a
program
rather
than
giving
a
highly
negative
evaluation.
This
is
a
kind
of
limit
case
of
rec-‐
ommendation,
and
of
course
requires
an
extra
skill
set,
namely
design
skills.
The
main
problem
here
is
role
conflict
and
the
consequent
improper
pressure:
the
evaluator
is
offer-‐
ing
the
client
loaded
alternatives,
a
variation
on
‘your
money
or
your
life.’
The
advocates
suggest
that
the
world
will
be
a
better
place
if
the
program
is
redesigned
rather
than
just
condemned
by
them,
which
is
probably
true;
but
these
are
not
the
only
alternatives.
The
evaluator
might
instead
recommend
the
redesign,
and
suggest
calling
for
bids
on
that,
recusing
his
or
her
candidacy.
Or
they
might
just
recommend
changes
that
a
new
designer
should
incorporate
or
consider.
Note
D2.1:
Policy
analysis,
in
the
common
situation
when
the
policy
is
being
considered
for
future
adoption,
is
close
to
being
program
evaluation
of
future
(possible)
programs
(a.k.a.,
ex
ante,
or
prospective
program
evaluation)
and
hence
necessarily
involves
all
the
checkpoints
in
the
KEC
including,
in
most
cases,
an
especially
large
dose
of
prediction.
(A
policy
is
a
‘course
or
principle
of
action’
for
a
certain
domain
of
action,
and
implementing
it
typically
produces
a
program.)
Extensive
knowledge
of
the
fate
of
similar
programs
in
the
past
is
then
the
key
resource,
but
not
the
only
one.
It
is
also
essential
to
look
specifically
for
the
presence
of
indicators
of
future
change
in
the
record,
e.g.,
downturns
in
the
perform-‐
ance
of
the
policy
in
the
most
recent
time
periods,
intellectual
or
motivational
burn-‐out
of
principal
players/managers,
media
attention,
the
probability
of
personnel
departure
for
better
offers,
the
probability
of
epidemics,
natural
disasters,
legislative
‘counter-‐
revolutions’
by
groups
of
opponents,
general
economic
decline,
technological
break-‐
throughs,
or
large
changes
in
taxes
or
house
or
market
values,
etc.
If,
on
the
other
hand,
the
policy
has
already
been
implemented,
then
we’re
doing
historical
(a.k.a.
ex
post,
or
retro-‐
spective
program
evaluation)
and
policy
analysis
amounts
to
program
evaluation
without
prediction,
a
much
easier
case.
45 In PERT charting (PERT = Program Evaluation and Review Technique), a long-established approach to program
planning that emerged from the complexities of planning the first submarine nuclear missile, the Polaris, the formula
for calculating what you should expect from some decision is: {Best possible outcome + Worst Possible outcome +
4 x (Most likely outcome)}/6. It’s a pragmatic solution to consider seriously. My take on this approach is that it only
makes sense when there are good grounds for saying the most likely outcome (MLO) is very likely; there are many
cases where we can identify the best and worst cases, but have no grounds for thinking the intermediate case is more
likely other than the fact it’s intermediate. Now that fact does justify some weighting (given the usual distribution of
probabilities), but the coefficient for the MLO might then be better as 2 or 3.
35 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
Note
D2.3:
Evaluability
assessment
is
a
useful
part
of
good
program
planning,
whenever
it
is
required,
hoped,
or
likely
that
evaluation
could
later
be
used
to
help
improve
as
well
as
determine
the
m/w/s
of
the
program
to
assist
decision-‐makers
and
fixers.
It
can
be
done
well
by
using
the
KEC
to
identify
the
questions
that
will
have
to
be
answered
eventually,
and
thus
to
identify
the
data
that
will
need
to
be
obtained;
and
the
difficulty
of
doing
that
will
determine
the
evaluability
of
the
program
as
designed.
Those
preliminary
steps
are,
of
course,
exactly
the
ones
that
you
have
to
go
through
to
design
an
evaluation,
so
the
two
processes
are
two
sides
of
the
same
coin.
Since
everything
is
evaluable,
to
some
extent
in
some
contexts,
the
issue
of
evaluability
is
a
matter
of
degree,
resources,
and
circumstance,
not
of
absolute
possibility.
In
other
words,
while
everything
is
evaluable,
by
no
means
is
everything
evaluable
to
a
reasonable
degree
of
confidence,
with
the
available
resources,
in
every
context.
(For
example,
the
atomic
power
plant
program
for
Iran
after
4/2006,
when
access
was
denied
to
the
U.N.
inspectors)
As
this
example
illustrates,
‘context’
includes
the
date
and
type
of
evaluation,
since,
while
this
evaluand
is
not
evaluable
prospectively
with
any
confidence,
in
4/06—since
getting
the
data
is
not
feasible,
and
predicting
sustainability
is
highly
speculative—historians
will
no
doubt
be
able
to
evaluate
it
retrospectively,
be-‐
cause
we
will
eventually
know
whether
that
program
paid
off,
and/or
brought
on
an
attack.
Note
D2.3:
Inappropriate
expectations
The
fact
that
clients
often
expect/request
explan-‐
ations
of
success
or
shortcomings,
or
macro-‐recommendations,
or
impossible
predictions,
is
grounds
for
educating
them
about
what
we
can
definitely
do
vs.
what
we
can
hope
will
turn
out
to
be
possible.
Although
tempting,
these
expectations
on
the
client’s
part
are
not
an
excuse
for
doing,
or
trying
for
long
to
do,
and
especially
not
for
promising
to
do,
these
extra
things
if
you
lack
the
very
substantial
extra
requirements
for
doing
them,
especially
if
that
effort
jeopardizes
the
primary
task
of
the
evaluator,
viz.
drawing
the
needed
type
of
evaluative
conclusion
about
the
evaluand.
The
merit,
worth,
or
significance
of
a
program
is
often
hard
to
determine;
it
(typically)
requires
that
you
determine
whether
and
to
what
degree
and
in
what
respects
and
for
whom
and
under
what
conditions
and
at
what
cost
it
does
(or
does
not)
work
better
or
worse
than
the
available
alternatives,
and
what
all
that
means
for
all
those
involved.
To
add
on
the
tasks
of
determining
how
to
improve
it,
explain-‐
ing
why
it
works
(or
fails
to
work),
now
and
in
the
future,
and/or
what
one
should
do
about
supporting
or
exporting
it,
is
simply
to
add
other
tasks,
often
of
great
scientific
and/or
managerial/social
interest,
but
quite
often
beyond
current
scientific
ability,
let
alone
the
ability
of
an
evaluator
who
is
perfectly
competent
to
evaluate
the
program.
In
other
words,
‘black
box
evaluation’
should
not
be
used
as
a
term
of
contempt
since
it
is
often
the
name
for
a
vitally
useful,
feasible,
and
affordable
approach,
and
frequently
the
only
feasible
one.
And
in
fact,
most
evaluations
are
of
partially
blacked-‐out
boxes
(‘grey
boxes’)
where
one
can
only
see
a
little
of
the
inner
workings.
This
is
perhaps
most
obviously
true
in
pharma-‐
cological
evaluation,
but
it
is
also
true
in
every
branch
of
the
discipline
of
evaluation
and
every
one
of
its
application
fields
(health,
education,
social
services,
etc.).
A
program
evalu-‐
ator
with
some
knowledge
of
parapsychology
can
easily
evaluate
the
success
of
an
alleged
faith-‐healer
whose
program
theory
is
that
God
is
answering
his
prayers,
without
the
slight-‐
est
commitment
to
the
truth
or
falsehood
of
that
program
theory.
Note
D2.4:
Finally,
there
are
extreme
situations
in
which
the
evaluator
does
have
a
responsibility—an
ethical
responsibility—to
move
beyond
the
role
of
the
evaluator,
e.g.,
because
it
becomes
clear,
early
in
a
formative
evaluation,
either
that
(i)
some
gross
improprieties
are
involved,
or
that
(ii)
certain
actions,
if
taken
immediately,
will
lead
to
36 scriven, 18 April, 2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
involved,
or
that
(ii)
certain
actions,
if
taken
immediately,
will
lead
to
very
large
increases
in
benefits,
and
it
is
clear
that
no-‐one
besides
the
evaluator
is
going
to
take
the
necessary
steps.
The
evaluator
is
then
obliged
to
be
proactive,
and
we
can
call
the
resulting
action
whistle-‐blowing
in
the
first
case,
and
proformative
evaluation
in
the
second,
a
cross
be-‐
tween
formative
evaluation
and
proactivity.
While
macro-‐recommendations
by
evaluators
require
great
care,
proactivity
requires
even
greater
care.
37 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
plaining
the
report’s
significance
to
different
groups
including
users,
staff,
funders,
and
other
impactees,
and
even
reacting
to
later
program
or
management
or
media
documents
allegedly
reporting
the
results
or
implications
of
the
evaluation.
This
in
turn
may
involve
proactive
creation
and
depiction
in
the
primary
report
of
various
possible
scenarios
of
in-‐
terpretations
and
associated
actions
that
are,
and—the
contrast
is
extremely
helpful—are
not,
consistent
with
the
findings.
Essentially,
this
means
doing
some
problem-‐solving
for
the
clients,
that
is,
advance
handling
of
difficulties
they
are
likely
to
encounter
with
various
audiences.
In
this
process,
a
wide
range
of
communication
skills
is
often
useful
and
some-‐
times
vital,
e.g.,
audience
‘reading’,
use
and
reading
of
body
language,
understanding
the
multicultural
aspects
of
the
situation
and
the
cultural
iconography
and
connotative
implica-‐
tions
of
types
of
presentations
and
response.46
There
should
usually
be
an
explicit
effort
to
identify
‘lessons
learned,’
failures
and
limitations,
costs
if
requested,
and
explaining
‘who
evaluates
the
evaluators.’
Checkpoint
D4
should
also
cover
getting
the
results
(and
inciden-‐
tal
knowledge
findings)
into
the
relevant
databases,
if
any;
possibly
but
not
necessarily
into
the
information
ocean
via
journal
publication
(with
careful
consideration
of
the
cost
of
sub-‐
sidizing
these
for
potential
readers
of
the
publication
chosen);
recommending
creation
of
a
new
database
or
information
channel
(e.g.,
a
newsletter)
where
beneficial;
and
dissemina-‐
tion
into
wider
channels
if
appropriate,
e.g.,
through
presentations,
online
posting,
discus-‐
sions
at
scholarly
meetings,
or
in
hardcopy
posters,
graffiti,
book,
blogs,
wikis,
tweets,
and
in
movies
(yes,
fans,
remember—UTube
is
free).
D5.
Meta-evaluation
This
is
the
evaluation
of
an
evaluation
or
evaluations—including
evaluations
based
on
the
use
of
this
checklist—in
order
to
identify
their
strengths/limitations/other
uses.
Meta-‐
evaluation
should
always
be
done,
as
a
separate
quality
control
step(s),
as
follows:
(i)
to
the
extent
possible,
by
the
primary
evaluator,
certainly—but
not
only—after
completion
of
the
final
draft
of
any
report;
and
(ii)
whenever
possible
also
by
an
external
evaluator
of
the
evaluation
(a
meta-‐evaluator).
The
primary
criteria
of
merit
for
evaluations
are:
(i)
validity,
at
a
contextually
adequate
level47;
(ii)
utility48,
including
cost-‐feasibility
and
comprehensi-‐
bility
(usually
to
clients,
audiences,
and
stakeholders)
of
both
the
main
conclusions
about
the
m/w/s
of
the
evaluand,
and
the
recommendations,
if
any;
and
also
any
utility
arising
from
generalizability
e.g.,
of
novel
methodological
approaches;
(iii)
credibility
(to
select
stakeholders,
especially
funders,
regulatory
agencies,
and
usually
also
to
program
staff);
(iv)
comparative
cost-‐effectiveness,
which
goes
beyond
utility
to
require
consideration
of
alternative
possible
evaluation
approaches;
(v)
robustness,
i.e.,
the
extent
to
which
the
ev-‐
aluation
is
immune
to
variations
in
context,
measures
used,
point
of
view
of
the
evaluator
etc;
and
(vi)
ethicality/legality,
which
includes
such
matters
as
avoidance
of
conflict
of
in-‐
46The ‘connotative implications’ are in the sub-explicit but supra-symbolic realm of communication, manifested
in—to give a small example—the use of gendered or genderless language.
47 This means that when balance of evidence is all that’s called for (e.g., because a decision has to be made fast) it’s
irrelevant that proof of the conclusion beyond any reasonable doubt was not supplied.
48 Utility is usability and not actual use, the latter—or its absence—being at best a probabilistically sufficient but not
necessary condition for the former, since it may have been very hard to use the results of the evaluation, and
utility/usability requires (reasonable) ease of use. Failure to use the evaluation may be due to base motives or stu-
pidity or an act of God and hence is not a valid indicator of lack of merit.
terest49
and
protection
of
the
rights
of
human
subjects—of
course,
this
affects
credibility,
but
is
not
exactly
the
same
since
the
ethicality
may
be
deeply
flawed….
There
are
several
ways
to
go
about
meta-‐evaluation.
You
and
later
another
meta-‐evaluator
can:
(a)
apply
the
KEC
or
PES
or
GAO
list
(preferably
one
or
more
that
was
not
used
to
do
the
evaluation)
to
the
evaluation
itself
(e.g.,
the
Cost
checkpoint
in
the
KEC
then
addresses
the
cost
of
the
ev-‐
aluation,
not
the
program,
and
so
on
for
all
checkpoints);
and/or
(b)
use
a
special
meta-‐
evaluation
checklist
(there
are
several
available,
including
the
one
sketched
in
the
previous
sentence,
which
is
sometimes
called
the
Meta-‐Evaluation
Checklist
or
MEC50);
and/or
(c)
if
funds
are
available,
replicate
the
evaluation,
doing
it
in
the
same
way,
and
compare
the
re-‐
sults;
and/or
(d)
do
the
same
evaluation
using
a
different
methodology
and
compare
the
results.
It’s
highly
desirable
to
employ
more
than
one
of
these
approaches,
and
all
are
likely
to
require
supplementation
with
some
attention
to
conflict
of
interest/rights
of
subjects.
Note
D5.1:
Literal
or
direct
use
are
not
concepts
clearly
applicable
to
evaluations
without
recommendations,
a
category
that
includes
many
important,
complete,
and
influential
ev-‐
aluations:
evaluations
are
not
in
themselves
recommendations.
‘Due
consideration’
or
‘utilization’
is
a
better
generic
term
for
the
ideal
response
to
a
good
evaluation.
Failure
to
use
an
evaluation’s
results
is
often
due
to
bad,
perhaps
venal,
management,
and
so
can
never
be
regarded
as
an
indicator
of
bad
utility,
without
further
evidence.
Note
D5.2:
Evaluation
impacts
often
occur
years
after
completion
and
often
occur
even
if
the
evaluation
was
rejected
completely
when
submitted.
Evaluators
too
often
give
up
their
hopes
of
impact
too
soon.
Note
D5.3:
Help
with
utilization
beyond
submitting
the
report
should
at
least
have
been
offered—see
Checkpoint
D4.
Note
D5.4:
Look
for
contributions
from
the
evaluation
to
the
client
organization’s
know-‐
ledge
management
system;
if
they
lack
one,
recommend
creating
one.
Note
D5.5:
Since
effects
of
the
evaluation
are
not
usually
regarded
as
effects
of
the
pro-‐
gram,
it
follows
that
although
an
empowerment
evaluation
should
produce
substantial
gains
in
the
staff’s
knowledge
about
and
tendency
to
use
or
improve
evaluations,
that’s
not
an
effect
of
the
program
in
the
relevant
sense
for
an
evaluator.
Also,
although
that
valuable
outcome
is
an
effect
of
the
evaluation,
it
can’t
compensate
for
low
validity
or
low
external
credibility—two
of
the
most
common
threats
to
empowerment
evaluation—since
training
the
program
staff
is
not
a
primary
criterion
of
merit
for
evaluations.
Note
D5.6:
Similarly,
one
common
non-‐money
cost
of
an
evaluation—disruption
of
the
work
of
program
staff—is
not
a
bad
effect
of
the
program.
It
is
one
of
the
items
that
should
always
be
picked
up
in
a
meta-‐evaluation.
Of
course,
it’s
minimal
in
goal-‐free
evaluation,
since
the
(field)
evaluators
do
not
talk
to
program
staff.
Careful
design
(of
program
plus
ev-‐
aluation)
can
therefore
sometimes
bring
these
evaluation
costs
near
to
zero
or
ensure
that
there
are
benefits
that
more
than
offset
the
cost.
49 There are a number of cases of conflict of interest of particular relevance to evaluators, e.g., formative evaluators
who make suggestions for improvement and then do a subsequent evaluation (formative or summative) of the same
program, of which they are now co-authors—or rejected contributor-wannabes—and hence in conflict of interest.
50
Now online at michaelscriven.info
39 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION
_________________________________________________________________
GENERAL
NOTE
8:
The
explanatory
remarks
here
should
be
regarded
as
first
approxima-‐
tions
to
the
content
of
each
checkpoint.
More
detail
on
some
of
them
and
on
items
men-‐
tioned
in
them
can
be
found
in
one
of
the
following:
(i)
the
Evaluation
Thesaurus,
Michael
Scriven,
(4th
edition,
Sage,
1991),
under
the
checkpoint’s
name;
(ii)
in
the
references
cited
there;
(iii)
in
the
online
Evaluation
Glossary
(2006)
at
evaluation.wmich.edu,
partly
written
by
this
author;
(iv)
in
the
best
expository
source
now,
E.
Jane
Davidson’s
Evaluation
Meth-
odology
Basics
(Sage,
2004
and
2e,
2012
(projected));
(v)
in
later
editions
of
this
document,
at
michaelscriven.info.
The
above
version
of
the
KEC
itself
is,
however,
in
most
respects
very
much
better
than
the
ET
one,
having
been
substantially
refined
and
expanded
in
more
than
60
‘editions’
(i.e.,
widely
circulated
or
online
posted
revisions),
since
its
birth
as
a
two-‐
ager
around
1971—16
since
early
2009—with
much
appreciated
help
from
many
students
and
colleagues,
including:
Chris
Coryn,
Jane
Davidson,
Rob
Brinkerhoff,
Christian
Gugiu,
Nadini
Persaud,51
Emil
Posavac,
Liliana
Rodriguez-‐Campos,
Daniela
Schroeter,
Natasha
Wilder,
Lori
Wingate,
and
Andrea
Wulf;
with
a
thought
or
two
from
Michael
Quinn
Patton’s
work.
More
suggestions
and
criticisms
are
very
welcome—please
send
to:
mjscriv1@gmail.com,
with
KEC
as
the
first
words
in
the
title
line.
(Suggestions
after
3.28.11
that
require
significant
changes
are
rewarded,
not
only
with
an
acknowledgment
but
a
little
prize:
usually
your
choice
from
my
list
of
duplicate
books.)
[23,679
words]
51
Dr. Persaud’s detailed comments have been especially valuable: she was a
CPA
before
she
became
a
profes-‐
sional
evaluator
(but
there
are
not
as
many
changes
in
the
cost
section
as
she
thinks
are
called
for,
so
she
is
not
to
blame
for
any
remaining
faults.)