0% found this document useful (0 votes)

52 views40 pages

Key Evaluation Chechlist KEC - 4.18.2011

This document introduces the Key Evaluation Checklist (KEC) as a reference work for professionals conducting evaluations of programs, projects, plans, processes, and policies. The KEC is organized into four parts that guide the evaluator through preliminary steps, establishing the program foundations, conducting sub-evaluations of key aspects like process and outcomes, and developing conclusions. The introduction provides clarification on how the KEC can be applied to different evaluands and notes on terminology used in systematic professional evaluation.

Uploaded by

musiquedelavoiture

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views40 pages

Key Evaluation Chechlist KEC - 4.18.2011

Uploaded by

musiquedelavoiture

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

KEY

EVALUATION CHECKLIST (KEC)

[Edition of 18.april.2011]

Michael Scriven
Claremont Graduate University
& The Evaluation Center, Western Michigan University

• For use in professional designing, managing, and evaluating or monitoring of:
programs, projects, plans, processes, and policies;
• for assessing their evaluability;
• for requesting proposals (i.e., writing RFPs) to do or evaluate them;
& for evaluating proposed, ongoing, or completed evaluations of them.1

INTRODUCTION
This introduction takes the form of a number of ‘General Notes,’ more of which may be
found in the body of the document, along with many keypoint-‐specific Notes.
General Note 1: APPLICABILITY The KEC can be used, with care, for evaluating more than
the five evaluands2 listed above, just as it can be used, with considerable care, by others besides
professional evaluators. For example, it can be used for some help with: (i) the evaluation of
products;3 (ii) the evaluation of organizations and organizational units4 such as departments, re-
search centers, consultancies, associations, companies, and for that matter, (iii) hotels, restau-
rants, and mobile food carts, (iv) services, which can be treated as if they were aspects or con-
stituents of programs e.g., as processes; (v) many processes, policies, practices, or procedures,
which are often implicit programs (e.g., “Our practice at this school is to provide guards for chil-
dren walking home after dark”), hence evaluable using the KEC; or habitual patterns of behav-
iour i.e., performances (as in “In my practice as a consulting engineer, I often assist designers,
not just manufacturers”), which is, strictly speaking, a slightly different subdivision of evalu-
ation; and, with some use of the imagination and a heavy emphasis on the ethical values in-
volved, for (vi) some tasks or major parts of tasks in the evaluation of personnel. So it is a kind

1
That is, for what is called meta-evaluation, i.e., the evaluation of one or more evaluations.
2
‘Evaluand’ is a term used to refer to whatever is being evaluated. Note that what counts as a program is often also
sometimes called an initiative or intervention, sometimes even an approach or strategy, although the latter are really
types of program
3
For which it was originally designed and used, c. 1971—although it has since been completely rewritten for its
present purposes, and then revised or rewritten (and circulated or re-posted) at least 60 times. The latest version can
always be found at michaelscriven.info. It is an example of ‘continuous interactive publication’ a type of project
with some new significance in the field of knowledge development, although (understandably) a source of irritation
to some librarians and bibliographers. It enables the author, like a garden designer (and unlike a traditional architect
or composer), to steadily improve his or her specific individual creations over the years or decades, with the help of
user input. It is simply a technologically-enabled extension to the limit of the stepwise process of producing succes-
sive editions in traditional publishing, and arguably a substantial improvement, in the cases where it’s appropriate.
4
There is of course a large literature on the evaluation of organizations, from Baldrige to Senge, and some of it will
be useful for a serious evaluator, but much of it is confused and confusing e.g., about the difference and links be-
tween evaluation and explanation, needs and markets, criteria and indicators, goals and duties.
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

of 40-page (~23,000-word) mini-textbook or reference work for a wide range of professionals

working in evaluation and management—with all the limitations of that size, and certainly more.
General Note 2: TABLE OF CONTENTS
PART A: PRELIMINARIES: A1, Executive Summary; A2, Clarifications; A3, Design
and Methods.
PART B: FOUNDATIONS: B1, Background and Context; B2, Descriptions & Defini-‐
tions; B3, Consumers (Impactees); B4, Resources (‘Strengths Assessment’); B5,
Values.
PART C: SUBEVALUATIONS: C1, Process: C2, Outcomes; C3, Costs; C4, Compari-‐
sons; C5, Generalizability.
PART D: CONCLUSIONS & IMPLICATIONS: D1, Synthesis; D2 (possible),
Recommendations, Explanations, Predictions, & Redesigns; D3 (possible),
Responsibility and Justification; D4, Report & Support; D5, Meta-‐evaluation.5
General Note 3: TERMINOLOGY: Throughout this document, “evaluation” is taken to mean
the determination of merit, worth, or significance (abbreviated m/w/s); “an evaluation” to
mean the results of such a determination; and “evaluand” to mean whatever is being evalu-‐
ated… “Dimensions of merit” (a.k.a., “criteria of merit”) refers to the characteristics of the
evaluand that definitionally bear on its m/w/s (i.e., could be included in explaining what
‘good X’ means), and “indicators of merit” refers to factors that are empirically but not defi-‐
nitionally linked to the evaluand’s m/w/s… Professional evaluation is simply evaluation
requiring specialized tools or skills that are not in the everyday repertoire; it is usually sys-‐
tematic (and inferential), but may also be simply judgmental, if the judgment skill is profes-‐
sionally trained, and maintained, or a (recently) tested advanced skill (think of livestock
judges, football referees, sawmill controllers)… The KEC is a tool for use in systematic pro-‐
fessional evaluation, so knowledge of some terms from evaluation vocabulary is assumed,
e.g., formative, goal-‐free, ranking; their definitions can be found in my Evaluation The-
saurus (4e, Sage, 1991), or in the Evaluation Glossary, online at evaluation.wmich.edu.
However, any conscientious program manager (or designer or fixer) does evaluation of
their own projects, and will benefit from using this, skipping the occasional technical de-‐
tails… The most common reasons for doing evaluation are (i) to identify needed improve-
ments to the evaluand (formative evaluation); (ii) to support decisions about the program
(summative evaluation6); and (iii) to enlarge or refine our body of evaluative knowledge
(ascriptive evaluation, as in ‘best practices’ studies and all evaluations by historians). Keep
in mind that an evaluation may serve more than one purpose, or shift from one to the other
as time passes or the context changes… Merely for simplification, we talk throughout this
document about the evaluation of ‘programs’ rather than ‘programs, plans, or policies, or
evaluations of them, etc….’ as detailed in the sub-‐heading above.

5
It’s not important, but you can remember the part titles from this mnemonic: A for Approach, B for Before, C for
Core (or Center), and D for Dependencies. Since these have 3+5+5+3 components, it’s a 16-point checklist.
6
Major decisions about the program include: refunding, defunding, exporting, replicating, developing further, and
deciding whether it represents a proper or optimal use of funds (i.e., evaluation for accountability, as in an audit).

2 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

General Note 4: TYPE OF CHECKLIST This is an iterative checklist, not a one-shot checklist,
i.e., you should expect to go through it several times when dealing with a single project,
even for design purposes, since discoveries or problems that come up under later check-‐
points will often require modification of what was entered under earlier ones (and no re-‐
arrangement of the order will completely avoid this).7 For more on the nature of checklists,
and their use in evaluation, see the author’s paper on that topic, and a number of other pa-‐
pers about, and examples of, checklists for evaluation by various authors, under the listing
for the Checklist Project at evaluation.wmich.edu.
General Note 5: EXPLANATIONS Since it is not entirely helpful to simply list here what
(allegedly) needs to be covered in an evaluation when the reasons for the recommended
coverage (or exclusions) are not obvious—especially when the issues are highly controver-‐
sial (e.g., Checkpoint D2)—brief summaries of the reasons for the position taken are also
provided in such cases.
General Note 6: CHECKPOINT FOCUS The determination of merit, or worth, or significance
(a.k.a. (respectively) quality, value, or importance), the triumvirate value foci of evaluation,
each rely to different degrees on slightly different slices of the KEC, as well as on a good
deal of it as common ground. These differences are marked by a comment on these distinc-‐
tive elements with the relevant term of the three underlined in the comment, e.g., worth,
unlike merit (or quality, as the terms are commonly used), brings in Cost (Checkpoint C3).
General Note 7: THE COST OF EVALUATION The KEC is a list of what ought to be covered in
an evaluation, but in the real world, the budget for an evaluation is often not enough to cover the
whole list thoroughly. People sometimes ask what checkpoints could be skipped when one has a
very small evaluation budget. The answer is, “None, but….” These are, generally speaking, nec-
essary conditions for validity, But… (i) sometimes the client, or you, if you are the client, can
show that one or two are not relevant to the information need in this case (e.g., cost may not be
important in some cases); (ii) the fact that you can’t skip any checkpoint doesn’t mean you
have to spend significant money on each of them. What you do have to do is think through
each checkpoint’s implications for the case in hand, and consider whether an economical
way of coping with it would be probably adequate for an acceptably probable conclusion,
i.e., focus on robustness (see Checkpoint D5, Meta-‐evaluation, below). In an extreme case,
you may have to rely on a subject-‐matter expert for an estimate based on his/her experi-‐
ence, maybe covering more than one checkpoints in a half-‐day of consulting—or on a few
hours of literature + phone search by you—of the relevant facts about e.g., resources, or
critical competitors. But reality sometimes mean the evaluation can’t be done; that’s the
cost of integrity for evaluators and, sometimes, excessive parsimony for clients. Don’t for-‐
get that honesty on this point prevents some bad scenes later—and may lead to a change of
budget.

PART A: PRELIMINARIES

7
An important category of these is identified in Note C2.5 below

3 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

These preliminary checkpoints are clearly essential parts of an evaluation report, but may
seem to have no relevance to the design and execution phases of the evaluation itself. That’s
why they are segregated from the rest of the KEC checklist: however, it turns out to be
quite useful to begin all one’s thinking about an evaluation by role-‐playing the situation
when you will come to write a report on it. Amongst other benefits, it makes you realize the
importance of describing context; of settling on a level of technical terminology and pre-‐
supposition; of clearly identifying the most notable conclusions; and of starting a log on the
project as well as its evaluation as soon as the latter becomes a possibility. Similarly, it’s
good practice to make explicit at an early stage the clarification step and the methodology
array and its justification

A1. Executive Summary

The most important element in this section is the summary of the results and not (or not
just) the investigatory process. Typically this should be done without even mentioning the
process whereby you got them, unless the methodology is especially notable. In other
words, avoid the pernicious practice of using the executive summary as a ‘teaser’ that only
describes what you looked at or how you looked at it, instead of what you found. Through-‐
out the whole process of designing or doing an evaluation, keep asking yourself what the
overall summary is going to say, based on what you have learned so far, and how directly
and adequately it relates to the client’s and stakeholders’ and (probable future) audiences’
needs, in terms of their pre-‐existing information; this helps you to focus on what still needs
to be done in order to find out what matters most. The executive summary should usually
be a selective summary of Parts B and C, and should not run more than one or at most two
pages if you expect it to be read by executives. Only rarely is the occasional practice of two
summaries (e.g., one ten-‐ pager and one one-‐pager) worth the trouble, but discuss this op-‐
tion with the client if in doubt, and the earlier the better. The summary should usually con-‐
vey some sense of the strength of the conclusions—which includes an estimate of both the
weight of the evidence for the premises and the robustness of the inference(s) to the con-‐
clusion(s)—and its limitations (see A3 below). Of course, the final version of the executive
summary will be written near the end of writing the report, but it’s worth trying the prac-‐
tice of re-‐editing an informal draft of it every couple of weeks during the evaluation be-‐
cause this forces one to keep thinking about identification and substantiation of the most
important conclusions. Append these versions to the log, for future consideration.
Note A1.1 This Note should be just for beginners, but experience has demonstrated that
others can also benefit from its advice: the executive summary is a summary of the evalu-
ation not of the program. (Checkpoint B2 is reserved for the latter.)

A2. Clarifications

Now is the time to clearly identify and define in your notes, for assertion in the final report
(and resolution of ambiguities along the way): (i) the client, if there is one: this is the per-‐
son, group, or committee who officially requests, and, if it’s a paid evaluation, pays for (or
authorizes payment for) the evaluation, and—you hope—the same entity to whom you first
report (if not, try to arrange this, to avoid crossed wires in communications). (ii) The pros-‐
pective (i.e., overt) audiences (for the report). (iii) The stakeholders in the program (those

4 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

who have or will have a substantial vested interest—not just an intellectual interest—in
the outcome of the evaluation, and may have important information or views about the
program and its situation/history). (iv) Anyone else who (probably) will see, have the right
to see, or should see, (a) the results, and/or (b) the raw data—these are the covert audi-‐
ences. Get clear in your mind your actual role or roles—internal evaluator, external evalu-‐
ator, a hybrid (e.g., an outsider on the payroll for a limited time to help the staff with setting
up and running evaluation processes), an evaluation trainer (sometimes described as an
empowerment evaluator), a repairer/‘fixit guy’, visionary (or re-‐visionary), etc. Each of
these roles has different risks and responsibilities, and is viewed with different expecta-‐
tions by your staff and colleagues, the clients, the staff of the program being evaluated, et al.
You may also pick up some other roles along the way—e.g., counsellor, therapist, mediator,
decision-‐maker, inventor, advocate—sometimes for everyone but sometimes for only part
of the staff/stakeholders/others involved. It’s good to formulate and sometimes to clarify
these roles, at least for your own thinking (especially about possible conflicts of role), in the
project log. The project log is absolutely essential; and it’s worth considering making a
standard practice of having someone else read and initial entries in it that may at some
stage become very important.
And now is the time to get down to the nature and details of the job or jobs, as the client
sees them—and to encourage the client to clarify their position on the details that they
have not yet thought out. Get all this into a written contract if possible (essential if you’re
an external evaluator, highly desirable for an internal one.) Can you determine the source
and nature of the request, need, or interest, leading to the evaluation: for example, is the
request, or the need, for an evaluation of worth—which usually involves really serious at-‐
tention to cost analysis—rather than of merit; or of significance which always requires ad-‐
vanced knowledge of the research (or other current work) scene in the evaluand’s field; or
of more than one of these? Is the evaluation to be formative, summative, or ascriptive8; or
for more than one of these purposes? Exactly what are you supposed to be evaluating (the
evaluand alone, or also the context and/or the infrastructure?): how much of the context is
to be taken as fixed; do they just want an evaluation in general terms, or if they want de-‐
tails, what counts as a detail (enough to replicate the program elsewhere, or enough to rec-
ognize it anywhere, or just enough for prospective readers to know what you’re referring
to); are you supposed to be simply evaluating the effects of the program as a whole (holistic
evaluation); or the dimensions of its success and failure (one type of analytic evaluation);
or the quality on each of those dimensions, or the quantitative contribution of each of its
components to its overall m/w/s (another two types of analytic evaluation); are you re-‐
quired to rank the evaluand against other actual or possible programs (which ones?), or
only to grade it;9 and to what extent is a conclusion that involves generalization from this
context being requested or required (e.g., where are they thinking of exporting it?). And, of
particular importance, is the main thrust to be on ex post facto (historical) evaluation, or ex

8 Formative evaluations, as mentioned earlier, are usually done to find areas needing improvement of the evaluation:
summative are mainly done to support a decision about the disposition of the evaluand (e.g., to refund, defund, or
replicate it); and ‘ascriptive’ evaluations are done simply for the record, for history, for benefit to the discipline, or
just for interest.
9
Grading refers not only to the usual academic letter grades (A-F, Satisfactory/Unsatisfactory, etc.) but to any allo-
cation to a category of merit, worth, or significance, e.g., grading of meat, grading of ideas and thinkers.

5 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

ante (predictive) evaluation, or (the most common, but don’t assume it) both? Note that
predictive program evaluation is very close to being (the almost universal variety of) policy
analysis, and vice versa.
Are you also being asked (or expected) either to evaluate the client’s theory of how the ev-‐
aluand’s components work, or to create/improve such a ‘program theory’—keeping in
mind that this is something over and above the literal evaluation of the program, and espe-‐
cially keeping in mind that this is sometimes impossible for even the most expert of field
experts in the present state of subject-‐matter knowledge? 10 Is the required conclusion sim-‐
ply to provide and justify grades, ranks, scores, profiles, or (a different level of difficulty
altogether) distribution of funding? Are recommendations (for improvement or disposi-‐
tion), or identifications of human fault, or predictions, requested, or expected, or feasible
(another level of difficulty, too—see Checkpoint D2)? Is the client really willing and anxious
to learn from faults or is this just conventional rhetoric? (Your contract or, for an internal
evaluator, your job, may depend on getting the answer to this question right, so you might
consider trying this test: ask them to explain how they would handle the discovery of ex-‐
tremely serious flaws in the program—you will often get an idea from their reaction to this
question whether they have ‘the right stuff’ to be a good client.) Or you may discover that
you are really expected to produce a justification for the program in order to save some-‐
one’s neck; and that they have no interest in hearing about faults. Have they thought about
post-‐report help with interpretation and utilization? (If not, offer it without extra charge—
see Checkpoint D2 below.)
It’s best to complete the discussion of these issues about what’s expected and/or feasible to
evaluate, and clarify your commitment (and your cost estimate, if it’s not already fixed),
only after doing a quick pass through the KEC, so ask for a little time to do this, overnight if
possible (see Note D2.3 near the end of the KEC). Be sure to note later any subsequently
negotiated, or imposed, changes in any of the preceding. And here’s where you give ac-‐
knowledgments/thanks/etc., so it probably should be the last section you revise in the final
report.

A3. Design and Methods

Now that you’ve got the questions straight, how are you going to find the answers? You
need a plan that lays out the aspects of the evaluand you need to investigate in order to ev-‐
aluate it—the design of the evaluation—and a set of investigative procedures to implement
this plan—i.e., the planned methods, based on some general account of how to investigate
each aspect of the design, in other words a methodology. To a large extent, the methodol-‐
ogy now used in evaluation originates in social science methodology, and is well covered

10 Essentially, this is a request for decisive non-evaluative explanatory research on the evaluand and/or context. You
may or may not have the skills for this, depending on the exact problem; you certainly didn’t acquire them in the
course of your evaluation training. It’s one thing to determine whether (and to what degree) a program reduces de-
linquency; any good evaluator can do that (given the budget and time required). It’s another thing altogether to be
able to explain why that program does or does not work—that often requires an adequate theory of delinquency,
which so far doesn’t exist. Although program theory enthusiasts think their obligations always include or require
such a theory, the standards for acceptance of any of these theories by the field as a whole are often beyond their
reach; and you risk lowering the standards of the evaluation field if you claim your evaluation depends on providing
such a theory, since in many of the most important areas, you will not be able to do that.

6 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

elsewhere, in both social science and evaluation texts. In this section, we just list some en-‐
try points for that slice of evaluation methodology, and provide rather more details about
the evaluative slice of evaluation methodology, the neglected part; apart from a few com-‐
ments here, this is mostly covered, or at least introduced, under the later checkpoints
which refer to the necessary aspects of the investigation—Values, Process, Outcomes,
Costs, Comparisons, Generalizability, and Synthesis checkpoints. Leaving this slice out of
the methodology of evaluation is roughly the same as leaving out any discussion of inferen-‐
tial statistics from a discussion of statistics.
Two orienting points to start with. (i) Program evaluation is usually about a single program
rather than a set of programs. Although program evaluation is not as individualistic—the
technical term is idiographic rather than nomothetic—as dentistry, forensic pathology, or
motorcycle maintenance, since most programs have large numbers of impactees rather
than just one, it is more individualistic than most social sciences, even applied social sci-‐
ences. So you’ll need to be knowledgeable about case study methodology11. (ii) Program ev-‐
aluation is nearly always a complex task, involving the investigation of a number of differ-‐
ent aspects of program performance—even a number of different aspects of a single ele-‐
ment in that such as impact or cost—which means it is part of the realm of study that re-‐
quires extensive use of checklists. The humble checklist has been ignored in most of the lit-‐
erature on research methods, but turns out to be more complex and also more important
than was generally realized, so look up the online Checklists Project at
http://www.wmich.edu/evalctr/checklists for some papers about the methodology and a
long list of specific checklists composed by evaluators (including an earlier version of this
one). You can find a few others, and the latest version of this one, at michaelscriven.info.
Now for some entry points for applying social science methodology: that is, some examples
of the kind of question that you may need to answer. Do you have adequate domain (a.k.a.
subject-‐matter, and/or local context) expertise for what you have now identified as the real
tasks? If not, how will you add it to the evaluation team (via consultant(s), advisory panel,
full team membership, sub-‐contract, or surveys/interviews)? More generally, identify, as
soon as possible, all investigative procedures for which you’ll need expertise, time, equip-‐
ment, and staff—and perhaps training—in this evaluation: observation, participant obser-‐
vation, logging, journaling, audio/photo/video recording, tests, simulating, role-‐playing,
surveys, interviews, experimental design, focus groups, text analysis, library/online
searches/search engines, etc.; and data-‐analytic procedures (stats, cost-‐analysis, modeling,
topical-‐expert consulting, etc.), plus reporting techniques (text, stories, plays, graphics,
freestyle drawings, stills, movies, etc.), and their justification. You probably need to allocate
time for a lit review on some of these methods.... In particular, on the difficult causation
component of the methodology (establishing that certain claimed or discovered phenom-‐
ena are the effects of the interventions), can you use separate control or comparison
groups to determine causation of supposed effects/outcomes? If not, look at interrupted
time series designs, the GEM approach,12 and some ideas in case study design. If there is to
be a control or quasi-‐control (i.e., comparison) group, can you and should you try to ran-‐

11
This means reading at least some books by Yin, Stake, and Brinkerhoff (check Amazon).
12
See “A Summative Evaluation of RCT methodology; and an alternative approach to causal research” in Journal
of Multidisciplinary Evaluation vol. 5, no. 9, March 2008, at jmde.com.

7 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

domly allocate subjects to it (and can you get through IRB)? How will you control differen-‐
tial attrition; cross-‐group contamination; other threats to internal validity? If you can’t con-‐
trol these, what’s the decision-‐rule for declining/aborting the study? Can you double-‐ or
single-‐blind the study (or triple-‐blind if you’re very lucky)? If the job requires you to de-‐
termine the separate contribution to the effects from individual components of the ev-‐
aluand—how will you do that?.... If a sample is to be used at any point, how selected, and if
stratified, how stratified?…. Will/should the evaluation be goal-‐based or goal-‐free?13.... To
what extent participatory or collaborative; if to a considerable extent, what standards and
choices will you use, and justify, for selecting partners/assistants? In considering your de-‐
cision on that, keep in mind that participatory approaches improve implementation (and
sometimes validity), but may cost you credibility (and possibly validity). How will you han-‐
dle that threat?.... If judges are to be involved at any point, what reliability and bias controls
will you need (again, for credibility as well as validity)?... How will you search for side ef-‐
fects and side-‐impacts, an essential element in almost all evaluations (see Checkpoint C2)?
Most important of all, with respect to all (significantly) relevant values how are you going
to go through the value-‐side steps in the evaluation process, i.e., (i) identify, (ii) particu-‐
larize, (iii) validate, (iv) measure, (v) set standards (‘cutting scores’) for, (vi) set weights
for, and then (vii) incorporate (synthesize, integrate) the value-‐side with the empirical
data-‐gathering side in order to generate the evaluative conclusion?... Now check the sug-‐
gestions about values-‐specific methodology in the Values checkpoint, especially the com-‐
ment on pattern-‐searching…. When you can handle all this, you are in a position to set out
the ‘logic of the evaluation,’ i.e., a general description and justification of the total design for
this project, something that—at least in outline—is a critical part of the report, under the
heading of Methodology.
Note A3.1: The above process will also generate a list of needed resources for your plan-‐
ning and budgeting efforts—i.e., the money (and other costs) estimate. And it will also pro-‐
vide the basis for the crucial statement of the limitations of the evaluation that may need to
be reiterated in the conclusion and perhaps in the executive summary.

PART B: FOUNDATIONS

This is the set of investigations that establishes the context and nature of the program, and
some of the empirical components of the evaluation that you’ll need in order to start spe-‐
cific work on the key dimensions of m/w/s in Part C. That is, they specify important ele-‐
ments that will end up in the actual evaluation report, by contrast with preparing for it, i.e.,
Part A, as well as providing foundations for the core elements in C. These are all part of the
content of the KEC, and each one is numbered for that purpose.

13
That is, at least partly done by evaluators who are not informed of the goals of the program.

8 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

B1. Background and Context

Identify historical, recent, concurrent, and projected settings for the program; start a list of
contextual factors that may be relevant to success/failure of the program; and put matched
labels (or metadata tags) on any that look as if they may interact. In particular, identify (i)
any ‘upstream stakeholders’—and their stakes—other than the clients (i.e., identify people
or groups or organizations that assisted in creation or implementation or support of the
program or its evaluation, e.g., with funding or advice or housing or equipment or help); (ii)
any enabling legislation/mission statements, etc.—and any other relevant legisla-‐
tion/policies—and log any legislative/executive/practice or attitude changes that occur
after start-‐up; (iii) the underlying rationale, including the official program theory, and po-‐
litical logic (if either exist or can be reliably inferred; although neither are necessary for
getting an evaluative conclusion, they are sometimes useful/required); (iv) general results
of a literature review on similar interventions, including ‘fugitive studies’ (those not pub-‐
lished in standard media), and on the Internet (consider checking the ‘invisible web,’ and
the latest group and individual blogs/wikis with the specialized search engines needed to
access these); (v) previous evaluations, if any; (vi) their impact, if any.

B2. Descriptions & Definitions

What you are going to evaluate is officially a certain program, but actually it’s the total
intervention made in the name of the program. That will usually include much more than
just the program, e.g., it may include the personalities of the field staff and support agen-‐
cies, their modus operandi in dealing with local communities, their modes of dress and
transportation, etc. So, record any official descriptions of program, its components, its con-‐
text/environment, and the client’s program logic, but don’t assume they are correct, even
as descriptions of the actual program delivered, let alone of the total intervention. Be sure
to develop a correct and complete description of the first three, which may be very differ-‐
ent from the client’s version, in enough detail to recognize the evaluand in any situation
you observe, and perhaps—depending on the purpose of the evaluation—to replicate it.
You don’t need to develop the correct program logic, only the supposed program logic, un-‐
less you have undertaken to do so and have the resources to add this—often major, and
sometimes suicidal—requirement to the basic evaluation tasks. Of course, you will some-‐
times see, or find later, some obvious flaws in the client’s effort at a program logic and you
may be able to point those out, diplomatically, at some appropriate time. Get a detailed de-‐
scription of goals/mileposts for the program (if not operating in goal-‐free mode). Explain
the meaning of any ‘technical terms,’ i.e., those that will not be in the prospective audiences’
vocabulary, e.g., ‘hands-‐on’ or ‘inquiry-‐based’ science teaching, ‘care-‐provider’. Note signifi-‐
cant patterns/analogies/metaphors that are used by (or implicit in) participants’ accounts,
or that occur to you; these are potential descriptions and may be more enlightening than
literal prose; discuss whether or not they can be justified; do the same for any graphic ma-‐
terials associated with the program. Distinguish the instigator’s efforts in trying to start up
a program from the program itself; both are interventions, only the latter is (normally) the
evaluand. Remember, you’re only going to provide a summary of the program description,
not a complete description, which might take more space than your complete evaluation.

9 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

B3. Consumers (Impactees)

Consumers, as the term is used here (impactees is a less ambiguous term), comprise (i) the
recipients/users of the services/products (i.e., the downstream direct impactees) PLUS (ii)
the downstream indirect impactees (e.g., recipient’s family or co-‐workers, and others, who
are impacted via ripple effect14). Program staff are also impactees, but we usually keep
them separate by calling them the midstream impactees, because the obligations to them,
and the effects on them, are almost always very different and much weaker in most kinds of
program evaluation (and their welfare is not the raison d’être of the program). The funding
agency, taxpayers, and political supporters, who are also impactees in some sense and
some cases, are also treated differently (and called upstream impactees, or, sometimes,
stakeholders, although that term is often used more loosely to include all impactees), ex-‐
cept when they are also direct recipients. Note that there are also upstream impactees who
are not funders or recipients of the services but react to the announcement or planning of
the program before it actually comes online (we can call them anticipators); e.g., real estate
agents and employment agencies. In identifying consumers remember that they often
won’t know the name of the program or its goals and may not know that they were im-‐
pacted or even targeted by it. (You may need to use tracer &/or modus operandi method-‐
ology.) While looking for the impacted population, you may also consider how others could
have been impacted, or protected from impact, by variations in the program: these define
alternative possible (a.k.a. virtual) impacted populations, which may suggest some ways to
expand, modify, or contract the program when/if you spend time on Checkpoint D1 (Syn-‐
thesis)15, and Checkpoint D2 (Recommendations); and hence some ways that the program
should perhaps have been redefined by now, which bears on issues of praise and blame
(Checkpoints B1 and D3). Considering possible variations is of course constrained by the
resources available—see next checkpoint.
Note B3.1: Do not use or allow the use of the term ‘beneficiaries’ for impactees, since it car-‐
ries with it the completely unacceptable assumption that all the effects of the program are
beneficial. It is also misleading, on a smaller scale, to use the term ‘recipients’ since many
impactees are not receiving anything but merely being affected, e.g., by the actions of
someone who learnt something about flu control from an educational program. The term
‘recipient’ should be used only for those who, whether as intended or not, are directly im-‐
pacted.

B4. Resources (a.k.a. “Strengths Assessment”)

This checkpoint is important for answering the questions (i) whether the program made
the best use of resources available (i.e., an extended kind of cost-‐effectiveness: see Note
C3.4 for more), and (ii) what it might realistically do to improve (i.e., within the resources
available). It refers to the financial, physical, and intellectual-‐social-‐relational assets of the
program (not the evaluation!). These include the abilities, knowledge, and goodwill of staff,
volunteers, community members, and other supporters. This checkpoint should cover what
could now (or could have been) used, not just what was used: this is what defines the
14
In rare cases, this will include members of the research and evaluation communities who read the evaluation re-
port.
15 A related issue, equally important, is: What might have been done that was not done?

10 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

“possibility space,” i.e., the range of what could have been done, often an important ele-‐
ment: in the assessment of achievement; in the comparisons, and in identifying directions
for improvement that an evaluation considers. This means the checkpoint is crucial for
Checkpoint C4 (Comparisons), Checkpoint D1 (Synthesis, for achievement), Checkpoint D2
(Recommendations), and Checkpoint D3 (Responsibility). Particularly for D1 and D2, it’s
helpful to list specific resources that were not used but were available in this implementa-‐
tion. For example, to what extent were potential impactees, stakeholders, fund-‐raisers, vol-‐
unteers, and possible donors not recruited or not involved as much as they could have
been? (As a crosscheck, and as a complement, consider all constraints on the program, in-‐
cluding legal, environmental, and fiscal constraints.) Some matters such as adequate insur-‐
ance coverage (or, more generally, risk management) could be discussed here or under
Process (Checkpoint C1 below); the latter is preferable since the status of insurance cover-‐
age is ephemeral, and good process must include a procedure for regular checking on it.
This checkpoint is the one that covers individual and social capital available to the pro-‐
gram; the evaluator must also identify social capital used by the program (enter this as part
of its Costs at Checkpoint C3), and, sometimes, social capital benefits produced by the pro-‐
gram (enter as part of the Outcomes, at Checkpoint C2).16 Remember to include the re-
sources contributed by other stakeholders, including other organizations and clients.17

B5. Values
The values of primary interest in typical professional program evaluations are for the most
part not mere personal preferences of the impactees, unless those overlap with their needs
and the community/society’s needs and committed values, e.g., those in the Bill of Rights
and the wider body of law. Preferences as such are not irrelevant in evaluation, especially
the preferences of impactees, and on some issues, e.g., surgery options, they are often de-‐
finitive; it’s just that they are generally less important—think of food preferences in chil-‐
dren—than dietary needs and medical, legal, or ethical requirements, especially for program
evaluation by contrast with product evaluation. While there are intercultural and interna-‐
16 Individual human capital is the sum of the physical and intellectual abilities, skills, powers, experience, health,
energy, and attitudes a person has acquired. These blur into their—and their community’s—social capital, which
also includes their relationships (‘social networks’) and their share of any latent attributes that their group acquires
over and above the sum of their individual human capital (i.e., those that depend on interactions with others). For
example, the extent of the trust or altruism that pervades a group, be it family, army platoon, corporation, or other
organization, is part of the value the group has acquired, a survival-related value that they (and perhaps others) bene-
fit from having in reserve. (Example of non-additive social capital: the skills of football or other team members that
will only provide (direct) benefits for others who are part of a group, e.g., a team, with complementary skills.) These
forms of capital are, metaphorically, possessions or assets to be called on when needed, although they are not di-
rectly observable in their normal latent state. A commonly discussed major benefit resulting from the human capital
of trust and civic literacy is support for democracy; a less obvious one, resulting in tangible assets, is the current set
of efforts towards a Universal Digital Library containing ‘all human knowledge’. Human capital can usually be
taken to include natural gifts as well as acquired ones, or those whose status is indeterminate as between these cate-
gories (e.g., creativity, patience, empathy, adaptability), but there may be contexts in which this should not be as-
sumed. (The short term for all this might seem to be “human resources” but that term has been taken over to mean
“employees,” and that is not what we are talking about here.) The above is a best effort to construct the current
meaning: the 25 citations in Google for ‘human capital’ and the 10 for ‘social capital’ (at 6/06/07) include simplified
and erroneous as well as other and inconsistent uses—few dictionaries have yet caught up with these terms (al-
though the term ‘human capital’ dates from 1916).
17 Thanks to Jane Davidson for the reminder on this last point.

11 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

tional differences of great importance in evaluating programs, most of the values listed be-‐
low are highly regarded in all cultures; the differences are generally in their precise inter-‐
pretation, the contextual parameters, the exact standards laid down for each of them, and
the relative weight assigned to them; and taking those differences into account is fully al-‐
lowed for in the approach here. Of course, your client won’t let you forget what they value,
usually the goals of the program, and you should indeed keep them in mind and report on
success in achieving them; but since you must value every unintended effect of the program
just as seriously as the intended ones, and in most contexts you must take into account
values other than those of the clients, e.g., those of the impactees and usually also those of
other stakeholders, you need a repertoire of values to check when doing serious program
evaluation, and what follows is a proposal for that. Keep in mind that with respect to each
of these (sets of) values, you will have to: (i) define and justify relevance to this program in
this context; (ii) justify the relative weight (i.e., comparative importance) you will accord
this value; (iii) identify any bars (i.e., minimum acceptable performance standards on each
value dimension) you will require an evaluand to meet in order to be considered at all in
this context; (iii) specify the empirical levels that will justify the application of each grade
level above the bar on that value that you may wish to distinguish (e.g., define what will
count as fair/good/excellent. And one more thing, rarely identified as part of the evalu-‐
ator’s task but crucial; (v) once you have a list of impactees, however partial, you must be-‐
gin to look for patterns within them, e.g., that pregnant women have greater calorific re-‐
quirements (i.e., needs) than those who are not pregnant. If you don’t do this, you will miss
extremely important ways to optimize the use of intervention resources.18
To get all this done, you should begin by identifying the relevant values for evaluating this
evaluand in these circumstances. There are several very important groups of these. (i)
Some of these follow simply from understanding the nature of the evaluand (these are
sometimes called definitional criteria of merit, or dimensions of merit). For example, if it’s a
health program, then the criteria of merit, simply from the meaning of the terms, include
the extent (a.k.a., reach or breadth) of its impact (i.e., the size and range of the demographic
(age/gender/ethnic/economic) and medical categories of the impactee population), and
the impact’s depth (usually a function of magnitude, extent and duration) of beneficial ef-‐
fects. (ii) Other primary criteria of merit in such a case are extracted from a general or spe-‐
cialist understanding of the nature of a health program, include safety of staff and patients,
quality of medical care (from diagnosis to follow-‐up), low adverse eco-‐impact, physical
ease of access/entry; and basic staff competencies plus basic functioning, diagnostic and
minor therapeutic supplies and equipment. Knowing what these values are is one reason
you need either specific evaluand-‐area expertise or a consultant who has it. (iii) Then look
for particular, site-‐specific, criteria of merit—for example, the need for one or more sec-‐
ond-‐language competencies in service providers; you will probably need to do or find a
valid needs assessment for the targeted, and perhaps also for any other probably impacted
population. Here you must include representatives from the impactee population as rel-‐
evant experts, and you may need only their expertise for the needs assessment, but prob-‐

18 And if you do this, you will be doing what every scientist tries to do—find patterns in data. This is one of several
ways in which good evaluation requires full-fledged traditional scientific skills; and something more as well (han-
dling the values component).

12 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

ably should do a serious needs assessment and have them help design and interpret it. (iv)
Next, list the explicit goals/values of the client if not already covered, since they will surely
want to know whether and to what extent these were met. (v) Finally, turn to the list below
to find other relevant values. Validate them as relevant or irrelevant for the present evalu-‐
ation, and as contextually supportable.19
Now, for each of the values you are going to rely on at all heavily, there are two important
steps you will usually need to take. First, you need to establish a scale or scales on which
you can measure performance that is relevant to merit. On each of these scales, you need
to locate levels of performance that will count as being of a certain value (these are called
the ‘cut scores’ if the dimension is measurable). For example, you might measure know-‐
ledge of first aid on a certain well-‐validated test and set 90% as the score that marks an A
grade, 75% as a C grade, etc.20 Second, you will usually need to stipulate the relative im-‐
portance of each of these scales in determining the overall m/w/s of the evaluand.
A useful basic toolkit for this involves doing what we call identifying the “stars, bars, and
steps” for our listed values. (i) The “stars” (usually best limited to 1–3 stars) are the
weights, i.e., the relative or absolute importance of the dimensions of merit (or worth or
significance) that will be used as premises to carry you from the facts about the evaluand,
as you locate or determine those, to the evaluative conclusions you need. Their absolute
importance might be expressed qualitatively (e.g., major/medium/minor, or by letter
grades A-‐F); or quantitatively (e.g., points on a five, ten, or other point scale, or—often a
better method of giving relative importance—by the allocation of 100 ‘weighting points’
across the set of values); or, if merely relative values are all you need, these can even be ex-‐
pressed in terms of an ordering of their comparative importance (rarely an adequate ap-‐
proach). (ii) The “bars” are absolute minimum standards for acceptability, if any: that is,
they are minima on the particular scales, scores or ratings that must be ‘cleared’ (exceeded)
if the candidate is to be acceptable, no matter how well s/he scores on other scales. Note
that an F grade for performance on a particular scale does not mean ‘failure to clear a bar,’
e.g., an F on the GRE quantitative scale may be acceptable if offset by other virtues, for se-‐
lecting students into a creative writing program21. Bars and stars may be set on any rel-‐

19 The view taken here is the commonsense one that values of the kind used by evaluators looking at programs serv-
ing the usual ‘good causes’ of health, education, social service, disaster relief, etc., are readily and objectively sup-
portable, to a degree acceptable to essentially all stakeholders and supervisory or audit personnel, contrary to the
doctrine of value-free social science which held that values are essentially matters of taste and hence lack objective
verifiability. The ones in the list here are usually fully supportable to the degree needed by the evaluator for the par-
ticular case, by appeal to publicly available evidence, expertise, and careful reasoning. Bringing them into consid-
eration is what distinguishes evaluation from plain empirical research, and only their use makes it possible for
evaluators to answer the questions that mere empirical research cannot answer, e.g., Is this the best vocational high
school in this city? Do we really need a new cancer clinic building? Is the new mediation training program for police
officers who are working the gang beat really worth what it costs to implement? In other words, the most important
practical questions, for most people—and their representatives—who are looking at programs (and the same applies
to product, personnel, policy evaluation, etc.)
20
This is the tricky process of identifying ‘cutscores’ a specialized topic in test theory—there is a whole book by
that title devoted to discussing how it should be done. A good review of the main issues is in Gene Glass’ paper
21 If an F is acceptable on that scale, why is that dimension still listed at all—why is it relevant? Answer: it may be

one of several on which high scores are weighted as a credit, on one of which the candidate must score high, but not
on any particular one. In other words the applicant has to have some special talent, but a wide range of talents are
acceptable. This might be described as a case where there is a ‘group’ bar, i.e., a ‘floating’ bar on a group of dimen-

13 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

evant properties (a.k.a. dimensions of merit), or directly on dimensions of measured
(valued) performance, and may additionally include holistic bars or stars22. (iii) In serious
evaluation, it is often appropriate to locate “steps” i.e., points or zones on measured dimen-‐
sions of merit where the weight changes, if the mere stars don’t provide enough precision.
An example of this is the setting of several cutting scores on the GRE for different grades in
the use of that scale for the two types of evaluation given above (evaluating the program
and evaluating applicants to it). The grades, bars, and stars (weights), are often loosely in-‐
cluded under what is called ‘standards.’ (Bars and steps may be fuzzy as well as precise.)
Three values are of such general importance that they receive full checkpoint status and
are listed in the next section: cost (minimization of), superiority (to comparable alterna-‐
tives), and generalizability/exportabillity. Their presence in the KEC brings the number of
types of values considered, including the list below, up to a total of 21.
At least check all these values for relevance and look for others: and for those that are rel-‐
evant, set up an outline of a set of defensible standards that you will use. Since these are
context-‐dependent (e.g., the standards for a C in evaluating free clinics in Zurich today are
not the same as for a C in evaluating a free clinic in Darfur at the moment), and the client’s
evaluation-‐needs—i.e., the questions they need to be able to answer—differ massively,
there isn’t a universal dictionary for them. You’ll need to have a topical expert on your team
or do a good literature search to develop a draft, and eventually run serious sessions with
impactee and other stakeholder representatives to ensure defensibility for the revised
draft. The final version of each of the standards, and the set of them, is often called a rubric,
meaning a table translating evaluative terms into observable or testable terms and/or vice
versa.23
(i) Definitional values—those that follow from the definitions of terms in standard
usage (e.g., breadth and depth of impact are, definitionally, dimensions of merit
for a public health program), or that follow from the contextual implications of
having an ideal or excellent evaluand of this type (e.g., an ideal shuttle bus ser-‐
vice for night shift workers would feature increased frequency of service around
shift change times). The latter draw from general knowledge and to some extent
from program-‐area expertise.

sions, which must be cleared by the evaluand’s performance on at least one of them. It can be exhibited in the list of
dimensions of merit by bracketing the group of dimensions in the abscissa, and stating the height of the floating bar
in an attached note. Example: “Candidates for admission to the psychology grad program must have passed one
upper division statistics course.”
22 Example: The candidates for admission to a graduate program—whose quality is one criterion of merit for the

program—may meet all dimension-specific minimum standards in each respect for which these were specified (i.e.,
they ‘clear these bars’), but may be so close to missing the bars (minima) in so many respects, and so weak in re-
spects for which no minimum was specified, that the selection committee feels they are not good enough for the
program. We can describe this as a case where they failed to clear a holistic (a.k.a. overall) bar that was implicit in
this example, but can often be made explicit through dialog. (The usual way to express a quantitative holistic bar is
via an average grade; but that is not always the best way to specify it and often not strictly defensible.)
23
The term ‘rubric’ as used here is a technical term originating in educational testing parlance; this meaning is not
in most dictionaries, or is sometimes distinguished as ‘an assessment rubric.’ A complication we need to note here is
that some of the observable/measurable terms may themselves be evaluative, at least in some contexts.

14 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

(ii) Needs of the impacted population, via a needs assessment (distinguish perform-‐
ance needs (e.g., need for health) from treatment needs (need for specific medi-‐
cation or delivery system), needs that are currently met from unmet needs,24 and
meetable needs from ideal but impractical or impossible-‐with-‐present-‐resources
needs (consider the Resources checkpoint)). The needs are matters of fact, not
values in themselves, but in any context that accepts the most rudimentary ethi-‐
cal considerations (i.e., the non-‐zero value of the welfare of other human beings),
those facts are value-‐imbued. NOTE: needs may have macro as well as micro lev-‐
els that must be considered; e.g., there are local community needs, regional
needs within a country, national needs, global region needs, and global needs
(these often overlap, e.g., in the case of building codes (illustrated by their ab-‐
sence in the Port-‐au-‐Prince earthquake of 2010). Doing a needs assessment is
sometimes the most important part of an evaluation, and in much of the litera-‐
ture is based on invalid definitions of need, e.g., the idea that needs are the gaps
between the actual level of some factor (e.g., income, calories) and the ideal level.
Needs for X are the levels of X without which the subject(s) will be unable to
function satisfactorily (not the same as optimally, maximally, or ideally); and of
course, what functions are under study and what level will count as satisfactory
will vary with the study and the context. Final note; check the Resources check-‐
point, a.k.a. Strengths Assessment, for other entities valued in that context and
hence of value in this evaluation.
(iii) Logical requirements (e.g., consistency, sound inferences in design of program
or measurement instruments e.g., tests).
(iv) Legal requirements (but see (v) below).
(v) Ethical requirements (overlaps with legal and overrides them when in conflict),
usually including (reasonable) safety, and confidentiality (and sometimes ano-‐
nymity) of all records, for all impactees. (Problems like conflict of interest and
protection of human rights have federal legal status in the US, and are also re-‐
garded as scientific good procedural standards, and as having some very general
ethical status.) In most circumstances, health, shelter, education, and other wel-‐
fare considerations for impactees and potential impactees are other obvious
values to which ethical weight must be given.
(vi) Cultural values (not the same as needs or wants, although overlapping with
them) held with a high degree of respect (and thus distinguished from matters of
manners, style, taste, etc.), of which an extremely important one is honor; an-‐
other group, not always distinct from that one, concerns respect for ancestors,
elders, tribal or totem spirits, and local deities. These, like legal requirements,
are subject to override, in principle at least, by ethical values, although often
taken to have the same and sometimes higher status.

24
A very common mistake—reflected in definitions of needs that are widely used—is to think that met needs are not
‘really’ needs, and should not be included in a needs assessment. That immediately leads to the ‘theft’ of resources
that are meeting currently met needs, in order to serve the remaining unmet needs, a blunder that can cost lives.
Identify all needs, then identify the ones that are still unmet

15 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

(vii) Personal, group, and organizational goals/desires (unless you’re doing a goal-‐
free evaluation) if not in conflict with ethical/legal/practical considerations.
These are usually less important than the needs of the impactees, since they lack
specific ethical or legal backing, but are enough by themselves to drive the infer-‐
ence to many evaluative conclusions about e.g., what recreational facilities to
provide in community-‐owned parks, subject to consistency with ethical and legal
constraints. They include some things that are arguable needed as well as de-‐
sired, by some, e.g., convenience, recreation, respect, earned recognition, excite-‐
ment, and compatibility with aesthetic preferences of recipients.
(viii) Environmental needs, if these are regarded as not simply reducible to ‘human
needs with respect to the environment,’ e.g., habitat needs of other species
(fauna or flora), and perhaps Gaian ‘needs of the planet.’
(viii) Fidelity to alleged specifications (a.k.a. “authenticity,” “adherence,” “implemen-‐
tation,” “dosage,” or “compliance”)—this is often usefully expressed via an “index
of implementation”; and—a different but related matter—consistency with the
supposed program model (if you can establish this BRD—beyond reasonable
doubt); crucially important in Checkpoint C1.
(ix) Sub-‐legal but still important legislative preferences (GAO used to determine
these from an analysis of the hearings in front of the sub-‐committee in Congress
from which the legislation emanated.)
(x) Professional standards (i.e., standards set by the profession) of quality that ap-‐
ply to the evaluand25.
(xi) Expert refinements of any standards lacking a formal statement, e.g., ones in (ix).
(xii) Historical/Traditional standards.
(xiii) Scientific merit (or worth or significance).
(xiv) Technological m/w/s.
(xv) Marketability, in commercial program evaluation.
(xvi) Political merit, if you can establish it BRD.
(xvii) Risk (sometimes meaning the same as chance), meaning the probability of failure
(or success) or, sometimes, of the loss (or gain) that would result from failure (or
success), or sometimes the product of these two. Risk in this ontext does not
mean the probability of error about the facts or values we are using as param-‐
eters—i.e., the level of confidence we have in our data or conclusions. Risk here
is the value or disvalue of the chancy element in the enterprise in itself, as an in-‐
dependent positive or negative element—positive for those who are positively
attracted by gambling as such (this is usually taken to be a real attraction, unlike

25Since one of the steps in the evaluation checklist is the meta-evaluation, in which the evaluation itself is the
evaluand, you will also need, when you come to t)hat checkpoint, to apply professional standards for evaluations to
the list. Currently the best ones would be those developed by the Joint Committee (Program Evaluation Standards
2e (Sage), but there are several others of note, e.g., the GAO Yellow Book), and perhaps the KEC. And see the final
checkpoint in the KEC.

16 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

risk-‐tolerance) and negative for those who are, by contrast, risk-‐averse. This
consideration is particularly important in evaluating plans (preformative evalu-‐
ation) and in formative evaluation, but is also relevant in summative and ascrip-‐
tive evaluation when either is done prospectively (i.e., before all data is avail-‐
able). There is an option of including this under personal preferences, item (vii)
above, but it is often better to consider it separately since it benefits by being ex-‐
plicit, can be very important, and is a matter on which evidence/expertise (in the
logic of probability) should be brought to bear, not simply a matter of personal
taste.26
(xviii) last but not least—Resource economy (i.e., how low-‐impact the program is with
respect to short-‐term and long-‐term limits on resources of money, space, time,
labor, contacts, expertise and the eco-‐system). Note that ‘low-‐impact’ is not what
we normally mean by ‘low-‐cost’ (covered separately in Checkpoint C3) in the
normal currencies (money and non-‐money), but refers to absolute (usually
means irreversible) loss of available resources (in some framework, which might
range from a single person to a country). This could be included under an ex-‐
tended notion of (opportunity) cost or need, but has become so important in its
own right that it is probably better to put it under its own heading as a value. It
partly overlaps with Checkpoint C5, because a low score on resource economy
undermines sustainability, so watch for double-‐counting. Also check for double
counting against value (viii), if that is being weighted by client or audiences and
is not overridden by ethical or other higher-‐weighted concerns.
Fortunately, bringing these values and their standards to bear27 is less onerous than it may
appear, since many of these values will be unimportant or only marginally important in
many specific cases, although each one will be crucially important in other particular cases.
And doing all this values-‐analysis will be easy to do sometimes, although very hard on
other occasions; it can often require expert advice and/or impactee/stakeholder advice.
And, of course, some of these values will conflict with others (e.g., impact size with resource
economy), so their relative weights may then have to be determined for the particular case,
a non-‐trivial task by itself. Hence you need to be very careful not to assume that you have to

26 Note that risk is often defined in the technical literature as the product of the likelihood of failure and the magni-
tude of the disaster if the program, or part of it, does fail (the possible loss itself is often called ‘the hazard’); but in
common parlance, the term ‘risk’ is often used to mean either the probability of disaster (“very risky”) or the disas-
ter itself (“the risk of death”). Now the classical definition of a gambler is someone who will prefer to pay a dollar to
get a 1 in 1,000 chance of making $1,000 over paying a dollar to get a 1 in 2 chance of making $2, even though the
expectancy is the same in each case; the risk-averse person will reverse those preferences and in extreme cases will
prefer to simply keep the dollar; and the rational risk-tolerant person will, supposedly, treat all three options as of
equal merit. So, if this is correct, then one might argue that the more precise way to put the value differences here is
to say that the gambler is not attracted by the element of chance in itself but by the possibility of making the larger
sum despite the low probability of that outcome, i.e., that s/he is less risk-averse, not more of a risk-lover. (I think
this way of putting the matter actually leads to a better analysis, viz., any of these positions can be rational depend-
ing on contextual specification of the cost of Type 1 vs. Type 2 errors.) However described, this can be a major
value difference between people and organizations e.g., venture capitalist groups vs. city planning groups.
27 ‘Bringing them to bear’ involves: (a) identifying the relevant ones, (b) specifying them (i.e., determining the di-
mensions for each and a method of measuring performance/achievements on all of these scales), (c) validating the
relevant standards for the case, and (d) applying the standards to the case.

17 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

generate a ranking of evaluands from an evaluation you are asked to do, since if that’s not
required, you can often avoid settling the issue of relative weights of criteria, or at least
avoid any precision in settling it, by simply doing a grading of each evaluand, on a profiling
display (i.e., showing the merit on all relevant dimensions of merit in a bar-‐graph for each
evaluand(s)). That will exhibit the various strengths and weaknesses of each evaluand,
ideal for helping them to improve, and for helping clients to refine their weights for the cri-‐
teria of merit, which will often make it obvious which is the best choice.
Note B5.1: You must cover in this checkpoint all values that you will use, including those
used in evaluating the side-‐effects (if any), not just the intended effects (if any materialize).
Some of these values may well occur to you only after you find the side-‐effects (Checkpoint
C2), but that’s not a problem—this is an iterative checklist, and in practice that means you
will often have to come back to modify findings on earlier checkpoints.

PART C: SUBEVALUATIONS

Each of the following five core dimensions of an evaluation requires both: (i) a ‘fact-‐
finding’28 phase, followed by (ii) the process of combining those facts with whatever values
from B5 that are relevant to this dimension of merit bear on those facts, which yields (iii)
the subevaluation. In other words, Part C requires29 the completion of five separate kinds of
inference from (i) plus (ii) to (iii), i.e., from What’s So? to So What?—e.g., (in the case of C2,
Outcomes), from ‘the outcomes were measured as XX,’ and ‘outcomes of this size are valu-‐
able to the degree YY’ to ‘the effects were extremely beneficial,’ or ‘insignificant in this con-‐
text,’ etc. Making that step requires, in each case, a premise of type (ii) that forms a bridge
between facts and values; these are usually some kind of ‘rubric,’ discussed further in the
D1 (Synthesis) checkpoint. The first two of the following checkpoints will, in one case or
another, use rubrics referring to nearly all the values from Checkpoint B5 and bear most of
the load in determining merit; the next three checkpoints are defined in terms of specific
values of great general importance, named in their heading, and particularly relevant to
worth (Checkpoint C3 and C4) and significance (Checkpoints C4 and C5).

C1. Process
We start with this core checkpoint because it forces us to confront immediately the merit of
the means this intervention employs, so that we are able, as soon as possible, to answer the
question whether the (intended or produced unintentionally) ends—many of which we’ll
cover in the next checkpoint—justify the means, in this specific case or set of cases. The
Process checkpoint involves the assessment of the m/w/s of everything that happens or

28 Here, and commonly, this sense of the term means non-evaluative fact-finding. There are plenty of evaluative
facts that we often seek, e.g., whether the records show that an attorney we are considering has a history of malprac-
tice; whether braided nylon fishline is as good as wire for fish over 30kg.
29
Although this is generally true, there are evaluations in which one or more of the sub-evaluations are irrelevant,
e.g., when cost is of no concern.

18 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

applies before true outcomes emerge, especially: (i) the vision, design, planning and oper-‐
ation of the program, from the justification of its goals (if you’re not operating in goal-‐free
mode)—and note that these may have changed or be changing since the program began—
through design provisions for reshaping the program under environmental or political or
fiscal duress (including planning for worst-‐case outcomes); to the development and justifi-‐
cation of the program’s supposed ‘logic’ a.k.a. design (but see Checkpoint D2), along with
(ii) the program’s ‘implementation fidelity’ (i.e., the degree of implementation of the sup-‐
posed archetype or exemplar program, if any). This index is also called “authenticity,” “ad-‐
herence,” “alignment,” “fidelity,” “internal sustainability,” or “compliance”.30 You must also,
under Process, check the accuracy of the official name or subtitle (whether descriptive or
evaluative), or the official description of the program e.g., “an inquiry-‐based science educa-‐
tion program for middle school”—one, two, three, or even four of the components of this
compound descriptive claim (it may also be contextually evaluative) can be false. (Other
examples: “raises beginners to proficiency level”, “advanced critical thinking training pro-‐
gram”). Also check (iii) the quality of its management (especially (a) the arrangements for
getting and appropriately reporting evaluative feedback (that package is often much of
what is called accountability or transparency), along with support for learning from that
feedback, and from any mistakes/solutions discovered in other ways, along with meeting
more obviously appropriate standards of accountability and transparency; (b) the quality
of the risk-‐management,31 including the presence of a full suite of ‘Plan B’ options; (c) the
extent to which the program planning covers issues of sustainability and not just short-‐
term returns (this point can also be covered in C5). You need to examine all activities and
procedures, especially the program’s general learning/training process (e.g., regular ‘up-‐
dating training’ to cope with changes in the operational and bio-‐environment, staff aging,
essential skill pool, new technology32); attitudes/values; and morale. Of course, manage-‐
ment quality is something that continues well beyond the beginning of the program, so in
looking at it, you need to be clear when it had what form or you won’t be able to ascribe re-‐
sults—good or bad—to management features, if you are hoping to be able to do that. Orga-‐
nization records often lack this kind of detail, so try to improve that practice, at least for the
duration of your evaluation.
As mentioned, under this heading you may or may not examine the quality of the original
‘logic of the program’ (the rationale for its design) and its current logic (both the current
official version and the possibly different one implicit in the operations or in staff behavior

30Several recent drug studies have shown huge outcome differences between subjects filling 80% or more of their
prescriptions and those filling less than 80%, in both the placebo and treatment groups, even when it’s unknown how
many of those getting the drug from the pharmacy are actually taking it, and even though there is no overall differ-
ence in average outcomes between the two groups. In other words, mere aspects of the process of treatment can be
more important than the nature of the treatment or the fact of treatment status. So be sure you know what the process
actually comprises, and whether any comparison group is closely similar on each aspect.
31 Risk-management has emerged fairly recently as a job classification in large organizations, growing out of the
specialized task of analyzing the adequacy and justifiability of the organization’s insurance coverage, but now in-
cluding matters such as the adequacy and coordination of protocols and training for emergency response to natural
and human-caused disasters, the identification of responsibility for each risk, and the sharing of risk and insurance
with other parties.
32
See also my paper on “Evaluation of Training” at michaelscriven.info for a checklist that massively extends
Kirkpatrick’s groundbreaking effort at this task.

19 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

rather than rhetoric). It is not generally appropriate to try to determine and affirm whether
the model is correct in detail and in scientific fact unless you have specifically undertaken
that kind of (usually ambitious and sometimes unrealistically ambitious) analytic evalu-‐
ation of the program design/plan/theory. You need to judge with great care whether com-‐
ments on the plausibility of the program theory are likely to be helpful, and, if so, whether
you are sufficiently expert to make them. Just keep in mind, that it’s never been hard to ev-‐
aluate aspirin for e.g., its analgesic effects, although it is only very recently that we had any
idea how/why it works. It would have been a logical error—and unhelpful to society—to
make the earlier evaluations depend on solving the causal mystery. It helps to keep in mind
that there’s no mystery until you’ve done the evaluation, since you can’t explain outcomes
if there aren’t any (or explain why there aren’t any until you’ve shown that that’s the situa-‐
tion). So if you can be helpful by evaluating the program theory, and you have the resources
to spare, do it; but doing this is not an essential part of doing a good evaluation, will often
be a diversion, and is sometimes a cause for disruptive antagonism.
Process evaluation may also include (iv) the evaluation of what are often called “outputs,”
(usually taken to be ‘intermediate outcomes’ that are developed en route to ‘true out-‐
comes,’ the longer-‐term results that are sometimes called ‘impact’). Typical outputs are
knowledge, skill, or attitude changes in staff (or clients), when these changes are not major
outcomes in their own right. Remember that in any program that involves learning,
whether incidental or intended, the process of learning is gradual and at any point in time,
long before you can talk about outcomes/impact, there will have been substantial learning
that produces a gain in individual or social capital, which must be regarded as a tangible
gain for the program and for the intervention. It’s not terribly important whether you call it
process or output or short-‐term outcome, as long as you find it, estimate it, and record it—
once. (Recording it under more than one heading—other than for merely annotative rea-‐
sons—leads to double counting when you are aiming for an overall judgement.)
Note C1.1: Five other reasons why process is an essential element in program evaluation,
despite the common tendency in much evaluation to place almost the entire emphasis on
outcomes: (v) gender or racial prejudice in selection/promotion/treatment of staff is an
unethical practice that must be checked for, and comes under process; (vi) in evaluating
health programs that involve medication or exercise, ‘adherence’ or ‘implementation fi-‐
delity’ means following the prescribed regimen including drug dosage, and it is often vitally
important to determine the degree to which this is occurring—which is also a process con-‐
sideration. We now know, because researchers finally got down to triangulation (e.g., via
randomly timed counts, by a nurse-‐observer, of the number of pills remaining in the pa-‐
tient’s medicine containers), that adherence can be very low in many needy populations,
e.g., Alzheimer’s patients, a fact that completely altered evaluative conclusions about
treatment efficacy; (vii) the process may be where the value lies—writing poetry in the
creative writing class may be a good thing to do in itself, not because of some later out-‐
comes (same for having fun, in kindergarten at least; painting; and marching to protest
war, even if it doesn’t succeed); (viii) the treatment of human subjects must meet federal,
state, and other ethical standards, and an evaluator can rarely avoid the responsibility for
checking that they are met; (ix) as the recent scandal in anaesthesiology underscores, many
widely accepted evaluation procedures, e.g., peer review, rest on assumptions that are
sometimes completely wrong (e.g., that the researcher actually did get the data reported

20 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

from real patients), and the evaluator should try to do better than rely on such assump-‐
tions.

C2. Outcomes
This checkpoint is the poster-‐boy of many evaluations, and the one that many people mis-‐
takenly think of as covering ‘the results’ of an intervention. In fact, the results are every-‐
thing covered in Part C. This checkpoint does cover the ‘ends’ at which the ‘means’ dis-‐
cussed in C1 (Process) are aimed, but (a) only to the extent they were achieved; and (b)
much more than that. It requires the identification of all (good and bad) effects of the pro-‐
gram (a.k.a. intervention) on: (i) program recipients (both targeted and untargeted—an
example of the latter are thieves of aid or drug supplies); on (ii) other impactees, e.g., fami-‐
lies and friends—and enemies—of recipients; and on (iii) the environment (biological,
physical, and more remote social environments). These effects must include direct and in-‐
direct effects, intended and unintended ones, immediate33 and short-‐term and long-‐term
ones (the latter being one kind of sustainability). (These are all, roughly speaking, the focus
of Campbell’s ‘internal validity.’) Finding outcomes cannot be done by hypothesis-‐testing
methodology, because: (i) often the most important effects are unanticipated ones (the four
main ways to find such side-‐effects are: (a) goal-‐free evaluation, (b) skilled observation, (c)
interviewing that is explicitly focused on finding side-‐effects, and (d) using previous ex-‐
perience (as provided in the mythical “Book of Causes”34). And (ii) because determining the
m/w/s of the effects—that’s the bottom line result you have to get out of this sub-‐
evaluation—is often the hard part, not just determining whether there are any, or even
what they are intrinsically, and who they affect (some of which you can get by hypothesis
testing)… Immediate outcomes (e.g., the publication of instructional leaflets for AIDS car-‐
egivers) are often called ‘outputs,’ especially if their role is that of an intermediate cause or
intended cause of main outcomes, and they are normally covered under Checkpoint C1. But
note that some true outcomes (including results that are of major significance, whether or
not intended) can occur during the process but may be best considered here, especially if
they are highly durable. (Long-‐term results are sometimes called ‘effects’ (or ‘true effects’
or ‘results’) and the totality of these is often referred to as the ‘impact’; but you should ad-‐
just to the highly variable local usage of these terms by clients/audiences/stakeholders.)
Note that you must pick up effects on individual and social capital here (see the earlier
footnote); much of this ensemble is normally not counted as outcomes, because they are

33 The ‘immediate’ effects of a program are not only the first effects that occur after the program starts up, but
should also include major effects that occur before the program starts. These (preformative) effects impact ‘anticipa-
tors’ who react to the announcement of—or have secret intelligence about—the future start of the program. For ex-
ample, the award of the 2012 Olympic Games to Rio de Janeiro, made several years in advance of any implementa-
tion of the planned constructions etc. for the games, had a huge immediate effect on real estate prices, and later on
favela policing for drug and violence control.
34 The Book of Causes shows, when opened at the name of a condition, factor, or event: (i) on the left (verso) side of

the opening, all the things which are known to be able to cause it, in some context or other (which is specified at that
point); and (ii) on the right (recto) side, all the things which it can cause: that’s the side you need in order to guide
the search for side-effects. Since the BofC is only a virtual book, you have to create the relevant pages, using all
your resources such as accessible experts and a literature/internet search. Good forensic pathologists and good field
epidemiologists, amongst other scientists, have very comprehensive ‘local editions’ of the BofC in their heads and
as part of the social capital of their guild. Modern computer technology makes the BofC feasible, perhaps imminent.

21 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

gains in latent ability (capacity, potentiality), not necessarily in observable achievements or
goods. Particularly in educational evaluations aimed at improving test scores, a common
mistake is to forget to include the possibly life-‐long gain in ability as an effect.
Sometimes, not always, it’s useful and feasible to provide explanations of success/failure in
terms of components/context/decisions. For example, when evaluating a statewide consor-‐
tium of training programs for firemen dealing with toxic fumes, it’s probably fairly easy to
identify the more and less successful programs, maybe even to identify the key to success
as particular features—e.g., realistic simulations—that are to be found in and only in the
successful programs. To do this usually does not require the identification of the whole op-‐
erating logic/theory of program operation. (Remember that the operating logic is not ne-‐
cessarily the same as: (i) the original official program logic, (ii) the current official logic,
(iii) the implicit, logics or theories of field staff). Also see Checkpoint D2 below.
Given that the most important outcomes may have been unintended (a broader class than
unexpected), it’s worth distinguishing between side-‐effects (which affect the target popula-‐
tion and possibly others) and side-‐impacts (meaning impacts of any kind on non-‐targeted
populations).
The biggest methodological problem with this checkpoint is establishing the causal connec-‐
tion, especially when there are many possible or actual causes, and—a separate point—if
attribution of portions of the effect to each of them must be attempted.35
Note C2.1: As Robert Brinkerhoff argues, success cases may be worth their own analysis as
a separate group, regardless of the average improvement (if any) due to the program (since
the benefits in those cases alone may justify the cost of the program)36; the failure cases
should also be examined, for differences and toxic factors.
Note C2.2: Keep the “triple bottom-‐line” approach in mind. This means that, as well as (i)
conventional outcomes (e.g., learning gains by impactees), you should always be looking for
(ii) community (include social capital) changes, and (iii) environmental impact… And al-‐
ways comment on (iv) the risk aspect of outcomes, which is likely to be valued very differ-‐
ently by different stakeholders… Especially, do not overlook (v) the effects on the program
staff, good and bad; e.g., lessons and skills learned, and the usual effects of stress; and (vi)
the pre-‐program effects mentioned earlier: that is, the (often major) effects of the an-‐
nouncement or discovery that a program will be implemented, or even may be imple-‐
mented. These effects include booms in real estate and migration of various groups
to/from the community, and are sometimes more serious in at least the economic dimen-‐
sion than the directly caused results of the program’s implementation on this impact group,
the ‘anticipators.’ Looking at these effects carefully is sometimes included under preforma-‐
tive evaluation (which also covers looking at other dimensions of the planned program,
such as evaluability).
Note C2.3: It is usually true that evaluations have to be completed long before some of the
main outcomes have, or indeed could have, occurred—let alone have been inspected care-‐

35
On this, consult recent literature by, or cited by, Cook or Scriven, e.g., in the 6th and the 8th issues of the Journal of
MultiDisciplinary Evaluation (2008), at jmde.com, and American Journal of Evaluation (3, 2010)).
36 Robert Brinkerhoff in The Success Case Method (Berrett-Koehler, 2003).

22 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

fully. This leads to a common practice of depending heavily on predictions of outcomes
based on indications or small samples of what they will be. This is a risky activity, and
needs to be carefully highlighted, along with the assumptions on which the prediction is
based, and the checks that have been made on them, as far as is possible. Many very expen-‐
sive evaluations of giant international aid programs have been based almost entirely on
outcomes estimated by the same agency that did the evaluation and the installation of the
program—estimates that, not too surprisingly, turned out to be absurdly optimistic. Pessi-‐
mism can equally well be ill-‐based, for example predicting the survival chances of Stage IV
cancer patients is often done using the existing data on five-‐year survival—but that ignores
the impact of research on treatment in (at least) the last five years, which has often been
considerable. On the other hand, waiting for the next Force 8 earthquake to test disaster
plans is stupid; simulations, if designed by a competent external agency, can do a very good
job in estimating long-‐term effects of a new plan.
Note C2.4: Identifying the impactees is not only a matter of identifying each individual—or
at least small group—that is impacted (targeted or not), hard though that is; it is also a
matter of finding patterns in them, e.g., a tendency for the intervention to be more success-‐
ful with women than men. Finding patterns in the data is of course a traditional scientific
task, so here is one case amongst several where the task of the evaluator includes one of
the core tasks of the traditional scientist.

Note C2.5: Furthermore, if you have discovered any unanticipated side-‐effects at all, con-‐
sider that they are likely to require evaluation against some values that were not con-‐
sidered under the Values checkpoint, since you were not expecting them; you need to go
back and expand your list of relevant values, and develop scales and rubrics for these, too.
Note C2.6: Almost without exception, the social science literature on effects identifies them as
what happened after an intervention that would not have happened without the presence of the
intervention—this is the so-called counterfactual property. This is a complete fallacy, and shows
culpable ignorance of about a century’s literature on causation in the logic of science (see refer-
ences given above on causation, e.g., in footnote 8). Many effects would have happened anyway,
due to the presence of other factors with causal potential; this is the phenomenon of ‘overdeter-
mination’ which is common in the social sciences. For example, the good that Catholic Charities
does in a disaster might well have occurred if they were not operating, since there are other
sources of help with identical target populations; this does not show they were not in fact the
causal agency nor does it show that they are redundant.

C3. Costs
This checkpoint brings in what might be called ‘the other quantitative element in evalu-‐
ation’ besides statistics, i.e., (most of) cost analysis. But don’t forget there is also such a
thing as qualitative cost analysis, and it’s also very important—and, done properly, it’s not
a feeble surrogate for quantitative cost analysis but an essentially independent effort; note
that both quantitative and qualitative cost-‐analysis are included in the economist’s defini-‐
tion of cost-‐effectiveness. Both are usually very important in determining worth (or, in one
sense, value) by contrast with plain merit (a.k.a. quality). Both were almost totally ignored
for many years after program evaluation became a matter of professional practice; and a
recent survey of journal articles by Nadini Persaud shows they are still seriously underused

23 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

in evaluation. An impediment to progress that she points out is that today, CA (cost analy-‐
sis) is done in a different way by economists and accountants,37 and you will need to make
clear which approach you are using, or that you are using both—and, if you do use both,
indicate when and where you use each. There are also a number of different types of quan-‐
titative CA—cost-‐benefit analysis, cost-‐effectiveness analysis, cost-‐utility analysis, cost-‐
feasibility analysis, etc., and each has a particular purpose; be sure you know which one
you need and explain why in the report (the definitions in Wikipedia are good). The first
two require calculation of benefits as well as costs, which usually means you have to find,
and monetize if important (and possible), the benefits and damages from Checkpoint C2 as
well as the more conventional (input) costs.
At a superficial level, cost analysis requires attention to and distinguishing between: (i)
money vs. non-‐money vs. non-‐monetizable costs; (ii) direct and indirect costs; (iii) both ac-‐
tual and opportunity costs;38 and (iv) sunk (already spent) vs. prospective costs. It is also
often helpful, for the evaluator and/or audiences, to itemize these by: developmental stage,
i.e., in terms of the costs of: (a) start-‐up (purchase, recruiting, training, site preparation,
etc.); (b) maintenance (including ongoing training and evaluating); (c) upgrades; (d) shut-‐
down; (e) residual (e.g., environmental damage); and/or by calendar time period; and/or
by cost elements (rent, equipment, personnel, etc.); and/or by payee. Include use of ex-‐
pended but never utilized value, if any, e.g., social capital (such as decline in workforce mo-‐
rale).
The most common significant non-‐money costs that are often monetizable are space, time,
expertise, and common labor, to the extent that these are not available for purchase in the
open market—when they are so available, they can be monetized. The less measurable but
often more significant ones include: lives, health, pain, stress (and other positive or neg-‐
ative affects), social/political/personal capital or debts (e.g., reputation, goodwill, interper-‐
sonal skills), morale, energy reserves, content and currency of technical knowledge/skills,
and immediate/long-‐term environmental costs. Of course, in all this, you should be analyz-‐
ing the costs and benefits of unintended as well as intended outcomes; and, although unin-‐
tended heavily overlaps unanticipated, both must be covered. The non-‐money costs are al-‐
most never trivial in large program evaluations, technology assessment, or senior staff ev-‐
aluation, and very often decisive. The fact that in rare contexts (e.g., insurance suits) some
money equivalent of e.g., a life, is treated seriously is not a sign that life is a monetizable

37
Accountants do ‘financial analysis’ which is oriented towards an individual’s monetary situation, econo-‐
mists do ‘economic analysis’ which is takes a societal point of view.
38 Economists often define the costs of P as the value of the most valuable forsaken alternative (MVFA), i.e., as the

same as opportunity costs. This risks circularity, since it’s arguable that you can’t determine the value of the MVFA
without knowing what it required you to give up, i.e., identifying its MVFA. In general, it may be better to define
ordinary costs as the tangible valued resources that were used to cause the evaluand to come into existence (money,
time, expertise, effort, etc.), and opportunity costs as another dimension of cost, namely the MVFA you spurned by
choosing to create the evaluand rather than the best alternative path to your goals, using about the same resources.
The deeper problem is this: the ‘opportunity cost of the evaluand’ is ambiguous; it may mean the value of something
else to do the same job, or it may mean the value of the resources if you didn’t attempt this job at all. (See my “Cost
in Evaluation: Concept and Practice”, in The Costs of Evaluation, edited by Alkin and Solomon, (Sage, 1983) and
“The Economist’s Fallacy” in jmde.com, 2007).

24 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

value in general i.e., across more than that very limited context,39 let alone a sign that if we
only persevere, cost analysis can be treated as really a quantitative task or even as a task for
which a quantitative version will give us a useful approximation to the real truth. Both
views are categorically wrong, as is apparent if you think about the difference between the
value of a particular person’s life to their family, vs. to their employer/employees/cowork-‐
ers, vs. to their profession, and to their friends; and the difference between those values as
between different people whose lost lives we are evaluating. And don’t think that the way
out is to allocate different money values to each specific case, i.e., to each person’s life-‐loss
for each impacted group: not only will this destroy generalizability but the cost to some of
these impactees is clearly still not covered by money, e.g., when a great theoretician or
musician dies.
As an evaluator you aren’t doing a literal audit, since you’re (usually) not an accountant,
but you can surely benefit if an audit is available, or being done in parallel. Otherwise, con-‐
sider hiring a good accountant as a consultant to the evaluation; or an economist, if you’re
going that way. But even without the accounting expertise, your cost analysis and certainly
your evaluation, if you follow the lists here, will include key factors—for decision-‐making
or simple appraisal—usually omitted from standard auditing practice. And keep in mind
that there are evaluations where it is appropriate to analyze benefits (a subset of out-‐
comes) in just the same way, i.e., by type, time of appearance, etc. This is especially useful
when you are doing an evaluation with an emphasis on cost-‐benefit tradeoffs.
Note C3.1: This sub-‐evaluation (especially item (iii) in the first list) is the key element in
the determination of worth.
Note C3.2: If you have not already evaluated the program’s risk-‐management efforts under
Process, consider doing—or having it done—as part of this checkpoint.
Note C3.3: Sensitivity analysis is the cost-‐analysis analog of robustness analysis in statistics
and testing methodology, and equally important. It is essential to do it for any quantitative
results.
Note C3.4: The discussion of CA in this checkpoint so far uses the concept of cost-‐effect-‐
iveness in the usual economic sense, but there is another sense of this concept that is of
considerable importance in evaluation—in some but not all contexts, and this sense does
not seem to be discussed in the economic or accounting literature. (It is the ‘extended
sense’ mentioned in the Resources checkpoint discussion above.) In this sense, efficiency or
cost-‐effectiveness means the ratio of benefits to resources available not resources used. In
this sense—remember, it’s only appropriate in certain contexts—one would say that a pro-‐
gram, e.g., an aid program funded to provide clean water to refugees in the Haitian tent cit-‐
ies in 2010, was (at least in this respect) inefficient/cost-‐ineffective if it did not do as much
as was possible with the resources provided. There may be exigent circumstances that de-‐
flect any imputation of irresponsibility here, but the fact remains that the program needs to
be categorized as unsatisfactory with respect to getting the job done, even though it was
provided with adequate resources to do it. Moral: when you’re doing CA in an evaluation,
don’t just analyze what was spent but also what was available.
39
The World Bank since 1966 has recommended reporting mortality data in terms of lives saved or lost, not dollars
(Persaud reference).

25 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

C4. Comparisons
Comparative or relative m/w/s, which requires comparisons, is often extremely illuminat-‐
ing, and sometimes absolutely essential—as when a government has to decide on whether
to refund a health program, go with a different one, or abandon the sector to private enter-‐
prise. Here you must look for programs or other entities that are alternative ways for get-‐
ting the same or similar benefits from about the same resources, especially those that use
fewer resources. Anything that comes close to this is known as a “critical competitor”. Iden-‐
tifying the most important critical competitors is a test of high intelligence, since they are
often very unlike the standard competitors, e.g., a key critical competitor for telephone and
email communication in extreme disaster planning is carrier pigeons, even today. It is also
often worth looking for, and reporting on, at least one other alternative—if you can find
one—that is much cheaper but not much less effective (‘el cheapo’); and one much stronger
although costlier alternative, i.e., one that produces far more payoffs or process advantages
(‘el magnifico’), although still within the outer limits of the available Resources identified in
Checkpoint B4; the extra cost may still be the best bet. (But be sure that you check care-‐
fully, e.g., don’t assume the more expensive option is higher quality because it’s higher
priced.) It’s also sometimes worth comparing the evaluand with a widely adopted/admired
approach that is perceived by important stakeholders as an alternative, though not really in
the race, e.g., a local icon. Keep in mind that looking for programs ‘having the same effects’
means looking at the side-‐effects as well as intended effects, to the extent they are known,
though of course the best available critical competitor might not match on side-‐effects…
Treading on potentially thin ice, there are also sometimes strong reasons to compare the
evaluand with a demonstrably possible alternative, a ‘virtual critical competitor’—one that
could be assembled from existing or easily constructed components (the next checkpoint is
another place where ideas for this can emerge). The ice is thin because you’re now moving
into the partial role of a program designer rather than an evaluator, which creates a risk of
conflict of interest (you may be ego-‐involved as author of a possible competitor and hence
not objective about evaluating it or, therefore, the original evaluand). Also, if your ongoing
role is that of formative evaluator, you need to be sure that your client can digest sugges-‐
tions of virtual competitors (see also Checkpoint D2). The key comparisons should be con-‐
stantly updated as you find out more from the evaluation of the primary evaluand, espe-‐
cially new side-‐effects, and should always be in the background of your thinking about the
evaluand.
Note C4.1: It sometimes looks as if looking for critical competitors is a completely wrong
approach, e.g., when we are doing formative evaluation of a program i.e., with the interest
of improvement: but in fact, it’s important even then to be sure that the changes made or
recommended really do add up, taken all together, to an improvement, so you need to com-‐
pare version 2 with version 1, and also with available alternatives since the set of critical
competitors may change as you modify the evaluand.
Note C4.2: It’s tempting to collapse the Cost and Comparison checkpoints into ‘Compara-‐
tive Cost-‐Effectiveness’ (as Davidson does, for example) but it’s better to keep them sepa-‐
rate because for certain important purposes, e.g., fund-‐raising, you will need the separate
results. Other examples: you often need to look at simple cost-‐feasibility, which does not

26 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

involve comparisons (but give the critical competitors a quick look in case one of them is
cost-‐feasible); or at relative merit when ‘cost is no object’ (which means ‘all available alter-‐
natives are cost-‐feasible, and the merit gains from choosing correctly are much more im-‐
portant than cost savings’).
Note C4.3: One often hears the question: “But won’t the Comparisons Checkpoint double or
triple our costs for the evaluation—after all, the comparisons needed have to be quite de-‐
tailed in order to match one based on the KEC?” Some responses: (i) “But the savings on
purchase costs may be much more than that;” (ii) “There may already be a decent evalu-‐
ation of some or several or all critical competitors in the literature;” (iii) “Other funding
sources may be interested in the broader evaluation, and able to help with the extra costs;”
(iv) “Good design of the evaluations of alternatives will often eliminate potential competi-‐
tors at trifling cost, by starting with the checkpoints on which they are most obviously vul-‐
nerable;” (v) “ Estimates, if that’s all you can afford, are much cheaper than evaluations, and
better than not doing a comparison at all.”

C5. Generalizability

Other names for this checkpoint (or something close to, or part of it) are: exportability,
transferability, transportability—which would put it close to Campbell’s “external valid-‐
ity”— but it also covers sustainability, longevity, durability, and resilience, since these tell
you about generalizing the program’s merit to other times rather than (or as well as) other
places or circumstances besides the one you’re at (in either direction, so the historian is in-‐
volved.) Note that this checkpoint concerns the sustainability of the program, not the sus-‐
tainability of its effects, which is also important and covered under impact.
Although other checkpoints bear on it (because they are needed to establish that the pro-‐
gram has non-‐trivial benefits), this checkpoint is frequently the most important one of the
core five when attempting to determine significance. (The other highly relevant checkpoint
for that is C4, where we look at how much better it is compared to whatever else is avail-‐
able; and the final word on that comes in Checkpoint D1, especially Note D1.1.) Under
Checkpoint C5, you must find the answers to questions like these: Can the program be used,
with similar results, if we use it: (i) with other content; (ii) at other sites; (iii) with other
staff; (iv) on a larger (or smaller) scale; (v) with other recipients; (vi) in other climates
(social, political, physical); etc. An affirmative answer on any of these ‘dimensions of gener-‐
alization’ is a merit, since it adds another universe to the domains in which the evaluand
can yield benefits (or adverse effects). Looking at generalizability thus makes it possible
(sometimes) to benefit greatly from, instead of dismissing, programs and policies whose
use at the time of the study was for a very small group of impactees—such programs may
be extremely important because of their generalizability.
Generalization to (vii) later times, a.k.a. longevity, is nearly always important (under com-‐
mon adverse conditions, it’s durability). Even more important is (viii) sustainability (this is
external sustainability, not the same as the internal variety mentioned under Process). It is
sometimes inadequately treated as meaning, or as equivalent to, ‘resilience to risk.’ Sus-‐
tainability usually requires making sure the evaluand can survive at least the termination

27 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

of the original funding (which is usually not a risk but a known certainty), and also some
range of hazards under the headings of warfare or disasters of the natural as well as finan-‐
cial, social, ecological, and political varieties. Sustainability isn’t the same as resilience to
risk especially because is must cover future certainties, such as seasonal changes in tem-‐
perature, humidity, water supply—and the end of the reign of the present CEO, or of pres-‐
ent funding. But the ‘resilience to risk’ definition has the merit of reminding us that this
checkpoint will require some effort at identifying and then estimating the likelihood of the
occurrence of the more serious risks, and costing the attendant losses. Sustainability is
sometimes even more important than longevity, for example when evaluating international
or cross-‐cultural developmental programs; longevity and durability refer primarily to the
reliability of the ‘machinery’ of the program and its maintenance, including availability of
the required labor/expertise and tech supplies; but are less connotative of external threats
such as the ‘100-‐year drought’ or civil war, and less concerned with ‘continuing to produce
the same results’ which is what you really care about. Note that what you’re generalizing—
i.e., predicting—about these programs is the future (effects) of ‘the program in context,’ not
the mere existence of the program, and so any context required for the effects should be
specified, and include any required infrastructure. Here, as in the previous checkpoint, we
are making predictions about outcomes in certain scenarios, and, although risky, this some-‐
times generates the greatest contribution of the evaluation to improvement of the world
(see also the ‘possible scenarios’ of Checkpoint D4). All three show the extent to which
good evaluation is a creative and not just a reactive enterprise. That’s the good news way of
putting the point; the bad news way is that much good evaluation involves raising ques-‐
tions that can only be answered definitively by doing work that you are probably not
funded to do.
Note C5.1: Above all, keep in mind that the absence of generalizability has absolutely no
deleterious effect on establishing that a program is meritorious, unlike the absence of a
positive rating on any of the four other sub-‐evaluation dimensions. It only affects establish-‐
ing the extent of its benefits. This can be put by saying that generalizability is a plus, but its
absence is not a minus—unless you’re scoring for the Ideal Program Oscars. Putting it an-‐
other way, generalizability is highly desirable, but that doesn’t mean that it’s a requirement
for m/w/s. A program may do the job of meeting needs just where it was designed to do
that, and not be generalizable—and still rate an A+.
Note C5.2: Although generalizability is ‘only’ a plus, it needs to be explicitly defined and de-‐
fended. It is still the case that good researchers make careless mistakes of inappropriate
implicit generalization. For example, there is still much discussion, with good researchers
on both sides, of whether the use of student ratings of college instructors and courses im-‐
proves instruction, or has any useful level of validity. But any conclusion on this topic in-‐
volves an illicit generalization, since the evaluand ‘student ratings’ is about as useful in
such evaluations as ‘herbal medicine’ is in arguments about whether herbal medicine is
beneficial or not. Since any close study shows that herbal medicines with the same label
often contain completely different substances (and almost always substantially different
amounts of the main element), and since most but not all student rating forms are invalid
or uninterpretable for more than one reason, the essential foundation for the generaliza-‐
tion—a common referent—is non-‐existent. Similarly, investigations of whether online
teaching is superior to onsite instruction, or vice versa, are about absurdly variable ev-‐

28 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

aluands, and generalizing about their relative merits is like generalizing about the ethical
standards of ‘white folk’ compared to ‘Asians.’ Conversely, and just as importantly, evalu-‐
ative studies of a nationally distributed reading program must begin by checking the fi-‐
delity of your sample (Description and Process checkpoints). This is checking instantiation
(sometimes this is part of what is called ‘checking dosage’ in the medical/pharmaceutical
context), the complementary problem to checking generalization.
Note C5.3: Checkpoint C5 is, perhaps more than any others, the residence of prediction,
with all its special problems. Will the program continue to work in its present form? Will it
work in some modified form? In some different context? With different person-‐
nel/clients/recipients? These, and the others listed above, are each formidable prediction
tasks that will, in important cases, require separate research into their special problems.
When special advice cannot be found, it is tempting to fall back on the assumption that, ab-‐
sent ad hoc considerations, the best prediction is extrapolation of current trends. That’s the
best simple choice, but it’s not the best you can do; you can at least identify the most com-‐
mon interfering conditions and check to see if they are/will be present and require a modi-‐
fication or rejection of the simple extrapolation. Example: will the program continue to do
as well as it has been doing? Possibly not if the talented CEO dies/retires/leaves/burns
out? So check on the evidence for each of these possibilities, thereby increasing the validity
of the bet on steady-‐state results, or forcing a switch to another bet. See also Note D2.2.
General Note 7: Comparisons, Costs, and Generalizability are in the same category as
values from the list in Checkpoint B5; they are all considerations of certain dimensions of
value—comparative value, economic value, general value. Why do they get special billing
with their own checkpoint in the list of sub-‐evaluations? Basically, because of (i) their vir-‐
tually universal critical importance40, (ii) the frequency with which one or more are omit-‐
ted from evaluations when they should have been included, and (iii) because they each in-‐
volve some techniques of a relatively special kind. Despite their idiosyncrasies, it’s also
possible to see them as potential exemplars, by analogy at least, of how to deal with some
of the other relevant values from Checkpoint B5, which will come up as relevant under Pro-‐
cess, Outcomes, and Comparisons.

PART D: CONCLUSIONS & IMPLICATIONS

D1. Synthesis
Now we’re beginning to develop the key elements of the report and the executive sum-‐
mary. You have already done a great deal of the required synthesis of facts with values
using the scales developed in Checkpoint B5, Values, in order to get the sub-‐evaluations of
Part C. This means you already have an evaluative profile of the evaluand: i.e., a bar graph,
the simplest graphical means of representing a multidimensional evaluative conclusion,
and greatly superior to a table for most clients and audiences. But for some evaluative pur-‐
40
Of course, ethics (and the law) is critically important, but only as a framework constraint that must not be vio-
lated. Outcomes are the material benefits or damage within the ethical/legal framework and their size and direction
are the most variable and antecedently uncertain, and hence highly critical findings from the evaluation. Ethics is the
greenhouse; outcomes are what grows inside it.

29 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

poses you need a further synthesis, this time of the sub-‐evaluations, because you need to
get a one-‐dimensional evaluative conclusion, i.e., an overall grade or, if you can justify a
quantitative scale, an overall score. For example, you may need to assist the client in choos-‐
ing the best of several evaluands, which means ranking them, and the easiest way to do this
is to have each of them evaluated on a single overall summative dimension. That’s easy to
say, but it’s not easy to justify most efforts to do that, because in order to combine those
multiple dimensions into a single one, you have to have a legitimate common metric for
them, which is rarely supportable. (It’s easy to see why a quantitative approach is so attrac-‐
tive!) At the least, you’ll need a supportable estimate of the relative importance of each di-‐
mension of merit, and not even that is easy to get. Details of how and when it can be done
will be provided elsewhere and would take too much space to fit in here.41 The content
focus (point of view) of the synthesis, on which the common metric should be based, should
usually be the present and future total impact on consumer (e.g., employer, employee, pa-‐
tient, student) or community needs, subject to the constraints of ethics, the law, and re-‐
source-‐feasibility, etc… Apart from the need for a ranking there is very often also a practical
need for a concise presentation of the most crucial evaluative information. A profile showing
merit on the five core dimensions of Part C can often meet that need, without going to a
uni-‐dimensional compression into a single grade. Another possible profile for such a sum-‐
mary would be based on the SWOT checklist widely used in business: Strengths, Weak-‐
nesses, Opportunities, and Threats.42 Sometimes it makes sense to provide both profiles.
This part of the synthesis/summary could also include referencing the results against the
clients’ and perhaps other stakeholders’ goals, wants, or hopes (if feasible), e.g., goals met,
ideals realized, created but unrealized value, when these are determinable, which can also
be done with a profile. But the primary obligation of the evaluator is usually to reference
the results to the needs of the impacted population, within the constraints of overarching
values such as ethics, the law, the culture, etc. Programs are not made into good programs
by matching someone’s goals, but by doing something worthwhile, on balance. Of course,
for public or philanthropic funding, the two should coincide, but you can’t assume they do;
in fact, they are all-‐too-‐often incompatible.
Another popular focus for the overall report is the ROI (return on investment), which is su-‐
perbly concise, but it’s too limited (no ethics, side-‐effects, goal critique, etc.) The often-‐
suggested 3D expansion of ROI gives us the 3P dimensions—benefits to People, Planet, and
Profit—often called the ‘triple bottom line.’ It’s still a bit narrow and we can do better with
the 5 dimensions listed here as the sub-‐evaluations listed in Part C: Process, Outcomes;
Costs; Comparisons; Generalizability. A bar graph showing the merit of achievements on
each of these provides a succinct and insightful profile of a program’s value. To achieve it,
you will need defensible definitions of the standards you are using on each column (the
rubrics), e.g., “An A grade for Outcomes will require…” and there will be ‘bars’ (i.e., absolute
minimum standards) on several of these, e.g., ethical acceptability on the Outcomes scale,
cost-‐feasibility on the Costs scale. Since it’s highly desirable that you have these for any
serious program evaluation, this 5D summary should not be a dealbreaker requirement.

41
An article “The Logic of Evaluation” forthcoming by summer, 2011, on the web site michaelscriven.info does a
better job on this than my previous efforts, which do not now seem adequate as references.
42 Google provides 6.2 million references for SWOT (@1/23/07), but the top two or three are good introductions.

30 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

(Another version of a 5D approach is given in the paper “Evaluation of Training” that is on-‐
line at michaelscriven.info. )
Apart from the rubrics for each relevant value, if you have to come up with an overall grade
of some kind, you will need to do an overall synthesis to reduce the two-‐dimensional pro-‐
file to a ‘score’ on a single dimension. (Since it may be qualitative, we’ll use the term ‘grade’
for this property.) Getting to an overall grade requires what we might call a meta-‐rubric—a
set of rules for converting profiles—which are typically themselves a set of grades on sev-‐
eral dimensions—to a grade on a single scale. What we call ‘weighting’ the dimensions is a
basic kind of meta-‐rubric since it’s an instruction to take some of the constituent grades
more seriously than others for some further, ‘higher-level’ evaluative purpose. (A neat way to
display this graphically is by using the width of a column in the profile to indicate import-‐
ance.) If you are lucky enough to have developed an evaluative profile for a particular ev-‐
aluand, in which each dimension of merit is of equal importance (or of some given numeri-
cal importance compared to the others), and if each grade can be expressed numerically,
then you can just average the grades. BUT legitimate examples of such cases are almost un-‐
known, although we often oversimplify and act as if we have them when we don’t. For ex-‐
ample, we average college grades to get the GPA (grade point average) and use this in many
overall evaluative contexts such as selection for admission to graduate programs. Of
course, this oversimplification can be, and frequently is, ‘gamed’ by students e.g., by taking
courses where grade inflation means that the A’s do not represent excellent work by any
reasonable standard. A better meta-‐rubric results from using a comprehensive exam,
graded by a departmental committee instead of one person, and then giving the grade on
this double weight, or even 80% of the weight. Another common meta-‐rubric in graduate
schools is setting a meta-‐bar, i.e., an overall absolute requirement for graduation, e.g., that
no single dimension (course or a named subset of crucially important courses) be graded
below B-‐.
Note D1.1: One special conclusion to go for, often a major part of determining significance,
comes from looking at what was done against what could have been done with the Re-‐
sources available, including social and individual capital. This is one of several cases where
imagination is needed to determine a grade on the Opportunities part of the SWOT analy-‐
sis. But remember this is thin ice territory (see Note C4.1).
Note D1.2: Be sure to convey some sense of the strength of your conclusions, which means
the combination of: (i) the net weight of the evidence for the premises, with (ii) the proba-
bility of the inferences from them to the conclusion(s), and (iii) the probability that there is
no other relevant evidence. For example, indicate whether the performance on the various
dimensions of merit was a tricky inference or directly observed; did the evaluand clear any
bars or lead any competitors ‘by a mile’ or just scrape over (i.e., use gap-‐ranking not just
ranking43); were the predictions involved double-‐checked for invalidating indicators (see
Note C5.2); was the conclusion established ‘beyond any reasonable doubt,’ or merely ‘sup-‐
ported by the balance of the evidence’? This complex property of the evaluation is referred
43
Gap-ranking is a refinement of ranking in which a qualitative or quantitative estimate of the size of intervals be-
tween evaluands is provided (modeled after the system in horse-racing—‘by a head,’ ‘by a nose,’ ‘by three lengths,’
etc. This is often enormously more useful than mere ranking e.g., because it tells a buyer that s/he can get very
nearly as good a product for much less money.

31 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

to as ‘robustness.’ Some specific aspects of the limitations also need statement here e.g.,
those due to limited time-‐frame (which often rules out some mid-‐ or long-‐term follow-‐ups
that are badly needed).

D2. (possible) Recommendations, Explanations, Predictions, and Redesigns.
All of these possibilities are examples of the ‘something more’ approach to evaluation, by
contrast with the more conservative ‘nothing but’ approach, which advocates rather careful
restriction of the evaluator’s activities to evaluation, ‘pure and simple.’ These alternatives
have analogies in every profession—judges are tempted to accept directorships in com-‐
panies who may come before them as defendants, counsellors consider adopting counse-‐
lees, etc. The ‘nothing more approach’ can be expressed, with thanks to a friend of Gloria
Steinem, as: ‘An evaluation without recommendations (or explanations, etc.) is like a fish
without a bicycle.’ Still, there are more caveats about pressing for evaluation-‐separation
than with the fish. In other words, ‘lessons learned’—of whatever type—should be sought
diligently, expressed cautiously, and applied even more cautiously.
Let’s start with recommendations. Micro-‐recommendations—those concerning the inter-‐
nal workings of program management and the equipment or personnel choices/use—often
become obvious to the evaluator during the investigation, and are demonstrable at little or
no extra cost/effort (we sometimes say they “fall out” from the evaluation; as an example of
how easy this can sometimes be, think of copy-‐editors, who often do both evaluation and
recommendation to an author in one pass), or they may occur to a knowledgeable evalu-‐
ator who is motivated to help the program, because of his/her expert knowledge of this or
an indirectly or partially relevant field such as information or business technology, organi-‐
zation theory, systems concepts, or clinical psychology. These ‘operational recommenda-‐
tions’ can be very useful—it’s not unusual for a client to say that these suggestions alone
were worth more than the cost of the evaluation. (Naturally, these suggestions have to be
within the limitations of the (program developer’s) Resources checkpoint, except when
doing the Generalizability checkpoint.) Generating these ‘within-‐program’ recommenda-‐
tions as part of formative evaluation (though they’re one step away from the primary task
of formative evaluation which is straight evaluation of the present quality of the evaluand),
is one of the good side-‐effects that may come from using an external evaluator, who often
has a new view of things that everyone on the scene may have seen too often to see criti-‐
cally.
On the other hand, macro-‐recommendations—which are about the disposition or classifica-‐
tion of the whole program (refund, cut, modify, export, etc.—which we might also call ex-‐
ternal management recommendations, or dispositional recommendations)—are usually
another matter. These are important decisions serviced by and properly dependent on,
summative evaluations, but making recommendations about the evaluand is not intrinsi-‐
cally part of the task of evaluation as such, since it depends on other matters besides the
m/w/s of the program, which is all the evaluator normally can undertake to determine.
For the evaluator to make dispositional recommendations about a program’s disposition
will typically require two extras over and above what it takes to evaluate the program: (i)
extensive knowledge of the other factors in the context-‐of-‐decision for the top-‐level

32 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

(‘about-‐program’) decision-‐makers. Remember that those people are often not the clients
for the evaluation—they are often further up the organization chart—and they may be un-‐
willing or psychologically or legally unable to provide full details about the context-‐of-‐
decision concerning the program (e.g., unable because implicit values are not always rec-‐
ognized by those who operate using them). The correct dispositional decisions often rightly
depend on legal or donor constraints on the use of funds, and sometimes on legitimate po-‐
litical constraints not explained to the evaluator, not just m/w/s; and any of these can arise
after the evaluation begins and the evaluator is briefed about then-‐known environmental
constraints, if s/he is briefed at all.
Such recommendations will also often require (ii) considerable extra effort e.g., to evaluate
each of the other macro-‐options. Key elements in this may be trade secrets or national se-‐
curity matters not available to the evaluator, e.g., the true sales figures, the best estimate of
competitors’ success, the extent of political vulnerability for work on family planning, the
effect on share prices of withdrawing from this slice of the market. This elusiveness also
often applies to the macro-‐decision makers’ true values, with respect to this decision,
which are quite often trade or management or government secrets of the board of direc-‐
tors, or select legislators, or perhaps personal values only known to their psychotherapists.
So it is really a quaint conceit of evaluators to suppose that the m/w/s of the evaluand are
the only relevant grounds for deciding how to dispose of it; there are often entirely legiti-‐
mate political, legal, public-‐perception, market, and ethical considerations that are at least
as important, especially in toto. So it’s simply presumptuous to propose macro-‐recomm-‐
endations as if they follow directly from the evaluation: they almost never do, even when
the client may suppose that they do, and encourage the evaluator to produce them. (It’s a
mistake I’ve made more than once.) If you do have the required knowledge to infer to them,
then at least be very clear that you are doing a different evaluation in order to reach them,
namely an evaluation of the alternative options open to the disposition decision-‐makers, by
contrast with an evaluation of the evaluand itself. In the standard program evaluation, but
not in the evaluation of various dispositions of it, you can sometimes include an evaluation
of the internal choices available to the program manager, i.e., recommendations for im-‐
provements.
There are a couple of ways to ‘soften’ recommendations in order to take account of these
hazards. The simplest way is to preface them by saying, “Assuming that the program’s dis-‐
position is dependent only on its m/w/s, it is recommended that…” A more creative and
often more productive approach, advocated by Jane Davidson, is to convert recommenda-‐
tions into options, e.g., as follows: “It would seem that program management/staff faces a
choice between: (i) continuing with the status quo; (ii) abandoning this component of the
program; (iii) implementing the following variant [here you insert your recommendation]
or some variation of this.” The program management/staff is thus invited to adopt and be-‐
come a co-‐author of an option, a strategy that is often more likely to result in implementa-‐
tion than a mere recommendation from an outsider.
Many of these extra requirements for making macro-‐recommendations—and sometimes
one other—also apply to providing explanations of success or failure. The extra require-‐
ment is possession of the correct (not just the believed) logic or theory of the program,
which typically requires more than—and rarely requires less than—state-‐of-‐the-‐art sub-‐

33 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

ject-‐matter expertise, both practical and ‘theoretical’ (i.e., the scientific or engineering ac-‐
count), about the evaluand’s inner workings (i.e., about what optional changes would lead
to what results). A good automobile mechanic has the practical kind of knowledge about
cars that s/he works on regularly, which includes knowing how to identify malfunction and
its possible causes; but it’s often only the automobile engineer who can give you the rea-‐
sons why these causal connections work, which is what the demand for explanations will
usually require. The combination of these requirements imposes considerable, and some-‐
times enormous, extra time and research costs which has too often meant that the attempt
to provide recommendations or explanations (by using the correct program logic) is done
at the expense of doing the basic evaluation task well (or even getting to it at all), a poor
trade-‐off in most cases. Moreover, getting the explanation right will sometimes be abso-‐
lutely impossible within the ‘state of the art’ of science and engineering at the moment—
and this is not a rare event, since in many cases where we’re looking for a useful social
intervention, no-‐one has yet found a plausible account of the underlying phenomenon: for
example, in the cases of delinquency, addiction, autism, serial killing, ADHD. In such cases,
what we need to know is whether we have found a cure—complete or partial—since we
can use that knowledge to save people immediately, and also, thereafter, to start work on
finding the explanation. That’s the ‘aspirin case’—the situation where we can easily, and
with great benefit to many sufferers, evaluate a claimed medication although we don’t
know why it works, and don’t need to know that in order to evaluate its efficacy. In fact, un-‐
til the evaluation is done, there’s no success or failure for the scientist to investigate, which
vastly reduces the significance of the causal inquiry, and hence the probability/value of its
occurrence.
It’s also extremely important to realize that macro-‐recommendations will typically require
the ability to predict the results of the recommended changes in the program, at the very
least in this specific context, which is something that the program logic or program theory
(like many social science theories) is often not able to do with any reliability. Of course,
procedural recommendations in the future tense, e.g., about needed further research or
data-‐gathering or evaluation procedures, are often possible—although typically much less
useful.
‘Plain’ predictions are also often requested by clients or thought to be included in any
good evaluation. (e.g., Will the program work reliably in our schools? Will it work with the
recommended changes, without staff changes?) and are often very hazardous.44 Now, since
these are reasonable questions to answer in deciding on the value of the program for many
clients, you have to try to provide the best response. So read Clinical vs. Statistical Predic-
tion by Paul Meehl and the follow-‐up literature, and the following Note D2.1, and then call
in the subject matter experts. In most cases, the best thing you can do, even with all that
help, is not just to pick what appears to be the most likely result, but to give a range from
the probability of the worst possible outcome (which you describe carefully) to that of the

44 Evaluators sometimes say, in response to such questions, Well, why wouldn’t it work—the reasons for it doing so
are really good? The answer was put rather well some years ago: "…it ought to be remembered that there is nothing
more difficult to take in hand, more perilous to conduct, or more uncertain of success, than to take the lead in the
introduction of a new order of things. Because the innovator has for enemies all those who have done well under the
old conditions, and lukewarm defenders in those who may do well under the new.” (Niccolo Machiavelli (1513),
with thanks to John Belcher and Richard Hake for bringing it up recently (PhysLrnR, 16 Apr 2006)

34 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

best possible outcome (also described), plus the probability of the most likely outcome in
the middle (described even more carefully).45 On rare occasions, you may be able to esti-‐
mate a confidence interval for these estimates. Then the decision-‐makers can apply their
choice of strategy (e.g., minimax—minimizing maximum possible loss) based on their risk-‐
aversiveness.
Although it’s true that almost every evaluation is in a sense predictive, since the data it’s
based on is yesterday’s data but its conclusions are put forward as true today, there’s no
need to be intimidated by the need to predict; one just has to be very clear what assump-‐
tions one is making and how much evidence there is to support them.
Finally, a new twist on ‘something more’ that I first heard proposed by John Gargani and
Stewart Donaldson at the 2010 AEA convention, is for the evaluator to do a redesign of a
program rather than giving a highly negative evaluation. This is a kind of limit case of rec-‐
ommendation, and of course requires an extra skill set, namely design skills. The main
problem here is role conflict and the consequent improper pressure: the evaluator is offer-‐
ing the client loaded alternatives, a variation on ‘your money or your life.’ The advocates
suggest that the world will be a better place if the program is redesigned rather than just
condemned by them, which is probably true; but these are not the only alternatives. The
evaluator might instead recommend the redesign, and suggest calling for bids on that,
recusing his or her candidacy. Or they might just recommend changes that a new designer
should incorporate or consider.
Note D2.1: Policy analysis, in the common situation when the policy is being considered
for future adoption, is close to being program evaluation of future (possible) programs
(a.k.a., ex ante, or prospective program evaluation) and hence necessarily involves all the
checkpoints in the KEC including, in most cases, an especially large dose of prediction. (A
policy is a ‘course or principle of action’ for a certain domain of action, and implementing it
typically produces a program.) Extensive knowledge of the fate of similar programs in the
past is then the key resource, but not the only one. It is also essential to look specifically for
the presence of indicators of future change in the record, e.g., downturns in the perform-‐
ance of the policy in the most recent time periods, intellectual or motivational burn-‐out of
principal players/managers, media attention, the probability of personnel departure for
better offers, the probability of epidemics, natural disasters, legislative ‘counter-‐
revolutions’ by groups of opponents, general economic decline, technological break-‐
throughs, or large changes in taxes or house or market values, etc. If, on the other hand, the
policy has already been implemented, then we’re doing historical (a.k.a. ex post, or retro-‐
spective program evaluation) and policy analysis amounts to program evaluation without
prediction, a much easier case.

45 In PERT charting (PERT = Program Evaluation and Review Technique), a long-established approach to program
planning that emerged from the complexities of planning the first submarine nuclear missile, the Polaris, the formula
for calculating what you should expect from some decision is: {Best possible outcome + Worst Possible outcome +
4 x (Most likely outcome)}/6. It’s a pragmatic solution to consider seriously. My take on this approach is that it only
makes sense when there are good grounds for saying the most likely outcome (MLO) is very likely; there are many
cases where we can identify the best and worst cases, but have no grounds for thinking the intermediate case is more
likely other than the fact it’s intermediate. Now that fact does justify some weighting (given the usual distribution of
probabilities), but the coefficient for the MLO might then be better as 2 or 3.

35 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

Note D2.3: Evaluability assessment is a useful part of good program planning, whenever
it is required, hoped, or likely that evaluation could later be used to help improve as well as
determine the m/w/s of the program to assist decision-‐makers and fixers. It can be done
well by using the KEC to identify the questions that will have to be answered eventually,
and thus to identify the data that will need to be obtained; and the difficulty of doing that
will determine the evaluability of the program as designed. Those preliminary steps are, of
course, exactly the ones that you have to go through to design an evaluation, so the two
processes are two sides of the same coin. Since everything is evaluable, to some extent in
some contexts, the issue of evaluability is a matter of degree, resources, and circumstance,
not of absolute possibility. In other words, while everything is evaluable, by no means is
everything evaluable to a reasonable degree of confidence, with the available resources, in
every context. (For example, the atomic power plant program for Iran after 4/2006, when
access was denied to the U.N. inspectors) As this example illustrates, ‘context’ includes the
date and type of evaluation, since, while this evaluand is not evaluable prospectively with
any confidence, in 4/06—since getting the data is not feasible, and predicting sustainability
is highly speculative—historians will no doubt be able to evaluate it retrospectively, be-‐
cause we will eventually know whether that program paid off, and/or brought on an attack.
Note D2.3: Inappropriate expectations The fact that clients often expect/request explan-‐
ations of success or shortcomings, or macro-‐recommendations, or impossible predictions,
is grounds for educating them about what we can definitely do vs. what we can hope will
turn out to be possible. Although tempting, these expectations on the client’s part are not
an excuse for doing, or trying for long to do, and especially not for promising to do, these
extra things if you lack the very substantial extra requirements for doing them, especially if
that effort jeopardizes the primary task of the evaluator, viz. drawing the needed type of
evaluative conclusion about the evaluand. The merit, worth, or significance of a program is
often hard to determine; it (typically) requires that you determine whether and to what
degree and in what respects and for whom and under what conditions and at what cost it
does (or does not) work better or worse than the available alternatives, and what all that
means for all those involved. To add on the tasks of determining how to improve it, explain-‐
ing why it works (or fails to work), now and in the future, and/or what one should do about
supporting or exporting it, is simply to add other tasks, often of great scientific and/or
managerial/social interest, but quite often beyond current scientific ability, let alone the
ability of an evaluator who is perfectly competent to evaluate the program. In other words,
‘black box evaluation’ should not be used as a term of contempt since it is often the name
for a vitally useful, feasible, and affordable approach, and frequently the only feasible one.
And in fact, most evaluations are of partially blacked-‐out boxes (‘grey boxes’) where one
can only see a little of the inner workings. This is perhaps most obviously true in pharma-‐
cological evaluation, but it is also true in every branch of the discipline of evaluation and
every one of its application fields (health, education, social services, etc.). A program evalu-‐
ator with some knowledge of parapsychology can easily evaluate the success of an alleged
faith-‐healer whose program theory is that God is answering his prayers, without the slight-‐
est commitment to the truth or falsehood of that program theory.
Note D2.4: Finally, there are extreme situations in which the evaluator does have a
responsibility—an ethical responsibility—to move beyond the role of the evaluator, e.g.,
because it becomes clear, early in a formative evaluation, either that (i) some gross
improprieties are involved, or that (ii) certain actions, if taken immediately, will lead to
36 scriven, 18 April, 2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

involved, or that (ii) certain actions, if taken immediately, will lead to very large increases
in benefits, and it is clear that no-‐one besides the evaluator is going to take the necessary
steps. The evaluator is then obliged to be proactive, and we can call the resulting action
whistle-‐blowing in the first case, and proformative evaluation in the second, a cross be-‐
tween formative evaluation and proactivity. While macro-‐recommendations by evaluators
require great care, proactivity requires even greater care.

D3. (possible) Responsibility and Justification

If either can be determined, and if it is appropriate to determine them. Some versions of
accountability that stress the accountability of people do require this—see examples be-‐
low. Allocating blame or praise requires extensive knowledge of: (i) the main players’
knowledge-‐state at the time of key decision making; (ii) their resources and responsibilities
for their knowledge-‐state as well as their actions; as well as (iii) an ethical analysis of their
options, and of the excuses or justifications they (or others, on their behalf) may propose.
Not many evaluators have the qualifications to do this kind of analysis. The “blame game” is
very different from evaluation in most cases and should not be undertaken lightly. Still,
sometimes mistakes are made, are demonstrable, have major consequences, and should be
pointed out as part of an evaluation; and sometimes justified choices, with good or bad ef-‐
fects, are made and attacked, and should be praised or defended as part of an evaluation.
The evaluation of accidents is an example: the investigations of aircraft crashes by the
National Transportation Safety Board in the US are in fact a model example of how this can
be done; they are evaluations of an event with the added requirement of identifying re-‐
sponsibility, whether it’s human or natural causes. (Operating room deaths pose similar
problems but are often not as well investigated.)
Note D3.1: The evaluation of disasters, (a misleading title) recently an area of consider-‐
able activity, typically involves one or more of the following five options: (i) an evaluation
of the extent of preparedness, (ii) an evaluation of the immediate response, (iii) an evalu-‐
ation of the totality of the relief efforts until termination, (iv) an evaluation of the lessons
learned (lessons learned should be a part of each of the evaluations done of the response),
and (v) an evaluation of subsequent corrective/preventative action. All five involve some
evaluation of responsibility and sometimes the allocation of praise/blame. Recent efforts
(c. 2005) referred to as general approaches to the ‘evaluation of disasters’ appear not to
have distinguished all of these and not to have covered all of them, although it seems plaus-‐
ible that all should have been covered in order to minimize the impact of later disasters.

D4. Report & Support

Now we come to the task of conveying the conclusions in an appropriate way, and at ap-‐
propriate times and locations. This is a very different task from—although frequently con-‐
fused with—handing over a semi-‐technical report at the end of the study, the paradigm for
typical research studies of the same phenomena. Evaluation reporting for a single evalu-‐
ation may require, or benefit from, radically different presentations to different audiences,
at different times in the evaluation: these may be oral or written, long or short, public or
private, technical or non-‐technical, graphical or textual, scientific or story-‐telling, anecdotal
and personal or barebones. And this phase of the evaluation process should include post-‐
report help, e.g., handling questions when they turn up later as well as immediately, ex-‐

37 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

plaining the report’s significance to different groups including users, staff, funders, and
other impactees, and even reacting to later program or management or media documents
allegedly reporting the results or implications of the evaluation. This in turn may involve
proactive creation and depiction in the primary report of various possible scenarios of in-‐
terpretations and associated actions that are, and—the contrast is extremely helpful—are
not, consistent with the findings. Essentially, this means doing some problem-‐solving for
the clients, that is, advance handling of difficulties they are likely to encounter with various
audiences. In this process, a wide range of communication skills is often useful and some-‐
times vital, e.g., audience ‘reading’, use and reading of body language, understanding the
multicultural aspects of the situation and the cultural iconography and connotative implica-‐
tions of types of presentations and response.46 There should usually be an explicit effort to
identify ‘lessons learned,’ failures and limitations, costs if requested, and explaining ‘who
evaluates the evaluators.’ Checkpoint D4 should also cover getting the results (and inciden-‐
tal knowledge findings) into the relevant databases, if any; possibly but not necessarily into
the information ocean via journal publication (with careful consideration of the cost of sub-‐
sidizing these for potential readers of the publication chosen); recommending creation of a
new database or information channel (e.g., a newsletter) where beneficial; and dissemina-‐
tion into wider channels if appropriate, e.g., through presentations, online posting, discus-‐
sions at scholarly meetings, or in hardcopy posters, graffiti, book, blogs, wikis, tweets, and
in movies (yes, fans, remember—UTube is free).

D5. Meta-evaluation
This is the evaluation of an evaluation or evaluations—including evaluations based on the
use of this checklist—in order to identify their strengths/limitations/other uses. Meta-‐
evaluation should always be done, as a separate quality control step(s), as follows: (i) to the
extent possible, by the primary evaluator, certainly—but not only—after completion of the
final draft of any report; and (ii) whenever possible also by an external evaluator of the
evaluation (a meta-‐evaluator). The primary criteria of merit for evaluations are: (i) validity,
at a contextually adequate level47; (ii) utility48, including cost-‐feasibility and comprehensi-‐
bility (usually to clients, audiences, and stakeholders) of both the main conclusions about
the m/w/s of the evaluand, and the recommendations, if any; and also any utility arising
from generalizability e.g., of novel methodological approaches; (iii) credibility (to select
stakeholders, especially funders, regulatory agencies, and usually also to program staff);
(iv) comparative cost-‐effectiveness, which goes beyond utility to require consideration of
alternative possible evaluation approaches; (v) robustness, i.e., the extent to which the ev-‐
aluation is immune to variations in context, measures used, point of view of the evaluator
etc; and (vi) ethicality/legality, which includes such matters as avoidance of conflict of in-‐

46The ‘connotative implications’ are in the sub-explicit but supra-symbolic realm of communication, manifested
in—to give a small example—the use of gendered or genderless language.
47 This means that when balance of evidence is all that’s called for (e.g., because a decision has to be made fast) it’s
irrelevant that proof of the conclusion beyond any reasonable doubt was not supplied.
48 Utility is usability and not actual use, the latter—or its absence—being at best a probabilistically sufficient but not

necessary condition for the former, since it may have been very hard to use the results of the evaluation, and
utility/usability requires (reasonable) ease of use. Failure to use the evaluation may be due to base motives or stu-
pidity or an act of God and hence is not a valid indicator of lack of merit.

38 scriven, 18 April, 2011

DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

terest49 and protection of the rights of human subjects—of course, this affects credibility,
but is not exactly the same since the ethicality may be deeply flawed…. There are several
ways to go about meta-‐evaluation. You and later another meta-‐evaluator can: (a) apply the
KEC or PES or GAO list (preferably one or more that was not used to do the evaluation) to
the evaluation itself (e.g., the Cost checkpoint in the KEC then addresses the cost of the ev-‐
aluation, not the program, and so on for all checkpoints); and/or (b) use a special meta-‐
evaluation checklist (there are several available, including the one sketched in the previous
sentence, which is sometimes called the Meta-‐Evaluation Checklist or MEC50); and/or (c) if
funds are available, replicate the evaluation, doing it in the same way, and compare the re-‐
sults; and/or (d) do the same evaluation using a different methodology and compare the
results. It’s highly desirable to employ more than one of these approaches, and all are likely
to require supplementation with some attention to conflict of interest/rights of subjects.
Note D5.1: Literal or direct use are not concepts clearly applicable to evaluations without
recommendations, a category that includes many important, complete, and influential ev-‐
aluations: evaluations are not in themselves recommendations. ‘Due consideration’ or
‘utilization’ is a better generic term for the ideal response to a good evaluation. Failure to
use an evaluation’s results is often due to bad, perhaps venal, management, and so can
never be regarded as an indicator of bad utility, without further evidence.
Note D5.2: Evaluation impacts often occur years after completion and often occur even if
the evaluation was rejected completely when submitted. Evaluators too often give up their
hopes of impact too soon.
Note D5.3: Help with utilization beyond submitting the report should at least have been
offered—see Checkpoint D4.
Note D5.4: Look for contributions from the evaluation to the client organization’s know-‐
ledge management system; if they lack one, recommend creating one.
Note D5.5: Since effects of the evaluation are not usually regarded as effects of the pro-‐
gram, it follows that although an empowerment evaluation should produce substantial
gains in the staff’s knowledge about and tendency to use or improve evaluations, that’s not
an effect of the program in the relevant sense for an evaluator. Also, although that valuable
outcome is an effect of the evaluation, it can’t compensate for low validity or low external
credibility—two of the most common threats to empowerment evaluation—since training
the program staff is not a primary criterion of merit for evaluations.
Note D5.6: Similarly, one common non-‐money cost of an evaluation—disruption of the
work of program staff—is not a bad effect of the program. It is one of the items that should
always be picked up in a meta-‐evaluation. Of course, it’s minimal in goal-‐free evaluation,
since the (field) evaluators do not talk to program staff. Careful design (of program plus ev-‐
aluation) can therefore sometimes bring these evaluation costs near to zero or ensure that
there are benefits that more than offset the cost.

49 There are a number of cases of conflict of interest of particular relevance to evaluators, e.g., formative evaluators
who make suggestions for improvement and then do a subsequent evaluation (formative or summative) of the same
program, of which they are now co-authors—or rejected contributor-wannabes—and hence in conflict of interest.
50
Now online at michaelscriven.info

39 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

_________________________________________________________________
GENERAL NOTE 8: The explanatory remarks here should be regarded as first approxima-‐
tions to the content of each checkpoint. More detail on some of them and on items men-‐
tioned in them can be found in one of the following: (i) the Evaluation Thesaurus, Michael
Scriven, (4th edition, Sage, 1991), under the checkpoint’s name; (ii) in the references cited
there; (iii) in the online Evaluation Glossary (2006) at evaluation.wmich.edu, partly written
by this author; (iv) in the best expository source now, E. Jane Davidson’s Evaluation Meth-
odology Basics (Sage, 2004 and 2e, 2012 (projected)); (v) in later editions of this document,
at michaelscriven.info. The above version of the KEC itself is, however, in most respects
very much better than the ET one, having been substantially refined and expanded in more
than 60 ‘editions’ (i.e., widely circulated or online posted revisions), since its birth as a two-‐
ager around 1971—16 since early 2009—with much appreciated help from many students
and colleagues, including: Chris Coryn, Jane Davidson, Rob Brinkerhoff, Christian Gugiu,
Nadini Persaud,51 Emil Posavac, Liliana Rodriguez-‐Campos, Daniela Schroeter, Natasha
Wilder, Lori Wingate, and Andrea Wulf; with a thought or two from Michael Quinn Patton’s
work. More suggestions and criticisms are very welcome—please send to:
mjscriv1@gmail.com, with KEC as the first words in the title line. (Suggestions after 3.28.11
that require significant changes are rewarded, not only with an acknowledgment but a little
prize: usually your choice from my list of duplicate books.)
[23,679 words]

51
Dr. Persaud’s detailed comments have been especially valuable: she was a CPA before she became a profes-‐
sional evaluator (but there are not as many changes in the cost section as she thinks are called for, so she is
not to blame for any remaining faults.)

40 scriven, 18 April, 2011

Scriven.2015 KEC 8-15-15
No ratings yet
Scriven.2015 KEC 8-15-15
60 pages
Key Evaluation Checklist
No ratings yet
Key Evaluation Checklist
22 pages
1-Evaluation Methodology - JAne
No ratings yet
1-Evaluation Methodology - JAne
12 pages
Planning A Program Evaluation
No ratings yet
Planning A Program Evaluation
27 pages
Program Evaluation: A Brief Introduction ..
No ratings yet
Program Evaluation: A Brief Introduction ..
13 pages
PPA 502 - Program Evaluation
No ratings yet
PPA 502 - Program Evaluation
47 pages
Program Evaluation Essentials: Dr. Aneel SALMAN Institute of Business Administration Karachi
No ratings yet
Program Evaluation Essentials: Dr. Aneel SALMAN Institute of Business Administration Karachi
23 pages
Program Evaluation - A Step-By-S - Barrett, Nancy
No ratings yet
Program Evaluation - A Step-By-S - Barrett, Nancy
89 pages
Basic Guide Program Evaluation
No ratings yet
Basic Guide Program Evaluation
10 pages
The Meta-Evaluation Checklist: I - Evaluation Is Something We All Do, Often Many Times A Day, As We
No ratings yet
The Meta-Evaluation Checklist: I - Evaluation Is Something We All Do, Often Many Times A Day, As We
5 pages
Types of Evaluations
No ratings yet
Types of Evaluations
4 pages
Evaluation: Purpose and Standards
No ratings yet
Evaluation: Purpose and Standards
7 pages
Ch4 Project Managers Guide Opre Mar2023
No ratings yet
Ch4 Project Managers Guide Opre Mar2023
14 pages
Program Evaluation Form and Approach2
75% (4)
Program Evaluation Form and Approach2
153 pages
How To Conduct Evaluation of
No ratings yet
How To Conduct Evaluation of
78 pages
Sections of This Topic Include
No ratings yet
Sections of This Topic Include
14 pages
A Basic Guide To Program Evaluation
No ratings yet
A Basic Guide To Program Evaluation
8 pages
$PP Analysis CH 8
No ratings yet
$PP Analysis CH 8
81 pages
Evaluation: Definition and Purpose
No ratings yet
Evaluation: Definition and Purpose
2 pages
Conceptualization of Program Evaluation
No ratings yet
Conceptualization of Program Evaluation
42 pages
Ext 507
100% (1)
Ext 507
88 pages
What Is Evaluation Document
No ratings yet
What Is Evaluation Document
5 pages
Project Evaluation Guide For NGOs 1728941407
No ratings yet
Project Evaluation Guide For NGOs 1728941407
98 pages
Evaluation Terms
No ratings yet
Evaluation Terms
15 pages
Myths About Program Evaluation
No ratings yet
Myths About Program Evaluation
2 pages
BKK2016Rugh 1
No ratings yet
BKK2016Rugh 1
48 pages
5044 Davidson Chapter 3
100% (1)
5044 Davidson Chapter 3
30 pages
Edm564m-Program Evaluation-1
No ratings yet
Edm564m-Program Evaluation-1
39 pages
Stufflebeam - Checklist Evaluation
No ratings yet
Stufflebeam - Checklist Evaluation
9 pages
Evaluating The Evaluation
100% (1)
Evaluating The Evaluation
21 pages
Ext 507 (How To Conduct Evaluation)
No ratings yet
Ext 507 (How To Conduct Evaluation)
20 pages
Yambao - Guidance Program Evaluation Report
No ratings yet
Yambao - Guidance Program Evaluation Report
17 pages
Guide To Programme Evaluation in Quality Improvement v10
No ratings yet
Guide To Programme Evaluation in Quality Improvement v10
22 pages
Evaluation Research Methods Guide
No ratings yet
Evaluation Research Methods Guide
22 pages
A Checklist For Evaluating Large-Scale Assessment Programs: by Lorrie A. Shepard University of Colorado
No ratings yet
A Checklist For Evaluating Large-Scale Assessment Programs: by Lorrie A. Shepard University of Colorado
32 pages
Program Evaluation Final
No ratings yet
Program Evaluation Final
38 pages
Douglah. Concept of Program Evaluation
No ratings yet
Douglah. Concept of Program Evaluation
0 pages
Patton - A World Larger Than Formative Summative
No ratings yet
Patton - A World Larger Than Formative Summative
14 pages
Consumer-Oriented Evaluation Approach
83% (6)
Consumer-Oriented Evaluation Approach
13 pages
Performance Evaluation: The Fundamental Purpose of Evaluation
No ratings yet
Performance Evaluation: The Fundamental Purpose of Evaluation
33 pages
2 Criteria of Evaluation
No ratings yet
2 Criteria of Evaluation
3 pages
Evaluation Nimi
No ratings yet
Evaluation Nimi
14 pages
Evaluation Step by Step Guide
No ratings yet
Evaluation Step by Step Guide
36 pages
Program Evaluation: 6 Steps & 4 Standards
No ratings yet
Program Evaluation: 6 Steps & 4 Standards
26 pages
Where Program Evaluation Is Helpful
No ratings yet
Where Program Evaluation Is Helpful
2 pages
Exd114 2025
No ratings yet
Exd114 2025
1 page
Evaluation Guideline
No ratings yet
Evaluation Guideline
13 pages
Conducting An Environmental
No ratings yet
Conducting An Environmental
5 pages
Project - PPT 6 Monitoring and Evaluation
91% (11)
Project - PPT 6 Monitoring and Evaluation
41 pages
Ed Unit-V
No ratings yet
Ed Unit-V
7 pages
Project Monitoring and Evaluation
No ratings yet
Project Monitoring and Evaluation
116 pages
(Ebook PDF) Program Evaluation Alternative Approaches Practical Guidelines 4Th Install Download
No ratings yet
(Ebook PDF) Program Evaluation Alternative Approaches Practical Guidelines 4Th Install Download
54 pages
Evaluation
No ratings yet
Evaluation
2 pages
Learning and Evaluation - Efficiency, Effectiveness and Impact - Meta
No ratings yet
Learning and Evaluation - Efficiency, Effectiveness and Impact - Meta
1 page
Making Evaluation Sensitive To Gender An
No ratings yet
Making Evaluation Sensitive To Gender An
135 pages
Integrating Human Rights and - HRGE Handbook
No ratings yet
Integrating Human Rights and - HRGE Handbook
54 pages
Linking Public Health Academy To Public Health Practice Through Program Evaluation
No ratings yet
Linking Public Health Academy To Public Health Practice Through Program Evaluation
4 pages
Health - Impact - Assessment - HIA - Analyses and Challenges To Brazilian Health Surveillance
No ratings yet
Health - Impact - Assessment - HIA - Analyses and Challenges To Brazilian Health Surveillance
10 pages
Formative Evaluation For Implementation
No ratings yet
Formative Evaluation For Implementation
14 pages
Evaluating Health Promotion Programs 2012
No ratings yet
Evaluating Health Promotion Programs 2012
31 pages
Goal Free Evaluation Scriven
No ratings yet
Goal Free Evaluation Scriven
9 pages
Evaluation Theory Is Who We Are
No ratings yet
Evaluation Theory Is Who We Are
19 pages
Educational Evaluation Around The World - An Anthology
No ratings yet
Educational Evaluation Around The World - An Anthology
177 pages
Effectiveness and Efficiency of Monitoring and Evaluation Systems - Biennial Report On Operations Evaluation
No ratings yet
Effectiveness and Efficiency of Monitoring and Evaluation Systems - Biennial Report On Operations Evaluation
1 page
Brain Tumor Detection and Classification
No ratings yet
Brain Tumor Detection and Classification
14 pages
Listening Script 23
No ratings yet
Listening Script 23
3 pages
LabVIEW - Zadaci 1
No ratings yet
LabVIEW - Zadaci 1
5 pages
Naive Bays Intrusion Detection
No ratings yet
Naive Bays Intrusion Detection
5 pages
Module 1 PDF
No ratings yet
Module 1 PDF
35 pages
Sample of An Environmental Monitoring Protocol
No ratings yet
Sample of An Environmental Monitoring Protocol
3 pages
Cid Mag Managing Opportunities and Risk March08
No ratings yet
Cid Mag Managing Opportunities and Risk March08
12 pages
Aspiring Architect's Portfolio
No ratings yet
Aspiring Architect's Portfolio
2 pages
Results Focus
100% (1)
Results Focus
10 pages
Roberto Cabrales: Math Educator & Researcher
No ratings yet
Roberto Cabrales: Math Educator & Researcher
3 pages
LCD Display Driver Demo
No ratings yet
LCD Display Driver Demo
5 pages
Amcrps Gen Cat en 2011-2
No ratings yet
Amcrps Gen Cat en 2011-2
52 pages
Cataloguingintroduction
No ratings yet
Cataloguingintroduction
10 pages
Cell Selection & Power Control Guide
No ratings yet
Cell Selection & Power Control Guide
28 pages
Analysis of The Text L. P. Hartley "W.S."
100% (1)
Analysis of The Text L. P. Hartley "W.S."
1 page
Digital Imaging
No ratings yet
Digital Imaging
13 pages
St. Martin - g5 - Topic Paper
No ratings yet
St. Martin - g5 - Topic Paper
4 pages
Universidad Iberoamericana Administración de Negocios Internacionales Final Essay José Roberto Rodas Martínez
No ratings yet
Universidad Iberoamericana Administración de Negocios Internacionales Final Essay José Roberto Rodas Martínez
3 pages
Math 10: Combinations & Permutations Test
No ratings yet
Math 10: Combinations & Permutations Test
2 pages
Gcse Geography Coursework Methodology Table
100% (2)
Gcse Geography Coursework Methodology Table
6 pages
Flight Planning SFO MODULE FIVE PLAN
No ratings yet
Flight Planning SFO MODULE FIVE PLAN
3 pages
FBW1102 en
No ratings yet
FBW1102 en
603 pages
Clark 2001
No ratings yet
Clark 2001
20 pages
Scholarship Pending
No ratings yet
Scholarship Pending
3 pages
Descartes On Humans, Animals, and Machines
No ratings yet
Descartes On Humans, Animals, and Machines
3 pages
Balud Child Development Centers List
No ratings yet
Balud Child Development Centers List
2 pages
SAP SRM Consultant Profile
No ratings yet
SAP SRM Consultant Profile
5 pages
BSP 5 Section
100% (1)
BSP 5 Section
14 pages
Summative 3.1
No ratings yet
Summative 3.1
2 pages
Aicte Mandatory Disclosure 2010-11
No ratings yet
Aicte Mandatory Disclosure 2010-11
59 pages

Key Evaluation Chechlist KEC - 4.18.2011

Uploaded by

Key Evaluation Chechlist KEC - 4.18.2011

Uploaded by

KEY

EVALUATION CHECKLIST (KEC)

of 40-page (~23,000-word) mini-textbook or reference work for a wide range of professionals

2 scriven, 18 April, 2011

PART A: PRELIMINARIES

A1. Executive Summary

4 scriven, 18 April, 2011

A3. Design and Methods

6 scriven, 18 April, 2011

PART B: FOUNDATIONS

8 scriven, 18 April, 2011

B1. Background and Context

B2. Descriptions & Definitions

B3. Consumers (Impactees)

B4. Resources (a.k.a. “Strengths Assessment”)

10 scriven, 18 April, 2011

12 scriven, 18 April, 2011

14 scriven, 18 April, 2011

16 scriven, 18 April, 2011

PART C: SUBEVALUATIONS

18 scriven, 18 April, 2011

20 scriven, 18 April, 2011

22 scriven, 18 April, 2011

24 scriven, 18 April, 2011

26 scriven, 18 April, 2011

28 scriven, 18 April, 2011

PART D: CONCLUSIONS & IMPLICATIONS

30 scriven, 18 April, 2011

32 scriven, 18 April, 2011

34 scriven, 18 April, 2011

D3. (possible) Responsibility and Justification

D4. Report & Support

38 scriven, 18 April, 2011

40 scriven, 18 April, 2011

You might also like