[go: up one dir, main page]

0% found this document useful (0 votes)
52 views40 pages

Key Evaluation Chechlist KEC - 4.18.2011

This document introduces the Key Evaluation Checklist (KEC) as a reference work for professionals conducting evaluations of programs, projects, plans, processes, and policies. The KEC is organized into four parts that guide the evaluator through preliminary steps, establishing the program foundations, conducting sub-evaluations of key aspects like process and outcomes, and developing conclusions. The introduction provides clarification on how the KEC can be applied to different evaluands and notes on terminology used in systematic professional evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views40 pages

Key Evaluation Chechlist KEC - 4.18.2011

This document introduces the Key Evaluation Checklist (KEC) as a reference work for professionals conducting evaluations of programs, projects, plans, processes, and policies. The KEC is organized into four parts that guide the evaluator through preliminary steps, establishing the program foundations, conducting sub-evaluations of key aspects like process and outcomes, and developing conclusions. The introduction provides clarification on how the KEC can be applied to different evaluands and notes on terminology used in systematic professional evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

KEY

 EVALUATION  CHECKLIST  (KEC)  


[Edition  of  18.april.2011]  

Michael  Scriven  
Claremont  Graduate  University    
&  The  Evaluation  Center,  Western  Michigan  University  
 

•  For  use  in  professional  designing,  managing,  and  evaluating  or  monitoring  of:    
programs,  projects,  plans,  processes,  and  policies;    
•  for  assessing  their  evaluability;    
•  for  requesting  proposals  (i.e.,  writing  RFPs)  to  do  or  evaluate  them;  
 &  for  evaluating  proposed,  ongoing,  or  completed  evaluations  of  them.1  
 

INTRODUCTION  
This  introduction  takes  the  form  of  a  number  of  ‘General  Notes,’  more  of  which  may  be  
found  in  the  body  of  the  document,  along  with  many  keypoint-­‐specific  Notes.  
General Note 1: APPLICABILITY The KEC can be used, with care, for evaluating more than
the five evaluands2 listed above, just as it can be used, with considerable care, by others besides
professional evaluators. For example, it can be used for some help with: (i) the evaluation of
products;3 (ii) the evaluation of organizations and organizational units4 such as departments, re-
search centers, consultancies, associations, companies, and for that matter, (iii) hotels, restau-
rants, and mobile food carts, (iv) services, which can be treated as if they were aspects or con-
stituents of programs e.g., as processes; (v) many processes, policies, practices, or procedures,
which are often implicit programs (e.g., “Our practice at this school is to provide guards for chil-
dren walking home after dark”), hence evaluable using the KEC; or habitual patterns of behav-
iour i.e., performances (as in “In my practice as a consulting engineer, I often assist designers,
not just manufacturers”), which is, strictly speaking, a slightly different subdivision of evalu-
ation; and, with some use of the imagination and a heavy emphasis on the ethical values in-
volved, for (vi) some tasks or major parts of tasks in the evaluation of personnel. So it is a kind

1
That is, for what is called meta-evaluation, i.e., the evaluation of one or more evaluations.
2
‘Evaluand’ is a term used to refer to whatever is being evaluated. Note that what counts as a program is often also
sometimes called an initiative or intervention, sometimes even an approach or strategy, although the latter are really
types of program
3
For which it was originally designed and used, c. 1971—although it has since been completely rewritten for its
present purposes, and then revised or rewritten (and circulated or re-posted) at least 60 times. The latest version can
always be found at michaelscriven.info. It is an example of ‘continuous interactive publication’ a type of project
with some new significance in the field of knowledge development, although (understandably) a source of irritation
to some librarians and bibliographers. It enables the author, like a garden designer (and unlike a traditional architect
or composer), to steadily improve his or her specific individual creations over the years or decades, with the help of
user input. It is simply a technologically-enabled extension to the limit of the stepwise process of producing succes-
sive editions in traditional publishing, and arguably a substantial improvement, in the cases where it’s appropriate.
4
There is of course a large literature on the evaluation of organizations, from Baldrige to Senge, and some of it will
be useful for a serious evaluator, but much of it is confused and confusing e.g., about the difference and links be-
tween evaluation and explanation, needs and markets, criteria and indicators, goals and duties.
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

of 40-page (~23,000-word) mini-textbook or reference work for a wide range of professionals


working in evaluation and management—with all the limitations of that size, and certainly more.
General Note 2: TABLE OF CONTENTS
PART  A:  PRELIMINARIES:  A1,  Executive  Summary;  A2,  Clarifications;  A3,  Design  
and  Methods.  
PART  B:  FOUNDATIONS:  B1,  Background  and  Context;  B2,  Descriptions  &  Defini-­‐
tions;  B3,  Consumers  (Impactees);  B4,  Resources  (‘Strengths  Assessment’);  B5,  
Values.  
PART  C:  SUBEVALUATIONS:  C1,  Process:  C2,  Outcomes;  C3,  Costs;  C4,  Compari-­‐
sons;  C5,  Generalizability.  
PART  D:  CONCLUSIONS  &  IMPLICATIONS:  D1,  Synthesis;  D2  (possible),  
Recommendations,  Explanations,  Predictions,  &  Redesigns;  D3  (possible),  
Responsibility  and  Justification;  D4,  Report  &  Support;  D5,  Meta-­‐evaluation.5  
General  Note  3:  TERMINOLOGY:  Throughout  this  document,  “evaluation”  is  taken  to  mean  
the  determination  of  merit,  worth,  or  significance  (abbreviated  m/w/s);  “an  evaluation”  to  
mean  the  results  of  such  a  determination;  and  “evaluand”  to  mean  whatever  is  being  evalu-­‐
ated…  “Dimensions  of  merit”  (a.k.a.,  “criteria  of  merit”)  refers  to  the  characteristics  of  the  
evaluand  that  definitionally  bear  on  its  m/w/s  (i.e.,  could  be  included  in  explaining  what  
‘good  X’  means),  and  “indicators  of  merit”  refers  to  factors  that  are  empirically  but  not  defi-­‐
nitionally  linked  to  the  evaluand’s  m/w/s…  Professional  evaluation  is  simply  evaluation  
requiring  specialized  tools  or  skills  that  are  not  in  the  everyday  repertoire;  it  is  usually  sys-­‐
tematic  (and  inferential),  but  may  also  be  simply  judgmental,  if  the  judgment  skill  is  profes-­‐
sionally  trained,  and  maintained,  or  a  (recently)  tested  advanced  skill  (think  of  livestock  
judges,  football  referees,  sawmill  controllers)…  The  KEC  is  a  tool  for  use  in  systematic  pro-­‐
fessional  evaluation,  so  knowledge  of  some  terms  from  evaluation  vocabulary  is  assumed,  
e.g.,  formative,  goal-­‐free,  ranking;  their  definitions  can  be  found  in  my  Evaluation  The-­
saurus  (4e,  Sage,  1991),  or  in  the  Evaluation  Glossary,  online  at  evaluation.wmich.edu.  
However,  any  conscientious  program  manager  (or  designer  or  fixer)  does  evaluation  of  
their  own  projects,  and  will  benefit  from  using  this,  skipping  the  occasional  technical  de-­‐
tails…  The  most  common  reasons  for  doing  evaluation  are  (i)  to  identify  needed  improve-­
ments  to  the  evaluand  (formative  evaluation);  (ii)  to  support  decisions  about  the  program  
(summative  evaluation6);  and  (iii)  to  enlarge  or  refine  our  body  of  evaluative  knowledge  
(ascriptive  evaluation,  as  in  ‘best  practices’  studies  and  all  evaluations  by  historians).  Keep  
in  mind  that  an  evaluation  may  serve  more  than  one  purpose,  or  shift  from  one  to  the  other  
as  time  passes  or  the  context  changes…  Merely  for  simplification,  we  talk  throughout  this  
document  about  the  evaluation  of  ‘programs’  rather  than  ‘programs,  plans,  or  policies,  or  
evaluations  of  them,  etc….’  as  detailed  in  the  sub-­‐heading  above.  

5
It’s not important, but you can remember the part titles from this mnemonic: A for Approach, B for Before, C for
Core (or Center), and D for Dependencies. Since these have 3+5+5+3 components, it’s a 16-point checklist.
6
Major decisions about the program include: refunding, defunding, exporting, replicating, developing further, and
deciding whether it represents a proper or optimal use of funds (i.e., evaluation for accountability, as in an audit).

2 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

General  Note  4:  TYPE  OF  CHECKLIST    This  is  an  iterative  checklist,  not  a  one-­shot  checklist,  
i.e.,  you  should  expect  to  go  through  it  several  times  when  dealing  with  a  single  project,  
even  for  design  purposes,  since  discoveries  or  problems  that  come  up  under  later  check-­‐
points  will  often  require  modification  of  what  was  entered  under  earlier  ones  (and  no  re-­‐
arrangement  of  the  order  will  completely  avoid  this).7  For  more  on  the  nature  of  checklists,  
and  their  use  in  evaluation,  see  the  author’s  paper  on  that  topic,  and  a  number  of  other  pa-­‐
pers  about,  and  examples  of,  checklists  for  evaluation  by  various  authors,  under  the  listing  
for  the  Checklist  Project  at  evaluation.wmich.edu.  
 General  Note  5:  EXPLANATIONS    Since  it  is  not  entirely  helpful  to  simply  list  here  what  
(allegedly)  needs  to  be  covered  in  an  evaluation  when  the  reasons  for  the  recommended  
coverage  (or  exclusions)  are  not  obvious—especially  when  the  issues  are  highly  controver-­‐
sial  (e.g.,  Checkpoint  D2)—brief  summaries  of  the  reasons  for  the  position  taken  are  also  
provided  in  such  cases.  
General  Note  6:  CHECKPOINT  FOCUS    The  determination  of  merit,  or  worth,  or  significance  
(a.k.a.  (respectively)  quality,  value,  or  importance),  the  triumvirate  value  foci  of  evaluation,  
each  rely  to  different  degrees  on  slightly  different  slices  of  the  KEC,  as  well  as  on  a  good  
deal  of  it  as  common  ground.  These  differences  are  marked  by  a  comment  on  these  distinc-­‐
tive  elements  with  the  relevant  term  of  the  three  underlined  in  the  comment,  e.g.,  worth,  
unlike  merit  (or  quality,  as  the  terms  are  commonly  used),  brings  in  Cost  (Checkpoint  C3).    
General  Note  7:  THE  COST  OF  EVALUATION    The KEC is a list of what ought to be covered in
an evaluation, but in the real world, the budget for an evaluation is often not enough to cover the
whole list thoroughly. People sometimes ask what checkpoints could be skipped when one has a
very small evaluation budget. The answer is, “None, but….” These are, generally speaking, nec-
essary conditions for validity, But… (i) sometimes the client, or you, if you are the client, can
show that one or two are not relevant to the information need  in  this  case  (e.g.,  cost  may  not  be  
important  in  some  cases);  (ii)  the  fact  that  you  can’t  skip  any  checkpoint  doesn’t  mean  you  
have  to  spend  significant  money  on  each  of  them.  What  you  do  have  to  do  is  think  through  
each  checkpoint’s  implications  for  the  case  in  hand,  and  consider  whether  an  economical  
way  of  coping  with  it  would  be  probably  adequate  for  an  acceptably  probable  conclusion,  
i.e.,  focus  on  robustness  (see  Checkpoint  D5,  Meta-­‐evaluation,  below).  In  an  extreme  case,  
you  may  have  to  rely  on  a  subject-­‐matter  expert  for  an  estimate  based  on  his/her  experi-­‐
ence,  maybe  covering  more  than  one  checkpoints  in  a  half-­‐day  of  consulting—or  on  a  few  
hours  of  literature  +  phone  search  by  you—of  the  relevant  facts  about  e.g.,  resources,  or  
critical  competitors.  But  reality  sometimes  mean  the  evaluation  can’t  be  done;  that’s  the  
cost  of  integrity  for  evaluators  and,  sometimes,  excessive  parsimony  for  clients.  Don’t  for-­‐
get  that  honesty  on  this  point  prevents  some  bad  scenes  later—and  may  lead  to  a  change  of  
budget.  
 

PART  A:  PRELIMINARIES  


 

7
An important category of these is identified in Note C2.5 below

3 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

These  preliminary  checkpoints  are  clearly  essential  parts  of  an  evaluation  report,  but  may  
seem  to  have  no  relevance  to  the  design  and  execution  phases  of  the  evaluation  itself.  That’s  
why  they  are  segregated  from  the  rest  of  the  KEC  checklist:  however,  it  turns  out  to  be  
quite  useful  to  begin  all  one’s  thinking  about  an  evaluation  by  role-­‐playing  the  situation  
when  you  will  come  to  write  a  report  on  it.  Amongst  other  benefits,  it  makes  you  realize  the  
importance  of  describing  context;  of  settling  on  a  level  of  technical  terminology  and  pre-­‐
supposition;  of  clearly  identifying  the  most  notable  conclusions;  and  of  starting  a  log  on  the  
project  as  well  as  its  evaluation  as  soon  as  the  latter  becomes  a  possibility.  Similarly,  it’s  
good  practice  to  make  explicit  at  an  early  stage  the  clarification  step  and  the  methodology  
array  and  its  justification          

A1.   Executive  Summary  


The  most  important  element  in  this  section  is  the  summary  of  the  results  and  not  (or  not  
just)  the  investigatory  process.  Typically  this  should  be  done  without  even  mentioning  the  
process  whereby  you  got  them,  unless  the  methodology  is  especially  notable.  In  other  
words,  avoid  the  pernicious  practice  of  using  the  executive  summary  as  a  ‘teaser’  that  only  
describes  what  you  looked  at  or  how  you  looked  at  it,  instead  of  what  you  found.  Through-­‐
out  the  whole  process  of  designing  or  doing  an  evaluation,  keep  asking  yourself  what  the  
overall  summary  is  going  to  say,  based  on  what  you  have  learned  so  far,  and  how  directly  
and  adequately  it  relates  to  the  client’s  and  stakeholders’  and  (probable  future)  audiences’  
needs,  in  terms  of  their  pre-­‐existing  information;  this  helps  you  to  focus  on  what  still  needs  
to  be  done  in  order  to  find  out  what  matters  most.  The  executive  summary  should  usually  
be  a  selective  summary  of  Parts  B  and  C,  and  should  not  run  more  than  one  or  at  most  two  
pages  if  you  expect  it  to  be  read  by  executives.  Only  rarely  is  the  occasional  practice  of  two  
summaries  (e.g.,  one  ten-­‐  pager  and  one  one-­‐pager)  worth  the  trouble,  but  discuss  this  op-­‐
tion  with  the  client  if  in  doubt,  and  the  earlier  the  better.  The  summary  should  usually  con-­‐
vey  some  sense  of  the  strength  of  the  conclusions—which  includes  an  estimate  of  both  the  
weight  of  the  evidence  for  the  premises  and  the  robustness  of  the  inference(s)  to  the  con-­‐
clusion(s)—and  its  limitations  (see  A3  below).  Of  course,  the  final  version  of  the  executive  
summary  will  be  written  near  the  end  of  writing  the  report,  but  it’s  worth  trying  the  prac-­‐
tice  of  re-­‐editing  an  informal  draft  of  it  every  couple  of  weeks  during  the  evaluation  be-­‐
cause  this  forces  one  to  keep  thinking  about  identification  and  substantiation  of  the  most  
important  conclusions.  Append  these  versions  to  the  log,  for  future  consideration.  
Note  A1.1  This  Note  should  be  just  for  beginners,  but  experience  has  demonstrated  that  
others  can  also  benefit  from  its  advice:  the  executive  summary  is  a  summary  of  the  evalu-­
ation  not  of  the  program.  (Checkpoint  B2  is  reserved  for  the  latter.)  

A2.   Clarifications    

Now  is  the  time  to  clearly  identify  and  define  in  your  notes,  for  assertion  in  the  final  report  
(and  resolution  of  ambiguities  along  the  way):  (i)  the  client,  if  there  is  one:  this  is  the  per-­‐
son,  group,  or  committee  who  officially  requests,  and,  if  it’s  a  paid  evaluation,  pays  for  (or  
authorizes  payment  for)  the  evaluation,  and—you  hope—the  same  entity  to  whom  you  first  
report  (if  not,  try  to  arrange  this,  to  avoid  crossed  wires  in  communications).  (ii)  The  pros-­‐
pective  (i.e.,  overt)  audiences  (for  the  report).  (iii)  The  stakeholders  in  the  program  (those  

4 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

who  have  or  will  have  a  substantial  vested  interest—not  just  an  intellectual  interest—in  
the  outcome  of  the  evaluation,  and  may  have  important  information  or  views  about  the  
program  and  its  situation/history).  (iv)  Anyone  else  who  (probably)  will  see,  have  the  right  
to  see,  or  should  see,  (a)  the  results,  and/or  (b)  the  raw  data—these  are  the  covert  audi-­‐
ences.  Get  clear  in  your  mind  your  actual  role  or  roles—internal  evaluator,  external  evalu-­‐
ator,  a  hybrid  (e.g.,  an  outsider  on  the  payroll  for  a  limited  time  to  help  the  staff  with  setting  
up  and  running  evaluation  processes),  an  evaluation  trainer  (sometimes  described  as  an  
empowerment  evaluator),  a  repairer/‘fixit  guy’,  visionary  (or  re-­‐visionary),  etc.  Each  of  
these  roles  has  different  risks  and  responsibilities,  and  is  viewed  with  different  expecta-­‐
tions  by  your  staff  and  colleagues,  the  clients,  the  staff  of  the  program  being  evaluated,  et  al.  
You  may  also  pick  up  some  other  roles  along  the  way—e.g.,  counsellor,  therapist,  mediator,  
decision-­‐maker,  inventor,  advocate—sometimes  for  everyone  but  sometimes  for  only  part  
of  the  staff/stakeholders/others  involved.  It’s  good  to  formulate  and  sometimes  to  clarify  
these  roles,  at  least  for  your  own  thinking  (especially  about  possible  conflicts  of  role),  in  the  
project  log.  The  project  log  is  absolutely  essential;  and  it’s  worth  considering  making  a  
standard  practice  of  having  someone  else  read  and  initial  entries  in  it  that  may  at  some  
stage  become  very  important.  
And  now  is  the  time  to  get  down  to  the  nature  and  details  of  the  job  or  jobs,  as  the  client  
sees  them—and  to  encourage  the  client  to  clarify  their  position  on  the  details  that  they  
have  not  yet  thought  out.  Get  all  this  into  a  written  contract  if  possible  (essential  if  you’re  
an  external  evaluator,  highly  desirable  for  an  internal  one.)  Can  you  determine  the  source  
and  nature  of  the  request,  need,  or  interest,  leading  to  the  evaluation:  for  example,  is  the  
request,  or  the  need,  for  an  evaluation  of  worth—which  usually  involves  really  serious  at-­‐
tention  to  cost  analysis—rather  than  of  merit;  or  of  significance  which  always  requires  ad-­‐
vanced  knowledge  of  the  research  (or  other  current  work)  scene  in  the  evaluand’s  field;  or  
of  more  than  one  of  these?  Is  the  evaluation  to  be  formative,  summative,  or  ascriptive8;  or  
for  more  than  one  of  these  purposes?  Exactly  what  are  you  supposed  to  be  evaluating  (the  
evaluand  alone,  or  also  the  context  and/or  the  infrastructure?):  how  much  of  the  context  is  
to  be  taken  as  fixed;  do  they  just  want  an  evaluation  in  general  terms,  or  if  they  want  de-­‐
tails,  what  counts  as  a  detail  (enough  to  replicate  the  program  elsewhere,  or  enough  to  rec-­
ognize  it  anywhere,  or  just  enough  for  prospective  readers  to  know  what  you’re  referring  
to);  are  you  supposed  to  be  simply  evaluating  the  effects  of  the  program  as  a  whole  (holistic  
evaluation);  or  the  dimensions  of  its  success  and  failure  (one  type  of  analytic  evaluation);  
or  the  quality  on  each  of  those  dimensions,  or  the  quantitative  contribution  of  each  of  its  
components  to  its  overall  m/w/s  (another  two  types  of  analytic  evaluation);  are  you  re-­‐
quired  to  rank  the  evaluand  against  other  actual  or  possible  programs  (which  ones?),  or  
only  to  grade  it;9  and  to  what  extent  is  a  conclusion  that  involves  generalization  from  this  
context  being  requested  or  required  (e.g.,  where  are  they  thinking  of  exporting  it?).  And,  of  
particular  importance,  is  the  main  thrust  to  be  on  ex  post  facto  (historical)  evaluation,  or  ex  

8 Formative evaluations, as mentioned earlier, are usually done to find areas needing improvement of the evaluation:
summative are mainly done to support a decision about the disposition of the evaluand (e.g., to refund, defund, or
replicate it); and ‘ascriptive’ evaluations are done simply for the record, for history, for benefit to the discipline, or
just for interest.
9
Grading refers not only to the usual academic letter grades (A-F, Satisfactory/Unsatisfactory, etc.) but to any allo-
cation to a category of merit, worth, or significance, e.g., grading of meat, grading of ideas and thinkers.

5 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

ante  (predictive)  evaluation,  or  (the  most  common,  but  don’t  assume  it)  both?  Note  that  
predictive  program  evaluation  is  very  close  to  being  (the  almost  universal  variety  of)  policy  
analysis,  and  vice  versa.  
Are  you  also  being  asked  (or  expected)  either  to  evaluate  the  client’s  theory  of  how  the  ev-­‐
aluand’s  components  work,  or  to  create/improve  such  a  ‘program  theory’—keeping  in  
mind  that  this  is  something  over  and  above  the  literal  evaluation  of  the  program,  and  espe-­‐
cially  keeping  in  mind  that  this  is  sometimes  impossible  for  even  the  most  expert  of  field  
experts  in  the  present  state  of  subject-­‐matter  knowledge?  10  Is  the  required  conclusion  sim-­‐
ply  to  provide  and  justify  grades,  ranks,  scores,  profiles,  or  (a  different  level  of  difficulty  
altogether)  distribution  of  funding?  Are  recommendations  (for  improvement  or  disposi-­‐
tion),  or  identifications  of  human  fault,  or  predictions,  requested,  or  expected,  or  feasible  
(another  level  of  difficulty,  too—see  Checkpoint  D2)?  Is  the  client  really  willing  and  anxious  
to  learn  from  faults  or  is  this  just  conventional  rhetoric?  (Your  contract  or,  for  an  internal  
evaluator,  your  job,  may  depend  on  getting  the  answer  to  this  question  right,  so  you  might  
consider  trying  this  test:  ask  them  to  explain  how  they  would  handle  the  discovery  of  ex-­‐
tremely  serious  flaws  in  the  program—you  will  often  get  an  idea  from  their  reaction  to  this  
question  whether  they  have  ‘the  right  stuff’  to  be  a  good  client.)  Or  you  may  discover  that  
you  are  really  expected  to  produce  a  justification  for  the  program  in  order  to  save  some-­‐
one’s  neck;  and  that  they  have  no  interest  in  hearing  about  faults.  Have  they  thought  about  
post-­‐report  help  with  interpretation  and  utilization?  (If  not,  offer  it  without  extra  charge—
see  Checkpoint  D2  below.)    
It’s  best  to  complete  the  discussion  of  these  issues  about  what’s  expected  and/or  feasible  to  
evaluate,  and  clarify  your  commitment  (and  your  cost  estimate,  if  it’s  not  already  fixed),  
only  after  doing  a  quick  pass  through  the  KEC,  so  ask  for  a  little  time  to  do  this,  overnight  if  
possible  (see  Note  D2.3  near  the  end  of  the  KEC).  Be  sure  to  note  later  any  subsequently  
negotiated,  or  imposed,  changes  in  any  of  the  preceding.  And  here’s  where  you  give  ac-­‐
knowledgments/thanks/etc.,  so  it  probably  should  be  the  last  section  you  revise  in  the  final  
report.  

A3.   Design  and  Methods    


Now  that  you’ve  got  the  questions  straight,  how  are  you  going  to  find  the  answers?  You  
need  a  plan  that  lays  out  the  aspects  of  the  evaluand  you  need  to  investigate  in  order  to  ev-­‐
aluate  it—the  design  of  the  evaluation—and  a  set  of  investigative  procedures  to  implement  
this  plan—i.e.,  the  planned  methods,  based  on  some  general  account  of  how  to  investigate  
each  aspect  of  the  design,  in  other  words  a  methodology.  To  a  large  extent,  the  methodol-­‐
ogy  now  used  in  evaluation  originates  in  social  science  methodology,  and  is  well  covered  

10 Essentially, this is a request for decisive non-evaluative explanatory research on the evaluand and/or context. You
may or may not have the skills for this, depending on the exact problem; you certainly didn’t acquire them in the
course of your evaluation training. It’s one thing to determine whether (and to what degree) a program reduces de-
linquency; any good evaluator can do that (given the budget and time required). It’s another thing altogether to be
able to explain why that program does or does not work—that often requires an adequate theory of delinquency,
which so far doesn’t exist. Although program theory enthusiasts think their obligations always include or require
such a theory, the standards for acceptance of any of these theories by the field as a whole are often beyond their
reach; and you risk lowering the standards of the evaluation field if you claim your evaluation depends on providing
such a theory, since in many of the most important areas, you will not be able to do that.

6 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

elsewhere,  in  both  social  science  and  evaluation  texts.  In  this  section,  we  just  list  some  en-­‐
try  points  for  that  slice  of  evaluation  methodology,  and  provide  rather  more  details  about  
the  evaluative  slice  of  evaluation  methodology,  the  neglected  part;  apart  from  a  few  com-­‐
ments  here,  this  is  mostly  covered,  or  at  least  introduced,  under  the  later  checkpoints  
which  refer  to  the  necessary  aspects  of  the  investigation—Values,  Process,  Outcomes,  
Costs,  Comparisons,  Generalizability,  and  Synthesis  checkpoints.  Leaving  this  slice  out  of  
the  methodology  of  evaluation  is  roughly  the  same  as  leaving  out  any  discussion  of  inferen-­‐
tial  statistics  from  a  discussion  of  statistics.  
Two  orienting  points  to  start  with.  (i)  Program  evaluation  is  usually  about  a  single  program  
rather  than  a  set  of  programs.  Although  program  evaluation  is  not  as  individualistic—the  
technical  term  is  idiographic  rather  than  nomothetic—as  dentistry,  forensic  pathology,  or  
motorcycle  maintenance,  since  most  programs  have  large  numbers  of  impactees  rather  
than  just  one,  it  is  more  individualistic  than  most  social  sciences,  even  applied  social  sci-­‐
ences.  So  you’ll  need  to  be  knowledgeable  about  case  study  methodology11.  (ii)  Program  ev-­‐
aluation  is  nearly  always  a  complex  task,  involving  the  investigation  of  a  number  of  differ-­‐
ent  aspects  of  program  performance—even  a  number  of  different  aspects  of  a  single  ele-­‐
ment  in  that  such  as  impact  or  cost—which  means  it  is  part  of  the  realm  of  study  that  re-­‐
quires  extensive  use  of  checklists.  The  humble  checklist  has  been  ignored  in  most  of  the  lit-­‐
erature  on  research  methods,  but  turns  out  to  be  more  complex  and  also  more  important  
than  was  generally  realized,  so  look  up  the  online  Checklists  Project  at  
http://www.wmich.edu/evalctr/checklists  for  some  papers  about  the  methodology  and  a  
long  list  of  specific  checklists  composed  by  evaluators  (including  an  earlier  version  of  this  
one).  You  can  find  a  few  others,  and  the  latest  version  of  this  one,  at  michaelscriven.info.  
Now  for  some  entry  points  for  applying  social  science  methodology:  that  is,  some  examples  
of  the  kind  of  question  that  you  may  need  to  answer.  Do  you  have  adequate  domain  (a.k.a.  
subject-­‐matter,  and/or  local  context)  expertise  for  what  you  have  now  identified  as  the  real  
tasks?  If  not,  how  will  you  add  it  to  the  evaluation  team  (via  consultant(s),  advisory  panel,  
full  team  membership,  sub-­‐contract,  or  surveys/interviews)?  More  generally,  identify,  as  
soon  as  possible,  all  investigative  procedures  for  which  you’ll  need  expertise,  time,  equip-­‐
ment,  and  staff—and  perhaps  training—in  this  evaluation:  observation,  participant  obser-­‐
vation,  logging,  journaling,  audio/photo/video  recording,  tests,  simulating,  role-­‐playing,  
surveys,  interviews,  experimental  design,  focus  groups,  text  analysis,  library/online  
searches/search  engines,  etc.;  and  data-­‐analytic  procedures  (stats,  cost-­‐analysis,  modeling,  
topical-­‐expert  consulting,  etc.),  plus  reporting  techniques  (text,  stories,  plays,  graphics,  
freestyle  drawings,  stills,  movies,  etc.),  and  their  justification.  You  probably  need  to  allocate  
time  for  a  lit  review  on  some  of  these  methods....  In  particular,  on  the  difficult  causation  
component  of  the  methodology  (establishing  that  certain  claimed  or  discovered  phenom-­‐
ena  are  the  effects  of  the  interventions),  can  you  use  separate  control  or  comparison  
groups  to  determine  causation  of  supposed  effects/outcomes?  If  not,  look  at  interrupted  
time  series  designs,  the  GEM  approach,12  and  some  ideas  in  case  study  design.  If  there  is  to  
be  a  control  or  quasi-­‐control  (i.e.,  comparison)  group,  can  you  and  should  you  try  to  ran-­‐

11
This means reading at least some books by Yin, Stake, and Brinkerhoff (check Amazon).
12
See “A Summative Evaluation of RCT methodology; and an alternative approach to causal research” in Journal
of Multidisciplinary Evaluation vol. 5, no. 9, March 2008, at jmde.com.

7 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

domly  allocate  subjects  to  it  (and  can  you  get  through  IRB)?  How  will  you  control  differen-­‐
tial  attrition;  cross-­‐group  contamination;  other  threats  to  internal  validity?  If  you  can’t  con-­‐
trol  these,  what’s  the  decision-­‐rule  for  declining/aborting  the  study?  Can  you  double-­‐  or  
single-­‐blind  the  study  (or  triple-­‐blind  if  you’re  very  lucky)?  If  the  job  requires  you  to  de-­‐
termine  the  separate  contribution  to  the  effects  from  individual  components  of  the  ev-­‐
aluand—how  will  you  do  that?....  If  a  sample  is  to  be  used  at  any  point,  how  selected,  and  if  
stratified,  how  stratified?….  Will/should  the  evaluation  be  goal-­‐based  or  goal-­‐free?13....  To  
what  extent  participatory  or  collaborative;  if  to  a  considerable  extent,  what  standards  and  
choices  will  you  use,  and  justify,  for  selecting  partners/assistants?  In  considering  your  de-­‐
cision  on  that,  keep  in  mind  that  participatory  approaches  improve  implementation  (and  
sometimes  validity),  but  may  cost  you  credibility  (and  possibly  validity).  How  will  you  han-­‐
dle  that  threat?....  If  judges  are  to  be  involved  at  any  point,  what  reliability  and  bias  controls  
will  you  need  (again,  for  credibility  as  well  as  validity)?...  How  will  you  search  for  side  ef-­‐
fects  and  side-­‐impacts,  an  essential  element  in  almost  all  evaluations  (see  Checkpoint  C2)?  
Most  important  of  all,  with  respect  to  all  (significantly)  relevant  values  how  are  you  going  
to  go  through  the  value-­‐side  steps  in  the  evaluation  process,  i.e.,  (i)  identify,  (ii)  particu-­‐
larize,  (iii)  validate,  (iv)  measure,  (v)  set  standards  (‘cutting  scores’)  for,    (vi)  set  weights  
for,  and  then  (vii)  incorporate  (synthesize,  integrate)  the  value-­‐side  with  the  empirical  
data-­‐gathering  side  in  order  to  generate  the  evaluative  conclusion?...  Now  check  the  sug-­‐
gestions  about  values-­‐specific  methodology  in  the  Values  checkpoint,  especially  the  com-­‐
ment  on  pattern-­‐searching….  When  you  can  handle  all  this,  you  are  in  a  position  to  set  out  
the  ‘logic  of  the  evaluation,’  i.e.,  a  general  description  and  justification  of  the  total  design  for  
this  project,  something  that—at  least  in  outline—is  a  critical  part  of  the  report,  under  the  
heading  of  Methodology.  
Note  A3.1:  The  above  process  will  also  generate  a  list  of  needed  resources  for  your  plan-­‐
ning  and  budgeting  efforts—i.e.,  the  money  (and  other  costs)  estimate.  And  it  will  also  pro-­‐
vide  the  basis  for  the  crucial  statement  of  the  limitations  of  the  evaluation  that  may  need  to  
be  reiterated  in  the  conclusion  and  perhaps  in  the  executive  summary.  
 

PART  B:  FOUNDATIONS  


 
This  is  the  set  of  investigations  that  establishes  the  context  and  nature  of  the  program,  and  
some  of  the  empirical  components  of  the  evaluation  that  you’ll  need  in  order  to  start  spe-­‐
cific  work  on  the  key  dimensions  of  m/w/s  in  Part  C.  That  is,  they  specify  important  ele-­‐
ments  that  will  end  up  in  the  actual  evaluation  report,  by  contrast  with  preparing  for  it,  i.e.,  
Part  A,  as  well  as  providing  foundations  for  the  core  elements  in  C.  These  are  all  part  of  the  
content  of  the  KEC,  and  each  one  is  numbered  for  that  purpose.  

13
That is, at least partly done by evaluators who are not informed of the goals of the program.

8 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

B1.   Background  and  Context    


Identify  historical,  recent,  concurrent,  and  projected  settings  for  the  program;  start  a  list  of  
contextual  factors  that  may  be  relevant  to  success/failure  of  the  program;  and  put  matched  
labels  (or  metadata  tags)  on  any  that  look  as  if  they  may  interact.  In  particular,  identify  (i)  
any  ‘upstream  stakeholders’—and  their  stakes—other  than  the  clients  (i.e.,  identify  people  
or  groups  or  organizations  that  assisted  in  creation  or  implementation  or  support  of  the  
program  or  its  evaluation,  e.g.,  with  funding  or  advice  or  housing  or  equipment  or  help);  (ii)  
any  enabling  legislation/mission  statements,  etc.—and  any  other  relevant  legisla-­‐
tion/policies—and  log  any  legislative/executive/practice  or  attitude  changes  that  occur  
after  start-­‐up;  (iii)  the  underlying  rationale,  including  the  official  program  theory,  and  po-­‐
litical  logic  (if  either  exist  or  can  be  reliably  inferred;  although  neither  are  necessary  for  
getting  an  evaluative  conclusion,  they  are  sometimes  useful/required);  (iv)  general  results  
of  a  literature  review  on  similar  interventions,  including  ‘fugitive  studies’  (those  not  pub-­‐
lished  in  standard  media),  and  on  the  Internet  (consider  checking  the  ‘invisible  web,’  and  
the  latest  group  and  individual  blogs/wikis  with  the  specialized  search  engines  needed  to  
access  these);  (v)  previous  evaluations,  if  any;  (vi)  their  impact,  if  any.  

B2.   Descriptions  &  Definitions    


What  you  are  going  to  evaluate  is  officially  a  certain  program,  but  actually  it’s  the  total  
intervention  made  in  the  name  of  the  program.  That  will  usually  include  much  more  than  
just  the  program,  e.g.,  it  may  include  the  personalities  of  the  field  staff  and  support  agen-­‐
cies,  their  modus  operandi  in  dealing  with  local  communities,  their  modes  of  dress  and  
transportation,  etc.  So,  record  any  official  descriptions  of  program,  its  components,  its  con-­‐
text/environment,  and  the  client’s  program  logic,  but  don’t  assume  they  are  correct,  even  
as  descriptions  of  the  actual  program  delivered,  let  alone  of  the  total  intervention.  Be  sure  
to  develop  a  correct  and  complete  description  of  the  first  three,  which  may  be  very  differ-­‐
ent  from  the  client’s  version,  in  enough  detail  to  recognize  the  evaluand  in  any  situation  
you  observe,  and  perhaps—depending  on  the  purpose  of  the  evaluation—to  replicate  it.  
You  don’t  need  to  develop  the  correct  program  logic,  only  the  supposed  program  logic,  un-­‐
less  you  have  undertaken  to  do  so  and  have  the  resources  to  add  this—often  major,  and  
sometimes  suicidal—requirement  to  the  basic  evaluation  tasks.  Of  course,  you  will  some-­‐
times  see,  or  find  later,  some  obvious  flaws  in  the  client’s  effort  at  a  program  logic  and  you  
may  be  able  to  point  those  out,  diplomatically,  at  some  appropriate  time.  Get  a  detailed  de-­‐
scription  of  goals/mileposts  for  the  program  (if  not  operating  in  goal-­‐free  mode).  Explain  
the  meaning  of  any  ‘technical  terms,’  i.e.,  those  that  will  not  be  in  the  prospective  audiences’  
vocabulary,  e.g.,  ‘hands-­‐on’  or  ‘inquiry-­‐based’  science  teaching,  ‘care-­‐provider’.  Note  signifi-­‐
cant  patterns/analogies/metaphors  that  are  used  by  (or  implicit  in)  participants’  accounts,  
or  that  occur  to  you;  these  are  potential  descriptions  and  may  be  more  enlightening  than  
literal  prose;  discuss  whether  or  not  they  can  be  justified;  do  the  same  for  any  graphic  ma-­‐
terials  associated  with  the  program.  Distinguish  the  instigator’s  efforts  in  trying  to  start  up  
a  program  from  the  program  itself;  both  are  interventions,  only  the  latter  is  (normally)  the  
evaluand.  Remember,  you’re  only  going  to  provide  a  summary  of  the  program  description,  
not  a  complete  description,  which  might  take  more  space  than  your  complete  evaluation.  

9 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

B3.   Consumers  (Impactees)    


Consumers,  as  the  term  is  used  here  (impactees  is  a  less  ambiguous  term),  comprise  (i)  the  
recipients/users  of  the  services/products  (i.e.,  the  downstream  direct  impactees)  PLUS  (ii)  
the  downstream  indirect  impactees  (e.g.,  recipient’s  family  or  co-­‐workers,  and  others,  who  
are  impacted  via  ripple  effect14).  Program  staff  are  also  impactees,  but  we  usually  keep  
them  separate  by  calling  them  the  midstream  impactees,  because  the  obligations  to  them,  
and  the  effects  on  them,  are  almost  always  very  different  and  much  weaker  in  most  kinds  of  
program  evaluation  (and  their  welfare  is  not  the  raison  d’être  of  the  program).  The  funding  
agency,  taxpayers,  and  political  supporters,  who  are  also  impactees  in  some  sense  and  
some  cases,  are  also  treated  differently  (and  called  upstream  impactees,  or,  sometimes,  
stakeholders,  although  that  term  is  often  used  more  loosely  to  include  all  impactees),  ex-­‐
cept  when  they  are  also  direct  recipients.  Note  that  there  are  also  upstream  impactees  who  
are  not  funders  or  recipients  of  the  services  but  react  to  the  announcement  or  planning  of  
the  program  before  it  actually  comes  online  (we  can  call  them  anticipators);  e.g.,  real  estate  
agents  and  employment  agencies.  In  identifying  consumers  remember  that  they  often  
won’t  know  the  name  of  the  program  or  its  goals  and  may  not  know  that  they  were  im-­‐
pacted  or  even  targeted  by  it.  (You  may  need  to  use  tracer  &/or  modus  operandi  method-­‐
ology.)  While  looking  for  the  impacted  population,  you  may  also  consider  how  others  could  
have  been  impacted,  or  protected  from  impact,  by  variations  in  the  program:  these  define  
alternative  possible  (a.k.a.  virtual)  impacted  populations,  which  may  suggest  some  ways  to  
expand,  modify,  or  contract  the  program  when/if  you  spend  time  on  Checkpoint  D1  (Syn-­‐
thesis)15,  and  Checkpoint  D2  (Recommendations);  and  hence  some  ways  that  the  program  
should  perhaps  have  been  redefined  by  now,  which  bears  on  issues  of  praise  and  blame  
(Checkpoints  B1  and  D3).  Considering  possible  variations  is  of  course  constrained  by  the  
resources  available—see  next  checkpoint.    
Note  B3.1:  Do  not  use  or  allow  the  use  of  the  term  ‘beneficiaries’  for  impactees,  since  it  car-­‐
ries  with  it  the  completely  unacceptable  assumption  that  all  the  effects  of  the  program  are  
beneficial.  It  is  also  misleading,  on  a  smaller  scale,  to  use  the  term  ‘recipients’  since  many  
impactees  are  not  receiving  anything  but  merely  being  affected,  e.g.,  by  the  actions  of  
someone  who  learnt  something  about  flu  control  from  an  educational  program.  The  term  
‘recipient’  should  be  used  only  for  those  who,  whether  as  intended  or  not,  are  directly  im-­‐
pacted.    

B4.   Resources  (a.k.a.  “Strengths  Assessment”)    


This  checkpoint  is  important  for  answering  the  questions  (i)  whether  the  program  made  
the  best  use  of  resources  available  (i.e.,  an  extended  kind  of  cost-­‐effectiveness:  see  Note  
C3.4  for  more),  and  (ii)  what  it  might  realistically  do  to  improve  (i.e.,  within  the  resources  
available).  It  refers  to  the  financial,  physical,  and  intellectual-­‐social-­‐relational  assets  of  the  
program  (not  the  evaluation!).  These  include  the  abilities,  knowledge,  and  goodwill  of  staff,  
volunteers,  community  members,  and  other  supporters.  This  checkpoint  should  cover  what  
could  now  (or  could  have  been)  used,  not  just  what  was  used:  this  is  what  defines  the  
14
In rare cases, this will include members of the research and evaluation communities who read the evaluation re-
port.
15 A related issue, equally important, is: What might have been done that was not done?

10 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

“possibility  space,”  i.e.,  the  range  of  what  could  have  been  done,  often  an  important  ele-­‐
ment:  in  the  assessment  of  achievement;  in  the  comparisons,  and  in  identifying  directions  
for  improvement  that  an  evaluation  considers.  This  means  the  checkpoint  is  crucial  for  
Checkpoint  C4  (Comparisons),  Checkpoint  D1  (Synthesis,  for  achievement),  Checkpoint  D2  
(Recommendations),  and  Checkpoint  D3  (Responsibility).  Particularly  for  D1  and  D2,  it’s  
helpful  to  list  specific  resources  that  were  not  used  but  were  available  in  this  implementa-­‐
tion.  For  example,  to  what  extent  were  potential  impactees,  stakeholders,  fund-­‐raisers,  vol-­‐
unteers,  and  possible  donors  not  recruited  or  not  involved  as  much  as  they  could  have  
been?  (As  a  crosscheck,  and  as  a  complement,  consider  all  constraints  on  the  program,  in-­‐
cluding  legal,  environmental,  and  fiscal  constraints.)  Some  matters  such  as  adequate  insur-­‐
ance  coverage  (or,  more  generally,  risk  management)  could  be  discussed  here  or  under  
Process  (Checkpoint  C1  below);  the  latter  is  preferable  since  the  status  of  insurance  cover-­‐
age  is  ephemeral,  and  good  process  must  include  a  procedure  for  regular  checking  on  it.  
This  checkpoint  is  the  one  that  covers  individual  and  social  capital  available  to  the  pro-­‐
gram;  the  evaluator  must  also  identify  social  capital  used  by  the  program  (enter  this  as  part  
of  its  Costs  at  Checkpoint  C3),  and,  sometimes,  social  capital  benefits  produced  by  the  pro-­‐
gram  (enter  as  part  of  the  Outcomes,  at  Checkpoint  C2).16  Remember  to  include  the re-
sources contributed by other stakeholders, including other organizations and clients.17  

B5.   Values    
The  values  of  primary  interest  in  typical  professional  program  evaluations  are  for  the  most  
part  not  mere  personal  preferences  of  the  impactees,  unless  those  overlap  with  their  needs  
and  the  community/society’s  needs  and  committed  values,  e.g.,  those  in  the  Bill  of  Rights  
and  the  wider  body  of  law.  Preferences  as  such  are  not  irrelevant  in  evaluation,  especially  
the  preferences  of  impactees,  and  on  some  issues,  e.g.,  surgery  options,  they  are  often  de-­‐
finitive;  it’s  just  that  they  are  generally  less  important—think  of  food  preferences  in  chil-­‐
dren—than  dietary  needs  and  medical,  legal,  or  ethical  requirements,  especially  for  program  
evaluation  by  contrast  with  product  evaluation.  While  there  are  intercultural  and  interna-­‐
16 Individual human capital is the sum of the physical and intellectual abilities, skills, powers, experience, health,
energy, and attitudes a person has acquired. These blur into their—and their community’s—social capital, which
also includes their relationships (‘social networks’) and their share of any latent attributes that their group acquires
over and above the sum of their individual human capital (i.e., those that depend on interactions with others). For
example, the extent of the trust or altruism that pervades a group, be it family, army platoon, corporation, or other
organization, is part of the value the group has acquired, a survival-related value that they (and perhaps others) bene-
fit from having in reserve. (Example of non-additive social capital: the skills of football or other team members that
will only provide (direct) benefits for others who are part of a group, e.g., a team, with complementary skills.) These
forms of capital are, metaphorically, possessions or assets to be called on when needed, although they are not di-
rectly observable in their normal latent state. A commonly discussed major benefit resulting from the human capital
of trust and civic literacy is support for democracy; a less obvious one, resulting in tangible assets, is the current set
of efforts towards a Universal Digital Library containing ‘all human knowledge’. Human capital can usually be
taken to include natural gifts as well as acquired ones, or those whose status is indeterminate as between these cate-
gories (e.g., creativity, patience, empathy, adaptability), but there may be contexts in which this should not be as-
sumed. (The short term for all this might seem to be “human resources” but that term has been taken over to mean
“employees,” and that is not what we are talking about here.) The above is a best effort to construct the current
meaning: the 25 citations in Google for ‘human capital’ and the 10 for ‘social capital’ (at 6/06/07) include simplified
and erroneous as well as other and inconsistent uses—few dictionaries have yet caught up with these terms (al-
though the term ‘human capital’ dates from 1916).
17 Thanks to Jane Davidson for the reminder on this last point.

11 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

tional  differences  of  great  importance  in  evaluating  programs,  most  of  the  values  listed  be-­‐
low  are  highly  regarded  in  all  cultures;  the  differences  are  generally  in  their  precise  inter-­‐
pretation,  the  contextual  parameters,  the  exact  standards  laid  down  for  each  of  them,  and  
the  relative  weight  assigned  to  them;  and  taking  those  differences  into  account  is  fully  al-­‐
lowed  for  in  the  approach  here.  Of  course,  your  client  won’t  let  you  forget  what  they  value,  
usually  the  goals  of  the  program,  and  you  should  indeed  keep  them  in  mind  and  report  on  
success  in  achieving  them;  but  since  you  must  value  every  unintended  effect  of  the  program  
just  as  seriously  as  the  intended  ones,  and  in  most  contexts  you  must  take  into  account  
values  other  than  those  of  the  clients,  e.g.,  those  of  the  impactees  and  usually  also  those  of  
other  stakeholders,  you  need  a  repertoire  of  values  to  check  when  doing  serious  program  
evaluation,  and  what  follows  is  a  proposal  for  that.  Keep  in  mind  that  with  respect  to  each  
of  these  (sets  of)  values,  you  will  have  to:  (i)  define  and  justify  relevance  to  this  program  in  
this  context;  (ii)  justify  the  relative  weight  (i.e.,  comparative  importance)  you  will  accord  
this  value;  (iii)  identify  any  bars  (i.e.,  minimum  acceptable  performance  standards  on  each  
value  dimension)  you  will  require  an  evaluand  to  meet  in  order  to  be  considered  at  all  in  
this  context;  (iii)  specify  the  empirical  levels  that  will  justify  the  application  of  each  grade  
level  above  the  bar  on  that  value  that  you  may  wish  to  distinguish  (e.g.,  define  what  will  
count  as  fair/good/excellent.  And  one  more  thing,  rarely  identified  as  part  of  the  evalu-­‐
ator’s  task  but  crucial;  (v)  once  you  have  a  list  of  impactees,  however  partial,  you  must  be-­‐
gin  to  look  for  patterns  within  them,  e.g.,  that  pregnant  women  have  greater  calorific  re-­‐
quirements  (i.e.,  needs)  than  those  who  are  not  pregnant.  If  you  don’t  do  this,  you  will  miss  
extremely  important  ways  to  optimize  the  use  of  intervention  resources.18  
To  get  all  this  done,  you  should  begin  by  identifying  the  relevant  values  for  evaluating  this  
evaluand  in  these  circumstances.  There  are  several  very  important  groups  of  these.  (i)  
Some  of  these  follow  simply  from  understanding  the  nature  of  the  evaluand  (these  are  
sometimes  called  definitional  criteria  of  merit,  or  dimensions  of  merit).  For  example,  if  it’s  a  
health  program,  then  the  criteria  of  merit,  simply  from  the  meaning  of  the  terms,  include  
the  extent  (a.k.a.,  reach  or  breadth)  of  its  impact  (i.e.,  the  size  and  range  of  the  demographic  
(age/gender/ethnic/economic)  and  medical  categories  of  the  impactee  population),  and  
the  impact’s  depth  (usually  a  function  of  magnitude,  extent  and  duration)  of  beneficial  ef-­‐
fects.  (ii)  Other  primary  criteria  of  merit  in  such  a  case  are  extracted  from  a  general  or  spe-­‐
cialist  understanding  of  the  nature  of  a  health  program,  include  safety  of  staff  and  patients,  
quality  of  medical  care  (from  diagnosis  to  follow-­‐up),  low  adverse  eco-­‐impact,  physical  
ease  of  access/entry;  and  basic  staff  competencies  plus  basic  functioning,  diagnostic  and  
minor  therapeutic  supplies  and  equipment.  Knowing  what  these  values  are  is  one  reason  
you  need  either  specific  evaluand-­‐area  expertise  or  a  consultant  who  has  it.  (iii)  Then  look  
for  particular,  site-­‐specific,  criteria  of  merit—for  example,  the  need  for  one  or  more  sec-­‐
ond-­‐language  competencies  in  service  providers;  you  will  probably  need  to  do  or  find  a  
valid  needs  assessment  for  the  targeted,  and  perhaps  also  for  any  other  probably  impacted  
population.  Here  you  must  include  representatives  from  the  impactee  population  as  rel-­‐
evant  experts,  and  you  may  need  only  their  expertise  for  the  needs  assessment,  but  prob-­‐

18 And if you do this, you will be doing what every scientist tries to do—find patterns in data. This is one of several
ways in which good evaluation requires full-fledged traditional scientific skills; and something more as well (han-
dling the values component).

12 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

ably  should  do  a  serious  needs  assessment  and  have  them  help  design  and  interpret  it.  (iv)  
Next,  list  the  explicit  goals/values  of  the  client  if  not  already  covered,  since  they  will  surely  
want  to  know  whether  and  to  what  extent  these  were  met.  (v)  Finally,  turn  to  the  list  below  
to  find  other  relevant  values.  Validate  them  as  relevant  or  irrelevant  for  the  present  evalu-­‐
ation,  and  as  contextually  supportable.19      
Now,  for  each  of  the  values  you  are  going  to  rely  on  at  all  heavily,  there  are  two  important  
steps  you  will  usually  need  to  take.  First,  you  need  to  establish  a  scale  or  scales  on  which  
you  can  measure  performance  that  is  relevant  to  merit.    On  each  of  these  scales,  you  need  
to  locate  levels  of  performance  that  will  count  as  being  of  a  certain  value  (these  are  called  
the  ‘cut  scores’  if  the  dimension  is  measurable).  For  example,  you  might  measure  know-­‐
ledge  of  first  aid  on  a  certain  well-­‐validated  test  and  set  90%  as  the  score  that  marks  an  A  
grade,  75%  as  a  C  grade,  etc.20  Second,  you  will  usually  need  to  stipulate  the  relative  im-­‐
portance  of  each  of  these  scales  in  determining  the  overall  m/w/s  of  the  evaluand.  
A  useful  basic  toolkit  for  this  involves  doing  what  we  call  identifying  the  “stars,  bars,  and  
steps”  for  our  listed  values.  (i)  The  “stars”  (usually  best  limited  to  1–3  stars)  are  the  
weights,  i.e.,  the  relative  or  absolute  importance  of  the  dimensions  of  merit  (or  worth  or  
significance)  that  will  be  used  as  premises  to  carry  you  from  the  facts  about  the  evaluand,  
as  you  locate  or  determine  those,  to  the  evaluative  conclusions  you  need.  Their  absolute  
importance  might  be  expressed  qualitatively  (e.g.,  major/medium/minor,  or  by  letter  
grades  A-­‐F);  or  quantitatively  (e.g.,  points  on  a  five,  ten,  or  other  point  scale,  or—often  a  
better  method  of  giving  relative  importance—by  the  allocation  of  100  ‘weighting  points’  
across  the  set  of  values);  or,  if  merely  relative  values  are  all  you  need,  these  can  even  be  ex-­‐
pressed  in  terms  of  an  ordering  of  their  comparative  importance  (rarely  an  adequate  ap-­‐
proach).  (ii)  The  “bars”  are  absolute  minimum  standards  for  acceptability,  if  any:  that  is,  
they  are  minima  on  the  particular  scales,  scores  or  ratings  that  must  be  ‘cleared’  (exceeded)  
if  the  candidate  is  to  be  acceptable,  no  matter  how  well  s/he  scores  on  other  scales.  Note  
that  an  F  grade  for  performance  on  a  particular  scale  does  not  mean  ‘failure  to  clear  a  bar,’  
e.g.,  an  F  on  the  GRE  quantitative  scale  may  be  acceptable  if  offset  by  other  virtues,  for  se-­‐
lecting  students  into  a  creative  writing  program21.  Bars  and  stars  may  be  set  on  any  rel-­‐

19 The view taken here is the commonsense one that values of the kind used by evaluators looking at programs serv-
ing the usual ‘good causes’ of health, education, social service, disaster relief, etc., are readily and objectively sup-
portable, to a degree acceptable to essentially all stakeholders and supervisory or audit personnel, contrary to the
doctrine of value-free social science which held that values are essentially matters of taste and hence lack objective
verifiability. The ones in the list here are usually fully supportable to the degree needed by the evaluator for the par-
ticular case, by appeal to publicly available evidence, expertise, and careful reasoning. Bringing them into consid-
eration is what distinguishes evaluation from plain empirical research, and only their use makes it possible for
evaluators to answer the questions that mere empirical research cannot answer, e.g., Is this the best vocational high
school in this city? Do we really need a new cancer clinic building? Is the new mediation training program for police
officers who are working the gang beat really worth what it costs to implement? In other words, the most important
practical questions, for most people—and their representatives—who are looking at programs (and the same applies
to product, personnel, policy evaluation, etc.)
20
This is the tricky process of identifying ‘cutscores’ a specialized topic in test theory—there is a whole book by
that title devoted to discussing how it should be done. A good review of the main issues is in Gene Glass’ paper
21 If an F is acceptable on that scale, why is that dimension still listed at all—why is it relevant? Answer: it may be

one of several on which high scores are weighted as a credit, on one of which the candidate must score high, but not
on any particular one. In other words the applicant has to have some special talent, but a wide range of talents are
acceptable. This might be described as a case where there is a ‘group’ bar, i.e., a ‘floating’ bar on a group of dimen-

13 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

evant  properties  (a.k.a.  dimensions  of  merit),  or  directly  on  dimensions  of  measured  
(valued)  performance,  and  may  additionally  include  holistic  bars  or  stars22.  (iii)  In  serious  
evaluation,  it  is  often  appropriate  to  locate  “steps”  i.e.,  points  or  zones  on  measured  dimen-­‐
sions  of  merit  where  the  weight  changes,  if  the  mere  stars  don’t  provide  enough  precision.  
An  example  of  this  is  the  setting  of  several  cutting  scores  on  the  GRE  for  different  grades  in  
the  use  of  that  scale  for  the  two  types  of  evaluation  given  above  (evaluating  the  program  
and  evaluating  applicants  to  it).  The  grades,  bars,  and  stars  (weights),  are  often  loosely  in-­‐
cluded  under  what  is  called  ‘standards.’  (Bars  and  steps  may  be  fuzzy  as  well  as  precise.)    
Three  values  are  of  such  general  importance  that  they  receive  full  checkpoint  status  and  
are  listed  in  the  next  section:  cost  (minimization  of),  superiority  (to  comparable  alterna-­‐
tives),  and  generalizability/exportabillity.  Their  presence  in  the  KEC  brings  the  number  of  
types  of  values  considered,  including  the  list  below,  up  to  a  total  of  21.  
At  least  check  all  these  values  for  relevance  and  look  for  others:  and  for  those  that  are  rel-­‐
evant,  set  up  an  outline  of  a  set  of  defensible  standards  that  you  will  use.  Since  these  are  
context-­‐dependent  (e.g.,  the  standards  for  a  C  in  evaluating  free  clinics  in  Zurich  today  are  
not  the  same  as  for  a  C  in  evaluating  a  free  clinic  in  Darfur  at  the  moment),  and  the  client’s  
evaluation-­‐needs—i.e.,  the  questions  they  need  to  be  able  to  answer—differ  massively,  
there  isn’t  a  universal  dictionary  for  them.  You’ll  need  to  have  a  topical  expert  on  your  team  
or  do  a  good  literature  search  to  develop  a  draft,  and  eventually  run  serious  sessions  with  
impactee  and  other  stakeholder  representatives  to  ensure  defensibility  for  the  revised  
draft.  The  final  version  of  each  of  the  standards,  and  the  set  of  them,  is  often  called  a  rubric,  
meaning  a  table  translating  evaluative  terms  into  observable  or  testable  terms  and/or  vice  
versa.23  
(i) Definitional  values—those  that  follow  from  the  definitions  of  terms  in  standard  
usage  (e.g.,  breadth  and  depth  of  impact  are,  definitionally,  dimensions  of  merit  
for  a  public  health  program),  or  that  follow  from  the  contextual  implications  of  
having  an  ideal  or  excellent  evaluand  of  this  type  (e.g.,  an  ideal  shuttle  bus  ser-­‐
vice  for  night  shift  workers  would  feature  increased  frequency  of  service  around  
shift  change  times).  The  latter  draw  from  general  knowledge  and  to  some  extent  
from  program-­‐area  expertise.  

sions, which must be cleared by the evaluand’s performance on at least one of them. It can be exhibited in the list of
dimensions of merit by bracketing the group of dimensions in the abscissa, and stating the height of the floating bar
in an attached note. Example: “Candidates for admission to the psychology grad program must have passed one
upper division statistics course.”
22 Example: The candidates for admission to a graduate program—whose quality is one criterion of merit for the

program—may meet all dimension-specific minimum standards in each respect for which these were specified (i.e.,
they ‘clear these bars’), but may be so close to missing the bars (minima) in so many respects, and so weak in re-
spects for which no minimum was specified, that the selection committee feels they are not good enough for the
program. We can describe this as a case where they failed to clear a holistic (a.k.a. overall) bar that was implicit in
this example, but can often be made explicit through dialog. (The usual way to express a quantitative holistic bar is
via an average grade; but that is not always the best way to specify it and often not strictly defensible.)
23
The term ‘rubric’ as used here is a technical term originating in educational testing parlance; this meaning is not
in most dictionaries, or is sometimes distinguished as ‘an assessment rubric.’ A complication we need to note here is
that some of the observable/measurable terms may themselves be evaluative, at least in some contexts.

14 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

(ii) Needs  of  the  impacted  population,  via  a  needs  assessment  (distinguish  perform-­‐
ance  needs  (e.g.,  need  for  health)  from  treatment  needs  (need  for  specific  medi-­‐
cation  or  delivery  system),  needs  that  are  currently  met  from  unmet  needs,24  and  
meetable  needs  from  ideal  but  impractical  or  impossible-­‐with-­‐present-­‐resources  
needs  (consider  the  Resources  checkpoint)).  The  needs  are  matters  of  fact,  not  
values  in  themselves,  but  in  any  context  that  accepts  the  most  rudimentary  ethi-­‐
cal  considerations  (i.e.,  the  non-­‐zero  value  of  the  welfare  of  other  human  beings),  
those  facts  are  value-­‐imbued.  NOTE:  needs  may  have  macro  as  well  as  micro  lev-­‐
els  that  must  be  considered;  e.g.,  there  are  local  community  needs,  regional  
needs  within  a  country,  national  needs,  global  region  needs,  and  global  needs  
(these  often  overlap,  e.g.,  in  the  case  of  building  codes  (illustrated  by  their  ab-­‐
sence  in  the  Port-­‐au-­‐Prince  earthquake  of  2010).  Doing  a  needs  assessment  is  
sometimes  the  most  important  part  of  an  evaluation,  and  in  much  of  the  litera-­‐
ture  is  based  on  invalid  definitions  of  need,  e.g.,  the  idea  that  needs  are  the  gaps  
between  the  actual  level  of  some  factor  (e.g.,  income,  calories)  and  the  ideal  level.  
Needs  for  X  are  the  levels  of  X  without  which  the  subject(s)  will  be  unable  to  
function  satisfactorily  (not  the  same  as  optimally,  maximally,  or  ideally);  and  of  
course,  what  functions  are  under  study  and  what  level  will  count  as  satisfactory  
will  vary  with  the  study  and  the  context.    Final  note;  check  the  Resources  check-­‐
point,  a.k.a.  Strengths  Assessment,  for  other  entities  valued  in  that  context  and  
hence  of  value  in  this  evaluation.  
(iii) Logical  requirements  (e.g.,  consistency,  sound  inferences  in  design  of  program  
or  measurement  instruments  e.g.,  tests).    
(iv) Legal  requirements  (but  see  (v)  below).  
(v) Ethical  requirements  (overlaps  with  legal  and  overrides  them  when  in  conflict),  
usually  including  (reasonable)  safety,  and  confidentiality  (and  sometimes  ano-­‐
nymity)  of  all  records,  for  all  impactees.  (Problems  like  conflict  of  interest  and  
protection  of  human  rights  have  federal  legal  status  in  the  US,  and  are  also  re-­‐
garded  as  scientific  good  procedural  standards,  and  as  having  some  very  general  
ethical  status.)  In  most  circumstances,  health,  shelter,  education,  and  other  wel-­‐
fare  considerations  for  impactees  and  potential  impactees  are  other  obvious  
values  to  which  ethical  weight  must  be  given.    
(vi) Cultural  values  (not  the  same  as  needs  or  wants,  although  overlapping  with  
them)  held  with  a  high  degree  of  respect  (and  thus  distinguished  from  matters  of  
manners,  style,  taste,  etc.),  of  which  an  extremely  important  one  is  honor;  an-­‐
other  group,  not  always  distinct  from  that  one,  concerns  respect  for  ancestors,  
elders,  tribal  or  totem  spirits,  and  local  deities.  These,  like  legal  requirements,  
are  subject  to  override,  in  principle  at  least,  by  ethical  values,  although  often  
taken  to  have  the  same  and  sometimes  higher  status.  

24
A very common mistake—reflected in definitions of needs that are widely used—is to think that met needs are not
‘really’ needs, and should not be included in a needs assessment. That immediately leads to the ‘theft’ of resources
that are meeting currently met needs, in order to serve the remaining unmet needs, a blunder that can cost lives.
Identify all needs, then identify the ones that are still unmet

15 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

(vii) Personal,  group,  and  organizational  goals/desires  (unless  you’re  doing  a  goal-­‐
free  evaluation)  if  not  in  conflict  with  ethical/legal/practical  considerations.  
These  are  usually  less  important  than  the  needs  of  the  impactees,  since  they  lack  
specific  ethical  or  legal  backing,  but  are  enough  by  themselves  to  drive  the  infer-­‐
ence  to  many  evaluative  conclusions  about  e.g.,  what  recreational  facilities  to  
provide  in  community-­‐owned  parks,  subject  to  consistency  with  ethical  and  legal  
constraints.  They  include  some  things  that  are  arguable  needed  as  well  as  de-­‐
sired,  by  some,  e.g.,  convenience,  recreation,  respect,  earned  recognition,  excite-­‐
ment,  and  compatibility  with  aesthetic  preferences  of  recipients.  
(viii) Environmental  needs,  if  these  are  regarded  as  not  simply  reducible  to  ‘human  
needs  with  respect  to  the  environment,’  e.g.,  habitat  needs  of  other  species  
(fauna  or  flora),  and  perhaps  Gaian  ‘needs  of  the  planet.’  
(viii)   Fidelity  to  alleged  specifications  (a.k.a.  “authenticity,”  “adherence,”  “implemen-­‐
tation,”  “dosage,”  or  “compliance”)—this  is  often  usefully  expressed  via  an  “index  
of  implementation”;  and—a  different  but  related  matter—consistency  with  the  
supposed  program  model  (if  you  can  establish  this  BRD—beyond  reasonable  
doubt);  crucially  important  in  Checkpoint  C1.  
(ix)     Sub-­‐legal  but  still  important  legislative  preferences  (GAO  used  to  determine  
these  from  an  analysis  of  the  hearings  in  front  of  the  sub-­‐committee  in  Congress  
from  which  the  legislation  emanated.)  
  (x)     Professional  standards  (i.e.,  standards  set  by  the  profession)  of  quality  that  ap-­‐
ply  to  the  evaluand25.    
(xi)            Expert  refinements  of  any  standards  lacking  a  formal  statement,  e.g.,  ones  in  (ix).    
(xii)     Historical/Traditional  standards.  
(xiii)   Scientific  merit  (or  worth  or  significance).    
(xiv)        Technological  m/w/s.  
(xv)          Marketability,  in  commercial  program  evaluation.    
(xvi)      Political  merit,  if  you  can  establish  it  BRD.  
(xvii)    Risk  (sometimes  meaning  the  same  as  chance),  meaning  the  probability  of  failure  
(or  success)  or,  sometimes,  of  the  loss  (or  gain)  that  would  result  from  failure  (or  
success),  or  sometimes  the  product  of  these  two.  Risk  in  this  ontext  does  not  
mean  the  probability  of  error  about  the  facts  or  values  we  are  using  as  param-­‐
eters—i.e.,  the  level  of  confidence  we  have  in  our  data  or  conclusions.  Risk  here  
is  the  value  or  disvalue  of  the  chancy  element  in  the  enterprise  in  itself,  as  an  in-­‐
dependent  positive  or  negative  element—positive  for  those  who  are  positively  
attracted  by  gambling  as  such  (this  is  usually  taken  to  be  a  real  attraction,  unlike  

25Since one of the steps in the evaluation checklist is the meta-evaluation, in which the evaluation itself is the
evaluand, you will also need, when you come to t)hat checkpoint, to apply professional standards for evaluations to
the list. Currently the best ones would be those developed by the Joint Committee (Program Evaluation Standards
2e (Sage), but there are several others of note, e.g., the GAO Yellow Book), and perhaps the KEC. And see the final
checkpoint in the KEC.

16 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

risk-­‐tolerance)  and  negative  for  those  who  are,  by  contrast,  risk-­‐averse.  This  
consideration  is  particularly  important  in  evaluating  plans  (preformative  evalu-­‐
ation)  and  in  formative  evaluation,  but  is  also  relevant  in  summative  and  ascrip-­‐
tive  evaluation  when  either  is  done  prospectively  (i.e.,  before  all  data  is  avail-­‐
able).  There  is  an  option  of  including  this  under  personal  preferences,  item  (vii)  
above,  but  it  is  often  better  to  consider  it  separately  since  it  benefits  by  being  ex-­‐
plicit,  can  be  very  important,  and  is  a  matter  on  which  evidence/expertise  (in  the  
logic  of  probability)  should  be  brought  to  bear,  not  simply  a  matter  of  personal  
taste.26    
(xviii)   last  but  not  least—Resource  economy  (i.e.,  how  low-­‐impact  the  program  is  with  
respect  to  short-­‐term  and  long-­‐term  limits  on  resources  of  money,  space,  time,  
labor,  contacts,  expertise  and  the  eco-­‐system).  Note  that  ‘low-­‐impact’  is  not  what  
we  normally  mean  by  ‘low-­‐cost’  (covered  separately  in  Checkpoint  C3)  in  the  
normal  currencies  (money  and  non-­‐money),  but  refers  to  absolute  (usually  
means  irreversible)  loss  of  available  resources  (in  some  framework,  which  might  
range  from  a  single  person  to  a  country).  This  could  be  included  under  an  ex-­‐
tended  notion  of  (opportunity)  cost  or  need,  but  has  become  so  important  in  its  
own  right  that  it  is  probably  better  to  put  it  under  its  own  heading  as  a  value.  It  
partly  overlaps  with  Checkpoint  C5,  because  a  low  score  on  resource  economy  
undermines  sustainability,  so  watch  for  double-­‐counting.  Also  check  for  double  
counting  against  value  (viii),  if  that  is  being  weighted  by  client  or  audiences  and  
is  not  overridden  by  ethical  or  other  higher-­‐weighted  concerns.    
Fortunately,  bringing  these  values  and  their  standards  to  bear27  is  less  onerous  than  it  may  
appear,  since  many  of  these  values  will  be  unimportant  or  only  marginally  important  in  
many  specific  cases,  although  each  one  will  be  crucially  important  in  other  particular  cases.  
And  doing  all  this  values-­‐analysis  will  be  easy  to  do  sometimes,  although  very  hard  on  
other  occasions;  it  can  often  require  expert  advice  and/or  impactee/stakeholder  advice.  
And,  of  course,  some  of  these  values  will  conflict  with  others  (e.g.,  impact  size  with  resource  
economy),  so  their  relative  weights  may  then  have  to  be  determined  for  the  particular  case,  
a  non-­‐trivial  task  by  itself.  Hence  you  need  to  be  very  careful  not  to  assume  that  you  have  to  

26 Note that risk is often defined in the technical literature as the product of the likelihood of failure and the magni-
tude of the disaster if the program, or part of it, does fail (the possible loss itself is often called ‘the hazard’); but in
common parlance, the term ‘risk’ is often used to mean either the probability of disaster (“very risky”) or the disas-
ter itself (“the risk of death”). Now the classical definition of a gambler is someone who will prefer to pay a dollar to
get a 1 in 1,000 chance of making $1,000 over paying a dollar to get a 1 in 2 chance of making $2, even though the
expectancy is the same in each case; the risk-averse person will reverse those preferences and in extreme cases will
prefer to simply keep the dollar; and the rational risk-tolerant person will, supposedly, treat all three options as of
equal merit. So, if this is correct, then one might argue that the more precise way to put the value differences here is
to say that the gambler is not attracted by the element of chance in itself but by the possibility of making the larger
sum despite the low probability of that outcome, i.e., that s/he is less risk-averse, not more of a risk-lover. (I think
this way of putting the matter actually leads to a better analysis, viz., any of these positions can be rational depend-
ing on contextual specification of the cost of Type 1 vs. Type 2 errors.) However described, this can be a major
value difference between people and organizations e.g., venture capitalist groups vs. city planning groups.
27 ‘Bringing them to bear’ involves: (a) identifying the relevant ones, (b) specifying them (i.e., determining the di-
mensions for each and a method of measuring performance/achievements on all of these scales), (c) validating the
relevant standards for the case, and (d) applying the standards to the case.

17 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

generate  a  ranking  of  evaluands  from  an  evaluation  you  are  asked  to  do,  since  if  that’s  not  
required,  you  can  often  avoid  settling  the  issue  of  relative  weights  of  criteria,  or  at  least  
avoid  any  precision  in  settling  it,  by  simply  doing  a  grading  of  each  evaluand,  on  a  profiling  
display  (i.e.,  showing  the  merit  on  all  relevant  dimensions  of  merit  in  a  bar-­‐graph  for  each  
evaluand(s)).  That  will  exhibit  the  various  strengths  and  weaknesses  of  each  evaluand,  
ideal  for  helping  them  to  improve,  and  for  helping  clients  to  refine  their  weights  for  the  cri-­‐
teria  of  merit,  which  will  often  make  it  obvious  which  is  the  best  choice.  
Note  B5.1:  You  must  cover  in  this  checkpoint  all  values  that  you  will  use,  including  those  
used  in  evaluating  the  side-­‐effects  (if  any),  not  just  the  intended  effects  (if  any  materialize).  
Some  of  these  values  may  well  occur  to  you  only  after  you  find  the  side-­‐effects  (Checkpoint  
C2),  but  that’s  not  a  problem—this  is  an  iterative  checklist,  and  in  practice  that  means  you  
will  often  have  to  come  back  to  modify  findings  on  earlier  checkpoints.  
 
 

PART  C:  SUBEVALUATIONS  


 
Each  of  the  following  five  core  dimensions  of  an  evaluation  requires  both:  (i)  a  ‘fact-­‐
finding’28  phase,  followed  by  (ii)  the  process  of  combining  those  facts  with  whatever  values  
from  B5  that  are  relevant  to  this  dimension  of  merit  bear  on  those  facts,  which  yields  (iii)  
the  subevaluation.  In  other  words,  Part  C  requires29  the  completion  of  five  separate  kinds  of  
inference  from  (i)  plus  (ii)  to  (iii),  i.e.,  from  What’s  So?  to  So  What?—e.g.,  (in  the  case  of  C2,  
Outcomes),  from  ‘the  outcomes  were  measured  as  XX,’  and  ‘outcomes  of  this  size  are  valu-­‐
able  to  the  degree  YY’  to  ‘the  effects  were  extremely  beneficial,’  or  ‘insignificant  in  this  con-­‐
text,’  etc.  Making  that  step  requires,  in  each  case,  a  premise  of  type  (ii)  that  forms  a  bridge  
between  facts  and  values;  these  are  usually  some  kind  of  ‘rubric,’  discussed  further  in  the  
D1  (Synthesis)  checkpoint.  The  first  two  of  the  following  checkpoints  will,  in  one  case  or  
another,  use  rubrics  referring  to  nearly  all  the  values  from  Checkpoint  B5  and  bear  most  of  
the  load  in  determining  merit;  the  next  three  checkpoints  are  defined  in  terms  of  specific  
values  of  great  general  importance,  named  in  their  heading,  and  particularly  relevant  to  
worth  (Checkpoint  C3  and  C4)  and  significance  (Checkpoints  C4  and  C5).  

C1.   Process    
We  start  with  this  core  checkpoint  because  it  forces  us  to  confront  immediately  the  merit  of  
the  means  this  intervention  employs,  so  that  we  are  able,  as  soon  as  possible,  to  answer  the  
question  whether  the  (intended  or  produced  unintentionally)  ends—many  of  which  we’ll  
cover  in  the  next  checkpoint—justify  the  means,  in  this  specific  case  or  set  of  cases.  The  
Process  checkpoint  involves  the  assessment  of  the  m/w/s  of  everything  that  happens  or  

28 Here, and commonly, this sense of the term means non-evaluative fact-finding. There are plenty of evaluative
facts that we often seek, e.g., whether the records show that an attorney we are considering has a history of malprac-
tice; whether braided nylon fishline is as good as wire for fish over 30kg.
29
Although this is generally true, there are evaluations in which one or more of the sub-evaluations are irrelevant,
e.g., when cost is of no concern.

18 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

applies  before  true  outcomes  emerge,  especially:  (i)  the  vision,  design,  planning  and  oper-­‐
ation  of  the  program,  from  the  justification  of  its  goals  (if  you’re  not  operating  in  goal-­‐free  
mode)—and  note  that  these  may  have  changed  or  be  changing  since  the  program  began—
through  design  provisions  for  reshaping  the  program  under  environmental  or  political  or  
fiscal  duress  (including  planning  for  worst-­‐case  outcomes);  to  the  development  and  justifi-­‐
cation  of  the  program’s  supposed  ‘logic’  a.k.a.  design  (but  see  Checkpoint  D2),  along  with  
(ii)  the  program’s  ‘implementation  fidelity’  (i.e.,  the  degree  of  implementation  of  the  sup-­‐
posed  archetype  or  exemplar  program,  if  any).  This  index  is  also  called  “authenticity,”  “ad-­‐
herence,”  “alignment,”  “fidelity,”  “internal  sustainability,”  or  “compliance”.30  You  must  also,  
under  Process,  check  the  accuracy  of  the  official  name  or  subtitle  (whether  descriptive  or  
evaluative),  or  the  official  description  of  the  program  e.g.,  “an  inquiry-­‐based  science  educa-­‐
tion  program  for  middle  school”—one,  two,  three,  or  even  four  of  the  components  of  this  
compound  descriptive  claim  (it  may  also  be  contextually  evaluative)  can  be  false.  (Other  
examples:  “raises  beginners  to  proficiency  level”,  “advanced  critical  thinking  training  pro-­‐
gram”).  Also  check  (iii)  the  quality  of  its  management  (especially  (a)  the  arrangements  for  
getting  and  appropriately  reporting  evaluative  feedback  (that  package  is  often  much  of  
what  is  called  accountability  or  transparency),  along  with  support  for  learning  from  that  
feedback,  and  from  any  mistakes/solutions  discovered  in  other  ways,  along  with  meeting  
more  obviously  appropriate  standards  of  accountability  and  transparency;  (b)  the  quality  
of  the  risk-­‐management,31  including  the  presence  of  a  full  suite  of  ‘Plan  B’  options;  (c)  the  
extent  to  which  the  program  planning  covers  issues  of  sustainability  and  not  just  short-­‐
term  returns  (this  point  can  also  be  covered  in  C5).  You  need  to  examine  all  activities  and  
procedures,  especially  the  program’s  general  learning/training  process  (e.g.,  regular  ‘up-­‐
dating  training’  to  cope  with  changes  in  the  operational  and  bio-­‐environment,  staff  aging,  
essential  skill  pool,  new  technology32);  attitudes/values;  and  morale.  Of  course,  manage-­‐
ment  quality  is  something  that  continues  well  beyond  the  beginning  of  the  program,  so  in  
looking  at  it,  you  need  to  be  clear  when  it  had  what  form  or  you  won’t  be  able  to  ascribe  re-­‐
sults—good  or  bad—to  management  features,  if  you  are  hoping  to  be  able  to  do  that.  Orga-­‐
nization  records  often  lack  this  kind  of  detail,  so  try  to  improve  that  practice,  at  least  for  the  
duration  of  your  evaluation.  
As  mentioned,  under  this  heading  you  may  or  may  not  examine  the  quality  of  the  original  
‘logic  of  the  program’  (the  rationale  for  its  design)  and  its  current  logic  (both  the  current  
official  version  and  the  possibly  different  one  implicit  in  the  operations  or  in  staff  behavior  

30Several recent drug studies have shown huge outcome differences between subjects filling 80% or more of their
prescriptions and those filling less than 80%, in both the placebo and treatment groups, even when it’s unknown how
many of those getting the drug from the pharmacy are actually taking it, and even though there is no overall differ-
ence in average outcomes between the two groups. In other words, mere aspects of the process of treatment can be
more important than the nature of the treatment or the fact of treatment status. So be sure you know what the process
actually comprises, and whether any comparison group is closely similar on each aspect.
31 Risk-management has emerged fairly recently as a job classification in large organizations, growing out of the
specialized task of analyzing the adequacy and justifiability of the organization’s insurance coverage, but now in-
cluding matters such as the adequacy and coordination of protocols and training for emergency response to natural
and human-caused disasters, the identification of responsibility for each risk, and the sharing of risk and insurance
with other parties.
32
See also my paper on “Evaluation of Training” at michaelscriven.info for a checklist that massively extends
Kirkpatrick’s groundbreaking effort at this task.

19 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

rather  than  rhetoric).  It  is  not  generally  appropriate  to  try  to  determine  and  affirm  whether  
the  model  is  correct  in  detail  and  in  scientific  fact  unless  you  have  specifically  undertaken  
that  kind  of  (usually  ambitious  and  sometimes  unrealistically  ambitious)  analytic  evalu-­‐
ation  of  the  program  design/plan/theory.  You  need  to  judge  with  great  care  whether  com-­‐
ments  on  the  plausibility  of  the  program  theory  are  likely  to  be  helpful,  and,  if  so,  whether  
you  are  sufficiently  expert  to  make  them.  Just  keep  in  mind,  that  it’s  never  been  hard  to  ev-­‐
aluate  aspirin  for  e.g.,  its  analgesic  effects,  although  it  is  only  very  recently  that  we  had  any  
idea  how/why  it  works.  It  would  have  been  a  logical  error—and  unhelpful  to  society—to  
make  the  earlier  evaluations  depend  on  solving  the  causal  mystery.  It  helps  to  keep  in  mind  
that  there’s  no  mystery  until  you’ve  done  the  evaluation,  since  you  can’t  explain  outcomes  
if  there  aren’t  any  (or  explain  why  there  aren’t  any  until  you’ve  shown  that  that’s  the  situa-­‐
tion).  So  if  you  can  be  helpful  by  evaluating  the  program  theory,  and  you  have  the  resources  
to  spare,  do  it;  but  doing  this  is  not  an  essential  part  of  doing  a  good  evaluation,  will  often  
be  a  diversion,  and  is  sometimes  a  cause  for  disruptive  antagonism.  
Process  evaluation  may  also  include  (iv)  the  evaluation  of  what  are  often  called  “outputs,”  
(usually  taken  to  be  ‘intermediate  outcomes’  that  are  developed  en  route  to  ‘true  out-­‐
comes,’  the  longer-­‐term  results  that  are  sometimes  called  ‘impact’).  Typical  outputs  are  
knowledge,  skill,  or  attitude  changes  in  staff  (or  clients),  when  these  changes  are  not  major  
outcomes  in  their  own  right.  Remember  that  in  any  program  that  involves  learning,  
whether  incidental  or  intended,  the  process  of  learning  is  gradual  and  at  any  point  in  time,  
long  before  you  can  talk  about  outcomes/impact,  there  will  have  been  substantial  learning  
that  produces  a  gain  in  individual  or  social  capital,  which  must  be  regarded  as  a  tangible  
gain  for  the  program  and  for  the  intervention.  It’s  not  terribly  important  whether  you  call  it  
process  or  output  or  short-­‐term  outcome,  as  long  as  you  find  it,  estimate  it,  and  record  it—
once.  (Recording  it  under  more  than  one  heading—other  than  for  merely  annotative  rea-­‐
sons—leads  to  double  counting  when  you  are  aiming  for  an  overall  judgement.)    
Note  C1.1:  Five  other  reasons  why  process  is  an  essential  element  in  program  evaluation,  
despite  the  common  tendency  in  much  evaluation  to  place  almost  the  entire  emphasis  on  
outcomes:  (v)  gender  or  racial  prejudice  in  selection/promotion/treatment  of  staff  is  an  
unethical  practice  that  must  be  checked  for,  and  comes  under  process;  (vi)  in  evaluating  
health  programs  that  involve  medication  or  exercise,  ‘adherence’  or  ‘implementation  fi-­‐
delity’  means  following  the  prescribed  regimen  including  drug  dosage,  and  it  is  often  vitally  
important  to  determine  the  degree  to  which  this  is  occurring—which  is  also  a  process  con-­‐
sideration.  We  now  know,  because  researchers  finally  got  down  to  triangulation  (e.g.,  via  
randomly  timed  counts,  by  a  nurse-­‐observer,  of  the  number  of  pills  remaining  in  the  pa-­‐
tient’s  medicine  containers),  that  adherence  can  be  very  low  in  many  needy  populations,  
e.g.,  Alzheimer’s  patients,  a  fact  that  completely  altered  evaluative  conclusions  about  
treatment  efficacy;  (vii)  the  process  may  be  where  the  value  lies—writing  poetry  in  the  
creative  writing  class  may  be  a  good  thing  to  do  in  itself,  not  because  of  some  later  out-­‐
comes  (same  for  having  fun,  in  kindergarten  at  least;  painting;  and  marching  to  protest  
war,  even  if  it  doesn’t  succeed);  (viii)  the  treatment  of  human  subjects  must  meet  federal,  
state,  and  other  ethical  standards,  and  an  evaluator  can  rarely  avoid  the  responsibility  for  
checking  that  they  are  met;  (ix)  as  the  recent  scandal  in  anaesthesiology  underscores,  many  
widely  accepted  evaluation  procedures,  e.g.,  peer  review,  rest  on  assumptions  that  are  
sometimes  completely  wrong  (e.g.,  that  the  researcher  actually  did  get  the  data  reported  

20 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

from  real  patients),  and  the  evaluator  should  try  to  do  better  than  rely  on  such  assump-­‐
tions.  

C2.   Outcomes    
This  checkpoint  is  the  poster-­‐boy  of  many  evaluations,  and  the  one  that  many  people  mis-­‐
takenly  think  of  as  covering  ‘the  results’  of  an  intervention.  In  fact,  the  results  are  every-­‐
thing    covered  in  Part  C.  This  checkpoint  does  cover  the  ‘ends’  at  which  the  ‘means’  dis-­‐
cussed  in  C1  (Process)  are  aimed,  but  (a)  only  to  the  extent  they  were  achieved;  and  (b)  
much  more  than  that.  It  requires  the  identification  of  all  (good  and  bad)  effects  of  the  pro-­‐
gram  (a.k.a.  intervention)  on:  (i)  program  recipients  (both  targeted  and  untargeted—an  
example  of  the  latter  are  thieves  of  aid  or  drug  supplies);  on  (ii)  other  impactees,  e.g.,  fami-­‐
lies  and  friends—and  enemies—of  recipients;  and  on  (iii)  the  environment  (biological,  
physical,  and  more  remote  social  environments).  These  effects  must  include  direct  and  in-­‐
direct  effects,  intended  and  unintended  ones,  immediate33  and  short-­‐term  and  long-­‐term  
ones  (the  latter  being  one  kind  of  sustainability).  (These  are  all,  roughly  speaking,  the  focus  
of  Campbell’s  ‘internal  validity.’)  Finding  outcomes  cannot  be  done  by  hypothesis-­‐testing  
methodology,  because:  (i)  often  the  most  important  effects  are  unanticipated  ones  (the  four  
main  ways  to  find  such  side-­‐effects  are:  (a)  goal-­‐free  evaluation,  (b)  skilled  observation,  (c)  
interviewing  that  is  explicitly  focused  on  finding  side-­‐effects,  and  (d)  using  previous  ex-­‐
perience  (as  provided  in  the  mythical  “Book  of  Causes”34).  And  (ii)  because  determining  the  
m/w/s  of  the  effects—that’s  the  bottom  line  result  you  have  to  get  out  of  this  sub-­‐
evaluation—is  often  the  hard  part,  not  just  determining  whether  there  are  any,  or  even  
what  they  are  intrinsically,  and  who  they  affect  (some  of  which  you  can  get  by  hypothesis  
testing)…  Immediate  outcomes  (e.g.,  the  publication  of  instructional  leaflets  for  AIDS  car-­‐
egivers)  are  often  called  ‘outputs,’  especially  if  their  role  is  that  of  an  intermediate  cause  or  
intended  cause  of  main  outcomes,  and  they  are  normally  covered  under  Checkpoint  C1.  But  
note  that  some  true  outcomes  (including  results  that  are  of  major  significance,  whether  or  
not  intended)  can  occur  during  the  process  but  may  be  best  considered  here,  especially  if  
they  are  highly  durable.  (Long-­‐term  results  are  sometimes  called  ‘effects’  (or  ‘true  effects’  
or  ‘results’)  and  the  totality  of  these  is  often  referred  to  as  the  ‘impact’;  but  you  should  ad-­‐
just  to  the  highly  variable  local  usage  of  these  terms  by  clients/audiences/stakeholders.)  
Note  that  you  must  pick  up  effects  on  individual  and  social  capital  here  (see  the  earlier  
footnote);  much  of  this  ensemble  is  normally  not  counted  as  outcomes,  because  they  are  

33 The ‘immediate’ effects of a program are not only the first effects that occur after the program starts up, but
should also include major effects that occur before the program starts. These (preformative) effects impact ‘anticipa-
tors’ who react to the announcement of—or have secret intelligence about—the future start of the program. For ex-
ample, the award of the 2012 Olympic Games to Rio de Janeiro, made several years in advance of any implementa-
tion of the planned constructions etc. for the games, had a huge immediate effect on real estate prices, and later on
favela policing for drug and violence control.
34 The Book of Causes shows, when opened at the name of a condition, factor, or event: (i) on the left (verso) side of

the opening, all the things which are known to be able to cause it, in some context or other (which is specified at that
point); and (ii) on the right (recto) side, all the things which it can cause: that’s the side you need in order to guide
the search for side-effects. Since the BofC is only a virtual book, you have to create the relevant pages, using all
your resources such as accessible experts and a literature/internet search. Good forensic pathologists and good field
epidemiologists, amongst other scientists, have very comprehensive ‘local editions’ of the BofC in their heads and
as part of the social capital of their guild. Modern computer technology makes the BofC feasible, perhaps imminent.

21 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

gains  in  latent  ability  (capacity,  potentiality),  not  necessarily  in  observable  achievements  or  
goods.  Particularly  in  educational  evaluations  aimed  at  improving  test  scores,  a  common  
mistake  is  to  forget  to  include  the  possibly  life-­‐long  gain  in  ability  as  an  effect.  
Sometimes,  not  always,  it’s  useful  and  feasible  to  provide  explanations  of  success/failure  in  
terms  of  components/context/decisions.  For  example,  when  evaluating  a  statewide  consor-­‐
tium  of  training  programs  for  firemen  dealing  with  toxic  fumes,  it’s  probably  fairly  easy  to  
identify  the  more  and  less  successful  programs,  maybe  even  to  identify  the  key  to  success  
as  particular  features—e.g.,  realistic  simulations—that  are  to  be  found  in  and  only  in  the  
successful  programs.  To  do  this  usually  does  not  require  the  identification  of  the  whole  op-­‐
erating  logic/theory  of  program  operation.  (Remember  that  the  operating  logic  is  not  ne-­‐
cessarily  the  same  as:  (i)  the  original  official  program  logic,  (ii)  the  current  official  logic,  
(iii)  the  implicit,  logics  or  theories  of  field  staff).  Also  see  Checkpoint  D2  below.    
Given  that  the  most  important  outcomes  may  have  been  unintended  (a  broader  class  than  
unexpected),  it’s  worth  distinguishing  between  side-­‐effects  (which  affect  the  target  popula-­‐
tion  and  possibly  others)  and  side-­‐impacts  (meaning  impacts  of  any  kind  on  non-­‐targeted  
populations).    
The  biggest  methodological  problem  with  this  checkpoint  is  establishing  the  causal  connec-­‐
tion,  especially  when  there  are  many  possible  or  actual  causes,  and—a  separate  point—if  
attribution  of  portions  of  the  effect  to  each  of  them  must  be  attempted.35    
Note  C2.1:  As  Robert  Brinkerhoff  argues,  success  cases  may  be  worth  their  own  analysis  as  
a  separate  group,  regardless  of  the  average  improvement  (if  any)  due  to  the  program  (since  
the  benefits  in  those  cases  alone  may  justify  the  cost  of  the  program)36;  the  failure  cases  
should  also  be  examined,  for  differences  and  toxic  factors.    
Note  C2.2:  Keep  the  “triple  bottom-­‐line”  approach  in  mind.  This  means  that,  as  well  as  (i)  
conventional  outcomes  (e.g.,  learning  gains  by  impactees),  you  should  always  be  looking  for  
(ii)  community  (include  social  capital)  changes,  and  (iii)  environmental  impact…  And  al-­‐
ways  comment  on  (iv)  the  risk  aspect  of  outcomes,  which  is  likely  to  be  valued  very  differ-­‐
ently  by  different  stakeholders…  Especially,  do  not  overlook  (v)  the  effects  on  the  program  
staff,  good  and  bad;  e.g.,  lessons  and  skills  learned,  and  the  usual  effects  of  stress;  and  (vi)  
the  pre-­‐program  effects  mentioned  earlier:  that  is,  the  (often  major)  effects  of  the  an-­‐
nouncement  or  discovery  that  a  program  will  be  implemented,  or  even  may  be  imple-­‐
mented.  These  effects  include  booms  in  real  estate  and  migration  of  various  groups  
to/from  the  community,  and  are  sometimes  more  serious  in  at  least  the  economic  dimen-­‐
sion  than  the  directly  caused  results  of  the  program’s  implementation  on  this  impact  group,  
the  ‘anticipators.’  Looking  at  these  effects  carefully  is  sometimes  included  under  preforma-­‐
tive  evaluation  (which  also  covers  looking  at  other  dimensions  of  the  planned  program,  
such  as  evaluability).    
Note  C2.3:  It  is  usually  true  that  evaluations  have  to  be  completed  long  before  some  of  the  
main  outcomes  have,  or  indeed  could  have,  occurred—let  alone  have  been  inspected  care-­‐

35
On this, consult recent literature by, or cited by, Cook or Scriven, e.g., in the 6th and the 8th issues of the Journal of
MultiDisciplinary Evaluation (2008), at jmde.com, and American Journal of Evaluation (3, 2010)).
36 Robert Brinkerhoff in The Success Case Method (Berrett-Koehler, 2003).

22 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

fully.  This  leads  to  a  common  practice  of  depending  heavily  on  predictions  of  outcomes  
based  on  indications  or  small  samples  of  what  they  will  be.  This  is  a  risky  activity,  and  
needs  to  be  carefully  highlighted,  along  with  the  assumptions  on  which  the  prediction  is  
based,  and  the  checks  that  have  been  made  on  them,  as  far  as  is  possible.  Many  very  expen-­‐
sive  evaluations  of  giant  international  aid  programs  have  been  based  almost  entirely  on  
outcomes  estimated  by  the  same  agency  that  did  the  evaluation  and  the  installation  of  the  
program—estimates  that,  not  too  surprisingly,  turned  out  to  be  absurdly  optimistic.  Pessi-­‐
mism  can  equally  well  be  ill-­‐based,  for  example  predicting  the  survival  chances  of  Stage  IV  
cancer  patients  is  often  done  using  the  existing  data  on  five-­‐year  survival—but  that  ignores  
the  impact  of  research  on  treatment  in  (at  least)  the  last  five  years,  which  has  often  been  
considerable.  On  the  other  hand,  waiting  for  the  next  Force  8  earthquake  to  test  disaster  
plans  is  stupid;  simulations,  if  designed  by  a  competent  external  agency,  can  do  a  very  good  
job  in  estimating  long-­‐term  effects  of  a  new  plan.  
Note  C2.4:  Identifying  the  impactees  is  not  only  a  matter  of  identifying  each  individual—or  
at  least  small  group—that  is  impacted  (targeted  or  not),  hard  though  that  is;  it  is  also  a  
matter  of  finding  patterns  in  them,  e.g.,  a  tendency  for  the  intervention  to  be  more  success-­‐
ful  with  women  than  men.  Finding  patterns  in  the  data  is  of  course  a  traditional  scientific  
task,  so  here  is  one  case  amongst  several  where  the  task  of  the  evaluator  includes  one  of  
the  core  tasks  of  the  traditional  scientist.  

Note    C2.5:  Furthermore,  if  you  have  discovered  any  unanticipated  side-­‐effects  at  all,  con-­‐
sider  that  they  are  likely  to  require  evaluation  against  some  values  that  were  not  con-­‐
sidered  under  the  Values  checkpoint,  since  you  were  not  expecting  them;  you  need  to  go  
back  and  expand  your  list  of  relevant  values,  and  develop  scales  and  rubrics  for  these,  too.  
Note C2.6: Almost without exception, the social science literature on effects identifies them as
what happened after an intervention that would not have happened without the presence of the
intervention—this is the so-called counterfactual property. This is a complete fallacy, and shows
culpable ignorance of about a century’s literature on causation in the logic of science (see refer-
ences given above on causation, e.g., in footnote 8). Many effects would have happened anyway,
due to the presence of other factors with causal potential; this is the phenomenon of ‘overdeter-
mination’ which is common in the social sciences. For example, the good that Catholic Charities
does in a disaster might well have occurred if they were not operating, since there are other
sources of help with identical target populations; this does not show they were not in fact the
causal agency nor does it show that they are redundant.

C3.   Costs    
This  checkpoint  brings  in  what  might  be  called  ‘the  other  quantitative  element  in  evalu-­‐
ation’  besides  statistics,  i.e.,  (most  of)  cost  analysis.  But  don’t  forget  there  is  also  such  a  
thing  as  qualitative  cost  analysis,  and  it’s  also  very  important—and,  done  properly,  it’s  not  
a  feeble  surrogate  for  quantitative  cost  analysis  but  an  essentially  independent  effort;  note  
that  both  quantitative  and  qualitative  cost-­‐analysis  are  included  in  the  economist’s  defini-­‐
tion  of  cost-­‐effectiveness.  Both  are  usually  very  important  in  determining  worth  (or,  in  one  
sense,  value)  by  contrast  with  plain  merit  (a.k.a.  quality).  Both  were  almost  totally  ignored  
for  many  years  after  program  evaluation  became  a  matter  of  professional  practice;  and  a  
recent  survey  of  journal  articles  by  Nadini  Persaud  shows  they  are  still  seriously  underused  

23 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

in  evaluation.  An  impediment  to  progress  that  she  points  out  is  that  today,  CA  (cost  analy-­‐
sis)  is  done  in  a  different  way  by  economists  and  accountants,37  and  you  will  need  to  make  
clear  which  approach  you  are  using,  or  that  you  are  using  both—and,  if  you  do  use  both,  
indicate  when  and  where  you  use  each.  There  are  also  a  number  of  different  types  of  quan-­‐
titative  CA—cost-­‐benefit  analysis,  cost-­‐effectiveness  analysis,  cost-­‐utility  analysis,  cost-­‐
feasibility  analysis,  etc.,  and  each  has  a  particular  purpose;  be  sure  you  know  which  one  
you  need  and  explain  why  in  the  report  (the  definitions  in  Wikipedia  are  good).  The  first  
two  require  calculation  of  benefits  as  well  as  costs,  which  usually  means  you  have  to  find,  
and  monetize  if  important  (and  possible),  the  benefits  and  damages  from  Checkpoint  C2  as  
well  as  the  more  conventional  (input)  costs.    
At  a  superficial  level,  cost  analysis  requires  attention  to  and  distinguishing  between:  (i)  
money  vs.  non-­‐money  vs.  non-­‐monetizable  costs;  (ii)  direct  and  indirect  costs;  (iii)  both  ac-­‐
tual  and  opportunity  costs;38  and  (iv)  sunk  (already  spent)  vs.  prospective  costs.  It  is  also  
often  helpful,  for  the  evaluator  and/or  audiences,  to  itemize  these  by:  developmental  stage,  
i.e.,  in  terms  of  the  costs  of:  (a)  start-­‐up  (purchase,  recruiting,  training,  site  preparation,  
etc.);  (b)  maintenance  (including  ongoing  training  and  evaluating);  (c)  upgrades;  (d)  shut-­‐
down;  (e)  residual  (e.g.,  environmental  damage);  and/or  by  calendar  time  period;  and/or  
by  cost  elements  (rent,  equipment,  personnel,  etc.);  and/or  by  payee.  Include  use  of  ex-­‐
pended  but  never  utilized  value,  if  any,  e.g.,  social  capital  (such  as  decline  in  workforce  mo-­‐
rale).    
The  most  common  significant  non-­‐money  costs  that  are  often  monetizable  are  space,  time,  
expertise,  and  common  labor,  to  the  extent  that  these  are  not  available  for  purchase  in  the  
open  market—when  they  are  so  available,  they  can  be  monetized.  The  less  measurable  but  
often  more  significant  ones  include:  lives,  health,  pain,  stress  (and  other  positive  or  neg-­‐
ative  affects),  social/political/personal  capital  or  debts  (e.g.,  reputation,  goodwill,  interper-­‐
sonal  skills),  morale,  energy  reserves,  content  and  currency  of  technical  knowledge/skills,  
and  immediate/long-­‐term  environmental  costs.  Of  course,  in  all  this,  you  should  be  analyz-­‐
ing  the  costs  and  benefits  of  unintended  as  well  as  intended  outcomes;  and,  although  unin-­‐
tended  heavily  overlaps  unanticipated,  both  must  be  covered.  The  non-­‐money  costs  are  al-­‐
most  never  trivial  in  large  program  evaluations,  technology  assessment,  or  senior  staff  ev-­‐
aluation,  and  very  often  decisive.  The  fact  that  in  rare  contexts  (e.g.,  insurance  suits)  some  
money  equivalent  of  e.g.,  a  life,  is  treated  seriously  is  not  a  sign  that  life  is  a  monetizable  

37
Accountants  do  ‘financial  analysis’  which  is  oriented  towards  an  individual’s  monetary  situation,  econo-­‐
mists  do  ‘economic  analysis’  which  is  takes  a  societal  point  of  view.
38 Economists often define the costs of P as the value of the most valuable forsaken alternative (MVFA), i.e., as the

same as opportunity costs. This risks circularity, since it’s arguable that you can’t determine the value of the MVFA
without knowing what it required you to give up, i.e., identifying its MVFA. In general, it may be better to define
ordinary costs as the tangible valued resources that were used to cause the evaluand to come into existence (money,
time, expertise, effort, etc.), and opportunity costs as another dimension of cost, namely the MVFA you spurned by
choosing to create the evaluand rather than the best alternative path to your goals, using about the same resources.
The deeper problem is this: the ‘opportunity cost of the evaluand’ is ambiguous; it may mean the value of something
else to do the same job, or it may mean the value of the resources if you didn’t attempt this job at all. (See my “Cost
in Evaluation: Concept and Practice”, in The Costs of Evaluation, edited by Alkin and Solomon, (Sage, 1983) and
“The Economist’s Fallacy” in jmde.com, 2007).

24 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

value  in  general  i.e.,  across  more  than  that  very  limited  context,39  let  alone  a  sign  that  if  we  
only  persevere,  cost  analysis  can  be  treated  as  really  a  quantitative  task  or  even  as  a  task  for  
which  a  quantitative  version  will  give  us  a  useful  approximation  to  the  real  truth.  Both  
views  are  categorically  wrong,  as  is  apparent  if  you  think  about  the  difference  between  the  
value  of  a  particular  person’s  life  to  their  family,  vs.  to  their  employer/employees/cowork-­‐
ers,  vs.  to  their  profession,  and  to  their  friends;  and  the  difference  between  those  values  as  
between  different  people  whose  lost  lives  we  are  evaluating.  And  don’t  think  that  the  way  
out  is  to  allocate  different  money  values  to  each  specific  case,  i.e.,  to  each  person’s  life-­‐loss  
for  each  impacted  group:  not  only  will  this  destroy  generalizability  but  the  cost  to  some  of  
these  impactees  is  clearly  still  not  covered  by  money,  e.g.,  when  a  great  theoretician  or  
musician  dies.          
As  an  evaluator  you  aren’t  doing  a  literal  audit,  since  you’re  (usually)  not  an  accountant,  
but  you  can  surely  benefit  if  an  audit  is  available,  or  being  done  in  parallel.  Otherwise,  con-­‐
sider  hiring  a  good  accountant  as  a  consultant  to  the  evaluation;  or  an  economist,  if  you’re  
going  that  way.  But  even  without  the  accounting  expertise,  your  cost  analysis  and  certainly  
your  evaluation,  if  you  follow  the  lists  here,  will  include  key  factors—for  decision-­‐making  
or  simple  appraisal—usually  omitted  from  standard  auditing  practice.  And  keep  in  mind  
that  there  are  evaluations  where  it  is  appropriate  to  analyze  benefits  (a  subset  of  out-­‐
comes)  in  just  the  same  way,  i.e.,  by  type,  time  of  appearance,  etc.  This  is  especially  useful  
when  you  are  doing  an  evaluation  with  an  emphasis  on  cost-­‐benefit  tradeoffs.    
Note  C3.1:  This  sub-­‐evaluation  (especially  item  (iii)  in  the  first  list)  is  the  key  element  in  
the  determination  of  worth.  
Note  C3.2:  If  you  have  not  already  evaluated  the  program’s  risk-­‐management  efforts  under  
Process,  consider  doing—or  having  it  done—as  part  of  this  checkpoint.  
Note  C3.3:  Sensitivity  analysis  is  the  cost-­‐analysis  analog  of  robustness  analysis  in  statistics  
and  testing  methodology,  and  equally  important.  It  is  essential  to  do  it  for  any  quantitative  
results.  
Note  C3.4:  The  discussion  of  CA  in  this  checkpoint  so  far  uses  the  concept  of  cost-­‐effect-­‐
iveness  in  the  usual  economic  sense,  but  there  is  another  sense  of  this  concept  that  is  of  
considerable  importance  in  evaluation—in  some  but  not  all  contexts,  and  this  sense  does  
not  seem  to  be  discussed  in  the  economic  or  accounting  literature.  (It  is  the  ‘extended  
sense’  mentioned  in  the  Resources  checkpoint  discussion  above.)  In  this  sense,  efficiency  or  
cost-­‐effectiveness  means  the  ratio  of  benefits  to  resources  available  not  resources  used.  In  
this  sense—remember,  it’s  only  appropriate  in  certain  contexts—one  would  say  that  a  pro-­‐
gram,  e.g.,  an  aid  program  funded  to  provide  clean  water  to  refugees  in  the  Haitian  tent  cit-­‐
ies  in  2010,  was  (at  least  in  this  respect)  inefficient/cost-­‐ineffective  if  it  did  not  do  as  much  
as  was  possible  with  the  resources  provided.  There  may  be  exigent  circumstances  that  de-­‐
flect  any  imputation  of  irresponsibility  here,  but  the  fact  remains  that  the  program  needs  to  
be  categorized  as  unsatisfactory  with  respect  to  getting  the  job  done,  even  though  it  was  
provided  with  adequate  resources  to  do  it.  Moral:  when  you’re  doing  CA  in  an  evaluation,  
don’t  just  analyze  what  was  spent  but  also  what  was  available.  
39
The World Bank since 1966 has recommended reporting mortality data in terms of lives saved or lost, not dollars
(Persaud reference).

25 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

 
C4.   Comparisons    
Comparative  or  relative  m/w/s,  which  requires  comparisons,  is  often  extremely  illuminat-­‐
ing,  and  sometimes  absolutely  essential—as  when  a  government  has  to  decide  on  whether  
to  refund  a  health  program,  go  with  a  different  one,  or  abandon  the  sector  to  private  enter-­‐
prise.  Here  you  must  look  for  programs  or  other  entities  that  are  alternative  ways  for  get-­‐
ting  the  same  or  similar  benefits  from  about  the  same  resources,  especially  those  that  use  
fewer  resources.  Anything  that  comes  close  to  this  is  known  as  a  “critical  competitor”.  Iden-­‐
tifying  the  most  important  critical  competitors  is  a  test  of  high  intelligence,  since  they  are  
often  very  unlike  the  standard  competitors,  e.g.,  a  key  critical  competitor  for  telephone  and  
email  communication  in  extreme  disaster  planning  is  carrier  pigeons,  even  today.  It  is  also  
often  worth  looking  for,  and  reporting  on,  at  least  one  other  alternative—if  you  can  find  
one—that  is  much  cheaper  but  not  much  less  effective  (‘el  cheapo’);  and  one  much  stronger  
although  costlier  alternative,  i.e.,  one  that  produces  far  more  payoffs  or  process  advantages  
(‘el  magnifico’),  although  still  within  the  outer  limits  of  the  available  Resources  identified  in  
Checkpoint  B4;  the  extra  cost  may  still  be  the  best  bet.  (But  be  sure  that  you  check  care-­‐
fully,  e.g.,  don’t  assume  the  more  expensive  option  is  higher  quality  because  it’s  higher  
priced.)  It’s  also  sometimes  worth  comparing  the  evaluand  with  a  widely  adopted/admired  
approach  that  is  perceived  by  important  stakeholders  as  an  alternative,  though  not  really  in  
the  race,  e.g.,  a  local  icon.  Keep  in  mind  that  looking  for  programs  ‘having  the  same  effects’  
means  looking  at  the  side-­‐effects  as  well  as  intended  effects,  to  the  extent  they  are  known,  
though  of  course  the  best  available  critical  competitor  might  not  match  on  side-­‐effects…  
Treading  on  potentially  thin  ice,  there  are  also  sometimes  strong  reasons  to  compare  the  
evaluand  with  a  demonstrably  possible  alternative,  a  ‘virtual  critical  competitor’—one  that  
could  be  assembled  from  existing  or  easily  constructed  components  (the  next  checkpoint  is  
another  place  where  ideas  for  this  can  emerge).  The  ice  is  thin  because  you’re  now  moving  
into  the  partial  role  of  a  program  designer  rather  than  an  evaluator,  which  creates  a  risk  of  
conflict  of  interest  (you  may  be  ego-­‐involved  as  author  of  a  possible  competitor  and  hence  
not  objective  about  evaluating  it  or,  therefore,  the  original  evaluand).  Also,  if  your  ongoing  
role  is  that  of  formative  evaluator,  you  need  to  be  sure  that  your  client  can  digest  sugges-­‐
tions  of  virtual  competitors  (see  also  Checkpoint  D2).  The  key  comparisons  should  be  con-­‐
stantly  updated  as  you  find  out  more  from  the  evaluation  of  the  primary  evaluand,  espe-­‐
cially  new  side-­‐effects,  and  should  always  be  in  the  background  of  your  thinking  about  the  
evaluand.  
Note  C4.1:  It  sometimes  looks  as  if  looking  for  critical  competitors  is  a  completely  wrong  
approach,  e.g.,  when  we  are  doing  formative  evaluation  of  a  program  i.e.,  with  the  interest  
of  improvement:  but  in  fact,  it’s  important  even  then  to  be  sure  that  the  changes  made  or  
recommended  really  do  add  up,  taken  all  together,  to  an  improvement,  so  you  need  to  com-­‐
pare  version  2  with  version  1,  and  also  with  available  alternatives  since  the  set  of  critical  
competitors  may  change  as  you  modify  the  evaluand.    
Note  C4.2:  It’s  tempting  to  collapse  the  Cost  and  Comparison  checkpoints  into  ‘Compara-­‐
tive  Cost-­‐Effectiveness’  (as  Davidson  does,  for  example)  but  it’s  better  to  keep  them  sepa-­‐
rate  because  for  certain  important  purposes,  e.g.,  fund-­‐raising,  you  will  need  the  separate  
results.  Other  examples:  you  often  need  to  look  at  simple  cost-­‐feasibility,  which  does  not  

26 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

involve  comparisons  (but  give  the  critical  competitors  a  quick  look  in  case  one  of  them  is  
cost-­‐feasible);  or  at  relative  merit  when  ‘cost  is  no  object’  (which  means  ‘all  available  alter-­‐
natives  are  cost-­‐feasible,  and  the  merit  gains  from  choosing  correctly  are  much  more  im-­‐
portant  than  cost  savings’).  
Note  C4.3:  One  often  hears  the  question:  “But  won’t  the  Comparisons  Checkpoint  double  or  
triple  our  costs  for  the  evaluation—after  all,  the  comparisons  needed  have  to  be  quite  de-­‐
tailed  in  order  to  match  one  based  on  the  KEC?”  Some  responses:  (i)  “But  the  savings  on  
purchase  costs  may  be  much  more  than  that;”  (ii)  “There  may  already  be  a  decent  evalu-­‐
ation  of  some  or  several  or  all  critical  competitors  in  the  literature;”  (iii)  “Other  funding  
sources  may  be  interested  in  the  broader  evaluation,  and  able  to  help  with  the  extra  costs;”  
(iv)  “Good  design  of  the  evaluations  of  alternatives  will  often  eliminate  potential  competi-­‐
tors  at  trifling  cost,  by  starting  with  the  checkpoints  on  which  they  are  most  obviously  vul-­‐
nerable;”  (v)  “  Estimates,  if  that’s  all  you  can  afford,  are  much  cheaper  than  evaluations,  and  
better  than  not  doing  a  comparison  at  all.”  
 

C5.   Generalizability    

Other  names  for  this  checkpoint  (or  something  close  to,  or  part  of  it)  are:  exportability,  
transferability,  transportability—which  would  put  it  close  to  Campbell’s  “external  valid-­‐
ity”—  but  it  also  covers  sustainability,  longevity,  durability,  and  resilience,  since  these  tell  
you  about  generalizing  the  program’s  merit  to  other  times  rather  than  (or  as  well  as)  other  
places  or  circumstances  besides  the  one  you’re  at  (in  either  direction,  so  the  historian  is  in-­‐
volved.)  Note  that  this  checkpoint  concerns  the  sustainability  of  the  program,  not  the  sus-­‐
tainability  of  its  effects,  which  is  also  important  and  covered  under  impact.    
Although  other  checkpoints  bear  on  it  (because  they  are  needed  to  establish  that  the  pro-­‐
gram  has  non-­‐trivial  benefits),  this  checkpoint  is  frequently  the  most  important  one  of  the  
core  five  when  attempting  to  determine  significance.  (The  other  highly  relevant  checkpoint  
for  that  is  C4,  where  we  look  at  how  much  better  it  is  compared  to  whatever  else  is  avail-­‐
able;  and  the  final  word  on  that  comes  in  Checkpoint  D1,  especially  Note  D1.1.)  Under  
Checkpoint  C5,  you  must  find  the  answers  to  questions  like  these:  Can  the  program  be  used,  
with  similar  results,  if  we  use  it:  (i)  with  other  content;  (ii)  at  other  sites;  (iii)  with  other  
staff;  (iv)  on  a  larger  (or  smaller)  scale;  (v)  with  other  recipients;  (vi)  in  other  climates  
(social,  political,  physical);  etc.  An  affirmative  answer  on  any  of  these  ‘dimensions  of  gener-­‐
alization’  is  a  merit,  since  it  adds  another  universe  to  the  domains  in  which  the  evaluand  
can  yield  benefits  (or  adverse  effects).  Looking  at  generalizability  thus  makes  it  possible  
(sometimes)  to  benefit  greatly  from,  instead  of  dismissing,  programs  and  policies  whose  
use  at  the  time  of  the  study  was  for  a  very  small  group  of  impactees—such  programs  may  
be  extremely  important  because  of  their  generalizability.    
Generalization  to  (vii)  later  times,  a.k.a.  longevity,  is  nearly  always  important  (under  com-­‐
mon  adverse  conditions,  it’s  durability).  Even  more  important  is  (viii)  sustainability  (this  is  
external  sustainability,  not  the  same  as  the  internal  variety  mentioned  under  Process).  It  is  
sometimes  inadequately  treated  as  meaning,  or  as  equivalent  to,  ‘resilience  to  risk.’  Sus-­‐
tainability  usually  requires  making  sure  the  evaluand  can  survive  at  least  the  termination  

27 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

of  the  original  funding  (which  is  usually  not  a  risk  but  a  known  certainty),  and  also  some  
range  of  hazards  under  the  headings  of  warfare  or  disasters  of  the  natural  as  well  as  finan-­‐
cial,  social,  ecological,  and  political  varieties.  Sustainability  isn’t  the  same  as  resilience  to  
risk  especially  because  is  must  cover  future  certainties,  such  as  seasonal  changes  in  tem-­‐
perature,  humidity,  water  supply—and  the  end  of  the  reign  of  the  present  CEO,  or  of  pres-­‐
ent  funding.  But  the  ‘resilience  to  risk’  definition  has  the  merit  of  reminding  us  that  this  
checkpoint  will  require  some  effort  at  identifying  and  then  estimating  the  likelihood  of  the  
occurrence  of  the  more  serious  risks,  and  costing  the  attendant  losses.  Sustainability  is  
sometimes  even  more  important  than  longevity,  for  example  when  evaluating  international  
or  cross-­‐cultural  developmental  programs;  longevity  and  durability  refer  primarily  to  the  
reliability  of  the  ‘machinery’  of  the  program  and  its  maintenance,  including  availability  of  
the  required  labor/expertise  and  tech  supplies;  but  are  less  connotative  of  external  threats  
such  as  the  ‘100-­‐year  drought’  or  civil  war,  and  less  concerned  with  ‘continuing  to  produce  
the  same  results’  which  is  what  you  really  care  about.  Note  that  what  you’re  generalizing—
i.e.,  predicting—about  these  programs  is  the  future  (effects)  of  ‘the  program  in  context,’  not  
the  mere  existence  of  the  program,  and  so  any  context  required  for  the  effects  should  be  
specified,  and  include  any  required  infrastructure.  Here,  as  in  the  previous  checkpoint,  we  
are  making  predictions  about  outcomes  in  certain  scenarios,  and,  although  risky,  this  some-­‐
times  generates  the  greatest  contribution  of  the  evaluation  to  improvement  of  the  world  
(see  also  the  ‘possible  scenarios’  of  Checkpoint  D4).  All  three  show  the  extent  to  which  
good  evaluation  is  a  creative  and  not  just  a  reactive  enterprise.  That’s  the  good  news  way  of  
putting  the  point;  the  bad  news  way  is  that  much  good  evaluation  involves  raising  ques-­‐
tions  that  can  only  be  answered  definitively  by  doing  work  that  you  are  probably  not  
funded  to  do.  
Note  C5.1:  Above  all,  keep  in  mind  that  the  absence  of  generalizability  has  absolutely  no  
deleterious  effect  on  establishing  that  a  program  is  meritorious,  unlike  the  absence  of  a  
positive  rating  on  any  of  the  four  other  sub-­‐evaluation  dimensions.  It  only  affects  establish-­‐
ing  the  extent  of  its  benefits.  This  can  be  put  by  saying  that  generalizability  is  a  plus,  but  its  
absence  is  not  a  minus—unless  you’re  scoring  for  the  Ideal  Program  Oscars.  Putting  it  an-­‐
other  way,  generalizability  is  highly  desirable,  but  that  doesn’t  mean  that  it’s  a  requirement  
for  m/w/s.  A  program  may  do  the  job  of  meeting  needs  just  where  it  was  designed  to  do  
that,  and  not  be  generalizable—and  still  rate  an  A+.  
Note  C5.2:  Although  generalizability  is  ‘only’  a  plus,  it  needs  to  be  explicitly  defined  and  de-­‐
fended.  It  is  still  the  case  that  good  researchers  make  careless  mistakes  of  inappropriate  
implicit  generalization.  For  example,  there  is  still  much  discussion,  with  good  researchers  
on  both  sides,  of  whether  the  use  of  student  ratings  of  college  instructors  and  courses  im-­‐
proves  instruction,  or  has  any  useful  level  of  validity.  But  any  conclusion  on  this  topic  in-­‐
volves  an  illicit  generalization,  since  the  evaluand  ‘student  ratings’  is  about  as  useful  in  
such  evaluations  as  ‘herbal  medicine’  is  in  arguments  about  whether  herbal  medicine  is  
beneficial  or  not.  Since  any  close  study  shows  that  herbal  medicines  with  the  same  label  
often  contain  completely  different  substances  (and  almost  always  substantially  different  
amounts  of  the  main  element),  and  since  most  but  not  all  student  rating  forms  are  invalid  
or  uninterpretable  for  more  than  one  reason,  the  essential  foundation  for  the  generaliza-­‐
tion—a  common  referent—is  non-­‐existent.  Similarly,  investigations  of  whether  online  
teaching  is  superior  to  onsite  instruction,  or  vice  versa,  are  about  absurdly  variable  ev-­‐

28 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

aluands,  and  generalizing  about  their  relative  merits  is  like  generalizing  about  the  ethical  
standards  of  ‘white  folk’  compared  to  ‘Asians.’  Conversely,  and  just  as  importantly,  evalu-­‐
ative  studies  of  a  nationally  distributed  reading  program  must  begin  by  checking  the  fi-­‐
delity  of  your  sample  (Description  and  Process  checkpoints).  This  is  checking  instantiation  
(sometimes  this  is  part  of  what  is  called  ‘checking  dosage’  in  the  medical/pharmaceutical  
context),  the  complementary  problem  to  checking  generalization.  
Note  C5.3:  Checkpoint  C5  is,  perhaps  more  than  any  others,  the  residence  of  prediction,  
with  all  its  special  problems.  Will  the  program  continue  to  work  in  its  present  form?  Will  it  
work  in  some  modified  form?  In  some  different  context?  With  different  person-­‐
nel/clients/recipients?  These,  and  the  others  listed  above,  are  each  formidable  prediction  
tasks  that  will,  in  important  cases,  require  separate  research  into  their  special  problems.  
When  special  advice  cannot  be  found,  it  is  tempting  to  fall  back  on  the  assumption  that,  ab-­‐
sent  ad  hoc  considerations,  the  best  prediction  is  extrapolation  of  current  trends.  That’s  the  
best  simple  choice,  but  it’s  not  the  best  you  can  do;  you  can  at  least  identify  the  most  com-­‐
mon  interfering  conditions  and  check  to  see  if  they  are/will  be  present  and  require  a  modi-­‐
fication  or  rejection  of  the  simple  extrapolation.  Example:  will  the  program  continue  to  do  
as  well  as  it  has  been  doing?  Possibly  not  if  the  talented  CEO  dies/retires/leaves/burns  
out?  So  check  on  the  evidence  for  each  of  these  possibilities,  thereby  increasing  the  validity  
of  the  bet  on  steady-­‐state  results,  or  forcing  a  switch  to  another  bet.  See  also  Note  D2.2.  
General  Note  7:  Comparisons,  Costs,  and  Generalizability  are  in  the  same  category  as  
values  from  the  list  in  Checkpoint  B5;  they  are  all  considerations  of  certain  dimensions  of  
value—comparative  value,  economic  value,  general  value.  Why  do  they  get  special  billing  
with  their  own  checkpoint  in  the  list  of  sub-­‐evaluations?  Basically,  because  of  (i)  their  vir-­‐
tually  universal  critical  importance40,  (ii)  the  frequency  with  which  one  or  more  are  omit-­‐
ted  from  evaluations  when  they  should  have  been  included,  and  (iii)  because  they  each  in-­‐
volve  some  techniques  of  a  relatively  special  kind.  Despite  their  idiosyncrasies,  it’s  also  
possible  to  see  them  as  potential  exemplars,  by  analogy  at  least,  of  how  to  deal  with  some  
of  the  other  relevant  values  from  Checkpoint  B5,  which  will  come  up  as  relevant  under  Pro-­‐
cess,  Outcomes,  and  Comparisons.  
 

PART  D:  CONCLUSIONS  &  IMPLICATIONS  


D1.   Synthesis    
Now  we’re  beginning  to  develop  the  key  elements  of  the  report  and  the  executive  sum-­‐
mary.  You  have  already  done  a  great  deal  of  the  required  synthesis  of  facts  with  values  
using  the  scales  developed  in  Checkpoint  B5,  Values,  in  order  to  get  the  sub-­‐evaluations  of  
Part  C.  This  means  you  already  have  an  evaluative  profile  of  the  evaluand:  i.e.,  a  bar  graph,  
the  simplest  graphical  means  of  representing  a  multidimensional  evaluative  conclusion,  
and  greatly  superior  to  a  table  for  most  clients  and  audiences.  But  for  some  evaluative  pur-­‐
40
Of course, ethics (and the law) is critically important, but only as a framework constraint that must not be vio-
lated. Outcomes are the material benefits or damage within the ethical/legal framework and their size and direction
are the most variable and antecedently uncertain, and hence highly critical findings from the evaluation. Ethics is the
greenhouse; outcomes are what grows inside it.

29 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

poses  you  need  a  further  synthesis,  this  time  of  the  sub-­‐evaluations,  because  you  need  to  
get  a  one-­‐dimensional  evaluative  conclusion,  i.e.,  an  overall  grade  or,  if  you  can  justify  a  
quantitative  scale,  an  overall  score.  For  example,  you  may  need  to  assist  the  client  in  choos-­‐
ing  the  best  of  several  evaluands,  which  means  ranking  them,  and  the  easiest  way  to  do  this  
is  to  have  each  of  them  evaluated  on  a  single  overall  summative  dimension.  That’s  easy  to  
say,  but  it’s  not  easy  to  justify  most  efforts  to  do  that,  because  in  order  to  combine  those  
multiple  dimensions  into  a  single  one,  you  have  to  have  a  legitimate  common  metric  for  
them,  which  is  rarely  supportable.  (It’s  easy  to  see  why  a  quantitative  approach  is  so  attrac-­‐
tive!)  At  the  least,  you’ll  need  a  supportable  estimate  of  the  relative  importance  of  each  di-­‐
mension  of  merit,  and  not  even  that  is  easy  to  get.  Details  of  how  and  when  it  can  be  done  
will  be  provided  elsewhere  and  would  take  too  much  space  to  fit  in  here.41  The  content  
focus  (point  of  view)  of  the  synthesis,  on  which  the  common  metric  should  be  based,  should  
usually  be  the  present  and  future  total  impact  on  consumer  (e.g.,  employer,  employee,  pa-­‐
tient,  student)  or  community  needs,  subject  to  the  constraints  of  ethics,  the  law,  and  re-­‐
source-­‐feasibility,  etc…  Apart  from  the  need  for  a  ranking  there  is  very  often  also  a  practical  
need  for  a  concise  presentation  of  the  most  crucial  evaluative  information.  A  profile  showing  
merit  on  the  five  core  dimensions  of  Part  C  can  often  meet  that  need,  without  going  to  a  
uni-­‐dimensional  compression  into  a  single  grade.  Another  possible  profile  for  such  a  sum-­‐
mary  would  be  based  on  the  SWOT  checklist  widely  used  in  business:  Strengths,  Weak-­‐
nesses,  Opportunities,  and  Threats.42  Sometimes  it  makes  sense  to  provide  both  profiles.  
This  part  of  the  synthesis/summary  could  also  include  referencing  the  results  against  the  
clients’  and  perhaps  other  stakeholders’  goals,  wants,  or  hopes  (if  feasible),  e.g.,  goals  met,  
ideals  realized,  created  but  unrealized  value,  when  these  are  determinable,  which  can  also  
be  done  with  a  profile.  But  the  primary  obligation  of  the  evaluator  is  usually  to  reference  
the  results  to  the  needs  of  the  impacted  population,  within  the  constraints  of  overarching  
values  such  as  ethics,  the  law,  the  culture,  etc.  Programs  are  not  made  into  good  programs  
by  matching  someone’s  goals,  but  by  doing  something  worthwhile,  on  balance.  Of  course,  
for  public  or  philanthropic  funding,  the  two  should  coincide,  but  you  can’t  assume  they  do;  
in  fact,  they  are  all-­‐too-­‐often  incompatible.    
Another  popular  focus  for  the  overall  report  is  the  ROI  (return  on  investment),  which  is  su-­‐
perbly  concise,  but  it’s  too  limited  (no  ethics,  side-­‐effects,  goal  critique,  etc.)  The  often-­‐
suggested  3D  expansion  of  ROI  gives  us  the  3P  dimensions—benefits  to  People,  Planet,  and  
Profit—often  called  the  ‘triple  bottom  line.’  It’s  still  a  bit  narrow  and  we  can  do  better  with  
the  5  dimensions  listed  here  as  the  sub-­‐evaluations  listed  in  Part  C:  Process,  Outcomes;  
Costs;  Comparisons;  Generalizability.  A  bar  graph  showing  the  merit  of  achievements  on  
each  of  these  provides  a  succinct  and  insightful  profile  of  a  program’s  value.  To  achieve  it,  
you  will  need  defensible  definitions  of  the  standards  you  are  using  on  each  column  (the  
rubrics),  e.g.,  “An  A  grade  for  Outcomes  will  require…”  and  there  will  be  ‘bars’  (i.e.,  absolute  
minimum  standards)  on  several  of  these,  e.g.,  ethical  acceptability  on  the  Outcomes  scale,  
cost-­‐feasibility  on  the  Costs  scale.  Since  it’s  highly  desirable  that  you  have  these  for  any  
serious  program  evaluation,  this  5D  summary  should  not  be  a  dealbreaker  requirement.  

41
An article “The Logic of Evaluation” forthcoming by summer, 2011, on the web site michaelscriven.info does a
better job on this than my previous efforts, which do not now seem adequate as references.
42 Google provides 6.2 million references for SWOT (@1/23/07), but the top two or three are good introductions.

30 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

(Another  version  of  a  5D  approach  is  given  in  the  paper  “Evaluation  of  Training”  that  is  on-­‐
line  at  michaelscriven.info.  )  
Apart  from  the  rubrics  for  each  relevant  value,  if  you  have  to  come  up  with  an  overall  grade  
of  some  kind,  you  will  need  to  do  an  overall  synthesis  to  reduce  the  two-­‐dimensional  pro-­‐
file  to  a  ‘score’  on  a  single  dimension.  (Since  it  may  be  qualitative,  we’ll  use  the  term  ‘grade’  
for  this  property.)  Getting  to  an  overall  grade  requires  what  we  might  call  a  meta-­‐rubric—a  
set  of  rules  for  converting  profiles—which  are  typically  themselves  a  set  of  grades  on  sev-­‐
eral  dimensions—to  a  grade  on  a  single  scale.  What  we  call  ‘weighting’  the  dimensions  is  a  
basic  kind  of  meta-­‐rubric  since  it’s  an  instruction  to  take  some  of  the  constituent  grades  
more  seriously  than  others  for  some  further,  ‘higher-­level’  evaluative  purpose.  (A  neat  way  to  
display  this  graphically  is  by  using  the  width  of  a  column  in  the  profile  to  indicate  import-­‐
ance.)  If  you  are  lucky  enough  to  have  developed  an  evaluative  profile  for  a  particular  ev-­‐
aluand,  in  which  each  dimension  of  merit  is  of  equal  importance  (or  of  some  given  numeri-­
cal  importance  compared  to  the  others),  and  if  each  grade  can  be  expressed  numerically,  
then  you  can  just  average  the  grades.  BUT  legitimate  examples  of  such  cases  are  almost  un-­‐
known,  although  we  often  oversimplify  and  act  as  if  we  have  them  when  we  don’t.  For  ex-­‐
ample,  we  average  college  grades  to  get  the  GPA  (grade  point  average)  and  use  this  in  many  
overall  evaluative  contexts  such  as  selection  for  admission  to  graduate  programs.  Of  
course,  this  oversimplification  can  be,  and  frequently  is,  ‘gamed’  by  students  e.g.,  by  taking  
courses  where  grade  inflation  means  that  the  A’s  do  not  represent  excellent  work  by  any  
reasonable  standard.  A  better  meta-­‐rubric  results  from  using  a  comprehensive  exam,  
graded  by  a  departmental  committee  instead  of  one  person,  and  then  giving  the  grade  on  
this  double  weight,  or  even  80%  of  the  weight.  Another  common  meta-­‐rubric  in  graduate  
schools  is  setting  a  meta-­‐bar,  i.e.,  an  overall  absolute  requirement  for  graduation,  e.g.,  that  
no  single  dimension  (course  or  a  named  subset  of  crucially  important  courses)  be  graded  
below  B-­‐.  
Note  D1.1:  One  special  conclusion  to  go  for,  often  a  major  part  of  determining  significance,  
comes  from  looking  at  what  was  done  against  what  could  have  been  done  with  the  Re-­‐
sources  available,  including  social  and  individual  capital.  This  is  one  of  several  cases  where  
imagination  is  needed  to  determine  a  grade  on  the  Opportunities  part  of  the  SWOT  analy-­‐
sis.  But  remember  this  is  thin  ice  territory  (see  Note  C4.1).  
Note  D1.2:  Be  sure  to  convey  some  sense  of  the  strength  of  your  conclusions,  which  means  
the  combination  of:  (i)  the  net  weight  of  the  evidence  for  the  premises,  with  (ii)  the  proba-­
bility  of  the  inferences  from  them  to  the  conclusion(s),  and  (iii)  the  probability  that  there  is  
no  other  relevant  evidence.  For  example,  indicate  whether  the  performance  on  the  various  
dimensions  of  merit  was  a  tricky  inference  or  directly  observed;  did  the  evaluand  clear  any  
bars  or  lead  any  competitors  ‘by  a  mile’  or  just  scrape  over  (i.e.,  use  gap-­‐ranking  not  just  
ranking43);  were  the  predictions  involved  double-­‐checked  for  invalidating  indicators  (see  
Note  C5.2);  was  the  conclusion  established  ‘beyond  any  reasonable  doubt,’  or  merely  ‘sup-­‐
ported  by  the  balance  of  the  evidence’?  This  complex  property  of  the  evaluation  is  referred  
43
Gap-ranking is a refinement of ranking in which a qualitative or quantitative estimate of the size of intervals be-
tween evaluands is provided (modeled after the system in horse-racing—‘by a head,’ ‘by a nose,’ ‘by three lengths,’
etc. This is often enormously more useful than mere ranking e.g., because it tells a buyer that s/he can get very
nearly as good a product for much less money.

31 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

to  as  ‘robustness.’  Some  specific  aspects  of  the  limitations  also  need  statement  here  e.g.,  
those  due  to  limited  time-­‐frame  (which  often  rules  out  some  mid-­‐  or  long-­‐term  follow-­‐ups  
that  are  badly  needed).  
 
D2.    (possible)  Recommendations,  Explanations,  Predictions,  and  Redesigns.  
All  of  these  possibilities  are  examples  of  the  ‘something  more’  approach  to  evaluation,  by  
contrast  with  the  more  conservative  ‘nothing  but’  approach,  which  advocates  rather  careful  
restriction  of  the  evaluator’s  activities  to  evaluation,  ‘pure  and  simple.’  These  alternatives  
have  analogies  in  every  profession—judges  are  tempted  to  accept  directorships  in  com-­‐
panies  who  may  come  before  them  as  defendants,  counsellors  consider  adopting  counse-­‐
lees,  etc.  The  ‘nothing  more  approach’  can  be  expressed,  with  thanks  to  a  friend  of  Gloria  
Steinem,  as:  ‘An  evaluation  without  recommendations  (or  explanations,  etc.)  is  like  a  fish  
without  a  bicycle.’  Still,  there  are  more  caveats  about  pressing  for  evaluation-­‐separation  
than  with  the  fish.  In  other  words,  ‘lessons  learned’—of  whatever  type—should  be  sought  
diligently,  expressed  cautiously,  and  applied  even  more  cautiously.    
Let’s  start  with  recommendations.  Micro-­‐recommendations—those  concerning  the  inter-­‐
nal  workings  of  program  management  and  the  equipment  or  personnel  choices/use—often  
become  obvious  to  the  evaluator  during  the  investigation,  and  are  demonstrable  at  little  or  
no  extra  cost/effort  (we  sometimes  say  they  “fall  out”  from  the  evaluation;  as  an  example  of  
how  easy  this  can  sometimes  be,  think  of  copy-­‐editors,  who  often  do  both  evaluation  and  
recommendation  to  an  author  in  one  pass),  or  they  may  occur  to  a  knowledgeable  evalu-­‐
ator  who  is  motivated  to  help  the  program,  because  of  his/her  expert  knowledge  of  this  or  
an  indirectly  or  partially  relevant  field  such  as  information  or  business  technology,  organi-­‐
zation  theory,  systems  concepts,  or  clinical  psychology.  These  ‘operational  recommenda-­‐
tions’  can  be  very  useful—it’s  not  unusual  for  a  client  to  say  that  these  suggestions  alone  
were  worth  more  than  the  cost  of  the  evaluation.  (Naturally,  these  suggestions  have  to  be  
within  the  limitations  of  the  (program  developer’s)  Resources  checkpoint,  except  when  
doing  the  Generalizability  checkpoint.)  Generating  these  ‘within-­‐program’  recommenda-­‐
tions  as  part  of  formative  evaluation  (though  they’re  one  step  away  from  the  primary  task  
of  formative  evaluation  which  is  straight  evaluation  of  the  present  quality  of  the  evaluand),  
is  one  of  the  good  side-­‐effects  that  may  come  from  using  an  external  evaluator,  who  often  
has  a  new  view  of  things  that  everyone  on  the  scene  may  have  seen  too  often  to  see  criti-­‐
cally.    
On  the  other  hand,  macro-­‐recommendations—which  are  about  the  disposition  or  classifica-­‐
tion  of  the  whole  program  (refund,  cut,  modify,  export,  etc.—which  we  might  also  call  ex-­‐
ternal  management  recommendations,  or  dispositional  recommendations)—are  usually  
another  matter.  These  are  important  decisions  serviced  by  and  properly  dependent  on,  
summative  evaluations,  but  making  recommendations  about  the  evaluand  is  not  intrinsi-­‐
cally  part  of  the  task  of  evaluation  as  such,  since  it  depends  on  other  matters  besides  the  
m/w/s  of  the  program,  which  is  all  the  evaluator  normally  can  undertake  to  determine.    
For  the  evaluator  to  make  dispositional  recommendations  about  a  program’s  disposition  
will  typically  require  two  extras  over  and  above  what  it  takes  to  evaluate  the  program:  (i)  
extensive  knowledge  of  the  other  factors  in  the  context-­‐of-­‐decision  for  the  top-­‐level  

32 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

(‘about-­‐program’)  decision-­‐makers.  Remember  that  those  people  are  often  not  the  clients  
for  the  evaluation—they  are  often  further  up  the  organization  chart—and  they  may  be  un-­‐
willing  or  psychologically  or  legally  unable  to  provide  full  details  about  the  context-­‐of-­‐
decision  concerning  the  program  (e.g.,  unable  because  implicit  values  are  not  always  rec-­‐
ognized  by  those  who  operate  using  them).  The  correct  dispositional  decisions  often  rightly  
depend  on  legal  or  donor  constraints  on  the  use  of  funds,  and  sometimes  on  legitimate  po-­‐
litical  constraints  not  explained  to  the  evaluator,  not  just  m/w/s;  and  any  of  these  can  arise  
after  the  evaluation  begins  and  the  evaluator  is  briefed  about  then-­‐known  environmental  
constraints,  if  s/he  is  briefed  at  all.  
Such  recommendations  will  also  often  require  (ii)  considerable  extra  effort  e.g.,  to  evaluate  
each  of  the  other  macro-­‐options.  Key  elements  in  this  may  be  trade  secrets  or  national  se-­‐
curity  matters  not  available  to  the  evaluator,  e.g.,  the  true  sales  figures,  the  best  estimate  of  
competitors’  success,  the  extent  of  political  vulnerability  for  work  on  family  planning,  the  
effect  on  share  prices  of  withdrawing  from  this  slice  of  the  market.  This  elusiveness  also  
often  applies  to  the  macro-­‐decision  makers’  true  values,  with  respect  to  this  decision,  
which  are  quite  often  trade  or  management  or  government  secrets  of  the  board  of  direc-­‐
tors,  or  select  legislators,  or  perhaps  personal  values  only  known  to  their  psychotherapists.    
So  it  is  really  a  quaint  conceit  of  evaluators  to  suppose  that  the  m/w/s  of  the  evaluand  are  
the  only  relevant  grounds  for  deciding  how  to  dispose  of  it;  there  are  often  entirely  legiti-­‐
mate  political,  legal,  public-­‐perception,  market,  and  ethical  considerations  that  are  at  least  
as  important,  especially  in  toto.  So  it’s  simply  presumptuous  to  propose  macro-­‐recomm-­‐
endations  as  if  they  follow  directly  from  the  evaluation:  they  almost  never  do,  even  when  
the  client  may  suppose  that  they  do,  and  encourage  the  evaluator  to  produce  them.  (It’s  a  
mistake  I’ve  made  more  than  once.)  If  you  do  have  the  required  knowledge  to  infer  to  them,  
then  at  least  be  very  clear  that  you  are  doing  a  different  evaluation  in  order  to  reach  them,  
namely  an  evaluation  of  the  alternative  options  open  to  the  disposition  decision-­‐makers,  by  
contrast  with  an  evaluation  of  the  evaluand  itself.  In  the  standard  program  evaluation,  but  
not  in  the  evaluation  of  various  dispositions  of  it,  you  can  sometimes  include  an  evaluation  
of  the  internal  choices  available  to  the  program  manager,  i.e.,  recommendations  for  im-­‐
provements.  
There  are  a  couple  of  ways  to  ‘soften’  recommendations  in  order  to  take  account  of  these  
hazards.  The  simplest  way  is  to  preface  them  by  saying,  “Assuming  that  the  program’s  dis-­‐
position  is  dependent  only  on  its  m/w/s,  it  is  recommended  that…”  A  more  creative  and  
often  more  productive  approach,  advocated  by  Jane  Davidson,  is  to  convert  recommenda-­‐
tions  into  options,  e.g.,  as  follows:  “It  would  seem  that  program  management/staff  faces  a  
choice  between:  (i)  continuing  with  the  status  quo;  (ii)  abandoning  this  component  of  the  
program;  (iii)  implementing  the  following  variant  [here  you  insert  your  recommendation]  
or  some  variation  of  this.”  The  program  management/staff  is  thus  invited  to  adopt  and  be-­‐
come  a  co-­‐author  of  an  option,  a  strategy  that  is  often  more  likely  to  result  in  implementa-­‐
tion  than  a  mere  recommendation  from  an  outsider.  
Many  of  these  extra  requirements  for  making  macro-­‐recommendations—and  sometimes  
one  other—also  apply  to  providing  explanations  of  success  or  failure.  The  extra  require-­‐
ment  is  possession  of  the  correct  (not  just  the  believed)  logic  or  theory  of  the  program,  
which  typically  requires  more  than—and  rarely  requires  less  than—state-­‐of-­‐the-­‐art  sub-­‐

33 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

ject-­‐matter  expertise,  both  practical  and  ‘theoretical’  (i.e.,  the  scientific  or  engineering  ac-­‐
count),  about  the  evaluand’s  inner  workings  (i.e.,  about  what  optional  changes  would  lead  
to  what  results).  A  good  automobile  mechanic  has  the  practical  kind  of  knowledge  about  
cars  that  s/he  works  on  regularly,  which  includes  knowing  how  to  identify  malfunction  and  
its  possible  causes;  but  it’s  often  only  the  automobile  engineer  who  can  give  you  the  rea-­‐
sons  why  these  causal  connections  work,  which  is  what  the  demand  for  explanations  will  
usually  require.  The  combination  of  these  requirements  imposes  considerable,  and  some-­‐
times  enormous,  extra  time  and  research  costs  which  has  too  often  meant  that  the  attempt  
to  provide  recommendations  or  explanations  (by  using  the  correct  program  logic)  is  done  
at  the  expense  of  doing  the  basic  evaluation  task  well  (or  even  getting  to  it  at  all),  a  poor  
trade-­‐off  in  most  cases.  Moreover,  getting  the  explanation  right  will  sometimes  be  abso-­‐
lutely  impossible  within  the  ‘state  of  the  art’  of  science  and  engineering  at  the  moment—
and  this  is  not  a  rare  event,  since  in  many  cases  where  we’re  looking  for  a  useful  social  
intervention,  no-­‐one  has  yet  found  a  plausible  account  of  the  underlying  phenomenon:  for  
example,  in  the  cases  of  delinquency,  addiction,  autism,  serial  killing,  ADHD.  In  such  cases,  
what  we  need  to  know  is  whether  we  have  found  a  cure—complete  or  partial—since  we  
can  use  that  knowledge  to  save  people  immediately,  and  also,  thereafter,  to  start  work  on  
finding  the  explanation.  That’s  the  ‘aspirin  case’—the  situation  where  we  can  easily,  and  
with  great  benefit  to  many  sufferers,  evaluate  a  claimed  medication  although  we  don’t  
know  why  it  works,  and  don’t  need  to  know  that  in  order  to  evaluate  its  efficacy.  In  fact,  un-­‐
til  the  evaluation  is  done,  there’s  no  success  or  failure  for  the  scientist  to  investigate,  which  
vastly  reduces  the  significance  of  the  causal  inquiry,  and  hence  the  probability/value  of  its  
occurrence.  
It’s  also  extremely  important  to  realize  that  macro-­‐recommendations  will  typically  require  
the  ability  to  predict  the  results  of  the  recommended  changes  in  the  program,  at  the  very  
least  in  this  specific  context,  which  is  something  that  the  program  logic  or  program  theory  
(like  many  social  science  theories)  is  often  not  able  to  do  with  any  reliability.  Of  course,  
procedural  recommendations  in  the  future  tense,  e.g.,  about  needed  further  research  or  
data-­‐gathering  or  evaluation  procedures,  are  often  possible—although  typically  much  less  
useful.    
‘Plain’  predictions  are  also  often  requested  by  clients  or  thought  to  be  included  in  any  
good  evaluation.  (e.g.,  Will  the  program  work  reliably  in  our  schools?  Will  it  work  with  the  
recommended  changes,  without  staff  changes?)  and  are  often  very  hazardous.44  Now,  since  
these  are  reasonable  questions  to  answer  in  deciding  on  the  value  of  the  program  for  many  
clients,  you  have  to  try  to  provide  the  best  response.  So  read  Clinical  vs.  Statistical  Predic-­
tion  by  Paul  Meehl  and  the  follow-­‐up  literature,  and  the  following  Note  D2.1,  and  then  call  
in  the  subject  matter  experts.  In  most  cases,  the  best  thing  you  can  do,  even  with  all  that  
help,  is  not  just  to  pick  what  appears  to  be  the  most  likely  result,  but  to  give  a  range  from  
the  probability  of  the  worst  possible  outcome  (which  you  describe  carefully)  to  that  of  the  

44 Evaluators sometimes say, in response to such questions, Well, why wouldn’t it work—the reasons for it doing so
are really good? The answer was put rather well some years ago: "…it ought to be remembered that there is nothing
more difficult to take in hand, more perilous to conduct, or more uncertain of success, than to take the lead in the
introduction of a new order of things. Because the innovator has for enemies all those who have done well under the
old conditions, and lukewarm defenders in those who may do well under the new.” (Niccolo Machiavelli (1513),
with thanks to John Belcher and Richard Hake for bringing it up recently (PhysLrnR, 16 Apr 2006)

34 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

best  possible  outcome  (also  described),  plus  the  probability  of  the  most  likely  outcome  in  
the  middle  (described  even  more  carefully).45  On  rare  occasions,  you  may  be  able  to  esti-­‐
mate  a  confidence  interval  for  these  estimates.  Then  the  decision-­‐makers  can  apply  their  
choice  of  strategy  (e.g.,  minimax—minimizing  maximum  possible  loss)  based  on  their  risk-­‐
aversiveness.    
Although  it’s  true  that  almost  every  evaluation  is  in  a  sense  predictive,  since  the  data  it’s  
based  on  is  yesterday’s  data  but  its  conclusions  are  put  forward  as  true  today,  there’s  no  
need  to  be  intimidated  by  the  need  to  predict;  one  just  has  to  be  very  clear  what  assump-­‐
tions  one  is  making  and  how  much  evidence  there  is  to  support  them.  
Finally,  a  new  twist  on  ‘something  more’  that  I  first  heard  proposed  by  John  Gargani  and  
Stewart  Donaldson  at  the  2010  AEA  convention,  is  for  the  evaluator  to  do  a  redesign  of  a  
program  rather  than  giving  a  highly  negative  evaluation.  This  is  a  kind  of  limit  case  of  rec-­‐
ommendation,  and  of  course  requires  an  extra  skill  set,  namely  design  skills.  The  main  
problem  here  is  role  conflict  and  the  consequent  improper  pressure:  the  evaluator  is  offer-­‐
ing  the  client  loaded  alternatives,  a  variation  on  ‘your  money  or  your  life.’  The  advocates  
suggest  that  the  world  will  be  a  better  place  if  the  program  is  redesigned  rather  than  just  
condemned  by  them,  which  is  probably  true;  but  these  are  not  the  only  alternatives.  The  
evaluator  might  instead  recommend  the  redesign,  and  suggest  calling  for  bids  on  that,  
recusing  his  or  her  candidacy.  Or  they  might  just  recommend  changes  that  a  new  designer  
should  incorporate  or  consider.  
Note  D2.1:  Policy  analysis,  in  the  common  situation  when  the  policy  is  being  considered  
for  future  adoption,  is  close  to  being  program  evaluation  of  future  (possible)  programs  
(a.k.a.,  ex  ante,  or  prospective  program  evaluation)  and  hence  necessarily  involves  all  the  
checkpoints  in  the  KEC  including,  in  most  cases,  an  especially  large  dose  of  prediction.  (A  
policy  is  a  ‘course  or  principle  of  action’  for  a  certain  domain  of  action,  and  implementing  it  
typically  produces  a  program.)  Extensive  knowledge  of  the  fate  of  similar  programs  in  the  
past  is  then  the  key  resource,  but  not  the  only  one.  It  is  also  essential  to  look  specifically  for  
the  presence  of  indicators  of  future  change  in  the  record,  e.g.,  downturns  in  the  perform-­‐
ance  of  the  policy  in  the  most  recent  time  periods,  intellectual  or  motivational  burn-­‐out  of  
principal  players/managers,  media  attention,  the  probability  of  personnel  departure  for  
better  offers,  the  probability  of  epidemics,  natural  disasters,  legislative  ‘counter-­‐
revolutions’  by  groups  of  opponents,  general  economic  decline,  technological  break-­‐
throughs,  or  large  changes  in  taxes  or  house  or  market  values,  etc.  If,  on  the  other  hand,  the  
policy  has  already  been  implemented,  then  we’re  doing  historical  (a.k.a.  ex  post,  or  retro-­‐
spective  program  evaluation)  and  policy  analysis  amounts  to  program  evaluation  without  
prediction,  a  much  easier  case.    

45 In PERT charting (PERT = Program Evaluation and Review Technique), a long-established approach to program
planning that emerged from the complexities of planning the first submarine nuclear missile, the Polaris, the formula
for calculating what you should expect from some decision is: {Best possible outcome + Worst Possible outcome +
4 x (Most likely outcome)}/6. It’s a pragmatic solution to consider seriously. My take on this approach is that it only
makes sense when there are good grounds for saying the most likely outcome (MLO) is very likely; there are many
cases where we can identify the best and worst cases, but have no grounds for thinking the intermediate case is more
likely other than the fact it’s intermediate. Now that fact does justify some weighting (given the usual distribution of
probabilities), but the coefficient for the MLO might then be better as 2 or 3.

35 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

Note  D2.3:  Evaluability  assessment  is  a  useful  part  of  good  program  planning,  whenever  
it  is  required,  hoped,  or  likely  that  evaluation  could  later  be  used  to  help  improve  as  well  as  
determine  the  m/w/s  of  the  program  to  assist  decision-­‐makers  and  fixers.  It  can  be  done  
well  by  using  the  KEC  to  identify  the  questions  that  will  have  to  be  answered  eventually,  
and  thus  to  identify  the  data  that  will  need  to  be  obtained;  and  the  difficulty  of  doing  that  
will  determine  the  evaluability  of  the  program  as  designed.  Those  preliminary  steps  are,  of  
course,  exactly  the  ones  that  you  have  to  go  through  to  design  an  evaluation,  so  the  two  
processes  are  two  sides  of  the  same  coin.  Since  everything  is  evaluable,  to  some  extent  in  
some  contexts,  the  issue  of  evaluability  is  a  matter  of  degree,  resources,  and  circumstance,  
not  of  absolute  possibility.  In  other  words,  while  everything  is  evaluable,  by  no  means  is  
everything  evaluable  to  a  reasonable  degree  of  confidence,  with  the  available  resources,  in  
every  context.  (For  example,  the  atomic  power  plant  program  for  Iran  after  4/2006,  when  
access  was  denied  to  the  U.N.  inspectors)  As  this  example  illustrates,  ‘context’  includes  the  
date  and  type  of  evaluation,  since,  while  this  evaluand  is  not  evaluable  prospectively  with  
any  confidence,  in  4/06—since  getting  the  data  is  not  feasible,  and  predicting  sustainability  
is  highly  speculative—historians  will  no  doubt  be  able  to  evaluate  it  retrospectively,  be-­‐
cause  we  will  eventually  know  whether  that  program  paid  off,  and/or  brought  on  an  attack.  
Note  D2.3:  Inappropriate  expectations  The  fact  that  clients  often  expect/request  explan-­‐
ations  of  success  or  shortcomings,  or  macro-­‐recommendations,  or  impossible  predictions,  
is  grounds  for  educating  them  about  what  we  can  definitely  do  vs.  what  we  can  hope  will  
turn  out  to  be  possible.  Although  tempting,  these  expectations  on  the  client’s  part  are  not  
an  excuse  for  doing,  or  trying  for  long  to  do,  and  especially  not  for  promising  to  do,  these  
extra  things  if  you  lack  the  very  substantial  extra  requirements  for  doing  them,  especially  if  
that  effort  jeopardizes  the  primary  task  of  the  evaluator,  viz.  drawing  the  needed  type  of  
evaluative  conclusion  about  the  evaluand.  The  merit,  worth,  or  significance  of  a  program  is  
often  hard  to  determine;  it  (typically)  requires  that  you  determine  whether  and  to  what  
degree  and  in  what  respects  and  for  whom  and  under  what  conditions  and  at  what  cost  it  
does  (or  does  not)  work  better  or  worse  than  the  available  alternatives,  and  what  all  that  
means  for  all  those  involved.  To  add  on  the  tasks  of  determining  how  to  improve  it,  explain-­‐
ing  why  it  works  (or  fails  to  work),  now  and  in  the  future,  and/or  what  one  should  do  about  
supporting  or  exporting  it,  is  simply  to  add  other  tasks,  often  of  great  scientific  and/or  
managerial/social  interest,  but  quite  often  beyond  current  scientific  ability,  let  alone  the  
ability  of  an  evaluator  who  is  perfectly  competent  to  evaluate  the  program.  In  other  words,  
‘black  box  evaluation’  should  not  be  used  as  a  term  of  contempt  since  it  is  often  the  name  
for  a  vitally  useful,  feasible,  and  affordable  approach,  and  frequently  the  only  feasible  one.  
And  in  fact,  most  evaluations  are  of  partially  blacked-­‐out  boxes  (‘grey  boxes’)  where  one  
can  only  see  a  little  of  the  inner  workings.  This  is  perhaps  most  obviously  true  in  pharma-­‐
cological  evaluation,  but  it  is  also  true  in  every  branch  of  the  discipline  of  evaluation  and  
every  one  of  its  application  fields  (health,  education,  social  services,  etc.).  A  program  evalu-­‐
ator  with  some  knowledge  of  parapsychology  can  easily  evaluate  the  success  of  an  alleged  
faith-­‐healer  whose  program  theory  is  that  God  is  answering  his  prayers,  without  the  slight-­‐
est  commitment  to  the  truth  or  falsehood  of  that  program  theory.  
Note  D2.4:  Finally,  there  are  extreme  situations  in  which  the  evaluator  does  have  a  
responsibility—an  ethical  responsibility—to  move  beyond  the  role  of  the  evaluator,  e.g.,  
because  it  becomes  clear,  early  in  a  formative  evaluation,  either  that  (i)  some  gross  
improprieties  are  involved,  or  that  (ii)  certain  actions,  if  taken  immediately,  will  lead  to  
36 scriven, 18 April, 2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

involved,  or  that  (ii)  certain  actions,  if  taken  immediately,  will  lead  to  very  large  increases  
in  benefits,  and  it  is  clear  that  no-­‐one  besides  the  evaluator  is  going  to  take  the  necessary  
steps.  The  evaluator  is  then  obliged  to  be  proactive,  and  we  can  call  the  resulting  action  
whistle-­‐blowing  in  the  first  case,  and  proformative  evaluation  in  the  second,  a  cross  be-­‐
tween  formative  evaluation  and  proactivity.  While  macro-­‐recommendations  by  evaluators  
require  great  care,  proactivity  requires  even  greater  care.  

D3.  (possible)  Responsibility  and  Justification  


If  either  can  be  determined,  and  if  it  is  appropriate  to  determine  them.  Some  versions  of  
accountability  that  stress  the  accountability  of  people  do  require  this—see  examples  be-­‐
low.  Allocating  blame  or  praise  requires  extensive  knowledge  of:  (i)  the  main  players’  
knowledge-­‐state  at  the  time  of  key  decision  making;  (ii)  their  resources  and  responsibilities  
for  their  knowledge-­‐state  as  well  as  their  actions;  as  well  as  (iii)  an  ethical  analysis  of  their  
options,  and  of  the  excuses  or  justifications  they  (or  others,  on  their  behalf)  may  propose.  
Not  many  evaluators  have  the  qualifications  to  do  this  kind  of  analysis.  The  “blame  game”  is  
very  different  from  evaluation  in  most  cases  and  should  not  be  undertaken  lightly.  Still,  
sometimes  mistakes  are  made,  are  demonstrable,  have  major  consequences,  and  should  be  
pointed  out  as  part  of  an  evaluation;  and  sometimes  justified  choices,  with  good  or  bad  ef-­‐
fects,  are  made  and  attacked,  and  should  be  praised  or  defended  as  part  of  an  evaluation.  
The  evaluation  of  accidents  is  an  example:  the  investigations  of  aircraft  crashes  by  the  
National  Transportation  Safety  Board  in  the  US  are  in  fact  a  model  example  of  how  this  can  
be  done;  they  are  evaluations  of  an  event  with  the  added  requirement  of  identifying  re-­‐
sponsibility,  whether  it’s  human  or  natural  causes.  (Operating  room  deaths  pose  similar  
problems  but  are  often  not  as  well  investigated.)    
Note  D3.1:  The  evaluation  of  disasters,  (a  misleading  title)  recently  an  area  of  consider-­‐
able  activity,  typically  involves  one  or  more  of  the  following  five  options:  (i)  an  evaluation  
of  the  extent  of  preparedness,  (ii)  an  evaluation  of  the  immediate  response,  (iii)  an  evalu-­‐
ation  of  the  totality  of  the  relief  efforts  until  termination,  (iv)  an  evaluation  of  the  lessons  
learned  (lessons  learned  should  be  a  part  of  each  of  the  evaluations  done  of  the  response),  
and  (v)  an  evaluation  of  subsequent  corrective/preventative  action.  All  five  involve  some  
evaluation  of  responsibility  and  sometimes  the  allocation  of  praise/blame.  Recent  efforts  
(c.  2005)  referred  to  as  general  approaches  to  the  ‘evaluation  of  disasters’  appear  not  to  
have  distinguished  all  of  these  and  not  to  have  covered  all  of  them,  although  it  seems  plaus-­‐
ible  that  all  should  have  been  covered  in  order  to  minimize  the  impact  of  later  disasters.  

D4.   Report  &  Support    


Now  we  come  to  the  task  of  conveying  the  conclusions  in  an  appropriate  way,  and  at  ap-­‐
propriate  times  and  locations.  This  is  a  very  different  task  from—although  frequently  con-­‐
fused  with—handing  over  a  semi-­‐technical  report  at  the  end  of  the  study,  the  paradigm  for  
typical  research  studies  of  the  same  phenomena.  Evaluation  reporting  for  a  single  evalu-­‐
ation  may  require,  or  benefit  from,  radically  different  presentations  to  different  audiences,  
at  different  times  in  the  evaluation:  these  may  be  oral  or  written,  long  or  short,  public  or  
private,  technical  or  non-­‐technical,  graphical  or  textual,  scientific  or  story-­‐telling,  anecdotal  
and  personal  or  barebones.  And  this  phase  of  the  evaluation  process  should  include  post-­‐
report  help,  e.g.,  handling  questions  when  they  turn  up  later  as  well  as  immediately,  ex-­‐

37 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

plaining  the  report’s  significance  to  different  groups  including  users,  staff,  funders,  and  
other  impactees,  and  even  reacting  to  later  program  or  management  or  media  documents  
allegedly  reporting  the  results  or  implications  of  the  evaluation.  This  in  turn  may  involve  
proactive  creation  and  depiction  in  the  primary  report  of  various  possible  scenarios  of  in-­‐
terpretations  and  associated  actions  that  are,  and—the  contrast  is  extremely  helpful—are  
not,  consistent  with  the  findings.  Essentially,  this  means  doing  some  problem-­‐solving  for  
the  clients,  that  is,  advance  handling  of  difficulties  they  are  likely  to  encounter  with  various  
audiences.  In  this  process,  a  wide  range  of  communication  skills  is  often  useful  and  some-­‐
times  vital,  e.g.,  audience  ‘reading’,  use  and  reading  of  body  language,  understanding  the  
multicultural  aspects  of  the  situation  and  the  cultural  iconography  and  connotative  implica-­‐
tions  of  types  of  presentations  and  response.46  There  should  usually  be  an  explicit  effort  to  
identify  ‘lessons  learned,’  failures  and  limitations,  costs  if  requested,  and  explaining  ‘who  
evaluates  the  evaluators.’  Checkpoint  D4  should  also  cover  getting  the  results  (and  inciden-­‐
tal  knowledge  findings)  into  the  relevant  databases,  if  any;  possibly  but  not  necessarily  into  
the  information  ocean  via  journal  publication  (with  careful  consideration  of  the  cost  of  sub-­‐
sidizing  these  for  potential  readers  of  the  publication  chosen);  recommending  creation  of  a  
new  database  or  information  channel  (e.g.,  a  newsletter)  where  beneficial;  and  dissemina-­‐
tion  into  wider  channels  if  appropriate,  e.g.,  through  presentations,  online  posting,  discus-­‐
sions  at  scholarly  meetings,  or  in  hardcopy  posters,  graffiti,  book,  blogs,  wikis,  tweets,  and  
in  movies  (yes,  fans,  remember—UTube  is  free).  

D5.   Meta-­evaluation  
This  is  the  evaluation  of  an  evaluation  or  evaluations—including  evaluations  based  on  the  
use  of  this  checklist—in  order  to  identify  their  strengths/limitations/other  uses.  Meta-­‐
evaluation  should  always  be  done,  as  a  separate  quality  control  step(s),  as  follows:  (i)  to  the  
extent  possible,  by  the  primary  evaluator,  certainly—but  not  only—after  completion  of  the  
final  draft  of  any  report;  and  (ii)  whenever  possible  also  by  an  external  evaluator  of  the  
evaluation  (a  meta-­‐evaluator).  The  primary  criteria  of  merit  for  evaluations  are:  (i)  validity,  
at  a  contextually  adequate  level47;  (ii)  utility48,  including  cost-­‐feasibility  and  comprehensi-­‐
bility  (usually  to  clients,  audiences,  and  stakeholders)  of  both  the  main  conclusions  about  
the  m/w/s  of  the  evaluand,  and  the  recommendations,  if  any;  and  also  any  utility  arising  
from  generalizability  e.g.,  of  novel  methodological  approaches;  (iii)  credibility  (to  select  
stakeholders,  especially  funders,  regulatory  agencies,  and  usually  also  to  program  staff);  
(iv)  comparative  cost-­‐effectiveness,  which  goes  beyond  utility  to  require  consideration  of  
alternative  possible  evaluation  approaches;  (v)  robustness,  i.e.,  the  extent  to  which  the  ev-­‐
aluation  is  immune  to  variations  in  context,  measures  used,  point  of  view  of  the  evaluator  
etc;  and  (vi)  ethicality/legality,  which  includes  such  matters  as  avoidance  of  conflict  of  in-­‐

46The ‘connotative implications’ are in the sub-explicit but supra-symbolic realm of communication, manifested
in—to give a small example—the use of gendered or genderless language.
47 This means that when balance of evidence is all that’s called for (e.g., because a decision has to be made fast) it’s
irrelevant that proof of the conclusion beyond any reasonable doubt was not supplied.
48 Utility is usability and not actual use, the latter—or its absence—being at best a probabilistically sufficient but not

necessary condition for the former, since it may have been very hard to use the results of the evaluation, and
utility/usability requires (reasonable) ease of use. Failure to use the evaluation may be due to base motives or stu-
pidity or an act of God and hence is not a valid indicator of lack of merit.

38 scriven, 18 April, 2011


DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

terest49  and  protection  of  the  rights  of  human  subjects—of  course,  this  affects  credibility,  
but  is  not  exactly  the  same  since  the  ethicality  may  be  deeply  flawed….  There  are  several  
ways  to  go  about  meta-­‐evaluation.  You  and  later  another  meta-­‐evaluator  can:  (a)  apply  the  
KEC  or  PES  or  GAO  list  (preferably  one  or  more  that  was  not  used  to  do  the  evaluation)  to  
the  evaluation  itself  (e.g.,  the  Cost  checkpoint  in  the  KEC  then  addresses  the  cost  of  the  ev-­‐
aluation,  not  the  program,  and  so  on  for  all  checkpoints);  and/or  (b)  use  a  special  meta-­‐
evaluation  checklist  (there  are  several  available,  including  the  one  sketched  in  the  previous  
sentence,  which  is  sometimes  called  the  Meta-­‐Evaluation  Checklist  or  MEC50);  and/or  (c)  if  
funds  are  available,  replicate  the  evaluation,  doing  it  in  the  same  way,  and  compare  the  re-­‐
sults;  and/or  (d)  do  the  same  evaluation  using  a  different  methodology  and  compare  the  
results.  It’s  highly  desirable  to  employ  more  than  one  of  these  approaches,  and  all  are  likely  
to  require  supplementation  with  some  attention  to  conflict  of  interest/rights  of  subjects.    
Note  D5.1:  Literal  or  direct  use  are  not  concepts  clearly  applicable  to  evaluations  without  
recommendations,  a  category  that  includes  many  important,  complete,  and  influential  ev-­‐
aluations:  evaluations  are  not  in  themselves  recommendations.  ‘Due  consideration’  or  
‘utilization’  is  a  better  generic  term  for  the  ideal  response  to  a  good  evaluation.  Failure  to  
use  an  evaluation’s  results  is  often  due  to  bad,  perhaps  venal,  management,  and  so  can  
never  be  regarded  as  an  indicator  of  bad  utility,  without  further  evidence.  
Note  D5.2:  Evaluation  impacts  often  occur  years  after  completion  and  often  occur  even  if  
the  evaluation  was  rejected  completely  when  submitted.  Evaluators  too  often  give  up  their  
hopes  of  impact  too  soon.  
Note  D5.3:  Help  with  utilization  beyond  submitting  the  report  should  at  least  have  been  
offered—see  Checkpoint  D4.    
Note  D5.4:  Look  for  contributions  from  the  evaluation  to  the  client  organization’s  know-­‐
ledge  management  system;  if  they  lack  one,  recommend  creating  one.  
Note  D5.5:  Since  effects  of  the  evaluation  are  not  usually  regarded  as  effects  of  the  pro-­‐
gram,  it  follows  that  although  an  empowerment  evaluation  should  produce  substantial  
gains  in  the  staff’s  knowledge  about  and  tendency  to  use  or  improve  evaluations,  that’s  not  
an  effect  of  the  program  in  the  relevant  sense  for  an  evaluator.  Also,  although  that  valuable  
outcome  is  an  effect  of  the  evaluation,  it  can’t  compensate  for  low  validity  or  low  external  
credibility—two  of  the  most  common  threats  to  empowerment  evaluation—since  training  
the  program  staff  is  not  a  primary  criterion  of  merit  for  evaluations.    
Note  D5.6:  Similarly,  one  common  non-­‐money  cost  of  an  evaluation—disruption  of  the  
work  of  program  staff—is  not  a  bad  effect  of  the  program.  It  is  one  of  the  items  that  should  
always  be  picked  up  in  a  meta-­‐evaluation.  Of  course,  it’s  minimal  in  goal-­‐free  evaluation,  
since  the  (field)  evaluators  do  not  talk  to  program  staff.  Careful  design  (of  program  plus  ev-­‐
aluation)  can  therefore  sometimes  bring  these  evaluation  costs  near  to  zero  or  ensure  that  
there  are  benefits  that  more  than  offset  the  cost.  

49 There are a number of cases of conflict of interest of particular relevance to evaluators, e.g., formative evaluators
who make suggestions for improvement and then do a subsequent evaluation (formative or summative) of the same
program, of which they are now co-authors—or rejected contributor-wannabes—and hence in conflict of interest.
50
Now online at michaelscriven.info

39 scriven 18 April,.2011
DRAFT ONLY; NOT TO BE COPIED OR TRANSMITTED WITHOUT PERMISSION

_________________________________________________________________  
GENERAL  NOTE    8:  The  explanatory  remarks  here  should  be  regarded  as  first  approxima-­‐
tions  to  the  content  of  each  checkpoint.  More  detail  on  some  of  them  and  on  items  men-­‐
tioned  in  them  can  be  found  in  one  of  the  following:  (i)  the  Evaluation  Thesaurus,  Michael  
Scriven,  (4th  edition,  Sage,  1991),  under  the  checkpoint’s  name;  (ii)  in  the  references  cited  
there;  (iii)  in  the  online  Evaluation  Glossary  (2006)  at  evaluation.wmich.edu,  partly  written  
by  this  author;  (iv)  in  the  best  expository  source  now,  E.  Jane  Davidson’s  Evaluation  Meth-­
odology  Basics  (Sage,  2004  and  2e,  2012  (projected));  (v)  in  later  editions  of  this  document,  
at  michaelscriven.info.  The  above  version  of  the  KEC  itself  is,  however,  in  most  respects  
very  much  better  than  the  ET  one,  having  been  substantially  refined  and  expanded  in  more  
than  60  ‘editions’  (i.e.,  widely  circulated  or  online  posted  revisions),  since  its  birth  as  a  two-­‐
ager  around  1971—16  since  early  2009—with  much  appreciated  help  from  many  students  
and  colleagues,  including:  Chris  Coryn,  Jane  Davidson,  Rob  Brinkerhoff,  Christian  Gugiu,  
Nadini  Persaud,51  Emil  Posavac,  Liliana  Rodriguez-­‐Campos,  Daniela  Schroeter,  Natasha  
Wilder,  Lori  Wingate,  and  Andrea  Wulf;  with  a  thought  or  two  from  Michael  Quinn  Patton’s  
work.  More  suggestions  and  criticisms  are  very  welcome—please  send  to:  
mjscriv1@gmail.com,  with  KEC  as  the  first  words  in  the  title  line.  (Suggestions  after  3.28.11  
that  require  significant  changes  are  rewarded,  not  only  with  an  acknowledgment  but  a  little  
prize:  usually  your  choice  from  my  list  of  duplicate  books.)  
[23,679  words]  

51
Dr. Persaud’s detailed comments have been especially valuable: she was a  CPA  before  she  became  a  profes-­‐
sional  evaluator  (but  there  are  not  as  many  changes  in  the  cost  section  as  she  thinks  are  called  for,  so  she  is  
not  to  blame  for  any  remaining  faults.)

40 scriven, 18 April, 2011

You might also like