Skip to main content

    Madhabi Chatterji

    Purpose We developed EmergenCSim™, a serious game (SG) with an embedded assessment, to teach and assess performing general anesthesia for cesarean delivery. We hypothesized that first-year anesthesiology trainees (CA-1) playing... more
    Purpose We developed EmergenCSim™, a serious game (SG) with an embedded assessment, to teach and assess performing general anesthesia for cesarean delivery. We hypothesized that first-year anesthesiology trainees (CA-1) playing EmergenCSim™ would yield superior knowledge scores versus controls, and EmergenCSim™ and high-fidelity simulation (HFS) assessments would correlate. Methods This was a single-blinded, longitudinal randomized experiment. Following a lecture (week 3), trainees took a multiple-choice question (MCQ) test (week 4) and were randomized to play EmergenCSim™ (N = 26) or a non-content specific SG (N = 23). Participants repeated the MCQ test (week 8). Between month 3 and 12, all repeated the MCQ test, played EmergenCSim™ and participated in HFS of an identical scenario. HFS performance was rated using a behavior checklist. Results There was no significant change in mean MCQ scores over time between groups F (2, 94) = 0.870, p = 0.42, and no main effect on MCQ scores, F ...
    Recognition and treatment of maternal hypotension during epidural anesthesia administration for intrapartum cesarean delivery preserves maternal-fetal perfusion. A case that required quality assurance review uncovered lapses in maternal... more
    Recognition and treatment of maternal hypotension during epidural anesthesia administration for intrapartum cesarean delivery preserves maternal-fetal perfusion. A case that required quality assurance review uncovered lapses in maternal hemodynamic monitoring during the transition to intrapartum cesarean delivery anesthesia. To address this, a practice outline was designed for trainee’s education describing intrapartum epidural dosing for cesarean delivery and adequate blood pressure monitoring. The time-lapse between epidural dosing and subsequent blood pressure was evaluated before and after the introduction of our educational tool. The time-lapse between blood pressure measures decreased to <10 minutes (10.78–13.92 vs 8.8–9.76 minutes).
    This article synthesizes research on standards-based reforms and accountability, with specific attention to purposes, models, and methods of inquiry. Starting with the premise that the reforms were meant to be systemic, the article... more
    This article synthesizes research on standards-based reforms and accountability, with specific attention to purposes, models, and methods of inquiry. Starting with the premise that the reforms were meant to be systemic, the article examines the extent to which studies were guided by designs that explicitly or implicitly acknowledge a system, and evaluates the utility of the designs in generating information to support large-scale systemic changes in education. The article concludes that research efforts on reforms have been largely non-systemic in design and have thereby failed to adequately help individual schools, school systems, and statewide systems to develop in directions that are consistent with the mission of the reform movement.
    Purpose – This policy brief discusses validity and fairness issues that could arise when test-based information is used for making “high stakes” decisions at an individual level, such as, for the certification of teachers or other... more
    Purpose – This policy brief discusses validity and fairness issues that could arise when test-based information is used for making “high stakes” decisions at an individual level, such as, for the certification of teachers or other professionals, or when admitting students into higher education programs and colleges, or for making immigration-related decisions for prospective immigrants. To assist test developers, affiliated researchers and test users enhance levels of validity and fairness with these particular types of test score interpretations and uses, this policy brief summarizes an “argument-based approach” to validation given by Kane. Design/methodology/approach – This policy brief is based on a synthesis of conference proceedings and review of selected pieces of extant literature. To that synthesis, the authors add practitioner-friendly examples with their own analysis of key issues. They conclude by offering recommendations for test developers and test users. Findings – The...
    The Health Information Technology for Economic and Clinical Health (HITECH) initiative of the 2009 American Recovery and Reinvestment Act in the United States was intended to promote meaningful use of electronic health records (EHRs).... more
    The Health Information Technology for Economic and Clinical Health (HITECH) initiative of the 2009 American Recovery and Reinvestment Act in the United States was intended to promote meaningful use of electronic health records (EHRs). This article reports on a comprehensive, three-stage model employed to develop, validate, and facilitate regional implementation of a health care information technology curriculum for workforce development as part of that coordinated national effort. Building on needs assessed at the national level, the the stages involved: (a) curriculum design, (b) assuring quality of curricular products through validation and revision, and (c) design of a systems-based, curriculum implementation and evaluation protocol. The objective of the project was to prepare health care professionals with competencies necessary to implement EHRs meaningfully, thereby improving patient care. We produced content-validated and usable versions of curriculum goal frameworks, student...
    Data from a developmental assessment comprised of 9 short answer mathematics tasks were validated using classical and three-faceted Rasch measurement methods. Field test data from a mixed age elementary school sample (N=280) were... more
    Data from a developmental assessment comprised of 9 short answer mathematics tasks were validated using classical and three-faceted Rasch measurement methods. Field test data from a mixed age elementary school sample (N=280) were analyzed. Descriptive statistics on scores from the overall scale and two subdomains indicated improved performance with age. The data showed better fit with a two-factor model corresponding with the subdomain structure (Bentler's CFI=.94), than a one factor model (CFI=.87). The inter-factor correlation was.76. Convergent validity coefficients of scores with scaled scores of the Stanford AchievementTest mathematics battery ranged from.28 to.47; internal consistency reliability of the total and subdomain scores ranged from.87 to.89, respectively; and median inter-rater reliability was.75. On average, persons, tasks and raters showed acceptable fit with the three-facet Rasch model. Rasch logit difficulties of tasks suggested an ordered scale structure, al...
    Research Interests:
    The predictive properties of the Gesell School Readiness Screening Test, which measures developmental age (DA), were examined in two independent subject samples. Analyses examined the correlational properties of DA with first-grade... more
    The predictive properties of the Gesell School Readiness Screening Test, which measures developmental age (DA), were examined in two independent subject samples. Analyses examined the correlational properties of DA with first-grade achievement, incremental validity of DA over chronological age, effects on long-term achievement of 2-year and 1 -year kindergarten programs for students who start school with low DAs, and accuracy of classification by DA. Modest positive correlations were found between DA and achievement (.29 -. 39); DA showed a significant increment in prediction over chronological age. Correlations were not sufficient to enable accurate classification of mature and immature students. Students with low DAs were found to be misclassified most frequently. Inadequate support was found for the 2-year treatment of students who start school with initially low DAs.
    Abstract The purpose of this study was to describe a dropout prevention program for ninth-grade students and examine its effects on grade point average, reading and math achievement, school attendance, and dropout rate. A random sample of... more
    Abstract The purpose of this study was to describe a dropout prevention program for ninth-grade students and examine its effects on grade point average, reading and math achievement, school attendance, and dropout rate. A random sample of ninth-grade students was drawn from six high schools for each of 3 years that the program was in operation and was compared with a sample of students in the year prior to the program. Analysis of the program services across the school district revealed an emphasis on helping the students with academics and study skills. Significant positive effects were found on school attendance and dropout rate.
    This study describes psychometric investigations of a developmental assessment in mathematics composed of 19 performance tasks. Techniques from classical and the many-faceted Rasch approaches were combined to analyze field test data from... more
    This study describes psychometric investigations of a developmental assessment in mathematics composed of 19 performance tasks. Techniques from classical and the many-faceted Rasch approaches were combined to analyze field test data from a mixedage sample ( n = 110). Descriptive statistics on scores from the overall scale and subdomains indicated increased proficiency with age. Convergent validity coefficients of scores with scaled scores of the Stanford Achievement Test mathematics battery ranged from .22 to .57; internal consistency reliability of the total scores on proficiency and independence were .90 and .92, respectively; and median interrater reliability was .80. Classical item statistics and Rasch logit difficulties of tasks suggested an ordered scale structure, although eight tasks did not fit Rasch criteria. The original and calibrated task orderings were consistent at the extreme ends of the scale. Findings point to directions for future improvement of the scale and rate...
    Educational Evaluation and Policy Analysis December 2009, Vol. 31, No. 4, pp. 507-509 DOI: 10.3102/0162373709355240 © 2009 AERA. http://eepa.aera.net ... Malcolm Abbott Swinburne University of Technology, Australia Tommaso Agasisti... more
    Educational Evaluation and Policy Analysis December 2009, Vol. 31, No. 4, pp. 507-509 DOI: 10.3102/0162373709355240 © 2009 AERA. http://eepa.aera.net ... Malcolm Abbott Swinburne University of Technology, Australia Tommaso Agasisti Politecnico di Milano Karl L. Alexander John Hopkins University Elaine Allensworth University of Chicago Thomas Alsbury North Carolina State University Dorothea Anagnostopoulos Michigan State College, College of Education Nicole Arshan Stanford University Hanna Ayalon Tel Aviv University Marigee Bacolod University ...
    Scores from the Teacher Readiness for Educational Reforms (TRER)instrument were validated using a six-phase, iterative model. Initial conceptualization, content validation, and pilot testing yielded a 61-item instrument with seven... more
    Scores from the Teacher Readiness for Educational Reforms (TRER)instrument were validated using a six-phase, iterative model. Initial conceptualization, content validation, and pilot testing yielded a 61-item instrument with seven subdomains (Phases 1-4). Exploratory work (Phase 5) using principal axis factor extraction supported a five-factor structure (Data Set 1, n = 393). Further exploration with a five-factor free-path model and a more constrained structural model yielded satisfactory fit values (Bentler’s comparative fit indexes of .94 and .93, respectively). Deletion or collapsing of items in Phase 5 yielded a refined TRER with 43 items. Confirmatory work (Phase 6) with a new data set ( n = 392) showed little slippage in fit. Cronbach’s alpha values ranged from .78 to .96 in final subdomain scores.
    What is the news on Florida’s student performance since the passage of the A+ legislation? This brief presents a review of long-term data on five state and national indicators. It verifies outcomes and trends and examines the main premise... more
    What is the news on Florida’s student performance since the passage of the A+ legislation? This brief presents a review of long-term data on five state and national indicators. It verifies outcomes and trends and examines the main premise of the A+ mandate that, given appropriate schooling, students will have equitable outcomes and access to opportunities. The data show positive outcomes and steady gains at the elementary level; the pattern, however, is not sustained at the secondary level. Good News at Elementary Level in all Subjects Elementary achievement trends show the strongest gains and performance in writing, and steady improvements over time in reading and mathematics. Findings concur on the FCAT-CRT, FCAT-NRT, and the NAEP. Bad News on Secondary Level Reading
    Background Much is still unknown or unclear about how and where validity issues arise in high stakes testing situations in education, and ways by which we can rectify validity problems in practice and policy contexts. Purpose This paper... more
    Background Much is still unknown or unclear about how and where validity issues arise in high stakes testing situations in education, and ways by which we can rectify validity problems in practice and policy contexts. Purpose This paper is the Foreword to the special issue of the Teachers College Record, When Education Measures Go Public – Stakeholder Perspectives on How and Why Validity Breaks Down. Method The paper analyzes a recent case involving an application of the SAT to highlight tensions between validity and test score use in high stakes school accountability environments driven by the No Child Left Behind (NCLB) Act of 2001. It uses the case study as a vehicle to introduce the individual papers and authors in the section. Conclusions There are information and power gaps among those who set societal priorities for using tests for high stakes purposes, those who design and conduct psychometric research on tests and testing programs, and those who could eventually face conseq...
    Date Presented Accepted for AOTA INSPIRE 2021 but unable to be presented due to online event limitations. The Hand–Object Observation Tool is a valid systematic observational tool developed to measure upper extremity functioning in... more
    Date Presented Accepted for AOTA INSPIRE 2021 but unable to be presented due to online event limitations. The Hand–Object Observation Tool is a valid systematic observational tool developed to measure upper extremity functioning in children with bilateral cerebral palsy. Using a valid systematic strategy to observe hand–object interaction during daily activities provides clinicians with the much-needed information regarding upper extremity function and optimal object placement to facilitate hand use, task completion, and participation in daily activities. Primary Author and Speaker: Amanda Sarafian Contributing Authors: Katherine Dimitropoulou, Madhabi Chatterji, and Andrew Gordon
    Purpose The purpose of this study was to design and iteratively improve the quality of survey-based measures of three non-cognitive constructs for Grade 5-6 students, keeping in mind information needs of users in education reform... more
    Purpose The purpose of this study was to design and iteratively improve the quality of survey-based measures of three non-cognitive constructs for Grade 5-6 students, keeping in mind information needs of users in education reform contexts. The constructs are: Mathematics-related Self-Efficacy, Self-Concept, and Anxiety (M-SE, M-SC, and M-ANX). Design/methodology/approach The authors applied a multi-stage, iterative and user-centered approach to design and validate the measures, using several psychometric techniques and three data samples. They evaluated the utility of student-level scores and aggregated, classroom-level means. Findings At both student and classroom levels, replicated evidence supported theoretically-grounded validity arguments on information produced by four of five scales tapping M-SC, M-ANX and M-SE. The evidence confirmed a second order, two-factor structure for M-SC, representing positive math affect and perceived competence, and a one factor structure for M-ANX...
    In light of the NCLB Act of 2001, this study estimated mathematics achievement gaps in different subgroups of kindergartners and first graders, and identified child- and school-level correlates and moderators of early mathematics... more
    In light of the NCLB Act of 2001, this study estimated mathematics achievement gaps in different subgroups of kindergartners and first graders, and identified child- and school-level correlates and moderators of early mathematics achievement. A subset of 2300 students nested in 182 schools from the Early Childhood Longitudinal Study K-First Grade data set was analyzed with hierarchical linear models. Relative to school mean estimates at the end of kindergarten, significant mathematics achievement gaps were found in Hispanics, African Americans and high poverty students. At the end of Grade 1, mathematics gaps were significant in African American, high poverty, and female subgroups, but not in Hispanics. School-level correlates of Grade 1 Mathematics achievement were class size (with a small negative main effect), at-home reading time by parents (with a large positive main effect) and school size (with a small positive main effect). Cross-level interactions in Grade 1 indicated that ...
    The No Child Left Behind (NCLB) Act of 2001 requires that public schools adopt research-supported programs and practices, with a strong recommendation for randomized controlled trials (RCTs) as the “gold standard” for scientific rigor in... more
    The No Child Left Behind (NCLB) Act of 2001 requires that public schools adopt research-supported programs and practices, with a strong recommendation for randomized controlled trials (RCTs) as the “gold standard” for scientific rigor in empirical research. Within that policy framework, this paper compares the relative utility of federally-recommended RCT versus the demonstrated extended term mixed-method (ETMM) designs as options for monitoring effects of novel programs in real-time field settings. Guided by the program’s theory of action, a year-long, two-phase study was conducted to monitor the context, processes and early outcomes of an after-school supplemental program in a New York elementary school. In both phases, the design combined a matched-groups, quasi-experiment with qualitative classroom observations and descriptive surveys. Early findings showed some positive, albeit “gross” program effects. Although findings are tentative, the ETMM approach enhanced interpretations ...
    Research Interests:
    Research Interests:
    Purpose– The purpose of this article is to present alternative views on the theory and practice offormative assessment (FA), or assessment to support teaching and learning in classrooms, with the purpose of highlighting its value in... more
    Purpose– The purpose of this article is to present alternative views on the theory and practice offormative assessment (FA), or assessment to support teaching and learning in classrooms, with the purpose of highlighting its value in education and informing discussions on educational assessment policy.Methodology/approach– The method used is a “moderated policy discussion”. The six invited commentaries on the theme represent perspectives of leading scholars and measurement experts juxtaposed against voices of prominent school district leaders from two education systems in the USA. The discussion is moderated with introductory and concluding remarks from the guest editor and is excerpted from a recent blog published byEducation Week. References and author biographies are presented at the end of the article.Findings– While current assessment policies in the USA push for greater accountability in schools by increasing large scale testing of students, the authors underscore the importanc...
    This study examines validity of data generated by the School Readiness for Reforms: Leader Questionnaire (SRR-LQ) using an iterative procedure that combines classical and Rasch rating scale analysis. Following content-validation and... more
    This study examines validity of data generated by the School Readiness for Reforms: Leader Questionnaire (SRR-LQ) using an iterative procedure that combines classical and Rasch rating scale analysis. Following content-validation and pilot-testing, principal axis factor extraction and promax rotation of factors yielded a five factor structure consistent with the content-validated subscales of the original instrument. Factors were identified based on inspection of pattern and structure coefficients. The rotated factor pattern, inter-factor correlations, convergent validity coefficients, and Cronbach's alpha reliability estimates supported the hypothesized construct properties. To further examine unidimensionality and efficacy of the rating scale structures, item-level data from each factor-defined subscale were subjected to analysis with the Rasch rating scale model. Data-to-model fit statistics and separation reliability for items and persons met acceptable criteria. Rating scale...
    To demonstrate a methodology for coding and taxonomy development and to operationally define residents' competence in systems-based practice (SBP) in terms of observable roles, actions, and behaviors. The Accreditation Council for... more
    To demonstrate a methodology for coding and taxonomy development and to operationally define residents' competence in systems-based practice (SBP) in terms of observable roles, actions, and behaviors. The Accreditation Council for Graduate Medical Education's (ACGME's) full-text definition of SBP and the 6 discrete expectations it contains were content analyzed. Structured interviews of 88 health care professionals using a variant of focus group interviews called nominal group processes were conducted and qualitatively analyzed to identify the key attributes of SBP. Themes obtained from these 2 procedures were conceptually matched and organized to create a taxonomy of observable SPB behaviors and the SBP domain. Six general resident roles emerged, under which 35 specific behavioral attributes were subsumed. From the SBP domain specified. Sample SBP items categorized by roles were derived that reflected "in-context" representations of ACGME SBP expectations. Our...
    The complex competency labeled practice-based learning and improvement (PBLI) by the Accreditation Council for Graduate Medical Education (ACGME) incorporates core knowledge in evidence-based medicine (EBM). The purpose of this study was... more
    The complex competency labeled practice-based learning and improvement (PBLI) by the Accreditation Council for Graduate Medical Education (ACGME) incorporates core knowledge in evidence-based medicine (EBM). The purpose of this study was to operationally define a "PBLI-EBM" domain for assessing resident physician competence. The authors used an iterative design process to first content analyze and map correspondences between ACGME and EBM literature sources. The project team, including content and measurement experts and residents/fellows, parsed, classified, and hierarchically organized embedded learning outcomes using a literature-supported cognitive taxonomy. A pool of 141 items was produced from the domain and assessment specifications. The PBLI-EBM domain and resulting items were content validated through formal reviews by a national panel of experts. The final domain represents overlapping PBLI and EBM cognitive dimensions measurable through written, multiple-choice ...
    Purpose – This paper presents a moderated discussion on popular misconceptions, benefits and limitations of International Large-Scale Assessment (ILSA) programs, clarifying how ILSA results could be more appropriately interpreted and used... more
    Purpose – This paper presents a moderated discussion on popular misconceptions, benefits and limitations of International Large-Scale Assessment (ILSA) programs, clarifying how ILSA results could be more appropriately interpreted and used in public policy contexts in the USA and elsewhere in the world. Design/methodology/approach – To bring key issues, points-of-view and recommendations on the theme to light, the method used is a “moderated policy discussion”. Nine commentaries were invited to represent voices of leading ILSA scholars/researchers and measurement experts, juxtaposed against views of prominent leaders of education systems in the USA that participate in ILSA programs. The discussion is excerpted from a recent blog published by Education Week. It is moderated with introductory remarks from the guest editor and concluding recommendations from an ILSA researcher who did not participate in the original blog. References and author biographies are presented at the end of the...
    Purpose – Against a backdrop of high-stakes assessment policies in the USA, this paper explores the challenges, promises and the “state of the art” with regard to designing standardized achievement tests and educational assessment systems... more
    Purpose – Against a backdrop of high-stakes assessment policies in the USA, this paper explores the challenges, promises and the “state of the art” with regard to designing standardized achievement tests and educational assessment systems that are instructionally useful. Authors deliberate on the consequences of using inappropriately designed tests, and in particular tests that are insensitive to instruction, for teacher and/or school evaluation purposes. Methodology/approach – The method used is a “moderated policy discussion”. The six invited commentaries represent voices of leading education scholars and measurement experts, juxtaposed against views of a prominent leader and nationally recognized teacher from two American education systems. The discussion is moderated with introductory and concluding remarks from the guest editor, and is excerpted from a recent blog published by Education Week. References and author biographies are presented at the end of the article. Findings – ...
    This case study examines the applicability of 1994 standards, offered by the Joint Committee on Standards for Educational Evaluation, to evaluations conducted in international contexts. The work is undertaken in response to an open... more
    This case study examines the applicability of 1994 standards, offered by the Joint Committee on Standards for Educational Evaluation, to evaluations conducted in international contexts. The work is undertaken in response to an open invitation from the Joint Committee in its 1994 publication. The article addresses two purposes. First, it asks whether the standards in the four broad areas—utility, feasibility, propriety, and accuracy—can be applied as written to guide and monitor evaluation practices in developing countries when the programmatic focus and evaluation models, including relationships among sponsors, program participants, stakeholders, and evaluators, vary significantly from the assumptions underlying the 1994 standards. Second, it develops and refines methods for conducting metaevaluations of international evaluations by analyzing documentary and interview-based data from one case, represented by series of connected studies on education and health literacy programs in Ba...
    Purpose – This policy brief, the second AERI-NEPC eBrief in the series “Understanding validity issues around the world”, focuses on validity as it applies to test-based models of evaluation employed for schools, instructional programs,... more
    Purpose – This policy brief, the second AERI-NEPC eBrief in the series “Understanding validity issues around the world”, focuses on validity as it applies to test-based models of evaluation employed for schools, instructional programs, and teachers around the world. It discusses validity issues that could arise when data from student achievement test administrations and other sources are used for conducting personnel appraisals, program evaluations, or for external accountability purposes, suggesting solutions and recommendations for improving validity in such applications of test-based information. Design/methodology/approach – This policy brief is based on a synthesis of conference proceedings and review of selected pieces of extant literature. It begins by summarizing perspectives of an invited expert panel on the topic. To that synthesis, the authors add their own analysis of key issues. They conclude by offering recommendations for test developers and test users. Findings – The...
    Purpose– This AERI-NEPC eBrief, the fourth in a series titled “Understanding validity issues around the world”, looks closely at issues surrounding the validity of test-based actions in educational accountability and school improvement... more
    Purpose– This AERI-NEPC eBrief, the fourth in a series titled “Understanding validity issues around the world”, looks closely at issues surrounding the validity of test-based actions in educational accountability and school improvement contexts. The specific discussions here focus testing issues in the US. However, the general principles underlying appropriate and inappropriate test use in school reform and high stakes public accountability settings are applicable in both domestic and international settings. This paper aims to present the issues.Design/methodology/approach– This policy brief is based on a synthesis of conference proceedings and review of selected pieces of extant literature. It begins by summarizing perspectives of an invited expert panel on the topic. To that synthesis, the authors add their own analysis of key issues. They conclude by offering recommendations for test developers and test users.Findings– The authors conclude that recurring validity issues arise wit...
    ... settings. We worked to help design an appropriate evaluation plan for the Accelerated School Project that contributes to the empowerment of teachers, parents,students, and administrators (Fetterman & Haertel, 1990). The ...
    Charles Achilles, Eastern Michigan University and Seton Hall University Peter Afflerbach, University of Maryland Patricia Alexander, University of Maryland Richard Allington, University of Tennessee Donna Alvermann, University of ...
    Federal policy tools for gathering evidence on “What Works” in education, such as the What Works Clearinghouse’s (WWC) standards, emphasize randomized field trials as the preferred method for generating scientific evidence on the... more
    Federal policy tools for gathering evidence on “What Works” in education, such as the What Works Clearinghouse’s (WWC) standards, emphasize randomized field trials as the preferred method for generating scientific evidence on the effectiveness of educational programs. This article argues instead for extended-term mixed-method (ETMM) designs. Emphasizing the need to consider temporal factors in gaining thorough understandings of programs as they take hold in organizational or community settings, the article asserts that formal study of contextual and site-specific variables with multiple research methods is a necessary prerequisite to designing sound field experiments for making generalized causal inferences. A theoretical rationale and five guiding principles for ETMM designs are presented, with suggested revisions to the WWC’s standards.
    on Comprehensive Indicators of 2 Profiles of 2007-2008 Entering Kindergartners on Comprehensive Indicators of School Readiness: A Baseline Study of the Chemung County
    Purpose – This policy brief, the third in the AERI-NEPC eBrief series “Understanding validity issues around the world”, discusses validity issues surrounding International Large Scale Assessment (ILSA) programs. ILSA programs, such as the... more
    Purpose – This policy brief, the third in the AERI-NEPC eBrief series “Understanding validity issues around the world”, discusses validity issues surrounding International Large Scale Assessment (ILSA) programs. ILSA programs, such as the well-known Programme of International Student Assessment (PISA) and the Trends in International Mathematics and Science Study (TIMSS), are rapidly expanding around the world today. In this eBrief, the authors examine what “validity” means when applied to published results and reports of programs like the PISA. Design/methodology/approach – This policy brief is based on a synthesis of conference proceedings and review of selected pieces of extant literature. It begins by summarizing perspectives of an invited expert panel on the topic. To that synthesis, the authors add their own analysis of key issues. They conclude by offering recommendations for test developers and test users. Findings – ILSA programs and tests, while offering valuable informatio...
    This article argues with a literature review that a simplistic distinction between strong and weak evidence hinged on the use of randomized controlled trials (RCTs), the federal “gold standard” for generating rigorous evidence on social... more
    This article argues with a literature review that a simplistic distinction between strong and weak evidence hinged on the use of randomized controlled trials (RCTs), the federal “gold standard” for generating rigorous evidence on social programs and policies, is not tenable with evaluative studies of complex, field interventions such as those found in education. It introduces instead the concept of grades of evidence, illustrating how the choice of research designs coupled with the rigor with which they can be executed under field conditions, affects evidence quality progressively. It argues that evidence from effectiveness research should be graded on different design dimensions, accounting for conceptualization and execution aspects of a study. Well-implemented, phased designs using multiple research methods carry the highest potential to yield the best grade of evidence on effects of complex, field interventions.
    Purpose – This policy brief discusses validity and fairness issues that could arise when test-based information is used for making “high stakes” decisions at an individual level, such as, for the certification of teachers or other... more
    Purpose – This policy brief discusses validity and fairness issues that could arise when test-based information is used for making “high stakes” decisions at an individual level, such as, for the certification of teachers or other professionals, or when admitting students into higher education programs and colleges, or for making immigration-related decisions for prospective immigrants. To assist test developers, affiliated researchers and test users enhance levels of validity and fairness with these particular types of test score interpretations and uses, this policy brief summarizes an “argument-based approach” to validation given by Kane. Design/methodology/approach – This policy brief is based on a synthesis of conference proceedings and review of selected pieces of extant literature. To that synthesis, the authors add practitioner-friendly examples with their own analysis of key issues. They conclude by offering recommendations for test developers and test users. Findings – The...