Skip to main content
Edward Wolfe
  • Iowa City, Iowa, USA

Edward Wolfe

This study seeks to determine whether item features are related to observed differences in item difficulty (DIF) between computer- and paper-based test delivery media. Examinees responded to 60 quantitative items similar to those found on... more
This study seeks to determine whether item features are related to observed differences in item difficulty (DIF) between computer- and paper-based test delivery media. Examinees responded to 60 quantitative items similar to those found on the GRE general test in either a computer-based or paper-based medium. Thirty-eight percent of the items were flagged for cross-medium DIF, and post hoc content analyses were performed focusing on page formatting, mathematical notation, and mathematical content of the items. Although findings suggest that differences in page formatting and response processes across the delivery media contribute little to the observed cross-medium DIF, differences in the mathematical notation contained in the item text as well as differences in the mathematical content of the items provided the strongest apparent relationships with cross-medium DIF.
This article addresses the problem of improving the measurement quality of a complex performance assessment through principled assessment design. We describe the characteristics and measurement impact of steps taken to improve assessment... more
This article addresses the problem of improving the measurement quality of a complex performance assessment through principled assessment design. We describe the characteristics and measurement impact of steps taken to improve assessment exercise design along with modifications in assessor training materials and procedures between the 1995-1996 and the 1996-1997 administrations of the National Board for Professional Teaching Standards Early Childhood/Generalist examination. Specifically, we describe how the revision of this assessment resulted in increases in the inter-assessor agreement, internal consistency, and generalizability of scores. All indices we examined improved as a result of the revisions. The results suggest that previously observed limits on the measurement quality of performance assessments due to the relatively small number of items that contribute to an assessment score can be altered significantly through attention to assessment design and related scoring process...
Achieving levels of reliability that allow large-scale essay assessments to be used to guide educational policy is a major hurdle for test developers. Previous studies have shown that one influential source of measurement error associated... more
Achieving levels of reliability that allow large-scale essay assessments to be used to guide educational policy is a major hurdle for test developers. Previous studies have shown that one influential source of measurement error associated with essay scores is rater idiosyncracies [Engelhard 1994]. Although the literature dealing with scoring cognition is not conclusive about why some scorers are more consistent than others, it offers some insight into variables that may account for individual differences in scoring competence [Pula & Huot 1993; Wolfe & Feltovich 1994].
The Test of English as a Foreign Language (TOEFL) contains a direct writing assessment, and examinees are given the option of composing their responses at a computer terminal using a keyboard or composing their responses in handwriting.... more
The Test of English as a Foreign Language (TOEFL) contains a direct writing assessment, and examinees are given the option of composing their responses at a computer terminal using a keyboard or composing their responses in handwriting. This study sought to determine whether examinees from different demographic groups choose handwriting versus word-processing composition media with equal likelihood. The relationship between several demographic characteristics of examinees and their composition medium choice on the TOEFL writing assessment is examined using logistic regression. Females, speakers of languages based on non-Roman/Cyrillic character systems, examinees from Africa and the Middle East, and examinees with less proficient English skills were more likely to choose handwriting. Although there were only small differences between age groups with respect to composition medium choice in most geographic regions, younger examinees from Europe and older examinees from Asia were more ...
How common patterns of rater errors may be detected in a large-scale performance assessment setting is discussed. Common rater effects are identified, and a scaling method that can be used to detect them in operational data sets is... more
How common patterns of rater errors may be detected in a large-scale performance assessment setting is discussed. Common rater effects are identified, and a scaling method that can be used to detect them in operational data sets is presented. Simulated data sets are generated to exhibit each of these rater effects. The three continua that depict the most commonly cited rater effects are: (1) accuracy/randomness; (2) harshness/leniency; and (3) centrality/extremism. Rasch measurement theory provides one way of examining these rater effects within a normative framework. Rasch measurement places each facet of the measurement context on a common underlying linear scale, resulting in measures that can be subjected to traditional statistical analyses while allowing for unambiguous substantive interpretations of the meaning of examinee performance as it relates to rater performance and task functioning. In addition, Rasch calibrations of examinees, tasks, and raters are sample free in that...
AUTHOR Wolfe, Edward W.; Chiu, Chris W. T. TITLE Measuring Change over Time with a Rasch Rating Scale Model. SPONS AGENCY American Coll. Testing Program, Iowa City, Iowa. PUB DATE Mar 97 NOTE 48p.; Paper presented at the Annual Meeting of... more
AUTHOR Wolfe, Edward W.; Chiu, Chris W. T. TITLE Measuring Change over Time with a Rasch Rating Scale Model. SPONS AGENCY American Coll. Testing Program, Iowa City, Iowa. PUB DATE Mar 97 NOTE 48p.; Paper presented at the Annual Meeting of the American Educational Research Association (Chicago, IL, March 24-28, 1997). PUB TYPE Reports Evaluative (142) Speeches/Meeting Papers (150) EDRS PRICE MF01/PCO2 Plus Postage. DESCRIPTORS *Change; Item Response Theory; Measurement Techniques; *Portfolio Assessment; Portfolios (Background Materials); Probability; *Rating Scales; Teachers; *Time IDENTIFIERS Additive Models; Anchoring Devices; Calibration; FACETS Computer Program; Linear Models
The two studies described here compare essays composed on word processors to those composed with pen and paper for a standardized writing assessment. The following questions guided these studies: 1) Are there differences in test... more
The two studies described here compare essays composed on word processors to those composed with pen and paper for a standardized writing assessment. The following questions guided these studies: 1) Are there differences in test administration and writing processes associated with handwritten versus word processor writing assessments?, and 2) Are there differences in how raters evaluate handwritten versus word processor format? Study 1 revealed that there are some differences in the manner in which students approach writing essays when given a choice of the two formats. Study 2 revealed that there are differences in the manner in which essays in each format are scored by raters. Table o f
This study examined the influence of rater training and scoring context on training time, scoring time, qualifying rate, quality of ratings, and rater perceptions. 120 raters participated in the study and experienced one of three training... more
This study examined the influence of rater training and scoring context on training time, scoring time, qualifying rate, quality of ratings, and rater perceptions. 120 raters participated in the study and experienced one of three training contexts: (a) online training in a distributed scoring context, (b) online training in a regional scoring context, and (c) stand-up training in a regional context. After training, raters assigned scores to qualification sets, scored 400 student essays, and responded to a questionnaire that measured their perceptions of the effectiveness of, and satisfaction with, the training and scoring process, materials, and staff. The results suggest that the only clear difference on the outcomes for these three groups of raters concerned training time—online training was considerably faster. There were no clear differences between groups concerning qualification rate, rating quality, or rater perceptions.
Constructed-response items are commonly used in educational and psychological testing, and the answers to those items are typically scored by human raters. In the current rater monitoring processes, validity scoring is used to ensure that... more
Constructed-response items are commonly used in educational and psychological testing, and the answers to those items are typically scored by human raters. In the current rater monitoring processes, validity scoring is used to ensure that the scores assigned by raters do not deviate severely from the standards of rating quality. In this article, an adaptive rater monitoring approach that may potentially improve the efficiency of current rater monitoring practice is proposed. Based on the Rasch partial credit model and known development in multidimensional computerized adaptive testing, two essay selection methods—namely, the D-optimal method and the Single Fisher information method—are proposed. These two methods intend to select the most appropriate essays based on what is already known about a rater’s performance. Simulation studies, using a simulated essay bank and a cloned real essay bank, show that the proposed adaptive rater monitoring methods can recover rater parameters with...
Previous research has investigated the influence of sample size, model misspecification, test length, ability distribution offset, and generating model on the likelihood ratio difference test in applications of item response models. This... more
Previous research has investigated the influence of sample size, model misspecification, test length, ability distribution offset, and generating model on the likelihood ratio difference test in applications of item response models. This study extended that research to the evaluation of dimensionality using the multidimensional random coefficients multinomial logit model (MRCMLM). Logistic regression analysis of simulated data reveal that sample size and test length have a large effect on the capacity of the LR difference test to correctly identify unidimensionality, with shorter tests and smaller sample sizes leading to smaller Type I error rates. Higher levels of simulated misfit resulted in fewer incorrect decisions than data with no or little misfit. However, Type I error rates indicate that the likelihood ratio difference test is not suitable under any of the simulated conditions for evaluating dimensionality in applications of the MRCMLM.
Cognitive radios (CRs) are recent technological developments that rely on artificial intelligence to adapt a radio's performance to suit environmental demands, such as sharing radio... more
Cognitive radios (CRs) are recent technological developments that rely on artificial intelligence to adapt a radio's performance to suit environmental demands, such as sharing radio frequencies with other radios. Measuring the performance of the cognitive engines (CEs) that underlie a CR's performance is a challenge for those developing CR technology. This simulation study illustrates how the Rasch model can be applied to the evaluation of CRs. We simulated the responses of 50 CEs to 35 performance tasks and applied the Random Coefficients Multidimensional Multinomial Logit Model (MRCMLM) to those data. Our results indicate that CEs based on different algorithms may exhibit differential performance across manipulated performance task parameters. We found that a multidimensional mixture model may provide the best fit to the simulated data and that the two algorithms simulated may respond to tasks that emphasize achieving high levels of data throughput coupled with lower emphasis on power conservation differently than they do to other combinations of performance task characteristics.
This article describes a procedure for evaluating item-level non-response bias in questionnaire items. Specifically, logistic regression is used to determine whether non-responses are random or systematic in nature for one question from... more
This article describes a procedure for evaluating item-level non-response bias in questionnaire items. Specifically, logistic regression is used to determine whether non-responses are random or systematic in nature for one question from the National Educational Longitudinal Study of 1994 concerning drug use behaviors. It turns out that, indeed, non-responses are systematic with males and lower achieving students being more likely to contribute to non-response along with two-way interactions between ethnicity and SES and ethnicity and geographic region. In addition, the magnitude of the potential bias is estimated, which demonstrates that the parameter estimates obtained by assuming that the data are missing at random may be extremely biased, given this frame of reference. Finally, several steps are suggested for evaluating the threat of non-response bias in survey research.
Historically, rule-of-thumb critical values have been employed for interpreting fit statistics that depict anomalous person and item response patterns in applications of the Rasch model. Unfortunately, prior research has shown that these... more
Historically, rule-of-thumb critical values have been employed for interpreting fit statistics that depict anomalous person and item response patterns in applications of the Rasch model. Unfortunately, prior research has shown that these values are not appropriate in many contexts. This article introduces a bootstrap procedure for identifying reasonable critical values for Rasch fit statistics and compares the results of that procedure to applications of rule-of-thumb critical values for three example datasets. The results indicate that rule-of-thumb values may over- or under-identify the number of misfitting items or persons.
When latent trait models are used to measure change across time, it is difficult to disentangle changes in one facet of the measurement context from changes in other facets. Hence, it is difficult to diagnose change. Wright (1999b)... more
When latent trait models are used to measure change across time, it is difficult to disentangle changes in one facet of the measurement context from changes in other facets. Hence, it is difficult to diagnose change. Wright (1999b) proposed an algorithm for disentangling change, and previously the authors applied this algorithm to measuring change across two occasions (Wolfe and Chiu, 1999). In this article we extend Wright's algorithm to disentangle changes in measures across three occasions. We describe a standard Rasch rating scale analysis of a multi-occasion evaluation that produces confusing results when subjected to a series of "separate" calibrations. Then, we apply Wright's correction to the same data to show that the algorithm reveals changes that are more similar to ones that would be expected. Our demonstration shows that Wright's procedure can reduce misfit to the Rasch Rating Scale Model as well as changing the interpretation of change within the measurement context.
This article reports the results of an application of the Multidimensional Random Coefficients Multinomial Logit Model (MRCMLM) to the measurement of professional development activities in which community college administrators... more
This article reports the results of an application of the Multidimensional Random Coefficients Multinomial Logit Model (MRCMLM) to the measurement of professional development activities in which community college administrators participate. The analyses focus on confirmation of the factorial structure of the instrument, evaluation of the quality of the activities calibrations, examination of the internal structure of the instrument, and comparison of groups of administrators. The dimensionality analysis results suggest a five-dimensional model that is consistent with previous literature concerning career paths of community college administrators - education and specialized training, internal professional development and mentoring, external professional development, employer support, and seniority. The indicators of the quality of the activity calibrations suggest that measures of the five dimensions are adequately reliable, that the activities in each dimension are internally consistent, and that the observed responses to each activity are consistent with the expected values of the MRCMLM. The hierarchy of administrator measure means and of activity calibrations is consistent with substantive theory relating to professional development for community college administrators. For example, readily available activities that occur at the institution were most likely to be engaged in by administrators, while participation in selective specialized training institutes were the least likely activities. Finally, group differences with respect to age and title were consistent with substantive expectations - the greater the administrator's age and the higher the rank of the administrator's title, the greater the probability of having engaged in various types of professional development.
This paper compares the results of applications of the Multidimensional Random Coefficients Multinomial Logit Model (MRCMLM) to comparable Structural Equation Model (SEM) applications for the purpose of conducting a Confirmatory Factor... more
This paper compares the results of applications of the Multidimensional Random Coefficients Multinomial Logit Model (MRCMLM) to comparable Structural Equation Model (SEM) applications for the purpose of conducting a Confirmatory Factor Analysis (CFA). We review SEM as it is applied to CFA, identify some parallels between the MRCMLM approach to CFA and that utilized in a standard SEM CFA, and illustrate the comparability of MRCMLM and SEM CFA results for three datasets. Results indicate that the two approaches tend to identify similar dimensional models as exhibiting best fit and provide comparable depictions of latent variable correlations, but the two procedures depict the reliability of measures differently.
A recent mandate issued by the National Athletic Trainers' Association Board of Certification (NATABOC) stated that beginning in 2004 a student must graduate from an accredited... more
A recent mandate issued by the National Athletic Trainers' Association Board of Certification (NATABOC) stated that beginning in 2004 a student must graduate from an accredited Commission on Accreditation of Allied Health Education Programs (CAAHEP) in order to qualify for the NATABOC exam. The content of this exam is based on the National Athletic Trainers' Association (NATA) Athletic Training Educational Competencies. These 542 competencies in 12 different domains were developed through role delineation studies with the most recent edition published in 1999. Therefore, these competencies must be included in each athletic training curriculum program across the country in order to prepare their students for certification and to achieve program accreditation. Concern over the large number of competencies to be attained within the educational time frame created the need to develop an instrument to examine this issue. In response, instruments were developed to examine one domain of the NATA Educational Competencies. Specifically, the General Medical Conditions and Disabilities competencies were assessed for their perceived importance and measurability by certified athletic trainers and sports medicine physicians. This article reports the results of a validation study of an instrument designed to measure the perceived measurability and importance of the NATA Athletic Training Educational Competencies. Generally, the results are encouraging. The data supports six constructs, and each of these constructs exhibits high reliabilities. Relative competency calibrations within and between scales were consistent with theory. And, ratings assigned by different groups of trainers were comparable.
The number of performance assessments continues to increase around the world, and it is important to explore new methods for evaluating the quality of ratings obtained from raters. This study describes an unfolding model for examining... more
The number of performance assessments continues to increase around the world, and it is important to explore new methods for evaluating the quality of ratings obtained from raters. This study describes an unfolding model for examining rater accuracy. Accuracy is defined as the difference between observed and expert ratings. Dichotomous accuracy ratings (0 = inaccurate, 1 = accurate) are unfolded into three latent categories: inaccurate below expert ratings, accurate ratings, and inaccurate above expert ratings. The hyperbolic cosine model (HCM) is used to examine dichotomous accuracy ratings from a statewide writing assessment. This study suggests that HCM is a promising approach for examining rater accuracy, and that the HCM can provide a useful interpretive framework for evaluating the quality of ratings obtained within the context of rater-mediated assessments.
A large body of literature exists describing how rater effects may be detected in rating data. In this study, we compared the flag and agreement rates for several rater effects based on calibration of a real data under two psychometric... more
A large body of literature exists describing how rater effects may be detected in rating data. In this study, we compared the flag and agreement rates for several rater effects based on calibration of a real data under two psychometric models-the Rasch rating scale model (RSM) and the Rasch testlet-based rater bundle model (RBM). The results show that the RBM provided more accurate diagnoses of rater severity and leniency than do the RSM which is based on the local independence assumption. However, the statistical indicators associated with rater centrality and inaccuracy remain consistent between these two models.
The purpose of this two-part paper is to introduce researchers to the many-facet Rasch measurement (MFRM) approach for detecting and measuring rater effects. The researcher will learn how to use the Facets (Linacre, 2001) computer program... more
The purpose of this two-part paper is to introduce researchers to the many-facet Rasch measurement (MFRM) approach for detecting and measuring rater effects. The researcher will learn how to use the Facets (Linacre, 2001) computer program to study five effects: leniency/severity, central tendency, randomness, halo, and differential leniency/severity. Part 1 of the paper provides critical background and context for studying MFRM. We present a catalog of rater effects, introducing effects that researchers have studied over the last three-quarters of a century in order to help readers gain a historical perspective on how those effects have been conceptualized. We define each effect and describe various ways the effect has been portrayed in the research literature. We then explain how researchers theorize that the effect impacts the quality of ratings, pinpoint various indices they have used to measure it, and describe various strategies that have been proposed to try to minimize its impact on the measurement of ratees. The second half of Part 1 provides conceptual and mathematical explanations of many-facet Rasch measurement, focusing on how researchers can use MFRM to study rater effects. First, we present the many-facet version of Andrich's (1978) rating scale model and identify questions about a rating operation that researchers can address using this model. We then introduce three hybrid MFRM models, explain the conceptual distinctions among them, describe how they differ from the rating scale model, and identify questions about a rating operation that researchers can address using these hybrid models.
This article summarizes a simulation study of the performance of five item quality indicators (the weighted and unweighted versions of the mean square and standardized mean square fit indices and the point-measure correlation) under... more
This article summarizes a simulation study of the performance of five item quality indicators (the weighted and unweighted versions of the mean square and standardized mean square fit indices and the point-measure correlation) under conditions of relatively high and low amounts of missing data under both random and conditional patterns of missing data for testing contexts such as those encountered in operational administrations of a computerized adaptive certification or licensure examination. The results suggest that weighted fit indices, particularly the standardized mean square index, and the point-measure correlation provide the most consistent information between random and conditional missing data patterns and that these indices perform more comparably for items near the passing score than for items with extreme difficulty values.
Diet is associated with 5 of the 10 leading causes of death in the U.S., including coronary heart disease, certain types of cancer, atherosclerosis, and type 2 diabetes. Physicians can play a pivotal role in promoting nutritional... more
Diet is associated with 5 of the 10 leading causes of death in the U.S., including coronary heart disease, certain types of cancer, atherosclerosis, and type 2 diabetes. Physicians can play a pivotal role in promoting nutritional management of diabetes and other chronic diseases. Therefore, it is important that valid instruments are created so administrators can better assess the educational needs of prospective physicians, their practices, and patient outcomes. Two comparable studies, one year apart, were undertaken to create an instrument that measures nutritional competence and self-efficacy among prospective physicians. This paper: (a) describes the development of a nutrition self-efficacy scale (NSES) and (b) demonstrates reliability and validity of the NSES using Rasch modeling. It concludes with a discussion of potential contributions of this scale for assessing mastery of applied nutrition among prospective physicians.
When measures are taken on the same individual over time, it is difficult to determine whether observed differences are the result of changes in the person or changes in other facets of the measurement situation (e.g., interpretation of... more
When measures are taken on the same individual over time, it is difficult to determine whether observed differences are the result of changes in the person or changes in other facets of the measurement situation (e.g., interpretation of items or use of rating scale). This paper describes a method for disentangling changes in persons from changes in the interpretation of Likert-type questionnaire items and the use of rating scales (Wright, 1996a). The procedure relies on anchoring strategies to create a common frame of reference for interpreting measures that are taken at different times and provides a detailed illustration of how to implement these procedures using FACETS.
The Test of English as a Foreign Language (TOEFL) contains a direct writing assessment, and examinees are given the option of composing their responses at a computer terminal using a keyboard or composing their responses in handwriting.... more
The Test of English as a Foreign Language (TOEFL) contains a direct writing assessment, and examinees are given the option of composing their responses at a computer terminal using a keyboard or composing their responses in handwriting. This study sought to determine whether performance on a direct writing assessment is comparable for examinees when given the choice to compose essays
Research Interests:
Research Interests:
Research Interests:
ABSTRACT
ABSTRACT
Research Interests:

And 66 more