Papers by Anne Corinne Huggins-Manley, Ph.D.
Florida Journal of Educational Research, 2019
High-stakes testing in education often requires the use of cut scores to report achievement. In F... more High-stakes testing in education often requires the use of cut scores to report achievement. In Florida, cut scores are used to establish different levels of proficiency. Although the Florida Standards Assessments (FSA) reports the accuracy rates for cut scores, it does not report classification consistency, nor does it report information on the alignment between the high-stakes cut scores and variations in classification quality across a range of possible cut scores. Our purpose is to perform a case study evaluating the alignment between marginal classification accuracy and consistency rates across the ability continuum to cut point locations for high-stakes cut scores, and to demonstrate the practical utility of this cut score evaluation method that was proposed by Wyse and Babcock (2016). We achieved this purpose through the use of a large set of simulated test data samples generated from FSA item and person parameter estimates.
Bookmarks Related papers MentionsView impact
A Commentary on Construct Validity When Using Operational Virtual Learning Environment Data in Effectiveness Studies- Huggins-Manley, A. C., Beal, C. R., D'Mello, S. K., Leite, W. L., Cetin-Berber, D. D., Kim, D., & McNamara, D. S. Journal of Research on Educational Effectiveness, 2019
Virtual learning environments (VLEs) are increasingly used at-scale in educational contexts to fa... more Virtual learning environments (VLEs) are increasingly used at-scale in educational contexts to facilitate teaching and promote learning, and the data they produce can be used for educational research purposes. Meanwhile, the U.S. Department of Education's Office of Educational Technology has repeatedly emphasized the importance of using evidence to validate claims from VLE-based educational research. Although VLE data can provide some affordances for conducting educational research, we argue that many challenges can arise with respect to providing evidence for construct validity. The objective of this commentary is to encourage educational researchers using operational, at-scale VLE data to align their data and intended constructs to a theoretical framework of construct validity threats in order to develop a comprehensive set of actionable solutions. We use examples from our research project as a demonstration resource for performing such an alignment.
Bookmarks Related papers MentionsView impact
Educational and Psychological Measurement, 2019
The purpose of this study is to evaluate whether a recently developed semiordered model can be us... more The purpose of this study is to evaluate whether a recently developed semiordered model can be used to explore the functioning of neutral response options in rating scale data. Huggins-Manley, Algina, and Zhou developed a class of unidimensional models for semiordered data within scale items (i.e., items with both ordered response categories and an additional nominal response category) and found promising results when applying them to scale data with Not Applicable response categories. In this study, we extended the application of the semi-partial credit model (PCM) to evaluate whether the semi-PCM can be used to calibrate potentially unordered neutral responses in rating scale data, and if so, how the approach compares with alternate methods of dealing with the neutral response option. Findings indicate that the semi-PCM can (a) assist practitioners in evaluating the ordered or unordered nature of neutral responses and (b) provide a viable alternative for θ estimation in the presence of an unordered neutral category. The process used in this study also provides a methodological framework for researchers and practitioners to use when dealing with neutral responses in their own data. Full paper can be found at https://journals.sagepub.com/doi/full/
Bookmarks Related papers MentionsView impact
Journal of Computer Assisted Learning, 2019
Although the use of technology in the K12 classroom has been shown to have a positive impact, res... more Although the use of technology in the K12 classroom has been shown to have a positive impact, research on the use of open education resources (OER) is relatively limited, especially research focusing on low-achieving students. The present study examines the relationship between usage of Algebra Nation, a self-guided system that provided instructional videos and practice problems, and the performance of students who had failed the state-administered Algebra I end-of-course (EOC) assessment the previous year. Indicators of usage of Algebra Nation consisted of logins, video views, and practice questions answered. Path analyses and logistic regressions were used to evaluate relationships between usage indicators and algebra scores, controlling for number of absences, free/reduced lunch eligibility, Hispanic/Latino origin, race, and gender. The results indicate that higher levels of logins, video views, and practice questions answered were related to higher scores when the students re-took the assessment. Logins and practice questions were also related to increases in odds of passing the Algebra I EOC assessment, but not video views. The results suggest that there may be benefits to technology use in the form of an OER adopted by students and teachers on an informal basis and link self-regulated learning strategies to student achievement. Full paper can be found at https
Bookmarks Related papers MentionsView impact
Journal of Applied Statistics, 2019
Item response theory (IRT) models provide an important contribution in the analysis of polytomous... more Item response theory (IRT) models provide an important contribution in the analysis of polytomous items, such as Likert scale items in survey data. We propose a bifactor generalized partial credit model (bifac-GPC model) with flexible link functions-probit, logit and complementary log-log-for use in analysis of ordered polytomous item scale data. In order to estimate the parameters of the proposed model, we use a Bayesian approach through the NUTS algorithm and show the advantages of implementing IRT models through the Stan language. We present an application to marketing scale data. Specifically, we apply the model to a dataset of non-users of a mobile banking service in order to highlight the advantages of this model. The results show important managerial implications resulting from consumer perceptions. We provide a discussion of the methodology for this type of data and extensions. Codes are available for practitioners and researchers to replicate the application. Full paper can be found at https://www.tandfonline.com/doi/abs/
Bookmarks Related papers MentionsView impact
Journal of Science Education and Technology, 2019
Numerous studies have been undertaken to design, develop, and provide validity evidence for using... more Numerous studies have been undertaken to design, develop, and provide validity evidence for using instruments to measure students' attitudes toward STEM (Science, Technology, Engineering, and Mathematics). This study presents validity evidence of scores produced from the S-STEM measurement tool and used to evaluate changes in attitudes during an educational intervention in a middle school robotics learning environment. All data were collected from middle school students who were involved in a school district-wide effort for integrating educational robotics into the classroom. Findings from this study provided not only internal structure validity evidence, but also criterion-related validity evidence of the proposed S-STEM tool use. In addition, measurement invariance results revealed that items in the S-STEM had equivalence in statistical properties of measurement across groups (e.g., grade level). The study provides further evidence that S-STEM survey is a powerful and useful tool to evaluate student attitude changes during STEM educational programs; offers suggestions for its future implementation; and presents other inspiring ideas for future STEM instrument development.
Bookmarks Related papers MentionsView impact
Educational and Psychological Measurement, 2018
Multidimensional item response theory (MIRT) models use data from individual item responses to es... more Multidimensional item response theory (MIRT) models use data from individual item responses to estimate multiple latent traits of interest, making them useful in educational and psychological measurement, among other areas. When MIRT models are applied in practice, it is not uncommon to see that some items are designed to measure all latent traits while other items may only measure one or two traits. In order to facilitate a clear expression of which items measure which traits and formulate such relationships as a math function in MIRT models, we applied the concept of the Q-matrix commonly used in diagnostic classification models to MIRT models. In this study, we introduced how to incorporate a Q-matrix into an existing MIRT model, and demonstrated benefits of the proposed hybrid model through two simulation studies and an applied study. In addition, we showed the relative ease in modeling educational and psychological data through a Bayesian approach via the NUTS algorithm. Full paper can be found at https://journals.sagepub.com/
Bookmarks Related papers MentionsView impact
Educational and Psychological Measurement, 2018
Routing examinees to modules based on their ability level is a very important aspect in computeri... more Routing examinees to modules based on their ability level is a very important aspect in computerized adaptive multistage testing. However, the presence of missing responses may complicate estimation of examinee ability, which may result in misrouting of individuals. Therefore, missing responses should be handled carefully. This study investigated multiple missing data methods in computerized adaptive multistage testing, including two imputation techniques, the use of full information maximum likelihood and the use of scoring missing data as incorrect. These methods were examined under the missing completely at random, missing at random, and missing not at random frameworks, as well as other testing conditions. Comparisons were made to baseline conditions where no missing data were present. The results showed that imputation and the full information maximum likelihood methods outperformed incorrect scoring methods in terms of average bias, average root mean square error, and correlation between estimated and true thetas. Full paper can be found at https://journals.sagepub.com/doi/full/
Bookmarks Related papers MentionsView impact
Educational and Psychological Measurement, 2018
This study aimed to assess the accuracy of the empirical item characteristic curve (EICC) preequa... more This study aimed to assess the accuracy of the empirical item characteristic curve (EICC) preequating method given the presence of test speededness. The simulation design of this study considered the proportion of speededness, speededness point, speededness rate, proportion of missing on speeded items, sample size, and test length. After crossing all of the manipulated factors and then normalizing the evaluation criteria (bias and root mean square difference [RMSD]) with regard to test length, the results revealed that (1) when test speededness was present, conversions from the EICC preequating method tended to be positively distorted; (2) no practically meaningful moderation effect associated with sample size was found on the relationship between test speededness and the accuracy of EICC preequating; and (3) the location of the speededness point was the driving factor in terms of its impact on the accuracy of EICC preequating. Implications and suggestions were discussed. Full paper can be found at https://journals.sagepub.com/doi/full/
Bookmarks Related papers MentionsView impact
International Journal of Testing, 2018
Score equity assessment (SEA) refers to an examination of population invariance of equating acros... more Score equity assessment (SEA) refers to an examination of population invariance of equating across two or more subpopulations of test examinees. Previous SEA studies have shown that score equity may be present for examinees scoring at particular test score ranges but absent for examinees scoring at other score ranges. No studies to date have performed research for the purpose of understanding why score equity can be inconsistent across the score range of some tests. The purpose of this study is to explore a source of uneven subpopulation score equity across the score range of a test. It is hypothesized that the difficulty of anchor items displaying differential item functioning (DIF) is directly related to the score location at which issues of score inequity are observed. The simulation study supports the hypothesis that the difficulty of DIF items has a systematic impact on the uneven nature of conditional score equity. Full paper can be found at https://www.tandfonline.com/doi/abs/
Bookmarks Related papers MentionsView impact
Communications in Statistics- Simulation and Computation, 2017
Marcelo A. da Silva, Jorge L. Bazan, & Anne Corinne Huggins-Manley
Polytomous Item Response Theo... more Marcelo A. da Silva, Jorge L. Bazan, & Anne Corinne Huggins-Manley
Polytomous Item Response Theory (IRT) models are used by specialists to score assessments and questionnaires that have items with multiple response categories. In this article, we study the performance of five model comparison criteria for comparing fit of the graded response and generalized partial credit models using the same dataset when the choice between the two is unclear. Simulation study is conducted to analyze the sensitivity of priors and compare the performance of the criteria using the No-U-Turn Sampler algorithm, under a Bayesian approach. The results were used to select a model for an application in mental health data.
Bookmarks Related papers MentionsView impact
The purpose of this study is to develop and evaluate unidimensional models that can handle semior... more The purpose of this study is to develop and evaluate unidimensional models that can handle semiordered data within scale items (i.e., items with multiple ordered response categories, and one additional nominal response category). We apply the models to scale data with not applicable (NA) responses to compare the model performance to conditions in which NA responses are treated as missing and ignored. We also conduct a small simulation study based on the operational study to evaluate the parameter recovery of the models under the operational conditions. Findings indicate that the proposed models show promise for (a) reducing standard errors of trait estimates for persons who select NA responses, (b) reducing nonresponse bias in trait estimates for persons who select NA responses, and (c) providing substantive information to practitioners about the nature of the relationship between NA selection and the trait of measurement.
Bookmarks Related papers MentionsView impact
The purpose of this article is to explore validity evidence and appropriate uses of the revised T... more The purpose of this article is to explore validity evidence and appropriate uses of the revised Technology Uses and Perceptions Survey (TUPS) designed to measure in-service teacher perspectives about technology integration in K–12 schools and classrooms. The revised TUPS measures 10 domains, including Access and Support; Preparation of Technology Use; Perceptions of Professional Development; Perceptions of Technology Use; Confidence and Comfort Using Technology; Technology Integration; Teacher Use of Technology; Student Use of Technology; Perceived Technology Skills; and Technology Usefulness. We first provide a review of the literature supporting the design of the revised TUPS. We collected data from N D 1,376 teachers from one medium-sized school district in the state of Florida and conducted a variety of psychometric analyses. We performed internal structure analysis, correlation analysis, and factor analysis with these data. The results demonstrate that data collected from the TUPS are best used as descriptive, granular information about reported behaviors and perceptions related to technology, rather than treated as a series of 10 scales. These findings have implications for the
appropriate uses of the TUPS.
Bookmarks Related papers MentionsView impact
The purpose of this commentary is to present a systematic framework for comparing the standards f... more The purpose of this commentary is to present a systematic framework for comparing the standards for drawing causal inferences in educational research to the lack of standards for drawing causal inferences under state accountability plans. We aim to demonstrate that this framework encompasses many of the vast critiques of previous educational state accountability plans for estimating school and teacher effects on achievement, and hence offers a path to improvement for state accountability plans currently being developed under the Every Student Succeeds Act (ESSA).
Bookmarks Related papers MentionsView impact
Given the relationships of item response theory (IRT) models to confirmatory factor analysis (CFA... more Given the relationships of item response theory (IRT) models to confirmatory factor analysis (CFA) models, IRT model misspecifications may be detectable through model fit indices commonly used in categorical CFA. The purpose of this study is to investigate the sensitivity of WLSMV-based RMSEA, CFI, and TLI model fit indices to IRT models that are misspecified due local dependence (LD). It was found that WLSMV-based fit indices have some functional relationships to parameter estimate bias in 2PL models caused by violations of local independence. Continued exploration into these functional relationships and development of LD-detection methods based on such relationships may hold much promise for providing IRT practitioners with global information on violations of local independence.
Bookmarks Related papers MentionsView impact
This study defines subpopulation item parameter drift (SIPD) as a change in item parameters over ... more This study defines subpopulation item parameter drift (SIPD) as a change in item parameters over time that is dependent on subpopulations of examinees, and hypothesizes that the presence of SIPD in anchor items is associated with bias and/or lack of invariance in three psychometric outcomes. Results show that SIPD in anchor items is associated with a lack of invariance in dimensionality structure of an anchor test, a lack of invariance in scaling coefficients across subpopulations, and a lack of invariance in ability estimates. It is demonstrated that these effects go beyond what can be understood from item parameter drift or differential item functioning.
Bookmarks Related papers MentionsView impact
The TPACK (technological pedagogical content knowledge) framework (Mishra & Koehler, 2006) has ga... more The TPACK (technological pedagogical content knowledge) framework (Mishra & Koehler, 2006) has gained tremendous momentum from within the educational technology community. Specifically, much discourse has focused on how to measure this multidimensional construct to further define the contours of the framework and potentially make some meaningful predictions. Some have proposed observation scales while other have proposed self-report measures to gauge the phenomenon. The Survey of Pre-service Teachers' Knowledge of Teaching and Technology instrument is one popular tool designed to measure TPACK (Schmidt et al., 2009) specifically from preservice teachers in teacher education programs. This study extends the measurement framework by providing a confirmatory factor analysis of the theoretical model proposed by Schmidt et al. (2009) on a sample of 227 preservice teachers from four public institutions of higher education in the southeastern United States. The data did not fit the theoretical 10-factor model implied by Schmidt et al. (2009), thus, an exploratory factor analysis was conducted to determine the optimal structure of the measurement tool for these data. This resulted in a nine-factor model, and there were measurement issues for several of the constructs. Additionally, the article provides evidence of external validity by correlating the instrument scores with other known technology constructs.
Bookmarks Related papers MentionsView impact
Cognitive diagnosis models (CDMs) estimate student ability profiles using latent attributes. Mode... more Cognitive diagnosis models (CDMs) estimate student ability profiles using latent attributes. Model fit to the data needs to be ascertained in order to determine whether inferences from CDMs are valid. This study investigated the usefulness of some popular model fit statistics to detect CDM fit including relative fit indices (AIC, BIC, and CAIC), and absolute fit indices (RMSEA2, ABS(fcor) and MAX(χ 2 jj′)). These fit indices were assessed under different CDM settings with respect to Q-matrix misspecification and CDM misspecification. Results showed that relative fit indices selected the correct DINA model most of the times and selected the correct G-DINA model well across most conditions. Absolute fit indices rejected the true DINA model if the Q-matrix was misspecified in any way. Absolute fit indices rejected the true G-DINA model whenever the Q-matrix was under-specified. RMSEA2 could be artificially low when the Q-matrix was over-specified.
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Uploads
Papers by Anne Corinne Huggins-Manley, Ph.D.
Polytomous Item Response Theory (IRT) models are used by specialists to score assessments and questionnaires that have items with multiple response categories. In this article, we study the performance of five model comparison criteria for comparing fit of the graded response and generalized partial credit models using the same dataset when the choice between the two is unclear. Simulation study is conducted to analyze the sensitivity of priors and compare the performance of the criteria using the No-U-Turn Sampler algorithm, under a Bayesian approach. The results were used to select a model for an application in mental health data.
appropriate uses of the TUPS.
Polytomous Item Response Theory (IRT) models are used by specialists to score assessments and questionnaires that have items with multiple response categories. In this article, we study the performance of five model comparison criteria for comparing fit of the graded response and generalized partial credit models using the same dataset when the choice between the two is unclear. Simulation study is conducted to analyze the sensitivity of priors and compare the performance of the criteria using the No-U-Turn Sampler algorithm, under a Bayesian approach. The results were used to select a model for an application in mental health data.
appropriate uses of the TUPS.