A Generalizability Investigation of Cognitive Demand and Rigor 1
RUNNING HEAD: A Generalizability Investigation of Cognitive Demand and Rigor
A Generalizability Investigation of Cognitive Demand and Rigor Ratings of Items and Standards
in an Alignment Study
Allison Lombardi, PhD
Mary Seburn, PhD
David Conley, PhD
Eric Snow, PhD
Educational Policy Improvement Center
720 E. 13th Ave., Suite 202
Eugene, OR 97401
541-346-6153
allison_lombardi@epiconline.org
allisonl@uoregon.edu
Presented at the annual conference of the American Educational Research Association
Denver, CO
April 2010
A Generalizability Investigation of Cognitive Demand and Rigor 2
Abstract
In alignment studies, expert raters evaluate assessment items against standards and
ratings are used to compute various alignment indices. Questions about rater reliability,
however, are often ignored or inadequately addressed. This paper reports the results of a
generalizability theory study of cognitive demand and rigor ratings of assessment items and
college-readiness standards in the context of an alignment study of college admissions tests to a
set of college readiness standards. Results indicate a higher level of generalizability for Math
item and standard ratings than for English item and standard ratings, as well as a higher level of
generalizability for cognitive demand ratings than for rigor ratings. Results also suggest that the
standard of 5-6 raters in alignment studies may be insufficient for obtaining desired reliability.
These findings may be used to carefully plan more robust alignment studies in the future so that
higher levels of reliability across raters will be attained.
A Generalizability Investigation of Cognitive Demand and Rigor 3
A Generalizability Investigation of Cognitive Demand and Rigor Ratings of Items and
Standards in an Alignment Study
In a measurement context where expert raters are used to evaluate student performance
questions, the reliability of ratings is paramount and is investigated using empirical evidence. In
alignment studies, however, where expert raters are used to evaluate assessment items against a
set of standards, basic questions about rater reliability are addressed either partially or not at all
(Herman, Webb & Zuniga, 2007; D’Agostino, et. al., 2008). Most alignment studies use between
three and ten expert raters, but the extent to which the ratings are generalizable across raters and
the influence of disagreements in ratings on alignment conclusions are often not critically
examined (e.g., Achieve, Inc., 2007; Webb, 1997, 1999, 2002; Porter, 2002).
Generalizability theory is particularly useful in the context of alignment studies, as it
provides a model for disentangling and identifying the multiple sources of error that may
influence the consistency of ratings. That is, generalizability theory describes the amount of error
that is attributable to raters and describes the extent to which the ratings generalize beyond the
individual raters to the intended construct domain. The error attributable to raters should be as
small as possible. When it is not, it implies that ratings were not consistently made across raters.
Generalizability Theory is used to conduct a G Study, where G coefficients resulting from
analyses conducted on objects and facets of measurement are used to evaluate reliability. G
Study results are then used in a Decision Study (or D Study) to forecast the G coefficients that
would be obtained with varying numbers of raters (Mushquash & O’Connor, 2006).
Aligning College Readiness Standards to College Admission and Placement Tests
The number of college freshmen requiring remedial education ranges from 30% to 60%
(NCES, 2004), suggesting a gap between what is taught in high school and what students are
A Generalizability Investigation of Cognitive Demand and Rigor 4
expected to know and do in college. A student may be college-eligible—that is, able to meet
college admissions requirements—without being college ready—able to enroll and succeed in
credit-bearing general education courses at the postsecondary level without remediation (Conley,
2005, 2007, 2010). Typically, college admissions and placement assessments are used to
measure college readiness and scores from these assessments often determine whether a student
takes a remedial course prior to enrollment in a credit-bearing course. At the same time, in an
attempt to address the growing problem of high school graduates requiring remediation, some
states, such as Texas, have developed and adopted career and college readiness standards meant
to guide educators in developing curriculum and assessments with the underlying goal of college
and career readiness for all graduating students. In light of these newly developed standards, it is
important to determine the degree of alignment to widely-used college admissions and placement
tests. However, evidence from content analysis and alignment studies (e.g., Conley, 2003;
Conley & Brown, 2007; Brown & Niemi, 2007; Achieve, Inc., 2007) suggest these assessments
may not be aligned strongly enough with college-readiness standards to be useful tools for
providing feedback to high-school students and teachers concerning college readiness or
remediation needs. For this reason, it is imperative that we critically examine the alignment
between these assessments and college readiness standards (AERA, APA & NCME, 1999). Such
critical examination requires a more robust study of rater reliability in the context of alignment
studies.
The purpose of this study was to examine the generalizability of ratings used to compute
various alignment indices in the context of a broader alignment study. Raters were trained to
make expert judgments concerning the rigor and cognitive demand of a sample of items from six
A Generalizability Investigation of Cognitive Demand and Rigor 5
college admission and placement tests, as well as a validated set of college readiness standards.
This study addressed the following research questions:
1. To what extent are cognitive demand and rigor ratings of assessment items
generalizable across raters?
2. To what extent are cognitive demand and rigor ratings of standards generalizable
across raters?
3. To what extent does the generalizability of cognitive demand and rigor ratings differ
for items and standards?
4. What is the ideal number of raters needed to maximize the generalizability of rigor
and cognitive demand ratings for items and standards?
Methods
This study employed Generalizability Theory to conduct a G-Study and D-Study in order
to address the above research questions in the context of a broader alignment study between
college admission and placement test items and a set of college readiness standards. Raters rated
the degree of alignment on test items and standards according to two rating scales: Cognitive
Demand and Rigor.
Generalizability of Ratings
We investigated the reliability of the cognitive demand and rigor ratings by conducting a
generalizability theory analysis (Shavelson and Webb, 1991). Specifically, we used a design with
items crossed by raters (i x r design) and standards crossed by raters (s x r design) in which the
sources of variation were treated as random. Because our intent was to measure the rigor and
cognitive demand of items and standards, the items and standards were the objects of
measurement and the raters were considered the sources or error (or facet in g-theory
A Generalizability Investigation of Cognitive Demand and Rigor 6
terminology). We report the phi coefficient, called the index of dependability (Shavelson and
Webb, 1991), which can be considered as a reliability-like coefficient for absolute decisions
(Herman, Webb & Zuniga, 2007; Thompson, 2003). We focus on reporting reliability for
absolute decisions as opposed to relative decisions because the primary interest is to identify the
absolute level of cognitive demand and rigor of an item or standard rather than to rank order the
items or standards. We also report the absolute error variance associated with each phi
coefficient, as it indicates the overall consistency of item and standard ratings across raters.
In the larger alignment study, we used the ratings of six experts as the basis for our
alignment computations and decisions. Six is the standard number of reviewers required in
similar alignment studies to obtain sufficient reliability (Herman, Webb and Zuniga, 2005; Webb
1997, 1999, and 2002). Consistent with our treatment of rigor ratings and cognitive demand
ratings as quasi-quantitative measures (rather than as strictly categorical measures), we
conducted a generalizability analysis with items and standards crossed by raters (Shavelson and
Webb, 1991). Other measures of agreement are appropriate for categorical ratings (e.g., kappa
coefficients, percent agreement, etc.). We used the results of the generalizability theory analysis
in a decision study to determine the ideal number of raters needed to maximize the
generalizability of item and standard ratings. Coefficients and error variances were calculated
using SPSS version 16.0 (SPSS, Inc., 2007; Mushquash & O’Connor, 2006).
Rating Scales: Cognitive Demand & Rigor
We used the first four levels of Marzano’s (2001) taxonomy - retrieval, comprehension,
analysis, and knowledge utilization - as the basis for the cognitive demand scale. Under this
taxonomy, cognitive demand is defined as the level of information processing and the degree of
conscious thought needed to complete a task. In our case, the object of measurement was
A Generalizability Investigation of Cognitive Demand and Rigor 7
represented by an assessment item or standard (rather than a task). One characteristic of the
taxonomy is that each level builds off of the prior one and each requires a higher degree of
cognitive processing than the previous one; as such, we can use these levels to create a set of
ordered categories for scoring purposes. The rating scale, which ranges from 1 (lowest) to 4
(highest), employs the following definitions:
1 = Retrieval: Recognizing, recalling, executing
2 = Comprehension: Integrating and symbolizing
3 = Analysis: Matching, classifying, analyzing errors, generalizing, specifying
4 = Knowledge utilization: Decision making, problem solving, experimenting, and
investigating
Rigor differs from cognitive demand in that it focuses not only on the mental activity
required to answer an item successfully or to perform the expectation stated in the standards, but
on the relative challenge and difficulty of doing so. Entry-level college expectations for students
serve as the point of reference. The scale, which ranges from 1 (lowest) to 3 (highest), employs
the following definitions:
1 = Below the level at which an entry-level college student should perform
2 = At the level at which an entry-level college student should perform
3 = Above the level at which an entry-level college student should perform
Background of Alignment Study
The current Generalizability Study was conducted in the context of a larger alignment
study. Researchers have developed several methods for evaluating alignment between
assessment items and educational standards (Rothman, 2003; Webb, 1997, 1999; Porter, 2002,
ACHIEVE, 2007; Wixson, Fisk, Dutro, McDaniel, 2002). In the alignment study, we
A Generalizability Investigation of Cognitive Demand and Rigor 8
implemented a modified version of Webb’s methodology (Webb, 1997, 1999) that focused on
the use of item and standard ratings to compute three commonly used alignment metrics:
categorical concurrence, depth-of-knowledge consistency, and range of knowledge.
Assessments. This study included operational items from a combination of six
mathematics and English/Language Arts college admissions and placement assessments. Test
developers provided item sets for use in the study that were representative of the test
specifications. Some item sets consisted of intact forms, and others were sampled from item
banks, depending on whether the test was paper and pencil (fixed form) or computer adaptive
where there are no identifiable test forms per se. There are other means for determining the
alignment of item pools, such as drawing a sample of items where the sample size is determined
by the length of the test administered or sampling test content across administrations using
multiple tests administered to students. However, most alignment studies comparing the
alignment of test forms and test item pools do not specifically address the lack of comparability
between the two (Achieve, 2007; Brown & Niemi, 2007).
College readiness standards. The college readiness standards were developed as part of
a larger state-wide initiative to improve the alignment between the K-12 and postsecondary
systems. These standards describe the content knowledge, thinking skills, and cognitive
strategies students need to know to succeed in entry-level postsecondary courses without
remediation.
Rater recruitment and training. Six English and six math content area experts were
recruited to participate in the alignment study. All raters were active college faculty members at
postsecondary institutions from around the U.S. Most had previous experience in the process
used for rating assessment items and educational standards. We recruited six raters for each
A Generalizability Investigation of Cognitive Demand and Rigor 9
content area, as previous research in similar alignment studies indicates that six raters are
required to obtain sufficient reliability (Herman, Webb and Zuniga, 2005; Webb 1997, 1999,
2002).
The six English and six math raters were trained in content-specific groups during an
iterative process using sets of non-operational sample assessment items. The raters convened to
review and discuss the standards, items, and rating scale definitions. They first reviewed and
rated the standards and a set of sample items individually and then met as a group via
teleconference to discuss their ratings and judgments. Through iterative discussion and practice
applying the rating scales to multiple sets of sample items and standards, they identified and
refined decision rules to apply to consistently make ratings and alignment determinations. As
discrepancies arose, the group discussed and reached consensus on a resolution, and then added
decision rules that would help them resolve similar discrepancies in future ratings. A team leader
was recruited from each group to facilitate consensus and address content and alignment
questions from reviewers throughout the study. This process was repeated until the raters agreed
on their ratings of the sample assessment item sets and standards.
Rating process. Following the completion of training, the math and English raters
accessed the assessment items via a secure online tool that collected their ratings of rigor and
cognitive demand. They first rated the standards, then assessment items, providing rigor and
cognitive demand ratings for all.
Results and Discussion
Table 1 shows the generalizability results for the cognitive demand and rigor ratings of
the Math and English items across assessments. The phi coefficients for the cognitive demand
A Generalizability Investigation of Cognitive Demand and Rigor 10
ratings are close to or above the conventional .80 criterion for reliability (Mushquash &
O’Connor, 2006). The phi coefficients for the rigor ratings were lower than those for the
cognitive demand ratings, with none of the coefficients reaching the conventional criterion for
reliability. These results indicate the six raters reached an acceptable level of dependability for
estimating Math and English items’ level of cognitive demand, but not for level of rigor. Results
also indicate higher reliability for math items in both cognitive demand and rigor ratings.
Table 1
G-Study Coefficients for Math and English Items Across Assessment
Cognitive Demand
Error
Coefficients
Variance
Phi
Absolute
Subject
Item N
Math
1460
0.859
English
1239
0.703
Rigor
Coefficients
Phi
Error
Variance
Absolute
0.035
0.505
0.007
0.053
0.446
0.020
Table 2 shows the generalizability results for the cognitive demand and rigor ratings of
the English and Math standards. The phi coefficients for the cognitive demand ratings are close
to or above the conventional .80 criterion for reliability. The phi coefficients for the rigor ratings
were lower than those for the cognitive demand ratings, with none of the coefficients reaching
the conventional criterion for reliability. These results indicate that the six raters did reach an
acceptable level of dependability for estimating Math and English standards’ level of cognitive
demand, but not for level of rigor. Therefore according to these findings, for both items and
standards six raters appear to be sufficient for cognitive demand, but insufficient for rigor.
A Generalizability Investigation of Cognitive Demand and Rigor 11
Table 2
G-Study Coefficients for English and Math Standard Ratings
Cognitive Demand
Error
Coefficients
Variance
Phi
Absolute
Rigor
Coefficients
Phi
Error
Variance
Absolute
Subject
Standards
N
Math
115
0.855
0.100
0.566
0.038
English
119
0.724
0.095
0.556
0.060
Overall, these results indicate stronger generalizability across raters for Math item and
standard ratings than for English item and standard ratings. Additionally, the results indicate
stronger generalizability across raters for cognitive demand ratings than for rigor ratings.
Interestingly, Herman, Webb, Zuniga (2005) also reported that ratings of cognitive demand were
more reliable than were ratings of centrality (similar to rigor in that centrality evaluated the
extent that a standard was essential to a topic).
Table 3 shows the residual effects components for cognitive demand and rigor ratings
across assessments and standards. With regard to the estimated variance components for
standards and raters, we noticed that the components for the residual effects (standard x rater) on
cognitive demand were particularly large relative to the individual (absolute) components for
standards (as presented in Table 2). As well, the residual effects components (standard x rater)
on rigor were larger than the absolute components, although the discrepancy was not as great.
With regard to the estimated variance components for items and raters, the residual effects
components on cognitive demand (item x rater) were larger than the individual (absolute)
components (as presented in Table 1), and the residual effects components (item x rater) on rigor
were slightly larger than the individual (absolute) components.
A Generalizability Investigation of Cognitive Demand and Rigor 12
These findings show there are greater differences in residual effects in cognitive demand
than rigor, and there are greater differences in standards than items. Further, these findings
suggest larger interaction effects for cognitive demand and standards (i.e., raters rank-ordered
items and standards differently on rigor and cognitive demand), and/or other sources of error
variability not captured in our study design. Other sources of error variability could be a result of
raters rating standards before items, each rater receiving the items in a random order, raters
conducting their ratings at different times in different settings, and variation in rater experience
writing items and conducting alignment studies.
Table 3
Residual Effects Components for Cognitive Demand and
Rigor Ratings Across Items and Standards
Subject
Residual Effects
Cognitive
Demand
Rigor
Item*Rater
Math
0.193
0.043
English
0.241
0.075
Math
0.550
0.189
English
0.466
0.257
Standard*Rater
Figures 1 and 2 present the results of the decision study analysis. Figure 1 shows results
for the items. These results indicate that, for cognitive demand ratings of Math items, increasing
the number of raters from 6 to between 10 and 15 would result in small gains in reliability. For
English items, the same increase in raters would result in moderate gains in reliability, and most
importantly, these moderate gains would bring the reliability to the acceptable .80 criterion level.
A Generalizability Investigation of Cognitive Demand and Rigor 13
For rigor ratings of Math and English items, increasing the number of raters from 6 to between
15 and 20 raters would results in phi coefficients approaching the conventional criterion for
reliability.
Figure 1. D-Study results for English and Math items
Figure 2 shows results for the standards. For Math standards, these results indicate that
increasing the number of raters from 6 to between 10 and 15 would result in small gains in
reliability in cognitive demand. For English standards, the same increase in raters would result
in moderate to large gains in reliability, and the reliability would exceed the acceptable .80
criterion level. For rigor ratings of Math and English standards, increasing the number of raters
from 6 to between 15 and 20 raters would results in phi coefficients approaching the
conventional criterion for reliability.
A Generalizability Investigation of Cognitive Demand and Rigor 14
Figure 2. D-Study results for English and Math standards
As shown in Figures 1 and 2, rigor ratings for both items and standards would greatly
increase if the number of raters were 15 to 20, a notable difference from the suggested 6
(Herman, Webb and Zuniga, 2005; Webb 1997, 1999, 2002). On the other hand, reliability of
the cognitive demand ratings would increase in small to moderate amounts if the number of
raters were increased from 6 to between 10 and 15. These results indicate more raters are
necessary to obtain sufficient reliability in rigor than cognitive demand.
Conclusion and Implications
The purpose of this study was to examine the generalizability of ratings used to compute
various alignment indices in the context of a broader alignment study. We examined the extent
to which cognitive demand and rigor ratings were generalizable across raters for a sample of
college admission and placement test items and a set of validated college readiness standards. In
A Generalizability Investigation of Cognitive Demand and Rigor 15
this examination, we determined if cognitive demand and rigor ratings differed for test items and
standards, as well as the ideal number of raters needed to maximize the generalizability of ratings
for items and standards.
Several conclusions about rater reliability in alignment studies can be drawn from our
findings. First, the primary subject of items and standards rated in alignment studies impact the
reliability of ratings. In our findings, Math items and standards showed greater levels of
reliability than English items and standards, which is important to consider in designing future
alignment studies. Given the objective nature of Math and the more subjective nature of English,
it may be easier for raters to agree in alignment of Math items and standards than in English.
A second conclusion drawn from our findings concerns the cognitive demand and rigor
scales. These scales were used as the basis for rating items, and clearly impacted the reliability
of ratings. We found greater reliability in cognitive demand scales in both content areas, and
across items and standards. More research is needed to determine a more precise conclusion as to
why cognitive demand ratings tend to be more reliable, but we suspect the difference may be
related to the descriptors used to define the rating scales. Recall the cognitive demand rating
scale was based on Marzano’s (2001) taxonomy, where the descriptors retrieval, comprehension,
analysis, and knowledge of utilization comprise the rating scale. Potentially, raters searched for
these descriptors within the text of items and standards in order to make their judgments. The
rigor rating scale, on the other hand, contained more obscure descriptors that were open to
interpretation and subjectivity, where the scale contained three descriptors: below, at, and above
the level at which an entry-level college student should perform. Overall, these results indicate
the cognitive demand scale may contain better descriptors that elicit more immediate responses
with ease and objectivity. The rigor scale, on the other hand, may require more precise
A Generalizability Investigation of Cognitive Demand and Rigor 16
descriptors for raters to easily elicit responses, or perhaps was subject to more sources of error
and disagreement in determining adequate skill levels for entry-level college students.
Third, sources of error variability from independently conducted ratings (e.g., variation in
rating time/location) not captured in alignment study designs can have a sizable impact on the
reliability of item and/or standard ratings. In terms of error variance, there were greater
differences between individual (absolute) and residual effects in cognitive demand than rigor
ratings, which suggests larger interaction effects between raters and the object of measurement
(items or standards) for the cognitive demand scale. It is not known if these differences could be
attributed to the interaction effect between raters and the objects of measurement, or facets not
included in the design.
Finally, the D-study results suggest that the standard five or six raters typically assumed
to be sufficient for obtaining reliable ratings in alignment studies might be insufficient.
Particularly, five or six raters may be insufficient when using rating scales similar to the rigor
scale. These findings show the number of sufficient raters may differ according to the type of
scale used in the alignment study, which is particularly useful knowledge in the design of future
alignment studies. Also important to consider, the number of raters may differ according to
content area, as our findings show differing levels of reliability for English and Math across both
items and standards. For both standards and items, six appears to be a sufficient number of raters
for Math and Cognitive Demand, but insufficient elsewhere.
This study was conducted in the context of a larger study addressing alignment of college
admissions and placement test items and a set of validated college readiness standards, a policy
arena where potential high-stakes decisions could be made based on the alignment study results.
As mentioned, the alignment of college admissions and placement tests to college readiness
A Generalizability Investigation of Cognitive Demand and Rigor 17
standards is crucial, as previous studies showed these assessments may not align strongly enough
with college readiness standards and are therefore insufficient in providing feedback to high
school students and teachers concerning college readiness or remediation needs. Ultimately,
these findings reiterate the importance of critically attending to the design of alignment studies,
particularly elements regarding the content expertise of raters, the use of multiple rating scales to
determine alignment, and rater training and process. These findings clearly demonstrate the value
of conducting a generalizability theory analysis to evaluate rater reliability in alignment studies.
A Generalizability Investigation of Cognitive Demand and Rigor 18
References
Achieve, Inc. (2007). Aligned expectations? A closer look at college admissions and placement
tests. Washington, DC: Author.
American Educational Research Association (AERA), American Psychological Association
(APA), National Council on Measurement in Education (NCME). (1999). Standards for
educational and psychological testing. Washington, D.C.: AERA.
Brown, R.S. and Conley, D.T. (2007). Comparing state high school assessments to standards for
success in entry-level university courses. Educational Assessment, 12(2), 137-160.
Brown, R.S., & Niemi, D.N. (2007). Investigating the alignment of high school and community
college assessments in California. National Center Report #07-3. The National Center for
Public Policy and Higher Education.
Conley, D.T. (2003). Mixed messages: What state high school tests communicate about student
readiness for college. Eugene: University of Oregon, Center for Educational Policy
Research.
Conley, D.T. (2007). Redefining college readiness. Educational Policy Improvement Center,
Eugene, Oregon.
Conley, D.T. (2010). College and career ready: Helping all students succeed beyond high
school. San Francisco, CA: Jossey-Bass.
D’Agostino, J.V., Welsh, M., Cimetta, A., Falco, L., Smith, S., VanWinkle, W., & Powers, S.
(2008). The rating and matching item-objective alignment methods. Applied
Measurement in Education, 21, 1-21.
A Generalizability Investigation of Cognitive Demand and Rigor 19
Herman, J. L., Webb, N. M., & Zuniga, S. A. (2007). Measurement issues in the alignment of
standards and assessments: A Case study. Applied Measurement in Education, 20,101126.
Herman, J. L., & Webb, N. M. (Eds.) (2007). Special Issue of Applied Measurement in
Education: Alignment Issues, 20, 1-135.
Marzano, R. J. (2001). Designing a new taxonomy of educational objectives. Thousand Oaks,
CA: Corwin Press.
Musquash, C., & O’Connor, B. (2006). SPSS and SAS programs for generalizability theory
analyses. Behavior Research Methods, 38 (3), 542-547.
National Center for Education Statistics [NCES] (2004). The condition of education 2004,
indicator 18: Remediation and degree completion. Washington, DC: U.S. Department of
Education.
Porter, A. C. (2002). Measuring the content of instruction: Uses in research and practice.
Educational Researcher, 31(7), 3-14.
Rothman, R. (2003). Imperfect matches: The alignment of standards and tests. National Research
Council.
Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: A primer. Newbury Park, CA:
Sage Publications.
SPSS, Inc. (2007). SPSS 16.0 for Windows. Lead Technologies.
Thompson, B. (2003). A brief introduction to generalizability theory. In B. Thompson, (Ed.)
Score reliability (pp. 43-58). Thousand Oaks, CA: SAGE Publications.
A Generalizability Investigation of Cognitive Demand and Rigor 20
Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and
science education (National Institute for Science Education NISE Res. Monograph No.
6). Madison: University of Wisconsin, Wisconsin Center for Education Research.
Webb, N. L. (1999). Alignment of science and mathematics standards and assessments in four
states (National Institute for Science Education NISE Res. Monograph No. 18). Madison:
University of Wisconsin, Wisconsin Center for Education Research.
Webb, N. L. (2002, April). An analysis of the alignment between mathematics standards and
assessments for three states. Paper presented at the annual meeting of the American
Educational Research Association, New Orleans.
Wixson, K.K., Fisk, M.C., Dutro, E., & McDaniel, J. (2002). The alignment of state standards
and assessments in elementary reading. CIERA Report #3-024. University of Michigan
School of Education, Center for the Improvement of Early Reading Achievement: Ann
Arbor, MI.