Procedia - Social and Behavioral Sciences 29 (2011) 1368 – 1372
International Conference on Education and Educational Psychology (ICEEPSY 2011)
A comparison Method of Equating Classic and Item Response
Theory (IRT): A Case of Iranian Study in the University Entrance
Exam
Ali Moghadamzadeh a,,, Keyvan Salehi b, Ebrahim Khodaie a,b
a
Department of Assessment and Measurement, University of Allameh Tabatabaei, Tehran, Iran
b
Department of psychometric and Educational Research, University of Tehran, Tehran, Iran
ab
Assistant Professor in Social Statistics, National Organization for Educational Testing, Tehran, Iran
Abstract
The aim of this study is to introduce the concept of equating, its implications and methods in the domain of measurement and
evaluation. In this paper the two kinds of classic equating and item response theory (IRT), through collecting data and analyzing
them were investigated along with considering advantages and disadvantages of these methods. According to these different
cases errors of equating were reckoned through computer software’s BILOG and SPSS. Considering a series of available data
and methods of measurement, and using equating, the test were completely explained through the data drawn from two tests.
Consequently, the two tests were equated with the same scale, and accuracy of equating was estimated. Finally, the advantages
and potential applications of equating of tests were considered in classic theory and item response theory. This study has
implications for educational measurement and testing procedures especially for adaptive testing and test constructing.
©©2011
Published
Elsevier
Selection
and/or peer-review
under responsibility
of Dr Zafer
2011 Published
byby
Elsevier
Ltd. Ltd.
Selection
and/or peer-review
under responsibility
of Dr. Zafer Bekirogullari
of Bekirogullari.
Cognitive – Counselling,
Research & Conference Services C-crcs.
Keywords: Equating; linking; scaling; calibration; measurement; classic theory; Item Response Theory (IRT); anchor item; difficulty;
discrimination;
1.
Introduction
Equating is a statistical method to relate two or more tests, or to base two or more tests on a same scale, which is
called ‘equating of test’ (Hambleton & Swaminathan, 1985). Other terms like linking (Vale, 1986), calibration
(Wright, 1968), and scaling (Hambleton, Swaminathan & Rogers, 1991) were also used to describe methods of
,
Corresponding author. Tel.: +989121944254; fax: +982188922237.
E-mail address: sanjeshali30@gmail.com.
1877-0428 © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Dr Zafer Bekirogullari.
doi:10.1016/j.sbspro.2011.11.375
Ali Moghadamzadeh et al. / Procedia - Social and Behavioral Sciences 29 (2011) 1368 – 1372
1369
equating. Some researchers (for example, Kolen & Brennan, 1995), however, argue that a method can be called
equating when it is used to equate two forms of test with the same content, and the other similar method should be
called scaling and linking. In this paper, with some simplification, the term of equating (which means establishing
some relations between the tests) is used, for it is very hard to determine whether the content and difficulty of two
tests to be equated are exactly the same.
Methods of equating are called differently, but in general they can be categorized in this way: horizontal and
vertical (Cook & Eignor, 1991). Horizontal equating is proper when for the security of test several forms of test are
needed. These forms are not the same, but it is expected that they are similar in their content and difficulty. When
the difficulty, reliability, and content of tests are so different from one form to another, few methods of equating can
properly work (Cook & Eignor, 1991). Furthermore, it is expected that testinees, who are subject to the forms of
test, have rather equally distributed abilities. When there is so much difference in distribution of abilities between
testinees, traditional methods of equating (like linear equating and equipercentile equating) do not response (Kolen
& Bernnan, 2004).
On the other hand, vertical equating is an attempt to equate the grades of two tests which are deliberately set with
different levels of difficulty, and yet measuring the same general realm of knowledge or one skill-based area.
Moreover, as opposed to horizontal equating, in this kind of equating the distribution of abilities between testinees
is different from one level to another. Obviously the problem of vertical equating is significantly more complicated
than that of horizontal equating. Some of specialists of measuring maintain that vertical equating should not be
considered in equating of tests, for the content of tests is often different from one level to another (Kolen, 1988).
Some of the other specialists of measuring, however, argue that determining that whether tests can be equated
should be dependent on the possibility of establishing the equating conditions. It is generally accepted that before
equating the tests four conditions are needed: (a) similar ability, (b) equity, (c) population invariance, and (d)
symmetry (including: Angoff (1984), Kolen & Brennan (1995), Lord (1980), Petersen, Kolen & Hoover (1989)).
Similar ability means that, to be equated, tests should measure similar abilities (or qualities, characteristics, and
skills)
The basic problem of this study is to introduce the concept of equating, its implications and methods in the
domain of measurement and evaluation. In this paper the two kinds of classic equating and item response theory
(IRT), through collecting data and analyzing them were investigated along with considering advantages and
disadvantages of these methods.
2.
Theoretical Background
The reasons for equating of test are different from one situation to another, but in general they can be classified
in three categories: one important and fundamental reason for equating can be improving the cohesion and integrity
of grades of test in order to not being forced to testine several times. To assure one of justice and equity of a test, or
to neutralize the effect of exercise, in the process of setting the test several forms are often needed. For measuring
the learning merit of children, for example, dual forms were set and published in a learning merit test (Michaelides,
2004).
The second reason for equating is that through this the grades of test become changeable. While several tests are
used to measure the same variables or qualities, the grades of these tests cannot often be used in place of each other,
because they are based on different scales. In order to compare testinees, or equivalent criteria series, in the test,
first it is necessary to base the grades of test on the same scale through equating of test (Mislevy, 1992).
The third reason is continuity of test. Multiple tests are used to measure the growth or change of an ability or
quality in different levels of ability. In this situation, tests are often deliberately set with different levels of
difficulty. In order to measure this growth or change in ability, we can set different tests in different forms with
different levels of difficulty. Thus, when the difficulty of test is changed, the ability of measuring growth or change
cannot be guarantied. Therefore, these tests should be again based on the same scale. According to two first reasons,
equating of test is categorized under the name of horizontal equating, and according to the last reason is called
vertical equating. (For more information concerning the reasons for equating of test, see Mislevy (1992), and
Peterson and others (1989).
1370
Ali Moghadamzadeh et al. / Procedia - Social and Behavioral Sciences 29 (2011) 1368 – 1372
2-1- Designs of equating
Classical methods of equating were described in detail by Angoff (1971) and Kolen (1988). In general, the
methods fall into the two main categories: equipercentile equating and linear equating. Equipercentile equating is
accomplished by considering the scores on tests X and Y to be equivalent if their respective percentile ranks in any
given group are equal. Strictly speaking, in order to equate scores on two tests, the tests must be given to the same
group of examinees. In practice, the process typically is carried out by giving the tests to randomly equivalent
groups of examinees.
Like other aspects of setting tests, equating begins with collecting the data. Generally four designs are used to
collect the data: (a) single-group design, (b) equivalent-groups, and (c) anchor-test design (d) Common–Person
Design. In single-group design two or more forms of test are presented to the same group of testinees. The
advantage of this design is that the errors of measuring will be reduced. The reason of this reduction is that just one
group of testinees enter in the equating process of test. Therefore, the differences between tests cannot be confused
with those between the groups. Since in this design it is necessary to testinees to response the items of several tests,
fatigue is a basic problem, especially when physical and intellectual activities are included. The effect of exercise is
another problem that should be considered. If awareness of a test improves testining performance, so the test that is
to be presented later would seem easier. In order to avoid fatigue and the effects of exercise, some kinds of spiraling
process should be used. For example, succession of presenting the forms of tests can work as a modifying factor
(Kolen, 1988, Kolen and Brennan, 1995).
In equivalent groups design, two equivalent tests are presented to two equivalent groups of testinees. The groups
could be randomly selected, and for this it is sometimes called random groups design. The advantage of this design
is that the problems of single-group design, like the effects of fatigue and exercise, can be reduced. Moreover
minimum time is needed and the test can be completed in one presentation. The defect of this design, however, is
that unknown levels of taking sides are entered in equating process, for the groups are not often exactly the same in
distribution of their abilities. In order to control the taking sides, greater samples are often needed in this design
(Kolen, 1988, Kolen and Brennan, 1995).
In anchor-test design, which is also called common-Item nonequivalent groups, tests are presented to two
different groups of testinees. As opposed to equivalent groups design, the groups can be different in their
distribution of abilities. In addition, a series of common items like anchors are included in two tests.
Finally, Common–Person Design: The two tests of be linked are given to two groups of examination, with a
common group of examination taking both tests. Because the testing will be lengthy for the common group. This
design has the same drawbacks as the single-grouph design.
3.
Methodology
3-1- Sample size
To answer the intended items, the subjects for this population were considered. The people were all Iranian
and included all the 2010 who were taking the test in the field of Test of language by The Iranian measurement
organization (TOLIMO).
The sample size with the sampling error of 0.5 % and confidence level of 95 % includes 1054 participants
for form A and 1241 participants for form B;
3-2- Instrumentation
Test of language by The Iranian measurement organization (TOLIMO) was held in two steps in May and
July of 2010. This test is based on three sub-scales of vocabulary, structure, and listening. In this test eight
1371
Ali Moghadamzadeh et al. / Procedia - Social and Behavioral Sciences 29 (2011) 1368 – 1372
questions were selected as anchor questions. Grades of the forms of A and B were compared according to the
classic method and IRT.
4.
Results
form
A
B
Model
Classic
IRT
Table 1: Deceptive statistics
N
Mean SD Skewness
1054 8.51 5.8
0.31
1241 7.95 6.1
0.27
Table 2: Summary of Results
Mean
A
B
b
0.34
0.41
a
0.52
.45
b
0.07
0.51
a
0.77
0.47
c
0.11
0.08
kurtosis
2.52
2.31
SD
A
0.15
0.07
0.77
0.22
0.08
B
0.12
0.06
0.71
0.37
0.1
The results driven through the classic method contained some problems. If the test is really parallel and with
proper validity, the classic method can be used. The results showed that in the classic method there is a little
equating between the forms of A and B.
In the classic method two conditions of symmetry and invariance should necessarily be regarded. In the method
of regression for determining the constants, which is used in equating formula, it is not needed to regard symmetry.
The necessity of invariance comes from the fact that the method of equating should be independent from the sample
group. These conditions are generally very difficult to be established in the classic method.
The results of the analysis in the classic method showed that the regression indexes were influenced when
respectively the forms of A and B were the references, in a way that with the results driven from the first state was
significantly different from the results of the second one. The results showed that in the classic method equating is
dependent to the observed sample, insofar as the observed sample being changed, the mean and standard division
from the two samples for a test were significantly different. IRT can resolve two limits of classic method, symmetry
and invariance.
In IRT if the model be fitted with the data, parameters of ability can be measured. In this case, parameters of
difficulty (bA) and discrimination parameter (aA) in the form of A and difficulty parameter (bB) and discrimination
parameter (aB) in the form of B can be compared. In addition, with IRT the parameter of ability (TETA) in the form
of A and the parameter of ability in the form of B can be compared. The results showed that the parameter of
discrimination and difficulty in different mentioned states were not significantly different. The measured parameter
of ability in two groups also did not show a significant difference.
Acknowledgment
I am heartily thankful to professor. Delavar, whose encouragement and guidance from the initial to the final
level enabled me to develop an understanding of the subject.
1372
Ali Moghadamzadeh et al. / Procedia - Social and Behavioral Sciences 29 (2011) 1368 – 1372
References:
Hambleton, R.K., & Swaminathan, H. (1985) Item Response Theory. Principles and applications. Boston: Kluwer
Nijhoff Publishing.
Vale.C.D.(1986).Linking item parameters onto a common scale. Applied psychology measurement,10(4),333-344
Hambleton,R.K., Swaminathan,H., & Rogers,H.J.(1991). Fundmentals of item response theory .Newbury
Park,CA:Sage
Kolen, M.J., & Brennan, R.L. (1995). Test Equating. New York: Spring
Cook,L.L.,& Eignor ,D.R.(1991). An NCMF instructional module on IRT equating methods. Educational
Measurment:Issuse and Practice,10,37-45.
Kolen, M.J., & Brennan, R.L. (2004). Test Equating,scalling ,and linking: New York: Spring.
Angoff,W.H.(1984). Scale, norms, and equivalent scores. Princeton, N.J: Educational Testing Service.
Lord, F.M. (1980). Applications of item Response theory to practical Testing Problems, Hillsdale, N. G.: Lawrence
Ertaun, 14, P: 117-18.
Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (ed.).
Educational measurement (3rd ed., pp. 221-262). Washington, DC: American Council on Education.
Michaelides, M. P., & Haertel, E. H. (2004). Sampling of common items: An unrecognized source of error in test
equating.. Applied Psychological Measurement. 34(1), 66-67
Mislevy, R.J., Beaton, A.E., Kaplan, B., & Sheehan, K.M. (1992). Estimating population characteristics from sparse
matrix samples of item responses. Journal of Educational Measurement, 29, 133-161.
Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement
(2nd ed.). Washington, DC: American Council on Education.
Kolen, M. 1988. An NCME instructional module on traditional equating methodology. Educational Measurement:
Issues and. Practice 8:29-36.