Norms and Reliability
Norm and Test Standardization
•To make sense out of individual test scores
•The raw scores are converted to some form of derived score based on
 comparison to a standardization or norm group
•A norm group:
 Representative of the population
 Large, heterogeneous sampling
Raw Scores
•Most basic level of information provided by a psychological test
•To make the score meaningful, researcher will interpret the score by consulting
 norm
•“Norm-referenced Test”
Essential Statistical Concepts
Approaches to organizing and summarizing quantitative data
Frequency distributions
Measures of central tendency
Measures of variability
The normal distribution
Skewness
Frequency Distributions
•Prepared by specifying a small number of usually equal-sized class interval and
 tallying how much scores fall within each interval
•The sums of the frequencies for all intervals = N ( the total number of sample)
•Histogram – graphic representation of the same information contained in the
 frequency distribution
•Frequency polygon – the frequency of the class intervals is represented by
 single points rather than columns and joined by straight line
Measures of Central Tendency
•Mean – adding up all the scores and dividing by N
•Median – the middlemost score when all the scores have been ranked
•Mode – the most frequently occurring score
•For extreme example, if a distribution of scores is skewed, the median is a better
 index of central tendency than the mean
Measures of Variability
•To describe the degree of dispersion
•Standard deviation (s or SD) – reflects the degree of dispersion in a group of
 scores
                                                       Distribution A has larger
                                                          standard deviation
Normal Distribution
•The distribution of scores resemble a symmetrical, mathematical defined, bell-
 shaped curve
Skewness
•Symmetrical or asymmetrical of a frequency distribution
•Skewed distributions usually signify that the test developer has included too few
 easy items or too few hard items
Raw Scores Transformations
Transforming raw scores into more interpretable and useful forms of
information
Percentile and percentile ranks
Standard scores
T-scores and standardized scores
Normalizing standard scores
Percentile and Percentile Ranks
•A percentile expresses the percentage of persons in the standardization sample
 who scored below a specific raw score.
•For example,
   In an IQ test, an examinee was found out that he has an IQ of 130. IQ 130
   corresponding to a percentile of 98 (P98) that means his IQ exceeds 98% of the
   standardization sample.
Standard Scores
•Standard scores expresses the distance from the mean in standard deviation units.
                                   Example,
                                   For a normative sample, M = 50, SD = 8
                                   A: raw score of 35
                                   Z = = -1.88 (below average)
                                   B: raw score of 50
                                   Z = = 0 (exactly average)
                                   C: raw score of 70
                                   Z = = +2.50 (above average)
T-Score and Other Standardized Scores
•Identical to standard scores
•Expressed in positive whole numbers
•T-score has a mean = 50, standard deviation = 10
Normalizing Standard Scores
•Transmuting a nonnormal distribution into a normal distribution by conversing
 percentiles to normalized standard scores
•The percentile of each raw score is used to determine its corresponding standard
 score
•Normalized standard scores are nonlinear transformation so the mathematical
 relationship may not hold true
•Test developers are advised to adjust the difficulty of test in order to produce a
 normal distributions
Stanines, Sten and C Scale
•Stanine scale
 all raw scores are converted to a single-digit system of scores ranging form 1 to 9
 Mean = 5, standard deviation ≈ 2
•Sten scale
 5 units above and 5 units below the mean
•C scale
 Consists of 11 units
Selecting a Norm Group
•Random sampling
•Stratified random sampling
 To ensure the smaller norm groups are truly representative of the population
Selecting a Norm Group (cont.)
•Age Norm – the level of test performance for each separate age group in the
 normative sample
 Facilitated same age comparisons
•Grade Norm – the level of test performance for each separate grade in the
 normative sample
 Useful in reporting school achievement levels in schoolchildren
•Local norms – derived from representative local examinees
•Subgroup norms – scores obtained from an identified subgroup
Expectancy Tables
•Portrays the established relationship between test scores and expected outcomes
 on a relevant task
•Based on previous performance of a large and representative sample of examines
 whose test performances and criterion outcomes reflected existing social
 condition and institutional policies
Criterion-Referenced Tests
Dimension           Criterion-referenced Tests           Norm-referenced Tests
Purpose             Compare examinees’ performance       Compare examines’
                    to standard                          performance to one
                                                         another
Item content        Narrow domain of skills with real-   Broad domain of skills
                    world relevance                      with indirect relevance
Item selection      Most items of similar difficulty     Items vary widely in
                    level                                difficulty level
Interpretation of   Scores usually expressed as a        Scores usually expressed
scores              percentage, with passing level       as a standard score,
                    predetermined                        percentile, or grade
                                                         equivalent
CONCEPTS OF RELIABILITY
                BY,
           TIW SEOK LIAN
            CINDY LOH
Definition of Reliability
•Attribute of consistency in measurement
   Minimal consistency      nearly perfect repeatability
   Reaction time            weight
Classical Test Theory
            = Theory of True and Error Scores
                             X=T+e
 X = obtained score
 T = true score
 e = errors of measurement
Sources of Measurement Errors
                          Question               Wording
                                       Item
                                      Selecti
                                        on
                                                               Environment
    X = T + es + e u
                       Systematic                  Test
                                     Important
                       Measurem                  Aministra            Emotion
                        ent Error     Sources      tion
                                       Test                        Examiner
                       Scoring        Scorin
                       criteria         g
                                                  Subjective
                                                  judgement
Measurement Error and Reliability
•Errors can reduce reliability
•Measurement errors are incredibly complex and varied
•measurement errors are random
•Mean error of measurement = 0
•True score and errors are uncorrelated, rTE = 0
•Errors on different tests are unrelated, r12 = 0
The Reliability Coefficient
•The ration of true score variance to the total variance of the test scores
•When the measurement error is very small, reliability coefficient, rxx approaches
 1.0
•0 < rxx < 1.0
•rxx approaches 1.0, test captures minimal measurement errors and produce
 consistent and reliable scores.
The Correlation Coefficient
•Express the degree of linear relationship, r between two sets of scores obtained
 from the same person.
         r ≈ 1.0                                          r ≈ -1.0
         r ≈ 0.4                                          r ≈ -0.7
          r≈0                                            r ≈ -0.4
                                Measure
                               Reliability
          Temporal                                   Internal
          Stability                                 Consistency
         Approaches                                 Approaches
                 Alternate-        Split-half
Test-retest                                         Coefficient          Interscorer
                   forms           Reliability
Reliability                                           Alpha              Reliability
                 Reliability
  2 test administration                          1 test administration
      Same subjects                                  Same subjects
Intervening time interval
Test-retest Reliability
•Straight forward method
•Conduct identical test 2 times on the same person
•For ability/ achievement test, practice, maturation or treatment effect makes the
 second score higher.
•Misleading in measuring reliability of a variable that fluctuates rapidly like
 mood
Alternate-Forms Reliability
•Two different forms of the same test with same specifications
•Source of error variance: item-sampling differences
•Higher cost – cost of publishing a test and put on the market
Split-half Reliability
Split-half Reliability
•Correlate the pairs of scores obtained from the equivalent halves of a test
•Higher reliability compared to test-retest method
•Major challenge: items ranked according to difficulty level. Compare odd items
 versus even items.
•For measuring large questionnaire with same construct
•For shorter test, use The Spearman-Brown Formula instead of Pearson r
The Spearman-Brown formula
                                  2r  hh
                r   SB     =
                                 1+r       hh
      rSB = Estimate reliability of a full test
      rhh = half test reliability
Coefficient Alpha
Coefficient Alpha
•The means of all possible split-half coefficients
               Alpha = [n/(n - 1)] x [(Vart - ΣVari)/Vart]
               n = number of items
               Vart = variance of the whole test (standard deviation squared)
               ΣVari = sum the variance for all n items
•It is an index of internal consistency = interrelatedness of individual items
•Cronbach (1951) derived this from KR20 (Kuder-Richardson formula 20)
  Interscorer Reliability
•For projective tests – leave judgments to the examiner in the assignment of scores
•Two or more examiners score the sample independently, then the score are correlated
•Suitable or qualitative research
•Test manual defines appropriate training and experience required by the examiners
Item Response Theory (IRT)
•Also called Latent Traits Theory (LTT)
•It also has a collection of mathematical models and statistical tools
•Application :
 analyzing items and scales,
 developing homogeneous psychological measures,
 measuring individual psychological constructs eg. intelligence
 administering psychological tests by computer
Item Response Theory (IRT)
It includes 3 fundamental elements which is
Item Response Functions (IRFs) – mathematical functions
Item Information functions (IIFs)– reliability & measurement precision
Assumptions of invariance – 2 assumptions
Item Response Theory (IRT)
•IRT represents the field of psychometrics – provide precision over a breadth of
 scales that are used to measure latent constructs, or underlying traits that are
 not directly observable
•consists of a class of statistical procedures that are used to model the association
 between an individual's responses to survey questions/items (in probabilistic
 terms) and an underlying latent trait that is measured by the items.
•appropriate for variables such as subjective health status, treatment outcomes,
 and quality of life.
Result of IRT
•The results of IRT analysis can be used to determine
 whether scale items are appropriate for measuring a particular trait,
 how well items in a scale "hang together"
 characterize the continuum of the underlying construct,
 how strongly each of the items is connected to the underlying construct.
Item Response Functions (IRFs)
•Also known as Item Characteristic Curve (ICC) – mathematical equation that
 describes the relationship between the amount of latent traits an individual
 possesses and the probability of giving a designated response (correct answer) to
 a test item that designed to measure that construct.
•Latent traits is assumed to directly influenced the examinee’s responses to the
 items on the test (design to measure the traits in questions)
ICC Curve
Information Functions (IIFs)
•Information reduces uncertainty. More info means the closer you will get to the
 answer or result. Leads to more precise measurement.
•The capacity of a test item to differentiate among people.
•Certain items to differentiate among individual with low traits and certain items
 to differentiate individual in high traits level.
•Item information functions can be derived from IRF.
•Item information functions can be added together to derive scale information
 function.
IIFs Curve
Invariance In IRT
•Two separate but related ideas
•First, examinee position on a latent-trait scores can be estimated from the
 responses to any set of test item with known IRFs.
•Second, IRFs do not depend on characteristics of a particular population. The
 result of different samples might help to find-tune different parts of the IRF but
 outcome should fall on the same curve. The scale of the traits exists
 independently of any set of items and independently of any particular population.
N O I T CE L L O C
                     : ELM
                         PAXE
                       AT AD
                                               ITEM 1                 ITEM 2              ITEM 3              ITEM 4               ITEM 5          AVERAGE
                                PERSON 1                         1                   1                   1                     1                  1            1
                                PERSON 2                                             1                   1                     1                  1          0.8
                                PERSON 3                                                                 1                     1                  1          0.6
                                PERSON 4                                                                                       1                  1          0.4
                                PERSON 5                                                                                                          1          0.2
                                AVERAGE                         0.8                 0.6                 0.4                0.2                    0 
                                           ITEM 1           ITEM 2              ITEM 3              ITEM 4        ITEM 5       AVERAGE                 ITEM 6
                                PERSON 1                1                  1                   1              1            1                  1                   0
                                PERSON 2                                   1                   1              1            1                0.8                   0
                                PERSON 3                                                       1              1            1                0.6                   0
                                PERSON 4                                                                      1            1                0.4                   0
                                PERSON 5                                                                                   1                0.2                   1
                                AVERAGE             0.8                   0.6                 0.4        0.2               0                                    0.8
                                PERSON 6                1                  1                   0              0            0                0.4
Calculate Probability
Probability =
1/ (1+exp( 1+exp(-(proficiency – difficulty )) )
                 ITEM 1 ITEM 2 ITEM 3 ITEM 4 ITEM 5 TSP
          PERSON
          1          0.55    0.6   0.65   0.69   0.73      1
          PERSON                                              TSP – tentative student
          2           0.5   0.55    0.6   0.65   0.69     0.8 proficiency
          PERSON
          3          0.45    0.5   0.55    0.6   0.65     0.6
          PERSON                                              TID – tentative item
          4           0.4   0.45    0.5   0.55    0.6     0.4 difficulty
          PERSON
          5          0.35    0.4   0.45    0.5   0.55     0.2
          TID         0.8    0.6    0.4    0.2      0 
Data Comparison
                  ITEM 1          ITEM 2          ITEM 3          ITEM 4          ITEM 5   TSP
       PERSON 1            0.55             0.6            0.65      0.73  0.69    1
       PERSON 2             0.5            0.55             0.6      0.69  0.65  0.8
       PERSON 3            0.45             0.5            0.55      0.65   0.6  0.6
       PERSON 4             0.4            0.45             0.5       0.6  0.55  0.4
       PERSON 5            0.35             0.4            0.45      0.55   0.5  0.2
       TID                  0.8             0.6             0.4         0   0.2
                  ITEM 1     ITEM 2     ITEM 3     ITEM 4     ITEM 5      AVERAGE
       PERSON 1            1          1          1          1           1          1
       PERSON 2                       1          1          1           1        0.8
       PERSON 3                                  1          1           1        0.6
       PERSON 4                                             1           1        0.4
       PERSON 5                                                         1        0.2
       AVERAGE           0.8        0.6        0.4        0.2           0 
        Use information to further calibrate the estimation using
        a computer program
Steps
•Present data in ICC curve or IRFs
•There is a mathematical way to compute how much information each ICC can
 tell us. This method is called the Item Information Function (IIFs)
•Present data in IIF curve
•Form balancing by TIF curve (sum of all IIFs)
  The New Rules Of Measurement
Old vs new
                     CTT                                IRT
         Std error of measurement is         Std error of measurement
         assumed to be constant that       become substantially larger at
           applies to all examinee            both extremes of ability
       Longer test is more reliable than     Shorter test can be more
                 shorter test                reliable than longer test.
Special Circumstances In The Estimation
Of Reliability
1. Unstable characteristics
 - Emotional reactivity as measured by electrodermal or galvanic skin response.
  Fluctuates quickly in reaction to loud noises
  Underlying thought process
  Stressful environment
2. Speed and power test
   Speed test contains uniform and simple questions, reflect speed performance
  Power test allow enough time but no test taker will obtain perfect score
CONT.
3. Restriction of Range
 Test-retest reliability will be low if it is based on a sample of homogeneous
  subjects
 It will be inappropriate to estimate the reliability of an intelligence test by
  administering it twice to a sample of college students.
4. Reliability of criterion-referenced tests
 Test items are designed to identify specific skills.
 Items tends to be of “pass/fail” variety
 Variability of scores among examinees is quite minimal
 Classification is important
Reliabilty Coefficients
• What is an acceptable level of reliability?
•Eg :
  Individual differences in characteristics – .90
  Standard tests with reliabilities of .70 can be useful
  Test reliability lower than that can be useful in research
•On more practical level, acceptable standards of reliability depends on the
 amount of measurement error the user can tolerate in the proposed application of
 test
Thank You