[go: up one dir, main page]

0% found this document useful (0 votes)
624 views54 pages

Psychometric Testing Essentials

This document discusses various concepts related to norms, reliability, and interpreting psychological test scores. It covers: 1) How raw test scores are converted to standardized scores based on a norm group to allow comparison to the general population. 2) Key statistical concepts used in analyzing test scores like frequency distributions, measures of central tendency and variability, and the normal distribution. 3) Different methods of establishing reliability of test scores including test-retest, parallel forms, split-half, and internal consistency reliability. 4) Factors that can introduce errors and reduce reliability such as testing conditions, subjective scoring, and practice effects. Reliability coefficients indicate how much true score versus error is being measured.

Uploaded by

JAGATHESAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
624 views54 pages

Psychometric Testing Essentials

This document discusses various concepts related to norms, reliability, and interpreting psychological test scores. It covers: 1) How raw test scores are converted to standardized scores based on a norm group to allow comparison to the general population. 2) Key statistical concepts used in analyzing test scores like frequency distributions, measures of central tendency and variability, and the normal distribution. 3) Different methods of establishing reliability of test scores including test-retest, parallel forms, split-half, and internal consistency reliability. 4) Factors that can introduce errors and reduce reliability such as testing conditions, subjective scoring, and practice effects. Reliability coefficients indicate how much true score versus error is being measured.

Uploaded by

JAGATHESAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Norms and Reliability

Norm and Test Standardization


•To make sense out of individual test scores
•The raw scores are converted to some form of derived score based on
comparison to a standardization or norm group
•A norm group:
Representative of the population
Large, heterogeneous sampling
Raw Scores
•Most basic level of information provided by a psychological test
•To make the score meaningful, researcher will interpret the score by consulting
norm
•“Norm-referenced Test”
Essential Statistical Concepts
Approaches to organizing and summarizing quantitative data
Frequency distributions
Measures of central tendency
Measures of variability
The normal distribution
Skewness
Frequency Distributions
•Prepared by specifying a small number of usually equal-sized class interval and
tallying how much scores fall within each interval
•The sums of the frequencies for all intervals = N ( the total number of sample)
•Histogram – graphic representation of the same information contained in the
frequency distribution
•Frequency polygon – the frequency of the class intervals is represented by
single points rather than columns and joined by straight line
Measures of Central Tendency
•Mean – adding up all the scores and dividing by N
•Median – the middlemost score when all the scores have been ranked
•Mode – the most frequently occurring score
•For extreme example, if a distribution of scores is skewed, the median is a better
index of central tendency than the mean
Measures of Variability
•To describe the degree of dispersion
•Standard deviation (s or SD) – reflects the degree of dispersion in a group of
scores

Distribution A has larger


standard deviation
Normal Distribution
•The distribution of scores resemble a symmetrical, mathematical defined, bell-
shaped curve
Skewness
•Symmetrical or asymmetrical of a frequency distribution
•Skewed distributions usually signify that the test developer has included too few
easy items or too few hard items
Raw Scores Transformations
Transforming raw scores into more interpretable and useful forms of
information
Percentile and percentile ranks
Standard scores
T-scores and standardized scores
Normalizing standard scores
Percentile and Percentile Ranks
•A percentile expresses the percentage of persons in the standardization sample
who scored below a specific raw score.
•For example,
In an IQ test, an examinee was found out that he has an IQ of 130. IQ 130
corresponding to a percentile of 98 (P98) that means his IQ exceeds 98% of the
standardization sample.
Standard Scores
•Standard scores expresses the distance from the mean in standard deviation units.

Example,
For a normative sample, M = 50, SD = 8
A: raw score of 35
Z = = -1.88 (below average)
B: raw score of 50
Z = = 0 (exactly average)
C: raw score of 70
Z = = +2.50 (above average)
T-Score and Other Standardized Scores
•Identical to standard scores
•Expressed in positive whole numbers
•T-score has a mean = 50, standard deviation = 10
Normalizing Standard Scores
•Transmuting a nonnormal distribution into a normal distribution by conversing
percentiles to normalized standard scores
•The percentile of each raw score is used to determine its corresponding standard
score
•Normalized standard scores are nonlinear transformation so the mathematical
relationship may not hold true
•Test developers are advised to adjust the difficulty of test in order to produce a
normal distributions
Stanines, Sten and C Scale
•Stanine scale
all raw scores are converted to a single-digit system of scores ranging form 1 to 9
Mean = 5, standard deviation ≈ 2

•Sten scale
5 units above and 5 units below the mean

•C scale
Consists of 11 units
Selecting a Norm Group
•Random sampling
•Stratified random sampling
To ensure the smaller norm groups are truly representative of the population
Selecting a Norm Group (cont.)
•Age Norm – the level of test performance for each separate age group in the
normative sample
Facilitated same age comparisons

•Grade Norm – the level of test performance for each separate grade in the
normative sample
Useful in reporting school achievement levels in schoolchildren

•Local norms – derived from representative local examinees


•Subgroup norms – scores obtained from an identified subgroup
Expectancy Tables
•Portrays the established relationship between test scores and expected outcomes
on a relevant task
•Based on previous performance of a large and representative sample of examines
whose test performances and criterion outcomes reflected existing social
condition and institutional policies
Criterion-Referenced Tests
Dimension Criterion-referenced Tests Norm-referenced Tests
Purpose Compare examinees’ performance Compare examines’
to standard performance to one
another
Item content Narrow domain of skills with real- Broad domain of skills
world relevance with indirect relevance
Item selection Most items of similar difficulty Items vary widely in
level difficulty level
Interpretation of Scores usually expressed as a Scores usually expressed
scores percentage, with passing level as a standard score,
predetermined percentile, or grade
equivalent
CONCEPTS OF RELIABILITY

BY,
TIW SEOK LIAN
CINDY LOH
Definition of Reliability
•Attribute of consistency in measurement

Minimal consistency nearly perfect repeatability

Reaction time weight


Classical Test Theory
= Theory of True and Error Scores

X=T+e
X = obtained score
T = true score
e = errors of measurement
Sources of Measurement Errors
Question Wording

Item
Selecti
on
Environment
X = T + es + e u
Systematic Test
Important
Measurem Aministra Emotion
ent Error Sources tion

Test Examiner
Scoring Scorin
criteria g
Subjective
judgement
Measurement Error and Reliability
•Errors can reduce reliability
•Measurement errors are incredibly complex and varied
•measurement errors are random
•Mean error of measurement = 0
•True score and errors are uncorrelated, rTE = 0
•Errors on different tests are unrelated, r12 = 0
The Reliability Coefficient
•The ration of true score variance to the total variance of the test scores
•When the measurement error is very small, reliability coefficient, rxx approaches
1.0
•0 < rxx < 1.0
•rxx approaches 1.0, test captures minimal measurement errors and produce
consistent and reliable scores.
The Correlation Coefficient
•Express the degree of linear relationship, r between two sets of scores obtained
from the same person.

r ≈ 1.0 r ≈ -1.0

r ≈ 0.4 r ≈ -0.7

r≈0 r ≈ -0.4
Measure
Reliability

Temporal Internal
Stability Consistency
Approaches Approaches

Alternate- Split-half
Test-retest Coefficient Interscorer
forms Reliability
Reliability Alpha Reliability
Reliability

2 test administration 1 test administration


Same subjects Same subjects
Intervening time interval
Test-retest Reliability
•Straight forward method
•Conduct identical test 2 times on the same person
•For ability/ achievement test, practice, maturation or treatment effect makes the
second score higher.
•Misleading in measuring reliability of a variable that fluctuates rapidly like
mood
Alternate-Forms Reliability
•Two different forms of the same test with same specifications
•Source of error variance: item-sampling differences
•Higher cost – cost of publishing a test and put on the market
Split-half Reliability
Split-half Reliability
•Correlate the pairs of scores obtained from the equivalent halves of a test
•Higher reliability compared to test-retest method
•Major challenge: items ranked according to difficulty level. Compare odd items
versus even items.
•For measuring large questionnaire with same construct
•For shorter test, use The Spearman-Brown Formula instead of Pearson r
The Spearman-Brown formula
2r hh
r SB =
1+r hh

rSB = Estimate reliability of a full test


rhh = half test reliability
Coefficient Alpha
Coefficient Alpha
•The means of all possible split-half coefficients

Alpha = [n/(n - 1)] x [(Vart - ΣVari)/Vart]


n = number of items
Vart = variance of the whole test (standard deviation squared)
ΣVari = sum the variance for all n items

•It is an index of internal consistency = interrelatedness of individual items


•Cronbach (1951) derived this from KR20 (Kuder-Richardson formula 20)
Interscorer Reliability
•For projective tests – leave judgments to the examiner in the assignment of scores
•Two or more examiners score the sample independently, then the score are correlated
•Suitable or qualitative research
•Test manual defines appropriate training and experience required by the examiners
Item Response Theory (IRT)
•Also called Latent Traits Theory (LTT)
•It also has a collection of mathematical models and statistical tools
•Application :
analyzing items and scales,
developing homogeneous psychological measures,
measuring individual psychological constructs eg. intelligence
administering psychological tests by computer
Item Response Theory (IRT)
It includes 3 fundamental elements which is
Item Response Functions (IRFs) – mathematical functions
Item Information functions (IIFs)– reliability & measurement precision
Assumptions of invariance – 2 assumptions
Item Response Theory (IRT)
•IRT represents the field of psychometrics – provide precision over a breadth of
scales that are used to measure latent constructs, or underlying traits that are
not directly observable
•consists of a class of statistical procedures that are used to model the association
between an individual's responses to survey questions/items (in probabilistic
terms) and an underlying latent trait that is measured by the items.
•appropriate for variables such as subjective health status, treatment outcomes,
and quality of life.
Result of IRT
•The results of IRT analysis can be used to determine

whether scale items are appropriate for measuring a particular trait,


how well items in a scale "hang together"
characterize the continuum of the underlying construct,
how strongly each of the items is connected to the underlying construct.
Item Response Functions (IRFs)
•Also known as Item Characteristic Curve (ICC) – mathematical equation that
describes the relationship between the amount of latent traits an individual
possesses and the probability of giving a designated response (correct answer) to
a test item that designed to measure that construct.
•Latent traits is assumed to directly influenced the examinee’s responses to the
items on the test (design to measure the traits in questions)
ICC Curve
Information Functions (IIFs)
•Information reduces uncertainty. More info means the closer you will get to the
answer or result. Leads to more precise measurement.
•The capacity of a test item to differentiate among people.
•Certain items to differentiate among individual with low traits and certain items
to differentiate individual in high traits level.
•Item information functions can be derived from IRF.
•Item information functions can be added together to derive scale information
function.
IIFs Curve
Invariance In IRT
•Two separate but related ideas
•First, examinee position on a latent-trait scores can be estimated from the
responses to any set of test item with known IRFs.
•Second, IRFs do not depend on characteristics of a particular population. The
result of different samples might help to find-tune different parts of the IRF but
outcome should fall on the same curve. The scale of the traits exists
independently of any set of items and independently of any particular population.
N O I T CE L L O C
: ELM
PAXE
AT AD
  ITEM 1 ITEM 2 ITEM 3 ITEM 4 ITEM 5 AVERAGE
PERSON 1 1 1 1 1 1 1
PERSON 2   1 1 1 1 0.8
PERSON 3     1 1 1 0.6
PERSON 4       1 1 0.4
PERSON 5         1 0.2
AVERAGE 0.8 0.6 0.4 0.2 0 

  ITEM 1 ITEM 2 ITEM 3 ITEM 4 ITEM 5 AVERAGE ITEM 6


PERSON 1 1 1 1 1 1 1 0
PERSON 2   1 1 1 1 0.8 0
PERSON 3     1 1 1 0.6 0
PERSON 4       1 1 0.4 0
PERSON 5         1 0.2 1
AVERAGE 0.8 0.6 0.4 0.2 0  0.8

PERSON 6 1 1 0 0 0 0.4
Calculate Probability
Probability =
1/ (1+exp( 1+exp(-(proficiency – difficulty )) )
  ITEM 1 ITEM 2 ITEM 3 ITEM 4 ITEM 5 TSP
PERSON
1 0.55 0.6 0.65 0.69 0.73 1
PERSON TSP – tentative student
2 0.5 0.55 0.6 0.65 0.69 0.8 proficiency
PERSON
3 0.45 0.5 0.55 0.6 0.65 0.6
PERSON TID – tentative item
4 0.4 0.45 0.5 0.55 0.6 0.4 difficulty
PERSON
5 0.35 0.4 0.45 0.5 0.55 0.2
TID 0.8 0.6 0.4 0.2 0 
Data Comparison
  ITEM 1 ITEM 2 ITEM 3 ITEM 4 ITEM 5 TSP
PERSON 1 0.55 0.6 0.65 0.73 0.69 1
PERSON 2 0.5 0.55 0.6 0.69 0.65 0.8
PERSON 3 0.45 0.5 0.55 0.65 0.6 0.6
PERSON 4 0.4 0.45 0.5 0.6 0.55 0.4
PERSON 5 0.35 0.4 0.45 0.55 0.5 0.2
TID 0.8 0.6 0.4 0  0.2
  ITEM 1 ITEM 2 ITEM 3 ITEM 4 ITEM 5 AVERAGE
PERSON 1 1 1 1 1 1 1
PERSON 2   1 1 1 1 0.8
PERSON 3     1 1 1 0.6
PERSON 4       1 1 0.4
PERSON 5         1 0.2
AVERAGE 0.8 0.6 0.4 0.2 0 

Use information to further calibrate the estimation using


a computer program
Steps
•Present data in ICC curve or IRFs
•There is a mathematical way to compute how much information each ICC can
tell us. This method is called the Item Information Function (IIFs)
•Present data in IIF curve
•Form balancing by TIF curve (sum of all IIFs)
The New Rules Of Measurement
Old vs new
CTT IRT

Std error of measurement is Std error of measurement


assumed to be constant that become substantially larger at
applies to all examinee both extremes of ability

Longer test is more reliable than Shorter test can be more


shorter test reliable than longer test.
Special Circumstances In The Estimation
Of Reliability
1. Unstable characteristics
- Emotional reactivity as measured by electrodermal or galvanic skin response.
Fluctuates quickly in reaction to loud noises
Underlying thought process
Stressful environment

2. Speed and power test


 Speed test contains uniform and simple questions, reflect speed performance
Power test allow enough time but no test taker will obtain perfect score
CONT.
3. Restriction of Range
Test-retest reliability will be low if it is based on a sample of homogeneous
subjects
It will be inappropriate to estimate the reliability of an intelligence test by
administering it twice to a sample of college students.
4. Reliability of criterion-referenced tests
Test items are designed to identify specific skills.
Items tends to be of “pass/fail” variety
Variability of scores among examinees is quite minimal
Classification is important
Reliabilty Coefficients
• What is an acceptable level of reliability?
•Eg :
Individual differences in characteristics – .90
Standard tests with reliabilities of .70 can be useful
Test reliability lower than that can be useful in research
•On more practical level, acceptable standards of reliability depends on the
amount of measurement error the user can tolerate in the proposed application of
test
Thank You

You might also like