Structural Equation
Modeling
1
DR. ARSHAD HASSAN
Structural Equation Modeling
2
SEM is an extension of the general linear model that
enables a researcher to test a set of regression
equations simultaneously.
SEM software can test traditional models, but it also
permits examination of more complex relationships
and models, such as confirmatory factor analysis and
path analyses.
Structural Equation Modeling
3
SEM Structural Equation Modeling
CSA Covariance Structure Analysis
Causal Models
Simultaneous Equation Modeling
Structural Equation Modeling
4
SEM is a combination of factor analysis and multiple
regression.
Structural Equation Modeling
5
The researcher first specifies a model based on
theory, then determines how to measure constructs,
collects data, and then inputs the data into the SEM
software package. The package fits the data to the
specified model and produces the results, which
include overall model fit statistics and parameter
estimates.
Theory
7
Theorize your model
What observed variables?
What latent variables?
Relationship between latent variables?
Relationship between latent variables and observed variables?
Correlated errors of measurement?
Structural Equation Modeling
8
SEM has a language all its own.
Manifest or observed variables are directly
measured by researchers, while latent or unobserved
variables are not directly measured but are inferred
by the relationships or correlations among measured
variables in the analysis.
This statistical estimation is accomplished in much
the same way that an exploratory factor analysis
infers the presence of latent factors from shared
variance among observed variables.
Structural Equation Modeling
9
Independent variables, which are assumed to be
measured without error, are called exogenous or
upstream variables;
dependent or mediating variables are called
endogenous or downstream variables.
Structural Equation Modeling
10
SEM users represent relationships among observed
and unobserved variables using path diagrams.
Ovals or circles represent latent variables, while
rectangles or squares represent measured variables.
Residuals are always unobserved, so they are
represented by ovals or circles.
Vocabulary
11
Measured variable
Observed variables, indicators or manifest variables in
an SEM design
Predictors and outcomes in path analysis
Squares in the diagram
Latent Variable
Un-observable variable in the model, factor, construct
Construct driving measured variables in the
measurement model
Circles in the diagram
Vocabulary
12
Error or E
Variance left over after prediction of a measured variable
Disturbance or D
Variance left over after prediction of a factor
Exogenous Variable
Variable that predicts other variables
Endogenous Variables
A variable that is predicted by another variable
A predicted variable is endogenous even if it in turn
predicts another variable
Vocabulary
13
Parameters.
The parameters of the model are regression coefficients for
paths between variables and variances/covariances of
independent variables. Parameters may be fixed to a
certain value (usually 0 or 1) or may be estimated.
In the diagram, an represents a parameter to be
estimated. A 1 indicates that the parameter has been
fixed to value 1. When two variables are not connected
by a path the coefficient for that path is fixed at 0.
Why SEM?
14
Assumptions underlying the statistical analyses are clear
and testable, giving the investigator full control and
potentially furthering understanding of the analyses.
Graphical interface software boosts creativity and
facilitates rapid model debugging
SEM programs provide overall tests of model fit and
individual parameter estimate tests simultaneously.
Regression coefficients, means, and variances may be
compared simultaneously, even across multiple
between-subjects groups.
Why SEM?
15
Measurement and confirmatory factor analysis models can
be used to purge errors, making estimated relationships
among latent variables less contaminated by measurement
error.
Ability to fit non-standard models, including flexible
handling of longitudinal data, databases with auto correlated
error structures (time series analysis), and databases with
non-normally distributed variables and incomplete data.
This last feature of SEM is its most attractive quality. SEM
provides a unifying framework under which numerous linear
models may be fit using flexible, powerful software.
SEM Assumptions
16
A Reasonable Sample Size
Continuously and Normally Distributed Endogenous
Variables
Model Identification
Identification
17
Identification is a structural or mathematical
requirement in order for the SEM analysis to take
place.
Identification refers to the idea that there is at least
one unique solution for each parameter estimate in a
SEM model.
Identification
18
Models in which there is only one possible solution
for each parameter estimate are said to be justidentified
Models for which there are an infinite number of
possible parameter estimate values are said to be
underidentified.
Finally, models that have more than one possible
solution (but one best or optimal solution) for each
parameter estimate are considered overidentified.
Model Identification
19
To determine whether the model is identified or not,
compare the number of data points to the number of
parameters to be estimated.
Since the input data set is the sample
variance/covariance matrix, the number of data
points is the number of variances and covariances in
that matrix, which can be calculated as , M(M+1)/2
where m is the number of measured variables.
Structural Equation Modeling
20
The SEM can be divided into two parts.
The measurement model is the part which relates
measured variables to latent variables.
The structural model is the part that relates latent
variables to one another.
Structural Equation Modeling
21
Measurement Models
Structural Equation Modeling
22
Structural Models
Structural Equation Modeling
23
Simultaneous Models
Identification of the Measurement Model
24
The measurement portion of the model will probably be identified if:
There is only one latent variable, it has at least three indicators that
load on it, and the errors of these indicators are not correlated with
one another.
There are two or more latent variables, each has at least three
indicators that load on it, and the errors of these indicators are not
correlated, each indicator loads on only one factor, and the factors
are allowed to covary.
There are two or more latent variables, but there is a latent variable
on which only two indicators load, the errors of the indicators are
not correlated, each indicator loads on only one factor, and none of
variances or covariances between factors is zero.
Identification of the Structural Model
25
This portion of the model may be identified if:
None of the latent dependent variables predicts
another latent dependent variable.
When a latent dependent variable does predict
another latent dependent variable, the relationship is
recursive, and the disturbances are not correlated.
Handling of Incomplete Data
26
Typical ad hoc solutions to missing data problems include
listwise deletion of cases, where an entire cases record is
deleted if the case has one or more missing data points,
and
pairwise data deletion, where bivariate correlations are
computed only on cases with available data. Pairwise
deletion results in different Ns for each bivariate
covariance or correlation in the database.
Another typically used ad hoc missing data handling
technique is substitution of the variables mean for the
missing data points on that variable.
Handling of Incomplete Data
27
Listwise deletion can result in a substantial loss of power,
particularly if many cases each have a few data points missing
on a variety of variables, not to mention limiting statistical
inference to individuals who complete all measures in the
database.
Pairwise deletion is marginally better, but the consequences
of using different ns for each covariance or correlation can
have profound consequences for model fitting efforts,
including impossible solutions in some instances.
Finally, mean substitution will shrink the variances of the
variables where mean substitution took place, which is not
desirable.
Handling of Incomplete Data
28
If the proportion of cases with missing data is small, say five
percent or less, list wise deletion may be acceptable (Roth,
1994).
Of course, if the five percent (or fewer) cases are not missing
completely at random, inconsistent parameter estimates can
result.
Otherwise, missing data experts (e.g., Little and Rubin, 1987)
recommend using a maximum likelihood estimation method
for analysis, a method that makes use of all available data
points.
AMOS features maximum likelihood estimation in the
presence of missing data.
Reliability of Measured Variables.
29
The variance in each measured variable is assumed to
stem from variance in the underlying latent variable.
Classically, the variance of a measured variable can be
partitioned into true variance (that related to the true
variable) and (random) error variance.
The reliability of a measured variable is the ratio of true
variance to total (true + error) variance.
In SEM the reliability of a measured variable is
estimated by a squared correlation coefficient, which is
the proportion of variance in the measured variable that
is explained by variance in the latent variable(s
How SEM Works
30
Statistically, the model is evaluated by comparing
two variance/covariance matrices. From the data a
sample variance/covariance matrix is calculated.
From this matrix and the model an estimated
population variance/covariance matrix is computed.
If the estimated population variance/covariance
matrix is very similar to the known sample Variance/
covariance matrix, then the model is said to fit the
data well.
How SEM Works
31
Evaluating Model Fit
32
The Default model, contains the fit statistics for the model you
specified in your AMOS Graphics diagram.
The Saturated and Independence, refer to two baseline or
comparison models automatically fitted by AMOS as part of every
analysis.
The Saturated model contains as many parameter estimates as
there are available degrees of freedom or inputs into the analysis.
The Saturated model is thus the least restricted model possible
that can be fit by AMOS.
By contrast, the Independence model is one of the most restrictive
models that can be fit: it contains estimates of the variances of the
observed variables only. In other words, the Independence model
assumes all relationships between the observed variables are zero.
Tests of Fit
33
The chi-square test is a test of overall model fit,
when the probability value of the chi-square test is
smaller than the .05 level used by convention, you
would reject the null hypothesis that the model fits
the data.
Because the chi-square test of absolute model fit is
sensitive to sample size and non-normality in the
underlying distribution of the input variables,
investigators often turn to various descriptive fit
statistics to assess the overall fit a model to the data.
Tests of Fit
34
These fit statistics are similar to the adjusted R 2 in
multiple regression analysis: the parsimony fit
statistics penalize large models with many estimated
parameters
Tucker-Lewis Index (TLI) and the Comparative Fit
Index (CFI) compare the absolute fit of your
specified model to the absolute fit of the
Independence model. The greater the discrepancy
between the overall fit of the two models, the larger
the values of these descriptive statistics.
Tests of Fit
35
The chi-square test is an absolute test of model fit: If
the probability value (P) is below .05, the model is
rejected.
Hu and Bentler (1999) recommend RMSEA values
below .06 and Tucker-Lewis Index values of .95 or
higher.
The analysis uses an iterative procedure to minimize
the differences between the sample
variance/covariance matrix and the estimated
population variance matrix. Maximum Likelihood
(ML) estimation is that most frequently employed.
Goodness-of-fit Statistics
36
Many goodness-of-fit statistics
Tb = chi-square test statistic for baseline model
Tm = chi-square test statistic for hypothesized model
dfb = degrees of freedom for baseline model
dfm = degrees of freedom for hypothesized model
Tb Tm
NFI
Tb
Tb Tm
IFI
Tb df m
RMSEA
Tm df m
( N 1)df m
Goodness-of-fit Statistics
37
The Normed Fit Index (NFI) is simply the difference
between the two models chi-squares divided by the
chi-square for the independence model. Values of .9
or higher (some say .95 or higher) indicate good fit.
The Comparative Fit Index (CFI) uses a similar
approach (with a noncentral chi-square) and is said
to be a good index for use even with small samples. It
ranges from 0 to 1, like the NFI, and .95 (or .9 or
higher) indicates good fit.
Goodness-of-fit Statistics
38
PRATIO is the ratio of how many paths you dropped
to how many you could have dropped (all of them).
The Parsimony Normed Fit Index (PNFI), is the
product of NFI and PRATIO, and PCFI is the product
of the CFI and PRATIO. The PNFI and PCFI are
intended to reward those whose models are
parsimonious (contain few paths).
Goodness-of-fit Statistics
39
NPAR is the number of parameters in the model.
CMIN is a Chi-square statistic comparing the tested
model and the independence model to the saturated
model.
CMIN/DF, the relative chi-square, is an index of how
much the fit of data to model has been reduced by
dropping one or more paths.
One rule of thumb is to decide you have dropped too
many paths if this index exceeds 2 or 3.
40
RMR, the root mean square residual, is an index of
the amount by which the estimated (by your model)
variances and covariances differ from the observed
variances and covariances. Smaller is better
Goodness-of-fit Statistics
41
GFI, the goodness of fit index, tells you what proportion of the
variance in the sample variance-covariance matrix is accounted
for by the model. This should exceed .9 for a good model. For the
saturated model it will be a perfect 1.
AGFI (adjusted GFI) is an alternate GFI index in which the value
of the index is adjusted for the number of parameters in the
model. The fewer the number of parameters in the model relative
to the number of data points (variances and covariances in the
sample variance-covariance matrix), the closer the AGFI will be
to the GFI.
The PGFI (P is for parsimony), the index is adjusted to reward
simple models and penalize models in which few paths have been
deleted.
Goodness-of-fit Statistics
42
The Root Mean Square Error of Approximation
(RMSEA) estimates lack of fit compared to the
saturated model. RMSEA of .05 or less indicates
good fit, and .08 or less adequate fit. PCLOSE is the
p value testing the null that RMSEA is no greater
than .05.
Goodness-of-fit Statistics
43
Component Fit
Use Substantive Experience
Are signs correct?
Any nonsensical results?
R2s for individual equations
Negative error variances?
Standard errors seem reasonable?
SEM limitations
44
SEM is a confirmatory approach
You need to have established theory about the relationships
Cannot be used to explore possible relationships when you
have more than a handful of variables
Exploratory methods (e.g. model modification) can be used on
top of the original theory
SEM is not causal; experimental design = cause
SEM is often thought of as strictly correlational but can be
used (like regression) with experimental data
Path Analysis
45
Theoretical assumptions
Causality:
X1 and Y1 correlate.
X1 precedes Y1 chronologically.
X1 and Y1 are still related after controlling other
dependencies.
Statistical assumptions
Model needs to be recursive.
It is OK to use ordinal data.
All variables are measured (and analyzed) without
measurement error ( = 0).
Path Analysis
46
Path Analysis estimates effects of variables in a
causal system.
It starts with structural Equationa mathematical
equation representing the structure of variables
relationships to each other.
47