[go: up one dir, main page]

0% found this document useful (0 votes)
22 views45 pages

Factor Analysis

Uploaded by

Muskaan Mulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views45 pages

Factor Analysis

Uploaded by

Muskaan Mulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Factor Analysis

-Applied Multivariate Analysis-

Lecturer: Darren Homrighausen, PhD

1
From PCA to factor analysis

Remember: PCA tries to estimate a transformation of the data


such that:

1. The maximum amount of variance possible is explained by


each component
2. The components are uncorrelated (equivalently, they are
perpendicular)

We are going to motivate an algorithm for finding a different


transformation of the data via two stories.

2
First story: Measurement error

Suppose the the numbers we write down into our X matrix aren’t
accurate.
[That is, there is measurement error in the explanatory variables]

PCA seeks to reproduce all the variance in X in the fewest number


of components

This is odd, as we want to keep the signal variance, but would like
to get rid of the noise variance.

3
First story: Measurement error

This situation in notation would be:

Xj = µj + j
|{z} |{z}
signal noise

Example: We are recording diameters of trees at a certain height


to answer questions about the success of various tree
species. Sometimes, the measuring tape will not be
level. Maybe the measuring tape only has inches and
the experimenter just guesses at the remaining frac-
tion. The true diameter would be µj and these other
experiment issues would be j .

4
Second story: Maintaining correlation

(Without getting into too many details)

Factor analysis (FA) finds a small number of factors that maintain


the most amount of correlations from the original data, instead of
the variance (which would be PCA)

To do this, we will seek to find latent variables or latent factors.

What is a latent variable?

Definition: A latent variable (or factor) is an unmeasured, un-


derlying feature that actually is driving your obser-
vation

5
Example of a latent variable

Suppose I record all the driving routes you take from your house to
another location and then back to your house
(Imagine the destination location isn’t recorded)

There would be many such routes, and many of them would be


similar

However, there would probably be several different routes to the


same general location
(You won’t go to exactly same location as you’ll park in different parking spaces, etc.)

Suppose one group of routes all ended near a grocery store. Then I
could say there is a latent variable for some of your movements
that was “going to get food/supplies”

6
Roots of factor analysis: Causal discovery

Charles Spearman (1904) hypothesized a hidden structure for


intelligence

He observed that schoolchildren’s grades in different subjects were


all correlated with each other

He thought he could explain this by reasoning that grades in


different subjects are correlated because they are all functions of
some common factor: ‘general intelligence,’ which he notated as g
or G
[interestingly enough, intelligence is still referred to as g or G . Let this be a lesson:
choose your notation carefully. It might be used for a very long time]

7
Roots of factor analysis: Causal discovery
Spearman’s model was

Xij = Gi Wj + ij

where
• i indexes children in the study
• j indexes school subjects
• ij is uncorrelated noise
• Gi is the i th child’s intelligence value. Think of G as a
function evaluated at that person, giving her some amount of
intelligence
• Wj is the amount subject j is influenced by G .
The idea: Given a value of intelligence Gi , a person’s grades are
uncorrelated and merely a (random) function of how much
intelligence affects achievement in a subject
8
Roots of factor analysis: Causal discovery

Spearman concluded based on some experiments that a single


factor G existed

Later research using large batteries of tests demonstrated that this


does not seem to be the case in general.

Spearman and other researchers decided this meant that one factor
wasn’t enough. Thus factor analysis was born.

9
Factor analysis (FA)

With multiple factors, Spearman’s model becomes


K
!
X
(Xij = Gi Wj + ij ) ⇒ Xij = Fik Wkj + ij
k=1

where
• the factors Fk are mean zero, and unit variance
• ij are uncorrelated with each other and the factors Fk .
• The X·j are mean zero
(That is; the covariates have already had their mean subtracted off: i.e: X − X)

10
Factor analysis (FA)

K
X
Xij = Fik Wkj + ij ⇔ X = FW + 
k=1

Some terminology:
• The Fik are the factor scores of the i th observation on the k th
factor
• The Wkj are the factor loadings of the j th explanatory variable
on the k th factor.

Compare to PCA (X = UDV > = SV > ):


p
X p
X
Xij = (UD)ik Vkj = Sik Vjk
k=1 k=1

11
There’s a problem...
Reminder: we call a matrix O orthogonal if its inverse is its
transpose:
O > O = I = OO >
[think about the SVD, where X = UDV > and V > V = I ]

So, with our model


X = FW +  = FOO > w +  = F 0 w 0 + 
where F 0 = FO and W 0 = O > W .

Conclusion: we changed the factors scores and loadings, but the


data didn’t change at all
(In statistics, we can such a situation unidentifiable)

Note: This doesn’t happen with PCA


X = UDV > = UDOO > V > = U D0
|{z} V 0>
not diagonal
12
Factor analysis model
Note that, for any i and each j
K
!
X
σj2 = Var(Xij ) = Var Fik Wkj + ij
k=1
K
X
= Var (Fik Wkj ) + Var (ij )
k=1
XK
= Wkj2 + Var (ij )
k=1
K
X
= Wkj2 + ψj
|{z}
k=1
| {z } Error variance
Factors’ variance

13
Factor analysis model

K
X
σj2 = Var(Xij ) = Wkj2 + ψj
|{z}
k=1
| {z } Error variance
Factors’ variance

This decomposes the variance of each explanatory variable:


PK 2
• k=1 Wkj is known as the communality
• ψj is known as the specific variance

14
Factor analysis model
Note that additionally,
K
X
σjl2 = Cov(Xij , Xil ) = Wkj Wkl
k=1

This implies

σ12 2
 
··· σ1p
Σ=
 ..
 .
2
σp1 · · · σp 2
PK 2
PK 
k=1 Wk1 + ψ1 · · · k=1 Wk1 Wkp
..
=
 
. 
PK PK 2
k=1 Wkp Wk1 ··· k=1 Wkp + ψp
>
=W W +Ψ

(Ψ as the diagonal matrix with entries ψj )


15
Factor analysis model: Our goal

Now that we have the induced equation of the covariances:

Σ = W >W + Ψ

we want to estimate W such that for a given K

• W > W is as large as possible


• Ψ is a small as possible

1 >
(One note: we never observe Σ. We need to estimate it with Σ̂ = n
X X)

16
Find Ψ & W : Maximum likelihood
In order to get at the W , we need a good estimate of Ψ

One way is through maximum likelihood (ML).

This has a cost: making assumptions about the distribution of F :


i.i.d.
Fik ∼ N(0, 1)

which implies
i.i.d.
Xi· ∼ N(0, Ψ + W > W )
Hence, the closer Ψ + W > W is to Σ̂, the higher the likelihood.
(The details of the maximization are tedious and we won’t discuss them here)

(Note that implicit in ML is a likelihood ratio test for K . We will discuss this later
when covering methods for choosing K )

17
Find Ψ & W : Principal factor analysis
Principal factor analysis (PA) has strong ties to PCA. The
difference is:
1. Guess a Ψ̂
2. Form the reduced covariance matrix Σ̂∗ = Σ̂ − Ψ̂. Hence, the
diagonals are the communalities we wish to estimate.
(Remember, we are trying to solve Σ̂ = W > W + Ψ)

3. Form Σ̂∗ = V 0 D 02 V 0>


4. Define Wj to be the j th row of the matrix V with only its first
K columns
(These would be the PC loading vectors if Σ̂∗ was the covariance matrix)

5. Re-estimate Ψ̂ = Σ̂ − Ŵ > Ŵ
6. Keep returning to step 1. until Ψ̂ and Ŵ don’t change much
between cycles

18
What about estimating F ?
In PCA, arguably the most important quantities were the principal
component scores given by UD.

These are the coordinates of the data expressed in the PC basis

Of somewhat lesser importance are the principal component


loadings, given by the rows of the matrix V .

In factor analysis, the factor loadings are everything.

It is much more important in most applications to find how the


covariates load on the latent factors than to find how the
observations are expressed in those factors (the factor scores)

We can still estimate F (by least squares), but it isn’t usually


considered a useful quantity. This is application specific, however

19
Uses of factor analysis

Factor analysis can be used for two major things:


• Exploratory factor analysis: (EFA) To identify
complex interrelationships among variables that are part of
unified concepts. The researcher makes no a priori
assumptions about relationships among factors
• Confirmatory factor analysis (CFA): To test the
hypothesis that the items are associated with specific factors

Usually, you split your data in two parts and run EFA on one part
and test your conclusion using CFA on the other part

20
Factor analysis in R
Like usual, someone has done the heavy lifting for you and made a
nice R package called psych
library(psych)
Xcorr = cor(X)
out = fa(r=Xcorr,nfactors=nfactors,
rotate=‘none’,fm=‘pa’)

Let’s discuss this a bit:


• r=Xcorr: Factor analysis operates on the correlation matrix.
Alternatively, we could pass the original data X
• nfactors: We need to specify the number of factors we are
looking for (we’ll return to this later)
• rotate = ‘none’: This controls which O we used
(That is, the orthogonal matrix O inside of FOO > W )

• fm=‘pa’: Method to get the W . Can use ‘ml’ or ‘minres’

21
method parameter in fa

The solution depends strongly on the estimation procedure

I’m not aware of a detailed comparison of these two methods

My recommendation:

• PA is preferred as it relies on weaker assumptions. However, it


can sometimes produce ψ̂j that are negative, which
corresponds to negative variance. If this happens, use ML
• The ML solution comes with more features, but it requires
much stronger assumptions that are hard to check.
Additionally, the maximization doesn’t always converge.
• If both PA and ML fail, use minimum residuals (known as
minres in R

22
Let’s look at an example

library(psych)
X = read.table(’../data/gradesFA.txt’,header=T)
Xcorr = cor(X)

> round(Xcorr,2)
BIO GEO CHEM ALG CALC STAT
BIO 1.00 0.66 0.74 -0.02 0.27 0.09
GEO 0.66 1.00 0.53 0.10 0.40 0.16
CHEM 0.74 0.53 1.00 0.00 0.10 -0.13
ALG -0.02 0.10 0.00 1.00 0.42 0.32
CALC 0.27 0.40 0.10 0.42 1.00 0.72
STAT 0.09 0.16 -0.13 0.32 0.72 1.00

23
Alternate visualizations
1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0 Correlation plot

1.0 2.0 3.0 4.0


● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● 1
BIO ● ● ● ● ● ● ● ● ● ● ● ● ●
BIO
● ● ● ● ● ● ● ● 0.8
1.0 2.0 3.0 4.0

● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6
GEO GEO
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4

1.0 2.0 3.0 4.0


● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
CHEM 0.2
● ● ● ●
CHEM ● ● ● ● ● ●

● ● ● ● ● ● ● ● 0
1.0 2.0 3.0 4.0

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ●
ALG
−0.2
● ● ● ● ● ● ● ●
ALG ● ● ● ●

● ● ● ● ●
−0.4

1.0 2.0 3.0 4.0


● ● ● ● ● ● ● ● ● ● ● ● ●
CALC
● ● ● ● ● ● ● ● ● ● ● ●

CALC −0.6

● ● ● ● ● ● ● ●
STAT −0.8
1.0 2.0 3.0 4.0

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

STAT −1

BIO

GEO

CHEM

ALG

CALC

STAT
● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0

##Left plot
pairs(X)
##Right plot
cor.plot(Xcorr) #in psych package
24
Let’s look at an example (Output)

out.fa = fa(X,nfactors=2,rotate=‘none’,fm=‘pa’)
> out.fa
Factor Analysis using method = pa
[omitted]
Standardized loadings
PA1 PA2 h2 u2
BIO 0.81 -0.46 0.86 0.142
GEO 0.70 -0.19 0.53 0.470
CHEM 0.61 -0.54 0.67 0.330
ALG 0.24 0.36 0.19 0.815
CALC 0.73 0.66 0.98 0.023
STAT 0.42 0.63 0.57 0.429

Here K = 2, and
• The ‘Standardized loadings’ are the entries in W
• h2 are the communality ( K 2
P
k=1 Wkj )
• u2 are specific variances (ψ̂j )
25
Let’s look at an example (Output continued)

PA1 PA2
SS loadings 2.28 1.51
Proportion Var 0.38 0.25
Cumulative Var 0.38 0.63
[omitted]

These are
• SS loadings: The sum of the squared loadings on each factor
• Proportion Var: The amount of total variation explained by
each factor

26
Return to the loadings table

Standardized loadings
PA1 PA2 h2 u2
BIO 0.81 -0.46 0.86 0.142
GEO 0.70 -0.19 0.53 0.470
CHEM 0.61 -0.54 0.67 0.330
ALG 0.24 0.36 0.19 0.815
CALC 0.73 0.66 0.98 0.023
STAT 0.42 0.63 0.57 0.429

We would like this solution to be ‘clean.’ This means


• each explanatory variable is highly loaded on only one factor
• all loadings are either large or near zero
(Zero mean something like < .1 in absolute value)

This solution has neither


(Actually, rotation=‘none’ very commonly produces ‘unclean’ solutions)

27
Graphical representation

For a two factors solution, this can be performed graphically

1.0
Standardized loadings STAT CALC

0.5
PA1 PA2 h2 u2 ALG

BIO 0.81 -0.46 0.86 0.142

Factor 2

0.0
GEO 0.70 -0.19 0.53 0.470
GEO
CHEM 0.61 -0.54 0.67 0.330
ALG 0.24 0.36 0.19 0.815

−0.5
BIO
CHEM

CALC 0.73 0.66 0.98 0.023


STAT 0.42 0.63 0.57 0.429
−1.0 −1.0 −0.5 0.0 0.5 1.0

Factor 1

28
Example rotations
Remember: the O matrix was arbitrary!

We can make any W 0 = O > W we want


1.0

1.0
BIO
CALC GEO
STAT CALC
CHEM
0.5

0.5
ALG
STAT

ALG
Black: ‘none’
Factor 2

Factor 2
0.0

0.0

GEO Blue: A rotation I made up


that is bad
−0.5

−0.5

BIO
CHEM

Green: A rotation I made


−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

Factor 1 Factor 1
up that is good
1.0

1.0

Red: A non-orthogonal
rotation I made up that is
0.5

0.5

STAT
the best
Factor 2

Factor 2

ALG
STAT
0.0

0.0

ALG
CALC

CALC
−0.5

−0.5

GEO GEO
CHEM CHEM
BIO BIO
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

Factor 1 Factor 1
29
Automated rotations?
Any solution we estimate is just one of an infinite number of
solutions
(Given by all possible rotations W 0 = O > W )

Manually trying many rotations becomes impossible if there are


more than 2 factors or if p is not small

We need ways to choose good O automatically

There are many proposed methods for choosing the rotation


• Varimax rotation
• Quartimax rotation
• Equimax rotation
• Direct oblimin rotation
• Promax rotation

30
Automated rotations?
Any solution we estimate is just one of an infinite number of
solutions
(Given by all possible rotations W 0 = O > W )

Manually trying many rotations becomes impossible if there are


more than 2 factors or if p is not small

We need ways to choose good O automatically

There are many proposed methods for choosing the rotation


• Varimax rotation
• Quartimax rotation
• Equimax rotation
• Direct oblimin rotation
• Promax rotation

30
Varimax and Oblimin rotation
Varimax: Finds an O that maximizes the variance of the
diagonals of W > W while keeping the latent factors uncorrelated

Advantage: This has the advantage of making the W either


large or small, which will help in interpretability

Keeping the factors uncorrelated is often unrealistic. This leads


to...

Direct oblimin rotation: If we consider correlated factors,


this leads to non-orthogonal (known as oblique) rotations.
Introducing correlations complicates the estimation process.
However, it often leads to more interpretable factor solutions.

Conclusion: Direct oblimin rotation is the standard method,


but it leads to (some) complications relative to 31
Automated rotations
1.0

1.0
STAT CALC
0.5

0.5
ALG

STAT

Black: ‘none’
Factor 2

Factor 2
ALG
0.0

0.0
CALC
GEO

Red: A non-orthogonal
−0.5

−0.5
BIO
CHEM

rotation I made up
GEO
CHEM
BIO
−1.0

−1.0

−1.0 −0.5 0.0

Factor 1
0.5 1.0 −1.0 −0.5 0.0

Factor 1
0.5 1.0
Green: Varimax rotation
Blue: Oblimin oblique
1.0

1.0

CALC CALC

STAT STAT

rotation
0.5

0.5

ALG ALG

GEO GEO
Factor 2

Factor 2

BIO BIO
0.0

0.0

CHEM CHEM
(Need package GPArotation)
−0.5

−0.5
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

Factor 1 Factor 1

32
Comparison of factor solution

None Varimax Oblimin


BIO 0.80 -0.45 | 0.92 | 0.91
GEO 0.70 -0.18 | 0.68 0.23 | 0.65 0.21
CHEM 0.61 -0.54 | 0.81 -0.10 | 0.83 -0.13
ALG 0.23 0.35 | 0.43 | 0.43
CALC 0.73 0.66 | 0.23 0.95 | 0.97
STAT 0.41 0.63 | 0.75 | -0.13 0.77
1.0

1.0

1.0
CALC CALC

STAT STAT
STAT CALC
0.5

0.5

0.5
ALG ALG
ALG
GEO GEO
Factor 2

Factor 2

Factor 2
BIO BIO
0.0

0.0

0.0
CHEM CHEM
GEO
−0.5

−0.5

−0.5
BIO
CHEM
−1.0

−1.0

−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

Factor 1 Factor 1 Factor 1

33
Which rotation?

The choice of rotation is a highly contentious topic in factor


analysis.

If it scientifically makes sense for the factors to be uncorrelated,


then use an orthogonal rotation (varimax)

If this isn’t reasonable use an oblique rotation (oblimin)

(Note that, this is essentially a nonparametric vs. parametric trade-off)

34
How to choose K (the number of factors)?

When selecting the number of factors, researchers attempt to


balance parsimony (only a few factors) and plausibility (enough of
the original correlation structure is accounted for)

Including too few factors is known as underfactoring. This leads to


many problems, essentially related to confounding

Including too many variables is known as overfactoring. This isn’t


as serious, as the factor loadings on these extra factors should be
small.

35
How to choose K (the number of factors)?

There are many possible ways to choose the number of factors


• Scree plot (tends to pick too many factors)
• Kaiser criterion (criticized as subjective)
• Cumulative variance explained
• Parallel analysis and non-graphical versions of the scree plot
• Using (non-statistical) theory to inform choice
• many others...

36
Scree plot

We plot the eigenvalues of the correlation matrix, decending from


largest to smallest.

We choose the number of factors to be at the ‘elbow’ or ‘kink.’


That is, when the eigenvalues go from decreasing rapidly to slowly.

We can get this information easy enough


ev = eigen(Xcorr)
Kgrid = 1:p
plot(Kgrid,ev,xlab=’number of factors’, ylab=’eigenvals’)

37
Scree results on grades data

2.5
2.0


eigenvals

1.5
1.0


0.5



1 2 3 4 5 6

number of factors

Figure: Choose 3 factors


38
Kaiser criterion
Alternatively, we can just look at the magnitude of the eigenvalues.

As this is a correlation matrix, the diagonal entries are all 1

Also, it is a fact that if (λj )pj=1 are the eigenvalues (p = 6 in our


example), then
Xp
λj = trace(cor(X)) = p
j=1
where trace just sums the diagonal entries.

If there is no factor structure, we might expect that all eigenvalues


are about 1 (which is the average)

Therefore, we can keep the factors that correspond to eigenvalues


greater than 1

This is known as either the Kaiser criterion or Kaiser’s criterion 39


Kaiser’s criterion

2.5
2.0


eigenvals

1.5
1.0


0.5



1 2 3 4 5 6

number of factors

Figure: Choose 2 factors


40
Cumulative variance explained
A very commonly used technique, at least in psychometrics

The idea is just to pick a threshold (very often 60%), and keep all
the factors needed to explain that much variance

This is commonly done be looking at the cumulative sum of the


eigenvalues:

ev = eigen(Xcorr)
ev$val/p
> ev$val/p
[1] 0.46445471 0.29629698 0.10507729 0.05610791
0.04368477 0.03437835
> cumsum(ev$val/p)
[1] 0.4644547 0.7607517 0.8658290 0.9219369
0.9656217 1.0000000

Pick 2 factors.
41
(Horn’s) parallel analysis
A Monte-Carlo based simulation method that compares the
observed eigenvalues with those obtained from uncorrelated normal
variables.

A factor or component is retained if the associated eigenvalue is


bigger than some quantile of this null distribution

Parallel analysis is regularly recommended (though it is far from a


consensus)

It is somewhat reminiscent of the ‘gap statistic’ from cluster


analysis

library(nFactors)
ap = parallel(subject = nrow(X),
var = ncol(X), rep=100,cent=0.05)
# parallel: This makes a correlation matrices from null dist
ns = nScree(ev$values,ap$eigen$qevpea)
plotnScree(ns) 42
(Horn’s) parallel analysis

If we type
ns = nScree(ev$values,ap$eigen$qevpea)
ns
> ns
noc naf nparallel nkaiser
2 2 2 2

The acceleration factor (AF) corresponds to a numerical solution


to the elbow of the scree plot (that is the second derivative)

The optimal coordinates (OC) corresponds to an extrapolation of


the preceeding eigenvalues by a regression line between the
eigenvalue coordinates and the last eigenvalue coordinates.

43
Parallel analysis
Non Graphical Solutions to Scree Test

● ● Eigenvalues (>mean = 2 )
Parallel Analysis (n = 2 )
Optimal Coordinates (n = 2 )

2.5
Acceleration Factor (n = 2 )

2.0
●(OC)
Eigenvalues

1.5
1.0

●(AF)
0.5



1 2 3 4 5 6

Components

Figure: Note: there is an error in the code for this package for
computing AF. They ignore the fact that it is undefined for 1 factor. So,
it should say n = 3 for AF. 44

You might also like