Statistical Methods for Bioinformatics
II-3: Variable Selection
Statistical Methods for Bioinformatics
Today
Dimension reduction and factor analysis
Principal Component Regression
Partial Least Squares
Considerations in High Dimensions
Statistical Methods for Bioinformatics
High Dimensionality is associated with elevated Variance
Learning in High Dimensionality tends to a higher “variance”
component in the error. Solutions up to now:
Subset selection
Ridge/Lasso regularization.
From linear regression you may remember that highly
correlated variables have high standard errors.
Can we not combine correlated variables into a single variable?
Statistical Methods for Bioinformatics
The problem of co-linearity
Statistical Methods for Bioinformatics
Dimension Reduction Techniques and Factor Analysis
Dimension Reduction or Factor Analysis try to describe
variability in a dataset using a reduced set of dimensions
Mapping of (cor)related variables onto unobserved “factors”
A multitude of approaches: Principal Component Analysis,
Latent-variable models, Non-negative matrix factorization
Important for many fields e.g. computer vision, text mining,
psychology
Both exploratory and hypothesis driven analyses
Active field of research, new methodologies under
development.
Statistical Methods for Bioinformatics
Example: text mining
Imagine there are 10000 documents about the protein p53
You know the, say 10000, words that are used and the
frequency in which they are used.
A sparse 10000 by 10000 matrix representing the literature
about the gene
Using Non-negative matrix factorization the word
dimensionality is reduced by identifying words with similar
occurrence patterns.
The condensed variables can represent the topics discussed
summarize the literature
classify documents by topic
try for better document retrieval (by using the topics)
find genes active in similar processes
Statistical Methods for Bioinformatics
Factor Analysis in Psychology
Factor: An underlying, but unobserved, construct or
phenomena that ’shapes’ or ’explains’ the observed variables
(at least partially).
Used in Psychology for example to define hypothetical
underlying components to explain observable traits
e.g. to describe personality traits
Talkativeness
Sociable Outgoing
Extraversion
The goal of factor analysis here is to explain the relations
between variables using an underlying component
On the following slide correlation matrix with data from Bernard Malle,
figure from Psych253 Stanford online course, Statistical Theory, Models,
and Statistical Methodology
Finding underlying Personality factors
Statistical Methods for Bioinformatics
Reducing Dimensionality for Statistical Learning
1 Find correlated variables and map them to a smaller set of
new variables
2 Use the new variables in a regression
3 Use cross validation to find optimal number of variables.
Underlying ideas
The dimension reduction may reduce variance component in
the error
The variability in the data is relevant for the response
Statistical Methods for Bioinformatics
Principal Component Analysis
PCA reduces the dimensionality of a data-set of related
variables, while retaining as much as possible of the variation.
The set of variables is transformed to a new set of variables,
principal components, which are uncorrelated and sorted by
the variation they retain.
The first principal component of a set of features is the
normalized linear combination that has the largest variance
Z1 = φ11 X1 + φ21 X2 + . . . + φp1 Xp
for this we work with the covariance matrix C = X T X/n with n
observations of the p variable vector
PpXi . The optimization problem
T 2
is to find maxφ φ C φ with j=1 φj1 = 1 (normalized)
Statistical Methods for Bioinformatics
Principal Component Analysis
Procedural description:
1 Find linear set of φj1 so that:
(assuming all X centered around 0, so average of scaled X also
0, this formula represents variance)
2 Repeat till φjp ensuring no correlation between weighting sets
Statistical Methods for Bioinformatics
Principal Component Analysis
Statistical Methods for Bioinformatics
Finding the principal components
You start
with n observations of an p variable vector Xi with
Xi,1
..
Xi = .
Xi,p
Center all p variables around zero
Construct a covariance matrix C = n1 XX T . Entries in the
matrix arePCl,k = Cov (X.,l , X.,k ), which is estimated by
Cl,k = n1 ni (xi,l − µl )(xi,k − µk ). µ represents the mean (set
to 0 here).
Decompose C to find its eigenvectors. It can be shown that C
can be decomposed as C = VDV T with V an orthonormal
matrix, i.e. all columns have length 1 and are orthogonal to
each other. V columns are eigenvectors of C , D is a diagonal
matrix with eigenvalues.
The eigenvectors define the mapping vectors φ of the principal
components, the eigenvalues in D give the variance explained
by every component
Statistical Methods for Bioinformatics
Principal Component Regression
After performing PCA, you choose a number of components M to
make a regression. The
P fitted coefficients θ relate to the non
reduced fit as: βj = M m=1 θm φj,m . This puts a constraint on the
coefficients. PCR is related in effect and form to ridge regression,
but with a discrete form for the penalty.
Statistical Methods for Bioinformatics
Principal Component Regression
Statistical Methods for Bioinformatics
Principal Component Regression: Considerations
Resume
Predictors mapped through
P a linear transformation to a reduced
set of predictors:Zm = pj=1 φj,m Xj
Normal regression with this smaller set of predictors.
The fitted
P coefficients relate to the non reduced fit as
βj = M m=1 θm φj,m
1 PCR works best if the PCA transformation captures most of
the variance in few dimensions AND this variance is
associated with the response
2 It is a linear mapping approach, so strong non-linear relations
will not be captured well.
3 Because PCA combines variables, the scale of each variable
influences the outcome. If not directly comparable,
standardize the variables.
4 PCA works best on normally distributed variables, strong
departures will make PCA fail
Statistical Methods for Bioinformatics
The connection between Ridge and PCR
If we take a N × p predictor matrix X , with zero-centered
variables, we can apply the Singular Value Decomposition
(SVD)
X = USV T
U and V are N × p and p × p orthogonal matrices, S is a diagonal
matrix with value si . Then the least squares fit for response y is:
ŷ = UU T y
Ridge is given by:
si2
ŷ = Udiag 2
UT y
si + λ
PCR by:
ŷ = Udiag {1, . . . , 1, 0, . . . , 0} U T y
Statistical Methods for Bioinformatics
The connection between Ridge and PCR
si2
ŷ = Udiag UT y
si2 + λ
What is s? Remember the PCA: formulation
1
C = n−1 X T X = VDV T . Together with X = USV T we can
S2
derive that D = n−1 . Hence the singular values s are related to the
eigenvalues of the covariance matrix as follows:
si2
di =
n−1
Statistical Methods for Bioinformatics
Partial Least Squares Regression
PLSR is another linear dimension reduction technique that
fulfills
Xp
Zm = φj,m Xj
j=1
It differs from PCR in that not just structure in the
explanatory variables is captured, but also the relation
between the explanatory variables and the response variables.
The decomposition is such that most variation in Y is
extracted and explained by a latent structure of X
It works with a response matrix
Resulting Z1 . . . Zm used with least squares to fit a linear
model
vs PCR& Ridge: Can reduce bias but increase variance
Statistical Methods for Bioinformatics
High dimensionality
When p>n, a situation frequently encountered in modern
science
Least squares regression not appropriate (no remaining
degrees of freedom)
Large danger of over-fitting
Cp , AIC and BIC are not appropriate (estimating error
variance not possible)
PLSR, PCR, forward step-wise regression, ridge and lasso are
appropriate
Statistical Methods for Bioinformatics
Regressions in High Dimensions
The lasso with n = 100 observations and varying features (p). 20
features were associated with the response. Plots show test MSEs
that result over the tuning parameter λ (degrees of freedom
reported). Test MSE goes up with more features. This is related to
the “curse of dimensionality”.
Statistical Methods for Bioinformatics
Interpretation of regression in high dimensionality
In high dimensional data sets many variables are highly
co-linear
Selected variables may not be the unique or even best set of
predictors for found prediction performance
So even when we have a predictive model, we should not
overstate the results
The found model is not unique, one of many possible models
Statistical Methods for Bioinformatics
The problem of co-linearity
Statistical Methods for Bioinformatics
Co-linearity is a motivation for regularization
Even a small λ will stabilize coefficient estimates in ridge
regression, also when p<<n
When you have many co-linear variables, ridge and PCR will
use them all in a sensible way.
You might want “group” selection: select the predictive set of
correlated variables
Lasso will tend to do feature selection and select the variable
strongest related to the response
Perhaps arbitrarily.
Can be less robust.
Statistical Methods for Bioinformatics
High dimensionality in practice
What you should learn from this class
1 What is dimension reduction and what is its use.
2 What is the procedure and motivation for a PCR (know what
is PLSR)
3 How does PCR compare to the LASSO and ridge
4 Considerations in high dimensions
Statistical Methods for Bioinformatics
To do:
Preparation for next week
Reading: chapter 7 through to 7.5
Send in any questions day before class
Exercises
Finish labs of Chapter 6
and Exercise below.
Statistical Methods for Bioinformatics
Exercise
In this exercise we will analyze the gene expression data set
from Van de Vijver et al. (2002, N Engl J Med, 347). The
study analyzed the gene expression in breast cancer tumors
genome wide with DNA microarrays. The study compared the
gene expression signature of the tumors with the presence or
absence of distant metastasis (“DM” vs “NODM”). The idea
was to use the gene expression signature as a clinical tool to
decide if chemo- or hormone therapy would be beneficial.
For the exercises load/install the following libraries: glmnet,
with library(library) and install.packages(“library”).
Statistical Methods for Bioinformatics
Exercise
1 Load the file “VIJVER.Rdata”.
Explore the dataset. How many variables? What do they
represent? How many samples? What do these samples
represent?
2 What challenges do you foresee in using gene expression for
the stated goal (predict distant metastases).
3 For a couple of genes evaluate association with the phenotype.
Do you see proof for some predictive potential? Test your
intuition with a formal statistical test.
4 Demonstrate if co-linearity occurs between genes in this
dataset. Do you think this represents a challenge in the
analysis?
5 Use lasso, ridge and PCR methodology and make a predictor
based on the gene expression values. How many genes are
used for an optimal predictor? Evaluate the performance of
the predictors, and comment on what you find.
Pointers: For lasso/ridge use library(”glmnet”), in the glmnet functions alpha=1
corresponds to lasso. Use the function predict to Methods
Statistical measure performance
for Bioinformatics