Methods of Parameter Estimation
In statistical models (e.g., linear regression, logistic regression, Cox’s
proportional hazards), a number of methods can be used to estimate
model parameters:
▶ Maximum Likelihood Estimation
▶ Estimating Equation Method
▶ Bayesian Estimation
▶ Custom Objective Functions (e.g., penalized likelihood)
Matrix Representation of Linear Regression Models
Define the response vector y, design matrix X, coefficient vector β,
and error vector ε as follows:
y1 1 x11 x12 ··· x1p β0 ε1
y2 1 x21 x22 ··· x2p β1 ε2
y = . , X = . .. , β = . , ε= .
.. .. ..
.. .. . . . . .. ..
yn 1 xn1 xn2 ··· xnp βp εn
Consider the linear regression model:
y = Xβ + ε, E(ε) = 0, Var(ε) = σ 2
Least Square Estimation Using Matrix Algebra
For least square estimation, we minimize the quantity:
Q = (y − Xβ)′ (y − Xβ)
Expanding and simplifying, we have:
Q = y′ y − 2y′ Xβ + β ′ X′ Xβ
To find the value of β that minimizes Q, we solve the equation from
setting the first derivative of Q to 0:
∂Q
= −2X′ y + 2X′ Xβ = 0
∂β
Leading to the estimate:
β̂LSE = (X′ X)−1 X′ y
Estimate of Variance
▶ The mean squared error (MSE) is an unbiased estimate of σ 2 :
1 1
σ̂ 2 = (y − Xβ̂)′ (y − Xβ̂) = y′ (I − H)y,
n−p−1 n−p−1
where H = X(X′ X)−1 X′
▶ MSE adjusts for the loss of degrees of freedom due to estimating
the coefficients.
▶ β̂LSE and MSE are unbiased.
▶ The derivation of β̂LSE and MSE don’t require the normal
distribution.
Linear Models with Normal Errors
We further assume that the random error ε follows a normal
distribution:
y = Xβ + ε, ε ∼ N (0, σ 2 I),
▶ Under this assumption, the sampling distributions of the
estimates have the following properties:
▶ β̂LSE follows a normal distribution.
▶ MSE follows a Chi-squared distribution.
▶ β̂LSE and MSE are independent.
▶ We can apply Maximum likelihood method to estimate the
parameters.
Likelihood Function for Regression Models
The likelihood function for a multivariate normal distribution
N (µ, Σ) can be written as:
1 1 ′ −1
L= exp − (y − µ) Σ (y − µ)
(2π)n/2 |Σ|1/2 2
For our regression model, given Σ = σ 2 I, we have:
1 1 ′
L= exp − 2 (y − Xβ) (y − Xβ)
(2πσ 2 )n/2 2σ
The log-likelihood function can be written as:
n n 1
l=− ln(2π) − ln(σ 2 ) − 2 (y − Xβ)′ (y − Xβ)
2 2 2σ
Linear Regression Models using MLE
n n 1
l=− ln(2π) − ln(σ 2 ) − 2 (y − Xβ)′ (y − Xβ)
2 2 2σ
The first derivatives of the parameters (score equations) are:
∂l 1 ∂(β ′ β)
= 2 X′ (y − Xβ), using the fact = 2β
∂β σ ∂β
∂l n 1
= − 2 + 4 (y − Xβ)′ (y − Xβ)
∂σ 2 2σ 2σ
Setting these two equations to zero gives:
1
β̂ = (X′ X)−1 X′ y, σ̂ 2 (β) = (y − Xβ)′ (y − Xβ)
n
Substituting β̂ for β, we obtain the MLE for σ 2 :
1
σ̂ 2 = (y − Xβ̂)′ (y − Xβ̂)
n
Linear Models for Independent Observations
▶ Reviewed two estimation methods: LSE and MLE, and derived
estimates using each method.
▶ Estimates for the regression parameter β are the same using LSE
or MLE (β̂LSE = β̂MLE ).
▶ Estimation for the variance parameter σ 2 :
▶ MSE of σ 2 is unbiased, taking into account the degree of freedom
used to estimate β.
▶ MLE of σ 2 is biased; when n is small, MLE tends to
underestimate σ 2 .
▶ When extending the methods to models for longitudinal data, we
will cover: MLE, REML, GLS and the Generalized Estimating
Equations approach (GEE).
General Linear Models for Longitudinal Data
Assume a general linear regression model,
yij = β0 + β1 xij1 + · · · + βp xijp + εij , i = 1, . . . , n, j = 1, . . . , k
Let
yi1 1 xi11 xi12 ··· xi1p
yi2 1 xi21 xi22 ··· xi2p
yi = . , Xi = . .. ,
.. .. ..
.. .. . . . .
yik 1 xik1 xik2 ··· xikp
denote the k × 1 vector of response variables and the k × (p + 1)
design matrix of covariates for the ith subject.
Define
y1 X1
y2 X2
Y = . , X = . .
.. ..
yn Xn
The model can be written as:
Y = Xβ + ε, E(ε) = 0, Var(ε) = Σ
The TLC Data
There were 100 study subjects randomly assigned to treatment or
placebo and measured at baseline, 1, 4, and 6 weeks post baseline.
The data for the first 2 subjects:
Treat Week y
0 0 30.8
0 1 26.9
0 4 25.8
0 6 23.8
1 0 26.5
1 1 14.8
1 4 19.5
1 6 21.0
Construct Xi , yi , X and Y. What are the dimensions of Xi , yi , X
and Y?
Assumptions
▶ The individuals represent a random sample from the population
of interest.
▶ X are fixed, free of random error.
▶ The elements of the vector of repeated measures yi1 , ..., yik , have
a conditional Multivariate Normal (MVN) distribution, with
means
µij = E(yij |xij ) = β0 + β1 xij1 + · · · + βp xijp
▶ Observations from different individuals are independent, while
repeated measurements of the same individual are not assumed
to be independent.
Maximum Likelihood for General Linear Models
The likelihood function for a multivariate normal distribution:
1 1 ′ −1
L(β, Σ) = exp − (y − Xβ) Σ (y − Xβ)
(2π)N/2 |Σ|1/2 2
log likelihood:
1 1
log L(β, Σ) ∝ − ln |Σ| − (y − Xβ)′ Σ−1 (y − Xβ)
2 2
Using matrix derivative formulae below and chain rule:
∂ ′
(x Ax) = 2Ax
∂x
∂ log L
= X′ Σ−1 (Y − Xβ)
∂β
Assume that Σ is known, solving the score equation for β gives:
X′ Σ−1 Xβ = X′ Σ−1 Y ⇒ β̂ = (X′ Σ−1 X)−1 X′ Σ−1 Y
Generalized Least Squares
▶ Under the multivariate normal assumption, MLE of β is also a
generalized least squares estimate of β
▶ Note that maximizing the likelihood for a given Σ is equivalent
to finding the estimates minimize
(Y − Xβ)′ Σ−1 (Y − Xβ)
▶ The estimates of β that minimize this function are called
generalized least squares estimates (GLS).
▶ Instead of minimizing the squared distance between Y and Xβ, a
generalized squared distance determined by Σ−1 is minimized.
Properties of GLS
For any choice of Σ, the GLS estimate of β is unbiased; that is
E(β̂) = β
Variance:
Cov(β̂) = (X′ Σ−1 X)−1
Sampling Distribution:
β̂ ∼ N β, (X′ Σ−1 X)−1
▶ If Σ = σ 2 I, the GLS estimate reduces to the ordinary least
squares estimate.
▶ The most efficient generalized least squares estimate is the one
that uses the true value of Σ.
Maximum Likelihood Estimation for Σ
Since we usually do not know Σ we typically estimate it from the
data. Now we compute the MLE of Σ.
To get the derivative with regard to Σ−1 we use the fact that
(y − Xβ)′ Σ−1 (y − Xβ) is a scalar and
tr(X′ AX) = tr(XX′ A)
The log-likelihood can be rewritten as:
1 1
log L(β, Σ−1 ) ∝ − ln |Σ| − (y − Xβ)′ Σ−1 (y − Xβ)
2 2
1 1
= ln |Σ−1 | − tr (Y − Xβ)′ Σ−1 (Y − Xβ)
2 2
1 1
= ln |Σ−1 | − tr (Y − Xβ)(Y − Xβ)′ Σ−1
2 2
Maximum Likelihood Estimation for Σ
Using the properties of matrix derivatives
∂tr(CB)
= C′ when B is symmetric,
∂B
and
∂ ln |Σ−1 |
=Σ
∂Σ−1
The derivative of the log likelihood with respect to Σ−1 is given by:
∂ log L 1 1
−1
= Σ − (Y − Xβ)(Y − Xβ)′
∂Σ 2 2
Setting the first derivative to 0 gives us the MLE of Σ:
Σ̂ = (Y − Xβ)(Y − Xβ)′
Numerical algorithms for finding MLE
Solving the two score equations yields:
β̂ = (X′ Σ−1 X)−1 X′ Σ−1 Y
Σ̂ = (Y − Xβ)(Y − Xβ)′
An iterative algorithm needs to be used to obtain ML estimates of β
and Σ:
1. Choose an initial estimate of Σ to obtain an estimate of β.
β̂ = (X′ Σ̂−1 X)−1 X′ Σ̂−1 Y
2. Use β̂ to obtain a new estimate for Σ.
Σ̂ = (Y − Xβ̂)(Y − Xβ̂)′
3. The process iterates until the estimates of β and Σ converge.
Asymptotic Distributions of β̂
▶ Recall: β̂ ∼ N (β, (X′ Σ−1 X)−1 ) when Σ is known.
▶ In large samples, the resulting estimator of β from the iterative
algorithm will have the same properties as when Σ is known.
▶ To test hypotheses about β we can make direct use of the ML
estimate and its estimated covariance matrix,
d β̂) = (X′ Σ̂−1 X)−1
Cov(
Statistical Inference for β
Let L denote a matrix or vector of known weights (often representing
contrasts of interest) and suppose that it is of interest to test
H0 : Lβ = 0 vs HA : Lβ ̸= 0
For example, if a model has 3 regression parameters β = (β1 , β2 , β3 ),
the following composite hypotheses can be expressed using L matrices:
▶ To test H0 : β1 = β2 , we can use L = (1, −1, 0), Lβ = β1 − β2 ,
then test H0 is equivalent to testing H0 : Lβ = 0.
▶ To test H0 : β1 = β2 = β3 , we can use L = 1 −1 0 ,
0 1 −1
β1 − β2
Lβ = , testing H0 : Lβ = 0.
β2 − β3
Wald Test
▶ A natural estimate of Lβ is Lβ̂ and the covariance matrix of Lβ̂
is given by
LCov(β̂)L′
▶ Suppose that L is a single row vector. An approximate 95% CI is
given by q
Lβ̂ ± 1.96 LCov(β̂)L′
▶ To test H0 : Lβ = 0 versus HA : Lβ ̸= 0, we can use Wald
statistic
Lβ̂
Z=q
LCov(β̂)L′
and compare with a standard normal distribution.
▶ The Wald test is based on asymptotic properties, meaning it
tends to perform well when sample sizes are large
Generalization of Wald Test
▶ Recall: If Z is a standard normal random variable, then Z 2 has a
χ2 distribution with 1 df. Thus, an identical test compares
W = (Lβ̂)′ (LCov(β̂)L′ )−1 (Lβ̂)
to the critical values of the χ2 distribution with 1 df.
▶ This approach readily generalizes to L having more than one row
and this allows simultaneous testing of more than one hypothesis.
▶ Suppose that L has r rows (number of contrasts). Then a
simultaneous test of the r contrasts is given by
W = (Lβ̂)′ (LCov(β̂)L′ )−1 (Lβ̂)
which has a χ2 distribution with r df if H0 is true
Hypothesis Tests for β in SAS PROC MIXED
By default, PROC MIXED performs these tests for coefficients:
▶ When L is a row vector, a t-test is used:
Lβ̂
T =q ∼ t(ν), ν = df
LCov(β̂)L′
▶ When the row rank of L is greater than 1 (contrast matrix), a
F-test is used:
(Lβ̂)′ (LCov(β̂)L′ )−1 (Lβ̂)
F = ∼ F (rank(L), ν)
r
where r = rank(LCov(β̂)L′ ).
Likelihood Ratio Test
Suppose that we are interested in comparing two nested models, a
“full” model and a “reduced” model.
Aside: Nested Models
When one model (the “reduced” model) is a special case of the other
(the “full model”), the reduced model is said to be nested within the
full model.
We can compare two nested models by comparing their maximized
log-likelihood, say ˆlf ull and ˆlred ; the former is at least as large as the
latter.
The larger the difference between ˆlf ull and ˆlred the stronger the
evidence that the reduced model is inadequate.
Likelihood Ratio Test
A formal test is obtained by taking
2(ˆlf ull − ˆlred )
and comparing the statistic to a Chi-squared distribution with degree
of freedom equal to the difference between the number of parameters
in the full and reduced models.
Formally, this test is called the likelihood ratio test (LRT).
We can use LRTs for hypotheses about models for the mean and the
covariance.
Numerical algorithms for finding MLE
▶ An iterative algorithm for obtaining maximum likelihood
estimates for longitudinal models:
β̂ = (X′ Σ̂−1 X)−1 X′ Σ̂−1 Y
Σ̂ = (Y − Xβ̂)(Y − Xβ̂)′
▶ MLE of β depends on the MLE of Σ; Different specification of
the covariance structure will lead to different β estimate.
Covariance Matrix for Longitudinal Data
In our MLE derivation, we used the most general form for the
covariance matrix, Σ, without assuming any special pattern. In
practice, a general covariance matrix is too costly for computation
efficiency because
▶ Σ is a matrix of large order (nk × nk)
▶ Maximum likelihood estimation requires computing Σ−1 ,
iteratively.
Luckily, in most longitudinal data analysis, it is reasonable to assume
that observations collected from different individuals are independent.
Covariance Matrix for Longitudinal Data
Assuming independence between individuals, the covariance matrix
has the following structure:
Σ1 0 ··· 0
0 Σ2 ··· 0
Σ= .
.. .. ..
..
. . .
0 0 ··· Σn
If we further assume Σ1 = Σ2 =, . . . , = Σn , this covariance pattern
can greatly reduces the dimensionality for ML (or REML) estimation,
leading to (verify on your own):
n
!−1 n
!
X X
β̂ = X′i Σ̂−1
i Xi X′i Σ̂−1
i yi
i=1 i=1
n
1X
Σ̂i = (Yi − Xi β̂)(Yi − Xi β̂)′
n i=1
Covariance Matrix for Longitudinal Data
▶ The assumption of independence between subjects allows for a
reduction in matrix dimension in the ML algorithm from N = nk
to k.
▶ In the TLC data (n = 100, k = 4), the outcome vector Y is a
400 × 1 vector; hence, a general covariance matrix Σ would have
dimensions 400 × 400.
▶ With the assumption of independence between subjects, we only
need to estimate Σi with dimensions of 4 × 4 in each iteration.
▶ 10 parameters are involved in defining Σi .
Covariance Matrix for Longitudinal Data
Since we have shown that the MLE of β depends on the MLE of Σ,
and there are usually many plausible patterns for the covariance
matrices Σi in a given dataset, fitting a longitudinal model will
require two key decisions:
▶ Which independent variables should be included in the model?
▶ What covariance matrix structure should be used to best capture
correlations among repeated measures?
Examples of Covariance Matrix Structures
One commonly used type of covariance matrix is the compound
symmetry (CS) structure:
2
σ + σb2 σb2 σb2
···
σb2 σ 2 + σb2 · · · σb2
Σi = = σ 2 I + σb2 J
.. .. .. ..
. . . .
2 2 2 2
σb σb · · · σ + σb
Or alternatively:
σ2 σ2 ρ σ2 ρ
···
σ 2 ρ σ2 ··· σ 2 ρ
Σi = .
.. .. ..
.. . . .
σ2 ρ σ2 ρ ··· σ2
Therefore, two parameters are involved in defining the covariance
matrix in a compound symmetry structure.
Examples of Covariance Matrix Structures
A compound symmetry covariance matrix assumes that variances of
the random error terms σ 2 + σb2 stay constant at different time points.
A different covariance matrix structure that relaxes this assumption is
the heterogeneous compound symmetry (HCS) structure:
2
σ1 σ1 σ2 ρ · · · σ1 σk ρ
σ1 σ2 ρ σ22 · · · σ2 σk ρ
Σi = .
. .. ..
.. .. . .
σ1 σk ρ σ2 σ k ρ · · · σk2
In the HCS type of covariance structure, k + 1 parameters are needed
in defining the covariance matrix.
MLE for Σ are biased
Recall that the MLE of Σ is given by:
n
1X
Σ̂i = (Yi − Xi β̂)(Yi − Xi β̂)′
n i=1
▶ Although the MLE’s of the elements of Σ (the variances and
covariances) have the usual large sample properties, (they are
asymptotically normal and unbiased), the estimates are biased
toward 0 in small samples.
▶ The situation is similar to the biases in MLE of σ 2 in linear
regression models where the ML estimate does not adjusted for
the effect of estimating β.
▶ In ordinary least squares regression, we can eliminate the bias by
dividing by n − p − 1 instead of n.
▶ Though the correction is not always as simple in the longitudinal
setting, the same idea applies. We can correct the estimates of
variance parameters for the bias.
Restricted Maximum Likelihood Estimation (REML)
▶ REML is introduced for the purpose of finding an unbiased
estimator for the covariance matrix Σ.
▶ Two significant academic papers in the development of REML:
▶ Patterson H. D.; Thompson R. (1971). “Recovery of inter-block
information when block sizes are unequal”. Biometrika. 58 (3):
545. doi:10.1093/biomet/58.3.545.
▶ Harville D. A. (1977). “Maximum Likelihood Approaches to
Variance Component Estimation and to Related Problems”.
Journal of the American Statistical Association. 72 (358):
320–338. doi:10.2307/2286796.
REML
▶ The main idea behind REML is to use a slightly different
likelihood function (restricted ML) that eliminates the
parameters β, so that it is defined only in terms of Σ.
▶ One possible way to obtain the restricted likelihood is to
transform the data to a set of linear combinations of observations
whose distribution does not depend on β.
▶ An example of such linear combinations of observations: residuals
from OLS estimates of β
Y − Xβ̂ = (I − H)Y, where H = X(X′ X)−1 X′ .
The likelihood of these residuals depends only on Σ and not on β.
▶ When the residual likelihood is maximized, we obtain estimates
of Σ whose degrees of freedom are effectively corrected for the
reduction in degrees of freedom due to estimating β.
REML
Observe that the vector:
def
(I − H)Y = (I − X(X′ X)−1 X′ )Y = PY
is free of β.
Let M be a matrix where its columns are the eigenvectors of the
matrix P corresponding to non-zero eigenvalues. The likelihood
function for MY is the Residual (or Restricted) Likelihood:
1 1 1
log LR (Σ|M Y ) = − ln |Σ|− (Y−Xβ̂)′ Σ−1 (Y−Xβ̂)− ln |X′ Σ−1 X|,
2 2 2
where β̂ is the GLS solution to β.
REML
▶ Thus, rather than maximizing
1 1
− ln |Σ| − (Y − Xβ̂)′ Σ−1 (Y − Xβ̂)
2 2
▶ REML maximizes the following slightly modified log-likelihood
1 1 1
− ln |Σ| − (Y − Xβ̂)′ Σ−1 (Y − Xβ̂) − ln |X′ Σ−1 X|
2 2 2
▶ The extra determinant term effectively makes a correction or
adjustments that is analogous to the correction to the
denominator in σ̂ 2 .
REML
Use iterative algorithm to find REML estimates of β and Σ:
β̂REML = (X′ Σ̂−1
REML X)
−1 ′ −1
X Σ̂REML Y
Σ̂REML = (Y − Xβ̂REML )(Y − Xβ̂REML )′ + XCov(
d β̂REML )X′
Asymptotic properties similar to ML estimates have also been
established:
β̂REML ∼ N β, (X′ Σ̂−1REML X) −1
d β̂REML ) = (X′ Σ̂−1 X)−1
Cov( REML
Hypothesis Testing on β̂ REML
Given β̂REML ∼ N (β, (X′ Σ̂−1
REML X)
−1
), we can conduct hypothesis
tests similarly to MLE:
H0 : LβREML = 0 vs HA : LβREML ̸= 0
▶ When L is a row vector, use a t-test:
Lβ̂REML
T =q ∼ t(ν), ν = df
d β̂REML )L′
LCov(
▶ When the row rank of L is greater than 1, use a F-test:
(Lβ̂REML )′ (LCov(
d β̂REML )L′ )−1 (Lβ̂REML )
F = ∼ F (rank(L), ν)
r
d β̂REML )L′ ).
where r = rank(LCov(
Summary
▶ Linear regression models are special cases of general linear
models when Σ = σ 2 I.
▶ For general linear models, both ML and REML estimates of the
regression parameter β depend on the estimate of the covariance
matrix Σ.
▶ The mathematical derivation assumes general forms of the
covariance matrix.
▶ Different patterns assumed for Σ will lead to different estimates
for β.
▶ Future lectures will cover how to use SAS to specify a general
linear model, including various patterns for Σ.