411 Note LDV

Lecture: Limited Dependent Variable Model (Wooldridge’s book
chapter 17)
1
Big Picture
Limited Dependent Variable (LDV) Model can be used when the dependent variable
is special. Examples are
1. y is binary (eg: voting for Trump or Biden)—probit or logistic (logit) regression
2. y is categorical with more than two unordered outcomes (eg: drinking Coke,
Pepsi, or water)—multinomial logistic regression
3. y is categorical with ordered outcomes (eg: evaluation is above, equal or below
average)—ordered logistic regression
4. y represents corner solution (eg: consumption of cigarettes)—Tobit model
5. y represents counts (eg: the number of car accidents)—Poisson model
6. y represents duration (eg: survival time of a patient)—Cox model
In general, LDV model is estimated by maximum likelihood (ML) method
2
Binomial Distribution
1. Suppose we flip a coin once. The probability of seeing head is p; and the
probability of seeing tail is 1 − p
2. Suppose we flip the same coin twice. The probability of seeing one head and one
tail is (p)(1 − p) + (1 − p)(p) = C21 (p)(1 − p); the probability of seeing two heads
is C22 p2 (1 − p)0 ; the probability of seeing two tails is C20 p0 (1 − p)2 ;
3. In general, if we flip the coin n times, the probability of seeing r heads is
Cnr pr (1 − p)n−r (1)
where  
n n!
Cnr ≡ = (2)
r r!(n − r)!
denotes the binomial coefficient, or combinations of r successes out of n trails.

4. Notice that Cnr is free of p
3
Joint Probability of IID Sample of Bernoulli Variables
1. When n = 1, the Binomial distribution reduces to Bernoulli distribution

2. Bernoulli random variable is binary (dichotomous), and its distribution is

 1 (sucess), with probability p
y= (3)
 0 (failure), with probability 1 − p
3. Equivalently, we can write the distribution function as
f = py (1 − p)1−y , (y = 1, 0) (4)
4. For a given iid sample of (y1 , y2 , . . . , yn ), the joint probability of observing this
particular sequence is
Πni=1 fi ≡ f1 f2 . . . fn = (py1 (1 − p)1−y1 ) . . . (pyn (1 − p)1−yn ) = pΣyi (1 − p)n−Σyi (5)
5. Compared to (1), equation (5) drops Cnr since the sample is given (only one
combination). Note that Σyi = r if there are r successes.
4
Maximum Likelihood (ML) Method
1. The joint probability (5) is a function of p, called likelihood function

L ≡ pΣyi (1 − p)n−Σyi (6)
2. When p is unknown (eg, we are not sure whether the coin is fair) the ML
method estimates p by maximizing the likelihood function (i.e., maximizing the
probability of observing the given sample)
p̂ = argmax L = argmax pΣyi (1 − p)n−Σyi (7)
3. In practice, ML maximizes the log likelihood since the log transformation is

monotonic
p̂ = argmax log(L) = argmax (Σyi ) log(p) + (n − Σyi ) log(1 − p) (8)
4. Taking derivative with respect to p and setting to zero, we have

Σyi r
p̂ = = = sample proportion (9)
n n
Thus sample proportion is the ML estimate for population proportion
5
Linear Probability Model (LPM)
1. So far we assume p is constant since we flip the same coin again and again. In
reality, we ask different persons whether voting for Trump or Biden, so p is no
longer constant
2. We want to use covariates x such as gender, age, education to explain p. LPM is
based on a linear function
pi ≡ P(yi = 1|xi ) = E(yi |xi ) = β0 + β1 x1i + . . . + βk xki (10)
where we use the fact that for a Bernoulli random variable, P(y = 1) = E(y)
3. (Critical thinking) can x include choice-specific characteristics such as age of the
presidential candidate, or price of drinks?
4. Basically equation (10) is the population regression function (conditional mean).
Thus LPM amounts to regressing y onto x via OLS
5. Heteroskedasticity-robust standard error should be used for LPM since
var(yi |xi ) = pi (1 − pi ) 6= constant (11)

6
Probit Model
1. LPM has low computational cost and constant marginal effect, but it fails to
account for two facts that (i) y is special (binary), and (ii) probability should be
bounded 0 ≤ pi ≤ 1. Those drawbacks of LPM motivate nonlinear models such
as Probit and Logit models
2. Probit model uses the cumulative distribution function (cdf) of normal
distribution Φ to specify the probability, and 0 ≤ Φ ≤ 1 by construction
Z β0 +β1 x1i +...+βk xki
1 − z2
pi = Φ(β0 + β1 x1i + . . . + βk xki ) = √ e 2 dz (12)
−∞ 2π
3. Now the unknown parameters are β s, and ML estimates them by maximizing
log(L) with numerical methods (no closed-form or analytical solution)
n
β̂ = argmax log(Πni=1 fi ) = ∑ log( fi ) =
i=1
n n
∑ log(pyi i (1 − pi )1−yi ) = ∑ [yi log(pi ) + (1 − yi ) log(1 − pi )] (13)
i=1 i=1
7
Derivation of Probit Model using Latent Variable
1. It seems that the Φ thing comes from nowhere. Well, we can use a latent
variable to justify it
2. Suppose a voter gives each presidential candidate a score, and the score depends
on the voter’s gender, age, educ, etc. The score is unobserved, so is called latent
variable. A person votes for Trump, y = 1, only if his score is above a threshold,
say, 0
P(y = 1) = P(score > 0) (14)

= P(xβ + u > 0) (15)
= P(u > −xβ ) (16)
= P(u < xβ ) (17)
≡ Φ(xβ ) (18)
where we assume the unobserved factors u follow standard normal distribution,

and we use the fact that normal distribution is symmetric. The last equality is
the definition for cdf. No wonder 0 ≤ Φ ≤ 1
8
Marginal Effect
1. Probit model has the benefit of imposing the 0-1 boundary for probability. The
cost is that β alone does not measure the (magnitude of) marginal effect.
2. We can apply chain rule to show the marginal effect of the j-th covariate on
probability is given by
∂ pi
= φ (β0 + β1 x1i + . . . + βk xki )β j (19)
∂ x ji
where φ is the derivative of Φ, called probability density function (pdf). In
short, the marginal effect of probit model is non-constant.
3. φ is non-negative. So the sign of marginal effect is determined by β
4. Because x varies, in practice, we can compute the average marginal effect
(AME):
∑ni=1 φ (β0 + β1 x1i + . . . + βk xki )
AME = β j (20)
n
or the marginal effect at average (MEAA):
MEAA = β j (β0 + β1 x̄1 + . . . + βk x̄k ) (21)
9
Logit Model (Logistic Regression)
1. Probit model has the drawback that Φ and φ are hard to compute
2. Alternatively, one may use a Logit model that specifies the success probability as
exi β
pi = Λ(xi β ) ≡ (22)
1 + exi β
where Λ denotes the cdf of a logistic distribution. We can directly verify that
0≤Λ≤1
3. Like the probit model, the marginal effect is product of β and a factor
∂ pi
= Λ0 (xi β )β (23)
∂ xi
where Λ0 denotes the derivative of Λ. The sign of marginal effect only depends
on the sign of β .
4. Logistic distribution looks similar to normal distribution. So in general, probit
and logit models produce similar marginal effects
10
Odds, Log Odds
1. In industry logit model is more popular than probit model because of a simple
formula for odds, and interpretation of coefficient in terms of log odds
2. By definition
pi
odds ≡ = exi β (24)
1 − pi
log odds = xi β (25)
So people interpret β as the effect on log odds when x changes by one unit.
3. Note that even though pi is bounded between 0 and 1, the log odds is unbounded
4. Equation (25) is an example of generalized linear model (GLM) in which the
right hand side is an unrestricted linear function, but the left hand side is a
transformation of original data
11
Odds Ratio
1. We are interested in a special case where x is a dummy variable. It follows that

odds when x = 1
odds ratio ≡ = eβ (26)
odds when x = 0
So the exponential of coefficient of a dummy independent variable in a logit
model gives odds ratio
2. For instance, if odds ratio = 2, that means the odds when x = 1 is twice of
odds when x = 0
3. We can test the null hypothesis that x does not matter— H0 : odds ratio = 1,
which is the same as H0 : β = 0,
12

411 Note LDV

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

411 Note LDV

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

411 Note LDV

Uploaded by

Copyright:

Available Formats

Lecture: Limited Dependent Variable Model (Wooldridge’s book

Cnr pr (1 − p)n−r (1)

denotes the binomial coefficient, or combinations of r successes out of n trails.

1. When n = 1, the Binomial distribution reduces to Bernoulli distribution

3. Equivalently, we can write the distribution function as

Πni=1 fi ≡ f1 f2 . . . fn = (py1 (1 − p)1−y1 ) . . . (pyn (1 − p)1−yn ) = pΣyi (1 − p)n−Σyi (5)

1. The joint probability (5) is a function of p, called likelihood function

3. In practice, ML maximizes the log likelihood since the log transformation is

4. Taking derivative with respect to p and setting to zero, we have

pi ≡ P(yi = 1|xi ) = E(yi |xi ) = β0 + β1 x1i + . . . + βk xki (10)

var(yi |xi ) = pi (1 − pi ) 6= constant (11)

P(y = 1) = P(score > 0) (14)

where we assume the unobserved factors u follow standard normal distribution,

1. We are interested in a special case where x is a dummy variable. It follows that

You might also like

411 Note LDV

Uploaded by

Document Informationclick to expand document informationStudy material

Document Informationclick to expand document information

Copyright:

Available Formats

411 Note LDV

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

411 Note LDV

Uploaded by

Copyright:

Available Formats

Lecture: Limited Dependent Variable Model (Wooldridge’s book

Cnr pr (1 − p)n−r (1)

denotes the binomial coefficient, or combinations of r successes out of n trails.

1. When n = 1, the Binomial distribution reduces to Bernoulli distribution

3. Equivalently, we can write the distribution function as

Πni=1 fi ≡ f1 f2 . . . fn = (py1 (1 − p)1−y1 ) . . . (pyn (1 − p)1−yn ) = pΣyi (1 − p)n−Σyi (5)

1. The joint probability (5) is a function of p, called likelihood function

3. In practice, ML maximizes the log likelihood since the log transformation is

4. Taking derivative with respect to p and setting to zero, we have

pi ≡ P(yi = 1|xi ) = E(yi |xi ) = β0 + β1 x1i + . . . + βk xki (10)

var(yi |xi ) = pi (1 − pi ) 6= constant (11)

P(y = 1) = P(score > 0) (14)

where we assume the unobserved factors u follow standard normal distribution,

1. We are interested in a special case where x is a dummy variable. It follows that

You might also like