411 Note LDV
411 Note LDV
411 Note LDV
chapter 17)
1
Big Picture
Limited Dependent Variable (LDV) Model can be used when the dependent variable
is special. Examples are
1. y is binary (eg: voting for Trump or Biden)—probit or logistic (logit) regression
2. y is categorical with more than two unordered outcomes (eg: drinking Coke,
Pepsi, or water)—multinomial logistic regression
3. y is categorical with ordered outcomes (eg: evaluation is above, equal or below
average)—ordered logistic regression
4. y represents corner solution (eg: consumption of cigarettes)—Tobit model
5. y represents counts (eg: the number of car accidents)—Poisson model
6. y represents duration (eg: survival time of a patient)—Cox model
In general, LDV model is estimated by maximum likelihood (ML) method
2
Binomial Distribution
1. Suppose we flip a coin once. The probability of seeing head is p; and the
probability of seeing tail is 1 − p
2. Suppose we flip the same coin twice. The probability of seeing one head and one
tail is (p)(1 − p) + (1 − p)(p) = C21 (p)(1 − p); the probability of seeing two heads
is C22 p2 (1 − p)0 ; the probability of seeing two tails is C20 p0 (1 − p)2 ;
3. In general, if we flip the coin n times, the probability of seeing r heads is
where
n n!
Cnr ≡ = (2)
r r!(n − r)!
f = py (1 − p)1−y , (y = 1, 0) (4)
4. For a given iid sample of (y1 , y2 , . . . , yn ), the joint probability of observing this
particular sequence is
5. Compared to (1), equation (5) drops Cnr since the sample is given (only one
combination). Note that Σyi = r if there are r successes.
4
Maximum Likelihood (ML) Method
2. When p is unknown (eg, we are not sure whether the coin is fair) the ML
method estimates p by maximizing the likelihood function (i.e., maximizing the
probability of observing the given sample)
p̂ = argmax L = argmax pΣyi (1 − p)n−Σyi (7)
1. So far we assume p is constant since we flip the same coin again and again. In
reality, we ask different persons whether voting for Trump or Biden, so p is no
longer constant
2. We want to use covariates x such as gender, age, education to explain p. LPM is
based on a linear function
where we use the fact that for a Bernoulli random variable, P(y = 1) = E(y)
3. (Critical thinking) can x include choice-specific characteristics such as age of the
presidential candidate, or price of drinks?
4. Basically equation (10) is the population regression function (conditional mean).
Thus LPM amounts to regressing y onto x via OLS
5. Heteroskedasticity-robust standard error should be used for LPM since
1. LPM has low computational cost and constant marginal effect, but it fails to
account for two facts that (i) y is special (binary), and (ii) probability should be
bounded 0 ≤ pi ≤ 1. Those drawbacks of LPM motivate nonlinear models such
as Probit and Logit models
2. Probit model uses the cumulative distribution function (cdf) of normal
distribution Φ to specify the probability, and 0 ≤ Φ ≤ 1 by construction
Z β0 +β1 x1i +...+βk xki
1 − z2
pi = Φ(β0 + β1 x1i + . . . + βk xki ) = √ e 2 dz (12)
−∞ 2π
3. Now the unknown parameters are β s, and ML estimates them by maximizing
log(L) with numerical methods (no closed-form or analytical solution)
n
β̂ = argmax log(Πni=1 fi ) = ∑ log( fi ) =
i=1
n n
∑ log(pyi i (1 − pi )1−yi ) = ∑ [yi log(pi ) + (1 − yi ) log(1 − pi )] (13)
i=1 i=1
7
Derivation of Probit Model using Latent Variable
1. It seems that the Φ thing comes from nowhere. Well, we can use a latent
variable to justify it
2. Suppose a voter gives each presidential candidate a score, and the score depends
on the voter’s gender, age, educ, etc. The score is unobserved, so is called latent
variable. A person votes for Trump, y = 1, only if his score is above a threshold,
say, 0
1. Probit model has the benefit of imposing the 0-1 boundary for probability. The
cost is that β alone does not measure the (magnitude of) marginal effect.
2. We can apply chain rule to show the marginal effect of the j-th covariate on
probability is given by
∂ pi
= φ (β0 + β1 x1i + . . . + βk xki )β j (19)
∂ x ji
where φ is the derivative of Φ, called probability density function (pdf). In
short, the marginal effect of probit model is non-constant.
3. φ is non-negative. So the sign of marginal effect is determined by β
4. Because x varies, in practice, we can compute the average marginal effect
(AME):
∑ni=1 φ (β0 + β1 x1i + . . . + βk xki )
AME = β j (20)
n
or the marginal effect at average (MEAA):
MEAA = β j (β0 + β1 x̄1 + . . . + βk x̄k ) (21)
9
Logit Model (Logistic Regression)
1. Probit model has the drawback that Φ and φ are hard to compute
2. Alternatively, one may use a Logit model that specifies the success probability as
exi β
pi = Λ(xi β ) ≡ (22)
1 + exi β
where Λ denotes the cdf of a logistic distribution. We can directly verify that
0≤Λ≤1
3. Like the probit model, the marginal effect is product of β and a factor
∂ pi
= Λ0 (xi β )β (23)
∂ xi
where Λ0 denotes the derivative of Λ. The sign of marginal effect only depends
on the sign of β .
4. Logistic distribution looks similar to normal distribution. So in general, probit
and logit models produce similar marginal effects
10
Odds, Log Odds
1. In industry logit model is more popular than probit model because of a simple
formula for odds, and interpretation of coefficient in terms of log odds
2. By definition
pi
odds ≡ = exi β (24)
1 − pi
log odds = xi β (25)
So people interpret β as the effect on log odds when x changes by one unit.
3. Note that even though pi is bounded between 0 and 1, the log odds is unbounded
4. Equation (25) is an example of generalized linear model (GLM) in which the
right hand side is an unrestricted linear function, but the left hand side is a
transformation of original data
11
Odds Ratio
12