0 ratings0% found this document useful (0 votes) 162 views36 pagesCH - 3 - Simple and Multiple Linear Regressions in Stata
Application to Cross Sectional Econometrics in stata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Mengistu Yismaw (MSc.)
Department of Economics
Debre Markos University (Burie Campus)
Email: menyis.2012@gmail.comChapter ou
Simple linear regression.
‘Regression with only qualitative (dummy) regressors: ANOVA
© Specification
© Estimation
© Interpretation
Multiple inear regression
Regression with qualitative and quantitative regressors: ANCOVA
© Specification
© Estimation
© Interpretation
© Test of LRM assumptions
© Violations of some of the CLRM assumption
© Interaction effect
4 qualitative Response Regression Models: Dummy as dependent variable
(Binary choice model)
* near Probability Model (LPM)
© Specification
© Estimation
© Interpretation
o "CHAPTER THREE: CROSS SECTIONAL ECONOMETRICS. CMIa
y
PNAaAPRON =
Methodology of econometrics analysis
What are the steps or procedures of econometricians in their analysis of an economic
problem?
Broadly speaking, classical econometric methodology proceeds along the following lines
(steps):
Develop statement of theory or hypothesis
Specification of the mathematical model of the theory
Specification of the statistical, or econometric, model
Obtaining the data
Estimation of the parameters of the econometric model
Hypothesis testing
Forecasting or prediction
Using the model for control or policy purposes
OT2.1. Simple Linear Regression
Simple linear regression= single regressor (independent variable)
Suppose a regression with only qualitative (durnmy) regressor= ANOVA Regression:
‘Step 1: Develop a statement of theory or hypothesis
Suppose we want to know if there isa productivity difference between male and female headed households
> i.e. Suppose Gender is our independent variable
Gender is a dummy (binary or Nominal scale ) variable
Nominal scale variable: it is a type of variable which gives qualitative information only.
male 1male
i. Gender { cans let yan
Note: the above coding ‘0’ or'1’ is used for identification purpose only.
> Then values of nominal scale variable can't be divided, subtracted or ordered for comparison
> This type of variable sometimes called dummy variable
IIo)Step 2: Specification of the mathematical model of the theory
Yield = BO+ B1Gender
Step 3: Specification of the statistical, or econometric, model
And let your multiple linear regression model is:
Yield = By+ B1Gender +pi
Step 4: Obtaining the data
Then the next step is going to field and collect the data.» The next step is entering the data in to the appropriate software and format.
O Remember ways of entering the data in to stata
i. Directly entering the data in to the stata
ii, Entering the data in to excel and import to the stata
iii. Entering the data in to SPSS and save it in the appropriate stata format or use stata
transfer software
> The next step is estimating the model
OeStep 5: Estimation of the parameters of the econometric model
Ordinary Least Square (OLS) estimation techniques using stata
Statistics mam models and related ==) Linear regression mm)Select the
dependent and independent variables ===> Click submit = Click ok
Syntax: reg depvar indepvarExample: eg Yield Gender nt
Some statistical manipulations
Depo of eedom(,
sample size
Penumber of parameters indep Vas)
ben. of variables > (1, 30-2) (1,28)
8
— 0.4610
Estimate te residuals
Estimate te ited value
4-BiGender
Fe) ~ 1088
t= 508
Cl for fy = + #220)
where; ¢2 ~~ value at (30 — 2,0%5/,) ~ 2.048
> CIfor fy=5.5278 +2.048(1.088)= (3.2986, 7.7568)
WaQO To estimate the RSS (Residual), follow
the following steps
PaOQ To estimate the ESS (model),
follow the following steps
Dae SAInterpretation of coefficients
What does the estimate 5.527 show?
It is coefficient for Male showing that the average productivity of
Male headed households is higher than female headed households
by 5.527 at (sig at 1%): remember the t-test result in ch-2
What about the estimate 3.5?
Average productivity of omitted category (Female headed
households)
‘Why we omitted one category (Female)?
Not to fall in dummy variable trap
Average productivity of male headed hhs=3.5+ 5.527 *1= 9.027:
remember the t-test result in ch-2
(Or use prediction
‘Average productivity of female headed hhs= 3.5+ 5.527 *0= 3.5
Or use prediction
The productivity difference b/n male and female headed hh
9,027-3.5= 5.527
WaCo
(e
Exercise
« Is there a significant difference in average productivity between
households with and without access to credit?
« What is the average productivity of households with access to credit?
« How much of the productivity variation is explained by access to credit?
CROSS SECTIONAL ECONOMET2.2. Multiple Linear Regression
2 Multiple linear regression= many regressors (independent variable)
© Suppose: Dummy and continuous variables as an independent variable= ANCOVA Regression.
Suppose you are going to analyze various determinants of maize productivity
Based on your literature review, you think that maize productivity can be affected by:
+ Age of the household head
¥ Land fragmentation
+ Fertilizer applied per hectare
¥ Household land size
¥ Gender of the household head
Then the multiple linear regression model will be:
Yield= Bot Byaget Byfragment+ B.fertlizert B,land+ B.Gender+pi
Note: estimation techniques are the same as simple linear regression model.
Syntax: reg dep var indep vars.
MENGISTU Y, UE Dae EL)Example: reg Yield age fragment fertlizer land Gender_n1
2 Based on p-value from five explanatory variables, only two
variables (fragment and Gender) are significant.
Note: only significant variables will be analyzed.
a Then let as analyze the coefficients of significant variables
2 However, before making the analysis of the result, it is
important to judge the efficiency of the model using some ed onteretetioneut equation of the regression model beams
Ss 08sGender
Vild~ 5.561740.055age 0.676Kragment-0.00Afertier 0,
diagnostic tests.
2 In particular, inferences based on OLS results can be valid
depending on whether the classical linear regression
(CLRM) assumptions hold.
UOa Now let as test the some of CLRM assumptions called diagnostic tests:
i. Multicollinearity Test
a The term multicollinearity means the existence of perfect or exact linear relationship among all
or some of the explanatory variables of the regression model.
a And the existence of multicollinearity can be examined (detected) using various techniques such
as using auxiliary regression, pair-wise correlations among regressors and variance inflation
factor (VIF) and or tolerance margin (1/VIF).
@ VIF is most commonly used which measures how the variance of an estimator is inflated by the
presence of multicollinearity.
Note: Multicollinearity is a matter of degree and not of kind.
< Itis not between the presence and the absence of its degrees (high or perfect)!Informal test: High R2 but t-ratio
Formal tests:
Take auxiliary regression
Test pair-wise correlations among regressors
Decision: best if less than 0.50
Test for variance inflation factor and tolerance
Decision
Asa rule of thumb if VIF is >10 or if 1/VIF < 10% (close to zero) there 1s multicollinearit.
> Since our result shows that VIF ofall variables are less than 10 and I/VIF of all variables are grater than 10%, multicollinearity 1s not a
problem in our model
Note:
> Multicollinearty is not a problem for nonlinear relationships between variables
> Multicollinearity is essentially a sample (regression) phenomenon not for the population.
Wa+
°
Remedial measures if there is multicollinearity problem
Drop one or more of the perfectly collinear variables
Take sample over wide area (increase the sample size)
Take new data
Transformation of variables (take square, natural logarithm...)
Combining cross-sectional and time series data
Do nothing: Multicollinearity is God’s will, according to Blanchard multicollinearity is essentially a data deficiency
problem not a problem with OLS or statistical technique in general.
MENGISTU Y, LEST DEES)|. Test of homoscedasticity
a It is the test of the variance of the error (disturbance) term.
alf the error term doesn’t have a constant variance, we say there is
Heteroscedasticity problem.
a The nature of the variance of the error term can be judged by Breusch-Pagan
test.Stata command: hettest
Then you get the following result
(Deasion- if the P-value 1s sufficiently small, e, if below chosen significant level
(usually TO%), we reject the null hypothesis (Ho) of homoscedasticity (constant variance
and accept the alternative hypothesis (1).
Since our result shows that P-value is less than 10%, we have to reject Ho
Then there is no constant variance (there is Heteroscedasticity problem) in our model.
MENGISTU Y, eur
ee USRemedial measures for Heteroscedasticity problem
Check for outliers (for the dependent variables)
Use robust regression
Example: reg Yield age fragment fertlizer land
Gender_ni, robust
Note: hettest is not appropriate after robust regression
Waiii. Model Specification test
Model specification test basically deals about:
» The exclusion of relevant explanatory variables
> The inclusion of irrelevant variables
> Functional form error
UIE
Dae EL)Q Take Ramsey reset test
Syntax: ovtest
Decision: if the P-value is sufficiently small, that is, if
below chosen significant level (usually 10%), we reject the
null hypothesis (Ho) of homoscedasticity (constant variance
and accept the alternative hypothesis (H,)
< Implies that there is no model specification problem.
TER THREE” CROSS SECTIONAL ECONOMETRICS DEBRE MARKOS UNIVERSITY(DMU)
MENGISTU Y,iv. Normality of the disturbance term
+ There are various ways of testing the normality of ui. For example:
y_ histogram with normal curve of residuals
¥ Normal probability plot and others
COSA)
LEST
MENGISTU Y, oyTest of normality of the disturbance term using stata
> First generate the disturbance term (U;)
Syntax: predict ui, residual
> Second test of normality of the disturbance term (Ui)
a. Draw histogram of the ui with normal curve
Syntax: histogram ui, normal
v Then you get the result likeNormal probability and quartile plot
Syntax: pnorm ui or qnorm ui
> Then you get the following result respectively
Noman)
bia 0% enpwcsifsumie) °° o : aos
> Both graphs shows that the disturbance term (ui) is almost normal.
MENGISTU Y,
LEST DEES)lations of Some of the VPC ellie
The presence of multicollinearity
a We said that multicollinearity means the existence of perfect or exact linear
relationship among all or some of the explanatory variables of the regression
model.
» Let us assume that the variable fertilizer is twice that of age.
» Then let us create hypothetical variable called age3 which is a function of age
Syntax: gen age3=50+ageNote: we deliberately make the 10th observation of
age3 95 instead of 75, unless the stata will drop one of
the perfectly correlated variables in the regression.
WeThen after regression with the new data, we get the following VIF result
vit
r DOTCategorical variables as a regressor
‘Suppose: Educational level (EducLevel)
‘Syntax: reg depvar i. Categorical var
Example: reg Yield fertlizer Gender_n1 i.€ducLevel_ni
Note:
When you put i. Infront of the Categorical variables
variable the software automatically drop the one
category (usually the lowest category) that will be your
bench mark
Unless you put | Infront of the Categorical variables the
software consider the variable as a continuous variable
Your estimate will be wrong
WaAnswer the following questions based on the regression result given below
A. What does 4.612 shows?
8. Whats the average productivity d/ce b/n male and female headed hhs?
C. Whatis the difference in average productivity b/n hhs with illiterate and
secondary educ. completed heads?
D. What is the difference in average productivity b/n hhs with secondary and
post-secondary educ. completed heads?
7 PIE eerDiscussion question
a What is the average productivity of households managed by
male and secondary educ. completed heads?
Ne eSTo know the average productivity of households
managed by male and secondary educ. Completed
heads.
1%: we have to generate interaction variable of Gender
and educational level
2nd: make regression using the newly generated variable
reg Yield fertlizer i.IntGenEduc
The average productivity of households managed by male and
secondary educ. Completed heads is 3.82
= UESThe linear probabi
model (LPM)
Suppose you are intended to investigate the effect of
gender and land size on access to credit
Model: Credit_Dummy_n1= By* B,land+ B,Gender+pi
Since the dependent variable is takes values which are
either 0 or 1, the model can be interpreted as the
probability of observing a 0 or 1 given the explanatory
variables
Though the LPM model is not entirely correct, we can use
OLS to estimate it.
WaInterpretation of coefficients
Interpret the intercept
Interpret the coefficient of land
Interpret the coefficient of Gender
Answer
‘A. The probability of access to credit for female managed HHs with no
land Is 0.168 or 17%
Note: if the intercept term is negative, it will be interpreted as zero
(because probability can’t be negative)
B, The coefficient of land shows that for one hectare increase in HH's
land size, on average, probability of access to credit decrease by
0.00069 or 0.07% but itis not statistically significant.
However, we can estimate the actual probability of access to credit
for a particular HH land size.
Example: suppose the male managed HH with land size of S hectare
E(x/land = 5, Gender = 1) = 0.168 ~ 0.00067 *5 +0499 +1 = 0.664
Or use prediction
C. The coefficient of Gender shows that the probability of access to
credit for male managed HHs greater than female managed HHs on
average by 50%
Uae SAieee UEP MCrsP eM CoM ire coreim ecm PeLar TT
Importing STATA result to Microsoft word
1. Using asdoc
Syntax: add asdoc before stata commands except for figure
commands
Examples:
asdoc sum
asdoc reg Yield age fragment fertlizer land Gender_n1 =
EducLevel_n1 Credit
t_Dummy_n1 Fete r
& The software authomatically save your result in Microsoft word
file “Myfile.doc” in the working directory you are working on.
>» Click on “Myfile.doc” in the stata result window to open the
document
MENGISTU Y, CU
LESSEE
Uae SA2. Using outreg2
Itis used for regression results
Syntax:
Note: run simultaneously
Example:
reg Yield age fragment fertlizer land Gender_n1 EducLevel_n1
Credit_Dummy_ni
outreg2 using Table1.doc, replace
The software automatically save your result in Microsoft
word file “Table1” in the working directory you are working
on.
Click on “Table” in the stata result window to open the
document
Note: outreg2 is usually used for publication purpose. For your
senior essay please use asdoc option. —
Wa