[go: up one dir, main page]

0% found this document useful (0 votes)
23 views16 pages

Lecture Notes Week One

The document provides comprehensive lecture notes on econometrics, covering definitions, types, methodologies, and data structures relevant to the field. It discusses regression analysis, including simple and multiple linear regression models, the Gauss Markov theorem, and practical applications using R. The notes also include examples and statistical concepts such as hypothesis testing and goodness of fit.

Uploaded by

mwanjeglen58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views16 pages

Lecture Notes Week One

The document provides comprehensive lecture notes on econometrics, covering definitions, types, methodologies, and data structures relevant to the field. It discusses regression analysis, including simple and multiple linear regression models, the Gauss Markov theorem, and practical applications using R. The notes also include examples and statistical concepts such as hypothesis testing and goodness of fit.

Uploaded by

mwanjeglen58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Econometrics

Ken Langat

Lecture 1 Notes

Contents
1 Introduction 2
1.1 Definition of Econometrics ................................................................................................... 2
1.2 Types of Econometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2.1 Theoretical Econometrics ........................................................................................ 2
1.2.2 Applied Econometrics 2
1.3 Methodology of Econometrics .............................................................................................. 2
1.4 Structure of Economic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.4.1 Cross sectional Data ................................................................................................ 2
1.4.2 Time series data ...................................................................................................... 2
1.4.3 Panel data .............................................................................................................. 3 3
2.1 Simple Regression Model ...................................................................................................... 3
2 Regression Analysisthe
2.1.1 Deriving with
OLSCross Sectional Data
...................................................................................................... 3
2.1.2 Goodness of Fit ......................................................................................................... 4
2.2 Gauss Markov Theorem ...................................................................................................... 5
2.3 Linear Estimators .............................................................................................................. 5
2.3.1 Unbiasedness of estimates ........................................................................................ 5
2.3.2 Variances ................................................................................................................. 5
2.3.3 Covariances .............................................................................................................. 6
2.4 Inference for β0 and β1 .......................................................................................... 6
2.4.1 Inference for β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.4.2 Inference for β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.5 Practical in R 7
2.5.1 Example 1: US Consumption Expenditure . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.5.2 Example 2: Sales data 8
2.6 Multiple Linear Regression Analysis 8
2.6.1 Matrix Formulation of MLR ..................................................................................... 8
2.6.2 OLS Estimation ...................................................................................................... 9
2.6.3 Fitted Values and Residuals ..................................................................................... 9
2.6.4 Inference in MLR ........................................................................................................... 10
2.6.5 Coefficient of Multiple Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.6 Hypothesis testing for individual regressors . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6.7Global test 11
2.6.8 Assumptions of Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . .11
2.6.9 Example in R 12

1
1 Introduction
1.1 Definition of Econometrics
This is the social science in which the tools of economic theory, mathematics and statistical inference are
applied to the analysis of economic phenomena.

1.2 Types of Econometrics


1.2.1 Theoretical Econometrics
This is concerned with the development of appropriate methods for measuring economic relationships spec-
ified by econometric models. Relies heavily on mathematical statistics

1.2.2 Applied Econometrics


Here we use the tools of theoretical econometrics to study some special fields of economics and business such
demand and supply.

1.3 Methodology of Econometrics


• Statement of theory or hypothesis
• Specification of mathematical model of the theory
• Specification of statistical or econmetric model
• Obtaining the data
• Estimating parameters of econometric model
• Hypothesis testing
• Forecasting or prediction
• Using model for control or policy purposes

1.4 Structure of Economic Data


1.4.1 Cross sectional Data
These consists of a sample of individuals, households, firms, states, countries taken at a given point in time.
Example may include a data set on wages and other individual features.

Obs No wage educ exper gender married


1 3.10 11 2 1 0
2 3.24 12 32 0 0
3 5.30 4 44 1

1.4.2 Time series data


A time series data set consists of observations on a variable or several variables made sequentially over time.
Examples include:
• Stock prices
• GDP
• Automobile sales figures
• CPI

Obs Year GDP


1 2006 14613.8

2
Obs Year GDP
2 2007 14873.8
3 2008 14830.4

1.4.3 Panel data


A panel data consist of a time series for each cross sectional member in a data set.
The key aspect of panel data is that we observe each micro unit for a number of time periods

Firm Year Area Prod Labor


1 1990 7.87 2.50 160
1 1991 7.18 2.50 151
1 1992 7.20 2.50 138
2 1990 8.92 2.50 184
2 1991 7.31 2.50 161
2 1992 8.01 2.50 151

2 Regression Analysis with Cross Sectional Data


2.1 Simple Regression Model
The simple regression model can be used to study the relationship between two variables.
A random experiment is repeated n times under identical conditions.
For each trial i = 1, 2, ..., n the value of Xi is known and the response Yi is recorded.
The simple linear regression model is given by:

Yi = β0 + β1 Xi + ϵi (1)

2.1.1 Deriving the OLS


Given
Yi = β0 + β1 Xi

The sum of squared errors:

X X
Q= ϵ2i = (Yi − β0 − β1 Xi )

Differentiating w.r.t β0 and β1 we have:

∂Q X
= −2 (Yi − β0 − β1 Xi )
∂β0

∂Q X
= −2 Xi (Yi − β0 − β1 Xi )
∂β1

We can expand the equations and have:

X X
Yi = nβˆ0 + βˆ1 Xi (2)

3
X X X
Xi Yi = βˆ0 Xi + βˆ1 Xi2 (3)

Solving the above simultaneously we have:

βˆ0 = Ȳ − βˆ1 X̄

P P P
n X Y − Xi Yi
βˆ1 = P i i2 P
n Xi − ( Xi )2

The fitted values of Y is

Ŷi = βˆ0 + βˆ1 Xi

The residuals for observation i is the difference between actual and its fitted value.

ei = Yi − Ŷi

The total sum of squares denoted by SST

X
SST = (Yi − Ȳ )2

The Sum of Squares due to regression:

X
SSR = (Ŷi − Ȳ )2

The Sum of Squares due to errors:

X
SSE = (eˆi )2

This implies that:

SST = SSE + SSR

To proof:
Expand (Yi − Ȳ )2 by adding and substracting Ŷi
P

2.1.2 Goodness of Fit


The R-squared of the regression sometimes called the coefficient of determination.

SSR SSE
R2 = =1−
SST SST

R2 is the fraction of sample variation in Y that is explained by X.


When interpreting R2 we multiply by 100. R2 is the % of sample variation in Y that is explained by X.

4
2.2 Gauss Markov Theorem
Under the assumptions of Simple Linear Regression, the least square estimators βˆ0 and βˆ1 are unbiased and
have a minimum variance among all linear unbiased estimators of β0 and β1 . Thus βˆ0 and βˆ1 are said to be
BLUE.

2.3 Linear Estimators


The least squares intercept and the slope are linear estimators in the sense that they are linear function of
Yi
Consider:

P
ˆ (Yi − Ȳ )
β1 = P
(Xi − X̄)2
can be written as:

X
βˆ1 = mi Yi

i −X̄)
where mi = P(X
(X −X̄)2
i
P
and mi = 0 and summi xi = 1

2.3.1 Unbiasedness of estimates


βˆ0 = Ȳ − βˆ1 X̄

E[βˆ0 ] = E[Ȳ − βˆ1 X̄]


= βˆ0 + βˆ1 X̄ − βˆ1 X̄ (4)
= βˆ0

X
E[βˆ1 ] = mi E[Yi ]
= mi (β0 + β1 Xi ) (5)
= β1

2.3.2 Variances
X
V ar(βˆ1 ) = V ar( mi Yi )
X XX
= m2i V ar(Yi ) + ki kj cov(Yi , Yj )
2
P
(Xi − X̄) (6)
= σ2 P
(Xi − X̄)4
σ2
=P
(Xi − X̄)2

V ar(βˆ0 ) = V ar(Ȳ ) + X̄V ar(β1 ) − 2X̄Cov(Ȳ , β1 )


σ2 σ2
= + X̄ 2 P
n (Xi − X̄)2 (7)
1 x2
= σ2 ( +P )
n (Xi − X̄)2

5
2.3.3 Covariances
The covariance between β0 and β1 is

Cov(β0 , β1 ) = Cov(Ȳ , β1 ) − X̄V ar(β1 )


X̄σ 2 (8)
= −P
(Xi − X̄)2

2.4 Inference for β0 and β1


2.4.1 Inference for β1
We test the hypothesis concerning β1 :

H0 : β1 = 0 vs H1 : β1 ̸= 0

The sampling distribution of βˆ1 refers to the different values of βˆ1 that would be obtained with repeated
samping when the levels of the predictor X are held constant from sample to sample

E[βˆ1 ] = β1
and
σ2
V ar(βˆ1 ) = P
(Xi − X̄)2

An estimate for σ 2 is:

σ 2 = f racSSEn − 2
thus

M SE
S 2 (βˆ1 ) = P
(X1 − X̄)2

If Yi are normally distributed then the distribution of βˆ1 is normal since βˆ1 =
P
mi Yi and a linear combi-
nation of independent random variables are also normally distributed then:

2
σ
βˆ1 ∼ N (β1 , P )
(Xi − X̄)2

The (1 − α)100% CI for β1 is


s
M SE
βˆ1 ± t(1− α2 ),n−2 P
(Xi − X̄)2

To test the hypothesis H0 : β1 = c the test statistic is:

β1 − c
t= q
P M SE
(Xi −X̄)2

6
2.4.2 Inference for β0
The sampling distribution of β0 is:

1 x¯2
βˆ0 ∼ N (β0 , σ 2 ( + P ))
n (xi − x̄)2

The (1 − α)100% CI for β0 is


q
βˆ0 ± t(1− α2 ),n−2 S 2 (βˆ0 )

To test the hypothesis To test the hypothesis H0 : β0 = c the test statistic is:

β0 − c
t= q
S 2 (βˆ0 )

2.5 Practical in R
2.5.1 Example 1: US Consumption Expenditure
In fpp3 package in R, a data set named us_change shows a time series of quarterly percentage changes
(growth rates) of real personal consumption expenditure, y and real personal disposable income x for the
US from 1970 Q1 to 2019 Q2.
us_change %>%
pivot_longer(c(Consumption, Income), names_to = "Series") %>%
autoplot(value) + theme_bw()+
labs(y = "% Change", x = "Time")

A scatter plot of consumption changes against income changes:


us_change |>
ggplot(aes(x = Income, y = Consumption)) +
labs(y = "Consumption (quarterly % change)",
x = "Income (quarterly % change)") +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_bw()

The model can be fitted using:


model <- lm(Consumption ~ Income, data = us_change)
pander(summary(model))

Estimate Std. Error t value Pr(>|t|)


(Intercept) 0.5445 0.05403 10.08 1.63e-19
Income 0.2718 0.04673 5.817 2.402e-08

Table 5: Fitting linear model: Consumption ~ Income

Observations Residual Std. Error R2 Adjusted R2


198 0.5905 0.1472 0.1429

7
The fitted equation is:

Ŷ = 0.545 + 0.272X

ANOVA Table can be obtained using:


pander(anova(model))

Table 6: Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)


Income 1 11.8 11.8 33.84 2.402e-08
Residuals 196 68.35 0.3487 NA NA

The confidence interval of the model:


pander(confint(model))

2.5 % 97.5 %
(Intercept) 0.438 0.6511
Income 0.1797 0.364

2.5.2 Example 2: Sales data


Consider the data frame named marketing in the datarium package containing the impact of three ad-
vertising medias (youtube, facebook and newspaper) on sales. We want to fit a SLR to see the impact of
advertising budget spent on youtube on sales.
i. Create a visualization for the two variables
ii. Fit a SLR model
iii. Obtain the 95% confidence interval and the ANOVA table for the model
iv. Interpret the results

2.6 Multiple Linear Regression Analysis


The general multiple linear regression model can be written as:

Y = β0 + β1 X1 + ... + βk Xk + ϵ

Where β0 is the intercept


β1 , ..., βk are the slope parameters associated with x1 , ..., xk

2.6.1 Matrix Formulation of MLR


Consider the following multiple regression model:

Yi = β0 + β1 Xi1 + β2 Xi2 + ... + βp Xi,p−1 + ϵi

The model can be written using vectors and matrices as:

8
Y = Xβ + ϵ

Y = n × 1 vector of response values β = p × 1 vector of regression parameters X = n × p matrix of known


constants ϵ = n × 1 vector of iid error terms
Interpreting the slope parameters has to take to account possible changes in other independent variables a
slope parameter say βk gives the change in Y when the independent variable Xk changes by one unit while
all other independent variables remain constant.
The coefficients measure the marginal effects of the predictor variables.

2.6.2 OLS Estimation


Define the best estimate of β as that which minimizes the SSE ϵ′ ϵ

X
ϵ2i = ϵ′ ϵ
(9)
= (Y − Xβ)′ (Y − Xβ)

Differentiate w.r.t β and equate to zero and have:

Q = (Y − Xβ)′ (Y − Xβ) = Y ′ Y − 2Y ′ Xβ + β ′ X ′ Xβ

∂Q
= 2X ′ Xβ − 2Y ′ X
∂β

Equating to zero we have:

−2X ′ Y = −2X ′ Xβ

Solving for β we get:

β̂ = (X ′ X)−1 X ′ Y

2.6.3 Fitted Values and Residuals


The fitted values are given as:

Ŷi = βˆ0 + βˆ1 Xi1 + ... + βp−1


ˆ Xi,p−1

Residuals are given by:

ei = Yi − Ŷi

In matrix formulation:

9
Ŷ = X β̂
= X(X ′ X)−1 X ′ Y (10)
= HY

In other words:

e = Y − Ŷ
= Y − Xβ
= Y − X(X ′ X)−1 X ′ Y (11)
= Y − HY
= (1 − H)Y

2.6.4 Inference in MLR


If the model Y = Xβ + ϵ is correct, the expectation of Y is Xβ and the expectation of β is:

E[β̂] = [(X ′ X)−1 X ′ ]E[Y ]


= [(X ′ X)−1 X ′ ]Xβ (12)

The variance of β is

V ar(β) = [(X ′ X)−1 ]V ar(Y )[(X ′ X)−1 X ′ ]′


= [(X ′ X)−1 X ′ ]Iσ[(X ′ X)−1 X ′ ]′ (13)
2 ′ −1
= σ (X X)

The ANOVA table can be constructed as:

Source df SS MSS
Regression p-1 SSR MSR
Error n-p SSE MSE
Total n-1 SST

2.6.5 Coefficient of Multiple Determination


SSR
R2 =
SST
It measures the amount of variation in Y explained by the independent variables.
The adjusted R2 is given by:

2 (n − 1)SSE
Rα =1−
(n − p)SST

It adjusts the R2 for the number of predictors in the model.

10
2.6.6 Hypothesis testing for individual regressors
• Determine the null and alternative hypothesis
• Specify the test statistic and its distribution if H0 is true
• Select α and determine the rejection region
• Calculate the sample value of test statistic and desired p-value
• State your conclusion
The hypothesis is
H0 : βk = 0 vs H1 : βk ̸= 0

The test statistic is:

βk
t= ∼ tn−p
se(βk )

2.6.7 Global test


This is an overall test for the regression model. It investigates the possibility that all the regression coefficients
are equal to zero.

H0 : β1 = ... = βk = 0 vs Ha : βj ̸= 0

The test statistic is the F -statistic given by

M SR
F =
M SE

2.6.8 Assumptions of Multiple Linear Regression


2.6.8.1 Linearity There is a linear relationship between the dependent variable and each independent
variable
Linearity may be evaluated by constructing a scatter diagram for each independent variable and examine
the diagrams
Linearity can also be assessed graphically by constructing residual plots. Constructed by plotting residuals
against the fitted values and this should exhibit no pattern

2.6.8.2 Homoscedasticity The variation in the residuals is the same for all fitted values of Y. The
formal test for homoscedasticity is the Breusch Pagan test and the hypothesis is:
H0 : Constant variance
Ha : Heteroscedasticity

2.6.8.3 Normality of residuals Residuals are normally distributed with a mean of zero. The assump-
tion is necessary for the validity of the inferences that we make based on the global and individual hypothesis
tests
The formal test for the normality of residuals is the Shapiro-Wilk test. The hypothesis tested is:
H0 : Normality of residuals Ha : Residuals not normally distributed

2.6.8.4 Multicollinearity This exists when the independent variables are correlated. If an independent
variable is highly correlated with other variables in the model should be removed.
To assess the degree to which independent variables are correlated we compute the VIF. A VIF greater than
10 is unsatisfactory.

11
2.6.8.5 Autocorrelation Successive residuals should be independent implying that there is no pattern
in the residuals.
When successive residuals are correlated we refer to the condition as autocorrelation.
The formal test is the Durbin Watson test
H0 : No Autocorrelation
Ha : Autocorrelation

2.6.9 Example in R
We fit a multiple linear regression for US consumption given by:

Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4

Where:
Y is the percentage change in real personal consumption expenditure X1 is the percentage change in real
personal disposable income X2 is the percentage change in industrial production X3 is the percentage change
in personal savings X4 is the change in unemployment rate
fit_model <- us_change |>
model(tslm = TSLM(Consumption ~ Income + Production +
Unemployment + Savings))
report(fit_model)

## Series: Consumption
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90555 -0.15821 -0.03608 0.13618 1.15471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.253105 0.034470 7.343 5.71e-12 ***
## Income 0.740583 0.040115 18.461 < 2e-16 ***
## Production 0.047173 0.023142 2.038 0.0429 *
## Unemployment -0.174685 0.095511 -1.829 0.0689 .
## Savings -0.052890 0.002924 -18.088 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3102 on 193 degrees of freedom
## Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
## F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16
The fitted values can be obtained as:
augment(fit_model) |>
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Consumption, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
labs(y = NULL,
title = "Percent change in US consumption expenditure"
) +

12
scale_colour_manual(values=c(Data="black",Fitted="#D55E00")) +
guides(colour = guide_legend(title = NULL))

Percent change in US consumption expenditure

Data
0
Fitted

−1

−2

1980 Q1 2000 Q1 2020 Q1


Quarter

The histogram of residuals


fit_model |> gg_tsresiduals()

13
Innovation residuals
1.0

0.5

0.0

−0.5

−1.0
1980 Q1 2000 Q1 2020 Q1
Quarter

40
0.1
30
count

0.0
acf

20

−0.1 10

−0.2 0
2 4 6 8 10121416182022 −1.0 −0.5 0.0 0.5 1.0
lag [1Q] .resid

We fit the model and test the assumptions


model2 <- lm(Consumption ~ Income + Production +
Unemployment + Savings, data = us_change)
pander(summary(model2))

Estimate Std. Error t value Pr(>|t|)


(Intercept) 0.2531 0.03447 7.343 5.713e-12
Income 0.7406 0.04012 18.46 1.648e-44
Production 0.04717 0.02314 2.038 0.04287
Unemployment -0.1747 0.09551 -1.829 0.06895
Savings -0.05289 0.002924 -18.09 2.028e-43

Table 10: Fitting linear model: Consumption ~ Income + Produc-


tion + Unemployment + Savings

Observations Residual Std. Error R2 Adjusted R2


198 0.3102 0.7683 0.7635

The ANOVA table


pander(anova(model2))

14
Table 11: Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)


Income 1 11.8 11.8 122.6 2.215e-22
Production 1 15.67 15.67 162.9 1.921e-27
Unemployment 1 2.623 2.623 27.26 4.583e-07
Savings 1 31.48 31.48 327.2 2.028e-43
Residuals 193 18.57 0.09623 NA NA

Testing the assumptions


Normality
check_normality(model2)

## Warning: Non-normality of residuals detected (p < .001).


Autocorrelation
check_autocorrelation(model2)

## OK: Residuals appear to be independent and not autocorrelated (p = 0.134).


Homoscedasticity
check_heteroscedasticity(model2)

## Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.001).


Multicollinearity
check_collinearity(model2)

## # Check for Multicollinearity


##
## Low Correlation
##
## Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
## Income 2.67 [2.18, 3.37] 1.63 0.37 [0.30, 0.46]
## Production 2.54 [2.08, 3.19] 1.59 0.39 [0.31, 0.48]
## Unemployment 2.52 [2.06, 3.17] 1.59 0.40 [0.32, 0.48]
## Savings 2.51 [2.05, 3.15] 1.58 0.40 [0.32, 0.49]
Linearity
us_change |>
ggpairs(columns = 2:6)

15
Consumption Income Production Savings Unemployment

Consumption
0.6
0.4
Corr: Corr: Corr: Corr:
0.2 0.384*** 0.529*** −0.257*** −0.527***
0.0

Income
2.5 Corr: Corr: Corr:
0.0
−2.5 0.269*** 0.720*** −0.224**

Production
2.5
0.0 Corr: Corr:
−2.5 −0.059 −0.768***
−5.0
40

Savings Unemployment
20 Corr:
0
−20 0.106
−40
−60
1.5
1.0
0.5
0.0
−0.5
−1.0
−2−10 1 2 −2.50.02.5 −5.0
−2.5
0.02.5−60
−40
−2002040
−1.0
−0.5
0.00.51.01.5

16

You might also like