[go: up one dir, main page]

0% found this document useful (0 votes)
14 views5 pages

Chapter - 1 - Lecture Notes

The document discusses the fundamentals of regression analysis in econometrics, tracing its origins to Francis Galton's work on 'regression to the mean.' It explains the distinction between statistical and deterministic relationships, emphasizing that econometrics deals with the former, and highlights the importance of understanding causation versus correlation in regression models. Additionally, it categorizes types of data used in econometrics and outlines different measurement scales that affect regression interpretations.

Uploaded by

chadeneilers586
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Chapter - 1 - Lecture Notes

The document discusses the fundamentals of regression analysis in econometrics, tracing its origins to Francis Galton's work on 'regression to the mean.' It explains the distinction between statistical and deterministic relationships, emphasizing that econometrics deals with the former, and highlights the importance of understanding causation versus correlation in regression models. Additionally, it categorizes types of data used in econometrics and outlines different measurement scales that affect regression interpretations.

Uploaded by

chadeneilers586
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ECO242 Basic Econometrics

Chapter One Lecture Notes

__________________________________________________________________________________

The nature of regression: slides 4-6

In 1886, Francis Galton published an article titled ‘Regression towards mediocrity in hereditary
stature’ in the academic journal Anthropological Institute of Great Britain and Ireland. What Galton
then called ‘regression towards mediocrity’ (and what we would today call ‘regression to the mean’)
is best encapsulated with this quote from the article.

It is some years since I made an extensive series of experiments on the produce of


seeds of different size but of the same species. They yielded results that seemed very
noteworthy, and I used them as the basis of a lecture before the Royal Institution on
February 9th, 1877. It appeared from these experiments that the offspring did not
tend to resemble their parent seeds in size, but to be always more mediocre than
they-to be smaller than the parents, if the parents were large; to be larger than the
parents, if the parents were very small. F. Galton, 1886

The experiment he describes above was repeated observationally with humans. Galton measured
the heights of close to 1000 children and the heights of their parents. He noticed a similar pattern
described in the above quote. That is, male children with exceptionally tall parents tended, on
average, to be shorter than their fathers. Similarly, male children with exceptionally short parents
tended, on average to be taller than their fathers.

Today, when we talk about regression we mean something different to ‘regression to the mean’.

Specifically, regression is an approach to modelling the dependence of one variable on one or more
other variables. The interpretation of this modelling approach is that it gives us an estimate of the
mean or average response of the dependent variable following a change in the independent variable
or variables.

An example is provided on slide 5. Notice that a regression model is nothing but a linear equation,
much like 𝑦 = 𝑚𝑎 + 𝑐. By convention we substitute the Greek letters 𝛽 (beta) for the slope and
intercept parameters in the linear equation. In the example of this slide 𝑌 is final mark for a course,
and 𝑋 is the number of hours spent studying per week by students enrolled for the course. This
model gives us parameter estimates of 25 for the constant term and 7 for the slope term. These
estimates imply that students who spend no time studying can expect, on average, to achieve a
grade of 25; and, students can, on average, increase their final grade by 7 marks for each addition
hour studied per week.

The three bullet points at the bottom of slide 5 show different ways of expressing the same idea.
Bullet 3 is read as “the expected value of 𝑌 given that 𝑋 takes on some specific value 𝑥”. The fourth
bullet point just adds a value for 𝑋, and is read as follows: “the expected grade for a student who
studies for 5 hours per week is 60”. The fifth bullet just shows us the calculation used to get to 60.

The graph on slide 6 illustrates the equation presented on slide 5. The table on the left are the data
used to estimate the model. A sample of 25 students was used in the estimation. The 𝑋 column is
their hours studied per week, and the 𝑌 column is their final grade for the course. The graph on the
right shows a plot of the data points and the estimated regression line. Notice that the same 𝑋 value
can be associated with different 𝑌 values. There are, for instance, two students with only 5 weekly
study hours who achieved a higher grade than one student with 6 study hours. So even though the
model suggests that, on average, more study time implies a higher grade, there are individual data
points that diverge from what the model predicts. Hence the term on average!

Statistical vs deterministic relationships: slide 7

Econometrics deals with statistical relationships and not deterministic relationships. Deterministic
relations are ones with no variation (or randomness) among the variables. Many physical
relationships are of this nature. Newton’s law of gravity, which is shown on this slide, is one such
example. The relationship between objects in space don’t randomly change – we don’t
spontaneously start floating sometimes while walking down the street.

In economics, and social sciences more generally, we deal entirely with non-deterministic
relationships. We haven’t uncovered (and, perhaps, never will) strict laws about our social world
that hold universally and provide precise predictions. We can predict with great accuracy what
happens when a projectile is launched into the air. We don’t know with great precision what the
absolute best policy is to bring down unemployment to 5%.

Consider the example on slides 5 and 6. Although it makes sense that more study hours improves
grades we still observe examples where this is not the case. Other things seem to matter, and we
can think of many other explanatory variables that could influence grades (intelligence,
commitment, health, wealth, quality of high school, quality of instruction by lecturer, home
environment, and on and on). Even if all the variables we can think of are included in the regression
model (of course, some of these variables cannot be observed or are hard to measure), we may still
observe randomness in the grades achieved by students. There will always be some variation that
we simply cannot explain.

This is why we include an ‘error term’ in the regression model.

Regression and causation: slide 8

Ultimately, we develop regression models because we want to make causal claims. In other words,
we want to be able to say that “these factors (𝑋) explain these outcomes (𝑌)”. But it is rare that
regression models provide us with such causal interpretations. In the grades-study hours regression
on slide 5, is it reasonable to interpret the estimated parameter values as follows: students who
achieved higher grades did so purely because they spent more time studying? Well, not exactly: you
might immediately object to such an interpretation by raising all the other variables mentioned
above that are also important for achieving good grades (intelligence, commitment, motivation, etc).

The above paragraph is not meant to be interpreted as “all regression models are merely speculative
and can’t tell us anything meaningful about the relationships between variables or phenomena
under study”. Regression models can be very useful, and many regression models apply techniques
that lend themselves to a causal interpretation (you’ll learn more about these types of models in
honours econometrics). It is important to note that very careful consideration must be given to an
econometric analysis using regression before making causal claims. And always remember that
Correlation does not imply causation. Just because two variables ‘move in the same direction’ does
not mean that one causes the other. Careful work needs to be done to determine whether causation
is at play, and advanced regression analysis can be a useful tool in this regard.

Slide 9 has a nice light-hearted take on the confusion around correlation and causation. Notice that
the student, perhaps, takes the idea of correlation vs causation too far. Let’s return to the grades-
study hours model. Earlier I suggested that it is not accurate to state with certainty that “our model
shows that more study hours improves grades for ECO242 at a rate of 7 marks per hour studied”.
However, is the more general claim that “students who input more study-time performed better as a
result” unreasonable?

Regression vs correlation: slide 10-11

This distinction given on this slide is quite useful. Regression and correlation are two distinct
concepts and must not be confused. Often researchers might notice a correlation and may then
wonder whether there is causation going on. Certain, more advanced, regression techniques could
aid in answering this question.

Milton Friedman famously noticed a correlation between expansionary monetary policy and
recessions. He then used this to argue that changes in the money supply cause fluctuations in
economic activity. Econometric analysis supports this observation. Again, it is hard to use the word
‘cause’ because econometric models cannot necessarily be interpreted causally – they can provide
further evidence but are not necessarily conclusive.

The image on slide 11 shows five reasons why two variables may be correlated. It could be pure
chance (sometimes this is cause ‘spurious correlation’). There could also be causation from one to
the other. There could be reverse causation (in the money supply example, could it be that changes
in the economic environment influences the money supply?). There could also be confounding
variables (when ice-cream sales increase, shark attacks increase – the confounding variable is higher
temperatures). Then there’s selection (people who graduated from certain universities earn higher
incomes from people who graduate from other universities; is it because the ‘high-income’
universities provide better instruction, or is it because they admit – select – high achieving
students?).

Terminology: slide 12

The dependent and independent variables are referred to in several different ways. It is useful to be
aware of the different terminology that is sometimes used as different texts may prefer different
terminology.

Nature and sources of data: slide 13-15

Data used in econometrics are often split into four ‘types’: time-series, cross-sectional, pooled, and
panel/longitudinal. Each type has its own implications for econometric analysis. Also, different types
of data will be more suited to particular research questions. In macroeconomics, for instance, time-
series data are generally of interest. In development economics on the other hand, usually panel or
cross-section data are of interest.

It is important to understand what distinguishes the four types and this is best explained by
considering the unit of observation in each case.

Time-series are data series that are measured over time at some regular periodicity. The unit of
observation is time in this instance. E.g. if we are looking at a series of annual GDP data for years
2000-2020, any particular data point (say GDP in 2015 was R4.8 trillion) that observation is specific
to that year. Although GDP is the variable, we observe different values at different times. If values of
a variable are collected over time, the data are usually time series in nature.

Unlike in time series, cross-sectional data are collected at a particular point in time and not over
time. It is therefore often described as ‘snapshot’ data. It provides a ‘snapshot’ of the units of
observation at the point of time when data were collected. In the case of cross-sectional data, the
unit of observation is usually individuals (these could be more specific, such as ‘students’ or ‘female
students’ or ‘female students on residence’, etc.), but could also be households, municipalities,
provinces, firms, countries.

Pooled data are similar to cross-section data except that it may combine multiple cross-sections or
‘snapshots’. The unit of observation remains the same as in cross-section data. The only difference is
that data are collected more than once (though time is NOT the unit of observation here). There’s an
important characteristic of pooled data that doesn’t allow time to be the unit of observation: each
cross-section included in the pooled data set can be randomly drawn. E.g., let’s say that government
wants to estimate the unemployment rate. It draws a random sample of individuals and asks them
questions related to their work status. Let’s say government wants regular updates on the
unemployment rate. It again randomly draws a sample from the population and asks individuals in
the new sample about their work status. StatsSA in fact does this every three months – it is called
the Quarterly Labour Force Survey (QLFS). Now, a key feature of this activity is that every three
months a random sample is drawn – there is no guarantee that individuals forming part of the
sample this quarter will be in the samples taken in subsequent quarters. Using pooled data can be
useful in instances where sample sizes of individual cross-sections are small. It is also useful for
considering policy effects over time. E.g. if government implements a jobs programme in Q3 2018
(quarter 3 of 2018), to see whether it has been effective an economist could pool QLFS data for Q1
2018 to Q4 2019 to determine an effect.

Panel data are similar to pooled cross-section data with one important distinction. The same sample
is repeatedly surveyed over time. So, instead of the taking the approach of the QLFS, which surveys
a random sample every three months, researchers could follow the same households and ask them
about their employment status. This will allow researchers to consider factors that are associated
with individuals’ labour outcomes. Here, the unit of observation is a joint pair of variables
{individual, time}. Observations are made on household, firm etc, but also on different values of a
given variable over time. If an economist wants to analyse investment in research and development
at the firm level (meaning how different firms behave with respect to such investment), they are
observing individual firms as well as the value of investment at various points in time. Thus, panel
data can be thought of as a combination of cross-section data and time-series data. It is nonetheless
distinct from both of these types of data.

Slide 15 provides some sources for data. Clicking on the data source will take you to the relevant
webpage.

Measurements of scale: slide 16

It is useful to know the difference between these measurement scales as variables with different
measurement scales result in different interpretations in a regression model – you’ll encounter this
mainly in third-year econometrics.

Ratio scale variables have three properties: they are divisible, subtractable, and ordinal.

Interval scale variables only have the last two properties: meaningful difference, and meaningful
order. Temperature is an example of interval scale. Consider 𝑋1 = −10 and 𝑋2 = 10, where 𝑋 is
temperature in degrees Celsius. The ratio of these values is meaningless.

Ordinal scale variables retain only the third property of ratio scale variables: order. The most popular
example of this is the Likert scale surveys that one is subjected to after a service call with some
company or institution. “On a scale of one to five, where one means strongly disagree, and 5 means
strongly agree, please indicate your agreement with these statements: you are likely to recommend
us to a friend”. It is not quite meaningful to say a score of 2 is half the value of a score of 4.

Nominal scales have no order. There’s no meaningful order in which we can place different race
groups or gender groups.

You might also like