Unit V MULTIVARIATE AND TIME SERIFS ANALYSIS,
Introducing a ‘Third Variuble - Causal Explanations ~ Three-Variable Contingency Tables and.
Beyond - Longitudinal Data - Fundamentals of TSA - Characteristics of time series data — Data
Cleaning — ‘Time-hused indexing — Visualizing — Grouping Resampling
Data Exploration and Visualization
4
plain a Third Variables,
A Uurd variable problem occurs when an observed correlation between two variables can
actually be explained by a third variable that hasn't been accounted for, When this third variable
is not taken into account, the correlation between the two variables under study can be
misleading und even confusing,
Example 1: Dogs & Fire Hydrants
A researcher observes that cities with more fire hydrants tend to also have more dogs. However,
these two variables ate only correlated because they both have a high correlation with a third
variable: population size.
Larger cities tend to have both more fire hydrants and more dogs. Conversely, smaller cities tend
Jo have fewer fire hydrants and fewer das,
ibid
“Padeepz App
we “Sa
eee =: ee
Example 2: Ire Cream Sales & Shark Attacks
A researcher finds that ice cream siles und shark attacks ure highly positively correlated
However, these two variables are only correlated because they both have a high correlation
votha third variable: temperiture
When it’s warmer out, more people buy iee cream and more people swim in the ocean which
explains why the vilues for both ice eream sles and shark allcks (end lo increase during the
same times of the year.2. Whats relationship between variables?
A survey is conducted and an interesting st jon between X and V is
discovered. There are two basic assumptions that have to be made if we wish to infer from
this that X may cause Y “These myolve the relationship between X and Y and other vanables
which might be operating, They are designed to ensure that when we compare groups which
differ on X, we are comparing like with like. Before giving an exposibon of these
assumptions, we need a bit more terminology: other variables ean be causally prior to both X
and Y, imtervene between X und ¥, or ensue from X und ¥, as shown helovs.
Figure 11.2 Different causal relationships beeween variables,
3. What are all the assumption can be made in casual relationships between variables?
‘There are three assumptions can be made in case of casual relationship berween the variables
Assumption 1 “X is causally prior to Y*.
There is nothing in the data to tell us whether X causes Y or ¥ causes X, so we have to make the
mast plausible assumption we can, based an aur knowledge of the subject mater and our
theorencal framework, One particulsr problem comes fram situations in which we suspect that X
and Y influence each ciher in a reciproviling prowess. “There ix: no way thal non-experimentil
data can be made to yield (wo independent estimates of the effect al X on ¥ und of Yon X.Assumption 2 - “Related prior variables have heen controlled”,
All other wuriahles which alfeet hth X and Y must be held canstint In ain expen ment, we e:
be sure that there are no third variables which give rise to both X and Y because the only w:
Which the randomized contral groups are allowed to vary isin terms al X,
Assumption 3—“AJ] variable intervening between X and Y have been controlled.
This assumption is not required before you can assume that there is a causal link between X and
Y, but rs required if you aim to understand how Xs causing ¥
4. What is spurious relationship?
Spurious relaionsinp- un asso
be explained by some third variable.
between (wo variables appears La be causal but can in Fagl
i control variable?
Control variables the effects of potential “third variables” are mathematically: controlled in the
data analysis process to highlight the relationship between the independent and dependent
vanable; used {a establish the ertleris nf nonespuriousness in nomothelic causal relationships
6. Explain the casual relationship in detail.
We consider ways of holding a third variable constant while assessing the relationship between
twa others
sinim PACE PZ App
We will now developed some experience af handling harches of dara, summarizing fearures of
their distributions, and investigating selationships between variables. We must now change gear
somewhat and ask what it would tke for such relationships to be weated as satisfactory
explanations, Hume suggested that 'We may define a cause ta hy un ohyect followed by another,
and where all the objects, similar ta the firsl, are Followed by objects similar to the second, Or,
in other words, where, if the first object had not been, the second never had existed’,
Cau
Direct and indirect effects
Causality should not necessarily be understood as a simple precess in which one factor or
vanable has am impact on unother For example, it is hkely in many cases that two oF more
factors will tend to work together to produce an effect, Moreover, the factors. or variables
contributing to the cffect may themselves be causally related, For this reason, we have to keep a.
clear idea in our heads of the relationships between the variables in the whole causal process. In
investigating the causes OF absenteeism Tram work, for esnmple, researchers have fund
4different contributory factors. We will consider two possible causal factors: heing female and
being in a low status job. Let us construct a causal path diagram depicting one possible set of
relationships between these variables.
Figwe 11.1 Causes of resencen,
‘The diagram in figure 11.1 represents simple sysiem of multiple causal paths. There 1s an.
arrow showing that those in Jow status jobs are more likely to go absent. Being female has a
causal effect im two ways. ‘There is un arrow sinaight to absentee behaviour, thes says thy
women are more likely ta be absent fram work than men, regardless of the kind of job they are
in This is termed a direct effect of gender on absenteeism. There is alse another way in which
being female has an effect. women are more likely to be in the kind of low status, perhaps
unpleasant, johs where absenteeism 16 more likely, irrespective af gender, We can say tht being,
femate thereline also has in inditect effect on absenteeism, through the type af work
performed Withaul seme emp
sal evidence we canmal be sure that this ‘model’ of the
relationships between the variables is correct,
Gontrolling the world ta learn id causes
i ivene dango deat LES oul AD a se ane
1 quite another {hing to find oul what they are (Causal prcesseS are hol obvious, ‘They hide 1n
situations of complexity, in which effects may have been produced by several different causes
acting together. When investigated, they will reluctantly shed one Jayer of explanation at a time,
hut only to reveal another deeper level ef complexity beneath, Kor this reasan, something th:
accepted as a satisfactory causal explanation at ene point in time can become problematic at
annther
Researchers investigating the canes af paycholagieal depressiem spent a long time carefully
documenting how severe, waumatizing events that happen to people, such as bereavement or
job loss, can induce it, Now that the causal effect of such life events has been established, the
research effon is numing co ask how an event such as unemployment has its effect. is it trough
the loss of suns! esteem, through the decline of self evaluation und self-esteem, through lack of
cash or through the sheer effect afanactivaly?Do opinion polls influence people?
Let us take an example ta illustate the differemt inferences which can be drawn from
experiments snd nan-experiments,
Some people believe that hearing the results of opinion polls before an election always
individuals Lowanls the winning emdidate. Imagine two ways in which empincal evidence
euld be collected five this proposition An experiment cauld be conducted by taking a largish
group of clectors, splitting them into two at random, telling half that the polls indicated one
candidate would win and telling the other half that they showed a rival would win. As long as
there were a substantial number of people in each group, the groups would start the experiment
having the same political preferences on average, since the groups were formed at random. If
they difTered substantially m (heir subsequent supparl far the candidates, then we could be
almost certain thar the phony poll information they were fed contributed to which candidate
they supported.
Altematively, the proposition could bs rescarched in a non-experimental way. A survey could
be conducted to discover what individuals believed recent opinion polls showed, and to find out
which candies the individuals themselves supported ‘The preferences of those who heliewed
thal one cancilite was going to win would he compared with thase who believed thal the rival
was going to win, The hypothesis would be thar the former would be moze sympathetic to the
candidate than the latter,
If the second survey did revenl a strong relationship between individuals’ perception of the state
of public opinion and ele uy fy PSA opinion polls have
a causal effect on nD epe Fei Pa banning polls in
pre-election periods asa resull’? Anyone who e th fing af argument would be taken to task
hy the pollsters, who have a commercial interest in resisting such reasomng, ‘They would deny
thal the effect in any way proves that pills influence opinion: it could, fir instance, be thut
supporters of a right-wing, candidate are of a generally conservative predisposition, and
purchase newspapers which only report polls sympathetic to their candidate.
In short, comparing individuals in a survey who thought that candidate A would win with those
who believed tust candidate B would win, would nol be comparing twa groups simulur in all
other possthle respects, unlike the experiment discussed above. An experiment would have a
better chance of persuading people that the publication of opinion polls affected individual
views
Assumptions required {o infer causes
Imagine a common situation, A survey is conducted and an interesting statistical association
between X and ¥ is discovered. ‘Phere wre two basic assumptions thal have to be made if we
6Wish 40 infer fiom this that X may cause Y, These involve the relationship between X and ¥
and other variables which might be operaring. They are designed to ensure that when we
compare groups which differ on X, we are comparing like with like, Before giving an
expastiion af these assumptions, we need a bil more terminology: other wunables em be
causally prior to both X and Y, intervene between X and Y, or ensue trom X and Y, as shown
in figure 11.2. ‘These terms are only relative (othe particular causal model im hand: ina
different model we might want to explain what gave rise to the prior variable.
Eosuing
vanable
Figure 11.2. Diferene cuusal relationships betweer varigbles.
Let us discuss each af the two core assumptions in turn
Assumption #
X is causally prior to ¥.
‘There is nothing in the data to tell us whether X causes Y or ¥ causes X, so we
iene nner CREED” AO Pee
Assumption 2
Related prior variables have been controlled.
All other variables which affect both X and Y must be held constant, In an experiment, we can
be sure that there are no-third variables whieh give rise ta both X and ¥ because the only way in
which the randomized comtral groups are allowed to vary is in terms of X, No such assumption
eam he made with non-experimental dit.
Assurnption 3
All variables intervening between X and Y have been controlled.
?‘This assunaption is not required before you can assume that there is a causal link between X and
Y, but it istequired ifynu sim to understane haw X as causing ¥
Let us first consider a hypothetical exemple drawn from the earlier discussinn of the causes of
absenteeism Suppose previous research had shown a positive bivariate relationship berween
low social status jabs and absentecism, The question arises: is there something about such jobs
that directly causes the people who do them to go off sick more than others? Before we can
draw such a conclusum, two assumptions have lo be made
‘Where are many possible outcomes nee the relationship between all three vanibles is
considered af ance, faur of whieh are showvn in figure 11.3
—
: cay Sey
. oT tres 2
Figure 112 The effect of job status on absentesism: controling prior variable,Figure 11.4 Ouomel from figure 11.3,
Figure 11.5 Quuumell from figure 11.3,
Figure 11.6 Outcome ill from gure 11.3.
If the relationship we ees ef eazy we sally prior variable is
brought uler cone! bo he eek ors By tis we do no
mean that the bivariate effect did not really exist, but rather that any'kausal conclusions drawn
from it would be incorrect. We can naw introduce another meaning, for that verb ‘to explain'. in
this situation, many researchers say thal the proportion of females in a job ‘explains’ the
relationship between the slalus of the job ame absenteeism, in the sense thal it accounts far at
entirely,
But what of the fourth situafion which is actually the most hkely outeame? H was the situation
portrayed in figure 11.1,
vw 7 ‘Absentee
behaviour
Figure 11.7 Quicome IY from figure 11.3,4. What {s Multiple and complex eausality?
‘Vhere are many different component causes can add together to produce a particular outcome,
A process snmetimes known as multiple causality of complex causality
8. Whatis thick trust and thin trust’?
Trust and honesty based on personal experience and on knowing people well over ¢ number
of years is conceptualized as ‘thick ust, Lluwever, this is only possible wath a relatively
small number of people and is therefore not os useful within a community as “thin truse.
9. ‘Three variuble cantingency duble und heyonil
A three-way contingency table is a cross-elassification of observations by the levels of three
eategoncal variables,
10. Explain about Longitudinal Data In detail,
Longitudinal data, sometimes called panel data, is data that is collected through a series of
repeated abservatinns of the same subjects aver some extended time frame —and rs useful for
measuring change, [n most of the cases longitudinal data deals with the Iuman dara,
Longitudinal dati effectively follows the same sumple over time, which chfTers fimdamentally
from cross-sectional data because it follows the sume subjects over some time, while eross-
sectional dita samples different subjects (whether individuals, firms, countmes, or regions) at
cach point in time. Meanwhile, 2 cross-sectional data set will always draw a new random
sample,
Typeset ong nal SRG d ee pz A p p
‘The three main types of longitudinal studies are
1. Panel Study
2. Retrospective Study
3, Cohort Study
“These methods help researchers to study variables amd account for qualitative amd «quantitative
data from the research sample,
1. Panel Study
Ina panel study, the researcher uses data collectinn methodls like surveys to gather information
from a fixed number of variables at regular but distant intervals, often spinning into a few years.
Is prmanly designed for quantititive research, although you cin use this method for
qualitative data anal
10When To Use Panel Study
If you want to have first-hand, factual infermation about the changes in a sample population,
then you should ont for a panel study. For example, medical researchers rely on panel studies 40
lentify the causes of age-related changes amd {heir consequences,
Advantages of Panel Study
© Ithelps you identify the causal factors of changes in.a research sample.
+ Taso allows you to witness the impact of these chunges an the properties of the
variables and information needed at different points of their existing relationship.
+ Panel studies can be used to oblain historical data fram the sample popufatian
Disudvamages of Panel Studies
© Conducting a panel study is pretty expensive in terms of tune and reseurces.
+» Itmight be challenging to gather the same quality of data from respondents at
interval
very
2. Remagpeetive Study
a revrespective study, the researcher depends on existing, information from previous
sslematic investigatians (9 discover pallerns leading to the study outcomes, In ather words,
retrospective study looks backward. It examines esposures to suspected risk or protection
factors concerning an outcome established at the start of the study.
When To Use Retosr ESHA ee pz Ap p
lexis where you want fo quickly estimate an
Retrospective studies are best fur reseurch coi
exposure’s effect on an outcome. It also helps you to discover preliminary measures of
association in your dala,
Medical researchers adopt retrospective study methods when they need to research rare
cunditions
Advantages of Retrospective Smdy
= Retrospective studies happen at a relatively smaller scale and do not require mush time
to complete.
= Hhelps you to sludy rire aulcomes when prospeclive surveys are nol feasihle
Disadvantages of Retrospective Study
+ Itiseasily affected by recall bias or misclassification bias,
= Itoften depends on convenience sampling, which is prone 10 selection bias.
n3. Cohort Study
A cohort study entails collecting information from 9 group of people who share specific traits or
have experienced a particular occurrence simultaneously For example, a researcher might
conduct a ewhart study on a group of Black school children in the UK
During cohort study, the researcher exposes some group members to a specific charactenstic or
risk factor. Then, she records the outcome of this exposive and its impact on the exposed
variables
When To Use Cohort Study
You should conduct a cohort study if you're Jooking to establish a causal relationship within
your data sets. For example, in medical research, cohort studies investigate the causes of
disease and estabhsh links between Ti fiaclors and efTects,
Advantages of Cohort Studies
* Itallows youto stady multiple outcomes that can be associated with one risk factor.
= Charl studies are designed ti help you measure all variahles af interest
Disadvantages al Cobort Studies,
+ Cohort studies are expensive to conduct,
+ ‘Throughout the process, the researcher has less control over variables
Padeepz App
nEile
‘plain the Three V; “ontingency ‘Table and Heyond,
Causal path models for three variables
The set of paths of causal influence, both direct and indirect, that we want to begin to consider
are represented in figure 12.5. In this causal model we are trying to explain social trust. the base
1s therefore the belief that "You can’) be too careful’, ‘The base categories selected fior the
explanatory variables are having lower levels of qualifications and not being a member of a
voluntary organization, to try and avoid negative paths, Each arrow linking two variables in a
causal path diagram represents the direct effect of one variable upon the other, controlling all
other relevant variables. The rule for identifying the relevant variables was given in chapter U
when we are assessing the dhreel effet of one vatiuble upon another, any turd variable which is
likely ta be causally connected 1 both varithles and prict to ong of them should be controlled
Coefficient b in figure 12.5 shows the ditect effect of being, in a voluntary association on the
belief that most people can be trusted, To find its value, we focus srtention on the proportion
who say that most people can be trusted, controlling for level of qualifications,
ower va certo |
aia [he ha
icv ortelowi Sweosatet
Figure 12.5 Social trust by membership of voluntary association and level of
qualifications: causal path diagram.
More complex models: going beyond three variables
Clearly there are likely to be many other factors or ‘variables’ that will have an influence, both
‘on volunteering behavior and on social trust, For example, in the model discussed above we
have not considered! gender ar age, and bolh of these may have an impact on all of the vanubles
in our model As can be seen from the discussion above, it becomes quite complicated even to
culate the direct and indirect causil paths when we have a simple model with three van:
BWe therefore need tn go beyond these paper and peneil techniques if we are going to build mare
complex models that aim to compare the impact of a number of different explanatory variables
‘on an outcome variable such as social trust. The following section describes the conceptual
foundations that underhe models to examine the factors inluencimy a simple chehoamaus (hwo
category) wanable.
logistne regress movlels
Regression analysis is a method for predicting the values of a continuously distributed
dependent variable from an independent, or explanatory, variable. The principles behind
logistic regression are very similar and the upprouch to huilding models and interpreting the
models 1s viruzilly identical However, where
is regression (more properly termed Ordinary
Least Squares. regression, or OLS regression) is used when the dependent variable is
continuous, a binary logistic regression model is
used when the dependent variable can only bike lwo values In many examples this dependent
variable indicates whether an event occurs or not and logistic reeression is used to model the
Twobahility that the event occurs, In the example we have been discussing abave, therefore,
logistic segression would be used to madel the probability that an individual believes that most
people can be trusted When we are just using a single explanatory variable, such as
unteering, the lagistiv regression can be written as
“Pddeepz App
12. What is the problem of attrition?
A major methodological issue far longitudinal studies in comparison with cross-sectional studies
as the problem of atinvion, ie, the drop cut af participants through successive waves af a
prospective study:
13. What is Time Si
ies Analysi
Time sees data includes timestamps and us oflen generated while monitoring the industrial
process or tracking, any business metrics An ordered sequence of timestamp values at oqually
spwced intervals is referred! to isi Zimte series, Analysis af such a lime series is used in many
applications such as sales forecasting, ulility studies, budget analysis, economic forecasting,
inventory studies, and so on,
14. Give an example for Time Series Data SetA collection of observations and sequentially in time.
Let's take an example of time