QM3 Lecture 2 With Notes
QM3 Lecture 2 With Notes
Quantitative Methods 3 (EBS2001), 2016/2017, lecture 2 gender What is your gender? 1 Female 2 Male
Please rate the statements below on a 1 to 7 scale; 1 means “totally disagree” and 7 “totally agree”.
satisfac Overall, I am very satisfied with XYZ 1 2 3 4 5 6 7
Today Preparing Cases 3, 4, 5 & 6 loyal1 For catering services at a party, XYZ will be my first choice 1 2 3 4 5 6 7
- Sharpe Chs. 4, 15–18 (relevant parts) loyal2 In the future, I will make use of XYZ’s services more often 1 2 3 4 5 6 7
loyal3 In the future, I will make use of XYZ’s services less often 1 2 3 4 5 6 7
- Some additional whistles and bells
A) Basic example: a (small) part of Case 3) 1) The items should not contain “too many” missing values
- a survey among a sample of XYZ customers, fall 2003 here: OK for all three
- n = 471 respondents, many variables
2) To get a reliable construct, the items entering it should behave “similarly”
- we consider the following subset of 5 variables (i.e. questions):
3 4
2) To get a reliable construct, the items entering it should behave “similarly” ii) Formal analysis: Cronbach’s alpha
SPSS: Analyze > Scale > Reliability Analysis; choose loyal1, loyal2 and loyal3R as Items
i) Informal analysis: a correlation matrix
Case Processing Summary
Correlations
N %
loyal1 loyal2 loyal3
Cases Valid 371 78.8
loyal1 Pearson Correlation 1.000 .843** -.708**
Excludeda 100 21.2
Sig. (2-tailed) . .000 .000
Total 471 100.0
N 394 387 374
a Listwise deletion based on all variables in the procedure.
loyal2 Pearson Correlation .843** 1.000 -.780**
Sig. (2-tailed) .000 . .000 Reliability Statistics
N 387 450 426 Cronbach's Alpha N of Items
loyal3 Pearson Correlation -.708** -.780** 1.000 .916 3
Sig. (2-tailed) .000 .000 .
N 374 426 429
** Correlation is significant at the 0.01 level (2-tailed).
Note: - correlation measures the relation between two items
- alpha measures the agreement between all items
Note: - the correlations are quite high, as it should be
- alpha £ 1 0: items are totally unrelated
- “loyal3” is negatively correlated with the others
1: items overlap fully
Þ reverse scale problem
- rule of thumb: reliable construct when alpha ³ 0.75
Þ Transform > Compute Variable: “loyal3R” = 8 – “loyal3”
Conclusion:
ii) Formal analysis: Cronbach’s alpha use Transform > Compute Variable to define the construct
“loyalty” = ( “loyal1” + “loyal2” + “loyal3R” ) / 3
5 6
Regression analysis, Part I) Þ there seems to be a positive relation,
but not one-to-one
1) Regression: the basic idea
many other factors relevant Þ scatter around a line
2) Inference about individual coefficients: the t-test
3) Using dummies to model qualitative explanatory variables
Þ simple linear regression model:
4) Dummy-interaction models
5) Measuring the fit of regression models y = β0 + β1x + ε (multiple regression
model has k x’s)
SPSS: Graphs > Legacy Dialogs > Scatter/Dot Þ we want to estimate the regression coefficients β0 and β1
select Simple Scatter, click Define
choose “loyalty” as Y, “satisfac” as X
7 8
Þ fit a line through the sample points Model Summary
R Adjusted R Std. Error of
Model R
Square Square the Estimate
ŷ i = b 0 + b 1x i i = 1,…,n
1 .917 .841 .840 .68255
Predictors: (Constant), satisfac
hoping that the estimates b0 and b1 are close to β0 and β1
ANOVA
- intuition: choose estimates such that the vertical distances Model Sum of Squares df Mean Square F Sig.
1 Regression 898.763 1 898.763 1929.193 .000
ei = yi – ŷi (“residuals”) Residual 170.044 365 .466
Total 1068.808 366
become as small as possible Predictors: (Constant), satisfac
Dependent Variable: loyalty
Coefficients
- formalization: choose the estimates such that the
Unstandardized Standardized
sum of squared vertical distances, SSE, Coefficients Coefficients
Model B Std. Error Beta t Sig.
becomes as small as possible 1 (Constant) .918 .106 8.671 .000
n n n satisfac .815 .019 .917 43.923 .000
Þ Minimize SSE = å ei2 = å ( y i - yˆ i )2 = å ( y i - b0 - b1x i )2 Dependent Variable: loyalty
b0 , b1 i =1 i =1 i =1
Þ computers can solve such a minimization problem easily Þ least squares regression line or prediction equation:
2) Inference about individual coefficients: the t-test If… the following regression assumptions are satisfied:
- LS regression line: loyâlty = 0.918 + 0.815satisfac 1) The linearity assumption: the true relation must be linear
Þ check the Linearity condition
“what does this line, fitted on the basis of a sample, tell us about the relationship
between ‘satisfac’ and ‘loyalty’ in the population?” 2) The independence assumption: the errors must be independent of each other
Þ check the Randomization condition
Þ inferential statistics!
3) The equal variance assumption: for all values of x, the errors have the same spread σε
- any other sample would give us different estimates b0 and b1 Þ check the Equal spread condition
Þ we need the sampling distribution of these estimates ! 4) The normality assumption: for all values of x, the errors follow a normal model
- decisive feature: the properties of the error term ε Þ check the Nearly normal (not so critical when the sample size n is large Þ CLT)
and Outlier conditions
y = β0 + β1x + ε
Then… the sampling distribution of the slope coefficient b1 looks as follows:
- the error ε reflects the effect on y of all factors other than x
Þ for any value of x, infinitely many distinct ε-values can occur (b1 - β1 )
~t t-distribution with n–k–1 df
Þ we can describe the behaviour of the error with a probability distribution
SE (b1 )
Þ this distribution has to satisfy some properties, to … - SE (b1 ) , “standard error of b1”, is calculated by SPSS (here: 0.019)
- this sampling distribution acts as the basis for inference !
- check the assumptions with residual analysis (lateron)
11 12
Þ Tool 1: Test statistic for testing H0: β1 = β1 : - the standard normal distribution is centered around 0
0
Þ Tool 2: 100(1–α)% Confidence Interval for β1 Option a) Critical values: z < –zα ?
point estimate ± critical value * standard error = [b1 ± t α / 2 × SE ( b1 )] (hardly used in QM3) Option b) The P-value: “the lower the P-value, the more evidence against the null! ”
3) Using dummies to model qualitative explanatory variables Model 2): loyalty = β0 + β1satisfac + β2female + ε
Model 1): loyalty = β0 + β1satisfac + ε - multiple regression, trivial generalization of simple regression
- new element: qualitative (nominal) variable “gender” SPSS: Analyze > Regression > Linear
genderi = 1 person i is a woman choose “loyalty” as Dependent, “satisfac” and “female” as Independent
genderi = 2 person i is a man
Model Summary
Adjusted R Std. Error of
- we suspect: maybe women are more loyal than men, even when equally satisfied Model R R Square
Square the Estimate
2 .921 .848 .847 .66560
Þ how to include qualitative variables in a regression model? Predictors: (Constant), female, satisfac
ANOVA
- cannot include them directly: their levels are nonnumerical Model Sum of Squares df Mean Square F Sig.
2 Regression 883.730 2 441.865 997.392 .000
- crucial tool: dummy variables, each indicating a single level Residual 158.158 357 .443
Total 1041.889 359
1 if a person has a particular level
Predictors: (Constant), female, satisfac
0 if not Dependent Variable: loyalty
Coefficients
- general rule: if a qualitative variable has k levels, then Unstandardized Standardized
Coefficients Coefficients
- choose an arbitrary base level
Model B Std. Error Beta t Sig.
- include dummies for the remaining k–1 levels 2 (Constant) .794 .113 7.043 .000
satisfac .840 .019 .921 44.663 .000
- here: define a “female”-dummy, men as base level female -.046 .070 -.013 -.652 .515
Dependent Variable: loyalty
SPSS: Transform > Recode into Different Variables
15 16
Fitted line: loyâlty = 0.794 + 0.840satisfac – 0.046female 4) Dummy-interaction models
- men: female = 0 ® loyâlty = 0.794 + 0.840satisfac Fitted Model 2): loyâlty = 0.794 + 0.840satisfac – 0.046female
- women: female = 1 ® loyâlty = 0.794 + 0.840satisfac – 0.046
different intercept: “among women, loyalty is on average 0.046 pts. lower
= 0.748 + 0.840satisfac than among men with the same satisfaction”
same slope: “on average, an extra point of satisfaction increases loyalty by 0.840 points
Þ graphically:
for both men and women”
- parallel lines for both sexes, but with different intercepts
- the vertical distance equals the dummy coefficient, (–)0.046 - we suspect: maybe different slope for men and women
Þ “among women, loyalty is on average 0.046 pts. lower (!) than among men with the same satisfaction” men
men
e.g.:
Note: the intercept difference is not significant (P-value = 0.515)
Û
women women
19 20
Model A): bikei = β0 + β1inhabi + β2educi + εi i = 1,…,61 - regression coefficients have nice properties, provided the four regression assumptions
(slide 10) are roughly satisfied
SPSS : Analyze > Regression > Linear - if seriously violated, a reformulation is needed (e.g. transformations)
choose “bike” as Dependent, “inhab” and “educ” as Independent
- all these assumptions involve the properties of the error
under Save, tick Unstandardized Residuals and Unstandardized Predicted Values (!!!)
Þ verify, using the residuals as observable point estimates
Model Summary
Adjusted Std. Error of
Model R R Square
R Square the Estimate
errors: εi = bikei – β0 – β1inhabi – β2educi i = 1,...,61
A .981 .961 .960 1195.28
residuals: ei = bikei – b0 – b1inhabi – b2educi i = 1,...,61
Predictors: (Constant), educ, inhab
Dependent Variable: bike
Coefficients
- tools: residual plots residuals against x and ŷ ® assumptions 1) and 3)
Unstandardized Standardized SPSS: Graphs > Legacy Dialogs > Scatter
Coefficients Coefficients
Model B Std. Error Beta t Sig. histogram frequency distribution residuals ® assumption 4)
A (Constant) -1246.849 237.395 -5.252 .000 SPSS: Graphs > Legacy Dialogs > Histogram, tick Display normal curve
inhab 4.793E-02 .002 .945 26.229 .000
educ 360.425 258.303 .050 1.395 .168
Dependent Variable: bike - residual analysis also helps to identify outliers (observations well-separated from the others)
23 24
6000
2) The use of natural logarithms in regression models
6000 6000
4000 4000 4000 Model A): slight nonlinearity, nonconstant variance, 2 outliers
Unstandardized Residual
Unstandardized Residual
Unstandardized Residual
-2000 -2000 -2000 Model B): ln(bikei) = γ0 + γ1ln(inhabi) + γ2educi + εi “natural logarithms”: lnx = elogx, e » 2.718
1) The linearity assumption Þ run the regression in terms of these new variables
30
weak signs of nonlinearity in the “inhab” and “predicted value” plots Model Summary
R Adjusted Std. Error of
20 Model R
Square R Square the Estimate
3) The equal variance assumption B .985 .970 .969 .11060
10
from left to right in the “inhab” and “predicted value” plots, Predictors: (Constant), educ, lninhab
Dependent Variable: lnbike
the residuals seem to “fan out” Þ nonconstant variance ? 0
-5000.0 -1000.0 3000.0
-3000.0 1000.0 5000.0
Coefficients
4) The normality assumption Unstandardized Standardized
Unstandardized Residual Coefficients Coefficients
the histogram looks reasonably bell-shaped and symmetric Þ OK Model B Std. Error Beta t Sig.
B (Constant) -4.004 .404 -9.907 .000
Outliers lninhab 1.059 .037 .940 29.010 .000
all plots, esp. the histogram, show two extreme residuals: Amsterdam: e1 = 5168, The Hague: e3 = –5242 educ 4.675E-02 .025 .062 1.903 .062
Dependent Variable: lnbike
25 26
.3 .3 .3
Conclusion: - model B) looks OK; “educ” is now marginally significant
.2 .2 .2
- the logarithmic transformations solve the problems in model A)
Unstandardized Residual
Unstandardized Residual
Unstandardized Residual
.1 .1 .1 - slight nonlinearity
0.0 0.0 0.0 - nonconstant variance
-.1
- outliers
-.1 -.1
27 28
Model B): lnbîke = –4.004 + 1.059lninhab + 0.047educ Special effect ii): testing for proportionality in Model B)
æ bike ö
Û lnç inhab ÷ = γ 0 + γ 2 educ + ε
rewrite Model B) by taking the antilog on both sides:
è ø
-4.004
ln(biˆke ) = -4.004 + 1.059 ln(inhab ) + 0.047educ Û biˆke = e × inhab1.059 × e0.047 educ
Þ response variable is the ratio of stolen bikes per head
cf. the CD production function in Case 6) ! cf. item f) of Case 6) !
29 30
C) Basic example: a (small) part of Case 5) SPSS: Analyze > Regression > Linear; choose “SM” as Dependent, “clr1” – “clr6” as Independent
Under Options, deselect “Include constant in equation”
- Duographic BV produces custom-printed T-shirts
Model Summary
- T = 48 (weekly) or T = 12 (monthly) observations on six distinct outputs and seven cost drivers Model R
Adjusted Std. Error of
R Square
R Square the Estimate
- we only consider the cost driver “sales orders” here 1 .987 .973 .969 10.424
ANOVA
Model 1): SMt = β1clr 1t + β2clr 2t + β3clr 3t + β4clr 4t + β5clr 5t + β6clr 6t + εt t = 1,…,48 Sum of Mean
Model df F Sig.
Squares Square
- SM : # sales orders per week 1 Regression 165934.357 6 27655.726 254.520 .000
Residual 4563.643 42 108.658
- clr1 : # 1-colour T-shirts produced per week, etc. Total 170498.000 48
- no intercept (see start case for motivation) Dependent Variable: sm
2 2 Coefficients
Þ reported R , Radj and overall F-statistic are misleading (SPSS redefines SST) Unstandardized Standardized
Þ use se for model comparison Coefficients Coefficients
Model B Std. Error Beta t Sig.
Þ t-tests and partial F-tests are still valid 1 clr1 .051 .005 .662 9.342 .000
clr2 .039 .014 .150 2.856 .007
clr3 .032 .017 .081 1.910 .063
clr4 .030 .032 .037 .924 .361
Regression analysis, Part III) clr5 .040 .048 .039 .834 .409
clr6 .051 .022 .111 2.309 .026
1) Collinearity Dependent Variable: sm
2) Time series data: checking for autocorrelation
3) Inference about several coefficients: the partial F-test
31 32
- if explanatory variables are strongly linearly related Regression assumption 2), slide 10, is very relevant for time series data:
Þ they measure almost the same thing (“overlap”)
2) The independence assumption: the errors must be independent of each other
Þ their individual effects are unstable / estimated imprecisely
Þ check the Randomization condition
Þ high standard errors of their coefficients, low t ratios
Þ the variables seem irrelevant, even when they are relevant
Time series data often suffer from (first-order) autocorrelation :
Seems no problem here (high t ratios!), but check nonetheless - each error is related to (i.e. dependent on) the previous error
Two detection tools discussed in QM2: - formally: εt = φεt -1 + at at an error without autocorrelation
i) Inspect the correlations among the explanatory variables - φ>0: positive autocorrelation - if error is positive, next one tends to be positive also
determine the correlation matrix: potential problems if there are correlations >0.9 (not here) - if error is negative, next one tends to be negative also
Þ successive errors tend to resemble each other
ii) Determine Variance Inflation Factors of all explanatory variables
- φ<0: negative autocorrelation - if error is positive, next one tends to be negative
1
VIF j = j = 1,…,k - if error is negative, next one tends to be positive
1 - R 2j
Þ successive errors tend to mirror each other
- Rj2: the R2 of a regression with Xj as response against all other explanatory variables
- if there is autocorrelation, the usual t- and F-tests are no longer valid
- high VIFj (say: 10 or higher) Þ high Rj2, so Xj must be very similar to other explanatory variables
Þ high collinearity Þ try to detect it!
- the reported VIF-values are misleading in models without intercept Þ again take the residuals as point estimates for the errors
33 34
A graphical detection tool: plot the residuals against their own lagged values 3) Inference about several coefficients: the partial F-test
SPSS: Transform > Compute Variable, function “LAG” ® “lagres” Model 1): SM t = β1clr 1t + β2clr 2t + β3 clr 3 t + β4 clr 4 t + β5 clr 5 t + β6 clr 6 t + ε t
Graphs > Legacy Dialogs > Scatter/Dot
Basic idea: - test reduced model against complete model 1)
- reduced model is special case of complete model
- fit them both, compare their SSE with F-statistic
Model 1):
QM2: the standard case e.g. H0: β4 = 0, β5 = 0
Þ no problem (maybe slightly negative?)
Þ reduced model: (plug H0 into the complete model)
QM3: more subtle cases e.g. H0: β2 = β1, β3 = β1, β4 = β1, β5 = β1, β6 = β1
250 “all cost driver factors are the same”
Counterexample (reported for contrast): 150
looks very different, but still implies a special case!
Unstandardized Residual
mortgaget = δ0 + δ1interest t + εt 50 Þ reduced model 2): (plug H0 into the complete model)
t = 1,…,36 (quarterly)
-50 SM t = β1clr 1t + β1clr 2t + β1clr 3 t + β1clr 4 t + β1clr 5 t + β1clr 6 t + εt
Þ positive autocorrelation !
-150 = β1 (clr 1t + clr 2t + ... + clr 6 t ) + εt
-250
-250 -150 -50 50 150 250
= β1shirtst + εt
LAGRES
35 36
with shirtst = clr 1t + clr 2 t + ... + clr 6 t Intuition: compare the SSE’s of both models 1) and 2)
“for this cost driver, only total output is important” Model 1): “complete model” SSEc = 4563.643 k = 6 explanatory variables (slide 30)
Model 2): “reduced model” SSEr = 4712.327 g = 1 explanatory variables (slide 35)
SPSS: Transform > Compute Variable, define “shirts”
Analyze > Regression > Linear; choose “SM” as Dependent, “shirts” as Independent - always SSEr > SSEc (fewer explanatory variables)
Under Options, deselect “Include constant in equation” - if H0 true, we expect SSEr – SSEc relatively small
Model Summary - if SSEr – SSEc relatively large, H0 is suspect Þ reject
Adjusted Std. Error of
Model R R Square
R Square the Estimate
2 .986 .972 .972 10.013
Formalization: the Partial F Test
ANOVA
Sum of (SSEr - SSEc ) /(k - g ) ( 4712.327 - 4563.643) /(6 - 1)
Model
Squares
df Mean Square F Sig. F= = = 0.274
SSEc /(n - k ) 4563.643 /(48 - 6)
2 Regression 165785.673 1 165785.673 1653.520 .000
Residual 4712.327 47 100.262
Total 170498.000 48 - if H0 is true, the F-statistic has an F-distribution with k – g = 5 numerator-df
Dependent Variable: sm and n – k = 42 » 40 denominator-df
Coefficients
Unstandardized Standardized - note: n – k – 1 denominator-df in models with intercept; n – k without
Coefficients Coefficients
Model B Std. Error Beta t Sig.
general rule: df = # observations – # estimated coefficients
2 shirts .046 .001 .986 40.663 .000
Þ is the actual value of 0.274 “far enough” above 0 ?
Dependent Variable: sm
37
Option a) Critical value: F > Fα ?
E.g. α = 5% Þ F0.05 = 2.45 Þ don’t reject H0
α = 10% Þ F0.10 = 2.00 Þ don’t reject H0
“if the null were true, there would be much more than 10% chance to obtain
an F-statistic ³ 0.274 ”
Þ don’t reject H0
Note:
- in model 2), b1 = 0.046 (Þ b2 etc.), similar as in model 1)
- two even trickier null hypotheses later in Case 5)
- don’t mix things up with the overall F-test: (ANOVA-table)