Flexible Data Models: Dummy Variables and Interaction Effects
Flexible Data Models: Dummy Variables and Interaction Effects
Questions:
What can we learn about the relationship between amount spent and the other variables?
What would be an intuitive simple regression model that we would want to investigate first?
Scatterplots
0.0 1.0 2.0 3.0 0 2000 4000 6000
150000
Salary
50000
3.0
2.0
Children
1.0
0.0
20
Catalogs
15
10
6000
4000
AmountSpent
2000
0
50000 150000 10 15 20
A Very Basic Model
Hypothesis: Customers receiving more catalogs are expected to spend more on purchases.
Ha: ∆AmountSpent / ∆Catalogs > 0
m1 <- lm(AmountSpent ~ Catalogs, data=d)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 209.766 65.194 3.218 0.00133 **
Catalogs 68.588 4.048 16.944 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 847.4 on 998 degrees of freedom
Multiple R-squared: 0.2234, Adjusted R-squared: 0.2226
F-statistic: 287.1 on 1 and 998 DF, p-value: < 2.2e-16
Questions:
What do we learn from the model about the relationship between Catalogs and AmountSpent?
Is this a good model? Can we have a better model?
Controlling for Other Numeric Variables
But amount spent should also depend on customers’ salary, number of children, etc.
Higher Salary should result in higher AmountSpent: ∆AmountSpent / ∆Salary > 0
More Children should result in lesser AmountSpent: ∆AmountSpent / ∆Children < 0
So we should control for Salary and Children
m2 <- lm(AmountSpent ~ Catalogs + Salary + Children, data=d)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.428e+02 5.372e+01 -8.242 5.29e-16 ***
Catalogs 4.770e+01 2.755e+00 17.310 < 2e-16 ***
Salary 2.041e-02 5.929e-04 34.417 < 2e-16 ***
Children -1.987e+02 1.709e+01 -11.628 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 562.5 on 996 degrees of freedom
Multiple R-squared: 0.6584, Adjusted R-squared: 0.6574
F-statistic: 640 on 3 and 996 DF, p-value: < 2.2e-16
Questions:
Is this a better model than Model 1?
How did adding Salary and Children change the effect of Catalogs compared to not having the controls?
How Much Explanation Does Catalogs Add
Model with Catalog m2 <- lm(AmountSpent ~ Catalogs + Salary + Children, data=d)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1501.69 38.42 39.085 <2e-16 ***
as.factor(Age)Old -69.56 71.65 -0.971 0.332
as.factor(Age)Young -943.07 63.95 -14.748 <2e-16 ***
Questions:
What does the coefficient 𝛽2 (=-69.56) mean?
A. The average amount spent by an old customer.
B. The average amount spent by a middle-aged customer.
C. The average amount spent by an old customer relative to a mid-aged customer.
D. The average amount spent by an old customer relative to a young customer.
Which age group spends the most: young, medium, or old?
Interpreting Dummy Coefficients
m5 <- lm(AmountSpent ~ as.factor(Age) + Catalogs, data=d)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 575.653 67.046 8.586 <2e-16 ***
as.factor(Age)Old -53.702 63.895 -0.840 0.401
as.factor(Age)Young -798.937 57.716 -13.843 <2e-16 ***
Catalogs 60.034 3.736 16.068 <2e-16 ***
AmountSpent
𝛽0 = Amount spent by mid-aged people when their salary is zero
576
𝛽1= Difference in amount spent by old people relative to middle-
576 - 54 aged people with the same salary
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -223.284 66.673 -3.349 0.000842 ***
as.factor(Age)Middle 798.937 57.716 13.843 < 2e-16 ***
as.factor(Age)Old 745.235 71.056 10.488 < 2e-16 ***
Catalogs 60.034 3.736 16.068 < 2e-16 ***
Questions:
How does the results change?
Which age group spends the most: young, medium, or old?
Numeric and Dummy Variables
m6 <- lm(AmountSpent ~ Catalogs + Salary + as.factor(Age), data=d)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.876e+02 5.398e+01 -12.738 < 2e-16 ***
Catalogs 5.145e+01 2.880e+00 17.866 < 2e-16 ***
Salary 2.079e-02 7.840e-04 26.516 < 2e-16 ***
as.factor(Age)Middle -1.018e+02 5.575e+01 -1.826 0.06813 .
as.factor(Age)Old 1.680e+02 5.861e+01 2.866 0.00424 **
Questions:
What does the coefficient 𝛽4 (= -102) mean?
A. Middle-aged customers spend the least.
B. Middle-aged customers spend less than young customers.
C. We should ignore this coefficient because the p-value is not significant.
D. Middle-aged customers spend less than young customers with the same salary level and same number of catalogs.
Why did the coefficient of Middle change sign?
Prediction with Dummy Variables
m6 <- lm(AmountSpent ~ Catalogs + Salary + as.factor(Age), data=d)
Questions:
What is the amount spend by an old person with a salary of $50,000 receiving no catalog per year?
What is the amount spend by a young person with a salary of $30,000 receiving 1 catalog per year?
What is the amount spend by a middle aged person with a salary of $80,000 with 2 kids?
What is the difference in amount spend by an old and a young person with the same salary and same number
of catalogs?
What is the difference in amount spend by an old person receiving 1 catalog and a middle aged person
receiving 2 catalogs, when controlled for salary?
Multiple Numeric and Dummies
m7 <- lm(AmountSpent ~ Salary + Catalogs + Children + as.factor(Age) +
as.factor(Gender) + as.factor(Married) + as.factor(Location), data=d)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.058e+02 7.501e+01 -8.076 1.92e-15 ***
Salary 2.251e-02 9.517e-04 23.648 < 2e-16 ***
Catalogs 4.314e+01 2.549e+00 16.922 < 2e-16 ***
Children -2.008e+02 1.724e+01 -11.645 < 2e-16 ***
as.factor(Age)Middle -8.208e+01 5.062e+01 -1.622 0.105
as.factor(Age)Old -2.038e+01 5.400e+01 -0.377 0.706
as.factor(Gender)Male -4.197e+01 3.467e+01 -1.211 0.226
as.factor(Married)Single 6.759e+01 4.677e+01 1.445 0.149
as.factor(Location)Far 5.071e+02 3.622e+01 14.001 < 2e-16 ***
AmtSpent
products spend more (larger intercept) than those
living close to competing stores (smaller intercept). Far
Close
However, customers living far spend at the same rate
as customers living close (same slope) .
Do you think this is reasonable/realistic?
Are other scenarios plausible? Which ones? Salary
AmtSpent
Is the spending rate higher for customers who live
far away than those living close?
How to answer this question?
Close
Add an interaction term, computed as the product
of values in Salary and Location columns.
Salary*as.factor(Location)
Examine if the interaction effect is significant. Salary
Interaction Between Numeric and Dummy Variables
m8 <- lm(AmountSpent ~ Salary + Catalogs + as.factor(Location) +
Salary*as.factor(Location), data=d);
Questions:
If the salary of a customer who lives close increases by $10,000, what is the predicted increase in amount spent?
If the salary of a customer who lives far increases by $10,000, what is the predicted increase in amount spent?
Interaction Between Numeric Variables
m9 <- lm(AmountSpent ~ Salary + Catalogs + Children + Salary*Catalogs, data=d)
Questions:
What is the marginal effect of children on amount spent?
What is the marginal effect of catalogs on amount spent?
Draw two graphs to show the difference between the two questions above.
Does marginal effect apply for dummy variables?
How Messy Can It Get?
m8 <- lm(AmountSpent ~ Salary + Catalogs + Children + Gender + Married
+ Catalogs*Salary + Catalogs*Children +Catalogs*Salary*Children +
Catalogs*Gender + Catalogs*Gender*Salary, data=d)
Female
40 60 80 100
Salary
Model 1: Main Effect of Gender
m1 <- lm(Salary ~ as.factor(Gender), data=d);
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.58 on 206 degrees of freedom
Multiple R-squared: 0.1201, Adjusted R-squared: 0.1158
F-statistic: 28.12 on 1 and 206 DF, p-value: 2.935e-07
Questions:
Based on the above model, is there gender discrimination at this bank?
If so, how much are female employees discriminated against?
Is this a good model?
Is there anything missing from this analysis? Elaborate.
Model 2: Controlling for Experience
m2 <- lm(Salary ~ Experience + as.factor(Gender), data=d);
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.454 on 205 degrees of freedom
Multiple R-squared: 0.4413, Adjusted R-squared: 0.4359
F-statistic: 80.98 on 2 and 205 DF, p-value: < 2.2e-16
Questions:
Based on this model, does gender discrimination exist?
If so, how much are female employees discriminated against?
Is this model better than the previous model? Why?
90
Female
Questions:
Male
80
After controlling for experience, does it appear
70
that gender discrimination still exists? Explain.
Salary
How can we investigate this analytically?
60
50
40
30
10 20 30 40
Experience
Model 3: Interaction Effect
m3 <- lm(Salary ~ Experience + as.factor(Gender) +
Experience*as.factor(Gender), data=d)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.1668 1.4059 23.592 < 2e-16 ***
Experience 0.3334 0.1033 3.228 0.00145 **
as.factor(Gender)Male -4.0171 2.0553 -1.955 0.05201 .
Experience:as.factor(Gender)Male 1.0431 0.1437 7.261 7.95e-12 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.555 on 204 degrees of freedom
Multiple R-squared: 0.5561, Adjusted R-squared: 0.5495
F-statistic: 85.18 on 3 and 204 DF, p-value: < 2.2e-16
Questions:
On average, how much more does a male employee with 10 years of experience make than a
female employee with the same experience level?
Interaction effect coefficient 𝛽3 = 1.04 refers to
A. The average rate of salary increase for each additional year of experience for male employees.
B. The amount by which the average rate of salary increase for each additional year of experience for
male employees relative to that of female employees.
C. The average salary of male employees with zero years of experience.
What does 𝛽2 = -4.01 mean?
Model 3: Wrapping Up
Questions:
Which of the 3 models best characterizes discrimination?
A. The gender-only model.
B. The gender + experience model.
C. The interaction model.
Is there discrimination in the starting salaries?
Is there discrimination in the rate of salary increase?
Are there any limitations to this analysis?
Questions to Ponder …
More on interaction effects:
Can we have interaction between two numeric
variables? How will you interpret the interaction
90
coefficient? Female
Male
80
variable and one categorical variables with three
70
levels (e.g., old, mid, young)?
Salary
Can you have interaction between two categorical
60
variables (e.g., old/mid/young vs. far/near)?
50
Gender discrimination problem:
40
Is this really a linear effect, i.e., is the rate of
increase the same (constant) at all experience
30
levels? 10 20 30 40
linear models).
Key Takeaways
Incorporating dummy variables as predictors in a regression model allows us to compare or
make predictions for each group and compare predictions across groups.
Adding an interaction term x1*x2 as a predictor allows for a more flexible model,
Change in y relative to change in x1 is a function of the other predictor x2.
Interaction may involve two or more variables, which may be numeric and/or categorical.
In all of these models, the DV is still numeric.