[go: up one dir, main page]

100% found this document useful (1 vote)
86 views31 pages

Flexible Data Models: Dummy Variables and Interaction Effects

The document discusses how to incorporate qualitative predictors into regression models using dummy variables and interaction effects. It explains dummy variables for two-level and three-level categorical predictors. It also explains how to model interaction effects, including interactions between numeric variables and between numeric and dummy variables. The document uses a direct marketing example to illustrate these concepts, building regression models to explain customer spending amount based on variables like salary, number of children, catalogs received, and incorporating dummy variables for factors like age, gender, and marital status.

Uploaded by

Saitej
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
86 views31 pages

Flexible Data Models: Dummy Variables and Interaction Effects

The document discusses how to incorporate qualitative predictors into regression models using dummy variables and interaction effects. It explains dummy variables for two-level and three-level categorical predictors. It also explains how to model interaction effects, including interactions between numeric variables and between numeric and dummy variables. The document uses a direct marketing example to illustrate these concepts, building regression models to explain customer spending amount based on variables like salary, number of children, catalogs received, and incorporating dummy variables for factors like age, gender, and marital status.

Uploaded by

Saitej
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Flexible Data Models:

Dummy Variables and Interaction Effects

ANOL BHATTACHERJEE, PH.D.


UNIVERSITY OF SOUTH FLORIDA
Outline
 How to incorporate qualitative predictors in regression models:
 Dummy variables.
 Two-level and three-level dummies.
 How to model situations when the effect of one predictor on the outcome variable depends
upon the value of another predictor:
 Interaction effect.
 Interaction among numeric variables.
 Interaction among numeric and dummy variables.
Motivation: Direct Marketing Example
 Problem:
 A direct marketer wants to identify which customers to target for a new direct mail catalogue.
 The marketer has available a database containing information on past customer behavior.
 Goal:
 Mine this database to extract valuable business insight about future customers’ behavior.
 Explain why some customers spend more than others.
 Explain which customer characteristics relate to AmountSpent, and how!
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent
Young Male Rent Single Close 15000 3 Low 6 38
Young Male Rent Single Close 13000 3 Low 6 43
Young Female Rent Single Close 14600 3 Low 6 47
Young Female Rent Single Close 17900 3 Low 6 62
Old Female Own Single Close 12700 2 Low 6 65
Young Female Rent Single Close 23000 3 Low 6 79
Young Female Rent Single Close 12700 3 Low 12 87
Young Female Rent Single Close 12100 1 Low 6 90
Old Female Own Married Close 10100 1 Low 6 93
Young Female Rent Married Close 42000 3 Low 6 105
Young Male Rent Single Close 11200 0 NA 6 106
Getting Started: Explore Associations
 Explore which variables relate to amount spent, and the nature of their relationship.
 How: Correlation analysis, scatterplots, etc., but most importantly, business sense.
Table of correlations
Salary Children Catalogs AmountSpent
Salary 1.000
Children 0.050 1.000
Catalogs 0.184 -0.113 1.000
AmountSpent 0.700 -0.222 0.473 1.000

 Questions:
 What can we learn about the relationship between amount spent and the other variables?
 What would be an intuitive simple regression model that we would want to investigate first?
Scatterplots
0.0 1.0 2.0 3.0 0 2000 4000 6000

150000
Salary

50000
3.0
2.0
Children

1.0
0.0

20
Catalogs

15
10
6000
4000
AmountSpent
2000
0

50000 150000 10 15 20
A Very Basic Model
 Hypothesis: Customers receiving more catalogs are expected to spend more on purchases.
 Ha: ∆AmountSpent / ∆Catalogs > 0
m1 <- lm(AmountSpent ~ Catalogs, data=d)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 209.766 65.194 3.218 0.00133 **
Catalogs 68.588 4.048 16.944 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 847.4 on 998 degrees of freedom
Multiple R-squared: 0.2234, Adjusted R-squared: 0.2226
F-statistic: 287.1 on 1 and 998 DF, p-value: < 2.2e-16

 Questions:
 What do we learn from the model about the relationship between Catalogs and AmountSpent?
 Is this a good model? Can we have a better model?
Controlling for Other Numeric Variables
 But amount spent should also depend on customers’ salary, number of children, etc.
 Higher Salary should result in higher AmountSpent: ∆AmountSpent / ∆Salary > 0
 More Children should result in lesser AmountSpent: ∆AmountSpent / ∆Children < 0
 So we should control for Salary and Children
m2 <- lm(AmountSpent ~ Catalogs + Salary + Children, data=d)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.428e+02 5.372e+01 -8.242 5.29e-16 ***
Catalogs 4.770e+01 2.755e+00 17.310 < 2e-16 ***
Salary 2.041e-02 5.929e-04 34.417 < 2e-16 ***
Children -1.987e+02 1.709e+01 -11.628 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 562.5 on 996 degrees of freedom
Multiple R-squared: 0.6584, Adjusted R-squared: 0.6574
F-statistic: 640 on 3 and 996 DF, p-value: < 2.2e-16

 Questions:
 Is this a better model than Model 1?
 How did adding Salary and Children change the effect of Catalogs compared to not having the controls?
How Much Explanation Does Catalogs Add
Model with Catalog m2 <- lm(AmountSpent ~ Catalogs + Salary + Children, data=d)

Residual standard error: 562.5 on 996 degrees of freedom


Multiple R-squared: 0.6584, Adjusted R-squared: 0.6574
F-statistic: 640 on 3 and 996 DF, p-value: < 2.2e-16

Model without Catalog m3 <- lm(AmountSpent ~ Salary + Children, data=d)

Residual standard error: 641.3 on 997 degrees of freedom


Multiple R-squared: 0.5557, Adjusted R-squared: 0.5548
F-statistic: 623.4 on 2 and 997 DF, p-value: < 2.2e-16

Comparing Nested Models anova(m2, m3, test="Chisq")

Analysis of Variance Table


Model 1: AmountSpent ~ Catalogs + Salary + Children
Model 2: AmountSpent ~ Salary + Children
Res.Df RSS Df Sum of Sq Pr(>Chi)
1 996 315171647
2 997 409993126 -1 -94821478 < 2.2e-16 ***
Increase in variance explained
anova(m2, m3, test="F“) from m3 to m2 is significant
What About the Remaining Variables?
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent
Young Male Rent Single Close 15000 3 Low 6 38
Young Male Rent Single Close 13000 3 Low 6 43
Young Female Rent Single Close 14600 3 Low 6 47
Young Female Rent Single Close 17900 3 Low 6 62
Old Female Own Single Close 12700 2 Low 6 65
Young Female Rent Single Close 23000 3 Low 6 79
 So far,
Young we have
Female Rentinvestigated
Singlethe relationship
Close between amount
12700 of money spent
3 Low 12 and the three87
Young Female Rent Single Close 12100 1 Low 6 90
Old numeric predictors
Female Own salary, children,
Married and catalogs.
Close 10100 1 Low 6 93
 What
Young about the
Female Rentremaining variables?
Married CloseAre they unimportant?3 Low
42000 6 105
Young Male Rent Single Close 11200 0 NA 6 106
 Does age potentially have an effect on spending behavior?
 How about gender or marital status?
 Should we include these variables into our regression model?
Qualitative Predictors and Dummy Variables
 Let’s start with variable Age: Age
 Age is a categorical variable with three possible values: Young, Middle or Old. Old
Middle
 No software will allow us to use the variable Age directly. Why?
Middle
 We can solve this problem by quantifying the qualitative variable Age. Young
 How do we quantify categorical variables? Middle
Young
 By creating dummy variables: n-1 combinations of binary variables to represent n
Middle
possible values of categorical variables. Young
 R can create dummy variables using the command “as.factor( )”, e.g. as.factor(Age) Young
Young
1, if middle aged 1, if old
Age_mid =  Age_old =  Young
0, otherwise 0, otherwise
Old
Young
Middle-aged customer: Age_mid=1 & Age_old=0
Old customer: Age_mid=0 & Age_old=1
Young customer: Age_mid=0 & Age_old=0
Regression with Dummy Variables
m4 <- lm(AmountSpent ~ as.factor(Age), data=d)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1501.69 38.42 39.085 <2e-16 ***
as.factor(Age)Old -69.56 71.65 -0.971 0.332
as.factor(Age)Young -943.07 63.95 -14.748 <2e-16 ***

Residual standard error: 866 on 997 degrees of freedom


Multiple R-squared: 0.1897, Adjusted R-squared: 0.1881
F-statistic: 116.7 on 2 and 997 DF, p-value: < 2.2e-16

 Questions:
 What does the coefficient 𝛽2 (=-69.56) mean?
A. The average amount spent by an old customer.
B. The average amount spent by a middle-aged customer.
C. The average amount spent by an old customer relative to a mid-aged customer.
D. The average amount spent by an old customer relative to a young customer.
 Which age group spends the most: young, medium, or old?
Interpreting Dummy Coefficients
m5 <- lm(AmountSpent ~ as.factor(Age) + Catalogs, data=d)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 575.653 67.046 8.586 <2e-16 ***
as.factor(Age)Old -53.702 63.895 -0.840 0.401
as.factor(Age)Young -798.937 57.716 -13.843 <2e-16 ***
Catalogs 60.034 3.736 16.068 <2e-16 ***

Residual standard error: 772.1 on 996 degrees of freedom


Multiple R-squared: 0.3565, Adjusted R-squared: 0.3546
F-statistic: 183.9 on 3 and 996 DF, p-value: < 2.2e-16

AmountSpent
𝛽0 = Amount spent by mid-aged people when their salary is zero
576
𝛽1= Difference in amount spent by old people relative to middle-
576 - 54 aged people with the same salary

𝛽1 – 𝛽1 = Difference in amount spent by young people relative


Catalogs to old people with the same salary
576 - 799
What If We Wish to Set Young as the Base Level
d$Age <- relevel(d$Age, "Young")
m5 <- lm(AmountSpent ~ as.factor(Age) + Catalogs, data=d)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -223.284 66.673 -3.349 0.000842 ***
as.factor(Age)Middle 798.937 57.716 13.843 < 2e-16 ***
as.factor(Age)Old 745.235 71.056 10.488 < 2e-16 ***
Catalogs 60.034 3.736 16.068 < 2e-16 ***

Residual standard error: 772.1 on 996 degrees of freedom


Multiple R-squared: 0.3565, Adjusted R-squared: 0.3546
F-statistic: 183.9 on 3 and 996 DF, p-value: < 2.2e-16

 Questions:
 How does the results change?
 Which age group spends the most: young, medium, or old?
Numeric and Dummy Variables
m6 <- lm(AmountSpent ~ Catalogs + Salary + as.factor(Age), data=d)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.876e+02 5.398e+01 -12.738 < 2e-16 ***
Catalogs 5.145e+01 2.880e+00 17.866 < 2e-16 ***
Salary 2.079e-02 7.840e-04 26.516 < 2e-16 ***
as.factor(Age)Middle -1.018e+02 5.575e+01 -1.826 0.06813 .
as.factor(Age)Old 1.680e+02 5.861e+01 2.866 0.00424 **

Residual standard error: 591.3 on 995 degrees of freedom


Multiple R-squared: 0.6229, Adjusted R-squared: 0.6214
F-statistic: 411 on 4 and 995 DF, p-value: < 2.2e-16

 Questions:
 What does the coefficient 𝛽4 (= -102) mean?
A. Middle-aged customers spend the least.
B. Middle-aged customers spend less than young customers.
C. We should ignore this coefficient because the p-value is not significant.
D. Middle-aged customers spend less than young customers with the same salary level and same number of catalogs.
 Why did the coefficient of Middle change sign?
Prediction with Dummy Variables
m6 <- lm(AmountSpent ~ Catalogs + Salary + as.factor(Age), data=d)

Estimate Std. Error t value Pr(>|t|)


(Intercept) -6.876e+02 5.398e+01 -12.738 < 2e-16 ***
Catalogs 5.145e+01 2.880e+00 17.866 < 2e-16 ***
Salary 2.079e-02 7.840e-04 26.516 < 2e-16 ***
as.factor(Age)Middle -1.018e+02 5.575e+01 -1.826 0.06813 .
as.factor(Age)Old 1.680e+02 5.861e+01 2.866 0.00424 **

 Questions:
 What is the amount spend by an old person with a salary of $50,000 receiving no catalog per year?
 What is the amount spend by a young person with a salary of $30,000 receiving 1 catalog per year?
 What is the amount spend by a middle aged person with a salary of $80,000 with 2 kids?
 What is the difference in amount spend by an old and a young person with the same salary and same number
of catalogs?
 What is the difference in amount spend by an old person receiving 1 catalog and a middle aged person
receiving 2 catalogs, when controlled for salary?
Multiple Numeric and Dummies
m7 <- lm(AmountSpent ~ Salary + Catalogs + Children + as.factor(Age) +
as.factor(Gender) + as.factor(Married) + as.factor(Location), data=d)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.058e+02 7.501e+01 -8.076 1.92e-15 ***
Salary 2.251e-02 9.517e-04 23.648 < 2e-16 ***
Catalogs 4.314e+01 2.549e+00 16.922 < 2e-16 ***
Children -2.008e+02 1.724e+01 -11.645 < 2e-16 ***
as.factor(Age)Middle -8.208e+01 5.062e+01 -1.622 0.105
as.factor(Age)Old -2.038e+01 5.400e+01 -0.377 0.706
as.factor(Gender)Male -4.197e+01 3.467e+01 -1.211 0.226
as.factor(Married)Single 6.759e+01 4.677e+01 1.445 0.149
as.factor(Location)Far 5.071e+02 3.622e+01 14.001 < 2e-16 ***

Residual standard error: 513.9 on 991 degrees of freedom


Multiple R-squared: 0.7164, Adjusted R-squared: 0.7141
F-statistic: 312.9 on 8 and 991 DF, p-value: < 2.2e-16

 Question: What do you see in the above model?


Interpretation: Effect of Location Dummy
 Customers who live far from a store selling similar

AmtSpent
products spend more (larger intercept) than those
living close to competing stores (smaller intercept). Far
Close
 However, customers living far spend at the same rate
as customers living close (same slope) .
 Do you think this is reasonable/realistic?
 Are other scenarios plausible? Which ones? Salary

 How can we investigate other scenarios?


 Remember: Our answers are only as good as the
questions we ask. Simple, unrealistic questions will
yield simple, unrealistic answers.
Interaction Effects
 A more challenging question: Far

AmtSpent
Is the spending rate higher for customers who live
far away than those living close?
 How to answer this question?
Close
 Add an interaction term, computed as the product
of values in Salary and Location columns.
 Salary*as.factor(Location)
 Examine if the interaction effect is significant. Salary
Interaction Between Numeric and Dummy Variables
m8 <- lm(AmountSpent ~ Salary + Catalogs + as.factor(Location) +
Salary*as.factor(Location), data=d);

Estimate Std. Error t value Pr(>|t|)


(Intercept) -5.421e+02 5.268e+01 -10.289 <2e-16 ***
Salary 1.687e-02 6.579e-04 25.636 <2e-16 ***
Catalogs 4.564e+01 2.603e+00 17.532 <2e-16 ***
as.factor(Location)Far -1.684e+02 7.593e+01 -2.218 0.0268 *
Salary:as.factor(Location)Far 1.215e-02 1.205e-03 10.075 <2e-16 ***

Residual standard error: 530 on 995 degrees of freedom


Multiple R-squared: 0.6971, Adjusted R-squared: 0.6959
F-statistic: 572.6 on 4 and 995 DF, p-value: < 2.2e-16

 Questions:
 If the salary of a customer who lives close increases by $10,000, what is the predicted increase in amount spent?
 If the salary of a customer who lives far increases by $10,000, what is the predicted increase in amount spent?
Interaction Between Numeric Variables
m9 <- lm(AmountSpent ~ Salary + Catalogs + Children + Salary*Catalogs, data=d)

Estimate Std. Error t value Pr(>|t|)


(Intercept) 2.832e+02 8.277e+01 3.422 0.000647 ***
Salary 6.618e-03 1.363e-03 4.856 1.39e-06 ***
Catalogs -4.296e+00 5.359e+00 -0.802 0.422963
Children -1.984e+02 1.613e+01 -12.302 < 2e-16 ***
Salary:Catalogs 9.416e-04 8.486e-05 11.096 < 2e-16 ***

Multiple R-squared: 0.696, Adjusted R-squared: 0.6948


F-statistic: 569.6 on 4 and 995 DF, p-value: < 2.2e-16

 Questions:
 What is the marginal effect of children on amount spent?
 What is the marginal effect of catalogs on amount spent?
 Draw two graphs to show the difference between the two questions above.
 Does marginal effect apply for dummy variables?
How Messy Can It Get?
m8 <- lm(AmountSpent ~ Salary + Catalogs + Children + Gender + Married
+ Catalogs*Salary + Catalogs*Children +Catalogs*Salary*Children +
Catalogs*Gender + Catalogs*Gender*Salary, data=d)

Estimate Std. Error t value Pr(>|t|)


(Intercept) 5.917e+01 1.427e+02 0.415 0.678458
Salary 8.617e-03 2.552e-03 3.377 0.000762 ***
Catalogs -3.000e+00 8.757e+00 -0.343 0.731960
Children -4.185e+01 7.540e+01 -0.555 0.579053
GenderMale 6.212e+01 1.743e+02 0.356 0.721674
MarriedSingle 1.522e+01 4.476e+01 0.340 0.733902
Salary:Catalogs 1.042e-03 1.506e-04 6.918 8.25e-12 ***
Two-way interaction Catalogs:Children 3.004e+00 5.159e+00 0.582 0.560497
Two-way interaction Salary:Children -1.068e-03 1.229e-03 -0.869 0.384948
Catalogs:GenderMale -5.511e+00 1.129e+01 -0.488 0.625570
with dummies Salary:GenderMale -5.758e-04 2.862e-03 -0.201 0.840601
Three-way interaction Salary:Catalogs:Children -1.638e-04 8.004e-05 -2.047 0.040944 *
Three-way interaction Salary:Catalogs:GenderMale 5.876e-05 1.759e-04 0.334 0.738453
with dummies
Multiple R-squared: 0.7138, Adjusted R-squared: 0.7103
F-statistic: 205.2 on 12 and 987 DF, p-value: < 2.2e-16
Example: Gender Discrimination
 The scenario: Gender Experience
Male 4
Salary
32
 Fifth National Bank is facing a gender discrimination lawsuit alleging Female
Female
15
12
39.1
33.2
that it pays its female employees less than male employees. Female 15 30.6
Male 3 29
 The bank’s database includes information on its 208 employees with Female 3 30.5
each employee’s annual salary (in $thousands), gender, and years Female 4 30
Male 10 27
of experience. Female 4 34
Female 9 29.5
 Question: Female 11 26.8
Female 16 31.3
 What regression model should we run to evaluate the gender
discrimination claim?
Exploring the Data
 Side-by-side box plots.
 Question:
 What do these box plots tell us?
Male

Female

40 60 80 100

Salary
Model 1: Main Effect of Gender
m1 <- lm(Salary ~ as.factor(Gender), data=d);

Estimate Std. Error t value Pr(>|t|)


(Intercept) 37.2099 0.8945 41.597 < 2e-16 ***
as.factor(Gender)Male 8.2955 1.5645 5.302 2.94e-07 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.58 on 206 degrees of freedom
Multiple R-squared: 0.1201, Adjusted R-squared: 0.1158
F-statistic: 28.12 on 1 and 206 DF, p-value: 2.935e-07

 Questions:
 Based on the above model, is there gender discrimination at this bank?
 If so, how much are female employees discriminated against?
 Is this a good model?
 Is there anything missing from this analysis? Elaborate.
Model 2: Controlling for Experience
m2 <- lm(Salary ~ Experience + as.factor(Gender), data=d);

Estimate Std. Error t value Pr(>|t|)


(Intercept) 26.63000 1.20832 22.039 < 2e-16 ***
Experience 0.87231 0.08034 10.858 < 2e-16 ***
as.factor(Gender)Male 8.51029 1.24979 6.809 1.06e-10 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.454 on 205 degrees of freedom
Multiple R-squared: 0.4413, Adjusted R-squared: 0.4359
F-statistic: 80.98 on 2 and 205 DF, p-value: < 2.2e-16

 Questions:
 Based on this model, does gender discrimination exist?
 If so, how much are female employees discriminated against?
 Is this model better than the previous model? Why?

 Is there anything still missing from this analysis? Elaborate.


Digging Further
 Scatterplots (or other visualizations) are a
great way of understanding what’s happening
in the data.

90
Female

Questions:
Male

80
 After controlling for experience, does it appear

70
that gender discrimination still exists? Explain.

Salary
How can we investigate this analytically?

60

50
40
30
10 20 30 40

Experience
Model 3: Interaction Effect
m3 <- lm(Salary ~ Experience + as.factor(Gender) +
Experience*as.factor(Gender), data=d)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.1668 1.4059 23.592 < 2e-16 ***
Experience 0.3334 0.1033 3.228 0.00145 **
as.factor(Gender)Male -4.0171 2.0553 -1.955 0.05201 .
Experience:as.factor(Gender)Male 1.0431 0.1437 7.261 7.95e-12 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.555 on 204 degrees of freedom
Multiple R-squared: 0.5561, Adjusted R-squared: 0.5495
F-statistic: 85.18 on 3 and 204 DF, p-value: < 2.2e-16

What does this negative coefficient mean?


 Question:
 Based on this model, does gender discrimination exist?
 If so, how much are female employees discriminated against?
 Is this model better than the previous two models? Why?
Model 3: Interpretation
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.1668 1.4059 23.592 < 2e-16 ***
Experience 0.3334 0.1033 3.228 0.00145 **
as.factor(Gender)Male -4.0171 2.0553 -1.955 0.05201 .
Experience:as.factor(Gender)Male 1.0431 0.1437 7.261 7.95e-12 ***

 Questions:
 On average, how much more does a male employee with 10 years of experience make than a
female employee with the same experience level?
 Interaction effect coefficient 𝛽3 = 1.04 refers to
A. The average rate of salary increase for each additional year of experience for male employees.
B. The amount by which the average rate of salary increase for each additional year of experience for
male employees relative to that of female employees.
C. The average salary of male employees with zero years of experience.
 What does 𝛽2 = -4.01 mean?
Model 3: Wrapping Up
 Questions:
 Which of the 3 models best characterizes discrimination?
A. The gender-only model.
B. The gender + experience model.
C. The interaction model.
 Is there discrimination in the starting salaries?
 Is there discrimination in the rate of salary increase?
 Are there any limitations to this analysis?
Questions to Ponder …
 More on interaction effects:
 Can we have interaction between two numeric
variables? How will you interpret the interaction

90
coefficient? Female
Male

 Can we have interaction between one numeric

80
variable and one categorical variables with three

70
levels (e.g., old, mid, young)?

Salary
 Can you have interaction between two categorical

60
variables (e.g., old/mid/young vs. far/near)?

50
 Gender discrimination problem:

40
 Is this really a linear effect, i.e., is the rate of
increase the same (constant) at all experience

30
levels? 10 20 30 40

 If not, how can we investigate this? (Answer: non- Experience

linear models).
Key Takeaways
 Incorporating dummy variables as predictors in a regression model allows us to compare or
make predictions for each group and compare predictions across groups.
 Adding an interaction term x1*x2 as a predictor allows for a more flexible model,
 Change in y relative to change in x1 is a function of the other predictor x2.
 Interaction may involve two or more variables, which may be numeric and/or categorical.
 In all of these models, the DV is still numeric.

You might also like