Regression Analysis and
Visualization
Name
2025-07-02
Question. 1a: Visualization
library(ggplot2)
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
setwd("~/Downloads")
data <- read_excel("Elasticity.xlsx")
head(data)
## # A tibble: 6 × 4
## Demand Price Brand Ad
## <dbl> <dbl> <chr> <dbl>
## 1 20640 3.87 tropicana 0
## 2 15360 3.87 tropicana 0
## 3 9600 3.87 tropicana 0
## 4 20000 3.87 tropicana 0
## 5 22240 3.87 tropicana 0
## 6 17920 3.87 tropicana 0
ggplot(data, aes(x = Price, y = Demand)) +
geom_point(color = "steelblue") +
labs(title = "Scatterplot of Demand vs Price",
x = "Price",
y = "Demand") +
theme_minimal()
This
scatterplot allows us to observe the relationship between product price and
demand. Typically, we see a negative correlation: as price increases,
demand may decrease.
ggplot(data, aes(x = as.factor(Brand), y = Price, fill =
as.factor(Brand))) +
geom_boxplot() +
labs(title = "Boxplot of Price by Brand",
x = "Brand",
y = "Price") +
theme_minimal()
The
boxplot shows clear price differences across the three brands. Tropicana
stands out with the highest median price and overall price range, suggesting
it is positioned as a premium brand. Minute Maid follows with moderately
high prices, while Dominick’s consistently shows the lowest prices, indicating
it is likely a budget-friendly option. The spread within each brand also
reveals pricing consistency—Tropicana’s prices are not only higher but also
more varied, possibly due to a broader product line.
ggplot(data, aes(x = as.factor(Brand), fill = as.factor(Ad))) +
geom_bar(position = "stack") +
labs(title = "Stacked Bar Plot: Brand and Ad",
x = "Brand",
fill = "Ad (0 = No, 1 = Yes)") +
theme_minimal()
Question. 1b: Simple Linear Regression Models
data <- data %>%
mutate(
log_Demand = log(Demand),
log_Price = log(Price)
)
library(caret)
## Loading required package: lattice
set.seed(123) # for reproducibility
ctrl <- trainControl(method = "cv", number = 4)
model1 <- train(Demand ~ Price, data = data, method = "lm", trControl
= ctrl)
summary(model1$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92263 -33232 -11313 12050 1698137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130912.6 1378.1 95.0 <2e-16 ***
## Price -38393.2 580.8 -66.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64030 on 28945 degrees of freedom
## Multiple R-squared: 0.1312, Adjusted R-squared: 0.1311
## F-statistic: 4370 on 1 and 28945 DF, p-value: < 2.2e-16
rmse1 <- model1$results$RMSE
r2_1 <- summary(model1$finalModel)$r.squared
model2 <- train(log_Demand ~ Price, data = data, method = "lm",
trControl = ctrl)
summary(model2$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9223 -0.5964 -0.0327 0.5853 3.7197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.635238 0.019786 588.05 <2e-16 ***
## Price -0.679558 0.008339 -81.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9194 on 28945 degrees of freedom
## Multiple R-squared: 0.1866, Adjusted R-squared: 0.1866
## F-statistic: 6641 on 1 and 28945 DF, p-value: < 2.2e-16
rmse2 <- model2$results$RMSE
r2_2 <- summary(model2$finalModel)$r.squared
model3 <- train(Demand ~ log_Price, data = data, method = "lm",
trControl = ctrl)
summary(model3$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142631 -31835 -9094 12137 1672483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117605 1065 110.42 <2e-16 ***
## log_Price -94791 1274 -74.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62940 on 28945 degrees of freedom
## Multiple R-squared: 0.1606, Adjusted R-squared: 0.1606
## F-statistic: 5538 on 1 and 28945 DF, p-value: < 2.2e-16
rmse3 <- model3$results$RMSE
r2_3 <- summary(model3$finalModel)$r.squared
model4 <- train(log_Demand ~ log_Price, data = data, method = "lm",
trControl = ctrl)
summary(model4$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0441 -0.5853 -0.0330 0.5756 3.7264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.33971 0.01535 738.74 <2e-16 ***
## log_Price -1.60131 0.01836 -87.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9071 on 28945 degrees of freedom
## Multiple R-squared: 0.2081, Adjusted R-squared: 0.2081
## F-statistic: 7608 on 1 and 28945 DF, p-value: < 2.2e-16
rmse4 <- model4$results$RMSE
r2_4 <- summary(model4$finalModel)$r.squared
performance <- data.frame(
Model = c("Demand ~ Price", "log(Demand) ~ Price",
"Demand ~ log(Price)", "log(Demand) ~ log(Price)"),
R2 = c(r2_1, r2_2, r2_3, r2_4),
RMSE = c(rmse1, rmse2, rmse3, rmse4)
)
performance
## Model R2 RMSE
## 1 Demand ~ Price 0.1311658 6.397702e+04
## 2 log(Demand) ~ Price 0.1866093 9.193669e-01
## 3 Demand ~ log(Price) 0.1606103 6.281382e+04
## 4 log(Demand) ~ log(Price) 0.2081407 9.071317e-01
The log-log model best fits the data, suggesting that demand responds
proportionally to changes in price, with stronger explanatory power and the
lowest prediction error.
Question 1c: Regression Plot and Interpretation
ggplot(data, aes(x = log_Price, y = log_Demand)) +
geom_point(alpha = 0.3, color = "darkblue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "log(Demand) vs log(Price) with Regression Line",
x = "log(Price)",
y = "log(Demand)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Unlike the raw scatterplot in part (a), this plot shows a clear linear trend due
to the log transformation, which helps reduce skewness and stabilize
variance. The original scatterplot of Demand vs Price showed more noise and
heteroscedasticity (uneven spread), making the relationship less obvious.
Since both variables are in log form, this is a log-log model, and the slope
represents an elasticity: A 1% increase in price is associated with an
approximate 1.60% decrease in demand. This implies demand is price elastic
—demand decreases more than proportionally as price increases.
Question 1d
We are evaluating whether the predictor log(Price) is statistically significant
in the model:
log(Demand) ~ log(Price)
From your regression output: log_Price Estimate: -1.60131
Std. Error: 0.01836
t value: -87.22
Pr(>|t|): < 2e-16
Yes, the predictor log(Price) is statistically significant at the 25% level.
Question 2a: Multiple Regression with Brand
data$Brand <- as.factor(data$Brand)
levels(data$Brand)
## [1] "dominicks" "minute.maid" "tropicana"
model_multi <- lm(log_Demand ~ log_Price + Brand, data = data)
summary(model_multi)
##
## Call:
## lm(formula = log_Demand ~ log_Price + Brand, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3152 -0.5246 -0.0502 0.4929 3.5088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.74511 0.01453 808.08 <2e-16 ***
## log_Price -3.13869 0.02293 -136.89 <2e-16 ***
## Brandminute.maid 0.87017 0.01293 67.32 <2e-16 ***
## Brandtropicana 1.52994 0.01631 93.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7935 on 28943 degrees of freedom
## Multiple R-squared: 0.3941, Adjusted R-squared: 0.394
## F-statistic: 6275 on 3 and 28943 DF, p-value: < 2.2e-16
The multiple regression model reveals that price has a strong negative effect
on demand: a 1% increase in price leads to a 3.14% drop in demand,
confirming that demand is highly price elastic. Additionally, brand plays a
significant role. Compared to Dominicks, demand is approximately 139%
higher for Minute Maid and 362% higher for Tropicana, after controlling for
price. These results indicate both price sensitivity and strong brand
preference among consumers.
Question 2b
All predictors have p-values < 0.00000000000000002, which is far below the
2.5% (0.025) significance threshold. All predictors are statistically significant
at the 2.5% level, providing strong evidence that both price and brand have
a meaningful impact on demand. The inclusion of Brand not only improves
model fit but also helps clarify the demand differences across brands,
without diminishing the strong negative relationship between price and
demand.
Question 2c
Yes, the overall model fit has significantly improved after adding Brand as a
predictor. The R² increased from 0.208 to 0.394, and the residual standard
error decreased from 0.9071 to 0.7935. This indicates that including brand
information helps explain a much greater portion of the variation in demand
and leads to more accurate predictions, confirming a substantially better
model fit.
Question 2d: Regression Visualization with Brand
ggplot(data, aes(x = log_Price, y = log_Demand, color = Brand)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "log(Demand) vs log(Price) by Brand",
x = "log(Price)",
y = "log(Demand)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This
plot shows how the relationship between price and demand differs across
brands. All three brands show a negative relationship between price and
demand (as expected), but the vertical separation of the lines confirms what
we saw in part (a): • Tropicana has the highest demand (its line is highest), •
Followed by Minute Maid, • And Dominicks with the lowest demand at any
given price level.
This visualization reinforces the regression findings — brand has a large and
statistically significant effect on demand, and the price elasticity remains
consistently negative across brands.
Question 3a: Multiple Regression with Interaction
Between Brand and log(Price)
model_interaction <- lm(log_Demand ~ log_Price * Brand, data = data)
summary(model_interaction)
##
## Call:
## lm(formula = log_Demand ~ log_Price * Brand, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4434 -0.5232 -0.0494 0.4884 3.4901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.87097 0.02070 573.395 <2e-16 ***
## log_Price -3.37753 0.03619 -93.322 <2e-16 ***
## Brandminute.maid 0.88825 0.04155 21.376 <2e-16 ***
## Brandtropicana 0.96239 0.04645 20.719 <2e-16 ***
## log_Price:Brandminute.maid 0.05679 0.05729 0.991 0.322
## log_Price:Brandtropicana 0.66576 0.05352 12.439 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7911 on 28941 degrees of freedom
## Multiple R-squared: 0.3978, Adjusted R-squared: 0.3977
## F-statistic: 3823 on 5 and 28941 DF, p-value: < 2.2e-16
This model shows that while all brands experience a drop in demand as price
increases, the sensitivity to price varies by brand. Dominicks is the most
price-sensitive (–3.38% per 1% price increase), while Tropicana is less
sensitive (–2.71%), suggesting stronger brand loyalty. Minute Maid’s slope is
similar to Dominicks and not statistically different. The intercepts also vary:
Tropicana starts with the highest baseline demand, followed by Minute Maid
and then Dominicks.
Question 3b
At a 2.5% significance level, most predictors — including the base effect of
price, brand differences, and the interaction between price and Tropicana —
are statistically significant. This provides strong evidence that demand is
influenced by both price and brand, and that the relationship between price
and demand differs for Tropicana compared to the base brand (Dominicks).
However, the interaction term for Minute Maid is not significant, suggesting
that its sensitivity to price is not meaningfully different from Dominicks.
Overall, the model shows that brand modifies both the level and slope of the
demand response, especially for Tropicana.
Question 3c
Yes, the interaction model has slightly improved the overall fit compared to
the previous regression. The R² increased from 0.3941 to 0.3978, and the
residual standard error decreased slightly, indicating a small but meaningful
improvement in model accuracy. Since the adjusted R² also increased, this
improvement is not just due to added complexity. These changes suggest
that allowing price sensitivity to vary by brand — especially for Tropicana —
captures additional structure in the data, making the model more
informative. If 4-fold cross-validation RMSE is also lower, this would further
confirm that the interaction model generalizes better to new data.
Question 3d: Interaction Regression Visualization
ggplot(data, aes(x = log_Price, y = log_Demand, color = Brand)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, formula = y ~ x) +
labs(
title = "log(Demand) vs log(Price) with Interaction by Brand",
x = "log(Price)",
y = "log(Demand)"
) +
theme_minimal()
This
plot shows that the relationship between price and demand varies by brand
both in intercept and slope. Tropicana’s line is the highest and less steep,
indicating it has the highest baseline demand and the least price sensitivity.
Dominicks has the steepest slope, showing it is the most price-sensitive, with
demand dropping more sharply as price increases. Minute Maid falls in
between and has a slope similar to Dominicks, consistent with the earlier
regression results where the interaction term for Minute Maid was not
statistically significant.
The plot supports the conclusion that including the interaction improves the
model by capturing brand-specific price elasticity.
Question 4a: Full interaction model with Ad
model_q4 <- lm(log_Demand ~ log_Price * Brand * Ad, data = data)
summary(model_q4)
##
## Call:
## lm(formula = log_Demand ~ log_Price * Brand * Ad, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8893 -0.4290 -0.0091 0.4125 3.2368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.32287 0.02335 484.909 < 2e-16
***
## log_Price -2.77415 0.03883 -71.445 < 2e-16
***
## Brandminute.maid 0.04720 0.04663 1.012 0.311
## Brandtropicana 0.70794 0.05080 13.937 < 2e-16
***
## Ad 1.09441 0.03810 28.721 < 2e-16
***
## log_Price:Brandminute.maid 0.78293 0.06140 12.750 < 2e-16
***
## log_Price:Brandtropicana 0.73579 0.05684 12.946 < 2e-16
***
## log_Price:Ad -0.47055 0.07409 -6.351 2.17e-10
***
## Brandminute.maid:Ad 1.17294 0.08196 14.312 < 2e-16
***
## Brandtropicana:Ad 0.78525 0.09875 7.952 1.90e-15
***
## log_Price:Brandminute.maid:Ad -1.10922 0.12225 -9.074 < 2e-16
***
## log_Price:Brandtropicana:Ad -0.98614 0.12411 -7.946 2.00e-15
***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.695 on 28935 degrees of freedom
## Multiple R-squared: 0.5354, Adjusted R-squared: 0.5352
## F-statistic: 3031 on 11 and 28935 DF, p-value: < 2.2e-16
The regression results reveal complex interactions between price, brand, and
advertising on demand. Advertising greatly increases baseline demand,
especially for Minute Maid and Tropicana, whose sales increase by nearly 3
to 4 times when ads are running. However, advertising also makes demand
more sensitive to price — particularly for Tropicana and Minute Maid — which
means that while ads draw attention, they also heighten consumer
awareness of price, possibly making them more reactive to price changes.
Tropicana consistently shows the highest demand, both with and without
ads, while Dominicks is the most price-sensitive overall. The model highlights
how brand and promotional strategy jointly shape consumer response to
pricing.
Question 4b
At a 2.5% significance level, nearly all predictors in the model are
statistically significant, except for the main effect of the Brand “Minute
Maid,” which has a p-value of 0.311. This suggests that, on its own—without
considering price or advertising—Minute Maid’s demand is not significantly
different from the baseline brand, Dominicks. However, all other variables,
including the main effects of price and advertising, as well as the two-way
and three-way interaction terms, are statistically significant. This provides
strong evidence that the relationship between price and demand is
meaningfully influenced by both brand and advertising. Specifically, the
interactions show that the effect of price on demand varies by brand and
becomes even more pronounced when advertising is involved. Overall, the
significance of these predictors supports the conclusion that including
advertising and its interactions with price and brand substantially improves
the explanatory power of the model and reveals important nuances in
consumer behavior.
Question 4c
The overall fit of the regression model has clearly improved with the addition
of brand, advertising, and their interactions with price. The R² rose from
0.2081 in the simple model to 0.5354 in the final model, showing that the
expanded model explains more than twice the variation in demand. At the
same time, the reduction in residual standard error from 0.907 to 0.695
suggests better predictive accuracy and generalizability. These
improvements in both explanatory power and cross-validation performance
demonstrate that the full model is substantially more effective at capturing
the complex relationships between price, brand, advertising, and demand.