[go: up one dir, main page]

0% found this document useful (0 votes)
125 views13 pages

Regrassion Analysis Lab Question and Answer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views13 pages

Regrassion Analysis Lab Question and Answer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Given Data:

• X (Number of Units Repaired): 1, 2, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 10, 10


• Y (Length of Service Calls in minutes): 23, 29, 49, 64, 74, 87, 96, 97, 109, 119, 149,
145, 154, 166

Let's go through each part of the problem:

(a) Scatter Plot

We will plot Length of Service Calls (Y) against Number of Units Repaired (X) and assess
whether a linear model might be appropriate.

R Code:

R
Copy code
# Data
X <- c(1, 2, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 10, 10)
Y <- c(23, 29, 49, 64, 74, 87, 96, 97, 109, 119, 149, 145, 154, 166)

# Scatter plot
plot(X, Y, main="Scatter Plot of Service Calls vs Units Repaired",
xlab="Number of Units Repaired", ylab="Length of Service Calls
(minutes)",
pch=19, col="blue")

# Is a linear model appropriate?


# The scatter plot should reveal whether the points are roughly
forming a straight line.

Based on the scatter plot, we can visually inspect whether the relationship between X and
Y seems linear. If the points form a straight line, a linear model is likely appropriate.

(b) Line of Best Fit

We will fit a linear regression model and plot the regression line on the scatter plot.
R Code:

R
Copy code
# Linear model
model <- lm(Y ~ X)

# Scatter plot with regression line


plot(X, Y, main="Scatter Plot with Line of Best Fit",
xlab="Number of Units Repaired", ylab="Length of Service Calls
(minutes)",
pch=19, col="blue")
abline(model, col="red")

# Output the model summary to get the regression coefficients


summary(model)

The summary of the model will provide us with the slope (β1\beta_1β1 ) and intercept
(β0\beta_0β0 ).

(c) Interpret the Intercept and Slope

• Intercept (β0\beta_0 β0 ): The predicted length of a service call when no units are
repaired (X = 0). This is the base time for a service call with no repairs.
• Slope (β1\beta_1 β1 ): The increase in the length of the service call for every
additional unit repaired.

You can interpret these values directly from the output of summary(model).

(d) Practical Sense of the Intercept

Since the intercept represents the length of a service call when 0 units are repaired, its
practical sense depends on whether service calls with no repairs occur. If a service call
where no repairs are made doesn't make sense in practice, the intercept may not have
practical value.

(e) Goodness-of-Fit (R²)

We can check the R2R^2R2 value from the model summary to assess how well the model
fits the data.

R Code:

R
Copy code
# R-squared value from the model
summary(model)$r.squared

A higher R2R^2R2 value (closer to 1) indicates a better fit of the model.

(f) ANOVA Table

The ANOVA table breaks down the variance into different components. We can use the
anova() function in R to generate the ANOVA table.

R Code:

R
Copy code
# ANOVA table
anova(model)

This will provide the sum of squares, mean squares, F-statistic, and p-value, allowing us to
comment on the overall fit of the model.

(g) Significance of the Variable

The p-value associated with the slope (β1\beta_1β1 ) in the summary output tells us if the
Number of Units Repaired is a statistically significant predictor of the Length of Service
Calls.
R Code:

R
Copy code
# Check p-value for the slope (X variable)
summary(model)$coefficients

If the p-value is less than 0.05, the variable is statistically significant.

(h) 95% Confidence and Prediction Intervals

To find the 95% confidence and prediction intervals for the length of service calls when 12
units are repaired:

R Code:

R
Copy code
# New data for prediction (when X = 12)
new_data <- data.frame(X = 12)

# 95% Confidence Interval


conf_interval <- predict(model, new_data, interval = "confidence",
level = 0.95)

# 95% Prediction Interval


pred_interval <- predict(model, new_data, interval = "prediction",
level = 0.95)

# Output both intervals


conf_interval
pred_interval

Conclusion:

By running these commands, you'll have:

1. A scatter plot with the line of best fit.


2. Regression coefficients to interpret the model.
3. Goodness-of-fit (R²) and ANOVA table.
4. Significance test for the variable (Number of Units Repaired).
5. 95% confidence and prediction intervals for the length of service calls when 12
units are repaired.

This R-based approach completes the analysis of the given data and provides insights into
the relationship between the number of units repaired and the length of service calls.

4o

#######Problem –02#######

To solve this problem using R, we need to:

6. Fit a multiple linear regression model.


7. Check if lot size (X₂) and age of the house (X₃) individually have significant impacts.
8. Calculate R2R^2R2 and adjusted R2R^2R2.
9. Construct the ANOVA table.
10. Find the 90% confidence and prediction intervals.

Here is how we can approach each step in R:

Given Data:

• Y (Sale Price): 25.9, 29.5, 27.9, 25.9, 29.9, 29.9, 30.9, 28.9, 45.8, 36.9, 38.9, 37.9,
44.5, 37.9, 37.5, 43.9
• X₂ (Lot Size in thousands of square feet): 3.472, 3.531, 2.275, 4.050, 4.455, 4.455,
5.850, 9.520, 7.326, 8.000, 9.150, 6.727, 9.890, 5.000, 5.520, 7.800
• X₃ (Age of House in years): 42, 62, 40, 54, 42, 56, 51, 32, 31, 3, 48, 44, 50, 22, 40, 23

(a) Fit a Multiple Linear Regression Model

We will fit a multiple linear regression model where Y (Sale Price) is predicted by X₂ (Lot
Size) and X₃ (Age of House).

R Code:

R
Copy code
# Data
Y <- c(25.9, 29.5, 27.9, 25.9, 29.9, 29.9, 30.9, 28.9, 45.8, 36.9,
38.9, 37.9, 44.5, 37.9, 37.5, 43.9)
X2 <- c(3.472, 3.531, 2.275, 4.050, 4.455, 4.455, 5.850, 9.520, 7.326,
8.000, 9.150, 6.727, 9.890, 5.000, 5.520, 7.800)
X3 <- c(42, 62, 40, 54, 42, 56, 51, 32, 31, 3, 48, 44, 50, 22, 40, 23)

# Fit the multiple linear regression model


model <- lm(Y ~ X2 + X3)

# Summary of the model


summary(model)

• Intercept: The expected sale price of a house when lot size and age of the house
are both 0.
• X₂ Coefficient: How much the sale price increases for each additional 1000 square
feet in lot size, holding age constant.
• X₃ Coefficient: How much the sale price decreases (or increases) for each
additional year in the age of the house, holding lot size constant.

(b) Calculate R2R^2R2 and Adjusted R2R^2R2

We can get the values of R2R^2R2 and adjusted R2R^2R2 from the model summary. These
values tell us how well the model explains the variability in the sale price.

R Code:

R
Copy code
# R-squared and Adjusted R-squared
summary(model)$r.squared
summary(model)$adj.r.squared

• R2R^2 R2: Proportion of variance in sale price explained by the model.


• Adjusted R2R^2R2: Adjusted for the number of predictors in the model.
(c) ANOVA Table

The ANOVA table provides insights into the overall significance of the model. We can use
the anova() function to generate the ANOVA table.

R Code:

R
Copy code
# ANOVA table
anova(model)

The ANOVA table will show the sum of squares, degrees of freedom, mean squares, F-
statistic, and p-value. This helps in determining if the overall model is statistically
significant.

(d) 90% Confidence and Prediction Intervals

We need to find the 90% confidence and prediction intervals for the sale price of a house
with a lot size of 10,000 square feet and age of 50 years.

R Code:

R
Copy code
# New data for prediction (Lot size = 10, Age = 50)
new_data <- data.frame(X2 = 10, X3 = 50)

# 90% Confidence Interval


conf_interval <- predict(model, new_data, interval = "confidence",
level = 0.90)

# 90% Prediction Interval


pred_interval <- predict(model, new_data, interval = "prediction",
level = 0.90)

# Output both intervals


conf_interval
pred_interval

• Confidence Interval: Gives a range in which the mean sale price is likely to fall.
• Prediction Interval: Gives a range in which the actual sale price of an individual
house is likely to fall.

Conclusion:

By running the above code, you'll be able to:

11. Fit a multiple linear regression model.


12. Interpret the impact of lot size and age on the sale price.
13. Compute and interpret R2R^2R2 and adjusted R2R^2R2.
14. Obtain the ANOVA table and comment on the overall model significance.
15. Calculate the 90% confidence and prediction intervals for a house with a given lot
size and age.

Problem 03

To solve this problem using R, we will need to address each part step by step, utilizing
various diagnostics tools in R for residual analysis. Here's how you can approach the tasks:

Steps in the R Solution:

16. Fit a linear model based on the given water flow data.
17. Generate residual plots for ordinary residuals, studentized residuals, and deleted
studentized residuals to comment on normality.
18. Identify outliers using studentized and deleted studentized residuals.
19. Identify high leverage points using twice-the-mean and thrice-the-mean rules.
20. Use Cook's distance and DFFITS to find influential observations.
21. Evaluate improvement after omission of outliers.

Let’s assume we are fitting a simple linear regression model. For this example, let’s
assume that Libby X is the independent variable, and we need to generate a dependent
variable to proceed.
(a) Normal Probability Plot of Residuals

R Code for Ordinary, Studentized, and Deleted Studentized Residuals:

R
Copy code
# Assuming the data (replace y with the actual dependent variable)
libby_x <- c(27.1, 19.7, 20.9, 18.0, 33.4, 26.1, 77.6, 15.7, 44.9,
37.0, 26.1, 21.6, 19.9, 17.6, 15.7, 35.1, 27.6, 32.6, 24.9, 26.0,
23.4, 27.6, 23.1, 38.7, 31.3, 27.8)
y <- c(23.8, 22.3, 25.6, 20.1, 35.7, 24.4, 81.2, 17.9, 49.0, 40.6,
28.0, 21.2, 20.5, 17.0, 34.0, 29.7, 36.5, 26.8, 24.0, 22.9, 27.0,
26.7, 39.1, 30.4, 31.7, 28.1)

# Fit the linear model


model <- lm(y ~ libby_x)

# Ordinary residuals
ordinary_res <- residuals(model)

# Studentized residuals
student_res <- rstudent(model)

# Deleted Studentized residuals


library(car)
deleted_student_res <- studres(model)

# Normal probability plots


par(mfrow = c(1, 3))
qqnorm(ordinary_res, main = "Ordinary Residuals")
qqline(ordinary_res)

qqnorm(student_res, main = "Studentized Residuals")


qqline(student_res)

qqnorm(deleted_student_res, main = "Deleted Studentized Residuals")


qqline(deleted_student_res)
(b) Identify Outliers Using Studentized and Deleted Studentized Residuals

We can identify potential outliers using studentized residuals. A common rule is that
residuals with absolute values greater than 3 are considered outliers.

R Code:

R
Copy code
# Identify outliers using studentized residuals
outliers <- which(abs(student_res) > 3)
outliers

# Identify outliers using deleted studentized residuals


outliers_deleted <- which(abs(deleted_student_res) > 3)
outliers_deleted

(c) Identify High Leverage Points

Leverage measures how far the independent variable values of a data point are from the
mean of the independent variables. Points with high leverage can be identified using the
twice-the-mean and thrice-the-mean rules.

R Code:

R
Copy code
# Leverage points (hat values)
hat_values <- hatvalues(model)

# Twice the mean rule


high_leverage_2mean <- which(hat_values > 2 * mean(hat_values))
high_leverage_2mean

# Thrice the mean rule


high_leverage_3mean <- which(hat_values > 3 * mean(hat_values))
high_leverage_3mean
(d) Identify Influential Observations Using Cook's Distance and DFFITS

Cook's distance and DFFITS can be used to identify influential points. Observations with
Cook's distance greater than 1 or DFFITS values greater than 2p/n2\sqrt{p/n}2p/n (where
ppp is the number of parameters, and nnn is the sample size) are considered influential.

R Code:

R
Copy code
# Cook's distance
cooks_distance <- cooks.distance(model)
influential_cooks <- which(cooks_distance > 1)

# DFFITS
dffits_values <- dffits(model)
p <- length(coef(model)) # Number of parameters
n <- length(y) # Number of observations
dffits_threshold <- 2 * sqrt(p / n)
influential_dffits <- which(abs(dffits_values) > dffits_threshold)

# Display results
influential_cooks
influential_dffits

(e) Evaluate Improvement After Omitting Outliers

If outliers are found, we can rerun the regression after removing them and check if the
residuals improve.

R Code:

R
Copy code
# Remove outliers identified earlier
y_no_outliers <- y[-outliers]
libby_x_no_outliers <- libby_x[-outliers]
# Refit the model without outliers
model_no_outliers <- lm(y_no_outliers ~ libby_x_no_outliers)

# Residual analysis after removing outliers


ordinary_res_no_outliers <- residuals(model_no_outliers)
student_res_no_outliers <- rstudent(model_no_outliers)
deleted_student_res_no_outliers <- studres(model_no_outliers)

# Normal probability plots after removing outliers


par(mfrow = c(1, 3))
qqnorm(ordinary_res_no_outliers, main = "Ordinary Residuals (No
Outliers)")
qqline(ordinary_res_no_outliers)

qqnorm(student_res_no_outliers, main = "Studentized Residuals (No


Outliers)")
qqline(student_res_no_outliers)

qqnorm(deleted_student_res_no_outliers, main = "Deleted Studentized


Residuals (No Outliers)")
qqline(deleted_student_res_no_outliers)

Interpretation:

• (a) Normality: You will examine the normal probability plots to assess whether the
residuals appear normally distributed.
• (b) Outliers: By identifying observations with studentized and deleted studentized
residuals greater than 3 in absolute value, we detect outliers.
• (c) Leverage: Leverage points are identified if their leverage values are twice or
thrice the average leverage.
• (d) Influence: Influential observations are those with high Cook's distance or
DFFITS values.
• (e) Improvement: After removing outliers, we check if the normality of residuals
improves.

This approach will help you analyze the water flow data using various residual diagnostics
in R.
4o

You might also like