ASSIGNMENT # 4
PROBLEM # 1
1.
1.Generating the graph using R
> Steepness <- c(0.629999995, 0.699999988, 0.819999993, 0.879999995,
1.149999976, 1.5, 4.400000095, 7.300000191, 11.30000019+ )
> GranuleDiameter <- c(0.170000002, 0.189999998, 0.219999999,
0.234999999, 0.234999999, 0.300000012, 0.349999994, 0.419999987,
0.850000024+ )
plot(Steepness,GranuleDiameter)
> plot(Steepness,GranuleDiameter, main="Steepness vs Granule
Diameter")
>plot(Steepness,GranuleDiameter, main="Steepness vs Granule
Diameter" , xlab="steepness of the beach" , ylab= "Diameter of
Granule")
abline(lm(GranuleDiameter~Steepness))
2. Genearting the model to predict the Diameter of the Granule from steepness of
the beach:
model <-lm(GranuleDiameter~Steepness)
summary(model)
Call:
lm(formula = GranuleDiameter ~ Steepness)
Residuals:
Min 1Q Median 3Q Max
-0.12826 -0.02434 0.01307 0.02739 0.08950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.160913 0.030102 5.346 0.00107 **
Steepness 0.053061 0.006288 8.438 6.48e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06739 on 7 degrees of freedom
Multiple R-squared: 0.9105, Adjusted R-squared: 0.8977
F-statistic: 71.2 on 1 and 7 DF, p-value: 6.475e-05
sqrt(0.9105)
[1] 0.9542012
Inorder to support my belief that the relationship is linear, I
am creating a residual plot where I can confirm the relationship is
linear.
plot(model)
By seeing the summary of the model, we have met certain assumptions
which are necessary to do linear regression analysis.They are
1. Correlation coefficient ‘r’ is strong because ‘r’ value is
nearest to 1.
2. The relationship is linear. This is confirmed by generating
a residual plot where we can see that residuals are normally
scattered around the 0 line.
3. The relaionship is causal.
As we met certain assumptions, we can now generate a formula where we can
predict the granule size from the steepness of the beach.
The Formula is :
Granule Diameter ^ = 0.161 + 0.054 * steepness of the beach
PROBLEM # 2
1. Generating the graph
povertyPercentage <- c(20.1, 7.1, 16.1, 14.9, 16.7, 8.8, 9.7,
10.3, 22, 16.2, 12.1, 10.3, 14.5, 12.4, 9.6, 12.2, 10.8, 14.7, 19.7,
11.2, 10.1, 11, 12.2, 9.2, 23.5, 9.4, 15.3, 9.6, 11.1, 5.3, 7.8,
25.3, 16.5, 12.6, 12, 11.5, 17.1, 11.2, 12.2, 10.6, 19.9, 14.5,
15.5, 17.4, 8.4, 10.3, 10.2, 12.5, 16.7, -1.5, 12.2 )
> BirthRate <- c(54.5, 39.5, 61.2, 59.9, 41.1, 47, 25.8, 46.3, 69.1,
44.5, 55.7, 38.2, 39.1, 42.2, 44.6, 32.5, 43, 51, 58.1, 25.4, 35.4,
23.3, 34.8, 27.5, 64.7, 44.1, 36.4, 37, 53.9, 20, 26.8, 62.4, 29.5,
52.2, 27.2, 39.5, 58, 36.8, 31.6, 35.6, 53, 38, 54.3, 64.4, 36.8,
24.2, 37.6, 33, 45.5, 32.3, 39.9+ )
> plot(povertyPercentage,BirthRate)
> plot(povertyPercentage,BirthRate, main="POVERTY VS BIRTH RATE ")
> plot(povertyPercentage,BirthRate, main="POVERTY VS BIRTH RATE ",
xlab= "percentage of people living in poverty",ylab="Teen Birth
Rate")
> abline(lm(BirthRate~povertyPercentage))
2. Generating the model:
model <- lm(BirthRate~povertyPercentage)
summary(model)
Call:
lm(formula = BirthRate ~ povertyPercentage)
Residuals:
Min 1Q Median 3Q Max
-19.0644 -7.4246 -0.4238 7.8092 15.5325
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.4172 3.7970 5.114 5.23e-06 ***
povertyPercentage 1.7665 0.2765 6.390 5.85e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.19 on 49 degrees of freedom
Multiple R-squared: 0.4545, Adjusted R-squared: 0.4434
F-statistic: 40.83 on 1 and 49 DF, p-value: 5.851e-08
> plot(model)
By examining the given dataset,graph and residual plot , I identified an outlier
which is an error because the percentage of people living in poverty in that
particular state will never be a negative value. The least value would be 0
percentage.
So, I am removing the negative percentage of people living in poverty in a
particular state(outlier) and the corresponding Teen Birth rate observation from the
Dataset and again generating a regression model.
Generating the graph after removal of the outlier:
PovertyPercentage <- c(20.1, 7.1, 16.1, 14.9, 16.7, 8.8, 9.7, 10.3,
22, 16.2, 12.1, 10.3, 14.5, 12.4, 9.6, 12.2, 10.8, 14.7, 19.7, 11.2,
10.1, 11, 12.2, 9.2, 23.5, 9.4, 15.3, 9.6, 11.1, 5.3, 7.8, 25.3,
16.5, 12.6, 12, 11.5, 17.1, 11.2, 12.2, 10.6, 19.9, 14.5, 15.5,
17.4, 8.4, 10.3, 10.2, 12.5, 16.7, 12.2 )
> BirthRate <- c(54.5, 39.5, 61.2, 59.9, 41.1, 47, 25.8, 46.3, 69.1,
44.5, 55.7, 38.2, 39.1, 42.2, 44.6, 32.5, 43, 51, 58.1, 25.4, 35.4,
23.3, 34.8, 27.5, 64.7, 44.1, 36.4, 37, 53.9, 20, 26.8, 62.4, 29.5,
52.2, 27.2, 39.5, 58, 36.8, 31.6, 35.6, 53, 38, 54.3, 64.4, 36.8,
24.2, 37.6, 33, 45.5, 39.9)
> plot(PovertyPercentage,BirthRate)
> plot(PovertyPercentage,BirthRate , main="POVERTY VS BIRTH RATE")
> plot(PovertyPercentage,BirthRate , main="POVERTY VS BIRTH RATE",
xlab="Percentage of people living in poverty", ylab="Teen Birth
Rate")
> abline(lm(BirthRate~PovertyPercentage))
Generating the model:
> model <- lm(BirthRate~PovertyPercentage)
> summary(model)
Call:
lm(formula = BirthRate ~ PovertyPercentage)
Residuals:
Min 1Q Median 3Q Max
-19.5956 -6.7355 -0.6259 7.5750 15.7252
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.7266 4.1482 3.791 0.000419 ***
PovertyPercentage 2.0224 0.2991 6.762 1.71e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.937 on 48 degrees of freedom
Multiple R-squared: 0.4879, Adjusted R-squared: 0.4772
G-statistic: 45.72 on 1 and 48 DF, p-value: 1.705e-08
> sqrt(0.4879)
[1] 0.6984984
> plot(model)
As a result of generating a regression model after removing the outlier from the
Dataset , I have seen that
1. Correlation coefficient ‘r’ is moderately strong, I.e., ‘r’ is somewhat nearest to
1.
2. The relationship is linear. This is confirmed by generating a
residual plot where we can see that residuals are normally
scattered around the 0 line.
3. In my opinion, the relationship is not causal as there is no causing factor for the
teen birth rate from the people living in poverty . For purposes of the course,
however I am assuming this relaion as causal and generating the linear regression
analysis.
Our Formula is :
Teen Birth rate ^ = 15.72 + 2.02 * percentage of people living in poverty.
Since there is no causing factor, we cannot predict the birth rate from the
percentage of people living in poverty. However, I have created the model for the
purpose of practice.