[go: up one dir, main page]

0% found this document useful (0 votes)
3 views10 pages

R Tutorial 4

Uploaded by

eshanisharma04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

R Tutorial 4

Uploaded by

eshanisharma04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Assignment 4 R Tutorial∗

In Assignment 4 we’ll be focusing on hypothesis testing and regression models. In this tutorial, we’ll analyze a
dataset of tweets downloaded from X (formerly known as Twitter).
Setting Things Up
Let’s read in the data, check the sample size, and see what the first few rows look like:
dataset <- read.csv("mydataset.csv") # reading in the data
dim(dataset)

## [1] 950 4
head(dataset)

## username length likes retweets


## 1 @CPHO_Canada 178 24 11
## 2 @CPHO_Canada 258 49 17
## 3 @CPHO_Canada 109 50 27
## 4 @CPHO_Canada 78 40 20
## 5 @CPHO_Canada 78 39 16
## 6 @CPHO_Canada 251 68 21
From this output, we can see that our sample size is 950, and our dataset contains 4 variates. This dataset contains
4 variates:
• username: the username (or ’handle’) of the account that sent the tweet.
• length: the length of the tweet text in number of characters.
• likes: The number of likes a tweet received.
• retweets: The number of retweets a tweet received.


Adapted from Prof. Michael Wallace

1
Regression Modelling
In our first analysis, we’ll explore the relationship between the number of retweets and likes tweets receive. For
many accounts, the number of likes their tweets received and the number of retweets will be extremely large. Thus,
it will make analysis easier if we transform the likes and retweets variates. For ease of analysis, we’ll work with
log transformed versions of these variates, and create likes.log and retweets.log as follows:
dataset$retweets.log <- log(dataset$retweets + 1)
dataset$likes.log <- log(dataset$likes + 1)
dataset$retweets.log[1:5]

## [1] 2.484907 2.890372 3.332205 3.044522 2.833213


dataset$likes.log[1:5]

## [1] 3.218876 3.912023 3.931826 3.713572 3.688879


summary(dataset$retweets.log)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.000 2.303 3.332 3.581 4.875 8.750
summary(dataset$likes.log)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.000 3.590 4.836 4.921 6.402 9.609
Whenever we create new variates, it’s a good idea to check they look as expected. Let’s explore this using a
scatterplot:
plot(dataset$retweets.log, dataset$likes.log, main = "Scatterplot of likes.log and retweets.log",
xlab = "retweets.log", ylab = "likes.log", pch = 1, cex = 0.5, col = "navy",
las = 1, lwd = 1)

Scatterplot of likes.log and retweets.log

8
likes.log

6
4
2
0

0 2 4 6 8

retweets.log
Let’s fit a linear regression model with retweets.log as the explanatory variate and likes.log as the response
variate. Think about why we might choose our explanatory and response variates this way round. We can fit this
linear model using the lm() command:
mod <- lm(likes.log ~ retweets.log, data = dataset)

Syntax Note: The basic syntax of lm() specifies the response variate followed by a tilde, followed by the explanatory
variate(s). The above command has stored the results of our linear regression in objects called mod.

2
We can look at the results of fitting our model with the summary() command:
summary(mod)

##
## Call:
## lm(formula = likes.log ~ retweets.log, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4734 -0.4891 0.1070 0.6061 3.1679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.69863 0.07962 21.34 <2e-16 ***
## retweets.log 0.89984 0.01982 45.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 948 degrees of freedom
## Multiple R-squared: 0.6849, Adjusted R-squared: 0.6846
## F-statistic: 2060 on 1 and 948 DF, p-value: < 2.2e-16
Syntax Note: Recall that we have used summary() to retrieve summary statistics of variates. This function can
be applied to different objects in R and, depending on what type of object it is, will return different outputs!
Wow, that’s a lot of information! Let’s start with the table labelled Coefficients. The two rows correspond to
the the y-intercept of the fitted line (Intercept) and the slope of the fitted line (retweets.log). For details of
each column, see Page 211 of the Course Notes.
Below the table we have the Residual standard error (which is an estimate of σ), the degrees of freedom
(check this is consistent with the sample size!), and the Multiple R-squared (which is the square of the sample
correlation coefficient). The other outputs are more important when considering more complicated models. You
can learn more about these in STAT 331!
We can extract values from the output in various ways, for example:
coef(mod)

## (Intercept) retweets.log
## 1.6986281 0.8998388
coef(summary(mod))

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 1.6986281 0.07961603 21.33525 8.519859e-83
## retweets.log 0.8998388 0.01982388 45.39167 5.905140e-240
coef(summary(mod))[, 1]

## (Intercept) retweets.log
## 1.6986281 0.8998388
coef(summary(mod))[, 4]

## (Intercept) retweets.log
## 8.519859e-83 5.905140e-240

3
sigma(mod)

## [1] 1.111014
Syntax Note: We can extract the values of the model parameter estimates with coef(mod). In contrast,
coef(summary(mod)) returns a table of all of the values in the results table, which we can then access the
values of (such as by asking for the fourth column to access the p-values).
My fitted line is of the form
y = 1.6986 + 0.8998x,
so let’s add it to my scatterplot of the data using the abline() function:
plot(dataset$retweets.log, dataset$likes.log, main = "Scatterplot of likes.log and retweets.log",
xlab = "retweets.log", ylab = "likes.log", pch = 1, cex = 0.5, col = "navy",
las = 1, lwd = 1)
abline(coef(mod), lwd = 2, lty = 2, col = "red")

Scatterplot of likes.log and retweets.log

8
likes.log

6
4
2
0

0 2 4 6 8

retweets.log
Syntax Note: We could also have fit this line using abline(1.41644, 1.00680): the syntax abline(a, b) will
add the line y = a + bx. By using coef() we avoid needing to type out the values when fitting our line, and the
associated risk of rounding errors!

Checking Model Fit


We can check some aspects of the fit of our model using the scatterplot, but we should also consult the standardized
residuals. These can be extracted from our linear model object using rstandard():
stdres <- rstandard(mod)
mean(stdres)

## [1] -0.0001457893
sd(stdres)

## [1] 1.001522
Now let’s generate three plots using the standardized residuals:
# set up plot window to have 1 row of 3 plots
par(mfrow = c(1, 3))
# standardized residuals vs explanatory variate
plot(dataset$retweets.log, stdres, main = "Std residuals vs. retweets.log",
xlab = "retweets.log", ylab = "Standardized Residuals", pch = 1, col = "navy",
cex = 0.5, las = 1)

4
abline(h = 0, lty = 2, col = "red", lwd = 2)
# standardized residuals vs fitted values
plot(fitted(mod), stdres, main = "Std residuals vs. fitted values", xlab = "Fitted values",
ylab = "Standardized Residuals", pch = 1, col = "navy", cex = 0.5)
abline(h = 0, lty = 2, col = "red", lwd = 2)
# qqplot of standardized residuals
qqnorm(stdres, main = "qqplot of std residuals", xlab = "G(0, 1) Quantiles",
ylab = "Standardized Residuals", pch = 1, col = "navy", cex = 0.5)
qqline(stdres, lty = 2, col = "red", lwd = 2)
Std residuals vs. retweets.log Std residuals vs. fitted values qqplot of std residuals

2
2

2
Standardized Residuals

Standardized Residuals

Standardized Residuals
0
0

0
−2

−2
−2
−4

−4
−4
−6

−6
−6
−8

−8
−8
0 2 4 6 8 2 4 6 8 −3 −1 1 2 3

retweets.log Fitted values G(0, 1) Quantiles

What do these plots tell us about our model assumptions?

5
Confidence Intervals: Model Parameters
We have found that the point estimates of α and β are α̂ = 1.699 and β̂ = 0.900. Note that these are also the
maximum likelihood and least squares estimates!
Recall: α and β are unknown constants, just like the µ in a Gaussian model or a θ in a Binomial model. They
have corresponding estimators α̃ and β̃ which are random variables with probability distributions. We can use our
knowledge of these to construct confidence intervals for α and β. The following code will calculate 90% confidence
intervals for α and β for our data using the output of summary(mod):
# 90% confidence interval for alpha
coef(summary(mod))[1, 1] - qt(0.95, 948) * coef(summary(mod))[1, 2]

## [1] 1.567543
coef(summary(mod))[1, 1] + qt(0.95, 948) * coef(summary(mod))[1, 2]

## [1] 1.829713
# 90% confidence interval for beta
coef(summary(mod))[2, 1] - qt(0.95, 948) * coef(summary(mod))[2, 2]

## [1] 0.8671996
coef(summary(mod))[2, 1] + qt(0.95, 948) * coef(summary(mod))[2, 2]

## [1] 0.9324781
Important: Check that you can understand why this code generates the desired confidence interval! Note that for
α the interval is of the form α̂ ± a × Std. Error where P (T ≤ a) = 0.95, T ∼ tn−2 (why?), and similarly for β.
An easier way to obtain these intervals is using the confint() function, specifying that we want a 90% confidence
interval using the level option:
confint(mod, level = 0.9)

## 5 % 95 %
## (Intercept) 1.5675433 1.8297129
## retweets.log 0.8671996 0.9324781

6
Estimating Mean Response
Suppose we wanted to use our model to estimate the expected number of likes received by a tweet that gets 10
retweets. In order to do this we first find the expected value of likes.log for a tweet where the retweets.log
variate equals log(10 + 1). We can calculate this using the equation of our fitted line, or by using the predict
function:
coef(mod)[1] + coef(mod)[2] * log(10 + 1) # equation of fitted line

## (Intercept)
## 3.856347
predict(mod, newdata = data.frame(retweets.log = log(10 + 1))) # predict function

## 1
## 3.856347
Syntax Note: In our first command we have taken the parameter estimates directly from coef(mod) - this ensures
we have no rounding errors.
Syntax Note: To use predict() we specify a new data frame (newdata) which contains the values of the explanatory
variate we wish to predict for.
So the expected value of likes.log for a tweet that receives 10 retweets is 3.856347. This is a point estimate, and
so we want to communicate the uncertainty inherent within it. Luckily, our predict() function is prepared for
this! We can use the interval option, which allows us to request a confidence interval or a prediction interval:
predict(mod, data.frame(retweets.log = log(10 + 1)), interval = "confidence", level = 0.9)

## fit lwr upr


## 1 3.856347 3.785542 3.927152
predict(mod, data.frame(retweets.log = log(10 + 1)), interval = "prediction", level = 0.9)

## fit lwr upr


## 1 3.856347 2.025735 5.68696
Syntax Note: Here, we have continued to ask for 90% intervals with the level option. The predict() function
will generate 95% intervals by default, so if I wanted a 95% interval we could simply omit the level option
completely!

7
Comparing Population Means
Having explored the relationship between likes and retweets, we’re now going to take a look at the length of
tweets (that is, the number of characters used in a tweet). We are going to compare two accounts for this:
@CPHO_Canada, the Chief Public Health Officer of Canada, and @TheTorontoZoo, described in their Twitter
profile as “Official Twitter account of the Toronto Zoo, an organization that protects wildlife and wild spaces.
These are two pretty different accounts – but would we expect a difference in the length of their tweets?
Data Summaries
First, let’s do a quick summary of our data:
# create cpho.length object
cpho.length <- dataset$length[dataset$username == "@CPHO_Canada"]
c.mean <- mean(cpho.length) # store mean of cpho.length
summary(cpho.length) # summarize cpho.length

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.0 109.0 197.0 175.6 245.0 276.0
length(cpho.length) # sample size for cpho.length

## [1] 190
# create zoo.length object
zoo.length <- dataset$length[dataset$username == "@TheTorontoZoo"]
z.mean <- mean(zoo.length) # store mean of zoo.length
summary(zoo.length) # summarize zoo.length

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 10.00 61.25 141.00 148.37 236.75 280.00
length(zoo.length) # sample size for zoo.length

## [1] 190
Note that we have created two new objects: cpho.length and zoo.length to make our coding a bit easier.
So it looks like in our sample @CPHO_Canada’s tweets are about 27 characters longer on average. Suppose that in
reality these two accounts had tweets of a similar mean length: would a difference of 27 characters in our sample be
surprising? Let’s conduct a hypothesis test! We’ll be assuming Gaussian models for our data, so we should explore
this using the usual numerical and graphical summaries. However, as we’ve covered this in previous tutorials we’ll
omit those details here.

Comparing Standard Deviations


Before we can carry out a test of whether the mean length of tweets is the same for both accounts, let’s compare
the standard deviation in the two groups:
# standard deviation of @CPHO_Canada tweet lengths
c.sd <- sd(cpho.length)
c.sd

## [1] 82.26672
# standard deviation of @TheTorontoZoo tweet lengths
z.sd <- sd(zoo.length)
z.sd

## [1] 89.62084

8
So the zoo’s tweets tend to have greater variability in length, but could this difference just be due to chance? Let’s
calculate confidence intervals for these two estimates (make sure you agree with these commands!):
# 95% CI for standard deviation of @CPHO_Canada tweet lengths
c(c.sd * sqrt(189/qchisq(0.975, 189)), c.sd * sqrt(189/qchisq(0.025, 189)))

## [1] 74.74323 91.48755


# 95% CI for standard deviation of @TheTorontoZoo tweet lengths
c(z.sd * sqrt(189/qchisq(0.975, 189)), z.sd * sqrt(189/qchisq(0.025, 189)))

## [1] 81.42480 99.66595


Important: The 189 in these commands are the degrees of freedom for the two sets of tweets. In this case, both
accounts have 190 tweets, so the degree of freedom for each calculation is the same and equal to 190 − 1 = 189.
Always check the number of observations to make sure you have the correct degree of freedom!
We can test whether the mean tweet length for each account is the same in two ways depending on whether we are
willing to assume that the standard deviation in length is the same for each account. If we are willing to assume
the standard deviation is the same, then we can estimate the pooled standard deviation across the two accounts by
s
(n1 − 1)s21 + (n2 − 1)s22
sp =
n1 + n2 − 2

# calculation of pooled standard deviation estimate


pooled.sd <- sqrt((189 * (c.sd)ˆ2 + 189 * (z.sd)ˆ2)/(190 + 190 - 2))
pooled.sd

## [1] 86.02241
We can then calculate the test statistic
|y 1 − y 2 |
d= q
1 1
sp n1 + n2

for a test of the null hypothesis that the two means are the same, and the resulting p-value by:
# calculate test statistic
d <- abs(c.mean - z.mean)/(pooled.sd * sqrt(1/190 + 1/190))
d

## [1] 3.087868
# calculate p-value
2 * (1 - pt(d, 190 + 190 - 2))

## [1] 0.002164853
Instead of step-by-step calculations, we can use the t.test() function to do the work for us!
# test of equal population means
t.test(cpho.length, zoo.length, var.equal = TRUE)

##
## Two Sample t-test
##
## data: cpho.length and zoo.length
## t = 3.0879, df = 378, p-value = 0.002165
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 9.898989 44.606274

9
## sample estimates:
## mean of x mean of y
## 175.6211 148.3684
We can also extract the individual elements from this test, for example:
# store test in test.results
test.results <- t.test(cpho.length, zoo.length, var.equal = TRUE)
# test statistic
test.results$statistic

## t
## 3.087868
# p-value
test.results$p.value

## [1] 0.002164853
# pooled standard error
test.results$stderr

## [1] 8.825713
# pooled standard deviation (compare with pooled.st above)
test.results$stderr/sqrt(1/190 + 1/190)

## [1] 86.02241
# 90% confidence interval of difference in means
t.test(cpho.length, zoo.length, var.equal = TRUE, conf.level = 0.9)$conf.int[1:2]

## [1] 12.69996 41.80530


Syntax Note: in the last command we used the level option to specify we wanted a 90% confidence interval
instead of the default of 95%.

10

You might also like