An Introduction to R, RStudio, R Markdown
Savaş Dayanık
24 09 2019
Introduction to R, Rstudio
You can type straight text and math, for example,
tips = Xβ +
Some tips:
• This is a bullet. You can emphasize any important infor by enclosing it with one-star parenthesis.
• If it is absolutely important, then double the stars.
• TO insert a new R chunk, use Insert menu or press Ctrl-Alt-i. For all keybindings, Tools menu is your
friend.
– To run a single R clause in chunk, press Ctrl + ENTER.
– To run all clauses in the same chunl, pres Ctrl-Shift-ENTER
– to comment out any part of code or text, highlight and CTRL-SHFT-C
Analyze tips dataset
A waiter collected the values of several variables that he thinks are important to determine tip amount, and
wants us to analyze the relation between tips he received and the factors that he just picked up.
d <- suppressWarnings(read_csv("tips.csv",
col_types = cols(
X1 = col_double(),
OBS = col_double(),
TOTBILL = col_double(),
TIP = col_double(),
SEX = col_character(),
SMOKER = col_character(),
DAY = col_character(),
TIME = col_character(),
SIZE = col_double()
))[,-1]) %>%
select(-OBS) %>%
mutate(
SEX = factor(SEX),
DAY = factor(DAY, levels = c("thurs","fri","sat","sun"),
labels = c("THU", "FRI", "SAT", "SUN") ),
TIME = factor(TIME),
SMOKER = factor(SMOKER))
## New names:
## * `` -> `...1`
1
d %>% distinct(DAY)
## # A tibble: 4 x 1
## DAY
## <fct>
## 1 SUN
## 2 SAT
## 3 THU
## 4 FRI
d %>% count(DAY)
## # A tibble: 4 x 2
## DAY n
## <fct> <int>
## 1 THU 62
## 2 FRI 19
## 3 SAT 87
## 4 SUN 76
d %>%
head() %>%
pander(caption = "(\\#tab:data) A glimpse over the data")
Table 1: (#tab:data) A glimpse over the data
TOTBILL TIP SEX SMOKER DAY TIME SIZE
16.99 1.01 F no SUN dinner 2
10.34 1.66 M no SUN dinner 3
21.01 3.5 M no SUN dinner 3
23.68 3.31 M no SUN dinner 2
24.59 3.61 F no SUN dinner 4
25.29 4.71 M no SUN dinner 4
# pander(another(yetanother(d)))
#
# d %>%
# yetanother() %>%
# another() %>%
# pander()
Table @ref(tab:data) shows the first six rows of tip data set, whch has actually 244 rows. Let us describe the
variables in the table briefly:
TOTBILL Total bill paid by the party
TIP Tip left by the party
SEX Gender of who paid the bill (F, M)
SMOKER whether bill payer smokes or not (yes, no)
DAY Day of the week when the pary have had the meal (thurs, fri, sat, sun)
TIME Time of day when the party had had meal (lunch, dinner)
SIZE number of people in the party
Below is a summary of each variable:
2
d %>%
summary() %>%
pander()
Table 2: Table continues below
TOTBILL TIP SEX SMOKER DAY TIME
Min. : 3.07 Min. : 1.000 F: 87 no :151 THU:62 dinner:176
1st Qu.:13.35 1st Qu.: 2.000 M:157 yes: 93 FRI:19 lunch : 68
Median :17.80 Median : 2.900 NA NA SAT:87 NA
Mean :19.79 Mean : 2.998 NA NA SUN:76 NA
3rd Qu.:24.13 3rd Qu.: 3.562 NA NA NA NA
Max. :50.81 Max. :10.000 NA NA NA NA
SIZE
Min. :1.00
1st Qu.:2.00
Median :2.00
Mean :2.57
3rd Qu.:3.00
Max. :6.00
boxplot(d$TOTBILL, main = "TOTALBILL")
boxplot(d$TIP, main = "TIP")
boxplot(d$SIZE, main = "SIZE")
TOTALBILL TIP SIZE
10
50
6
5
40
4
30
3
20
2
10
Figure 1: Boxplots for TOTALBILL on the left and TIP in the middle, and SIZE on the right.
Boxplots in Figure @ref(fig:bxplts) show that all numerical variables have right-skewed distributions.
d %>%
##select_if(is.numeric) %>%
gather(variable, value, TOTBILL, TIP, SIZE) %>%
ggplot(aes(variable, value)) +
# geom_boxplot(aes(fill = DAY)) +
geom_boxplot(aes(fill = TIME)) +
coord_flip()
3
TOTBILL
TIME
variable
TIP dinner
lunch
SIZE
0 10 20 30 40 50
value
Scatterplot
Modern version
g <- ggplot(d, aes(TOTBILL, TIP)) +
geom_point() +
geom_abline(intercept = 0, slope = .18, col = "red") +
geom_text(x=45, y=45*.18, label="18% tip\nline",
col="red", hjust = 0, vjust=1 )
print(g)
4
10.0
18% tip
7.5 line
TIP
5.0
2.5
10 20 30 40 50
TOTBILL
plot(g)
5
10.0
18% tip
7.5 line
TIP
5.0
2.5
10 20 30 40 50
TOTBILL
d
## # A tibble: 244 x 7
## TOTBILL TIP SEX SMOKER DAY TIME SIZE
## <dbl> <dbl> <fct> <fct> <fct> <fct> <dbl>
## 1 17.0 1.01 F no SUN dinner 2
## 2 10.3 1.66 M no SUN dinner 3
## 3 21.0 3.5 M no SUN dinner 3
## 4 23.7 3.31 M no SUN dinner 2
## 5 24.6 3.61 F no SUN dinner 4
## 6 25.3 4.71 M no SUN dinner 4
## 7 8.77 2 M no SUN dinner 2
## 8 26.9 3.12 M no SUN dinner 4
## 9 15.0 1.96 M no SUN dinner 2
## 10 14.8 3.23 M no SUN dinner 2
## # i 234 more rows
g + facet_grid(DAY+TIME~SMOKER+SEX, labeller = label_both) +
theme(strip.text.y = element_text(angle = 0))
6
SMOKER: no SMOKER: no SMOKER: yes SMOKER: yes
SEX: F SEX: M SEX: F SEX: M
10.0
7.5 18% tip
5.0 TIME: dinner DAY: THU
2.5 line
10.0
7.5 18% tip 18% tip 18% tip 18%TIME:
tip lunch
5.0 DAY: THU
2.5 line line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: FRI
2.5 line line line line
TIP
10.0
7.5 18% tip 18% tip 18%TIME:
tip lunch
5.0 DAY: FRI
2.5 line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: SAT
2.5 line line line line
10.0
7.5 18% tip 18% tip 18% tip 18% tip
5.0 TIME: dinner DAY: SUN
2.5 line line line line
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
TOTBILL
Let us also calculate the correlation between TOTBILL and TIP
cor(d$TOTBILL, d$TIP)
## [1] 0.6757341
d %>%
group_by(SMOKER, SEX, DAY, TIME) %>%
summarize(cor = cor(TOTBILL, TIP),
count = n())
## `summarise()` has grouped output by 'SMOKER', 'SEX', 'DAY'. You can override
## using the `.groups` argument.
## # A tibble: 20 x 6
## # Groups: SMOKER, SEX, DAY [16]
## SMOKER SEX DAY TIME cor count
## <fct> <fct> <fct> <fct> <dbl> <int>
## 1 no F THU dinner NA 1
## 2 no F THU lunch 0.881 24
## 3 no F FRI dinner NA 1
## 4 no F FRI lunch NA 1
## 5 no F SAT dinner 0.623 13
## 6 no F SUN dinner 0.849 14
## 7 no M THU lunch 0.798 20
## 8 no M FRI dinner 1 2
## 9 no M SAT dinner 0.920 32
## 10 no M SUN dinner 0.706 43
## 11 yes F THU lunch 0.869 7
7
## 12 yes F FRI dinner 0.949 4
## 13 yes F FRI lunch 0.374 3
## 14 yes F SAT dinner 0.448 15
## 15 yes F SUN dinner -0.665 4
## 16 yes M THU lunch 0.629 10
## 17 yes M FRI dinner 0.926 5
## 18 yes M FRI lunch -0.305 3
## 19 yes M SAT dinner 0.621 27
## 20 yes M SUN dinner -0.0835 15
d %>%
mutate(crazy = factor(sample(floor(seq(n())/2)))) %>%
group_by(crazy) %>%
ggplot(aes(TOTBILL, TIP)) +
geom_line(aes(col=crazy)) +
geom_point(aes(col=crazy)) +
theme(legend.position = "none")
10.0
7.5
TIP
5.0
2.5
10 20 30 40 50
TOTBILL
d %>%
mutate(crazy = factor(sample(floor(seq(n())/2)))) %>%
group_by(crazy) %>%
summarize(cor = cor(TOTBILL, TIP)) %>%
xtabs(~cor, .)
## Warning: There were 4 warnings in `summarize()`.
## The first warning was:
## i In argument: `cor = cor(TOTBILL, TIP)`.
## i In group 17: `crazy = 16`.
8
## Caused by warning in `cor()`:
## ! the standard deviation is zero
## i Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
## cor
## -1 1
## 28 89
Tables are useful to analyze categorical (qualitative, factor) data
SMOKER, SEX, DAY, TIME
d_tbl <- d %>%
mutate_if(is.character, factor) %>%
select_if(is.factor) %>%
xtabs(~ . , .)
d_tbl %>% ftable
## TIME dinner lunch
## SEX SMOKER DAY
## F no THU 1 24
## FRI 1 1
## SAT 13 0
## SUN 14 0
## yes THU 0 7
## FRI 4 3
## SAT 15 0
## SUN 4 0
## M no THU 0 20
## FRI 2 0
## SAT 32 0
## SUN 43 0
## yes THU 0 10
## FRI 5 3
## SAT 27 0
## SUN 15 0
library(vcd)
## Loading required package: grid
structable( DAY + TIME ~ SEX + SMOKER, d_tbl)
## DAY THU FRI SAT SUN
## TIME dinner lunch dinner lunch dinner lunch dinner lunch
## SEX SMOKER
## F no 1 24 1 1 13 0 14 0
## yes 0 7 4 3 15 0 4 0
## M no 0 20 2 0 32 0 43 0
## yes 0 10 5 3 27 0 15 0
(margin.table(d_tbl, c(1,2)) / sum(d_tbl)) %>% round(2) %>% multiply_by(100)
## SMOKER
## SEX no yes
## F 22 14
9
## M 40 25
# dimnames(d_tbl)
d_tbl %>%
margin.table(c(1,2))%>%
mosaic(type="expected")
SMOKER
no yes
F
SEX
M
d_tbl %>%
margin.table(c(1,2))%>%
mosaic(gp = gpar(fill = rep(c("pink", "lightblue"), each=2)))
10
SMOKER
F no yes
SEX
M
Tiles are aliged within each block. Therefore,
we tend to think SMOKER and SEX are independent.
dimnames(d_tbl)
## $SEX
## [1] "F" "M"
##
## $SMOKER
## [1] "no" "yes"
##
## $DAY
## [1] "THU" "FRI" "SAT" "SUN"
##
## $TIME
## [1] "dinner" "lunch"
d_tbl %>%
margin.table(1)%>%
mosaic()
11
F
SEX
M
d_tbl %>%
margin.table(c(1,3))%>%
mosaic(gp =gpar(fill = rep(c("pink", "lightblue"), each=4)))
DAY
THU FRI SAT SUN
F
SEX
M
library(RColorBrewer)
d_tbl %>%
margin.table(c(1,3))%>%
mosaic(gp =gpar(fill = brewer.pal(4, "PuOr"))) # picked diverging palette
12
DAY
F THU FRI SAT SUN
SEX
M
d_tbl %>%
margin.table(c(1,3))%>%
mosaic(type="expected")
DAY
THU FRI SAT SUN
F
SEX
M
Because DAY tiles within each SEX blocks are significatly disaligned, we cannot expected independence of
SEX and DAY. So they seem to be related. Can I measure the strength of the relation? Later
• Tile areas are proportional to the cell counts of the corresponding table.
13
• Titles within blocks are aligned across blocks: strongly suggests that SEX and SMOKER are independent
(random variables).
How can we check the relation between every pait of categorical variables?
library(vcd)
d %>%
mutate_if(is.character, factor) %>%
select_if(is.factor) %>%
xtabs(~. , .) %>%
pairs(diag_panel = pairs_barplot(var_offset = 1.3,
rot = -30,
just_leveltext = "left",
gp_leveltext = gpar(fontsize = 8)),
shade = TRUE)
200
SEX
150
100
50
0 F M
SMOKER
200
150
100
50
0 no ye
s
DAY
80
40
0 TH SA
U T
TIME
200
150
100
50
0 din lun
ne ch
r
Independence tests
Digression: What does pipe %>% do?
5*((mean(extract2(d, "TOTBILL"), na.rm=TRUE))ˆ2 )
## [1] 1957.418
versus
d %>%
#select(TOTBILL) %>%
extract2("TOTBILL") %>%
14
mean(na.rm=TRUE) %>%
`ˆ`(2) %>%
`*`(5)
## [1] 1957.418
Which one is readable and easy to modify?
Hair color, Eye color, gender
Go to Google form and fill the form for
• yourself,
• your mother,
• your father,
• your siblings
15