Introduction To R, Version 2
Introduction To R, Version 2
Introduction To R, Version 2
Contents
Introduction 3
1 Starting out in R 5
1.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Saving code in an R script . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Types of vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Indexing vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Data frames 14
2.1 Setting up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Loading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Exploring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Indexing data frames . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Columns are vectors . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Logical indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Readability vs tidyness . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.10 Joining data frames . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.11 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1
CONTENTS
4 Summarizing data 39
4.1 Summary functions . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Grouped summaries . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Thinking in R 46
5.1 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Other types not covered here . . . . . . . . . . . . . . . . . . . . 48
5.3 Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Next steps 49
6.1 Deepen your understanding . . . . . . . . . . . . . . . . . . . . . 49
6.2 Expand your vocabulary . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Join the community . . . . . . . . . . . . . . . . . . . . . . . . . 50
2
Introduction
These are course notes for the “Introduction to R” course given by the Monash
Bioinformatics Platform1 for the Monash Data Fluency2 initiative. Our teaching
style is based on the style of The Carpentries3 . This is a new version of the
course focussing on the modern Tidyverse4 set of packages. We believe this is
currently the quickest route to being productive in R.
During the workshop we will be using the RStudio Cloud to use R over the web:
• RStudio Cloud7
You can also install R on your own computer. There are two things to download
and install:
• Download R8
• Download RStudio9
3
CONTENTS
Source code
This book was created in R using the rmarkdown and bookdown packages!
• GitHub page10
This course is developed for the Monash Bioinformatics Platform by Paul Har-
rison.
10 https://github.com/MonashDataFluency/r-intro-2
11 http://creativecommons.org/licenses/by/4.0/
12 https://www.gapminder.org
4
Chapter 1
Starting out in R
5
CHAPTER 1. STARTING OUT IN R
Open RStudio, click on the “Console” pane, type 1+1 and press enter. R displays
the result of the calculation. In this document, we will show such an interaction
with R as below.
1+1
## [1] 2
+ is called an operator. R has the operators you would expect for for basic
mathematics: + - * / ˆ. It also has operators that do more obscure things.
* has higher precedence than +. We can use brackets if necessary ( ). Try
1+2*3 and (1+2)*3.
Spaces can be used to make code easier to read.
We can compare with == < > <= >=. This produces a logical value, TRUE or
FALSE. Note the double equals, ==, for equality comparison.
2 * 2 == 4
## [1] TRUE
There are also character strings such as "string". A character string must be
surrounded by either single or double quotes.
1.1 Variables
A variable is a name for a value. We can create a new variable by assigning a
value to it using <-.
6
CHAPTER 1. STARTING OUT IN R
width <- 5
width
## [1] 5
# Area of a square
width * width
## [1] 25
width <- 10
width
## [1] 10
area
## [1] 25
Notice that the value of area we calculated earlier hasn’t been updated. As-
signing a new value to one variable does not change the values of other variables.
This is different to a spreadsheet, but usual for programming languages.
7
CHAPTER 1. STARTING OUT IN R
Tip
Add comments to code, using lines starting with the # character. This makes it
easier for others to follow what the code is doing (and also for us the next time
we come back to it).
a <- 4*20
b <- 7
a+b
2*2+2*2+2*2
1.3 Vectors
A vector of numbers is a collection of numbers. “Vector” means different things
in different fields (mathematics, geometry, biology), but in R it is a fancy name
for a collection of numbers. We call the individual numbers elements of the
vector.
We can make vectors with c( ), for example c(1,2,3). c means “combine”. R
is obsesssed with vectors, in R even single numbers are vectors of length one.
Many things that can be done with a single number can also be done with a
vector. For example arithmetic can be done on vectors as it can be on single
numbers.
## [1] 10 20 30 40 50
8
CHAPTER 1. STARTING OUT IN R
myvec + 1
## [1] 11 21 31 41 51
myvec + myvec
## [1] 20 40 60 80 100
length(myvec)
## [1] 5
c(60, myvec)
## [1] 60 10 20 30 40 50
c(myvec, myvec)
## [1] 10 20 30 40 50 10 20 30 40 50
When we talk about the length of a vector, we are talking about the number of
numbers in the vector.
Sometimes the best way to understand R is to try some examples and see what
it does.
What happens when you try to make a vector containing different types, using
c( )? Make a vector with some numbers, and some words (eg. character strings
like "test", or "hello").
Why does the output show the numbers surrounded by quotes " " like character
strings are?
9
CHAPTER 1. STARTING OUT IN R
Because vectors can only contain one type of thing, R chooses a lowest common
denominator type of vector, a type that can contain everything we are trying to
put in it. A different language might stop with an error, but R tries to soldier
on as best it can. A number can be represented as a character string, but a
character string can not be represented as a number, so when we try to put
both in the same vector R converts everything to a character string.
myvec[1]
## [1] 10
myvec[2]
## [1] 20
myvec[2] <- 5
myvec
## [1] 10 5 30 40 50
## [1] 40 30 5
myvec[c(4,3,2)]
## [1] 40 30 5
Challenge: indexing
We can create and index character vectors as well. A cafe is using R to create
their menu.
10
CHAPTER 1. STARTING OUT IN R
1. What does items[-3] produce? Based on what you find, use indexing to
create a version of items without "spam".
2. Use indexing to create a vector containing spam, eggs, sausage, spam, and
spam.
3. Add a new item, “lobster”, to items.
1.6 Sequences
Another way to create a vector is with ::
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
items[1:4]
Sequences are useful for other things, such as a starting point for calculations:
x <- 1:10
x*x
## [1] 1 4 9 16 25 36 49 64 81 100
plot(x, x*x)
100
60
x*x
20
0
2 4 6 8 10
11
CHAPTER 1. STARTING OUT IN R
1.7 Functions
Functions are the things that do all the work for us in R: calculate, manipulate
data, read and write to files, produce plots. R has many built in functions and
we will also be loading more specialized functions from “packages”.
We’ve already seen several functions: c( ), length( ), and plot( ). Let’s
now have a look at sum( ).
sum(myvec)
## [1] 135
We called the function sum with the argument myvec, and it returned the value
135. We can get help on how to use sum with:
?sum
Some functions take more than one argument. Let’s look at the function rep,
which means “repeat”, and which can take a variety of different arguments. In
the simplest case, it takes a value and the number of times to repeat that value.
rep(42, 10)
## [1] 42 42 42 42 42 42 42 42 42 42
rep(c(1,2,3), 10)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3), times=10)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(x=c(1,2,3), 10)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
12
CHAPTER 1. STARTING OUT IN R
rep(times=10, x=c(1,2,3))
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
Arguments can have default values, and a function may have many different pos-
sible arguments that make it do obscure things. For example, rep can also take
an argument each=. It’s typical for a function to be invoked with some number
of positional arguments, which are always given, plus some less commonly used
arguments, typically given by name.
rep(c(1,2,3), each=3)
## [1] 1 1 1 2 2 2 3 3 3
## [1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1
## [39] 1 2 2 2 3 3 3
2. Look at the documentation for the seq function. What does seq do? Give
an example of using seq with either the by or length.out argument.
13
Chapter 2
Data frames
Data frame is R’s name for tabular data. We generally want each row in a data
frame to represent a unit of observation, and each column to contain a different
type of information about the units of observation. Tabular data in this form
is called “tidy data”1 .
Today we will be using a collection of modern packages collectively known as
the Tidyverse2 . R and its predecessor S have a history dating back to 1976. The
Tidyverse fixes some dubious design decisions baked into “base R”, including
having its own slightly improved form of data frame, which is called a tibble.
Sticking to the Tidyverse where possible is generally safer, Tidyverse packages
are more willing to generate errors rather than ignore problems.
2.1 Setting up
Our first step is to download the files we need and to install the Tidyverse. This
is the one step where we ask you to copy and paste some code:
# Install Tidyverse
install.packages("tidyverse")
If using RStudio Cloud, you might need to switch to R version 3.5.3 to suc-
cessfully install Tidyverse. Use the drop-down in the top right corner of the
page.
1 http://vita.had.co.nz/papers/tidy-data.html
2 https://www.tidyverse.org/
14
CHAPTER 2. DATA FRAMES
People also sometimes have problems installing all the packages in Tidyverse
on Windows machines. If you run into problems you may have more success
installing individual packages.
install.packages(c("dplyr","readr","tidyr","ggplot2"))
library(tidyverse)
# OR
library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
name,region,oecd,g77,lat,long,income2017
Afghanistan,asia,FALSE,TRUE,33,66,low
Albania,europe,FALSE,FALSE,41,20,upper_mid
Algeria,africa,FALSE,TRUE,28,3,upper_mid
Andorra,europe,FALSE,FALSE,42.50779,1.52109,high
Angola,africa,FALSE,TRUE,-12.5,18.5,lower_mid
15
CHAPTER 2. DATA FRAMES
## g77 = col_logical(),
## lat = col_double(),
## long = col_double(),
## income2017 = col_character()
## )
geo
## # A tibble: 196 x 7
## name region oecd g77 lat long income2017
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr>
## 1 Afghanistan asia FALSE TRUE 33 66 low
## 2 Albania europe FALSE FALSE 41 20 upper_mid
## 3 Algeria africa FALSE TRUE 28 3 upper_mid
## 4 Andorra europe FALSE FALSE 42.5 1.52 high
## 5 Angola africa FALSE TRUE -12.5 18.5 lower_mid
## 6 Antigua and Barbuda americas FALSE TRUE 17.0 -61.8 high
## 7 Argentina americas FALSE TRUE -34 -64 upper_mid
## 8 Armenia europe FALSE FALSE 40.2 45 lower_mid
## 9 Australia asia TRUE FALSE -25 135 high
## 10 Austria europe TRUE FALSE 47.3 13.3 high
## # ... with 186 more rows
You can also see this data frame referring to itself as “a tibble”. This is the
Tidyverse’s improved form of data frame. Tibbles present themselves more
conveniently than base R data frames. Base R data frames don’t show the type
of each column, and output every row when you try to view them.
Tip
A data frame can also be created from vectors, with the tibble function. (See
also data.frame in base R.) For example:
16
CHAPTER 2. DATA FRAMES
tibble(foo=c(10,20,30), bar=c("a","b","c"))
## # A tibble: 3 x 2
## foo bar
## <dbl> <chr>
## 1 10 a
## 2 20 b
## 3 30 c
Tip
2.3 Exploring
The View function gives us a spreadsheet-like view of the data frame.
View(geo)
print with the n argument can be used to show more than the first 10 rows on
the console.
print(geo, n=200)
17
CHAPTER 2. DATA FRAMES
nrow(geo)
## [1] 196
ncol(geo)
## [1] 7
colnames(geo)
summary(geo)
geo[4,2]
## # A tibble: 1 x 1
## region
## <chr>
## 1 europe
Note that while this is a single value, it is still wrapped in a data frame. (This
is a behaviour specific to Tidyverse data frames.) More on this in a moment.
Columns can be given by name.
18
CHAPTER 2. DATA FRAMES
geo[4,"region"]
## # A tibble: 1 x 1
## region
## <chr>
## 1 europe
The column or row may be omitted, thereby retrieving the entire row or column.
geo[4,]
## # A tibble: 1 x 7
## name region oecd g77 lat long income2017
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr>
## 1 Andorra europe FALSE FALSE 42.5 1.52 high
geo[,"region"]
## # A tibble: 196 x 1
## region
## <chr>
## 1 asia
## 2 europe
## 3 africa
## 4 europe
## 5 africa
## 6 americas
## 7 americas
## 8 europe
## 9 asia
## 10 europe
## # ... with 186 more rows
## # A tibble: 3 x 7
## name region oecd g77 lat long income2017
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr>
## 1 Afghanistan asia FALSE TRUE 33 66 low
## 2 Algeria africa FALSE TRUE 28 3 upper_mid
## 3 Angola africa FALSE TRUE -12.5 18.5 lower_mid
19
CHAPTER 2. DATA FRAMES
geo[c(1,3,5),]
## # A tibble: 3 x 7
## name region oecd g77 lat long income2017
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr>
## 1 Afghanistan asia FALSE TRUE 33 66 low
## 2 Algeria africa FALSE TRUE 28 3 upper_mid
## 3 Angola africa FALSE TRUE -12.5 18.5 lower_mid
geo[1:7,]
## # A tibble: 7 x 7
## name region oecd g77 lat long income2017
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr>
## 1 Afghanistan asia FALSE TRUE 33 66 low
## 2 Albania europe FALSE FALSE 41 20 upper_mid
## 3 Algeria africa FALSE TRUE 28 3 upper_mid
## 4 Andorra europe FALSE FALSE 42.5 1.52 high
## 5 Angola africa FALSE TRUE -12.5 18.5 lower_mid
## 6 Antigua and Barbuda americas FALSE TRUE 17.0 -61.8 high
## 7 Argentina americas FALSE TRUE -34 -64 upper_mid
head( geo$region )
head( geo[["region"]] )
To get the “region” value of the 4th row as above, but unwrapped, we can use:
geo$region[4]
## [1] "europe"
20
CHAPTER 2. DATA FRAMES
plot(geo$long, geo$lat)
60
geo$lat
0 20
−40
geo$long
head(is_southern)
sum(is_southern)
## [1] 40
geo[is_southern,]
21
CHAPTER 2. DATA FRAMES
## # A tibble: 40 x 7
## name region oecd g77 lat long income2017
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr>
## 1 Angola africa FALSE TRUE -12.5 18.5 lower_mid
## 2 Argentina americas FALSE TRUE -34 -64 upper_mid
## 3 Australia asia TRUE FALSE -25 135 high
## 4 Bolivia americas FALSE TRUE -17 -65 lower_mid
## 5 Botswana africa FALSE TRUE -22 24 upper_mid
## 6 Brazil americas FALSE TRUE -10 -55 upper_mid
## 7 Burundi africa FALSE TRUE -3.5 30 low
## 8 Chile americas TRUE TRUE -33.5 -70.6 high
## 9 Comoros africa FALSE TRUE -12.2 44.4 low
## 10 Congo, Dem. Rep. africa FALSE TRUE -2.5 23.5 low
## # ... with 30 more rows
• x == y – “equal to”
• x != y – “not equal to”
• x < y – “less than”
• x > y – “greater than”
• x <= y – “less than or equal to”
• x >= y – “greater than or equal to”
The oecd column of geo tells which countries are in the Organisation for Eco-
nomic Co-operation and Development, and the g77 column tells which countries
are in the Group of 77 (an alliance of developing nations). We could see which
OECD countries are in the southern hemisphere with:
geo[southern_oecd,]
## # A tibble: 3 x 7
## name region oecd g77 lat long income2017
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr>
## 1 Australia asia TRUE FALSE -25 135 high
## 2 Chile americas TRUE TRUE -33.5 -70.6 high
## 3 New Zealand asia TRUE FALSE -42 174 high
is_southern seems like it should be kept within our geo data frame for future
use. We can add it as a new column of the data frame with:
22
CHAPTER 2. DATA FRAMES
geo
## # A tibble: 196 x 8
## name region oecd g77 lat long income2017 southern
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr> <lgl>
## 1 Afghanistan asia FALSE TRUE 33 66 low FALSE
## 2 Albania europe FALSE FALSE 41 20 upper_mid FALSE
## 3 Algeria africa FALSE TRUE 28 3 upper_mid FALSE
## 4 Andorra europe FALSE FALSE 42.5 1.52 high FALSE
## 5 Angola africa FALSE TRUE -12.5 18.5 lower_mid TRUE
## 6 Antigua and Barbuda americas FALSE TRUE 17.0 -61.8 high FALSE
## 7 Argentina americas FALSE TRUE -34 -64 upper_mid TRUE
## 8 Armenia europe FALSE FALSE 40.2 45 lower_mid FALSE
## 9 Australia asia TRUE FALSE -25 135 high TRUE
## 10 Austria europe TRUE FALSE 47.3 13.3 high FALSE
## # ... with 186 more rows
## # A tibble: 3 x 8
## name region oecd g77 lat long income2017 southern
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr> <lgl>
## 1 Australia asia TRUE FALSE -25 135 high TRUE
## 2 Chile americas TRUE TRUE -33.5 -70.6 high TRUE
## 3 New Zealand asia TRUE FALSE -42 174 high TRUE
In the second argument, we are able to refer to columns of the data frame
as though they were variables. The code is beautiful, but also opaque. It’s
important to understand that under the hood we are creating and combining
logical vectors.
23
CHAPTER 2. DATA FRAMES
2.7 Factors
The count function from dplyr can help us understand the contents of some of
the columns in geo. count is also magical, we can refer to columns of the data
frame directly in the arguments to count.
count(geo, region)
## # A tibble: 4 x 2
## region n
## <chr> <int>
## 1 africa 54
## 2 americas 35
## 3 asia 59
## 4 europe 48
count(geo, income2017)
## # A tibble: 4 x 2
## income2017 n
## <chr> <int>
## 1 high 58
## 2 low 31
## 3 lower_mid 52
## 4 upper_mid 55
We should modify the income2017 column of the geo table in order to use this:
24
CHAPTER 2. DATA FRAMES
count(geo, income2017)
## # A tibble: 4 x 2
## income2017 n
## <fct> <int>
## 1 low 31
## 2 lower_mid 52
## 3 upper_mid 55
## 4 high 58
plot(geo$income2017)
50
30
0 10
plot(geo$income2017, factor(geo$oecd))
TRUE
0.8
y
0.4
FALSE
0.0
25
CHAPTER 2. DATA FRAMES
## # A tibble: 6 x 3
## income2017 oecd n
## <fct> <lgl> <int>
## 1 low FALSE 31
## 2 lower_mid FALSE 52
## 3 upper_mid FALSE 53
## 4 upper_mid TRUE 2
## 5 high FALSE 29
## 6 high TRUE 29
## # A tibble: 2 x 5
## oecd low lower_mid upper_mid high
## <lgl> <int> <int> <int> <int>
## 1 FALSE 31 52 53 29
## 2 TRUE NA NA 2 29
Tip
Tidying is often the first step when exploring a data-set. The tidyr3 package
contains a number of useful functions that help tidy (or un-tidy!) data. We’ve
just seen pivot_wider which spreads two columns into multiple columns. The
inverse of pivot_wider is pivot_longer, which gathers multiple columns into
two columns: a column of column names, and a column of values. pivot_longer
is often the first step when tidying a dataset you have received from the wild.
(This is sometimes also called a “melt” or a “gather”.)
3 http://tidyr.tidyverse.org/
26
CHAPTER 2. DATA FRAMES
Challenge: counting
Investigate how many OECD and non-OECD nations come from the northern
and southern hemispheres.
1. Using count.
2. By making a mosaic plot.
Remember you may need to convert columns to factors for plot to work, and
that a southern column could be added to geo with:
2.9 Sorting
Data frames can be sorted using the arrange function in dplyr.
arrange(geo, lat)
## # A tibble: 196 x 8
## name region oecd g77 lat long income2017 southern
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <fct> <lgl>
## 1 New Zealand asia TRUE FALSE -42 174 high TRUE
## 2 Argentina americas FALSE TRUE -34 -64 upper_mid TRUE
## 3 Chile americas TRUE TRUE -33.5 -70.6 high TRUE
## 4 Uruguay americas FALSE TRUE -33 -56 high TRUE
## 5 Lesotho africa FALSE TRUE -29.5 28.2 lower_mid TRUE
## 6 South Africa africa FALSE TRUE -29 24 upper_mid TRUE
## 7 Swaziland africa FALSE TRUE -26.5 31.5 lower_mid TRUE
## 8 Australia asia TRUE FALSE -25 135 high TRUE
## 9 Paraguay americas FALSE TRUE -23.3 -58 upper_mid TRUE
## 10 Botswana africa FALSE TRUE -22 24 upper_mid TRUE
## # ... with 186 more rows
Numeric columns are sorted in numeric order. Character columns will be sorted
in alphabetical order. Factor columns are sorted in order of their levels. The
desc helper function can be used to sort in descending order.
arrange(geo, desc(name))
## # A tibble: 196 x 8
## name region oecd g77 lat long income2017 southern
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <fct> <lgl>
## 1 Zimbabwe africa FALSE TRUE -19 29.8 low TRUE
## 2 Zambia africa FALSE TRUE -14.3 28.5 lower_mid TRUE
## 3 Yemen asia FALSE TRUE 15.5 47.5 lower_mid FALSE
27
CHAPTER 2. DATA FRAMES
## # A tibble: 4,312 x 5
## name year population gdp_percap life_exp
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1800 3280000 603 28.2
## 2 Albania 1800 410445 667 35.4
## 3 Algeria 1800 2503218 715 28.8
## 4 Andorra 1800 2654 1197 NA
## 5 Angola 1800 1567028 618 27.0
## 6 Antigua and Barbuda 1800 37000 757 33.5
## 7 Argentina 1800 534000 1507 33.2
## 8 Armenia 1800 413326 514 34
## 9 Australia 1800 351014 814 34.0
## 10 Austria 1800 3205587 1847 34.4
## # ... with 4,302 more rows
Quiz
It would be useful to have general information about countries from geo available
as columns when we use this data frame. gap and geo share a column called
name which can be used to match rows from one to the other.
28
CHAPTER 2. DATA FRAMES
## # A tibble: 4,312 x 12
## name year population gdp_percap life_exp region oecd g77 lat long
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 Afgh~ 1800 3280000 603 28.2 asia FALSE TRUE 33 66
## 2 Alba~ 1800 410445 667 35.4 europe FALSE FALSE 41 20
## 3 Alge~ 1800 2503218 715 28.8 africa FALSE TRUE 28 3
## 4 Ando~ 1800 2654 1197 NA europe FALSE FALSE 42.5 1.52
## 5 Ango~ 1800 1567028 618 27.0 africa FALSE TRUE -12.5 18.5
## 6 Anti~ 1800 37000 757 33.5 ameri~ FALSE TRUE 17.0 -61.8
## 7 Arge~ 1800 534000 1507 33.2 ameri~ FALSE TRUE -34 -64
## 8 Arme~ 1800 413326 514 34 europe FALSE FALSE 40.2 45
## 9 Aust~ 1800 351014 814 34.0 asia TRUE FALSE -25 135
## 10 Aust~ 1800 3205587 1847 34.4 europe TRUE FALSE 47.3 13.3
## # ... with 4,302 more rows, and 2 more variables: income2017 <fct>,
## # southern <lgl>
The output contains all ways of pairing up rows by name. In this case each row
of geo pairs up with multiple rows of gap.
The “left” in “left join” refers to how rows that can’t be paired up are handled.
left_join keeps all rows from the first data frame but not the second. This is
a good default when the intent is to attaching some extra information to a data
frame. inner_join discard all rows that can’t be paired up. full_join keeps
all rows from both data frames.
5 http://r4ds.had.co.nz/
6 https://monashdatafluency.github.io/r-progtidy/
29
Chapter 3
We already saw some of R’s built in plotting facilities with the function plot.
A more recent and much more powerful plotting library is ggplot2. ggplot2
is another mini-language within R, a language for creating plots. It implements
ideas from a book called “The Grammar of Graphics”1 . The syntax can be a
little strange, but there are plenty of examples in the online documentation2 .
ggplot2 is part of the Tidyverse, so loading the tidyverse package will load
ggplot2.
library(tidyverse)
1 https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448
2 http://ggplot2.tidyverse.org/reference/
30
CHAPTER 3. PLOTTING WITH GGPLOT2
80
60
life_exp
40
20
The call to ggplot and aes sets up the basics of how we are going to represent
the various columns of the data frame. aes defines the “aesthetics”, which is how
columns of the data frame map to graphical attributes such as x and y position,
color, size, etc. aes is another example of magic “non-standard evaluation”,
arguments to aes may refer to columns of the data frame directly. We then
literally add layers of graphics (“geoms”) to this.
Further aesthetics can be used. Any aesthetic can be either numeric or categor-
ical, an appropriate scale will be used.
80 population
5e+08
60 1e+09
life_exp
region
40
africa
americas
20 asia
europe
31
CHAPTER 3. PLOTTING WITH GGPLOT2
This R code will get the data from the year 2010:
• gdp_percap as x.
• life_exp as y.
• population as the size.
• region as the color.
80
60 region
africa
life_exp
americas
40
asia
europe
20
A wide variety of geoms are available. Here we show Tukey box-plots. Note
again the use of the “group” aesthetic, without this ggplot will just show one
big box-plot.
32
CHAPTER 3. PLOTTING WITH GGPLOT2
80
60
life_exp
40
20
80
60
life_exp
40
20
33
CHAPTER 3. PLOTTING WITH GGPLOT2
80
60
oecd
life_exp
FALSE
40
TRUE
20
80
60
life_exp
40
20
Notice also that the second geom_line has some further arguments controlling
its appearance. These are not aesthetics, they are not a mapping of data to
appearance, but rather a direct specification of the appearance. There isn’t an
associated scale as when color was an aesthetic.
34
CHAPTER 3. PLOTTING WITH GGPLOT2
Gapminder
80
Life expectancy
60
40
20
Now, the figure has proper labels and titles. However, the title is not at the
center of the figure. We can further customize it using theme() function (for
more detail please see the docs ?theme).
Gapminder
80
Life expectancy
60
40
20
35
CHAPTER 3. PLOTTING WITH GGPLOT2
75
life_exp
50
25
0
1800 1850 1900 1950 2000
year
Type scale_ and press the tab key. You will see functions giving fine-grained
controls over various scales (x, y, color, etc). These allow transformations (eg
log10), and manually specified breaks (labelled values). Very fine grained control
is possible over the appearance of ggplots, see the ggplot2 documentation for
details and further examples.
Continuing with your scatter-plot of the 2010 data, add axis labels to your plot.
Give your x axis a log scale by adding scale_x_log10().
3.5 Faceting
Faceting lets us quickly produce a collection of small plots. The plots all have
the same scales and the eye can easily compare them.
36
CHAPTER 3. PLOTTING WITH GGPLOT2
africa americas
80
60
40
20
life_exp
asia europe
80
60
40
20
1800 1850 1900 1950 2000 1800 1850 1900 1950 2000
year
Note the use of ~, which we’ve not seen before. ~ syntax is used in R to specify
dependence on some set of variables, for example when specifying a linear model.
Here the information in each plot is dependent on the continent.
# To save to a file
37
CHAPTER 3. PLOTTING WITH GGPLOT2
ggsave("test.png", p)
# This is an alternative method that works with "base R" plots as well:
png("test.png")
print(p)
dev.off()
Figures in papers tend to be quite small. This means text must be proportion-
ately larger than we usually show on screen. Dots should also be proportionately
larger, and lines proportionately thicker. The way to achieve this using ggsave
is to specify a small width and height, given in inches. To ensure the output
also has good resolution, specify a high dots-per-inch, or use a vector-graphics
format such as PDF or SVG.
38
Chapter 4
Summarizing data
Having loaded and thoroughly explored a data set, we are ready to distill it
down to concise conclusions. At its simplest, this involves calculating summary
statistics like counts, means, and standard deviations. Beyond this is the fitting
of models, and hypothesis testing and confidence interval calculation. R has a
huge number of packages devoted to these tasks and this is a large part of its
appeal, but is beyond the scope of today.
Loading the data as before, if you have not already done so:
library(tidyverse)
mean( c(1,2,3,4) )
## [1] 2.5
## [1] 6949495061
39
CHAPTER 4. SUMMARIZING DATA
mean(gap2010$life_exp)
## [1] NA
gap2010$life_exp
## [1] 56.20 76.31 76.55 82.66 60.08 76.85 75.82 73.34 81.98 80.50 69.13 73.79
## [13] 76.03 70.39 76.68 70.43 79.98 71.38 61.82 72.13 71.64 76.75 57.06 74.19
## [25] 77.08 73.86 57.89 57.73 66.12 57.25 81.29 72.45 47.48 56.49 79.12 74.59
## [37] 76.44 65.93 57.53 60.43 80.40 56.34 76.33 78.39 79.88 77.47 79.49 63.69
## [49] 73.04 74.60 76.72 70.52 74.11 60.93 61.66 76.00 61.30 65.28 80.00 81.42
## [61] 62.86 65.55 72.82 80.09 62.16 80.41 71.34 71.25 57.99 55.65 65.49 32.11
## [73] 71.58 82.61 74.52 82.03 66.20 69.90 74.45 67.24 80.38 81.42 81.69 74.66
## [85] 82.85 75.78 68.37 62.76 60.73 70.10 80.13 78.20 68.45 63.80 73.06 79.85
## [97] 46.50 60.77 76.10 NA 73.17 81.35 74.01 60.84 53.07 74.46 77.91 59.46
## [109] 80.28 63.72 68.23 73.42 75.47 65.38 69.74 NA 66.18 76.36 73.55 54.48
## [121] 66.84 58.60 NA 68.26 80.73 80.90 77.36 58.78 60.53 81.04 76.09 65.33
## [133] NA 77.85 58.70 74.07 77.92 69.03 76.30 79.84 79.52 73.66 69.24 64.59
## [145] NA 75.48 71.64 71.46 NA 68.91 75.13 64.01 74.65 73.38 55.05 82.69
## [157] 75.52 79.45 61.71 53.13 54.27 81.94 74.42 66.29 70.32 46.98 81.52 82.21
## [169] 76.15 79.19 69.61 59.30 76.57 71.10 58.74 69.86 72.56 76.89 78.21 67.94
## [181] NA 56.81 70.41 76.51 80.34 78.74 76.36 68.77 63.02 75.41 72.27 73.07
## [193] 67.51 52.02 49.57 58.13
mean(gap2010$life_exp, na.rm=TRUE)
## [1] 70.34005
Ideally we should also use weighted.mean here, to take population into account.
## [1] 70.96192
40
CHAPTER 4. SUMMARIZING DATA
NA + 1
## [1] NA
is.na( c(1,2,NA,3) )
## [1] 70.96192
## # A tibble: 1 x 1
## mean_life_exp
## <dbl>
## 1 71.0
So far unremarkable, but summarize comes into its own when the group_by
“adjective” is used.
summarize(
group_by(gap_geo, year),
mean_life_exp=weighted.mean(life_exp, population, na.rm=TRUE))
## # A tibble: 22 x 2
## year mean_life_exp
## <dbl> <dbl>
## 1 1800 30.9
## 2 1810 31.1
## 3 1820 31.2
## 4 1830 31.4
## 5 1840 31.4
## 6 1850 31.6
41
CHAPTER 4. SUMMARIZING DATA
## 7 1860 30.3
## 8 1870 31.5
## 9 1880 32.0
## 10 1890 32.5
## # ... with 12 more rows
Challenge: summarizing
What is the total population for each year? Plot the result.
Advanced: What is the total GDP for each year? For this you will first need to
calculate GDP per capita times the population of each country.
group_by can be used to group by multiple columns, much like count. We can
use this to see how the rest of the world is catching up to OECD nations in
terms of life expectancy.
## # A tibble: 44 x 3
## # Groups: year [22]
## year oecd mean_life_exp
## <dbl> <lgl> <dbl>
## 1 1800 FALSE 29.9
## 2 1800 TRUE 34.7
## 3 1810 FALSE 29.9
## 4 1810 TRUE 35.2
## 5 1820 FALSE 30.0
## 6 1820 TRUE 35.9
## 7 1830 FALSE 30.0
## 8 1830 TRUE 36.2
## 9 1840 FALSE 30.0
## 10 1840 TRUE 36.2
## # ... with 34 more rows
42
CHAPTER 4. SUMMARIZING DATA
80
70
mean_life_exp
60 oecd
FALSE
50 TRUE
40
30
A similar plot could be produced using geom_smooth. Differences here are that
we have full control over the summarization process so we were able to use the
exact summarization method we want (weighted.mean for each year), and we
have access to the resulting numeric data as well as the plot. We have reduced
a large data set down to a smaller one that distills out one of the stories present
in this data. However the earlier visualization and exploration activity using
ggplot2 was essential. It gave us an idea of what sort of variability was present
in the data, and any unexpected issues the data might have.
4.4 t-test
We will finish this section by demonstrating a t-test. The main point of this
section is to give a flavour of how statistical tests work in R, rather than the
details of what a t-test does.
Has life expectancy increased from 2000 to 2010?
t.test(gap2010$life_exp, gap2000$life_exp)
##
## Welch Two Sample t-test
##
## data: gap2010$life_exp and gap2000$life_exp
## t = 3.0341, df = 374.98, p-value = 0.002581
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.023455 4.792947
43
CHAPTER 4. SUMMARIZING DATA
## sample estimates:
## mean of x mean of y
## 70.34005 67.43185
Statistical routines often have many ways to tweak the details of their operation.
These are specified by further arguments to the function call, to override the
default behaviour. By default, t.test performs an unpaired t-test, but these
are repeated observations of the same countries. We can specify paired=TRUE
to t.test to perform a paired sample t-test and gain some statistical power.
Check this by looking at the help page with ?t.test.
It’s important to first check that both data frames are in the same order.
all(gap2000$name == gap2010$name)
## [1] TRUE
##
## Paired t-test
##
## data: gap2010$life_exp and gap2000$life_exp
## t = 13.371, df = 188, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.479153 3.337249
## sample estimates:
## mean of the differences
## 2.908201
When performing a statistical test, it’s good practice to visualize the data to
make sure there is nothing funny going on.
plot(gap2000$life_exp, gap2010$life_exp)
abline(0,1)
44
CHAPTER 4. SUMMARIZING DATA
80
70
gap2010$life_exp
60
50
40
50 60 70 80
gap2000$life_exp
45
Chapter 5
Thinking in R
The result of a t-test is actually a value we can manipulate further. Two func-
tions help us here. class gives the “public face” of a value, and typeof gives
its underlying type, the way R thinks of it internally. For example numbers are
“numeric” and have some representation in computer memory, either “integer”
for whole numbers only, or “double” which can hold fractional numbers (stored
in memory in a base-2 version of scientific notation).
class(42)
## [1] "numeric"
typeof(42)
## [1] "double"
class(result)
## [1] "htest"
typeof(result)
## [1] "list"
names(result)
46
CHAPTER 5. THINKING IN R
result$p.value
## [1] 4.301261e-29
5.1 Lists
Lists are vectors that can hold anything as elements (even other lists!). It’s
possible to create lists with the list function. This becomes especially useful
once you get into the programming side of R. For example writing your own
function that needs to return multiple values, it could do so in the form of a
list.
## $hello
## [1] "Hello" "world"
##
## $numbers
## [1] 1 2 3 4
class(mylist)
## [1] "list"
typeof(mylist)
## [1] "list"
names(mylist)
47
CHAPTER 5. THINKING IN R
mylist$hello
mylist[[2]]
## [1] 1 2 3 4
• contains only one type of data, usually numeric (rather than different
types in different columns).
• commonly has rownames as well as colnames. (Base R data frames can
have rownames too, but it is easier to have any unique identifier as a
normal column instead.)
• has individual cells as the unit of observation (rather than rows).
Matrices can be created using as.matrix from a data frame, matrix from a
single vector, or using rbind or cbind with several vectors.
You may also encounter “S4 objects”, especially if you use Bioconductor1 pack-
ages. The syntax for using these is different again, and uses @ to access elements.
5.3 Programming
Once you have a useful data analysis, you may want to do it again with different
data. You may have some task that needs to be done many times over. This is
where programming comes in:
The “R for Data Science” book5 is an excellent source to learn more. Monash
Data Fluency “Programming and Tidy data analysis in R” course6 also covers
this.
1 http://bioconductor.org/
2 http://r4ds.had.co.nz/functions.html
3 http://r4ds.had.co.nz/iteration.html
4 http://r4ds.had.co.nz/functions.html#conditional-execution
5 http://r4ds.had.co.nz/
6 https://monashdatafluency.github.io/r-progtidy/
48
Chapter 6
Next steps
49
CHAPTER 6. NEXT STEPS
The R Manuals8 are the place to look if you need a precise definition of how R
behaves.
Meetups in Melbourne:
• MelbURN10
• R-Ladies11
5 https://www.rstudio.com/resources/cheatsheets/
6 https://cran.r-project.org/doc/contrib/Short-refcard.pdf
7 https://github.com/mikelove/bioc-refcard/blob/master/README.Rmd
8 https://cran.r-project.org/manuals.html
9 https://www.monash.edu/data-fluency
10 https://www.meetup.com/en-AU/MelbURN-Melbourne-Users-of-R-Network/
11 https://www.meetup.com/en-AU/R-Ladies-Melbourne/
12 https://carpentries.org/
13 https://combine.org.au/
50