DATA VISUALIZATION WITH GGPLOT2
Case Study I
Bag Plot
Data Visualization with ggplot2
ggplot2 2.0
● Write your own extensions
● Extremely flexible
● Create bag plot
● John Tukey (box plots)
● 2D box plot
Data Visualization with ggplot2
data set
> dim(df)
[1] 202 2
> head(df)
type Value
1 1 99.43952
2 1 99.76982
3 1 101.55871
4 1 100.07051
5 1 100.12929
6 1 101.71506
Data Visualization with ggplot2
2 box plots
> ggplot(df, aes(x = type, Value)) +
geom_boxplot() +
facet_wrap(~type, ncol = 2, scales = "free")
1 2
● ●
104 152
102 150
Value
148
100
146
98 ● ●
1 2
type
Data Visualization with ggplot2
slope plot
> df$ID <- seq_len(nrow(df) / 2)
> ggplot(df, aes(x = type, Value, group = ID)) +
geom_line(alpha = 0.3)
140
Value
120
100
1 2
type
Data Visualization with ggplot2
Distribution of slope
50
slope
Box plot?
45
40
Data Visualization with ggplot2
2 distinct variables
> head(dat)
group1 group2
1 99.43952 149.2896
2 99.76982 150.2569
3 101.55871 149.7533
4 100.07051 149.6525
5 100.12929 149.0484
6 101.71506 149.9550
Data Visualization with ggplot2
Sca!er plot
> ggplot(dat, aes(x = group1, y = group2)) +
geom_point()
● ●
152 ●
● ●
●
●
● ●
● ● ●
● ●● ● ● ●
● ● ● ●● ● ●
● ●
● ● ● ● ●● ●
● ● ●● ●● ●
150 ● ●
● ● ● ●
● ●● ● ● ● ● ●
●
group2
● ●
●● ● ●● ● ● ●
● ● ● ●
●● ● ●●
● ● ● ●
●
●● ● ● ●
● ● ●●
●● ● ●
●
● ● ●
148 ●
146
98 100 102 104
group1
Data Visualization with ggplot2
2D density plot
> library(viridis)
> ggplot(dat, aes(x = group1, y = group2)) +
stat_density_2d(geom = "tile", aes(fill = ..density..),
contour = FALSE) +
scale_fill_viridis()
152.5
density
150.0
0.15
group2
0.10
0.05
147.5
145.0
98 100 102 104
group1
Data Visualization with ggplot2
Bag plot
> library(aplpack)
> bagplot(dat[1:2])
hull
152
● ●
●
●
bag
●
●
●
● ●
● ●
●
●● ●
● ● ●
● ● ● ●
● ● ●
150
● ●
● ● ● ● ●
● ●
● ● ● ● ●● ● ●
●
group2
● ● ●
loop
● ●
● ● ●
● ●●
● ● ● ● ●●
● ● ● ●
● ● ●●
● ● ●
● ● ●
● ●●
● ● ● ●
●● ● ●
● ● ●
● ●
●● ● ●
148
● ● ●
●
●
146
98 100 102 104
group1
Data Visualization with ggplot2
aplpack
> library(aplpack)
> plot_data <- compute.bagplot(x = dat$group1, y = dat$group2)
> names(plot_data)
[1] "center" "hull.center" "hull.bag" "hull.loop"
[5] "pxy.bag" "pxy.outer" "pxy.outlier" "hdepths"
[9] "is.one.dim" "prdata" "xy" "xydata"
Data Visualization with ggplot2
ggplot2
> ggplot(dat, aes(x = group1, y = group2)) +
geom_point()
● ●
152 ●
● ●
●
●
● ●
● ● ●
● ●● ● ● ●
● ● ● ●● ● ●
● ●
● ● ● ● ●● ●
● ● ●● ●● ●
150 ● ●
● ● ● ●
● ●● ● ● ● ● ●
●
group2
● ●
●● ● ●● ● ● ●
● ● ● ●
●● ● ●●
● ● ● ●
●
●● ● ● ●
● ● ●●
●● ● ●
●
● ● ●
148 ●
146
98 100 102 104
group1
Data Visualization with ggplot2
ggplot2
> ggplot(dat, aes(x = group1, y = group2)) +
stat_bag(alpha = 0.2)
152
150
group2
148
146
98 100 102 104
group1
Data Visualization with ggplot2
Remarks
● Useful but not popular
● Poorly understood
● Learn to use ggplot2 extensions
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!
DATA VISUALIZATION WITH GGPLOT2
Case Study II
Weather (Part 1)
Data Visualization with ggplot2
Weather
Source: h!p://www.edwardtu"e.com/
Data Visualization with ggplot2
present
> dim(present)
[1] 153 5
> head(present, n = 4)
month day year temp new_day
1 1 1 2016 41 1
2 1 2 2016 37 2
3 1 3 2016 40 3
4 1 4 2016 33 4
> tail(present, n = 4)
month day year temp new_day
148 5 28 2016 79 148
149 5 29 2016 80 149
150 5 30 2016 73 150
151 5 31 2016 76 151
Data Visualization with ggplot2
Time series
> ggplot(present, aes(x = new_day, y = temp)) +
geom_line()
80
60
temp
40
20
0 50 100 150
new_day
Data Visualization with ggplot2
past
> str(past)
'data.frame': 7645 obs. of 11 variables:
$ month : num 1 1 1 1 1 1 1 1 1 1 ...
$ day : num 1 2 3 4 5 6 7 8 9 10 ...
$ year : num 1995 1995 1995 1995 1995 ...
$ temp : num 44 41 28 31 21 27 42 35 34 29 ...
$ new_day : int 1 2 3 4 5 6 7 8 9 10 ...
$ upper : num 51 48 57 55 56 62 52 57 54 47 ...
$ lower : num 17 15 16 15 21 14 14 12 21 8.5 ...
$ avg : num 35.6 35.4 34.9 35.1 35.9 ...
$ se : num 2.19 1.83 2.46 2.53 1.92 ...
$ avg_upper: num 40.2 39.2 40 40.5 39.9 ...
$ avg_lower: num 31 31.5 29.7 29.8 31.9 ...
Data Visualization with ggplot2
Each year separately
> ggplot(past, aes(x = new_day, y = temp, group = year)) +
geom_line(alpha = 0.2)
75
temp
50
25
0 100 200 300
new_day
Data Visualization with ggplot2
present + past
> ggplot(past, aes(x = new_day, y = temp, group = year)) +
geom_line(alpha = 0.4) +
geom_line(data = present, aes(group = 1), col = "red")
75
temp
50
25
0 100 200 300
new_day
Data Visualization with ggplot2
present + past
> ggplot(past, aes(x = new_day, y = temp, group = year)) +
geom_line(alpha = 0.4) +
geom_line(data = present, aes(group = 1), col = "red")
75
temp
50
25
0 100 200 300
new_day
Data Visualization with ggplot2
Linerange
75
temp
50
25
0 100 200 300
new_day
Data Visualization with ggplot2
Records
●
●
75
● ●
● ●
temp
● ● ● ●
50 ●
25
● ●
0 100 200 300
new_day
Data Visualization with ggplot2
Custom legend
●
●
75
● ●
● ●
temp
● ● ● ●
50 ●
●
New record high ●
past record high
95% CI range
25 Current year
● ●
past record low
New record low ●
●
0 100 200 300
new_day
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!
DATA VISUALIZATION WITH GGPLOT2
Case Study II
Weather (Part 2)
Data Visualization with ggplot2
Up to now
●
●
75
● ●
● ●
temp
● ● ● ●
50 ●
●
New record high ●
past record high
95% CI range
25 Current year
● ●
past record low
New record low ●
●
0 100 200 300
new_day
Data Visualization with ggplot2
Situation
● Many data frames
● Plot summary data frame as a layer
● stat_summary()
Data Visualization with ggplot2
stat_historical()
> ggplot(my_data, aes(x = new_day, y = temp, fill = year)) +
stat_historical()
75
temp
50
25
0 100 200 300
new_day
Data Visualization with ggplot2
stat_present()
> ggplot(my_data, aes(x = new_day, y = temp, fill = year)) +
stat_historical() +
stat_present()
75
temp
50
25
0 100 200 300
new_day
Data Visualization with ggplot2
stat_extremes()
> ggplot(my_data, aes(new_day, temp, fill = year)) +
stat_historical() +
stat_present() +
stat_extremes(aes(colour = ..record..))
●
●
75
● ●
● ●
temp
● ● ● ●
50 ●
25
● ●
0 100 200 300
new_day
Data Visualization with ggplot2
Specific layers
> ggplot(my_data, aes(new_day, temp, fill = year)) +
stat_historical() +
# stat_present() +
stat_extremes(aes(colour = ..record..))
●
●
75
● ●
● ●
temp
● ● ● ●
50 ●
25
● ●
0 100 200 300
new_day
Data Visualization with ggplot2
Face!ing
PARIS REYKJAVIK
75
●
50
●
●
25 ●
● ●
temp
NEW YORK LONDON
●
●
75
● ●
● ●
● ● ● ● ● ●
●
●
●
50 ●
25
● ●
0 100 200 300 0 100 200 300
new_day
DATA VISUALIZATION WITH GGPLOT2
Let’s practice!
DATA VISUALIZATION WITH GGPLOT2
Wrap-up
Data Visualization with ggplot2
Statistics Design
Graphical Communication
Data Analysis & Perception
Data Visualization with ggplot2
Explore Explain
Confirm Inform
and and
Analyse Persuade
Data Visualization with ggplot2
Element Description
Data The dataset being plo!ed.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.
Data Visualization with ggplot2
Element Description
Data The dataset being plo!ed.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.
Facets Plo!ing small multiples.
Statistics Representations of our data to aid understanding.
Coordinates The space on which the data will be plo!ed.
Themes All non-data ink.
Data Visualization with ggplot2
24
21
18
Total sleep time (h)
15 ●
12
●
●
●
9
0
Carnivore Herbivore Insectivore Omnivore
70
Eating habits
60
50
Site
Yield (bushels/acre)
Waseca
40
Crookston
Morris
University Farm
30 Duluth
Grand Rapids
20
10
0
1931 1932
Year
Data Visualization with ggplot2
1.00
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
Obese
0.75
Over−weight
0.50
1
Healthy−weight
0.25
Under−weight
0.00
0 10000 20000 30000 40000
xtext
residual
−5.0−2.5 0.0 2.5 5.0
Data Visualization with ggplot2
4 density
0.025
eruptions
0.020
0.015
0.010
3
0.005
2 Unemployment (%)
12
50 60 70 80 90
9
waiting
3
Silt
100
20
80
40
60
60
40
80
20
10
0
Sand Clay
20
40
60
80
0
10
Data Visualization with ggplot2
Iris Sepals
4.5
4.0
Species
3.5
setosa
Width
versicolor
3.0 virginica
Anderson, 1936
2.5
2.0
4 5 6 7 8
Length
Data Visualization with ggplot2
152
150
group2
148
146
98 100 102 104
group1
●
●
75
● ●
● ●
temp
● ● ● ●
50 ●
●
New record high ●
past record high
95% CI range
25 Current year
● ●
past record low
New record low ●
●
0 100 200 300
new_day
DATA VISUALIZATION WITH GGPLOT2
Thank you!