When teaching examples using R, instructors often using nice datasets - but these aren't very realistic, and aren't what students will later encounter in the real world. Real datasets have typos, missing values encoded in strange ways, and weird spaces. The {messy} R package takes a clean dataset, and randomly adds these things in - giving students the opportunity to practice their data cleaning and wrangling skills without having to change all of your examples.
Install from GitHub using:
remotes::install_github("nrennie/messy")
set.seed(1234)
messy(ToothGrowth[1:10,])
len supp dose
1 4.2 vc 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10 VC 0.5
7 11.2 VC 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7 <NA> <NA>
Increase how messy the data is:
set.seed(1234)
messy(ToothGrowth[1:10,], messiness = 0.7)
len supp dose
1 <NA> <NA> 0.5
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
6 10 <NA> 0.5
7 <NA> <NA> <NA>
8 <NA> <NA> 0.5
9 5.2 VC 0.5
10 7 <NA> <NA>
Randomly adds a whitespace to the ends of some values, meaning that numeric columns may be converted to characters:
set.seed(1234)
add_whitespace(ToothGrowth[1:10,])
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10 VC 0.5
7 11.2 VC 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7 VC 0.5
Apply to only some columns:
set.seed(1234)
add_whitespace(ToothGrowth[1:10,], cols = "supp")
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 VC 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
Randomly switches the case between upper case, lower case, and no change of character or factor columns:
set.seed(1234)
change_case(ToothGrowth[1:10,], messiness = 0.5)
len supp dose
1 4.2 vc 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 vc 0.5
8 11.2 vc 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
Randomly make some values missing using NA
:
set.seed(1234)
make_missing(ToothGrowth[1:10,])
len supp dose
1 4.2 VC 0.5
2 11.5 VC NA
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 NA VC 0.5
8 11.2 VC NA
9 5.2 VC 0.5
10 7.0 VC 0.5
Add a different missing value representation for some columns:
set.seed(1234)
make_missing(ToothGrowth[1:10,], cols = "supp", missing = "999")
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 999 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
You can pipe together multiple functions to create custom messy transformations:
set.seed(1234)
ToothGrowth[1:10,] |>
make_missing(cols = "supp", missing = " ") |>
make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |>
add_whitespace(cols = "supp", messiness = 0.5)
len supp dose
1 4.2 VC 0.5
2 11.5 VC NA
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 0.5
8 11.2 VC NA
9 5.2 VC 0.5
10 7.0 VC 0.5