Tidy Data
INST 462, Spring 2025, sections 101–103
Instructor: Dr. Scott Jackson
February 17, 2025
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 1 / 22
Why are we talking about this?
• We want to visualize data.
• But what if that data is a mess?
• Clean – and often tidy – data is necessary for visualization.
• An old trope: “Data Scientists spend 50%/80%/99% of their time
cleaning and preparing data.”
• These days:
• probably still true
• unless you are lucky enough to have a dedicated and competent
data engineer
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 2 / 22
Why are we talking about this?
• In this class:
• you will have the opportunity to seek out other data sets of
interests.
• you will sometimes need to be able to re-shape and re-arrange
data in order to visualize it effectively
• this is often true even for “clean” data.
• So you need to learn some basic things about data cleaning.
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 3 / 22
One problem with messy data
“Happy families are all alike; every unhappy family is
unhappy in its own way.”
— Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is
messy in its own way.”
— Hadley Wickham
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 4 / 22
Nevertheless. . .
• There are some common patterns of messiness that you can learn to
look out for.
1. Weird or improper format
2. Untidiness (will define this more)
3. Junk
4. Missing data
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 5 / 22
Formats
• (Review from last week)
• Simple (delimited) text: e.g, CSV, TSV, TXT
• Common spreadsheet files: e.g., Excel (.xls, .xlsx)
• Web data formats: e.g., JSON, XML
• Databases: e.g., SQL, MongoDB
• Binary/proprietary/other: it’s a big (and growing) world!
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 6 / 22
Simple, delimited text
• Files like CSV are very common.
• CSV = “Comma Separated Values”
• Simple text format data (can be read with programs like
Notepad/TextEdit, etc.)
• Rectangular:
• each line of the file corresponds to a row of data
• each value on a row is in a different “column”
• columns are separated by a delimiter or separator
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 7 / 22
Simple, delimited text
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 8 / 22
Simple, delimited text
• File type names mainly differ in the delimiter character(s) that
separate columns:
• CSV: columns separated by commas
• TSV: columns separated by “tabs” (special whitespace character,
like a “big space”)
• TXT: might be separated by a single space, or maybe some other
character like a semicolon, etc.
• Fixed-width: some older files have a fixed number of characters
(“width”) for each column, like each column is 4 characters or
something like that.
• Note: the file extension (.csv, .tsv, .txt, etc.) is a hint about what
the delimiter/separator is, not a guarantee!
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 9 / 22
Spreadsheet files
• Excel files (.xls, .xlsx) are still common, especially in government
data.
• Still essentially rectangular data (rows, columns).
• Biggest challenge of Excel: formatting!
• Spreadsheets are often made to be visually accessible, but that can
make the data harder to manage or extract!
• e.g., colors, borders, “merging” cells, adding notes or graphics
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 10 / 22
File formats summary
• Figuring out how to access the format of your data file(s) is the first
hurdle.
• There may be unexpected aspects of the format you need to deal with:
• unexpected delimiter characters (e.g., semicolons instead of
commas)
• unexpected stuff/formatting in an Excel file
• unexpected complexity that you need to “flatten” into a
rectangular shape
• etc. etc.
• Bottom line: learn tools to access different types of data, but always
remember to check the data and adjust if it has surprising formatting
issues.
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 11 / 22
Tidyiness
• Challenge: messy data sets are all messy in different ways
• Part of the solution: have an ideal target in mind, so you have
something to aim for
• That target: tidy data
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 12 / 22
Tidy Data
• What is “tidy” data?
• Term coined/championed by Hadley
Wickham
• Chief Scientist at Posit, one of the
Patron Saints of R
• Cohesive, conceptually grounded
approach to organizing rectangular data
• His Tidy Data paper is on ELMS, along
with other handy links
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 13 / 22
Sidetrack: the Tidyverse
• Wickham and co-authors have created a set of related packages,
including the ggplot2 package we will start using next week,
collectively called the “Tidyverse”.
• The Tidyverse is very popular, for good reason, and we will make use
of these.
• But it’s also not the only way to do things.
• In R, you can install the tidyverse package, which is basically just a
package for making it convenient to install/load all of the Tidyverse
packages at once.
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 14 / 22
Tidy Data
• Okay, but what is it?
• Three core rules:
1. Each variable is a column; each column is a variable.
2. Each observation is a row; each row is an observation.
3. Each value is a cell; each cell is a single value.
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 15 / 22
Rectangular Data
• “Spreadsheet-like”
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 16 / 22
Variables
• Only one variable per column
• Only one column for a variable
• Same data type (i.e., vector in R)
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 17 / 22
Observations
• Only one observation per row
• Only one row per observation (rows are distinct)
• Set of interconnected values for an “observation unit”
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 18 / 22
Values
• Only one value per cell
• Only one cell for a value
• Each is a “data point”
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 19 / 22
Tidy Data is not the end-all
• Tidy data is a good reference point concept.
• It’s not always the “right” way to shape data.
• “Untidy” data can sometimes be more compact (for disk space,
storage).
• Our other tools (e.g., analysis & graphing functions in R) may
require different “shapes”.
• Sometimes what counts as “tidy” depends on the unit of analysis,
goals of the analyst.
• It’s a good set of guiding principles, not an absolute straightjacket.
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 20 / 22
More on Tidy Data
• See Tutorial R Markdown
• learn the basics of the tidyr package
• Links and references (also on ELMS):
• original academic paper (from 2014):
https://www.jstatsoft.org/article/view/v059i10
• page on the Tidyverse docs website:
https://tidyr.tidyverse.org/articles/tidy-data.html
• section of the Data Science for R book:
https://r4ds.hadley.nz/data-tidy
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 21 / 22
Next steps
• Work through the Tutorial R Markdown files for Week 4.
• more on reading/importing data
• filtering and subsetting data
• reshaping data
• Practice Exercises are for your own use, and for Friday Discussion.
• This is to prepare you to “tidy” the Visualization Judgments data next
week, in the first stage of your projects.
• Start working on Tutorials & Exercises before Friday.
• Volunteer (see the “quiz” on ELMS) to discuss your progress on Friday!
Instructor: Dr. Scott Jackson Tidy Data February 17, 2025 22 / 22