CLEANING DATA IN PYTHON
Tidy data
Cleaning Data in Python
Tidy data
● “Tidy Data” paper by Hadley Wickham, PhD
● Formalize the way we describe the shape of data
● Gives us a goal when forma!ing our data
● “Standard way to organize data values within a dataset”
Cleaning Data in Python
Motivation for tidy data
Cleaning Data in Python
Principles of tidy data
● Columns represent separate variables
● Rows represent individual observations
● Observational units form tables
Cleaning Data in Python
Converting to tidy data
● Be!er for reporting vs. be!er for analysis
● Tidy data makes it easier to fix common data
problems
Cleaning Data in Python
Converting to tidy data
● The data problem we are trying to fix:
● Columns containing values, instead of variables
● Solution: pd.melt()
Cleaning Data in Python
Melting
In [1]: pd.melt(frame=df, id_vars='name',
...: value_vars=['treatment a', 'treatment b'])
Out[1]:
name variable value
0 Daniel treatment a _
1 John treatment a 12
2 Jane treatment a 24
3 Daniel treatment b 42
4 John treatment b 31
5 Jane treatment b 27
Cleaning Data in Python
Melting
In [2]: pd.melt(frame=df, id_vars='name',
...: value_vars=['treatment a', 'treatment b’],
...: var_name='treatment', value_name='result')
Out[2]:
name treatment result
0 Daniel treatment a _
1 John treatment a 12
2 Jane treatment a 24
3 Daniel treatment b 42
4 John treatment b 31
5 Jane treatment b 27
CLEANING DATA IN PYTHON
Let’s practice!
CLEANING DATA IN PYTHON
Pivoting data
Cleaning Data in Python
Pivot: un-melting data
● Opposite of melting
● In melting, we turned columns into rows
● Pivoting: turn unique values into separate columns
● Analysis friendly shape to reporting friendly shape
● Violates tidy data principle: rows contain observations
● Multiple variables stored in the same column
Cleaning Data in Python
Pivot: un-melting data
Cleaning Data in Python
Pivot
In [1]: weather_tidy = weather.pivot(index='date',
...: columns='element',
...: values='value')
In [2]: print(weather_tidy)
element tmax tmin
date
2010-01-30 27.8 14.5
2010-02-02 27.3 14.4
Cleaning Data in Python
Pivot
Cleaning Data in Python
Using pivot when you have duplicate entries
In [3]: import numpy as np
In [4]: weather2_tidy = weather.pivot(values='value',
...: index='date',
...: columns='element')
Out[4]:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-2962bb23f5a3> in <module>()
1 weather2_tidy = weather2.pivot(values='value',
2 index='date',
----> 3 columns='element')
ValueError: Index contains duplicate entries, cannot reshape
Cleaning Data in Python
Pivot table
● Has a parameter that specifies how to deal with duplicate
values
● Example: Can aggregate the duplicate values by taking their
average
Cleaning Data in Python
Pivot table
In [5]: weather2_tidy = weather.pivot_table(values='value',
...: index='date',
...: columns='element',
...: aggfunc=np.mean)
Out[5]:
element tmax tmin
date
2010-01-30 27.8 14.5
2010-02-02 27.3 15.4
CLEANING DATA IN PYTHON
Let’s practice!
CLEANING DATA IN PYTHON
Beyond melt and
pivot
Cleaning Data in Python
Beyond melt and pivot
● Melting and pivoting are basic tools
● Another common problem:
● Columns contain multiple bits of information
Cleaning Data in Python
Beyond melt and pivot
Cleaning Data in Python
Melting and parsing
In [1]: pd.melt(frame=tb, id_vars=['country', 'year'])
Out[1]:
country year variable value
0 AD 2000 m014 0
1 AE 2000 m014 2
2 AF 2000 m014 52
3 AD 2000 m1524 0
4 AE 2000 m1524 4
5 AF 2000 m1524 228
● Nothing inherently wrong about original data shape
● Not conducive for analysis
Cleaning Data in Python
Melting and parsing
In [2]: tb_melt['sex'] = tb_melt.variable.str[0]
In [3]: tb_melt
Out[3]:
country year variable value sex
0 AD 2000 m014 0 m
1 AE 2000 m014 2 m
2 AF 2000 m014 52 m
3 AD 2000 m1524 0 m
4 AE 2000 m1524 4 m
5 AF 2000 m1524 228 m
CLEANING DATA IN PYTHON
Let’s practice!