Handling Missing Data
Sources of Missing Values
User forgot to fill in a field.
Data was lost while transferring manually from
a legacy database.
There was a programming error.
Users chose not to fill out a field tied to their
beliefs about how the results would be used or
interpreted.
Missing Data Conventions
Masking approach
Boolean array- indicate the null status of a value
Adds overhead in both storage and computation
Sentinel approach
Data-specific convention, such as indicating a
missing integer value with –9999
Common special values like NaN are not available
for all data types.
Missing Data in Pandas
Pandas handles missing values is constrained by its reliance
on the NumPy package, which does not have a built-in
notion of NA values
Pandas could have derived from masked arrays of NumPy-
storage, computation, and code maintenance makes that an
unattractive choice
Pandas chose to use sentinels for missing data,
Python null values: the special floatingpoint NaN value,
Operating on Null Values
isnull() :Generate a Boolean mask indicating
missing values
notnull() : Opposite of isnull()
dropna() : Return a filtered version of the data
fillna() : Return a copy of the data with
missing values filled or imputed
Example: Handling Missing Values
1. What are the features?
2. What are the expected types (int, float,
string, boolean)?
3. Is there obvious missing data (values that
Pandas can detect)?
4. Is there other types of missing data that’s
not so obvious (can’t easily detect with
Pandas)?
small real estate dataset
what are my features?
ST_NUM: Street number
ST_NAME: Street name
OWN_OCCUPIED: Is the residence owner
occupied
NUM_BEDROOMS: Number of bedrooms
what are the expected types?
ST_NUM: float or int… some sort of numeric
type
ST_NAME: string
OWN_OCCUPIED: string… Y (“Yes”) or N
(“No”)
NUM_BEDROOMS: float or int, a numeric type
Look for Missing Values
Standard Missing Values
Non-Standard Missing Values
Non-Standard Missing Values
Different formats recognized as missing values
Different formats recognized as missing values
Unexpected Missing Values
Unexpected Missing Values
Solution
Loop through the OWN_OCCUPIED column
Try and turn the entry into an integer
If the entry can be changed into an integer,
enter a missing value
If the number can’t be an integer, we know
it’s a string, so keep going