[go: up one dir, main page]

0% found this document useful (0 votes)
69 views30 pages

Pandas Missing Data

www

Uploaded by

NagaRaju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views30 pages

Pandas Missing Data

www

Uploaded by

NagaRaju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Handling Missing Data

Sources of Missing Values


 User forgot to fill in a field.
 Data was lost while transferring manually from
a legacy database.
 There was a programming error.
 Users chose not to fill out a field tied to their
beliefs about how the results would be used or
interpreted.
Missing Data Conventions
 Masking approach
 Boolean array- indicate the null status of a value
 Adds overhead in both storage and computation
 Sentinel approach
 Data-specific convention, such as indicating a
missing integer value with –9999
 Common special values like NaN are not available
for all data types.
Missing Data in Pandas
 Pandas handles missing values is constrained by its reliance

on the NumPy package, which does not have a built-in

notion of NA values

 Pandas could have derived from masked arrays of NumPy-

storage, computation, and code maintenance makes that an

unattractive choice

 Pandas chose to use sentinels for missing data,

 Python null values: the special floatingpoint NaN value,


Operating on Null Values
 isnull() :Generate a Boolean mask indicating
missing values
 notnull() : Opposite of isnull()
 dropna() : Return a filtered version of the data
 fillna() : Return a copy of the data with
missing values filled or imputed
Example: Handling Missing Values
1. What are the features?
2. What are the expected types (int, float,
string, boolean)?
3. Is there obvious missing data (values that
Pandas can detect)?
4. Is there other types of missing data that’s
not so obvious (can’t easily detect with
Pandas)?
small real estate dataset
what are my features?

 ST_NUM: Street number

 ST_NAME: Street name

 OWN_OCCUPIED: Is the residence owner

occupied

 NUM_BEDROOMS: Number of bedrooms


what are the expected types?

 ST_NUM: float or int… some sort of numeric

type

 ST_NAME: string

 OWN_OCCUPIED: string… Y (“Yes”) or N

(“No”)

 NUM_BEDROOMS: float or int, a numeric type


Look for Missing Values
Standard Missing Values
Non-Standard Missing Values
Non-Standard Missing Values
Different formats recognized as missing values
Different formats recognized as missing values
Unexpected Missing Values
Unexpected Missing Values
Solution

 Loop through the OWN_OCCUPIED column

 Try and turn the entry into an integer

 If the entry can be changed into an integer,

enter a missing value

 If the number can’t be an integer, we know

it’s a string, so keep going

You might also like