Week 4
Handling Missing Data, Data Combining, and Aggregation in Pandas Lecture
Notes
I. Handling Missing Data: Operations on Null Values
Handling missing data is crucial to ensure data quality and the reliability of
analysis. Missing values can lead to biased results or reduce the performance of
machine learning models. Pandas offers several methods to detect, handle, and
impute missing values effectively.
1. Identifying Missing Values
: Identifies missing values and returns a DataFrame of the same
isnull()
shape with True for missing values.
import pandas as pd
df = pd.read_csv('data.csv')
missing_values = df.isnull()
print(missing_values)
sum() : Get the count of missing values for each column.
Week 4 1
missing_count = df.isnull().sum()
print(missing_count)
any() : Check if any value is missing in a column.
missing_any = df.isnull().any()
print(missing_any)
Visualization: Use heatmaps to visualize missing values.
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()
2. Dropping Missing Values
: Removes rows or columns with missing values. This is typically
dropna()
used when the amount of missing data is small and will not significantly
impact the dataset.
Drop Rows with Missing Values:
df_cleaned = df.dropna()
Drop Columns with Missing Values:
df_cleaned = df.dropna(axis=1)
Drop Rows with Missing Values in Specific Columns:
df_cleaned = df.dropna(subset=['Age', 'Salary'])
3. Imputing Missing Values
Week 4 2
Fill with Mean, Median, or Mode: Imputing missing values is commonly
done to retain all rows in the dataset while filling the gaps.
Mean: Fill missing numerical values with the column mean.
df['Age'].fillna(df['Age'].mean(), inplace=True)
Median: Fill missing values with the median.
df['Age'].fillna(df['Age'].median(), inplace=True)
Mode: Fill missing values with the mode (most frequent value).
df['Gender'].fillna(df['Gender'].mode()[0], inplace=
True)
Forward Fill and Backward Fill: Useful for time series data where the
assumption is that values remain constant until a change occurs.
Forward Fill ( ffill ): Fill missing values using the previous row's value.
df.fillna(method='ffill', inplace=True)
Backward Fill ( bfill ): Fill missing values using the next row's value.
df.fillna(method='bfill', inplace=True)
Custom Imputation: Replace missing values with a specific constant or
custom value based on domain knowledge.
df['Salary'].fillna(50000, inplace=True) # Fill missin
g salaries with a constant value
II. Combining Datasets: Concat and Append
Combining datasets is often necessary when working with multiple data sources
or when you need to add new data to an existing dataset. Pandas provides
Week 4 3
convenient methods such as concat() and append() to combine DataFrames.
1. Concatenation ( concat() )
Vertical Concatenation (stack DataFrames on top of each other). Useful
when you have multiple datasets with the same structure (i.e., same
columns).
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice',
'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [4, 5, 6], 'Name': ['David',
'Ella', 'Frank']})
df_combined = pd.concat([df1, df2], axis=0)
print(df_combined)
Horizontal Concatenation (combine columns). Useful when the datasets
share an index or when you want to add additional information.
df3 = pd.DataFrame({'Age': [25, 30, 35]})
df_combined_horiz = pd.concat([df1, df3], axis=1)
print(df_combined_horiz)
Concatenating with Keys: Add hierarchical keys to identify which original
DataFrame the rows came from. This is useful when you want to maintain
information about the source of each row.
df_concat_keys = pd.concat([df1, df2], keys=['Group1',
'Group2'])
print(df_concat_keys)
Ignoring Index: Reset the index when concatenating.
df_combined_reset = pd.concat([df1, df2], ignore_index=
True)
2. Appending Datasets ( append() )
Week 4 4
Appending Rows: Use append() to add rows from another DataFrame or
Series. This is similar to vertical concatenation.
df_appended = df1.append(df2, ignore_index=True)
print(df_appended)
Appending Series: Append a single Series (like a new row) to a
DataFrame.
new_row = pd.Series({'ID': 7, 'Name': 'George'})
df_appended_series = df1.append(new_row, ignore_index=T
rue)
print(df_appended_series)
Deprecated Notice: append() is deprecated from Pandas version 1.4
onwards. Prefer using concat() instead for future compatibility.
III. Aggregation and Grouping: Groupby Functions
Grouping and aggregation are essential techniques to perform operations on
subsets of data, such as computing averages, sums, or counts. The groupby()
function in Pandas provides a powerful way to split data, apply functions, and
combine results.
1. Basic Grouping
Grouping by a Column: Use groupby() to group data by specific columns,
allowing you to aggregate data for each group.
df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I
T', 'Finance'],
'Salary': [60000, 80000, 62000, 8500
0, 75000]})
grouped = df.groupby('Department')
2. Aggregation
Week 4 5
Aggregating with Built-in Functions: Apply aggregation functions like
mean() , sum() , count() , etc., on grouped data to derive insights.
mean_salary = grouped['Salary'].mean()
print(mean_salary)
Custom Aggregation: Use agg() to apply multiple aggregation functions to
grouped data.
aggregated = grouped['Salary'].agg(['mean', 'sum', 'mi
n', 'max'])
print(aggregated)
Renaming Aggregation Columns: Rename the aggregated columns for
clarity.
aggregated = grouped['Salary'].agg(mean_salary='mean',
total_salary='sum')
print(aggregated)
3. Grouping by Multiple Columns
Multi-Level Grouping: Group by more than one column to explore detailed
breakdowns, such as by department and location.
df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I
T', 'Finance'],
'Location': ['NY', 'SF', 'NY', 'SF',
'LA'],
'Salary': [60000, 80000, 62000, 8500
0, 75000]})
grouped_multi = df.groupby(['Department', 'Location'])
['Salary'].mean()
print(grouped_multi)
4. Iterating Over Groups
Week 4 6
Iterate over groups to process each group independently. This can be
helpful when different processing is required for each group.
for name, group in grouped:
print(f"Department: {name}")
print(group)
IV. Pivot Tables: Use Cases and Examples
Pivot tables are used to summarize and aggregate data in a flexible way, similar to
Excel pivot tables. They allow us to restructure data and gain insights by breaking
down numerical data into meaningful summaries.
1. Creating Pivot Tables
Basic Pivot Table: Create a pivot table using pivot_table() . You can
summarize values by specifying index , columns , and aggfunc .
df = pd.DataFrame({'Region': ['East', 'West', 'East',
'West', 'East'],
'Product': ['A', 'A', 'B', 'B',
'A'],
'Sales': [100, 150, 200, 300, 120]})
pivot = df.pivot_table(values='Sales', index='Region',
columns='Product', aggfunc='sum', fill_value=0)
print(pivot)
Multiple Aggregation Functions: Apply multiple aggregation functions to
summarize data in different ways.
pivot_multi_agg = df.pivot_table(values='Sales', index
='Region', columns='Product', aggfunc=['sum', 'mean'],
fill_value=0)
print(pivot_multi_agg)
2. Use Cases of Pivot Tables
Week 4 7
Sales Analysis: Pivot tables are commonly used in sales analysis to
understand performance across different regions, products, or time
periods.
Example: Calculate total sales for each region and each product to
identify top-performing products and regions.
Human Resources: Analyze employee count or average salary by
department and location.
Example: Calculate the average salary per department to determine
compensation trends or analyze the distribution of employees across
locations.
Financial Reporting: Summarize financial data by quarter, year, or product
type for reporting purposes.
Example: Calculate quarterly sales to compare seasonal performance
and track yearly growth.
3. Adding Margins
Margins: Use margins=True to add row and column totals to pivot tables for
a comprehensive view.
pivot_with_totals = df.pivot_table(values='Sales', inde
x='Region', columns='Product', aggfunc='sum', fill_valu
e=0, margins=True)
print(pivot_with_totals)
Adding Custom Totals: Customize the margin values by renaming them.
pivot_custom_margins = df.pivot_table(values='Sales', i
ndex='Region', columns='Product', aggfunc='sum', fill_v
alue=0, margins=True, margins_name='Total Sales')
print(pivot_custom_margins)
Week 4 8