0% found this document useful (0 votes)

9 views8 pages

Data Mining - Week - 4

The document covers handling missing data, combining datasets, and aggregation in Pandas. It explains methods for identifying, dropping, and imputing missing values, as well as techniques for concatenating and appending DataFrames. Additionally, it discusses grouping data using the groupby function and creating pivot tables for data summarization and analysis.

Uploaded by

nghiemhoa4895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views8 pages

Data Mining - Week - 4

Uploaded by

nghiemhoa4895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Week 4

Handling Missing Data, Data Combining, and Aggregation in Pandas Lecture

Notes
I. Handling Missing Data: Operations on Null Values
Handling missing data is crucial to ensure data quality and the reliability of
analysis. Missing values can lead to biased results or reduce the performance of
machine learning models. Pandas offers several methods to detect, handle, and
impute missing values effectively.

1. Identifying Missing Values

: Identifies missing values and returns a DataFrame of the same

isnull()

shape with True for missing values.

import pandas as pd
df = pd.read_csv('data.csv')
missing_values = df.isnull()
print(missing_values)

sum() : Get the count of missing values for each column.

Week 4 1
missing_count = df.isnull().sum()
print(missing_count)

any() : Check if any value is missing in a column.

missing_any = df.isnull().any()
print(missing_any)

Visualization: Use heatmaps to visualize missing values.

import seaborn as sns

import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

2. Dropping Missing Values

: Removes rows or columns with missing values. This is typically

dropna()

used when the amount of missing data is small and will not significantly
impact the dataset.

Drop Rows with Missing Values:

df_cleaned = df.dropna()

Drop Columns with Missing Values:

df_cleaned = df.dropna(axis=1)

Drop Rows with Missing Values in Specific Columns:

df_cleaned = df.dropna(subset=['Age', 'Salary'])

3. Imputing Missing Values

Week 4 2
Fill with Mean, Median, or Mode: Imputing missing values is commonly
done to retain all rows in the dataset while filling the gaps.

Mean: Fill missing numerical values with the column mean.

df['Age'].fillna(df['Age'].mean(), inplace=True)

Median: Fill missing values with the median.

df['Age'].fillna(df['Age'].median(), inplace=True)

Mode: Fill missing values with the mode (most frequent value).

df['Gender'].fillna(df['Gender'].mode()[0], inplace=
True)

Forward Fill and Backward Fill: Useful for time series data where the
assumption is that values remain constant until a change occurs.

Forward Fill ( ffill ): Fill missing values using the previous row's value.

df.fillna(method='ffill', inplace=True)

Backward Fill ( bfill ): Fill missing values using the next row's value.

df.fillna(method='bfill', inplace=True)

Custom Imputation: Replace missing values with a specific constant or

custom value based on domain knowledge.

df['Salary'].fillna(50000, inplace=True) # Fill missin

g salaries with a constant value

II. Combining Datasets: Concat and Append

Combining datasets is often necessary when working with multiple data sources
or when you need to add new data to an existing dataset. Pandas provides

Week 4 3
convenient methods such as concat() and append() to combine DataFrames.

1. Concatenation ( concat() )

Vertical Concatenation (stack DataFrames on top of each other). Useful

when you have multiple datasets with the same structure (i.e., same
columns).

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice',

'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [4, 5, 6], 'Name': ['David',
'Ella', 'Frank']})
df_combined = pd.concat([df1, df2], axis=0)
print(df_combined)

Horizontal Concatenation (combine columns). Useful when the datasets

share an index or when you want to add additional information.

df3 = pd.DataFrame({'Age': [25, 30, 35]})

df_combined_horiz = pd.concat([df1, df3], axis=1)
print(df_combined_horiz)

Concatenating with Keys: Add hierarchical keys to identify which original

DataFrame the rows came from. This is useful when you want to maintain
information about the source of each row.

df_concat_keys = pd.concat([df1, df2], keys=['Group1',

'Group2'])
print(df_concat_keys)

Ignoring Index: Reset the index when concatenating.

df_combined_reset = pd.concat([df1, df2], ignore_index=

True)

2. Appending Datasets ( append() )

Week 4 4
Appending Rows: Use append() to add rows from another DataFrame or
Series. This is similar to vertical concatenation.

df_appended = df1.append(df2, ignore_index=True)

print(df_appended)

Appending Series: Append a single Series (like a new row) to a

DataFrame.

new_row = pd.Series({'ID': 7, 'Name': 'George'})

df_appended_series = df1.append(new_row, ignore_index=T
rue)
print(df_appended_series)

Deprecated Notice: append() is deprecated from Pandas version 1.4

onwards. Prefer using concat() instead for future compatibility.

III. Aggregation and Grouping: Groupby Functions

Grouping and aggregation are essential techniques to perform operations on

subsets of data, such as computing averages, sums, or counts. The groupby()
function in Pandas provides a powerful way to split data, apply functions, and
combine results.

1. Basic Grouping

Grouping by a Column: Use groupby() to group data by specific columns,

allowing you to aggregate data for each group.

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I

T', 'Finance'],
'Salary': [60000, 80000, 62000, 8500
0, 75000]})
grouped = df.groupby('Department')

2. Aggregation

Week 4 5
Aggregating with Built-in Functions: Apply aggregation functions like
mean() , sum() , count() , etc., on grouped data to derive insights.

mean_salary = grouped['Salary'].mean()
print(mean_salary)

Custom Aggregation: Use agg() to apply multiple aggregation functions to

grouped data.

aggregated = grouped['Salary'].agg(['mean', 'sum', 'mi

n', 'max'])
print(aggregated)

Renaming Aggregation Columns: Rename the aggregated columns for

clarity.

aggregated = grouped['Salary'].agg(mean_salary='mean',
total_salary='sum')
print(aggregated)

3. Grouping by Multiple Columns

Multi-Level Grouping: Group by more than one column to explore detailed

breakdowns, such as by department and location.

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I

T', 'Finance'],
'Location': ['NY', 'SF', 'NY', 'SF',
'LA'],
'Salary': [60000, 80000, 62000, 8500
0, 75000]})
grouped_multi = df.groupby(['Department', 'Location'])
['Salary'].mean()
print(grouped_multi)

4. Iterating Over Groups

Week 4 6
Iterate over groups to process each group independently. This can be
helpful when different processing is required for each group.

for name, group in grouped:

print(f"Department: {name}")
print(group)

IV. Pivot Tables: Use Cases and Examples

Pivot tables are used to summarize and aggregate data in a flexible way, similar to
Excel pivot tables. They allow us to restructure data and gain insights by breaking
down numerical data into meaningful summaries.

1. Creating Pivot Tables

Basic Pivot Table: Create a pivot table using pivot_table() . You can
summarize values by specifying index , columns , and aggfunc .

df = pd.DataFrame({'Region': ['East', 'West', 'East',

'West', 'East'],
'Product': ['A', 'A', 'B', 'B',
'A'],
'Sales': [100, 150, 200, 300, 120]})
pivot = df.pivot_table(values='Sales', index='Region',
columns='Product', aggfunc='sum', fill_value=0)
print(pivot)

Multiple Aggregation Functions: Apply multiple aggregation functions to

summarize data in different ways.

pivot_multi_agg = df.pivot_table(values='Sales', index

='Region', columns='Product', aggfunc=['sum', 'mean'],
fill_value=0)
print(pivot_multi_agg)

2. Use Cases of Pivot Tables

Week 4 7
Sales Analysis: Pivot tables are commonly used in sales analysis to
understand performance across different regions, products, or time
periods.

Example: Calculate total sales for each region and each product to
identify top-performing products and regions.

Human Resources: Analyze employee count or average salary by

department and location.

Example: Calculate the average salary per department to determine

compensation trends or analyze the distribution of employees across
locations.

Financial Reporting: Summarize financial data by quarter, year, or product

type for reporting purposes.

Example: Calculate quarterly sales to compare seasonal performance

and track yearly growth.

3. Adding Margins

Margins: Use margins=True to add row and column totals to pivot tables for
a comprehensive view.

pivot_with_totals = df.pivot_table(values='Sales', inde

x='Region', columns='Product', aggfunc='sum', fill_valu
e=0, margins=True)
print(pivot_with_totals)

Adding Custom Totals: Customize the margin values by renaming them.

pivot_custom_margins = df.pivot_table(values='Sales', i
ndex='Region', columns='Product', aggfunc='sum', fill_v
alue=0, margins=True, margins_name='Total Sales')
print(pivot_custom_margins)

Week 4 8

Lecture 14
No ratings yet
Lecture 14
33 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Pandas
No ratings yet
Pandas
26 pages
Pandas For Python Pro Level Cheat Sheet
No ratings yet
Pandas For Python Pro Level Cheat Sheet
14 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Learn Pandas
No ratings yet
Learn Pandas
37 pages
Python & Pandas for Beginners
No ratings yet
Python & Pandas for Beginners
7 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
Python Pandas: 12 Data Manipulation Techniques
100% (2)
Python Pandas: 12 Data Manipulation Techniques
19 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Intro Pandas
No ratings yet
Intro Pandas
18 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas Dataframe Cheat Sheet
No ratings yet
Pandas Dataframe Cheat Sheet
3 pages
07 Data Wrangling
No ratings yet
07 Data Wrangling
51 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
5 pages
Pandas 1
No ratings yet
Pandas 1
50 pages
Introduction To Pandas Programming 2
No ratings yet
Introduction To Pandas Programming 2
3 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Cheat Sheet
No ratings yet
Cheat Sheet
15 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Razorpay Data Analyst Interview Questions 1739977522
No ratings yet
Razorpay Data Analyst Interview Questions 1739977522
12 pages
Pandas
No ratings yet
Pandas
30 pages
Pandas
No ratings yet
Pandas
13 pages
Day 11 Pandas For Data Science - Part 2
No ratings yet
Day 11 Pandas For Data Science - Part 2
21 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Unit IV
No ratings yet
Unit IV
49 pages
Pandas
No ratings yet
Pandas
2 pages
Justenoughpython Pandas 220915 175329
No ratings yet
Justenoughpython Pandas 220915 175329
64 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Python 2.1.3
No ratings yet
Python 2.1.3
6 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Data Handling Part Ii
No ratings yet
Data Handling Part Ii
41 pages
Pandas Notes
No ratings yet
Pandas Notes
3 pages
Pandas
No ratings yet
Pandas
5 pages
12 Pandas
100% (1)
12 Pandas
21 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
Pandas+With+Python+ +DATAhill+Solutions
No ratings yet
Pandas+With+Python+ +DATAhill+Solutions
24 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas Data Wrangling Cheat Sheet
100% (2)
Pandas Data Wrangling Cheat Sheet
6 pages
Data Analysis with Pandas
No ratings yet
Data Analysis with Pandas
31 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Python
No ratings yet
Python
16 pages
Python Programming For Data Science
No ratings yet
Python Programming For Data Science
36 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Rajni Ip File Final
No ratings yet
Rajni Ip File Final
42 pages
CSC 222 - Data Wrangling and EDA
No ratings yet
CSC 222 - Data Wrangling and EDA
5 pages
DHP Unit - 4 Part2
No ratings yet
DHP Unit - 4 Part2
16 pages
Lab-4, Data Wrangling With Python
No ratings yet
Lab-4, Data Wrangling With Python
11 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Amazon vs Walmart: Marketing Strategy Review
No ratings yet
Amazon vs Walmart: Marketing Strategy Review
25 pages
Gaurav Dang: Senior Product Manager Expertise
No ratings yet
Gaurav Dang: Senior Product Manager Expertise
1 page
Project Report (BBA)
No ratings yet
Project Report (BBA)
70 pages
III Internals Hydrology and Irrigation Engineering
No ratings yet
III Internals Hydrology and Irrigation Engineering
5 pages
5075 Datasheet
No ratings yet
5075 Datasheet
3 pages
15-Manipulate and Translate Machine and Assembly Code
100% (2)
15-Manipulate and Translate Machine and Assembly Code
2 pages
Untitled
No ratings yet
Untitled
30 pages
Chapter I: Engineer'S Report: 3.I Water Supply Route Plan Route 1 Route 2 Route 3 Route 4 Main Route
No ratings yet
Chapter I: Engineer'S Report: 3.I Water Supply Route Plan Route 1 Route 2 Route 3 Route 4 Main Route
22 pages
Vibration Absorbers
No ratings yet
Vibration Absorbers
33 pages
Atv 61
No ratings yet
Atv 61
62 pages
Faa STC Sa02571se
No ratings yet
Faa STC Sa02571se
3 pages
Slides - Design Guideline For HDI (MULTEK)
No ratings yet
Slides - Design Guideline For HDI (MULTEK)
11 pages
Mathematical Literacy Grade 11 Term 2 Week 6 - 2020
100% (1)
Mathematical Literacy Grade 11 Term 2 Week 6 - 2020
10 pages
1.amplitude Modulation and Demodulation
0% (1)
1.amplitude Modulation and Demodulation
6 pages
A Beginner
No ratings yet
A Beginner
11 pages
K 5 Science Lesson Plan
No ratings yet
K 5 Science Lesson Plan
2 pages
Crossfire-E1-Delta-Q-Charger-IC1200-manual de Cargador 6210
No ratings yet
Crossfire-E1-Delta-Q-Charger-IC1200-manual de Cargador 6210
69 pages
Din 4030-2 - 1991
No ratings yet
Din 4030-2 - 1991
12 pages
70mai S500 - Dash Cam Manual - ManualsLib
No ratings yet
70mai S500 - Dash Cam Manual - ManualsLib
11 pages
Chapter 7
No ratings yet
Chapter 7
3 pages
PDF Fixed Point Signal Processors 1st Edition Wayne T. Padgett Download
100% (19)
PDF Fixed Point Signal Processors 1st Edition Wayne T. Padgett Download
47 pages
Chapter 4 FM Hand 2022
100% (1)
Chapter 4 FM Hand 2022
12 pages
DAILY LESSON PLAN IN Mapeh Health WEEK 5
100% (1)
DAILY LESSON PLAN IN Mapeh Health WEEK 5
11 pages
CPE 2 2 Time Scaled Event Network Exercises Macalinao
100% (2)
CPE 2 2 Time Scaled Event Network Exercises Macalinao
6 pages
Types of Teams - Permanent Teams, Temporary Teams, Task Force, Virtual Teams Etc
No ratings yet
Types of Teams - Permanent Teams, Temporary Teams, Task Force, Virtual Teams Etc
6 pages
Top 10 MBA Entrance Exams Guide
No ratings yet
Top 10 MBA Entrance Exams Guide
29 pages
What Is A Concept Paper?
No ratings yet
What Is A Concept Paper?
1 page
HRM Suggestions
No ratings yet
HRM Suggestions
3 pages
B. Inggris - Modul 1 - Good Morning How Are You
100% (1)
B. Inggris - Modul 1 - Good Morning How Are You
50 pages
Equilibrium Between Demand and Supply: Chapter - 4
No ratings yet
Equilibrium Between Demand and Supply: Chapter - 4
17 pages

Data Mining - Week - 4

Uploaded by

Data Mining - Week - 4

Uploaded by

Week 4

Handling Missing Data, Data Combining, and Aggregation in Pandas Lecture

1. Identifying Missing Values

: Identifies missing values and returns a DataFrame of the same

shape with True for missing values.

sum() : Get the count of missing values for each column.

any() : Check if any value is missing in a column.

Visualization: Use heatmaps to visualize missing values.

import seaborn as sns

2. Dropping Missing Values

: Removes rows or columns with missing values. This is typically

Drop Rows with Missing Values:

Drop Columns with Missing Values:

Drop Rows with Missing Values in Specific Columns:

df_cleaned = df.dropna(subset=['Age', 'Salary'])

3. Imputing Missing Values

Mean: Fill missing numerical values with the column mean.

Median: Fill missing values with the median.

Custom Imputation: Replace missing values with a specific constant or

df['Salary'].fillna(50000, inplace=True) # Fill missin

II. Combining Datasets: Concat and Append

Vertical Concatenation (stack DataFrames on top of each other). Useful

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice',

Horizontal Concatenation (combine columns). Useful when the datasets

df3 = pd.DataFrame({'Age': [25, 30, 35]})

Concatenating with Keys: Add hierarchical keys to identify which original

df_concat_keys = pd.concat([df1, df2], keys=['Group1',

Ignoring Index: Reset the index when concatenating.

df_combined_reset = pd.concat([df1, df2], ignore_index=

2. Appending Datasets ( append() )

df_appended = df1.append(df2, ignore_index=True)

Appending Series: Append a single Series (like a new row) to a

new_row = pd.Series({'ID': 7, 'Name': 'George'})

Deprecated Notice: append() is deprecated from Pandas version 1.4

III. Aggregation and Grouping: Groupby Functions

Grouping and aggregation are essential techniques to perform operations on

Grouping by a Column: Use groupby() to group data by specific columns,

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I

Custom Aggregation: Use agg() to apply multiple aggregation functions to

aggregated = grouped['Salary'].agg(['mean', 'sum', 'mi

Renaming Aggregation Columns: Rename the aggregated columns for

3. Grouping by Multiple Columns

Multi-Level Grouping: Group by more than one column to explore detailed

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I

4. Iterating Over Groups

for name, group in grouped:

IV. Pivot Tables: Use Cases and Examples

1. Creating Pivot Tables

df = pd.DataFrame({'Region': ['East', 'West', 'East',

Multiple Aggregation Functions: Apply multiple aggregation functions to

pivot_multi_agg = df.pivot_table(values='Sales', index

2. Use Cases of Pivot Tables

Human Resources: Analyze employee count or average salary by

Example: Calculate the average salary per department to determine

Financial Reporting: Summarize financial data by quarter, year, or product

Example: Calculate quarterly sales to compare seasonal performance

pivot_with_totals = df.pivot_table(values='Sales', inde

Adding Custom Totals: Customize the margin values by renaming them.

You might also like