[go: up one dir, main page]

0% found this document useful (0 votes)
4 views38 pages

Module_4

Pandas is an open-source Python library designed for data analysis, providing efficient data structures like Series and DataFrame for handling structured data. It simplifies data cleaning, transformation, and exploratory data analysis, making it essential for machine learning and data preprocessing. Key functionalities include reading and writing data, filtering, aggregation, and integration with other tools like NumPy and Matplotlib.

Uploaded by

Harish Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views38 pages

Module_4

Pandas is an open-source Python library designed for data analysis, providing efficient data structures like Series and DataFrame for handling structured data. It simplifies data cleaning, transformation, and exploratory data analysis, making it essential for machine learning and data preprocessing. Key functionalities include reading and writing data, filtering, aggregation, and integration with other tools like NumPy and Matplotlib.

Uploaded by

Harish Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

🐼 What is Pandas?

Pandas stands for "Python Data Analysis Library".It is an open-source Python


library providing fast, flexible, and expressive data structures designed to
work with structured data, such as:

 Tabular data (like Excel, SQL tables)


 Labeled data (like time series)
 Missing data
 Heterogeneous types in columns

🧠 Think of Pandas as Excel + SQL + NumPy all in one!

🔧 Why is Pandas Important?

 Makes data cleaning, transformation, and analysis easy.


 Efficient handling of large datasets.
 Simplifies exploratory data analysis (EDA).
 A must-have for Machine Learning, Data Preprocessing, Visualization,
and ETL pipelines.

🧱 Core Data Structures

1. Series

A one-dimensional labeled array. Like a single column of data.


Output:

2. DataFrame

A two-dimensional table (rows and columns), like a spreadsheet or SQL table.

Output:
Essential Functionalities

📊 1. Reading Data

🔍 2. Exploring Data

🧹 3. Cleaning Data
🎯 4. Filtering and Selecting

📐 5. Aggregation and Grouping

🧮 6. Mathematical Operations

🔁 7. Merging and Joining

8. Sorting and Ranking

🕒 9. DateTime Handling
📤 10. Exporting Data

💼 Real-World Data Science Applications


Use Case How Pandas Helps
Data Cleaning Removing missing values, fixing formats
Exploratory Data .describe(), .groupby(), .plot() for quick
Analysis insights
Feature Engineering Creating new variables from existing ones
Data Merging Combining datasets from multiple sources
ETL (Extract, Transform,
Read from files, transform, and write to DB
Load)
Time Series Analysis Easy handling of timestamps and date ranges
Scaling, encoding, and transforming datasets
ML Preprocessing
before modeling

🔄 Integration with Other Tools

Tool How It Works with Pandas


NumPy Pandas is built on top of NumPy
Matplotlib/Seaborn Visualize data from Pandas directly
Scikit-learn Use DataFrames for model inputs and outputs
SQLAlchemy Read/write to SQL using .read_sql()
Streamlit Display DataFrames in interactive dashboards

🧠 Performance Tips

 Use df.astype() to reduce memory by setting correct dtypes.


 Use vectorized operations instead of loops.
 For large datasets: use Dask, Vaex, or chunked loading.
 Avoid chaining many operations together (e.g.,
df.drop().fillna().sort()) for clarity.

🧪 Sample Mini Project: Analyze Sales CSV

🧾 Summary Table
Feature Description
Data Structures Series (1D), DataFrame (2D)
File IO Read/write CSV, Excel, JSON, SQL
Analysis Tools Groupby, aggregation, filtering
Cleaning Drop, fill, replace, rename
Integration NumPy, Matplotlib, Scikit-learn, SQL
Applications EDA, data prep, modeling, dashboards

🧱 What is a Pandas Series?

A Pandas Series is a one-dimensional labeled array that can hold any data
type (integers, strings, floats, Python objects, etc.).

 Similar to a column in a DataFrame or an array in NumPy.


 Each element has a value and an associated label called an index.
✅ Creating a Series

🔍 Querying and Indexing in Series

Pandas Series provides multiple ways to access (query) and index data efficiently.

🔹 1. Positional Indexing (.iloc or direct [ ])

Access by integer position (like lists or arrays).

🔹 2. Label-Based Indexing (.loc or direct [ ] with label)

Access by index label.


🔹 3. Slicing

By position:

By label (inclusive of end label):

🔹 4. Boolean Indexing (Filtering)

Output:

Used for filtering data based on a condition.


🔹 5. Fancy Indexing (Multiple indexes at once)

Output:

🔹 6. Accessing Index and Values

🔹 7. Modifying Values

🔬 Relevance in Data Science

Task How Series Indexing Helps


Data Cleaning Easily locate and update invalid or missing entries
Feature Engineering Apply logic per row or per field
Query specific segments of Series (e.g., outliers,
Statistical Analysis
percentiles)
EDA (Exploratory
Filter data to study trends, e.g., s[s > mean]
Analysis)
Series with datetime index supports time-based
Time Series
slicing

⏱ Example: Filter Sensor Readings > Threshold

Output:

Helps identify high readings for alerts in time-series or IoT sensor data.

🗓 Example: Time Series Indexing


✅ Summary: Key Series Querying & Indexing Tools
Method Usage
s[index] Single element by label or position
s.loc[label] Label-based indexing
s.iloc[position] Position-based indexing
s[start:end] Slicing
s[condition] Boolean filtering
s[[i, j]] Fancy indexing (multiple elements)

📘 What Is a DataFrame?

A Pandas DataFrame is a two-dimensional, tabular data structure — think of


it like an Excel spreadsheet or SQL table.

 It has rows and columns.


 Each column can have a different data type.
 It’s built on top of NumPy and is highly optimized for performance.

🧱 Structure of a DataFrame

A DataFrame has:

 Index (row labels) — Default: 0, 1, 2...


 Columns (column labels) — Named like a dictionary key
 Data — 2D array-like, can be a mix of types
✅ Creating a DataFrame

📥 Loading Data into a DataFrame

Source Function
CSV pd.read_csv('file.csv')
Excel pd.read_excel('file.xlsx')
JSON pd.read_json('file.json')
SQL pd.read_sql(query, conn)
Clipboard pd.read_clipboard()
Dictionary pd.DataFrame(dict)
Example – Loading from CSV

🔍 Querying a DataFrame

1. Selecting Columns

2. Selecting Rows

By Index (position-based): .iloc[]

By Label (index-based): .loc[]

🔁 Filtering Rows (Boolean Indexing)


Output:

🎯 Indexing in DataFrame

1. Default Index

Auto-generated: 0, 1, 2…

2. Custom Index

3. Set/Reset Index
🔁 Slicing Rows and Columns

🧹 Modifying the DataFrame

Adding a Column

Removing a Column

Changing Values

📐 Aggregation & Grouping

🧮 Math Operations
🧪 Example: Filter and Compute

K Data Exploration Methods

🧠 Relevance in Data Science


Task How DataFrames Help
Data Cleaning Drop nulls, fill missing, rename, format values
Exploratory Data
Quick summary and statistical insights
Analysis (EDA)
Feature Engineering Create new variables, transform existing columns
Machine Learning DataFrame as input for scikit-learn models
Plotting with Seaborn/Matplotlib based on
Visualization
DataFrame structure
ETL Pipelines Extract, transform, and load data efficiently
Data Aggregation Grouping and summarizing across categories

✅ Summary Table

Feature Command/Example
Load CSV pd.read_csv()
Access Column df['col'] or df.col
Access Row df.loc[] / df.iloc[]
Filter Rows df[df['col'] > value]
Add Column df['new'] = ...
Group & Aggregate df.groupby()['col'].mean()
Change Index df.set_index() / df.reset_index()
Describe Stats df.describe()

💡 Bonus Tip: Chaining Methods

🔗 What is Merging in Pandas?

Merging in Pandas is the process of combining two or more DataFrames based


on common columns or indices, much like SQL JOIN operations.

It allows you to build complex datasets by combining smaller pieces based on


logical relationships.

🧱 Syntax

Parameter Description
left, right DataFrames to merge
how Type of join: inner, outer, left, right
on Column(s) to join on (common column)
left_on, right_on Specify different column names if keys differ
🧪 Example DataFrames

🔀 1. Inner Join (only matching rows)

Result:

🔄 2. Left Join (all from left, match from right)

Result:
🔁 3. Right Join (all from right, match from left)

Result:

🌐 4. Outer Join (all rows from both, fill NaNs)

Result:

🧠 Advanced Parameters

🔹 left_on, right_on

Use when column names are different:


🔹 suffixes

Add suffixes when overlapping column names (besides the key):

🔍 Merging on Index

📊 Real-Life Example: Sales and Customer Info

Result:

🔬 Importance in Data Science


Task Relevance of Merging
Data Combine data from different sources (e.g., user logs +
Integration demographics)
Feature
Merge extra info (e.g., zip code data with customer data)
Engineering
Joining Results Merge model outputs with test IDs or metadata
Master Tables Create master data sets by combining multiple inputs
Relational
Work with multi-table formats like in SQL or NoSQL
Datasets

✅ Summary: Merge Types Cheat Sheet

Merge Keeps Unmatched


Description
Type Rows
inner Only matched rows ❌ No
left All from left + matched from right ✅ Yes (left only)
right All from right + matched from left ✅ Yes (right only)
outer All rows from both sides ✅ Yes (both sides)
📘 What is groupby() in Pandas?

The groupby() function in Pandas is used to split a DataFrame into groups based
on a specific column (or columns), apply a function (like mean, sum, count, etc.),
and then combine the results.

Think of it like SQL's GROUP BY, allowing aggregation and summarization of data.

🧠 The GroupBy Process (SAC):

1. Split the data into groups.


2. Apply a function to each group.
3. Combine the results into a new DataFrame or Series.

🧱 Basic Syntax

To apply a function:
🧪 Example DataFrame

🔁 Group by a Column and Aggregate

1. Group and Compute Mean

Output:
2. Group and Count

Output:

3. Group and Sum

📊 Multiple Aggregations with agg()


Output:

🔁 Group by Multiple Columns

Gives you groupings at a more granular level (multi-indexed DataFrame).

🔄 Resetting the Index

This brings the result back to a regular DataFrame (not Series or multi-index).

💼 Custom Aggregation Functions

🔎 Filtering Groups

Using .filter():
Returns rows from groups where average salary > 55,000.

📚 Use in Data Science

Use Case How groupby() Helps


Exploratory Data Summarize data by category (e.g., avg sales by
Analysis region)
Create group-level features (e.g., user's avg
Feature Engineering
transaction)
Anomaly Detection Compare group stats to find outliers
Data Cleaning Check missing data per group
Model Evaluation Group by models or labels for performance analysis

🧪 Real-World Example: Customer Purchases

Output:
This shows total purchase value per customer — great for customer
segmentation.

✅ Summary Table

Task Code Example


Group by column df.groupby('col')
Group + mean df.groupby('col')['val'].mean()
Group + multiple agg df.groupby('col').agg(['sum', 'count'])
Group by two columns df.groupby(['col1', 'col2'])
Filter groups .filter(lambda x: x['val'].mean() > 100)
Reset index after grouping .reset_index()

📘 What is a Pivot Table in Pandas?

A pivot table is a powerful data transformation tool used to summarize,


aggregate, and reorganize data in a DataFrame.

Just like Excel pivot tables, Pandas' pivot_table() is used to group data across two
axes (rows and columns), apply an aggregation function, and extract insights.
🧱 Syntax of pivot_table()

Parameter Description
data DataFrame to use
values Column to aggregate (e.g., sales, price)
index Keys to group by on the rows
columns Keys to group by on the columns
aggfunc Aggregation function (mean, sum, count, etc.)
fill_value Replace missing values (e.g., 0 instead of NaN)

🧪 Example Dataset

🔀 Creating a Simple Pivot Table

Output:
✅ This shows total Sales by Region and Product.

📊 Aggregation Functions (aggfunc)

You can change the aggregation logic:

Other options: 'mean', 'sum', 'min', 'max', 'std'


🧱 Multiple Aggregation Columns

Gives a multi-index result with both sales and profit data summarized.

🔄 Filling Missing Values (fill_value)

Replaces NaN with 0 — useful in reporting or ML preprocessing.

📘 margins=True: Add Grand Totals

Adds an "All" row and column for total sums.


🔁 Use Case: Multi-level Pivot

Groups by two levels of indexing — perfect for nested analysis.

📈 Visualization Idea

Pivot tables + heatmaps = beautiful EDA summaries!

📊 Use of Pivot Tables in Data Science


Application Area Example Use of Pivot Table
EDA (Exploratory Quickly summarize numeric features across categories
Analysis) (e.g., avg price per brand)
Spot missing or inconsistent entries grouped by
Data Cleaning
dimension
Aggregate data before feeding into ML (e.g., avg orders
Feature Engineering
per user)
Reporting &
Create summarized KPIs per month/region/category
Dashboards
Anomaly Detection Identify outliers by comparing group stats
Pivot over time to view trends in sales, events,
Time Series Analysis
transactions

✅ Summary Table

Task Code Example


pd.pivot_table(df, values='Sales', index='Region',
Basic pivot
columns='Product')
Fill missing
fill_value=0
values
Multiple agg
values=['Sales', 'Profit']
values
Multi-level
index=['Region', 'Product']
group by
Add totals margins=True

🔍 Pro Tip:

Pivot tables are highly optimized and can replace many lines of groupby logic
with just one clear line of summarization.

What is Date/Time Functionality in Python?

Date/time functionality in Python allows you to:

 Work with dates and times as objects


 Format, parse, and calculate time-related data
 Perform time-series analysis, timestamp comparison, and date
arithmetic

🧰 Key Libraries for Date/Time


Library Description
datetime Native Python module for working with dates and times
time Lower-level module, useful for performance timing
calendar Functions related to calendar calculations (e.g., leap year)
Extended date/time parsing and manipulation (external
dateutil
library)
pytz Time zone support
Built-in time handling for Series/DataFrame, essential in Data
pandas
Science

🧱 Basic Usage with datetime

✅ Get Current Date and Time


📅 Create Date and Time Objects

🔁 Formatting Dates (strftime)

🔄 Parsing Dates from Strings (strptime)

➕ Date Arithmetic with timedelta

Date/Time Comparison

Convert to Timestamp
🌍 Time Zones (using pytz)

🧪 Working with Dates in Pandas


📆 Convert Column to DateTime

🧮 Extract Date Components

🔁 Resample (For Time Series)

Resample Code Meaning


'D' Day
'W' Week
'M' Month
'Y' Year

📈 Example: Time Series Plot

💼 Use of Date/Time in Data Science


Use Case Description
Time Series
Stock prices, weather, IoT data trends over time
Analysis
Feature
Extracting day of week, month, season for ML models
Engineering
Log Analysis Analyzing timestamps in logs for user behavior or errors
Forecasting Building models to predict future trends
Data Cleaning Identifying missing or out-of-sequence timestamps
Cohort Analysis Grouping users by signup/join dates
Event Detection Spotting anomalies or patterns in date-based sequences

✅ Summary Table
Task Code Example
Get current datetime datetime.now()
Format datetime to string dt.strftime("%Y-%m-%d")
Parse string to datetime datetime.strptime(str, format)
Add days dt + timedelta(days=1)
Extract year/month/day in Pandas df['date'].dt.year
Resample time series df.resample('M').mean()
Convert column to datetime pd.to_datetime(df['column'])

📊 What is a DataFrame?

A DataFrame is a 2-dimensional, tabular data structure from the pandas library


that:

 Is similar to an Excel spreadsheet or SQL table


 Supports labeled rows and columns
 Allows powerful data manipulation

🧪 Example Dataset: Sales Data

🔧 1. Viewing and Inspecting

🧪 2. Add a New Column

Now df includes a column Revenue for each transaction.


🔍 3. Filter Rows (Boolean Indexing)

Used frequently in data analysis and cleaning.

🔁 4. Sorting Data

Helps identify top transactions, outliers, or order results.

🔢 5. Group By and Aggregation

This is core to analytics — summarizing grouped data.

🔄 6. Replace / Map Values

Useful when standardizing or encoding categorical data for ML.


🔢 7. Renaming Columns

Clean, readable column names are important in modeling.

🔗 8. Merging DataFrames

Used for joining datasets (e.g., transactions + customer info).

📅 9. Date Filtering and Time-based Features

Time-based slicing is common in time series and forecasting.

🧼 10. Drop or Fill Missing Data

Essential step in data cleaning before model training.


📊 Final DataFrame

You now have a clean, enriched, time-aware DataFrame ready for:

 Visualization
 Modeling (ML)
 Export to CSV/Excel
 Dashboarding

💡 Why It’s Important in Data Science:


Task How DataFrame Manipulation Helps
EDA Filter, group, and analyze to understand trends
Feature
Add time features, ratios, or encoded values
Engineering
Cleaned and structured DataFrame is used directly in ML
Model Input
pipelines
Data Cleaning Remove nulls, fix types, handle outliers
Joins & Merging Combine datasets for richer insights
Visualization
Slice and group data to prepare for plotting
Prep

✅ Summary of Key Methods


Operation Function
Filter rows df[df['col'] == value]
Add new column df['new'] = ...
Group by df.groupby('col')['val'].sum()
Sort rows df.sort_values(by='col')
Rename columns df.rename(columns={...})
Merge tables pd.merge(df1, df2, on=..., how=...)
Date operations df['Date'].dt.day_name()

You might also like