UNIT II Notes (1)
UNIT II Notes (1)
Introduction to Pandas
python --version
✅ 2. Install pandas
✅ 3. Verify Installation
import pandas as pd
print(pd.__version__)
import pandas as pd
print(df)
Run it:
python test_pandas.py
Pandas is one of the most popular libraries in Python for data manipulation and analysis. It
provides efficient and easy-to-use data structures for handling and analyzing structured data. The
primary data structures in Pandas are Series and DataFrame, each serving a unique purpose in
working with data.
1. Pandas Series
A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats,
etc.). It's similar to a list or an array but with additional functionality provided by Pandas, such as
labels (indices) that allow easy access to the data.
Creating a Series:
import pandas as pd
Output:
0 1
1 2
2 3
3 4
dtype: int64
Output:
a 10
b 20
c 30
d 40
dtype: int64
2. Pandas DataFrame
Creating a DataFrame:
import pandas as pd
df = pd.DataFrame(data)
print(df)
Output:
data = [
['Alice', 25, 'New York'],
['Bob', 30, 'Los Angeles'],
['Charlie', 35, 'Chicago'],
['David', 40, 'Houston']
]
Output:
Accessing Columns:
print(df['Name'])
Output:
0 Alice
1 Bob
2 Charlie
3 David
Name: Name, dtype: object
Accessing Rows (using .iloc[] or .loc[]):
print(df.iloc[1]) # Accessing the second row (index 1)
Output:
Name Bob
Age 30
City Los Angeles
Name: 1, dtype: object
print(df.loc[1]) # Accessing the second row (with label 1)
Output:
Name Bob
Age 30
City Los Angeles
Name: 1, dtype: object
Conclusion:
Series is a simple, one-dimensional labeled array, useful for handling a single column of
data.
DataFrame is a powerful, two-dimensional structure that can handle multiple columns
and rows, and is more suited for working with tabular data.
These data structures, Series and DataFrame, are the foundation of data manipulation and
analysis with Pandas, allowing you to efficiently work with real-world datasets.
Pandas makes it very easy to read data from different file formats and save it back to various
formats. Whether your data is stored in a CSV, Excel, JSON, or other formats, Pandas provides
efficient methods to import, manipulate, and export data.
CSV (Comma Separated Values) is one of the most common formats for storing tabular data.
Pandas provides the read_csv() function to load data from CSV files.
Example:
import pandas as pd
Excel files are another common format for data storage. Pandas provides the read_excel()
function for reading Excel files. You will need the openpyxl or xlrd library installed to handle
Excel files.
Example:
import pandas as pd
JSON (JavaScript Object Notation) is a lightweight data interchange format. You can use the
read_json() function to load data from a JSON file.
Example:
import pandas as pd
Pandas can read data directly from SQL databases using the read_sql() function. You'll need a
connection to your database, and Pandas will execute a query and return the result as a
DataFrame.
Example:
import pandas as pd
import sqlite3
Pandas also makes it easy to save your DataFrame to different file formats, such as CSV, Excel,
JSON, and more.
2.1 Writing to CSV Files
To export a DataFrame to a CSV file, you can use the to_csv() method.
Example:
import pandas as pd
To export a DataFrame to an Excel file, you can use the to_excel() method. You need the
openpyxl library installed for .xlsx files.
Example:
import pandas as pd
To save a DataFrame to a JSON file, you can use the to_json() method.
Example:
import pandas as pd
orient: Determines the format of the JSON data (options: 'split', 'records', 'index',
'columns').
lines: Whether to write each record on a separate line (default is False).
You can write data from a DataFrame to an SQL database using the to_sql() method. It
requires a connection object to the database.
Example:
import pandas as pd
import sqlite3
Conclusion:
Pandas provides a versatile set of functions to read data from a variety of file formats (CSV,
Excel, JSON, SQL) and export data back to these formats. This ability to handle different data
sources seamlessly is one of the key strengths of Pandas in data analysis and manipulation tasks.
Data cleaning is an essential step in the data analysis process. Raw data often contains missing or
duplicate values, as well as other inconsistencies that can skew analysis. Pandas provides a
variety of functions to handle these issues, enabling effective data cleaning and preparation.
Missing data can occur due to various reasons (e.g., not recorded, data entry errors). Pandas
provides several methods to handle missing values (NaN), allowing you to either fill them with
certain values or drop them entirely.
You can detect missing data in a DataFrame using isnull() or notnull() methods.
import pandas as pd
df = pd.DataFrame(data)
Output:
You can drop rows or columns with missing values using the dropna() method.
You can fill missing data with a specific value using fillna().
2. Removing Duplicates
Duplicate data can arise from data entry errors or merging data from multiple sources. Pandas
offers a simple way to identify and remove duplicates from your DataFrame.
2.1 Identifying Duplicates
You can detect duplicate rows using the duplicated() method, which returns a boolean Series
indicating whether each row is a duplicate.
df = pd.DataFrame(data)
Output:
0 False
1 False
2 True
3 False
dtype: bool
You can remove duplicates from your DataFrame using the drop_duplicates() method.
3. Data Filtering
Data filtering allows you to select rows from a DataFrame based on certain conditions.
Pandas provides string functions to filter rows based on string patterns (like contains,
startswith, endswith).
df_filtered = df[df['City'].str.contains('New')]
print(df_filtered)
df_filtered = df[df['Name'].str.startswith('A')]
print(df_filtered)
Task Function
Detect Missing Data isnull(), notnull()
Drop Missing Data dropna()
Fill Missing Data fillna()
Detect Duplicates duplicated()
Remove Duplicates drop_duplicates()
Filter by Condition Boolean indexing (df[condition])
Task Function
Filter by String str.contains(), str.startswith()
Conclusion:
Data cleaning in Pandas is a crucial step in preparing data for analysis. By handling missing data,
removing duplicates, and applying filters, you can ensure that your dataset is accurate,
consistent, and ready for further analysis. Pandas provides efficient and flexible tools for each of
these tasks, making data cleaning fast and straightforward.
Pandas is powerful when it comes to manipulating data. Whether you need to sort your data,
group it based on certain columns, merge data from different sources, or concatenate multiple
datasets, Pandas offers various methods that make these tasks easy and efficient.
1. Sorting Data
2. Indexing Data
3. Grouping Data
4. Merging DataFrames
5. Concatenating DataFrames
1. Sorting Data
Sorting data is essential to analyze patterns or prepare data for visualizations or reporting.
You can sort a DataFrame by its index using the sort_index() method.
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40]}
# Sorting by index
df_sorted = df.sort_index(ascending=True)
print(df_sorted)
Output:
Name Age
a Alice 25
b David 40
c Charlie 35
d Bob 30
You can sort by one or more columns using the sort_values() method.
Output:
Name Age
a Alice 25
d Bob 30
c Charlie 35
b David 40
df = pd.DataFrame(data)
Output:
2. Indexing Data
Indexing refers to the ability to access rows or columns based on their labels or positions.
You can access columns directly as attributes or by using the column name.
print(df.Name)
You can set a column as the index of the DataFrame using the set_index() method.
df_indexed = df.set_index('Name')
print(df_indexed)
Output:
Age City
Name
Alice 25 New York
Bob 30 Los Angeles
Charlie 25 Chicago
David 40 Houston
df_reset = df_indexed.reset_index()
print(df_reset)
3. Grouping Data
Grouping is useful for performing aggregation operations like sum, mean, count, etc., based on
some criteria.
You can use the groupby() method to group data based on one or more columns.
Output:
Age
Age
25 25.0
30 30.0
35 35.0
40 40.0
City Name
Age
25 New York 1
30 Los Angeles 1
35 Chicago 1
40 Houston 1
Output:
Age City
25 Chicago 1
New York 1
30 Los Angeles 1
35 Chicago 1
40 Houston 1
dtype: int64
4. Merging DataFrames
Merging DataFrames is a common operation when you want to combine data from multiple
sources. Pandas provides the merge() function for this purpose, similar to SQL joins.
You can merge DataFrames using the merge() function by specifying a common column (key).
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
You can merge DataFrames with different column names using left_on and right_on.
You can specify the type of join using the how parameter:
5. Concatenating DataFrames
You can concatenate multiple DataFrames along a particular axis (either row-wise or column-
wise) using the concat() function.
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 40
Output:
Operation Function
Sorting sort_index(), sort_values()
Indexing loc[], iloc[], set_index()
Grouping groupby(), agg(), size()
Merging DataFrames merge()
Concatenating DataFrames concat()
Conclusion:
Pandas provides a variety of powerful tools to manipulate and transform data. Sorting, indexing,
grouping, merging, and concatenating DataFrames are common tasks that allow you to clean,
organize, and analyze data more effectively. With these tools, you can perform complex data
operations in just a few lines of code, making Pandas an essential library for data manipulation in
Python.
In Pandas, working with dates and times is made easy with the datetime functionality, which
includes converting strings to datetime objects, extracting components like year, month, day,
etc., and performing operations on date and time data.
To convert strings to datetime objects, you can use the pd.to_datetime() function. This will
automatically recognize the date and time format.
import pandas as pd
print(date_time)
2. Extracting Date and Time Components
Once you have a datetime object, you can easily extract individual components like the year,
month, day, etc.
When working with a DataFrame or Series that contains date and time data, you can apply the
same functions across the column. For example:
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
print(df)
You can use the timedelta class to add or subtract time from a datetime object. For example,
adding one week to a date:
You can perform date and time arithmetic to find the difference between two datetime objects.
The result will be a Timedelta object.
date1 = pd.to_datetime('2025-03-01')
date2 = pd.to_datetime('2025-03-11')
# Find difference
difference = date2 - date1
print(difference) # Output: 10 days
You can also handle time zones by using the tz_localize() and tz_convert() functions.
7. Formatting Dates/Times
To format a datetime object as a string, use the strftime() method. This allows you to
customize the output format.
If a date or time value is missing, Pandas uses NaT (Not a Time) to represent it, similar to how
NaN works for numerical values.
This covers a lot of basic and advanced operations you might need when working with dates and
times in Pandas! Let me know if you'd like to dive deeper into any specific aspect of working
with dates.