Week 4.1
Week 4.1
Pandas
Pandas is a powerful open-source Python library that offers versatile data structures and tools for
data manipulation and analysis. It's particularly adept at handling structured data, such as tabular
data commonly found in CSV files, spreadsheets, SQL databases, and more. Pandas is built on top
of NumPy, leveraging its efficient array-based operations while providing additional functionality
for data organization and manipulation. Pandas provides easy-to-use data structures and functions
for working with structured data.
1. Series: One of the core data structures in Pandas is the Series. A Pandas Series is a one-
dimensional labeled array that can hold data of various types, including integers, floats,
strings, and Python objects. Each element in a Series is associated with a label or index,
allowing for easy and intuitive data access and alignment. Series are similar to one-
dimensional NumPy arrays but provide additional functionalities, such as label-based
indexing and alignment, making them more flexible and convenient for data analysis tasks.
2. DataFrame: A two-dimensional labeled data structure with columns of potentially
different types. It can be thought of as a spreadsheet or SQL table, and it is the primary
data structure used in Pandas for data manipulation and analysis.
Pandas provides a wide range of functionalities for data manipulation and analysis, including:
Reading and writing data from/to various file formats such as CSV, Excel, SQL databases, and
more.
Overall, Pandas is widely used in data science, machine learning, finance, economics, and other
fields for data manipulation, cleaning, exploration, and analysis tasks. Its intuitive and powerful
API makes it a popular choice among Python developers and data scientists for working with
structured data.
SHAHID JAMIL 1
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
You can create a Pandas Series using pd.Series() constructor by passing a list or array-like object
as data.
import pandas as pd
Accessing Elements:
You can access elements in a Pandas Series using integer indexing or labels.
import pandas as pd
Operations on Series:
You can perform various operations on Pandas Series, such as arithmetic operations, boolean
indexing, and applying functions.
import pandas as pd
# Arithmetic operations
series = pd.Series([1, 2, 3, 4, 5])
print(series * 2)
# Boolean indexing
print(series[series > 3])
SHAHID JAMIL 2
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
# Applying functions
print(series.apply(lambda x: x ** 2))
Handling Missing Data:
Pandas Series can handle missing or NaN (Not a Number) values using the NaN keyword from
NumPy.
import pandas as pd
print(series.shape)
print(series.size)
print(series.dtype)
print(series.mean())
print(series.std())
print(series.min())
print(series.max())
Various Functionalities
Here are different functionalities of Pandas Series:
import pandas as pd
SHAHID JAMIL 3
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
print(s)
print()
# Slicing
print("Slicing from index 'b' to 'd':")
print(s['b':'d'])
print()
# Arithmetic operations
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([6, 7, 8, 9, 10])
print("Addition:")
print(s1 + s2)
print("Subtraction:")
print(s1 - s2)
print("Multiplication:")
print(s1 * s2)
print("Division:")
print(s1 / s2)
print()
# Aggregation functions
print("Mean:", s.mean())
print("Sum:", s.sum())
print("Min:", s.min())
print("Max:", s.max())
print()
# Boolean indexing
print("Boolean indexing for elements greater than 2:")
print(s[s > 2])
print()
# Element-wise functions
print("Applying square function element-wise:")
print(s.apply(lambda x: x ** 2))
print()
SHAHID JAMIL 4
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
# Sorting
print("Sorted Series by index:")
print(s.sort_index())
print("Sorted Series by values:")
print(s.sort_values())
print()
# Unique values
print("Unique values in Series:", s.unique())
print()
# Value counts
print("Value counts:")
print(s.value_counts())
Pandas DataFrame:
A pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different
types. It is similar to a spreadsheet or SQL table, or a dictionary of Series objects. DataFrames are
incredibly versatile and are one of the primary data structures used in pandas for data manipulation
and analysis. Here's a detailed explanation of pandas DataFrame:
Key Characteristics:
SHAHID JAMIL 5
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
import pandas as pd
print(df_from_list)
• We can specify column names explicitly using the columns parameter when creating the
DataFrame.
• Pandas automatically assigns numeric indices (0, 1, 2, ...) to the rows unless specified
otherwise.
• This DataFrame has three columns: 'ID', 'Name', and 'Age'.
import pandas as pd
# Creating a dictionary
data_dict = {
'ID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45]
}
print(df_from_dict)
SHAHID JAMIL 6
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
# Assuming you have a CSV file named 'data.csv' with columns 'col1', 'col2',
'col3'
# Adjust the file name and column names according to your CSV file
file_path = 'data.csv'
Below is an example of how you can read a specific number of rows and select particular columns
from a CSV file using Pandas:
import pandas as pd
# Assuming you want to read first 10 rows and only columns 'col1' and 'col2'
# Adjust the file name, column names, and nrows according to your
requirement
file_path = 'data.csv'
selected_columns = ['col1', 'col2']
nrows = 10
# Read the specific columns and rows from the CSV file into a Pandas
DataFrame
df = pd.read_csv(file_path, usecols=selected_columns, nrows=nrows)
Reading a large CSV file in smaller, manageable chunks is a common practice, especially when
dealing with very large datasets. Here's an example of how you can read a CSV file in smaller
chunks using Pandas:
import pandas as pd
# Assuming you want to read the CSV file in chunks of 1000 rows at a time
# Adjust the file name and chunk size according to your requirement
file_path = 'data.csv'
chunk_size = 1000
SHAHID JAMIL 7
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
When using pd.read_csv() with the chunksize parameter, it returns an iterator of DataFrame
chunks, each representing a portion of the CSV file. Each chunk is essentially a DataFrame
containing a subset of the data from the CSV file, with the number of rows specified by the
chunksize parameter.
Therefore, in the loop for chunk_number, chunk in enumerate(chunk_iter):, the variable chunk
represents each DataFrame chunk read from the CSV file.
import pandas as pd
# Create a DataFrame
data = {'A': [1, 2, 3],
'B': [4, 5, 6]}
df = pd.DataFrame(data)
# Insert a new column named 'C' at index 1 with values [10, 20, 30]
df.insert(loc=1, column='C', value=[10, 20, 30])
Output:
Original DataFrame:
A B
0 1 4
1 2 5
SHAHID JAMIL 8
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
2 3 6
Using pop() is useful when you need to remove a column and use its values elsewhere in your
code. Below are examples of how to use pop() to remove a column from a DataFrame in Pandas:
import pandas as pd
# Create a DataFrame
data = {'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]}
df = pd.DataFrame(data)
# Create a DataFrame
data = {'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]}
df = pd.DataFrame(data)
SHAHID JAMIL 9
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
Output:
Original DataFrame:
A B C
0 1.0 NaN 9
1 2.0 6.0 10
2 NaN 7.0 11
3 4.0 8.0 12
You can also use dropna() to remove columns with missing values by specifying axis=1:
SHAHID JAMIL 10
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
Output:
DataFrame after dropping columns with missing values:
C
0 9
1 10
2 11
3 12
This removes columns containing any missing values, resulting in a DataFrame containing only
columns without missing values.
The thresh parameter in the dropna() method allows you to specify a threshold for the number of
non-null values required to keep a row or column. Rows or columns with fewer non-null values
than the specified threshold will be dropped. Here's an example of how to use dropna() with the
thresh parameter:
import pandas as pd
import numpy as np
# Display the DataFrame after dropping rows with fewer than 2 non-null
values
print("\nDataFrame after dropping rows with fewer than 2 non-null values:")
print(df_cleaned)
SHAHID JAMIL 11
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
• 'any': Drops rows or columns if any NaNs are present in them (default behavior).
• 'all': Drops rows or columns only if all values are NaNs.
# Display the DataFrame after dropping rows with all missing values
print("\nDataFrame after dropping rows with all missing values:")
print(df_drop_all)
SHAHID JAMIL 12
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
# Display the DataFrame after dropping columns with any missing values
print("\nDataFrame after dropping columns with any missing values:")
print(df_drop_any)
# Display the DataFrame after filling missing values with the mean of each
column
print("\nDataFrame after filling missing values with the mean of each
column:")
print(df_filled_mean)
Output:
Original DataFrame:
A B C
0 1.0 NaN 7.0
1 NaN 5.0 8.0
2 3.0 NaN NaN
SHAHID JAMIL 13
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
DataFrame after filling missing values with the mean of each column:
A B C
0 1.0 5.0 7.0
1 2.0 5.0 8.0
2 3.0 5.0 7.5
When using fillna() with the axis=1 parameter, it fills missing values horizontally along the
columns instead of vertically along the rows. Here's an example demonstrating how to use fillna()
with axis=1:
import pandas as pd
import numpy as np
Output:
Original DataFrame:
A B C
0 1.0 NaN 7.0
1 NaN 5.0 8.0
2 3.0 NaN NaN
SHAHID JAMIL 14
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
We can also use other methods such as backward fill (bfill), interpolation, or specify a constant
value when filling missing values horizontally using fillna() with axis=1, depending on your
specific requirements and data analysis needs.
Attributes:
• shape: Returns a tuple representing the dimensionality of the DataFrame (rows, columns).
• index: Returns the row index labels.
• columns: Returns the column names.
• dtypes: Returns the data types of each column.
Methods:
• head(), tail(): View the first or last few rows of the DataFrame.
• info(): Provides a concise summary of the DataFrame including index dtype and column
dtypes, non-null values, and memory usage.
• describe(): Generates descriptive statistics for numerical columns.
• loc[], iloc[]: Accessing a group of rows and columns by label(s) or integer position(s).
• drop(): Drop specified labels from rows or columns.
• fillna(), dropna(): Handle missing data by filling or dropping NaN values.
• groupby(), pivot_table(): Perform grouping and aggregation operations.
• merge(), join(): Combine DataFrames through database-style join operations.
Example:
import pandas as pd
# Accessing elements
print(df['Name'])
print(df.iloc[0])
SHAHID JAMIL 15
PROGRAMMING FOR AI LECTURE NOTES WEEK-04
print(df)
# Descriptive statistics
print(df.describe())
Pandas DataFrame is a powerful tool for data analysis and manipulation, widely used in data
science, machine learning, and statistical analysis. Its flexibility and rich set of functionalities
make it an essential component in the Python data ecosystem.
SHAHID JAMIL 16