0% found this document useful (0 votes)

16 views42 pages

Pandas

Pandas is an open-source Python library designed for data analysis, introducing data structures like Series and DataFrame for handling spreadsheet-like data. It provides functionalities for loading, manipulating, and analyzing datasets, including operations for handling missing values, grouping data, and performing aggregate calculations. Key features include support for various file formats (CSV, TSV, Excel), boolean subsetting, and custom functions for data manipulation.

Uploaded by

efavourable

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views42 pages

Pandas

Uploaded by

efavourable

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

Pandas

What is Pandas

What are Datasets: CSV vs TSV

Loading your first dataset with pandas

Columns, Rows and Cells

Selecting Columns, Rows and Combination of rows and columns

Performing aggregate calculation

What is Pandas

Pandas is an open source Python library for data analysis. It gives Python the ability to work with
spreadsheet-like data for fast data loading, manipulating, aligning, merging, etc.

To give Python these enhanced features, Pandas introduces two new data types to Python: Series and
DataFrame.

The DataFrame will represent your entire spreadsheet or rectangular data, whereas the Series is a single
column of the DataFrame. A Pandas DataFrame can also be thought of as a dictionary or collection of Series.
Dataset and Dataframes

A dataset is a more general term that refers to a collection of data. It can encompass data in various formats
and structures, including but not limited to tabular data.

A pandas dataframe on the other hand is a specific data structure provided by the Pandas library for Python.

It is a two-dimensional, size-mutable, and highly flexible data structure designed to work with structured data.

DataFrames are typically used to represent tabular data, where data is organized into rows and columns.
Pandas series

A Pandas Series is a one-dimensional labeled array-like data structure provided by the Pandas library in
Python. It is a fundamental building block of Pandas and is designed to work with various data types, similar
to a column in a spreadsheet

So in simple terms, data in any form as a collection is called a dataset e.g tabular, images, audio datasets,
while data loaded up in a spreadsheet, tabular manner is called a dataframe. Pandas series are the building
blocks of dataframes in pandas
Creating A Series and a dataframe

import pandas as pd

# Create a Pandas Series from a Python list

data = [10, 20, 30, 40, 50]

series = pd.Series(data)

# Print the Series

print(series)
# Data in separate lists

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [28, 34, 22, 45],

'Location': ['New York', 'Los Angeles', 'Chicago', 'San Francisco']

# Create a Pandas DataFrame from the dictionary

df = pd.DataFrame(data)

# Print the DataFrame

print(df)
CSV, TSV FILES

CSV (Comma-Separated Values) files are a common and widely used file format for storing tabular data. CSV
files are simple text files that represent data in a structured way, with rows and columns separated by commas
(or other delimiters). Each line in a CSV file typically represents a record or row of data, and within each line,
the values for different columns are separated by commas

TSV are like CSV files just using tabs instead of commas to separate values
Series methods

So Selecting a single column from a dataframe results in a series getting returned.

There are several methods that can be performed on a series Such as:
Series Methods

dtype or dtypes: Returns the data type of the elements in the Series. If the Series contains elements of
different types, you can use .dtypes to get a Series of data types for each element.

series.dtype # Returns the data type of the Series

series.dtypes # Returns a Series of data types for each element

Series methods

shape: Returns a tuple representing the dimensions of the Series. Since a Series is one-dimensional, it
returns a tuple with one element, which is the number of rows.

series.shape

size: Returns the number of elements in the Series.

Series.size

value_counts: Returns a Series containing counts of unique values in the Series.

series.value_counts()
Series methods

values: Returns the data in the Series as a NumPy array. This allows you to access and manipulate the
underlying data.

series.values

describe: Generates descriptive statistics of the Series, such as count, mean, standard deviation, minimum,
and maximum.

Series.describe

unique: Returns an array of unique values in the Series

Series.unique, series.nunique
Series methods

corr: Calculates the correlation between two Series. It measures how closely the values of two Series are
linearly related. For example, you can use this method to check how two variables change together.

cov: Calculates the covariance between two Series. Covariance measures how two variables change
together. It's a measure of the joint variability of two random variables.

describe: Generates summary statistics for the Series. It provides information like count, mean, standard
deviation, minimum, maximum, and quartiles, giving you a quick overview of the data's distribution.

drop_duplicates: Returns a new Series without duplicate values. It's useful for data cleaning when you want to
remove repeated values from your Series.
Series Methods

replace: Replaces specified values in the Series with a specified value. It's useful for data cleaning and
transformation.

sample: Returns a random sample of values from the Series. You can specify the number of random values to
retrieve.

sort_values: Sorts the values in the Series in either ascending or descending order, allowing you to order your
data as needed.
Series methods

max: Returns the maximum value in the Series, which is the largest value among all the elements.

mean: Calculates the arithmetic mean (average) of the Series. It's the sum of all values divided by the number
of values.

median: Calculates the median value of the Series, which is the middle value when all values are sorted.

mode: Calculates the mode(s) of the Series, which is the most frequently occurring value(s).

quantile: Calculates the value at a given quantile, allowing you to find specific percentiles or quartiles of your
data.
Boolean Subsetting: Dataframe

Boolean subsetting in Pandas Series is a powerful technique for filtering and selecting data based on boolean
conditions. You can create boolean masks, which are Series of True and False values, and use them to select
specific rows that meet certain criteria. Here's how you can perform boolean subsetting with a Pandas Series:

Assuming you have a Pandas Series called data_series, here's how you can use boolean subsetting:
Boolean Subsetting

Creating a Boolean Mask:

You can create a boolean mask by applying a condition to the Series. For example, let's say you want to
create a mask for values greater than a certain threshold:

mask = data_series > threshold

Applying the Boolean Mask:

Once you have the boolean mask, you can use it to filter the Series. This will return a new Series containing
only the elements that satisfy the condition:

filtered_series = data_series[mask]
Multiple Conditions:

You can also combine multiple conditions using logical operators like & (and) and | (or) within the mask. For
example, to filter for values greater than a threshold and less than another threshold:

mask = (data_series > threshold1) & (data_series < threshold2)

filtered_series = data_series[mask]
Exporting and importing data: pickle, csv and excel,tsv

Export to Pickle:

Pickle is a Python-specific binary format for serializing and deserializing Python objects. You can use the
to_pickle() method to export a DataFrame to a pickle file:

Export to CSV:

You can use the to_csv() method to export a DataFrame to a CSV (Comma-Separated Values) file:

Export to Excel:

You can use the to_excel() method to export a DataFrame to an Excel file. You'll need to install the openpyxl
library if it's not already installed:
Consequently to import, we use the

Pd.read_excel, pd.read_csv, pd.read_pickle.

data /variable Type

In pandas, a "variable type" generally refers to the data type or dtype of a column within a DataFrame or a
Series. In Pandas we have the following datatypes

Numeric Types:

int: Integer

float: Floating-point data types for decimal numbers.

Categorical Types:

category: A special data type for categorical data, which can have a limited number of unique values (like
labels or categories). This is memory-efficient compared to storing text labels as strings.
Other datatypes

DateTime Types:

datetime64: Represents dates and times.

timedelta64: Represents differences or durations between dates and times.

Text/String Types:

object: The most general data type, which can hold any Python object, including strings. However, it's not the
most memory-efficient choice for large datasets with consistent data types.

Boolean Type:

bool: Represents boolean values (True or False).

Non Categorical Vs Categorical Variables

Categorical Variables:

Definition: Categorical variables, also known as qualitative variables, represent categories or groups and can
take on a limited, discrete set of values or labels.

Examples: Gender (categories: male, female), color (categories: red, green, blue), car types (categories:
sedan, SUV, truck).
Non-Categorical Variables (Quantitative Variables):

Definition: Non-categorical variables, also known as quantitative variables, represent data that consists of
numerical values with meaningful numerical order and magnitude.

Examples: Age, income, height, temperature, and test scores are all examples of non-categorical variables.
These variables can be measured and operated on with mathematical operations.
Groupby

In pandas, the groupby operation is a powerful method used for grouping and aggregating data based on one
or more criteria.

It is a fundamental tool for data analysis and is often used in combination with other pandas functions to
perform various data manipulations.

Here's an overview of how the groupby operation works:

Column to groupby

Grouping Data:

The groupby operation is applied to a DataFrame or Series, and it allows you to group rows of data based on
the values in one or more columns.

You specify the column(s) by which you want to group the data.

So say you want to group based on gender etc..

Performing action to a groupby object

Aggregation

After grouping, you can apply aggregation functions to the grouped data. Aggregation functions summarize
data within each group.

Common aggregation functions include sum, mean, median, count, min, max, and custom functions.
# Group by the 'Category' column and calculate the mean for each group

grouped = df.groupby('Category')

result = grouped.mean()

print(result)
We use groupby to group the data by the 'Category' column.

We apply the mean aggregation function to calculate the mean value for each group.

The resulting DataFrame result shows the mean values for categories 'A' and 'B.'

groupby is versatile and can be used for more complex operations involving multiple grouping columns,
custom aggregation functions, and chaining with other pandas operations. It's commonly used for tasks like
data summarization, pivot tables, and exploring relationships within datasets.
Operations between pandas Dataframe columns

In pandas, you can perform various operations between columns of a DataFrame to create new columns,
modify existing ones, or derive insights from your data. Here are some common operations between columns
in pandas:

Artirhmetic: +, -, / , *

Comparison Operations:

df['IsEqual'] = df['Column1'] == df['Column2']

Custom Functions: Using Apply to manipulate pandas data

Custom Functions:

You can define custom functions using Python's lambda functions or regular functions and apply them to
columns using .apply().
Detecting Missing Values

Detecting missing values in a pandas DataFrame is an essential step in data cleaning and analysis. Pandas
provides several methods to identify and handle missing values. Here are some common techniques for
detecting missing values:

Several ways to detect missing values:

isna() and isnull():

These methods return a DataFrame of the same shape as the original, with True for missing values and False
for non-missing values.
info():

The info() method provides a summary of the DataFrame, including the count of non-null values in each
column.

describe():

The describe() method can be used to get statistics for numeric columns, and it implicitly shows the count of
non-null values.
● Groupby: Pandas groupby function is used to group and segment data in a DataFrame based on one or
more columns, allowing you to perform various aggregation and analysis operations on those grouped
data subsets.
● df.grouby(“column name”) ⇒ returns groupby object to aggregate and perform analysis with
● Operations On/With Columns: e.g, df[“column A”] + df[“column B”] = resulting series, or A * df[“columns
C”] = resulting series
● Manipulating Data using Apply: df.apply(function_name) ⇒ resulting series
● Finding missing values. Df[‘column name”].isna() ⇒ returns boolean series
Detecting missing values in columns

To detect missing values in columns using pandas, you can use the isna() or isnull() method on the
DataFrame or Series.

These methods return a DataFrame or Series of boolean values where True represents a missing value (NaN)
and False represents a non-missing value.

You can then use aggregation functions like sum() to count the missing values in each column.
Handling Missing Values

Dropping Missing Values:

dropna(): This method removes rows or columns containing missing values. You can specify the axis
parameter to drop either rows (axis=0) or columns (axis=1) with missing values.

df.dropna(axis=0) # Drop rows with missing values

df.dropna(axis=1) # Drop columns with missing values

Filling Missing Values:

fillna(): This method allows you to fill missing values with a specified value or using various filling strategies
like forward fill, backward fill, or interpolation.
Fill in with default values

df.fillna(value=0)

This method fills missing values in the DataFrame df with a specific value, which is specified as the argument.
In the example provided, missing values are replaced with the value 0. This method is useful when you want
to replace missing values with a constant value that is meaningful in your context.
Forward Fill

df.fillna(method='ffill')

This method performs forward fill for missing values. It means that missing values are filled with the most
recent non-missing value that appears before them in the column. This method is suitable when you want to
propagate the last observed value forward in the dataset.
Backward Fill

df.fillna(method='bfill'):

This method performs backward fill for missing values. It fills missing values with the next non-missing value
that appears after them in the column. It's the opposite of forward fill and is useful when you want to
propagate the next observed value backward in the dataset.
df.interpolate(method='linear', axis=0):

This method performs linear interpolation to fill missing values. Linear interpolation estimates missing values
based on the linear relationship between neighboring data points. The axis=0 argument indicates that
interpolation should be done along columns. This method is suitable when you have time-series or sequential
data and want to estimate missing values based on the trend of the data.
In general

Use fillna(value=0) when you want to replace missing values with a specific constant value.

Use fillna(method='ffill') when you want to forward fill missing values based on the previous non-missing
values.

Use fillna(method='bfill') when you want to backward fill missing values based on the next non-missing values.

Use interpolate(method='linear', axis=0) when you want to estimate missing values based on the linear
relationship between neighboring data points, typically for time-series or sequential data.

Python Pandas
100% (1)
Python Pandas
35 pages
Raker-Beam Construction Requires Rugged Steel Forms - tcm45-343123
0% (1)
Raker-Beam Construction Requires Rugged Steel Forms - tcm45-343123
4 pages
Application of Capcitors
No ratings yet
Application of Capcitors
4 pages
Biology Paper 1B 2023
No ratings yet
Biology Paper 1B 2023
32 pages
Mat Plot Lib
No ratings yet
Mat Plot Lib
18 pages
Vehicle Loan Cum Hypothecation Agreement
No ratings yet
Vehicle Loan Cum Hypothecation Agreement
20 pages
Seaborn
No ratings yet
Seaborn
7 pages
pandas notes
No ratings yet
pandas notes
19 pages
Product Life-Cycle: Product Design and Development
No ratings yet
Product Life-Cycle: Product Design and Development
17 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Pandas Worksheets ALL
100% (1)
Pandas Worksheets ALL
8 pages
Pandas Summarized Visually in 8
100% (2)
Pandas Summarized Visually in 8
8 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Presentation5
No ratings yet
Presentation5
1 page
C500-0055 QSL9 C300D6 OPENSET-Rev00
No ratings yet
C500-0055 QSL9 C300D6 OPENSET-Rev00
2 pages
Assignment 2
No ratings yet
Assignment 2
1 page
Advertisement Analysis Prompt
No ratings yet
Advertisement Analysis Prompt
2 pages
QMS Unit1&2
No ratings yet
QMS Unit1&2
9 pages
The Incredible Story of Maggie Parker
No ratings yet
The Incredible Story of Maggie Parker
4 pages
Vra Sample Questions
No ratings yet
Vra Sample Questions
21 pages
Zoonoses 20220037
No ratings yet
Zoonoses 20220037
11 pages
Melc DLL Eng 9 Week 2 Q1 Done
No ratings yet
Melc DLL Eng 9 Week 2 Q1 Done
15 pages
Pandas Basics
No ratings yet
Pandas Basics
84 pages
Pandas
No ratings yet
Pandas
13 pages
2_Pandas
No ratings yet
2_Pandas
22 pages
unit 3
No ratings yet
unit 3
10 pages
Pandas
No ratings yet
Pandas
41 pages
Drager Carina Guide
33% (3)
Drager Carina Guide
24 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Wikimedia Promotions in Federal University Kashere (Wikipedia Presentation by Bukola James) 2022
No ratings yet
Wikimedia Promotions in Federal University Kashere (Wikipedia Presentation by Bukola James) 2022
48 pages
rajni_ip_file_final
No ratings yet
rajni_ip_file_final
42 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Pandas Questions
No ratings yet
Pandas Questions
11 pages
Pandas Notes(1)
No ratings yet
Pandas Notes(1)
44 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
Pandas Shan Ver2
No ratings yet
Pandas Shan Ver2
25 pages
Pandas
No ratings yet
Pandas
49 pages
Data Handling using Pandas-1
No ratings yet
Data Handling using Pandas-1
23 pages
introduction to pandas
No ratings yet
introduction to pandas
14 pages
The Parable of The Black Belt
No ratings yet
The Parable of The Black Belt
4 pages
Pandas
No ratings yet
Pandas
25 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
How Google Fights Disinformation
No ratings yet
How Google Fights Disinformation
32 pages
Unit 4
No ratings yet
Unit 4
36 pages
Pandas
No ratings yet
Pandas
163 pages
Lecture 7 Understanding dataFrames in Python and R
No ratings yet
Lecture 7 Understanding dataFrames in Python and R
17 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
Lesson Operations Management and Operations Function: Objectives
No ratings yet
Lesson Operations Management and Operations Function: Objectives
8 pages
Moral Neutralization in Pakistan Capital Markets
No ratings yet
Moral Neutralization in Pakistan Capital Markets
41 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
"A Study On Ratio Analysis of Axis Bank Hosur": Submitted by
No ratings yet
"A Study On Ratio Analysis of Axis Bank Hosur": Submitted by
70 pages
Green Building PDF
No ratings yet
Green Building PDF
15 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Scheme For Igneous Rock Identification
No ratings yet
Scheme For Igneous Rock Identification
1 page
Ln. 1 - Data handling using Pandas - Series & Dataframe
No ratings yet
Ln. 1 - Data handling using Pandas - Series & Dataframe
14 pages
XII_ip_Panda_I_Part_I_2023 (1) 1 1
No ratings yet
XII_ip_Panda_I_Part_I_2023 (1) 1 1
25 pages
The Impact of Trauma On Learning 909 311457 7
No ratings yet
The Impact of Trauma On Learning 909 311457 7
6 pages
Python 3rd unit question and answer
No ratings yet
Python 3rd unit question and answer
25 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
1.put The Verbs in Brackets Into The Appropriate Tense Forms
No ratings yet
1.put The Verbs in Brackets Into The Appropriate Tense Forms
3 pages
Python Pandas
No ratings yet
Python Pandas
21 pages
Pandas
No ratings yet
Pandas
12 pages
Pandas
No ratings yet
Pandas
41 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
Alfred Eichner - The Foundations of The Corporate Economy
No ratings yet
Alfred Eichner - The Foundations of The Corporate Economy
18 pages
Diet Plan Gerd
No ratings yet
Diet Plan Gerd
4 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Unit_III_part_2_1725700061785
No ratings yet
Unit_III_part_2_1725700061785
85 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Unit-4Introduction To Pandas
No ratings yet
Unit-4Introduction To Pandas
44 pages
On Data Handling Using Pandas-I
100% (2)
On Data Handling Using Pandas-I
64 pages
Sanyo VCB 3374
No ratings yet
Sanyo VCB 3374
15 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Teaching Resume Complete
No ratings yet
Teaching Resume Complete
2 pages
Communicating With Children Improves Parent Child Relationships
No ratings yet
Communicating With Children Improves Parent Child Relationships
8 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
LOR Instructions
No ratings yet
LOR Instructions
2 pages
Python Libraries
No ratings yet
Python Libraries
53 pages
Final Satyendra Upreti CV 2012
No ratings yet
Final Satyendra Upreti CV 2012
3 pages
Pandas
No ratings yet
Pandas
29 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
Pandas Dataframe Export The CSV File
No ratings yet
Pandas Dataframe Export The CSV File
9 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
6 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet