[go: up one dir, main page]

0% found this document useful (0 votes)
16 views42 pages

Pandas

Pandas is an open-source Python library designed for data analysis, introducing data structures like Series and DataFrame for handling spreadsheet-like data. It provides functionalities for loading, manipulating, and analyzing datasets, including operations for handling missing values, grouping data, and performing aggregate calculations. Key features include support for various file formats (CSV, TSV, Excel), boolean subsetting, and custom functions for data manipulation.

Uploaded by

efavourable
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views42 pages

Pandas

Pandas is an open-source Python library designed for data analysis, introducing data structures like Series and DataFrame for handling spreadsheet-like data. It provides functionalities for loading, manipulating, and analyzing datasets, including operations for handling missing values, grouping data, and performing aggregate calculations. Key features include support for various file formats (CSV, TSV, Excel), boolean subsetting, and custom functions for data manipulation.

Uploaded by

efavourable
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Pandas

What is Pandas

What are Datasets: CSV vs TSV

Loading your first dataset with pandas

Columns, Rows and Cells

Selecting Columns, Rows and Combination of rows and columns

Performing aggregate calculation


What is Pandas

Pandas is an open source Python library for data analysis. It gives Python the ability to work with
spreadsheet-like data for fast data loading, manipulating, aligning, merging, etc.

To give Python these enhanced features, Pandas introduces two new data types to Python: Series and
DataFrame.

The DataFrame will represent your entire spreadsheet or rectangular data, whereas the Series is a single
column of the DataFrame. A Pandas DataFrame can also be thought of as a dictionary or collection of Series.
Dataset and Dataframes

A dataset is a more general term that refers to a collection of data. It can encompass data in various formats
and structures, including but not limited to tabular data.

A pandas dataframe on the other hand is a specific data structure provided by the Pandas library for Python.

It is a two-dimensional, size-mutable, and highly flexible data structure designed to work with structured data.

DataFrames are typically used to represent tabular data, where data is organized into rows and columns.
Pandas series

A Pandas Series is a one-dimensional labeled array-like data structure provided by the Pandas library in
Python. It is a fundamental building block of Pandas and is designed to work with various data types, similar
to a column in a spreadsheet

So in simple terms, data in any form as a collection is called a dataset e.g tabular, images, audio datasets,
while data loaded up in a spreadsheet, tabular manner is called a dataframe. Pandas series are the building
blocks of dataframes in pandas
Creating A Series and a dataframe

import pandas as pd

# Create a Pandas Series from a Python list

data = [10, 20, 30, 40, 50]

series = pd.Series(data)

# Print the Series

print(series)
# Data in separate lists

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [28, 34, 22, 45],

'Location': ['New York', 'Los Angeles', 'Chicago', 'San Francisco']

# Create a Pandas DataFrame from the dictionary

df = pd.DataFrame(data)

# Print the DataFrame

print(df)
CSV, TSV FILES

CSV (Comma-Separated Values) files are a common and widely used file format for storing tabular data. CSV
files are simple text files that represent data in a structured way, with rows and columns separated by commas
(or other delimiters). Each line in a CSV file typically represents a record or row of data, and within each line,
the values for different columns are separated by commas

TSV are like CSV files just using tabs instead of commas to separate values
Series methods

So Selecting a single column from a dataframe results in a series getting returned.

There are several methods that can be performed on a series Such as:
Series Methods

dtype or dtypes: Returns the data type of the elements in the Series. If the Series contains elements of
different types, you can use .dtypes to get a Series of data types for each element.

series.dtype # Returns the data type of the Series

series.dtypes # Returns a Series of data types for each element


Series methods

shape: Returns a tuple representing the dimensions of the Series. Since a Series is one-dimensional, it
returns a tuple with one element, which is the number of rows.

series.shape

size: Returns the number of elements in the Series.

Series.size

value_counts: Returns a Series containing counts of unique values in the Series.

series.value_counts()
Series methods

values: Returns the data in the Series as a NumPy array. This allows you to access and manipulate the
underlying data.

series.values

describe: Generates descriptive statistics of the Series, such as count, mean, standard deviation, minimum,
and maximum.

Series.describe

unique: Returns an array of unique values in the Series

Series.unique, series.nunique
Series methods

corr: Calculates the correlation between two Series. It measures how closely the values of two Series are
linearly related. For example, you can use this method to check how two variables change together.

cov: Calculates the covariance between two Series. Covariance measures how two variables change
together. It's a measure of the joint variability of two random variables.

describe: Generates summary statistics for the Series. It provides information like count, mean, standard
deviation, minimum, maximum, and quartiles, giving you a quick overview of the data's distribution.

drop_duplicates: Returns a new Series without duplicate values. It's useful for data cleaning when you want to
remove repeated values from your Series.
Series Methods

replace: Replaces specified values in the Series with a specified value. It's useful for data cleaning and
transformation.

sample: Returns a random sample of values from the Series. You can specify the number of random values to
retrieve.

sort_values: Sorts the values in the Series in either ascending or descending order, allowing you to order your
data as needed.
Series methods

max: Returns the maximum value in the Series, which is the largest value among all the elements.

mean: Calculates the arithmetic mean (average) of the Series. It's the sum of all values divided by the number
of values.

median: Calculates the median value of the Series, which is the middle value when all values are sorted.

mode: Calculates the mode(s) of the Series, which is the most frequently occurring value(s).

quantile: Calculates the value at a given quantile, allowing you to find specific percentiles or quartiles of your
data.
Boolean Subsetting: Dataframe

Boolean subsetting in Pandas Series is a powerful technique for filtering and selecting data based on boolean
conditions. You can create boolean masks, which are Series of True and False values, and use them to select
specific rows that meet certain criteria. Here's how you can perform boolean subsetting with a Pandas Series:

Assuming you have a Pandas Series called data_series, here's how you can use boolean subsetting:
Boolean Subsetting

Creating a Boolean Mask:

You can create a boolean mask by applying a condition to the Series. For example, let's say you want to
create a mask for values greater than a certain threshold:

mask = data_series > threshold


Applying the Boolean Mask:

Once you have the boolean mask, you can use it to filter the Series. This will return a new Series containing
only the elements that satisfy the condition:

filtered_series = data_series[mask]
Multiple Conditions:

You can also combine multiple conditions using logical operators like & (and) and | (or) within the mask. For
example, to filter for values greater than a threshold and less than another threshold:

mask = (data_series > threshold1) & (data_series < threshold2)

filtered_series = data_series[mask]
Exporting and importing data: pickle, csv and excel,tsv

Export to Pickle:

Pickle is a Python-specific binary format for serializing and deserializing Python objects. You can use the
to_pickle() method to export a DataFrame to a pickle file:

Export to CSV:

You can use the to_csv() method to export a DataFrame to a CSV (Comma-Separated Values) file:

Export to Excel:

You can use the to_excel() method to export a DataFrame to an Excel file. You'll need to install the openpyxl
library if it's not already installed:
Consequently to import, we use the

Pd.read_excel, pd.read_csv, pd.read_pickle.


data /variable Type

In pandas, a "variable type" generally refers to the data type or dtype of a column within a DataFrame or a
Series. In Pandas we have the following datatypes

Numeric Types:

int: Integer

float: Floating-point data types for decimal numbers.

Categorical Types:

category: A special data type for categorical data, which can have a limited number of unique values (like
labels or categories). This is memory-efficient compared to storing text labels as strings.
Other datatypes

DateTime Types:

datetime64: Represents dates and times.

timedelta64: Represents differences or durations between dates and times.

Text/String Types:

object: The most general data type, which can hold any Python object, including strings. However, it's not the
most memory-efficient choice for large datasets with consistent data types.

Boolean Type:

bool: Represents boolean values (True or False).


Non Categorical Vs Categorical Variables

Categorical Variables:

Definition: Categorical variables, also known as qualitative variables, represent categories or groups and can
take on a limited, discrete set of values or labels.

Examples: Gender (categories: male, female), color (categories: red, green, blue), car types (categories:
sedan, SUV, truck).
Non-Categorical Variables (Quantitative Variables):

Definition: Non-categorical variables, also known as quantitative variables, represent data that consists of
numerical values with meaningful numerical order and magnitude.

Examples: Age, income, height, temperature, and test scores are all examples of non-categorical variables.
These variables can be measured and operated on with mathematical operations.
Groupby

In pandas, the groupby operation is a powerful method used for grouping and aggregating data based on one
or more criteria.

It is a fundamental tool for data analysis and is often used in combination with other pandas functions to
perform various data manipulations.

Here's an overview of how the groupby operation works:


Column to groupby

Grouping Data:

The groupby operation is applied to a DataFrame or Series, and it allows you to group rows of data based on
the values in one or more columns.

You specify the column(s) by which you want to group the data.

So say you want to group based on gender etc..


Performing action to a groupby object

Aggregation

After grouping, you can apply aggregation functions to the grouped data. Aggregation functions summarize
data within each group.

Common aggregation functions include sum, mean, median, count, min, max, and custom functions.
# Group by the 'Category' column and calculate the mean for each group

grouped = df.groupby('Category')

result = grouped.mean()

print(result)
We use groupby to group the data by the 'Category' column.

We apply the mean aggregation function to calculate the mean value for each group.

The resulting DataFrame result shows the mean values for categories 'A' and 'B.'

groupby is versatile and can be used for more complex operations involving multiple grouping columns,
custom aggregation functions, and chaining with other pandas operations. It's commonly used for tasks like
data summarization, pivot tables, and exploring relationships within datasets.
Operations between pandas Dataframe columns

In pandas, you can perform various operations between columns of a DataFrame to create new columns,
modify existing ones, or derive insights from your data. Here are some common operations between columns
in pandas:

Artirhmetic: +, -, / , *

Comparison Operations:

df['IsEqual'] = df['Column1'] == df['Column2']


Custom Functions: Using Apply to manipulate pandas data

Custom Functions:

You can define custom functions using Python's lambda functions or regular functions and apply them to
columns using .apply().
Detecting Missing Values

Detecting missing values in a pandas DataFrame is an essential step in data cleaning and analysis. Pandas
provides several methods to identify and handle missing values. Here are some common techniques for
detecting missing values:

Several ways to detect missing values:

isna() and isnull():

These methods return a DataFrame of the same shape as the original, with True for missing values and False
for non-missing values.
info():

The info() method provides a summary of the DataFrame, including the count of non-null values in each
column.

describe():

The describe() method can be used to get statistics for numeric columns, and it implicitly shows the count of
non-null values.
● Groupby: Pandas groupby function is used to group and segment data in a DataFrame based on one or
more columns, allowing you to perform various aggregation and analysis operations on those grouped
data subsets.
● df.grouby(“column name”) ⇒ returns groupby object to aggregate and perform analysis with
● Operations On/With Columns: e.g, df[“column A”] + df[“column B”] = resulting series, or A * df[“columns
C”] = resulting series
● Manipulating Data using Apply: df.apply(function_name) ⇒ resulting series
● Finding missing values. Df[‘column name”].isna() ⇒ returns boolean series
Detecting missing values in columns

To detect missing values in columns using pandas, you can use the isna() or isnull() method on the
DataFrame or Series.

These methods return a DataFrame or Series of boolean values where True represents a missing value (NaN)
and False represents a non-missing value.

You can then use aggregation functions like sum() to count the missing values in each column.
Handling Missing Values

Dropping Missing Values:

dropna(): This method removes rows or columns containing missing values. You can specify the axis
parameter to drop either rows (axis=0) or columns (axis=1) with missing values.

df.dropna(axis=0) # Drop rows with missing values

df.dropna(axis=1) # Drop columns with missing values


Filling Missing Values:

fillna(): This method allows you to fill missing values with a specified value or using various filling strategies
like forward fill, backward fill, or interpolation.
Fill in with default values

df.fillna(value=0)

This method fills missing values in the DataFrame df with a specific value, which is specified as the argument.
In the example provided, missing values are replaced with the value 0. This method is useful when you want
to replace missing values with a constant value that is meaningful in your context.
Forward Fill

df.fillna(method='ffill')

This method performs forward fill for missing values. It means that missing values are filled with the most
recent non-missing value that appears before them in the column. This method is suitable when you want to
propagate the last observed value forward in the dataset.
Backward Fill

df.fillna(method='bfill'):

This method performs backward fill for missing values. It fills missing values with the next non-missing value
that appears after them in the column. It's the opposite of forward fill and is useful when you want to
propagate the next observed value backward in the dataset.
df.interpolate(method='linear', axis=0):

This method performs linear interpolation to fill missing values. Linear interpolation estimates missing values
based on the linear relationship between neighboring data points. The axis=0 argument indicates that
interpolation should be done along columns. This method is suitable when you have time-series or sequential
data and want to estimate missing values based on the trend of the data.
In general

Use fillna(value=0) when you want to replace missing values with a specific constant value.

Use fillna(method='ffill') when you want to forward fill missing values based on the previous non-missing
values.

Use fillna(method='bfill') when you want to backward fill missing values based on the next non-missing values.

Use interpolate(method='linear', axis=0) when you want to estimate missing values based on the linear
relationship between neighboring data points, typically for time-series or sequential data.

You might also like