Pandas
Pandas
What is Pandas
Pandas is an open source Python library for data analysis. It gives Python the ability to work with
spreadsheet-like data for fast data loading, manipulating, aligning, merging, etc.
To give Python these enhanced features, Pandas introduces two new data types to Python: Series and
DataFrame.
The DataFrame will represent your entire spreadsheet or rectangular data, whereas the Series is a single
column of the DataFrame. A Pandas DataFrame can also be thought of as a dictionary or collection of Series.
Dataset and Dataframes
A dataset is a more general term that refers to a collection of data. It can encompass data in various formats
and structures, including but not limited to tabular data.
A pandas dataframe on the other hand is a specific data structure provided by the Pandas library for Python.
It is a two-dimensional, size-mutable, and highly flexible data structure designed to work with structured data.
DataFrames are typically used to represent tabular data, where data is organized into rows and columns.
Pandas series
A Pandas Series is a one-dimensional labeled array-like data structure provided by the Pandas library in
Python. It is a fundamental building block of Pandas and is designed to work with various data types, similar
to a column in a spreadsheet
So in simple terms, data in any form as a collection is called a dataset e.g tabular, images, audio datasets,
while data loaded up in a spreadsheet, tabular manner is called a dataframe. Pandas series are the building
blocks of dataframes in pandas
Creating A Series and a dataframe
import pandas as pd
series = pd.Series(data)
print(series)
# Data in separate lists
data = {
df = pd.DataFrame(data)
print(df)
CSV, TSV FILES
CSV (Comma-Separated Values) files are a common and widely used file format for storing tabular data. CSV
files are simple text files that represent data in a structured way, with rows and columns separated by commas
(or other delimiters). Each line in a CSV file typically represents a record or row of data, and within each line,
the values for different columns are separated by commas
TSV are like CSV files just using tabs instead of commas to separate values
Series methods
There are several methods that can be performed on a series Such as:
Series Methods
dtype or dtypes: Returns the data type of the elements in the Series. If the Series contains elements of
different types, you can use .dtypes to get a Series of data types for each element.
shape: Returns a tuple representing the dimensions of the Series. Since a Series is one-dimensional, it
returns a tuple with one element, which is the number of rows.
series.shape
Series.size
series.value_counts()
Series methods
values: Returns the data in the Series as a NumPy array. This allows you to access and manipulate the
underlying data.
series.values
describe: Generates descriptive statistics of the Series, such as count, mean, standard deviation, minimum,
and maximum.
Series.describe
Series.unique, series.nunique
Series methods
corr: Calculates the correlation between two Series. It measures how closely the values of two Series are
linearly related. For example, you can use this method to check how two variables change together.
cov: Calculates the covariance between two Series. Covariance measures how two variables change
together. It's a measure of the joint variability of two random variables.
describe: Generates summary statistics for the Series. It provides information like count, mean, standard
deviation, minimum, maximum, and quartiles, giving you a quick overview of the data's distribution.
drop_duplicates: Returns a new Series without duplicate values. It's useful for data cleaning when you want to
remove repeated values from your Series.
Series Methods
replace: Replaces specified values in the Series with a specified value. It's useful for data cleaning and
transformation.
sample: Returns a random sample of values from the Series. You can specify the number of random values to
retrieve.
sort_values: Sorts the values in the Series in either ascending or descending order, allowing you to order your
data as needed.
Series methods
max: Returns the maximum value in the Series, which is the largest value among all the elements.
mean: Calculates the arithmetic mean (average) of the Series. It's the sum of all values divided by the number
of values.
median: Calculates the median value of the Series, which is the middle value when all values are sorted.
mode: Calculates the mode(s) of the Series, which is the most frequently occurring value(s).
quantile: Calculates the value at a given quantile, allowing you to find specific percentiles or quartiles of your
data.
Boolean Subsetting: Dataframe
Boolean subsetting in Pandas Series is a powerful technique for filtering and selecting data based on boolean
conditions. You can create boolean masks, which are Series of True and False values, and use them to select
specific rows that meet certain criteria. Here's how you can perform boolean subsetting with a Pandas Series:
Assuming you have a Pandas Series called data_series, here's how you can use boolean subsetting:
Boolean Subsetting
You can create a boolean mask by applying a condition to the Series. For example, let's say you want to
create a mask for values greater than a certain threshold:
Once you have the boolean mask, you can use it to filter the Series. This will return a new Series containing
only the elements that satisfy the condition:
filtered_series = data_series[mask]
Multiple Conditions:
You can also combine multiple conditions using logical operators like & (and) and | (or) within the mask. For
example, to filter for values greater than a threshold and less than another threshold:
filtered_series = data_series[mask]
Exporting and importing data: pickle, csv and excel,tsv
Export to Pickle:
Pickle is a Python-specific binary format for serializing and deserializing Python objects. You can use the
to_pickle() method to export a DataFrame to a pickle file:
Export to CSV:
You can use the to_csv() method to export a DataFrame to a CSV (Comma-Separated Values) file:
Export to Excel:
You can use the to_excel() method to export a DataFrame to an Excel file. You'll need to install the openpyxl
library if it's not already installed:
Consequently to import, we use the
In pandas, a "variable type" generally refers to the data type or dtype of a column within a DataFrame or a
Series. In Pandas we have the following datatypes
Numeric Types:
int: Integer
Categorical Types:
category: A special data type for categorical data, which can have a limited number of unique values (like
labels or categories). This is memory-efficient compared to storing text labels as strings.
Other datatypes
DateTime Types:
Text/String Types:
object: The most general data type, which can hold any Python object, including strings. However, it's not the
most memory-efficient choice for large datasets with consistent data types.
Boolean Type:
Categorical Variables:
Definition: Categorical variables, also known as qualitative variables, represent categories or groups and can
take on a limited, discrete set of values or labels.
Examples: Gender (categories: male, female), color (categories: red, green, blue), car types (categories:
sedan, SUV, truck).
Non-Categorical Variables (Quantitative Variables):
Definition: Non-categorical variables, also known as quantitative variables, represent data that consists of
numerical values with meaningful numerical order and magnitude.
Examples: Age, income, height, temperature, and test scores are all examples of non-categorical variables.
These variables can be measured and operated on with mathematical operations.
Groupby
In pandas, the groupby operation is a powerful method used for grouping and aggregating data based on one
or more criteria.
It is a fundamental tool for data analysis and is often used in combination with other pandas functions to
perform various data manipulations.
Grouping Data:
The groupby operation is applied to a DataFrame or Series, and it allows you to group rows of data based on
the values in one or more columns.
You specify the column(s) by which you want to group the data.
Aggregation
After grouping, you can apply aggregation functions to the grouped data. Aggregation functions summarize
data within each group.
Common aggregation functions include sum, mean, median, count, min, max, and custom functions.
# Group by the 'Category' column and calculate the mean for each group
grouped = df.groupby('Category')
result = grouped.mean()
print(result)
We use groupby to group the data by the 'Category' column.
We apply the mean aggregation function to calculate the mean value for each group.
The resulting DataFrame result shows the mean values for categories 'A' and 'B.'
groupby is versatile and can be used for more complex operations involving multiple grouping columns,
custom aggregation functions, and chaining with other pandas operations. It's commonly used for tasks like
data summarization, pivot tables, and exploring relationships within datasets.
Operations between pandas Dataframe columns
In pandas, you can perform various operations between columns of a DataFrame to create new columns,
modify existing ones, or derive insights from your data. Here are some common operations between columns
in pandas:
Artirhmetic: +, -, / , *
Comparison Operations:
Custom Functions:
You can define custom functions using Python's lambda functions or regular functions and apply them to
columns using .apply().
Detecting Missing Values
Detecting missing values in a pandas DataFrame is an essential step in data cleaning and analysis. Pandas
provides several methods to identify and handle missing values. Here are some common techniques for
detecting missing values:
These methods return a DataFrame of the same shape as the original, with True for missing values and False
for non-missing values.
info():
The info() method provides a summary of the DataFrame, including the count of non-null values in each
column.
describe():
The describe() method can be used to get statistics for numeric columns, and it implicitly shows the count of
non-null values.
● Groupby: Pandas groupby function is used to group and segment data in a DataFrame based on one or
more columns, allowing you to perform various aggregation and analysis operations on those grouped
data subsets.
● df.grouby(“column name”) ⇒ returns groupby object to aggregate and perform analysis with
● Operations On/With Columns: e.g, df[“column A”] + df[“column B”] = resulting series, or A * df[“columns
C”] = resulting series
● Manipulating Data using Apply: df.apply(function_name) ⇒ resulting series
● Finding missing values. Df[‘column name”].isna() ⇒ returns boolean series
Detecting missing values in columns
To detect missing values in columns using pandas, you can use the isna() or isnull() method on the
DataFrame or Series.
These methods return a DataFrame or Series of boolean values where True represents a missing value (NaN)
and False represents a non-missing value.
You can then use aggregation functions like sum() to count the missing values in each column.
Handling Missing Values
dropna(): This method removes rows or columns containing missing values. You can specify the axis
parameter to drop either rows (axis=0) or columns (axis=1) with missing values.
fillna(): This method allows you to fill missing values with a specified value or using various filling strategies
like forward fill, backward fill, or interpolation.
Fill in with default values
df.fillna(value=0)
This method fills missing values in the DataFrame df with a specific value, which is specified as the argument.
In the example provided, missing values are replaced with the value 0. This method is useful when you want
to replace missing values with a constant value that is meaningful in your context.
Forward Fill
df.fillna(method='ffill')
This method performs forward fill for missing values. It means that missing values are filled with the most
recent non-missing value that appears before them in the column. This method is suitable when you want to
propagate the last observed value forward in the dataset.
Backward Fill
df.fillna(method='bfill'):
This method performs backward fill for missing values. It fills missing values with the next non-missing value
that appears after them in the column. It's the opposite of forward fill and is useful when you want to
propagate the next observed value backward in the dataset.
df.interpolate(method='linear', axis=0):
This method performs linear interpolation to fill missing values. Linear interpolation estimates missing values
based on the linear relationship between neighboring data points. The axis=0 argument indicates that
interpolation should be done along columns. This method is suitable when you have time-series or sequential
data and want to estimate missing values based on the trend of the data.
In general
Use fillna(value=0) when you want to replace missing values with a specific constant value.
Use fillna(method='ffill') when you want to forward fill missing values based on the previous non-missing
values.
Use fillna(method='bfill') when you want to backward fill missing values based on the next non-missing values.
Use interpolate(method='linear', axis=0) when you want to estimate missing values based on the linear
relationship between neighboring data points, typically for time-series or sequential data.