0% found this document useful (0 votes)

27 views108 pages

Module 3

The document provides a comprehensive overview of Exploratory Data Analysis (EDA) using Python, covering essential topics such as data exploration, handling duplicates, outliers, and missing values, as well as univariate and bivariate analysis. It emphasizes the importance of EDA in understanding dataset structure and guiding feature selection for machine learning. Additionally, it includes practical examples and Python code snippets for performing various EDA tasks on a retail sales dataset.

Uploaded by

rahman.mega

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views108 pages

Module 3

Uploaded by

rahman.mega

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 108

Exploratory Data Analysis

(EDA)

Exploration of Data
using Visualization
in Python
Topics covered
Exploratory Data Analysis (EDA)—Introduction, Data Exploration,
Handling -Duplicates, Outliers, Missing values, Univariate Analysis,
and Bivariate Analysis.
1. Importing the Python Libraries
2. Handling Duplicates
3. Handling Outliers
4. Handling Missing Values
5. Univariate Analysis
6. Bivariate Analysis
Exploratory data analysis (EDA)
Exploratory data analysis, popularly known as EDA, is a process of
performing initial investigations on the dataset to discover the
structure and content. It is often known as Data Profiling.

•EDA is where we get a basic understanding of the data at hand,

which helps us further with data cleaning and preparation.
What is Exploratory Data Analysis (EDA)?
EDA is the process of analyzing datasets to summarize their
key characteristics, detect patterns, spot anomalies, and
check assumptions using visualization and statistical
methods.
EDA is used to:
• Understand dataset structure
• Detect missing values, outliers, and errors
• Identify patterns and relationships
• Guide feature selection for machine learning
Why is EDA Important?
Before building models, we need to understand and clean the data.
• Poor data quality can lead to wrong conclusions.
• Helps choose the right analysis techniques.

Real-Life Examples:
• Stock Market: Identifying trends before forecasting stock prices
• E-Commerce: Understanding customer purchase behavior
• Healthcare: Detecting anomalies in patient data for diagnosis

Imagine an app-based food delivery company analyzing its data. They perform EDA to:
• Check missing order details
• Find peak hours of delivery
• Analyze the relationship between delivery time & customer ratings
Key Steps in Exploratory Data Analysis (EDA)
Step Description Example

1.Data Collection & Loading Import the dataset from a file (CSV, df = pd.read_csv("data.csv")
Excel, Database, API)
2. Data Cleaning Handle missing values, duplicates, Remove duplicates, fill missing values
and inconsistent formatting with mean/median

3. Data Exploration Understand dataset structure, df.describe(), df.info()

summary statistics, and
distributions
4. Outlier Detection Identify and handle anomalies using Detect extreme values in sales or
boxplots, IQR, or Z-score customer spending
5. Data Visualization Use plots to understand Histograms, Scatter plots, Correlation
distributions and relationships heatmaps
6. Feature Selection & Select relevant variables, scale or Standardization, One-Hot Encoding
Transformation encode categorical data
Types of Data in EDA
Data Type Description Example Variables

Numerical Data Data represented in numbers, can Revenue, Age, Temperature

be continuous or discrete
Categorical Data Data divided into categories, Gender (Male/Female), Customer
either nominal or ordinal Type (New/Returning)
Ordinal Data Categorical data with a Satisfaction Level (Low, Medium,
meaningful order/ranking High)
Date-Time Data Data related to time, useful for Sales Date, Transaction Time
trend analysis

Boolean Data Binary values representing Is_Defaulted (Yes/No), Purchased

True/False or Yes/No (0/1)
Python

Data Visualization
Syntax / Code– Red Colour
Importing the Python Libraries

•NumPy
•Pandas
•Matplotlib and
•Seaborn.
Dataset – EDA-1-Black_Friday_3

This dataset comprises of sales transactions captured at

a retail store. It’s a classic dataset to explore and expand
your feature engineering skills and day to day
understanding from multiple shopping experiences.
Import Python Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')
#Step-1-Importing Dataset

import pandas as pd

BF=pd.read_excel(r"D:\DataSet2024\Black_Friday_3.xlsx")

print(BF)
#Step-2-Display the Variable Names and their Data
Types-Meta Data

Referred to as data that describes

#Meta Data other data, metadata is structured
BF.info() reference data that helps to sort
and identify attributes of the
information it describes.
#Step-3-Count the Number of Non-Missing
Values for each Variable

BF.count()
#Step-4-Descriptive Statistics

BF.describe()
#Step-5-Inclusion of categorical variables

BF.describe(include='all')
#Step-6-Handling Duplicates
This involves 2 steps: Detecting
duplicates and Removing duplicates.
The pandas. DataFrame. dupli
cated() method is used to find
duplicate rows in a
BF.duplicated() DataFrame. It returns a
boolean series which
identifies whether a row is
duplicate or unique.
#Step-7-To remove the duplicates(if any)

BF.drop_duplicates()
#BF.drop_duplicates(subset='User_ID')
This by default keeps just the first occurrence of the duplicated value in
the User_ID variable and drops the rest of them.

Do not want to remove the duplicate values from the User_ID variable permanently so
just to see the output and not make any permanent change in our data frame.

BF.drop_duplicates(subset='User_ID' , inplace=False)
Handling Outliers

Outliers are the extreme values on the low and the high
side of the data.

Handling Outliers involves 2 steps: Detecting outliers

and Treatment of outliers.
Detecting Outliers
Consider any variable from data frame and determine
the upper cut off and the lower cutoff with the help of
any of the 3 methods namely :
• Percentile Method
• IQR Method
• Standard Deviation Method
IQR Method of Outlier Detection

minimum is the minimum

value in the dataset,
and maximum is the
maximum value in the
dataset.
Determining if there are any outliers in the data
set using the IQR(Interquartile range) Method.

Finding the minimum(p0), maximum(p100), first

quartile(q1), second quartile(q2), the third quartile(q3), and
the iqr(interquartile range) of the values in the Purchase
variable.
#Step-8-IQR(Interquartile range) Method -
Execute
p0=BF.Purchase.min()
p100=BF.Purchase.max()
q1=BF.Purchase.quantile(0.25)
q2=BF.Purchase.quantile(0.5)
q3=BF.Purchase.quantile(0.75)
iqr=q3-q1
Using the Interquartile Rule to Find Outliers

1. Calculate the interquartile range for the data

2. Multiply the interquartile range (IQR) by 1.5 (a constant used to
discern outliers)
3. Add 1.5 x (IQR) to the third quartile. Any number greater than this
is a suspected outlier
4. Subtract 1.5 x (IQR) from the first quartile. Any number less than
this is a suspected outlier
Why to use 1.5 in IQR
• About 68.26% of the whole data lies within one standard deviation (<σ) of
the mean (μ), taking both sides into account, the pink region in the figure.
• About 95.44% of the whole data lies within two standard deviations (2σ)
of the mean (μ), taking both sides into account, the pink+blue region in
the figure.
• About 99.72% of the whole data lies within three standard deviations
(<3σ) of the mean (μ), taking both sides into account, the
pink+blue+green region in the figure.
• And the rest 0.28% of the whole data lies outside three standard
deviations (>3σ) of the mean (μ), taking both sides into account, the little
red region in the figure. And this part of the data is considered as outliers.
• The first and the third quartiles, Q1 and Q3, lies at -0.675σ and +0.675σ
from the mean, respectively.
#Step-9-To find the lower cutoff(lc) and the upper
cutoff(uc) of the values

lc = q1 - 1.5*iqr

uc = q3 + 1.5*iqr
#Step-10-Execute

uc
Outlier - Condition
If lc < p0 → There are NO Outliers on the lower side

If uc > p100 → There are NO Outliers on the higher side

#Step-11-Execute

print( "p0 = " , p0 ,", p100 = " , p100 ,", lc = " , lc ,", uc = " , uc)

Output

p0 = 12 , p100 = 23961 , lc = -3523.5 , uc = 21400.5

Clearly lc < p0 so there are no outliers on the lower side. But uc <
p100 so there are outliers on the higher side.
#Step-12-Pictorial representation of the
outlier by drawing the box plot

BF.Purchase.plot(kind='box')
Outlier Treatment

• Clip the values instead of removing them from the

variable.
• During this process, replace values that are outside
the range with the lower or upper cutoff, as
appropriate.
• Once outliers are removed from the data, and all of
the data is within the range.
#Step-13-Clipping all values greater than the
upper cutoff to the upper cutoff :

BF.Purchase.clip(upper=uc)
#Step-14-To finally treat the outliers and
make the changes permanent

BF.Purchase.clip(upper=uc,inplace=True)
BF.Purchase.plot(kind='box')
#Step-15-Handling Missing Values

#Detecting the Missing Values #Pandas.isna(obj)[source] Detect

missing values for an array-like
object. This function takes a scalar
or array-like object and indicates
whether values are missing ( NaN in

BF.isna() numeric arrays, None or NaN in

object arrays, NaT in datetimelike).

#df.isna() returns True for the missing values

and False for the non-missing values.
#Step-16-To find out the percentage of
missing values in each variable

BF.isna().sum()/BF.shape[0]
Missing Value Treatment
To treat the missing values we can opt for a method from
the following :

• Drop the variable

• Drop the observation(s)
• Missing Value Imputation
Missing Value Treatment
• 31.56% of the values for the variable Product Category 2 are
missing.
• We should not discard such a large number of observations or the
variable itself. Therefore, we will use imputation.
• In this process, missing data is imputed by replacing the missing
values with an appropriate value, which could be a constant, mean,
median, mode, or a predictive model output.
• Since Product Category 2 is a categorical variable, we will impute
the missing values using the mode.
#Step-17-Missing Value Treatment
#Execute 1st
BF.Product_Category_2.mode()[0]

BF.Product_Category_2.fillna(BF.Product_Category_2.mo
de()[0],inplace=True)

# Execute 2nd

BF.isna().sum() #BF.isna().sum()/BF.shape[0]
#Step-17-Missing Value Treatment

For the variable Product_Category_3, 69.67% of the

values are missing, which is a significant proportion.
Therefore, we will drop this variable.

BF.dropna(axis=1,inplace=True)
#Step-18-Missing Value Treatment
How to check?

BF.dtypes
Univariate Analysis

In this type of analysis, charts are plotted for a single

variable. These charts help visualize how the data is
distributed and structured based on the variable type-
categorical or numerical.
For continuous variables, we use box plots and histograms
to examine the data distribution.
Syntax Example :
1. Declared file name = Train
Declared file name.Variable Name.hist() Variable Name = Purchase
plt.show()
2.
Declared file name.groupby(‘Variable Name’).Variable
Name.count().plot(kind=‘pie’ or ‘barh’)
plt.show()

3.
sns.countplot(Declared file name.Varible name)
plt.show()
#Step– 19- Distribution of Purchase
# Histogram

import matplotlib.pyplot as plt

BF.Purchase.hist()
plt.show()
For Categorical Variables
• To analyze the distribution (Spread) of categorical
variables, we use frequency plots such as bar charts
and horizontal bar charts.

• To understand the composition (Arrangement) of data,

we use pie charts.
#Step-20-Composition of Gender
import matplotlib.pyplot as plt
print(BF['Gender'].value_counts()) # Print the counts

# Plot the pie chart

BF['Gender'].value_counts().plot(kind='pie', autopct='%1.1f%%')

# Save as PNG the figure before showing it

plt.savefig("gender_pie_chart.png", dpi=300, bbox_inches='tight')

plt.show()
Gender
M 414259
F 135809
Name: count, dtype: int64
#Step-21-Composition of Gender

BF.groupby('Gender').Gender.count().plot(kind='pie')

plt.show()
#Step-22-Distribution of Marital_Status

import seaborn as sns

sns.countplot(x='Marital_Status',data=BF)

plt.show()
#Step-22A-Distribution of Marital_Status

sns.countplot(x='Marital_Status', data=BF,
hue='Gender', palette='coolwarm')

plt.show()
#Step-22B-Distribution of Marital_Status
import seaborn as sns
import matplotlib.pyplot as plt

# Create count plot with hue for Gender

ax = sns.countplot(x='Marital_Status', data=BF, hue='Gender', palette='viridis')

# Display counts on top of bars

for bar in ax.containers:
ax.bar_label(bar)

plt.show()
palette_options = [
'viridis', 'magma', 'plasma', 'inferno', # Scientific color maps
'coolwarm', 'RdYlBu', 'Spectral', # Diverging color maps
'Blues', 'Reds', 'Greens', 'Purples', # Single-tone color maps
'pastel', 'deep', 'muted', 'bright', 'dark', 'colorblind' # Seaborn-
themed palettes
]
print(palette_options)
#Step-23-Composition of City_Category

BF.groupby('City_Category').City_Category.count().plot(kind
='pie')

plt.show()
#Step-23A-Composition of City_Category
import matplotlib.pyplot as plt

# Plot the pie chart with numbers

BF.groupby('City_Category').City_Category.count().plot(kind='pie',
autopct='%1.1f%%')

plt.ylabel('') # Removes the default y-axis label

plt.title("City Category Distribution")
plt.show()
Step-24-Distribution of Age

sns.countplot(x='Age',data=BF)

plt.show()
import seaborn as sns
import matplotlib.pyplot as plt

# Create the count plot with a color palette

ax = sns.countplot(x='Age', data=BF, palette='viridis') # Change palette if
needed

# Add numbers on top of each bar

for bar in ax.containers:
ax.bar_label(bar)

plt.title("Age Distribution")
plt.show()
#Step-25-Composition of Stay_In_Current_City_Years

import matplotlib.pyplot as plt

BF.groupby('Stay_In_Current_City_Years').City_Cate
gory.count().plot(kind='pie', autopct='%1.1f%%')

plt.show()
#Step-26-Distribution of Occupation

Chart=sns.countplot(x='Occupation',data=BF)
for container in Chart.containers:
Chart.bar_label(container)
plt.show()
#Step-27-Distribution of Occupation

sns.countplot(x='Occupation',data=BF)

plt.show()
#Step-28-Distribution of Product_Category_1

BF.groupby('Product_Category_1').Product_Category_1.c
ount().plot(kind='barh')

plt.show()
#Step-29-Histogram - Multiple
df=pd.DataFrame(BF,columns=['Purchase','Product
_Category_1','Occupation'])

df.diff().hist(bins=15)
#Step-30-Histogram with grid

BF.Purchase.plot(kind='hist' , grid = True)

plt.show()
Bivariate Analysis
In this type of analysis, we consider two variables at a
time and create charts based on them. Since there are
two types of variables - categorical and numerical-
bivariate analysis can have three cases

1. Numerical & Numerical

2. Numerical & Categorical
3. Categorical & Categorical
1.Numerical & Numerical

To examine the relationship between two

variables, we create scatter plots and a correlation
matrix with a heatmap overlay
Scatter Plot
Our dataset contains only one numerical variable, so we
cannot create a scatter plot.
How can we address this?
Consider a scenario where we treat all numerical
variables (i.e., those with data types of int or float) as
numerical variables
#Step-31 # Considering 2 categorical (Numerical data)
variables Product_Category_1 and Product_Category_2

BF.plot(x='Product_Category_1',y='Product_Cate
gory_2',kind = 'scatter')

plt.show()
#Step-32# Considering 2 categorical
variables Product_Category_1 and Product_Categ
ory_2

plt.scatter(x=BF.Product_Category_1 ,
y=BF.Product_Category_2)

plt.show()
#Step-32-Correlation Matrix : Finding the
correlation among all numerical variables

BF.drop('User_ID',axis=1, inplace=True)

BF.select_dtypes(['float64' , 'int64']).corr()
Product_Categor Product_Categor
Occupation Marital_Status Purchase
y_1 y_2

Occupation 1.000000 0.024280 -0.007618 0.001566 0.020853

Marital_Status 0.024280 1.000000 0.019888 0.010260 -0.000599

Product_Category_1 -0.007618 0.019888 1.000000 0.279247 -0.347413

Product_Category_2 0.001566 0.010260 0.279247 1.000000 -0.131104

Purchase 0.020853 -0.000599 -0.347413 -0.131104 1.000000

#Step-34-Correlation Matrix - Finding a
correlation between all the numeric variables

BF['Marital_Status'].corr(BF['Occupation'])
#Step-35-Heatmap

Creating a heatmap using Seaborn on the

correlation matrix helps visualize the
relationships between numerical columns in the
dataset.
!pip install seaborn --upgrade
sns.heatmap(BF.select_dtypes(['float64' , 'int64'])
.corr(),annot=True)

plt.show()

#annot: If True, write the

data value in each cell.
2.Numerical & Categorical

•To analyze the composition of the data, create bar

and line charts

•To compare two variables, create bar and line charts

#Step-36#Comparison between Purchase and
Occupation: Bar Chart

BF.groupby('Occupation').Purchase.sum().plot(kin
d='bar')

plt.show()
#Step-37#Comparison between Purchase and
Occupation: Bar Chart

summary=BF.groupby('Occupation').Purchase.su
m()

sns.barplot(x=summary.index ,
y=summary.values)

plt.show()
#Step-38-Comparison between Purchase
and Age: Line Chart

BF.groupby('Age').Purchase.sum().plot(kind='line')

plt.show()
#Step-39-Comparison between Purchase
and City_Category: Area Chart

BF.groupby('City_Category').Purchase.sum().plot(
kind='area')

plt.show()
BF.groupby('City_Category').Purchase.sum().plot(k
ind='area', color='skyblue')

plt.show()

Replace 'skyblue' with any preferred color (e.g.,

'red', 'green', '#FFA07A', etc.).
#Step-40-Comparison between Purchase and
Marital_Status

sns.boxplot(x='Marital_Status',y='Purchase',data=BF)

plt.show()
Generate the Box Plot & Display Numerical Summary
# Load the Black Friday dataset
BF=pd.read_excel(r"D:\Mallie IV\AcademicYear2025_26\Black_Friday_3.xlsx")
# Create the box plot
plt.figure(figsize=(8, 5))
ax = sns.boxplot(x='Marital_Status', y='Purchase', data=BF)

# Calculate and display descriptive statistics

stats = BF.groupby('Marital_Status')['Purchase'].describe()
print(stats) # Print numerical summary of the box plot

# Adding median values to the plot

medians = BF.groupby('Marital_Status')['Purchase'].median()
for i, median in enumerate(medians):
plt.text(i, median, f'{median:.0f}', horizontalalignment='center', fontsize=12, color='black', fontweight='bold')

plt.title("Box Plot of Marital Status vs. Purchase (Black Friday Data)")

plt.xlabel("Marital Status (0 = Single, 1 = Married)")
plt.ylabel("Purchase Amount")
plt.show()
Numerical Summary (from describe())

Marital Count Mean Median Q1 (25%) Q3 (75%) Min Max Std Dev
Status (Q2)

Single (0) 324,731 ₹9,266 ₹8,044 ₹5,605 ₹12,061 ₹12 ₹23,961 ₹5,027

Married (1) 225,337 ₹9,261 ₹8,051 ₹5,843 ₹12,042 ₹12 ₹23,961 ₹5,017
How to Read the Box Plot?
Box Plot Element Where to Find in Graph? Interpretation
Median (Q2 - 50th Horizontal line inside each Very close for both groups (₹8,044 vs.
Percentile) box ₹8,051) → No significant difference in
typical spending.
Interquartile Range (IQR: Height of the box (from Q1 Singles: ₹5,605 - ₹12,061, Married:
Q3 - Q1) to Q3) ₹5,843 - ₹12,042 → Similar spending
range.
Whiskers (Min & Max, Lines extending from the Both groups have identical min (₹12)
Excluding Outliers) box and max (₹23,961).
Outliers Dots outside the whiskers Few high-value purchases exist in both
groups but not significantly different.

Box Width (Variability in Width of the box Almost equal spread, meaning both
Data) groups have similar purchase behavior.
Key Insights
•Spending patterns between singles and married individuals are almost identical.
•The mean, median, and IQR values are very close for both groups.
•Both groups have similar spending variations (standard deviation ≈ ₹5,000).
•The max purchase amount is the same for both groups (₹23,961).
•This suggests that high spenders exist equally in both categories.
•The IQR range is slightly wider for singles (₹5,605 - ₹12,061) vs. married (₹5,843 -
₹12,042).
•This means that married individuals have slightly more consistent spending
behavior, but the difference is minor.
•There is no strong indication that marital status influences purchase behavior
significantly.
•Since both groups have nearly identical distributions, other factors (like Age,
Occupation, or City Category) may influence spending more than marital status.
Conclusion
• The box plot visually confirms that there is no major
difference in spending behavior between singles and
married individuals.
• We should focus on other variables (such as Age,
Product Category, or Income) to find stronger
patterns.
How to Interpret the Box Plot?
Each box represents the distribution of purchases for a specific
marital status.
•Key elements of the box plot:
•The box → Shows the middle 50% of data (Interquartile Range,
IQR).
•The line inside the box → Represents the median purchase
amount.
•Whiskers → Indicate the range of most values, excluding outliers.
•Dots outside the whiskers → Represent outliers, meaning
unusually high or low purchases.
How to get rupee symbol?
import seaborn as sns
import matplotlib.pyplot as plt

# Simple box plot

sns.boxplot(x=["Single", "Single", "Married", "Married"], y=[8000, 12000,
15000, 18000])

# Use Unicode for ₹ symbol

plt.ylabel("\u20B9 Purchase")

plt.show()
3.Categorical & Categorical
To analyze the relationship between two
variables, create a crosstab and overlay a
heatmap
Step 41: Relationship Between Age and Gender –
Creating a crosstab to display data for Age and
Gender

pd.crosstab(BF.Age,BF.Gender)
Gender F M
Age
0-17 5083 10019
18-25 24628 75032
26-35 50752 168835
36-45 27170 82843
46-50 13199 32502
51-55 9894 28607
55+ 5083 1642
#Step-42-Heatmap: Creating a Heat Map on
the top of the crosstab

sns.heatmap(pd.crosstab(BF.Age,BF.Gender))

plt.show()
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create and print crosstab values

crosstab_data = pd.crosstab(BF['Age'], BF['Gender'])
print(crosstab_data)

# Create heatmap
sns.heatmap(crosstab_data, cmap="Blues")

plt.show()
Interpretation of Age vs. Gender Heatmap
Aspect Observation Interpretation
Dominant Age Group 26-35 (Male: 168,835, Female: 50,752) This age group has the highest number of buyers,
especially males. It suggests that young working
professionals are the biggest shoppers.
Second Highest Age Group 18-25 (Male: 75,032, Female: 24,628) The 18-25 age group is the second most active, likely
consisting of college students and young employees.
Males are significantly more than females.
Older Age Groups (46-55, Fewer buyers compared to younger groups The number of shoppers decreases with age,
55+) indicating that middle-aged and senior customers
shop less on Black Friday.
Male vs. Female Male dominate all age groups Across all age categories, the number of male buyers
Comparison is consistently higher than female buyers, showing
that males participate more in Black Friday shopping.

Least Active Age Group 55+ (Male: 16,421, Female: 5,083) The senior group (55+) has the least shoppers,
possibly due to lower tech adoption or shopping
preferences.
Balanced Gender Ratio Most balanced in the 36-45 group (M: While male still outnumber females, the gender gap is
82,843, F: 27,170) slightly less pronounced in this category.
Frequency Distribution

1e8 is standard scientific notion, and here it indicates an

overall scale factor for the y-axis. That is, if there's a 2 on the
y-axis and a 1e8 at the top, the value at 2 actually indicates
2*1e8 = 2e8 = 2 * 10^8 = 200,000,000
Workbook-2-Plotly Library

Dataset : World Happiness Report

Live Dashboards
Step-1
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
Step-2
WHR=pd.read_excel(r"D:\DataSet2024\World_H
appiness_R22_4.xlsx")

print(WHR)

WHR.info()
Step-3 – Scatter Diagram

fig = px.scatter(WHR, x="Happiness_Score",

y="GDP_Per_Capita", color='Country')

fig.show()
Step-4 - Scatter Diagram

fig = px.scatter(WHR, x="Happiness_Score",

y="GDP_Per_Capita", color='Rank')

fig.show()
Step-5-Scatter diagram

fig = px.scatter(WHR, x="Happiness_Score",

y="GDP_Per_Capita", color='PerceptionsOfCorruption')

fig.show()
Step-6 – Line chart - Multiple

fig = px.line(WHR, x='Happiness_Score',

y="Country")

fig.show()
Step-7 – Line chart

fig = px.line(WHR, x='Happiness_Score',

y=["GDP_Per_Capita","Social_Support",'Healthy_
Life_Expectancy','FreedomOfChoices',"Generosit
y"])

fig.show()
Step-8 – Line chart

fig = px.line(WHR,
x='Happiness_Score',y='Social_Support',color='Co
untry')

fig.show()
Step-9 – Bar Chart

fig = px.bar(WHR, x='Happiness_Score',

y='Country')

fig.show()
Step-10 – Bar chart

fig = px.bar(WHR, x='Country',

y='Happiness_Score')

fig.show()
Step-11 – Bar Chart

fig = px.bar(WHR, x='Country',

y='Happiness_Score',color='Rank')

fig.show()
Step-12 – Pie chart

fig = px.pie(WHR, values='Happiness_Score',

names='Rank', title='Happiness Report')

fig.show()
Step-13-Histogram

fig = px.histogram(WHR,
x="Happiness_Score",title="Happiness Report")

fig.show()
Step-14-Histogram with Colour

fig = go.Figure(data = [go.Histogram(x =

WHR.Happiness_Score,xbins=go.histogram.XBins(
size=0),marker=go.histogram.Marker(color="gree
n") ) ])

fig.show()

ML Ex2
No ratings yet
ML Ex2
7 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Eda U2
No ratings yet
Eda U2
141 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Data Analysis for Outlier Detection
100% (1)
Data Analysis for Outlier Detection
28 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Data Wrangling Assignment Guide
No ratings yet
Data Wrangling Assignment Guide
4 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Progress
No ratings yet
Progress
6 pages
Lecture 22
No ratings yet
Lecture 22
20 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
33 pages
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
No ratings yet
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
19 pages
Expt 2
No ratings yet
Expt 2
3 pages
Research File 3
No ratings yet
Research File 3
10 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Explorato Ry: Data Analysis
No ratings yet
Explorato Ry: Data Analysis
6 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Assignment 2 Ds
No ratings yet
Assignment 2 Ds
8 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
Unit 1
No ratings yet
Unit 1
21 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Step-by-Step Explanation of Python Data Preprocessing Script
No ratings yet
Step-by-Step Explanation of Python Data Preprocessing Script
9 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
How To Handle Outliers
No ratings yet
How To Handle Outliers
6 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Lab - Interpret Visualizations With Respect To Outliers
No ratings yet
Lab - Interpret Visualizations With Respect To Outliers
4 pages
Guide On Outlier Detection Methods
No ratings yet
Guide On Outlier Detection Methods
11 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Edp 3
No ratings yet
Edp 3
16 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
Data Preprocessing and Cleaning Techniques
No ratings yet
Data Preprocessing and Cleaning Techniques
16 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
ILP Report Ankith
No ratings yet
ILP Report Ankith
10 pages
11.3.25 DB Presentation CO5 24-26
No ratings yet
11.3.25 DB Presentation CO5 24-26
2 pages
Course Code: 23PGDBA302 Course: Python Programming For Business
No ratings yet
Course Code: 23PGDBA302 Course: Python Programming For Business
1 page
Module 1
No ratings yet
Module 1
194 pages
RDA 24 QP Supple
No ratings yet
RDA 24 QP Supple
2 pages
Module 4
No ratings yet
Module 4
106 pages
Als630 Literature Review
No ratings yet
Als630 Literature Review
43 pages
DLC Practical
No ratings yet
DLC Practical
6 pages
Lecture 4 - Stock
No ratings yet
Lecture 4 - Stock
29 pages
DIY Metal Monkey Bars Guide
No ratings yet
DIY Metal Monkey Bars Guide
8 pages
A320 Technical Systems Overview
100% (2)
A320 Technical Systems Overview
118 pages
Olight M2X-UT Javelot
No ratings yet
Olight M2X-UT Javelot
84 pages
2 Community Organizing
No ratings yet
2 Community Organizing
54 pages
228 B.M. 1625 Uy Timosa
No ratings yet
228 B.M. 1625 Uy Timosa
2 pages
UNIX Introduction: The Kernel
No ratings yet
UNIX Introduction: The Kernel
34 pages
Peace and Order Council Performance Audit System
No ratings yet
Peace and Order Council Performance Audit System
24 pages
Datalogging Light Meter DT 8809a
No ratings yet
Datalogging Light Meter DT 8809a
2 pages
BMO Nesbitt Burns - Research Highlights - March 5
No ratings yet
BMO Nesbitt Burns - Research Highlights - March 5
32 pages
Being Pwer and Powerless - Hmnties & SCL Scncs Reviews
No ratings yet
Being Pwer and Powerless - Hmnties & SCL Scncs Reviews
106 pages
Pump Head Calculation For Viva Bahriya Project: Page 1 of 4
No ratings yet
Pump Head Calculation For Viva Bahriya Project: Page 1 of 4
4 pages
Idcdcsolar Im
No ratings yet
Idcdcsolar Im
2 pages
Canon Microfilm Scanner 350ii Owners Manual 587842
No ratings yet
Canon Microfilm Scanner 350ii Owners Manual 587842
62 pages
Cash Audit and Reconciliation Guide
No ratings yet
Cash Audit and Reconciliation Guide
7 pages
Norsap™ Deck Rails For Norsap™ 2000-1700-1500
No ratings yet
Norsap™ Deck Rails For Norsap™ 2000-1700-1500
1 page
Community Safety Plan
No ratings yet
Community Safety Plan
21 pages
Intacc 1
No ratings yet
Intacc 1
3 pages
Candidate Information Sheet: Thapar Institute of Engineering and Technology, Patiala
100% (1)
Candidate Information Sheet: Thapar Institute of Engineering and Technology, Patiala
3 pages
IPD - IA-Aditi Holey
No ratings yet
IPD - IA-Aditi Holey
20 pages
Wheat Fiber for Food Manufacturers
No ratings yet
Wheat Fiber for Food Manufacturers
1 page
Swift Standards Category 7 Version 11 September 2006
No ratings yet
Swift Standards Category 7 Version 11 September 2006
245 pages
Ulip Plan of Sbi
No ratings yet
Ulip Plan of Sbi
23 pages
Oberoi Hotel - Case Study On Trade Union PDF
0% (1)
Oberoi Hotel - Case Study On Trade Union PDF
3 pages
Insights Mindmaps: Significance of Indian Diaspora
No ratings yet
Insights Mindmaps: Significance of Indian Diaspora
3 pages
Single Core/pvc /cu
No ratings yet
Single Core/pvc /cu
20 pages
Unit 5 Government Expenditure and Revenue
No ratings yet
Unit 5 Government Expenditure and Revenue
45 pages
Lenovo Customer Experience Issues
No ratings yet
Lenovo Customer Experience Issues
1 page

Module 3

Uploaded by

Module 3

Uploaded by

Exploratory Data Analysis

•EDA is where we get a basic understanding of the data at hand,

3. Data Exploration Understand dataset structure, df.describe(), df.info()

Numerical Data Data represented in numbers, can Revenue, Age, Temperature

Boolean Data Binary values representing Is_Defaulted (Yes/No), Purchased

This dataset comprises of sales transactions captured at

Referred to as data that describes

Handling Outliers involves 2 steps: Detecting outliers

minimum is the minimum

Finding the minimum(p0), maximum(p100), first

1. Calculate the interquartile range for the data

If uc > p100 → There are NO Outliers on the higher side

p0 = 12 , p100 = 23961 , lc = -3523.5 , uc = 21400.5

• Clip the values instead of removing them from the

#Detecting the Missing Values #Pandas.isna(obj)[source] Detect

BF.isna() numeric arrays, None or NaN in

#df.isna() returns True for the missing values

• Drop the variable

For the variable Product_Category_3, 69.67% of the

In this type of analysis, charts are plotted for a single

import matplotlib.pyplot as plt

• To understand the composition (Arrangement) of data,

# Plot the pie chart

# Save as PNG the figure before showing it

import seaborn as sns

# Create count plot with hue for Gender

# Display counts on top of bars

# Plot the pie chart with numbers

plt.ylabel('') # Removes the default y-axis label

# Create the count plot with a color palette

# Add numbers on top of each bar

import matplotlib.pyplot as plt

BF.Purchase.plot(kind='hist' , grid = True)

1. Numerical & Numerical

To examine the relationship between two

Occupation 1.000000 0.024280 -0.007618 0.001566 0.020853

Marital_Status 0.024280 1.000000 0.019888 0.010260 -0.000599

Product_Category_1 -0.007618 0.019888 1.000000 0.279247 -0.347413

Product_Category_2 0.001566 0.010260 0.279247 1.000000 -0.131104

Purchase 0.020853 -0.000599 -0.347413 -0.131104 1.000000

Creating a heatmap using Seaborn on the

#annot: If True, write the

•To analyze the composition of the data, create bar

•To compare two variables, create bar and line charts

Replace 'skyblue' with any preferred color (e.g.,

# Calculate and display descriptive statistics

# Adding median values to the plot

plt.title("Box Plot of Marital Status vs. Purchase (Black Friday Data)")

# Simple box plot

# Use Unicode for ₹ symbol

# Create and print crosstab values

1e8 is standard scientific notion, and here it indicates an

Dataset : World Happiness Report

fig = px.scatter(WHR, x="Happiness_Score",

fig = px.scatter(WHR, x="Happiness_Score",

fig = px.scatter(WHR, x="Happiness_Score",

fig = px.line(WHR, x='Happiness_Score',

fig = px.line(WHR, x='Happiness_Score',

fig = px.bar(WHR, x='Happiness_Score',

fig = px.bar(WHR, x='Country',

fig = px.bar(WHR, x='Country',

fig = px.pie(WHR, values='Happiness_Score',

fig = go.Figure(data = [go.Histogram(x =

You might also like