[go: up one dir, main page]

0% found this document useful (0 votes)
27 views108 pages

Module 3

The document provides a comprehensive overview of Exploratory Data Analysis (EDA) using Python, covering essential topics such as data exploration, handling duplicates, outliers, and missing values, as well as univariate and bivariate analysis. It emphasizes the importance of EDA in understanding dataset structure and guiding feature selection for machine learning. Additionally, it includes practical examples and Python code snippets for performing various EDA tasks on a retail sales dataset.

Uploaded by

rahman.mega
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views108 pages

Module 3

The document provides a comprehensive overview of Exploratory Data Analysis (EDA) using Python, covering essential topics such as data exploration, handling duplicates, outliers, and missing values, as well as univariate and bivariate analysis. It emphasizes the importance of EDA in understanding dataset structure and guiding feature selection for machine learning. Additionally, it includes practical examples and Python code snippets for performing various EDA tasks on a retail sales dataset.

Uploaded by

rahman.mega
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Exploratory Data Analysis

(EDA)

Exploration of Data
using Visualization
in Python
Topics covered
Exploratory Data Analysis (EDA)—Introduction, Data Exploration,
Handling -Duplicates, Outliers, Missing values, Univariate Analysis,
and Bivariate Analysis.
1. Importing the Python Libraries
2. Handling Duplicates
3. Handling Outliers
4. Handling Missing Values
5. Univariate Analysis
6. Bivariate Analysis
Exploratory data analysis (EDA)
Exploratory data analysis, popularly known as EDA, is a process of
performing initial investigations on the dataset to discover the
structure and content. It is often known as Data Profiling.

•EDA is where we get a basic understanding of the data at hand,


which helps us further with data cleaning and preparation.
What is Exploratory Data Analysis (EDA)?
EDA is the process of analyzing datasets to summarize their
key characteristics, detect patterns, spot anomalies, and
check assumptions using visualization and statistical
methods.
EDA is used to:
• Understand dataset structure
• Detect missing values, outliers, and errors
• Identify patterns and relationships
• Guide feature selection for machine learning
Why is EDA Important?
Before building models, we need to understand and clean the data.
• Poor data quality can lead to wrong conclusions.
• Helps choose the right analysis techniques.

Real-Life Examples:
• Stock Market: Identifying trends before forecasting stock prices
• E-Commerce: Understanding customer purchase behavior
• Healthcare: Detecting anomalies in patient data for diagnosis

Imagine an app-based food delivery company analyzing its data. They perform EDA to:
• Check missing order details
• Find peak hours of delivery
• Analyze the relationship between delivery time & customer ratings
Key Steps in Exploratory Data Analysis (EDA)
Step Description Example

1.Data Collection & Loading Import the dataset from a file (CSV, df = pd.read_csv("data.csv")
Excel, Database, API)
2. Data Cleaning Handle missing values, duplicates, Remove duplicates, fill missing values
and inconsistent formatting with mean/median

3. Data Exploration Understand dataset structure, df.describe(), df.info()


summary statistics, and
distributions
4. Outlier Detection Identify and handle anomalies using Detect extreme values in sales or
boxplots, IQR, or Z-score customer spending
5. Data Visualization Use plots to understand Histograms, Scatter plots, Correlation
distributions and relationships heatmaps
6. Feature Selection & Select relevant variables, scale or Standardization, One-Hot Encoding
Transformation encode categorical data
Types of Data in EDA
Data Type Description Example Variables

Numerical Data Data represented in numbers, can Revenue, Age, Temperature


be continuous or discrete
Categorical Data Data divided into categories, Gender (Male/Female), Customer
either nominal or ordinal Type (New/Returning)
Ordinal Data Categorical data with a Satisfaction Level (Low, Medium,
meaningful order/ranking High)
Date-Time Data Data related to time, useful for Sales Date, Transaction Time
trend analysis

Boolean Data Binary values representing Is_Defaulted (Yes/No), Purchased


True/False or Yes/No (0/1)
Python

Data Visualization
Syntax / Code– Red Colour
Importing the Python Libraries

•NumPy
•Pandas
•Matplotlib and
•Seaborn.
Dataset – EDA-1-Black_Friday_3

This dataset comprises of sales transactions captured at


a retail store. It’s a classic dataset to explore and expand
your feature engineering skills and day to day
understanding from multiple shopping experiences.
Import Python Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')
#Step-1-Importing Dataset

import pandas as pd

BF=pd.read_excel(r"D:\DataSet2024\Black_Friday_3.xlsx")

print(BF)
#Step-2-Display the Variable Names and their Data
Types-Meta Data

Referred to as data that describes


#Meta Data other data, metadata is structured
BF.info() reference data that helps to sort
and identify attributes of the
information it describes.
#Step-3-Count the Number of Non-Missing
Values for each Variable

BF.count()
#Step-4-Descriptive Statistics

BF.describe()
#Step-5-Inclusion of categorical variables

BF.describe(include='all')
#Step-6-Handling Duplicates
This involves 2 steps: Detecting
duplicates and Removing duplicates.
The pandas. DataFrame. dupli
cated() method is used to find
duplicate rows in a
BF.duplicated() DataFrame. It returns a
boolean series which
identifies whether a row is
duplicate or unique.
#Step-7-To remove the duplicates(if any)

BF.drop_duplicates()
#BF.drop_duplicates(subset='User_ID')
This by default keeps just the first occurrence of the duplicated value in
the User_ID variable and drops the rest of them.

Do not want to remove the duplicate values from the User_ID variable permanently so
just to see the output and not make any permanent change in our data frame.

BF.drop_duplicates(subset='User_ID' , inplace=False)
Handling Outliers

Outliers are the extreme values on the low and the high
side of the data.

Handling Outliers involves 2 steps: Detecting outliers


and Treatment of outliers.
Detecting Outliers
Consider any variable from data frame and determine
the upper cut off and the lower cutoff with the help of
any of the 3 methods namely :
• Percentile Method
• IQR Method
• Standard Deviation Method
IQR Method of Outlier Detection

minimum is the minimum


value in the dataset,
and maximum is the
maximum value in the
dataset.
Determining if there are any outliers in the data
set using the IQR(Interquartile range) Method.

Finding the minimum(p0), maximum(p100), first


quartile(q1), second quartile(q2), the third quartile(q3), and
the iqr(interquartile range) of the values in the Purchase
variable.
#Step-8-IQR(Interquartile range) Method -
Execute
p0=BF.Purchase.min()
p100=BF.Purchase.max()
q1=BF.Purchase.quantile(0.25)
q2=BF.Purchase.quantile(0.5)
q3=BF.Purchase.quantile(0.75)
iqr=q3-q1
Using the Interquartile Rule to Find Outliers

1. Calculate the interquartile range for the data


2. Multiply the interquartile range (IQR) by 1.5 (a constant used to
discern outliers)
3. Add 1.5 x (IQR) to the third quartile. Any number greater than this
is a suspected outlier
4. Subtract 1.5 x (IQR) from the first quartile. Any number less than
this is a suspected outlier
Why to use 1.5 in IQR
• About 68.26% of the whole data lies within one standard deviation (<σ) of
the mean (μ), taking both sides into account, the pink region in the figure.
• About 95.44% of the whole data lies within two standard deviations (2σ)
of the mean (μ), taking both sides into account, the pink+blue region in
the figure.
• About 99.72% of the whole data lies within three standard deviations
(<3σ) of the mean (μ), taking both sides into account, the
pink+blue+green region in the figure.
• And the rest 0.28% of the whole data lies outside three standard
deviations (>3σ) of the mean (μ), taking both sides into account, the little
red region in the figure. And this part of the data is considered as outliers.
• The first and the third quartiles, Q1 and Q3, lies at -0.675σ and +0.675σ
from the mean, respectively.
#Step-9-To find the lower cutoff(lc) and the upper
cutoff(uc) of the values

lc = q1 - 1.5*iqr

uc = q3 + 1.5*iqr
#Step-10-Execute

lc

uc
Outlier - Condition
If lc < p0 → There are NO Outliers on the lower side

If uc > p100 → There are NO Outliers on the higher side


#Step-11-Execute

print( "p0 = " , p0 ,", p100 = " , p100 ,", lc = " , lc ,", uc = " , uc)

Output

p0 = 12 , p100 = 23961 , lc = -3523.5 , uc = 21400.5

Clearly lc < p0 so there are no outliers on the lower side. But uc <
p100 so there are outliers on the higher side.
#Step-12-Pictorial representation of the
outlier by drawing the box plot

BF.Purchase.plot(kind='box')
Outlier Treatment

• Clip the values instead of removing them from the


variable.
• During this process, replace values that are outside
the range with the lower or upper cutoff, as
appropriate.
• Once outliers are removed from the data, and all of
the data is within the range.
#Step-13-Clipping all values greater than the
upper cutoff to the upper cutoff :

BF.Purchase.clip(upper=uc)
#Step-14-To finally treat the outliers and
make the changes permanent

BF.Purchase.clip(upper=uc,inplace=True)
BF.Purchase.plot(kind='box')
#Step-15-Handling Missing Values

#Detecting the Missing Values #Pandas.isna(obj)[source] Detect


missing values for an array-like
object. This function takes a scalar
or array-like object and indicates
whether values are missing ( NaN in

BF.isna() numeric arrays, None or NaN in


object arrays, NaT in datetimelike).

#df.isna() returns True for the missing values


and False for the non-missing values.
#Step-16-To find out the percentage of
missing values in each variable

BF.isna().sum()/BF.shape[0]
Missing Value Treatment
To treat the missing values we can opt for a method from
the following :

• Drop the variable


• Drop the observation(s)
• Missing Value Imputation
Missing Value Treatment
• 31.56% of the values for the variable Product Category 2 are
missing.
• We should not discard such a large number of observations or the
variable itself. Therefore, we will use imputation.
• In this process, missing data is imputed by replacing the missing
values with an appropriate value, which could be a constant, mean,
median, mode, or a predictive model output.
• Since Product Category 2 is a categorical variable, we will impute
the missing values using the mode.
#Step-17-Missing Value Treatment
#Execute 1st
BF.Product_Category_2.mode()[0]

BF.Product_Category_2.fillna(BF.Product_Category_2.mo
de()[0],inplace=True)

# Execute 2nd

BF.isna().sum() #BF.isna().sum()/BF.shape[0]
#Step-17-Missing Value Treatment

For the variable Product_Category_3, 69.67% of the


values are missing, which is a significant proportion.
Therefore, we will drop this variable.

BF.dropna(axis=1,inplace=True)
#Step-18-Missing Value Treatment
How to check?

BF.dtypes
Univariate Analysis

In this type of analysis, charts are plotted for a single


variable. These charts help visualize how the data is
distributed and structured based on the variable type-
categorical or numerical.
For continuous variables, we use box plots and histograms
to examine the data distribution.
Syntax Example :
1. Declared file name = Train
Declared file name.Variable Name.hist() Variable Name = Purchase
plt.show()
2.
Declared file name.groupby(‘Variable Name’).Variable
Name.count().plot(kind=‘pie’ or ‘barh’)
plt.show()

3.
sns.countplot(Declared file name.Varible name)
plt.show()
#Step– 19- Distribution of Purchase
# Histogram

import matplotlib.pyplot as plt

BF.Purchase.hist()
plt.show()
For Categorical Variables
• To analyze the distribution (Spread) of categorical
variables, we use frequency plots such as bar charts
and horizontal bar charts.

• To understand the composition (Arrangement) of data,


we use pie charts.
#Step-20-Composition of Gender
import matplotlib.pyplot as plt
print(BF['Gender'].value_counts()) # Print the counts

# Plot the pie chart


BF['Gender'].value_counts().plot(kind='pie', autopct='%1.1f%%')

# Save as PNG the figure before showing it


plt.savefig("gender_pie_chart.png", dpi=300, bbox_inches='tight')

plt.show()
Gender
M 414259
F 135809
Name: count, dtype: int64
#Step-21-Composition of Gender

BF.groupby('Gender').Gender.count().plot(kind='pie')

plt.show()
#Step-22-Distribution of Marital_Status

import seaborn as sns

sns.countplot(x='Marital_Status',data=BF)

plt.show()
#Step-22A-Distribution of Marital_Status

sns.countplot(x='Marital_Status', data=BF,
hue='Gender', palette='coolwarm')

plt.show()
#Step-22B-Distribution of Marital_Status
import seaborn as sns
import matplotlib.pyplot as plt

# Create count plot with hue for Gender


ax = sns.countplot(x='Marital_Status', data=BF, hue='Gender', palette='viridis')

# Display counts on top of bars


for bar in ax.containers:
ax.bar_label(bar)

plt.show()
palette_options = [
'viridis', 'magma', 'plasma', 'inferno', # Scientific color maps
'coolwarm', 'RdYlBu', 'Spectral', # Diverging color maps
'Blues', 'Reds', 'Greens', 'Purples', # Single-tone color maps
'pastel', 'deep', 'muted', 'bright', 'dark', 'colorblind' # Seaborn-
themed palettes
]
print(palette_options)
#Step-23-Composition of City_Category

BF.groupby('City_Category').City_Category.count().plot(kind
='pie')

plt.show()
#Step-23A-Composition of City_Category
import matplotlib.pyplot as plt

# Plot the pie chart with numbers


BF.groupby('City_Category').City_Category.count().plot(kind='pie',
autopct='%1.1f%%')

plt.ylabel('') # Removes the default y-axis label


plt.title("City Category Distribution")
plt.show()
Step-24-Distribution of Age

sns.countplot(x='Age',data=BF)

plt.show()
import seaborn as sns
import matplotlib.pyplot as plt

# Create the count plot with a color palette


ax = sns.countplot(x='Age', data=BF, palette='viridis') # Change palette if
needed

# Add numbers on top of each bar


for bar in ax.containers:
ax.bar_label(bar)

plt.title("Age Distribution")
plt.show()
#Step-25-Composition of Stay_In_Current_City_Years

import matplotlib.pyplot as plt

BF.groupby('Stay_In_Current_City_Years').City_Cate
gory.count().plot(kind='pie', autopct='%1.1f%%')

plt.show()
#Step-26-Distribution of Occupation

Chart=sns.countplot(x='Occupation',data=BF)
for container in Chart.containers:
Chart.bar_label(container)
plt.show()
#Step-27-Distribution of Occupation

sns.countplot(x='Occupation',data=BF)

plt.show()
#Step-28-Distribution of Product_Category_1

BF.groupby('Product_Category_1').Product_Category_1.c
ount().plot(kind='barh')

plt.show()
#Step-29-Histogram - Multiple
df=pd.DataFrame(BF,columns=['Purchase','Product
_Category_1','Occupation'])

df.diff().hist(bins=15)
#Step-30-Histogram with grid

BF.Purchase.plot(kind='hist' , grid = True)


plt.show()
Bivariate Analysis
In this type of analysis, we consider two variables at a
time and create charts based on them. Since there are
two types of variables - categorical and numerical-
bivariate analysis can have three cases

1. Numerical & Numerical


2. Numerical & Categorical
3. Categorical & Categorical
1.Numerical & Numerical

To examine the relationship between two


variables, we create scatter plots and a correlation
matrix with a heatmap overlay
Scatter Plot
Our dataset contains only one numerical variable, so we
cannot create a scatter plot.
How can we address this?
Consider a scenario where we treat all numerical
variables (i.e., those with data types of int or float) as
numerical variables
#Step-31 # Considering 2 categorical (Numerical data)
variables Product_Category_1 and Product_Category_2

BF.plot(x='Product_Category_1',y='Product_Cate
gory_2',kind = 'scatter')

plt.show()
#Step-32# Considering 2 categorical
variables Product_Category_1 and Product_Categ
ory_2

plt.scatter(x=BF.Product_Category_1 ,
y=BF.Product_Category_2)

plt.show()
#Step-32-Correlation Matrix : Finding the
correlation among all numerical variables

BF.drop('User_ID',axis=1, inplace=True)

BF.select_dtypes(['float64' , 'int64']).corr()
Product_Categor Product_Categor
Occupation Marital_Status Purchase
y_1 y_2

Occupation 1.000000 0.024280 -0.007618 0.001566 0.020853

Marital_Status 0.024280 1.000000 0.019888 0.010260 -0.000599

Product_Category_1 -0.007618 0.019888 1.000000 0.279247 -0.347413

Product_Category_2 0.001566 0.010260 0.279247 1.000000 -0.131104

Purchase 0.020853 -0.000599 -0.347413 -0.131104 1.000000


#Step-34-Correlation Matrix - Finding a
correlation between all the numeric variables

BF['Marital_Status'].corr(BF['Occupation'])
#Step-35-Heatmap

Creating a heatmap using Seaborn on the


correlation matrix helps visualize the
relationships between numerical columns in the
dataset.
!pip install seaborn --upgrade
sns.heatmap(BF.select_dtypes(['float64' , 'int64'])
.corr(),annot=True)

plt.show()

#annot: If True, write the


data value in each cell.
2.Numerical & Categorical

•To analyze the composition of the data, create bar


and line charts

•To compare two variables, create bar and line charts


#Step-36#Comparison between Purchase and
Occupation: Bar Chart

BF.groupby('Occupation').Purchase.sum().plot(kin
d='bar')

plt.show()
#Step-37#Comparison between Purchase and
Occupation: Bar Chart

summary=BF.groupby('Occupation').Purchase.su
m()

sns.barplot(x=summary.index ,
y=summary.values)

plt.show()
#Step-38-Comparison between Purchase
and Age: Line Chart

BF.groupby('Age').Purchase.sum().plot(kind='line')

plt.show()
#Step-39-Comparison between Purchase
and City_Category: Area Chart

BF.groupby('City_Category').Purchase.sum().plot(
kind='area')

plt.show()
BF.groupby('City_Category').Purchase.sum().plot(k
ind='area', color='skyblue')

plt.show()

Replace 'skyblue' with any preferred color (e.g.,


'red', 'green', '#FFA07A', etc.).
#Step-40-Comparison between Purchase and
Marital_Status

sns.boxplot(x='Marital_Status',y='Purchase',data=BF)

plt.show()
Generate the Box Plot & Display Numerical Summary
# Load the Black Friday dataset
BF=pd.read_excel(r"D:\Mallie IV\AcademicYear2025_26\Black_Friday_3.xlsx")
# Create the box plot
plt.figure(figsize=(8, 5))
ax = sns.boxplot(x='Marital_Status', y='Purchase', data=BF)

# Calculate and display descriptive statistics


stats = BF.groupby('Marital_Status')['Purchase'].describe()
print(stats) # Print numerical summary of the box plot

# Adding median values to the plot


medians = BF.groupby('Marital_Status')['Purchase'].median()
for i, median in enumerate(medians):
plt.text(i, median, f'{median:.0f}', horizontalalignment='center', fontsize=12, color='black', fontweight='bold')

plt.title("Box Plot of Marital Status vs. Purchase (Black Friday Data)")


plt.xlabel("Marital Status (0 = Single, 1 = Married)")
plt.ylabel("Purchase Amount")
plt.show()
Numerical Summary (from describe())

Marital Count Mean Median Q1 (25%) Q3 (75%) Min Max Std Dev
Status (Q2)

Single (0) 324,731 ₹9,266 ₹8,044 ₹5,605 ₹12,061 ₹12 ₹23,961 ₹5,027

Married (1) 225,337 ₹9,261 ₹8,051 ₹5,843 ₹12,042 ₹12 ₹23,961 ₹5,017
How to Read the Box Plot?
Box Plot Element Where to Find in Graph? Interpretation
Median (Q2 - 50th Horizontal line inside each Very close for both groups (₹8,044 vs.
Percentile) box ₹8,051) → No significant difference in
typical spending.
Interquartile Range (IQR: Height of the box (from Q1 Singles: ₹5,605 - ₹12,061, Married:
Q3 - Q1) to Q3) ₹5,843 - ₹12,042 → Similar spending
range.
Whiskers (Min & Max, Lines extending from the Both groups have identical min (₹12)
Excluding Outliers) box and max (₹23,961).
Outliers Dots outside the whiskers Few high-value purchases exist in both
groups but not significantly different.

Box Width (Variability in Width of the box Almost equal spread, meaning both
Data) groups have similar purchase behavior.
Key Insights
•Spending patterns between singles and married individuals are almost identical.
•The mean, median, and IQR values are very close for both groups.
•Both groups have similar spending variations (standard deviation ≈ ₹5,000).
•The max purchase amount is the same for both groups (₹23,961).
•This suggests that high spenders exist equally in both categories.
•The IQR range is slightly wider for singles (₹5,605 - ₹12,061) vs. married (₹5,843 -
₹12,042).
•This means that married individuals have slightly more consistent spending
behavior, but the difference is minor.
•There is no strong indication that marital status influences purchase behavior
significantly.
•Since both groups have nearly identical distributions, other factors (like Age,
Occupation, or City Category) may influence spending more than marital status.
Conclusion
• The box plot visually confirms that there is no major
difference in spending behavior between singles and
married individuals.
• We should focus on other variables (such as Age,
Product Category, or Income) to find stronger
patterns.
How to Interpret the Box Plot?
Each box represents the distribution of purchases for a specific
marital status.
•Key elements of the box plot:
•The box → Shows the middle 50% of data (Interquartile Range,
IQR).
•The line inside the box → Represents the median purchase
amount.
•Whiskers → Indicate the range of most values, excluding outliers.
•Dots outside the whiskers → Represent outliers, meaning
unusually high or low purchases.
How to get rupee symbol?
import seaborn as sns
import matplotlib.pyplot as plt

# Simple box plot


sns.boxplot(x=["Single", "Single", "Married", "Married"], y=[8000, 12000,
15000, 18000])

# Use Unicode for ₹ symbol


plt.ylabel("\u20B9 Purchase")

plt.show()
3.Categorical & Categorical
To analyze the relationship between two
variables, create a crosstab and overlay a
heatmap
Step 41: Relationship Between Age and Gender –
Creating a crosstab to display data for Age and
Gender

pd.crosstab(BF.Age,BF.Gender)
Gender F M
Age
0-17 5083 10019
18-25 24628 75032
26-35 50752 168835
36-45 27170 82843
46-50 13199 32502
51-55 9894 28607
55+ 5083 1642
#Step-42-Heatmap: Creating a Heat Map on
the top of the crosstab

sns.heatmap(pd.crosstab(BF.Age,BF.Gender))

plt.show()
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create and print crosstab values


crosstab_data = pd.crosstab(BF['Age'], BF['Gender'])
print(crosstab_data)

# Create heatmap
sns.heatmap(crosstab_data, cmap="Blues")

plt.show()
Interpretation of Age vs. Gender Heatmap
Aspect Observation Interpretation
Dominant Age Group 26-35 (Male: 168,835, Female: 50,752) This age group has the highest number of buyers,
especially males. It suggests that young working
professionals are the biggest shoppers.
Second Highest Age Group 18-25 (Male: 75,032, Female: 24,628) The 18-25 age group is the second most active, likely
consisting of college students and young employees.
Males are significantly more than females.
Older Age Groups (46-55, Fewer buyers compared to younger groups The number of shoppers decreases with age,
55+) indicating that middle-aged and senior customers
shop less on Black Friday.
Male vs. Female Male dominate all age groups Across all age categories, the number of male buyers
Comparison is consistently higher than female buyers, showing
that males participate more in Black Friday shopping.

Least Active Age Group 55+ (Male: 16,421, Female: 5,083) The senior group (55+) has the least shoppers,
possibly due to lower tech adoption or shopping
preferences.
Balanced Gender Ratio Most balanced in the 36-45 group (M: While male still outnumber females, the gender gap is
82,843, F: 27,170) slightly less pronounced in this category.
Frequency Distribution

1e8 is standard scientific notion, and here it indicates an


overall scale factor for the y-axis. That is, if there's a 2 on the
y-axis and a 1e8 at the top, the value at 2 actually indicates
2*1e8 = 2e8 = 2 * 10^8 = 200,000,000
Workbook-2-Plotly Library

Dataset : World Happiness Report

Live Dashboards
Step-1
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
Step-2
WHR=pd.read_excel(r"D:\DataSet2024\World_H
appiness_R22_4.xlsx")

print(WHR)

WHR.info()
Step-3 – Scatter Diagram

fig = px.scatter(WHR, x="Happiness_Score",


y="GDP_Per_Capita", color='Country')

fig.show()
Step-4 - Scatter Diagram

fig = px.scatter(WHR, x="Happiness_Score",


y="GDP_Per_Capita", color='Rank')

fig.show()
Step-5-Scatter diagram

fig = px.scatter(WHR, x="Happiness_Score",


y="GDP_Per_Capita", color='PerceptionsOfCorruption')

fig.show()
Step-6 – Line chart - Multiple

fig = px.line(WHR, x='Happiness_Score',


y="Country")

fig.show()
Step-7 – Line chart

fig = px.line(WHR, x='Happiness_Score',


y=["GDP_Per_Capita","Social_Support",'Healthy_
Life_Expectancy','FreedomOfChoices',"Generosit
y"])

fig.show()
Step-8 – Line chart

fig = px.line(WHR,
x='Happiness_Score',y='Social_Support',color='Co
untry')

fig.show()
Step-9 – Bar Chart

fig = px.bar(WHR, x='Happiness_Score',


y='Country')

fig.show()
Step-10 – Bar chart

fig = px.bar(WHR, x='Country',


y='Happiness_Score')

fig.show()
Step-11 – Bar Chart

fig = px.bar(WHR, x='Country',


y='Happiness_Score',color='Rank')

fig.show()
Step-12 – Pie chart

fig = px.pie(WHR, values='Happiness_Score',


names='Rank', title='Happiness Report')

fig.show()
Step-13-Histogram

fig = px.histogram(WHR,
x="Happiness_Score",title="Happiness Report")

fig.show()
Step-14-Histogram with Colour

fig = go.Figure(data = [go.Histogram(x =


WHR.Happiness_Score,xbins=go.histogram.XBins(
size=0),marker=go.histogram.Marker(color="gree
n") ) ])

fig.show()

You might also like