[go: up one dir, main page]

0% found this document useful (0 votes)
13 views1 page

Exploratory Data Analysis Day 1

Uploaded by

ANIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views1 page

Exploratory Data Analysis Day 1

Uploaded by

ANIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 1

Exploratory Data Analysis -

1. Overall Data Distribution, shape


2. Target Feature - Result
1. Count plot - Yes/No
2. Check the distribution - imbalance or not? - **Insights** (we will tackle in the modelling Notes-
2)
1. Pair plot with all the numerical features
2. Separate the categorical and numerical features
3. Categorical -
1. Univariate Analysis -
1. Hygiene checks in the data
1. a) Check the categories of the columns - more than expected
categories like – Grad/undergrad (convert it into two categories)
2. b) Suppose city - 60% - Delhi, 1% Ahmedabad - try to merge for later
purpose
2. Missing values - Treat them with a method - Mode/Max freq/KNN imputer
from sklearn/Unknown
3. Check that - "?"/special characters - value counts on each of the categorical - if
you can run a loop
4. Create a few count plots to show freq - run a loop to get all the plots in 1 go -
**Insights**
2. Bi-variate Analysis -
1. Categorical to categorical (X1 v/s X2) - stack bar plot
2. Categorical to numerical (X1_cat v/s X2_num) - bar plot/swarm/violin/box -
**Insights**
3. Categorical to Target Feature (X1_cat v/s Target conversion) – stack bar
plot/swarm/violin/box Plot - **Insights**
4. Numerical -
1. Univariate Analysis -
1) Hygiene checks on the data
2) Missing values - Mean/Median/KNN imputer/simple imputer
3) Distribution and box plots with a loop - **Insights**
4) Outliers - boxplot - IQR method/percentile method (99%,95%)
5) Distribution and box plots with a loop - verify the outliers are removed -
**Insights**
6) Skewness in the data - right skewed - take a log else take a squareroot
2. Bi-variate Analysis -
1. Correlation -
1. a) Correlation between (X1_num v/s X2_num) - heatmap -
**Insights**
2. b) Scatter plots (X1_num v/s X2_num) - regplot - **Insights**
2. Relation with target feature (X1_num v/s Target) - BOX/Swarm/violin -
**Insights**
3. Relation with Categorical feature (X1_num v/s X1_cat) - BOX/Swarm/violin -
**Insights**
5. Overall Pairplot - Try to see the separation between the - creation the distribution plot with a
hue of target - Pair plot

You might also like