0% found this document useful (0 votes)

13 views47 pages

03a EDA

Uploaded by

Van loi Ha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views47 pages

03a EDA

Uploaded by

Van loi Ha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Lession 03 - 01

Exploratory Data Analysis

Understanding Your Data
Exploratory Data Analysis: 2

Week Overview
• Exploratory Data Analysis (EDA) Introduction
 What is EDA
 Why need EDA, what can do with EDA
 How to do EDA

• EDA Techniques
 Groups of EDA Methods
 Visualisation Methods

• EDA Tools
 Python Modules for EDA

• EDA Case Studies

Exploratory Data Analysis: 3

Thinking
• How to choose the most suitable algorithms for your dataset?

https://alastairrushworth.github.io/exploring_eda/EDA.html#1

• How to ensure you are ready to use machine learning

techniques in a new project?
• Answer: Exploratory Data Analysis (EDA) helps to answer
Exploratory Data Analysis: 4

Exploratory Data Analysis

• Exploratory Data Analysis (EDA) is a process for summarising, visualising, and
becoming intimately familiar with the important characteristics of the data
• EDA is an iterative cycle:
- Generate questions about your data:
- Search for answers by visualising, transforming, and modelling your data
- Use what you learn to refine your questions and/or generate new questions

 What type of variation occurs within my variables?

 What type of covariation occurs between my variables?

https://duo.com/labs/research/gamifying-data-
science-education

• EDA is statisticians’ way of story telling where you explore data, find patterns
and tells insights
Exploratory Data Analysis: 5

Key Concepts of Exploratory Data Analysis

• 4 Objectives of EDA
- Discover Patterns
- Spot Anomalies
- Frame Hypothesis
- Check Assumptions
• Stuff done during EDA
- Measures of central tendency: mean, median, mode
- Spread measurement : standard deviation, variance
- Shape of distribution: distribution, trends
- Outlier
- Correlations
- Visual Exploration
Exploratory Data Analysis: 6

Making Sense of Data – Distinguish Types of Attributes

• input takes the form of instances and attributes/features

- information to a machine learning learner takes the form of a set of
instances
- each instance is described by a fixed predefined set of features or
attributes:
- Types: Numerical (Discrete and Continuous) and Nominal (Categorical)
Exploratory Data Analysis: 7

Types of EDA Methods

• EDA methods: generally classified into two ways

- graphical or non-graphical/quantitative : summarising data in a
visual way or calculation of summary statistics
- univariate or multivariate: summary statistics for each
feature/attribute or find relationship between features
Univariate Multivariate
Non-Graphical Categorical Variable: tabular representation
of frequency
One Categorical Variable and One Quantitative
Variable: Standard univariable non-graphical statistics
for the quantitative variable separately for each level
Quantitative Variable: of the categorial variable
• Location (mean, median)
• Shape and Spread Two and more Quantitative Variable:
• Modality • Correlation,
• Outliers … • Covariance,
• …

Graphical Categorical Variable: Bar Chart One Categorical Variable and One Quantitative
Variable:
Quantitative Variable: • Side-by-side Boxplots
• Histogram Two and more Categorical Variable:
• Boxplot • Grouped Bar Chart
• … Two and more Quantitative Variable:
• Scatter plot, Correlation Heatmap, Pairplot …
Exploratory Data Analysis: 8

How to do EDA
Steps/activities involved in EDA:
• identification of variables and data types
• non-graphical and graphical univariate analysis
• bi-/multivariate analysis, correlation analysis
• detect missing values and anomalies
• detect outliers

A typical example:

https://www.researchgate.net/publication/342282008_Exploratory_Data_Analysis_and_Data_Envelo
pment_Analysis_of_Construction_and_Demolition_Waste_Management_in_the_European_Economi
c_Area/figures?lo=1
Exploratory Data Analysis: 9

Visualisation
There are four basic presentation types
• composition
• comparison
• distribution
• relationship

To determine which is best suited

• How many variables in a single chart?
• How many data points display for each
variable? https://raw.githubusercontent.com/areski/python-nvd3

• Will you display values over a period of

https://www.tatvic.com/blog/7-visualizations-learn-r/

time, or among items or groups

Exploratory Data Analysis: 10

Visual Aids
Common charts in EDA:
• Pie chart
• Histogram
• Bar & Stack Bar Chart
• Box Plot & Violin plot
• Area Chart
• Scatter Plot
• Correlogram
• Heatmap
Exploratory Data Analysis: 11

Pie chart
• Pie chart: circle divided to sectors, to communicate proportions
• a common method for representing categorical variables

o difficult to compare a few pieces

 simple and easy-to-understand
o unhelpful when observing trends
 understand information quickly
over time
Exploratory Data Analysis: 12

Histogram
• Histogram: a plot of the frequency distribution of numeric
variable by splitting values to small equal-sized bins, provide
a visual summary of central tendency, spread, and shape.

https://stackoverflow.com/questions/37911731/seaborn-histogram-with-4-panels-2-x-2-in-python
Exploratory Data Analysis: 13

Shape of Data - Skewness and Kurtosis

• histogram is effective for showing both the skewness and
kurtosis of data set.
• skewness is a measure of the lack of symmetry
• a distribution, or data set, is symmetric if it looks the same
to the left and right of the center point
• symmetric data should have a skewness near zero, negative
values indicate data that are skewed left and positive values
indicate skewed right.
• the type of peak the distribution can be characterized by a
measurement - kurtosis.
Exploratory Data Analysis: 14

Bar & Stacked Bar Chart

• Bar charts: a way of summarizing a set of categorical data,
displays data using bars, each representing a particular
category, the height is proportional to a specific aggregation
• Bars can be horizontal or vertical

https://chartio.com/learn/charts/bar-chart-complete-guide/
Exploratory Data Analysis: 15

Box Plot & Violin Plot

• Box plot: box and whisker plot, displays a summary of a
large amount of data in five numbers , a good indication of
how the values in the data are spread out with in groups
• plot a combination of categorical and continuous variables
• Violin plot: similar as box plot, additionally shows the kernel
density estimation of the underlying distribution

https://medium.com/analytics-vidhya/exploratory-data-analysis-uni-variate-analysis-of-iris-data-set-690c87a5cd40
Exploratory Data Analysis: 16

Area Chart/Stacked Chart

• base on the line chart, areas between axis and line are
commonly emphasized with colors
• share features with bar charts and line charts, compare two
or more quantities, work better for large difference and
multiple values over time

https://levelup.gitconnected.com/data-visualization-with-pandas-in-action-part-2-2cc8674da1d0
Exploratory Data Analysis: 17

Scatter Plot
• use a Cartesian coordinates system to display values of
two variables for a set of data

• show the relationship between two variables, referred to

as correlation plots
Exploratory Data Analysis: 18

Correlogram
• Correlogram: AKS correlation matrix, to analyse the
relationship between each pair of numeric variables
Exploratory Data Analysis: 19

HeatMap
• Heatmap: a two-dimensional graphical representation of
data where the individual values that are contained in a
matrix are represented as colors
• useful to see which intersections of the categorical values,
have higher concentration of the data compared to the
others
Exploratory Data Analysis: 20

Choose the Most Suitable Plots

https://www.tatvic.com/blog/7-visualizations-learn-r/
Exploratory Data Analysis: 21

Clustering Analysis for EDA

• Clustering in EDA to find new insights

K-Means clustering Hierarchical clustering

can be used to can be used to
detect possible outliers find underlying connectivity properties

https://www.neuroelectrics.com/blog/2018/07/06/clustering-methods-in-exploratory-analysis/
Exploratory Data Analysis: 22

Dimensionality Reduction for EDA

• Reduce the dimensions of the data into fewer dimensions
would help describing the relationship between variables

T-distributed stochastic neighbor embedding (T-SNE) and Principal Component Analysis (PCA)
https://www.programmersought.com/article/92363395092/
Exploratory Data Analysis: 23

What to look for in your plots?

• Turn the information into useful questions
o Which values are the most common? Why?
o Which values are rare? Why?
o Can you see any unusual patterns? What might explain them?

• Clusters suggest that subgroups exist in your data.

o How can you explain or describe the clusters?
o How are the observations within each cluster similar to each other?
o How are the observations in separate clusters different from each other?
Exploratory Data Analysis: 24

EDA and Data Preprocessing

EDA helps us to prepare it for the upcoming tasks,
e.g. data preprocessing and modelling
EDA Data Preprocessing
Duplicate Data-Points? Delete Duplicate Data-Points
Missing Values? Imputation
Outlier? Exclude outliers
Highly Correlated Features? Handling Highly Correlated Features
Low-Variance Features? Handling Low-Variance Features
Imbalanced Data? Oversampling, Undersampling
Features vary in their scale? Feature Scaling
High-dimensional data? Dimensionality Reduction, Feature selection
Exploratory Data Analysis: 25

EDA and Feature Selection

https://www.visual-design.net/post/feature-selection-and-eda-in-machine-learning
Exploratory Data Analysis: 26

Be Colorblind-friendly - Make Your Charts Accessible

• Stop using red and green, blue and yellow together
• If must, leverage light vs. dark, offer alternate methods to
distinguishing data
• Stop using the stoplight palette
• Use a colorblind-friendly palette:
https://colorbrewer2.org/#type=diverging&scheme=BrBG&n=5
Color wheel: https://graf1x.com/the-color-wheel-chart-poster/

https://www.tableau.com/about/blog/examining-data-viz-rules-dont-use-red-green-together
Lession 03 - 02

Exploratory Data Analysis Tools

Python Modules
Exploratory Data Analysis: 28

Week Overview
• Exploratory Data Analysis (EDA) Introduction
 What is EDA
 Why need EDA, what can do with EDA
 How to do EDA

• EDA Techniques
 Groups of EDA Methods
 Visualisation Methods

• EDA Tools
 Python Modules for EDA

• EDA Case Studies

Exploratory Data Analysis: 29

EDA Tools
Some of the open source tools to facilitate EDA
• Python: This is an open source programming language widely used
in data analysis, data mining, and data science
• R programming language: an open source programming language
that is widely utilized in statistical computation and graphical data
analysis, provides packages like ggplot2 for data visualisation
• Weka: This is an open source data mining package that involves
several EDA tools and algorithms
• Orange: This is an open source and workbench-style tool for data
analysis
Exploratory Data Analysis: 30

Python Essential Modules for EDA

• Brief Overview
- Numerical Python (Numpy): matrix / numerical analysis layer, the fundamental
library , powerful tool: ndarray
- Scientific Python (Scipy): scientific computing utilities/functions
(linalg, mathematical approximation, etc…)
- Scikit-learn (sklearn): machine learning toolbox
- Matplotlib: plotting and visualization
- Pandas: data manipulation and analysis, powerful data structures including
data frame and series
Many others:
‣ OpenCV: computer vision
‣ Statsmodels: Statistics in Python
‣ Seaborn, Bokeh, Plotly: Plotting and Visualisation
‣ NLTK, Gensim: NLP tools
‣ Theano, Caffe, Pytorch, Lasagne: Deep Learning
‣ Pybrain, Pylearn2: Machine learning
‣ DEAP, NEAT-python: Evolutionary Computation
Exploratory Data Analysis: 31

Software - Anaconda
• What is Anaconda (miniconda)?
- An essential large (~400 mb) Python installation
- It contains almost everything your need for basic machine learning and data
analysis
• Why to use Anaconda?
- gives the user ability to make an easy install of the version of python
- gives high performance computing with Anaconda Accelerate and several other
components
- removes bottlenecks involved in installing the right packages while taking into
considerations their compatibility
- no risk of messing up required system libraries
- over 7500 open source packages, manyof which are not in the pip repo
Exploratory Data Analysis: 32

Software - IDEs
• Recommend IDEs:
- Ipython (Jupyter Notebook)
‣ Ipython stands for “Interactive Python”
‣ Writing markdown at the same time
‣ More reasons can be found: http://pythonforengineers.com/why-ipython-is-the-
best-thing-since-sliced-bread/

- PyCharm EDU
‣ A real powerful IDE
‣ Many good features (file browser, intelligent auto-completion, run in IDE with input
& console windows for trial and error development)
‣ free to download here: https://www.jetbrains.com/pycharm-edu/

- RStudio: an integrated development environment for R

Exploratory Data Analysis: 33

Using Python Modules

• Python libraries are called modules
• Each module needs to be imported
before use.
• Three common alternatives:
- Import the full module:
‣ import numpy
- Import selected functions from the
module:
‣ from numpy import array, sin, cos
- Import all functions from the module:
(Not Recommended)
‣ from numpy import *

• All modules support shortcuts:

- e.g., import numpy as np
Exploratory Data Analysis: 34

Numpy
• Numpy is the foundation for scientific computing in Python
• It has many features:
- powerful for large, multi-dimensional arrays object
- basic linear algebra functions
- basic Fourier transforms
- sophisticated random number capabilities
- tools for integrating C/C++ code, Fortran code

• More information at the Numpy web page: http://www.numpy.org/

Exploratory Data Analysis: 35

Numpy for EDA

- Data manipulation: NumPy provides powerful array objects that allow
efficiently manipulate and process data, e.g., filtering, slicing, indexing,
and reshaping data, making it easier to extract relevant information from
your dataset.

- Mathematical operations: NumPy supports various mathematical

operations, enabling you to perform element-wise arithmetic, logarithmic,
trigonometric, and other mathematical functions on arrays of data

- Descriptive statistics: NumPy offers statistical functions to compute

descriptive statistics on your data, e.g., mean, median, standard
deviation, variance, min, max, percentiles, etc.

- Data visualization: plays an essential role in data preparation for

visualization libraries like Matplotlib and Seaborn
Exploratory Data Analysis: 36

SciPy and SciPy for EDA

SciPy stands for Scientific Python, is a collection of mathematical algorithms and
convenience functions built upon Numpy
• more functional
• a powerful tool for advanced scientific computing and data analysis
• widely used in various scientific and engineering disciplines, including physics,
chemistry, biology, economics, and data science
SciPy for EDA
• statistical tests: e.g., t-tests, chi-square tests, ANOVA, and correlation tests, to
assess relationships between variables and identify significant differences in
data, and for hypothesis testing to validate assumptions and draw conclusions
about the data.
• outlier detection (‘scipy.stats.zscore’),
• distance calculations (scipy.spatial.distance), …
Exploratory Data Analysis: 37

Pandas
• Pandas (Python Data Analysis Library), a python library that provides easy-to-
use data structures and data analysis tools for data manipulation, data
cleaning, data exploration, and data preparation tasks in data science and
data analysis
• built on Numpy, Scipy, and Matplotlib (to some extent).
• many features:
- support for CSV, Excel, JSON, SQL, HDF5, …
- powerful tools: DataFrame and Series
- data cleansing
- re-shape & merge data (joins & merge) & pivoting
- data visualisation
- database-like operations: filtering, sorting, grouping, aggregating, merging, and
joining
• more information at the Pandas web page: https://pandas.pydata.org/
Exploratory Data Analysis: 38

Pandas for EDA

- Data Inspection and Summary: Pandas provides methods like head(), tail(),
info(), describe(), .. and shape to quickly inspect the data and get a summary of
the dataset, including the data types, missing values, and basic statistics

- filter and select data based on conditions using boolean indexing, methods like
loc[], iloc[], and boolean expressions to extract relevant subsets of data

- powerful grouping and aggregation functions using groupby(). You can group
data based on one or more columns and apply aggregate functions like sum(),
mean(), count(), etc., to obtain insights and summaries

- reshape and pivot data using methods like pivot(), melt(), stack(), and unstack()

- provide a simple interface for creating basic visualizations directly from

DataFrames and Series using the ‘.plot()’ method
Exploratory Data Analysis: 39

Matplotlib and Seaborn

• Matplotlib is a comprehensive graphics library for generating scientific
figures
- high-quality output in many formats, including PNG, PDF, SVG, EPS, and PGF
- efficiently with data frames and arrays
- GUI for interactively exploring figures but supports figure generation without
the GUI
- integrates well with Jupyter notebooks
- support 3D
• More information at the Matplotlib web page: http://matplotlib.org/
• Seaborn: another Python data visualization library based on matplotlib
o have a polished default style, a better looking figure
o more convenient when using a pandas DataFrame
o 3D is not supported
• For basic plots, use Matplotlib, for more advanced statistical
visualizations, use Seaborn
Exploratory Data Analysis: 40

Scikit-learn
• What is Scikit-learn in Python?
- SciPy ToolKit
- Scikit-learn (sklearn): an open source library that provides a consistent API for
using traditional state-of-the-art ML algorithms/methods in Python.
- majorly written in Python, some of sklearn’s internal algorithms are written in
Cython using third party bindings.
- Sklearn’s major Python dependencies are scipy, numpy and matplotlib.

Scikit-learn

Scipy Matplotlib

Numpy

Python
Exploratory Data Analysis: 41

Scikit-learn - Overview

http://scikit-learn.org/stable/index.html
Exploratory Data Analysis: 42

Scikit-learn – APIs
• Estimators and Meta-estimators:
- Estimators: define instantiation mechanism of objects and expose a fit method for learning
a model from training data, E.g., LinearRegression
- Meta-estimators : combine one or more estimators into a single estimator E.g. ensemble, RF

• Predictors: a predict method that takes an array “X test” and produces predictions for “X
test”, based on the learned parameters of the estimator.

• Transformers
- modify or filter data before feeding, a transformer interface which defines a transform
method, E.g. StandardScaler()
Exploratory Data Analysis: 43

Scikit-learn – APIs
• Pipelines and Feature Unions
- A distinguishing feature of the Scikit-learn API is its ability to compose new estimators from
several base estimators. Two ways:
‣ Pipeline objects chain multiple estimators into a single one.
‣ FeatureUnion objects combine multiple transformers into a single one that concatenates
their outputs

• Pipelines:
Exploratory Data Analysis: 44

Scikit-learn – Feature Union

• apply different transformers on the same input data in parallel and
concatenate the outputs
• a FeatureUnion takes as input a list of transformers. Calling fit on the union is
the same as calling fit independently on each of the transformers and then
joining their outputs
For example:
- Join two feature transformers (i.e., PCA and UnivariateFeatureSelection) to
construct combined features
Exploratory Data Analysis: 45

Scikit-learn for EDA

• Data Preprocessing: Scikit-learn provides preprocessing techniques, e.g.,
scaling, normalization, handling missing values, and encoding categorical
variables, to ensure data is in a suitable format for analysis.
• Dimensionality Reduction: Scikit-learn offers techniques like Principal
Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding
(t-SNE) for dimensionality reduction.
• Feature Selection: identifying the most relevant features for modeling or
understanding the relationships between variables.
• Unsupervised Learning: clustering algorithms (e.g., KMeans) for discovering
patterns and structures.
Exploratory Data Analysis: 46

Scikit-learn Cheat Sheet

https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-
learning-big-data-678c51b4b463
Scikit-Learn Cheat Sheet Exploratory Data Analysis: 47

5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
173 pages
Exploratory Data Analysis & Data Preprocessing
No ratings yet
Exploratory Data Analysis & Data Preprocessing
16 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
m2 Final
No ratings yet
m2 Final
151 pages
Exp 4-10 Merged
No ratings yet
Exp 4-10 Merged
89 pages
Eda 1
No ratings yet
Eda 1
25 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Group 7
No ratings yet
Group 7
19 pages
Ds Unit 2 QB
No ratings yet
Ds Unit 2 QB
25 pages
ML Lac0 Notes
No ratings yet
ML Lac0 Notes
37 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
Unit 3
No ratings yet
Unit 3
222 pages
CH4 Exploratory Data Analysis
No ratings yet
CH4 Exploratory Data Analysis
12 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Data Science - Module 2 (Updated)
No ratings yet
Data Science - Module 2 (Updated)
94 pages
Document
No ratings yet
Document
21 pages
1.3.1. Exploratory Data Analysis
No ratings yet
1.3.1. Exploratory Data Analysis
24 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Unit 3
No ratings yet
Unit 3
47 pages
Wa0000.
No ratings yet
Wa0000.
15 pages
C21 Sma Exp4
No ratings yet
C21 Sma Exp4
12 pages
Ch-1 Introduction To Data Analysis
No ratings yet
Ch-1 Introduction To Data Analysis
23 pages
Unit 1
No ratings yet
Unit 1
19 pages
Module 1 - 2 - EDA
No ratings yet
Module 1 - 2 - EDA
12 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
9 pages
Eda Unit 1
No ratings yet
Eda Unit 1
57 pages
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
No ratings yet
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
47 pages
The Analysis - in - EDA
No ratings yet
The Analysis - in - EDA
7 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
Exploratory Data Analysis (Eda) : Niraj Poudyal, PHD Kathmandu University School of Arts
No ratings yet
Exploratory Data Analysis (Eda) : Niraj Poudyal, PHD Kathmandu University School of Arts
54 pages
Module 2
No ratings yet
Module 2
81 pages
What Is Exploratory Data Analysis (EDA) ?
No ratings yet
What Is Exploratory Data Analysis (EDA) ?
6 pages
Exploratory Data Analysis Presentation
No ratings yet
Exploratory Data Analysis Presentation
16 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
05 AIHC Exp02
No ratings yet
05 AIHC Exp02
11 pages
Unit 1
No ratings yet
Unit 1
52 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
AUTOMATED EDA Libraries
No ratings yet
AUTOMATED EDA Libraries
12 pages
AI6322 - Module 3 - Exploratory Data Analysis (EDA) - MODULE
No ratings yet
AI6322 - Module 3 - Exploratory Data Analysis (EDA) - MODULE
15 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
Exploratory Data Analysis (EDA) in Data
No ratings yet
Exploratory Data Analysis (EDA) in Data
12 pages
Unit 3
No ratings yet
Unit 3
77 pages
IOT Domain
No ratings yet
IOT Domain
70 pages
07 Eda
No ratings yet
07 Eda
5 pages
EDA QB Full Answers
No ratings yet
EDA QB Full Answers
18 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
SDB - Prefere 4116 (English)
No ratings yet
SDB - Prefere 4116 (English)
14 pages
Project Sekai Tips, Tricks, and Help Compliation
No ratings yet
Project Sekai Tips, Tricks, and Help Compliation
40 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
English Worksheet Set-1 PDF
No ratings yet
English Worksheet Set-1 PDF
37 pages
NBC PD1096 Rule VII Annotated)
92% (24)
NBC PD1096 Rule VII Annotated)
86 pages
Front End Developer Registered List
No ratings yet
Front End Developer Registered List
20 pages
Babok Review l0 Upd v1.0
No ratings yet
Babok Review l0 Upd v1.0
27 pages
D3E801B9571-Air Conditioning System With Refrigerant R134a
No ratings yet
D3E801B9571-Air Conditioning System With Refrigerant R134a
106 pages
The Filipino Wa Wps Office
No ratings yet
The Filipino Wa Wps Office
25 pages
Guide BB
No ratings yet
Guide BB
20 pages
Cultural Diversity Drives Innovation: Empowering Teams For Success
No ratings yet
Cultural Diversity Drives Innovation: Empowering Teams For Success
21 pages
Syllabus
No ratings yet
Syllabus
58 pages
Case Study 5 Ace Designers
No ratings yet
Case Study 5 Ace Designers
17 pages
Book List
No ratings yet
Book List
128 pages
A Brief History of Accounting
No ratings yet
A Brief History of Accounting
22 pages
Art History-19 Century Birth of "Isms": - Neoclassisim - Romanticism - Realism - Impressionism - Post-Impressionism
No ratings yet
Art History-19 Century Birth of "Isms": - Neoclassisim - Romanticism - Realism - Impressionism - Post-Impressionism
40 pages
Print Edition: 07 March 2014
No ratings yet
Print Edition: 07 March 2014
21 pages
04 Robinson Crusoe
No ratings yet
04 Robinson Crusoe
59 pages
Sys Verilog
No ratings yet
Sys Verilog
115 pages
Lor Ead-510-Site Budget Categories Template and Reflection
100% (1)
Lor Ead-510-Site Budget Categories Template and Reflection
5 pages
I Pu Annual Exam - Feb-2025
No ratings yet
I Pu Annual Exam - Feb-2025
1 page
14 G.R. No. 142773 People V Delim
No ratings yet
14 G.R. No. 142773 People V Delim
14 pages
Feeling Through Sight Zooming in Zooming Out
No ratings yet
Feeling Through Sight Zooming in Zooming Out
19 pages
RAP Answers
No ratings yet
RAP Answers
7 pages
Surgery Observation Paper
No ratings yet
Surgery Observation Paper
3 pages
SVC
No ratings yet
SVC
22 pages
Segments
No ratings yet
Segments
4 pages
(End of Unit Task) Gr.1 - Daily & Seasonal Changes
No ratings yet
(End of Unit Task) Gr.1 - Daily & Seasonal Changes
3 pages
Patient's Profile Name: S.T. Age: 3y/o Sex
No ratings yet
Patient's Profile Name: S.T. Age: 3y/o Sex
3 pages
Belzona 4301 - Product Details
No ratings yet
Belzona 4301 - Product Details
2 pages
Listening Transcript
No ratings yet
Listening Transcript
2 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet

03a EDA

Uploaded by

03a EDA

Uploaded by

Lession 03 - 01

Exploratory Data Analysis

• EDA Case Studies

• How to ensure you are ready to use machine learning

Exploratory Data Analysis

 What type of variation occurs within my variables?

Key Concepts of Exploratory Data Analysis

Making Sense of Data – Distinguish Types of Attributes

• input takes the form of instances and attributes/features

Types of EDA Methods

• EDA methods: generally classified into two ways

To determine which is best suited

• Will you display values over a period of

time, or among items or groups

o difficult to compare a few pieces

Shape of Data - Skewness and Kurtosis

Bar & Stacked Bar Chart

Box Plot & Violin Plot

Area Chart/Stacked Chart

• show the relationship between two variables, referred to

Choose the Most Suitable Plots

Clustering Analysis for EDA

K-Means clustering Hierarchical clustering

Dimensionality Reduction for EDA

What to look for in your plots?

• Clusters suggest that subgroups exist in your data.

EDA and Data Preprocessing

EDA and Feature Selection

Be Colorblind-friendly - Make Your Charts Accessible

Exploratory Data Analysis Tools

• EDA Case Studies

Python Essential Modules for EDA

- RStudio: an integrated development environment for R

Using Python Modules

• All modules support shortcuts:

• More information at the Numpy web page: http://www.numpy.org/

Numpy for EDA

- Mathematical operations: NumPy supports various mathematical

- Descriptive statistics: NumPy offers statistical functions to compute

- Data visualization: plays an essential role in data preparation for

SciPy and SciPy for EDA

Pandas for EDA

- provide a simple interface for creating basic visualizations directly from

Matplotlib and Seaborn

Scikit-learn – Feature Union

Scikit-learn for EDA

Scikit-learn Cheat Sheet

You might also like