[go: up one dir, main page]

0% found this document useful (0 votes)
13 views47 pages

03a EDA

Uploaded by

Van loi Ha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views47 pages

03a EDA

Uploaded by

Van loi Ha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Lession 03 - 01

Exploratory Data Analysis


Understanding Your Data
Exploratory Data Analysis: 2

Week Overview
• Exploratory Data Analysis (EDA) Introduction
 What is EDA
 Why need EDA, what can do with EDA
 How to do EDA

• EDA Techniques
 Groups of EDA Methods
 Visualisation Methods

• EDA Tools
 Python Modules for EDA

• EDA Case Studies


Exploratory Data Analysis: 3

Thinking
• How to choose the most suitable algorithms for your dataset?

https://alastairrushworth.github.io/exploring_eda/EDA.html#1

• How to ensure you are ready to use machine learning


techniques in a new project?
• Answer: Exploratory Data Analysis (EDA) helps to answer
Exploratory Data Analysis: 4

Exploratory Data Analysis


• Exploratory Data Analysis (EDA) is a process for summarising, visualising, and
becoming intimately familiar with the important characteristics of the data
• EDA is an iterative cycle:
- Generate questions about your data:
- Search for answers by visualising, transforming, and modelling your data
- Use what you learn to refine your questions and/or generate new questions

 What type of variation occurs within my variables?


 What type of covariation occurs between my variables?

https://duo.com/labs/research/gamifying-data-
science-education

• EDA is statisticians’ way of story telling where you explore data, find patterns
and tells insights
Exploratory Data Analysis: 5

Key Concepts of Exploratory Data Analysis


• 4 Objectives of EDA
- Discover Patterns
- Spot Anomalies
- Frame Hypothesis
- Check Assumptions
• Stuff done during EDA
- Measures of central tendency: mean, median, mode
- Spread measurement : standard deviation, variance
- Shape of distribution: distribution, trends
- Outlier
- Correlations
- Visual Exploration
Exploratory Data Analysis: 6

Making Sense of Data – Distinguish Types of Attributes

• input takes the form of instances and attributes/features


- information to a machine learning learner takes the form of a set of
instances
- each instance is described by a fixed predefined set of features or
attributes:
- Types: Numerical (Discrete and Continuous) and Nominal (Categorical)
Exploratory Data Analysis: 7

Types of EDA Methods

• EDA methods: generally classified into two ways


- graphical or non-graphical/quantitative : summarising data in a
visual way or calculation of summary statistics
- univariate or multivariate: summary statistics for each
feature/attribute or find relationship between features
Univariate Multivariate
Non-Graphical Categorical Variable: tabular representation
of frequency
One Categorical Variable and One Quantitative
Variable: Standard univariable non-graphical statistics
for the quantitative variable separately for each level
Quantitative Variable: of the categorial variable
• Location (mean, median)
• Shape and Spread Two and more Quantitative Variable:
• Modality • Correlation,
• Outliers … • Covariance,
• …

Graphical Categorical Variable: Bar Chart One Categorical Variable and One Quantitative
Variable:
Quantitative Variable: • Side-by-side Boxplots
• Histogram Two and more Categorical Variable:
• Boxplot • Grouped Bar Chart
• … Two and more Quantitative Variable:
• Scatter plot, Correlation Heatmap, Pairplot …
Exploratory Data Analysis: 8

How to do EDA
Steps/activities involved in EDA:
• identification of variables and data types
• non-graphical and graphical univariate analysis
• bi-/multivariate analysis, correlation analysis
• detect missing values and anomalies
• detect outliers

A typical example:

https://www.researchgate.net/publication/342282008_Exploratory_Data_Analysis_and_Data_Envelo
pment_Analysis_of_Construction_and_Demolition_Waste_Management_in_the_European_Economi
c_Area/figures?lo=1
Exploratory Data Analysis: 9

Visualisation
There are four basic presentation types
• composition
• comparison
• distribution
• relationship

To determine which is best suited


• How many variables in a single chart?
• How many data points display for each
variable? https://raw.githubusercontent.com/areski/python-nvd3

• Will you display values over a period of


https://www.tatvic.com/blog/7-visualizations-learn-r/

time, or among items or groups


Exploratory Data Analysis: 10

Visual Aids
Common charts in EDA:
• Pie chart
• Histogram
• Bar & Stack Bar Chart
• Box Plot & Violin plot
• Area Chart
• Scatter Plot
• Correlogram
• Heatmap
Exploratory Data Analysis: 11

Pie chart
• Pie chart: circle divided to sectors, to communicate proportions
• a common method for representing categorical variables

o difficult to compare a few pieces


 simple and easy-to-understand
o unhelpful when observing trends
 understand information quickly
over time
Exploratory Data Analysis: 12

Histogram
• Histogram: a plot of the frequency distribution of numeric
variable by splitting values to small equal-sized bins, provide
a visual summary of central tendency, spread, and shape.

https://stackoverflow.com/questions/37911731/seaborn-histogram-with-4-panels-2-x-2-in-python
Exploratory Data Analysis: 13

Shape of Data - Skewness and Kurtosis


• histogram is effective for showing both the skewness and
kurtosis of data set.
• skewness is a measure of the lack of symmetry
• a distribution, or data set, is symmetric if it looks the same
to the left and right of the center point
• symmetric data should have a skewness near zero, negative
values indicate data that are skewed left and positive values
indicate skewed right.
• the type of peak the distribution can be characterized by a
measurement - kurtosis.
Exploratory Data Analysis: 14

Bar & Stacked Bar Chart


• Bar charts: a way of summarizing a set of categorical data,
displays data using bars, each representing a particular
category, the height is proportional to a specific aggregation
• Bars can be horizontal or vertical

https://chartio.com/learn/charts/bar-chart-complete-guide/
Exploratory Data Analysis: 15

Box Plot & Violin Plot


• Box plot: box and whisker plot, displays a summary of a
large amount of data in five numbers , a good indication of
how the values in the data are spread out with in groups
• plot a combination of categorical and continuous variables
• Violin plot: similar as box plot, additionally shows the kernel
density estimation of the underlying distribution

https://medium.com/analytics-vidhya/exploratory-data-analysis-uni-variate-analysis-of-iris-data-set-690c87a5cd40
Exploratory Data Analysis: 16

Area Chart/Stacked Chart


• base on the line chart, areas between axis and line are
commonly emphasized with colors
• share features with bar charts and line charts, compare two
or more quantities, work better for large difference and
multiple values over time

https://levelup.gitconnected.com/data-visualization-with-pandas-in-action-part-2-2cc8674da1d0
Exploratory Data Analysis: 17

Scatter Plot
• use a Cartesian coordinates system to display values of
two variables for a set of data

• show the relationship between two variables, referred to


as correlation plots
Exploratory Data Analysis: 18

Correlogram
• Correlogram: AKS correlation matrix, to analyse the
relationship between each pair of numeric variables
Exploratory Data Analysis: 19

HeatMap
• Heatmap: a two-dimensional graphical representation of
data where the individual values that are contained in a
matrix are represented as colors
• useful to see which intersections of the categorical values,
have higher concentration of the data compared to the
others
Exploratory Data Analysis: 20

Choose the Most Suitable Plots

https://www.tatvic.com/blog/7-visualizations-learn-r/
Exploratory Data Analysis: 21

Clustering Analysis for EDA


• Clustering in EDA to find new insights

K-Means clustering Hierarchical clustering


can be used to can be used to
detect possible outliers find underlying connectivity properties

https://www.neuroelectrics.com/blog/2018/07/06/clustering-methods-in-exploratory-analysis/
Exploratory Data Analysis: 22

Dimensionality Reduction for EDA


• Reduce the dimensions of the data into fewer dimensions
would help describing the relationship between variables

T-distributed stochastic neighbor embedding (T-SNE) and Principal Component Analysis (PCA)
https://www.programmersought.com/article/92363395092/
Exploratory Data Analysis: 23

What to look for in your plots?


• Turn the information into useful questions
o Which values are the most common? Why?
o Which values are rare? Why?
o Can you see any unusual patterns? What might explain them?

• Clusters suggest that subgroups exist in your data.


o How can you explain or describe the clusters?
o How are the observations within each cluster similar to each other?
o How are the observations in separate clusters different from each other?
Exploratory Data Analysis: 24

EDA and Data Preprocessing


EDA helps us to prepare it for the upcoming tasks,
e.g. data preprocessing and modelling
EDA Data Preprocessing
Duplicate Data-Points? Delete Duplicate Data-Points
Missing Values? Imputation
Outlier? Exclude outliers
Highly Correlated Features? Handling Highly Correlated Features
Low-Variance Features? Handling Low-Variance Features
Imbalanced Data? Oversampling, Undersampling
Features vary in their scale? Feature Scaling
High-dimensional data? Dimensionality Reduction, Feature selection
Exploratory Data Analysis: 25

EDA and Feature Selection

https://www.visual-design.net/post/feature-selection-and-eda-in-machine-learning
Exploratory Data Analysis: 26

Be Colorblind-friendly - Make Your Charts Accessible


• Stop using red and green, blue and yellow together
• If must, leverage light vs. dark, offer alternate methods to
distinguishing data
• Stop using the stoplight palette
• Use a colorblind-friendly palette:
https://colorbrewer2.org/#type=diverging&scheme=BrBG&n=5
Color wheel: https://graf1x.com/the-color-wheel-chart-poster/

https://www.tableau.com/about/blog/examining-data-viz-rules-dont-use-red-green-together
Lession 03 - 02

Exploratory Data Analysis Tools


Python Modules
Exploratory Data Analysis: 28

Week Overview
• Exploratory Data Analysis (EDA) Introduction
 What is EDA
 Why need EDA, what can do with EDA
 How to do EDA

• EDA Techniques
 Groups of EDA Methods
 Visualisation Methods

• EDA Tools
 Python Modules for EDA

• EDA Case Studies


Exploratory Data Analysis: 29

EDA Tools
Some of the open source tools to facilitate EDA
• Python: This is an open source programming language widely used
in data analysis, data mining, and data science
• R programming language: an open source programming language
that is widely utilized in statistical computation and graphical data
analysis, provides packages like ggplot2 for data visualisation
• Weka: This is an open source data mining package that involves
several EDA tools and algorithms
• Orange: This is an open source and workbench-style tool for data
analysis
Exploratory Data Analysis: 30

Python Essential Modules for EDA


• Brief Overview
- Numerical Python (Numpy): matrix / numerical analysis layer, the fundamental
library , powerful tool: ndarray
- Scientific Python (Scipy): scientific computing utilities/functions
(linalg, mathematical approximation, etc…)
- Scikit-learn (sklearn): machine learning toolbox
- Matplotlib: plotting and visualization
- Pandas: data manipulation and analysis, powerful data structures including
data frame and series
Many others:
‣ OpenCV: computer vision
‣ Statsmodels: Statistics in Python
‣ Seaborn, Bokeh, Plotly: Plotting and Visualisation
‣ NLTK, Gensim: NLP tools
‣ Theano, Caffe, Pytorch, Lasagne: Deep Learning
‣ Pybrain, Pylearn2: Machine learning
‣ DEAP, NEAT-python: Evolutionary Computation
Exploratory Data Analysis: 31

Software - Anaconda
• What is Anaconda (miniconda)?
- An essential large (~400 mb) Python installation
- It contains almost everything your need for basic machine learning and data
analysis
• Why to use Anaconda?
- gives the user ability to make an easy install of the version of python
- gives high performance computing with Anaconda Accelerate and several other
components
- removes bottlenecks involved in installing the right packages while taking into
considerations their compatibility
- no risk of messing up required system libraries
- over 7500 open source packages, manyof which are not in the pip repo
Exploratory Data Analysis: 32

Software - IDEs
• Recommend IDEs:
- Ipython (Jupyter Notebook)
‣ Ipython stands for “Interactive Python”
‣ Writing markdown at the same time
‣ More reasons can be found: http://pythonforengineers.com/why-ipython-is-the-
best-thing-since-sliced-bread/

- PyCharm EDU
‣ A real powerful IDE
‣ Many good features (file browser, intelligent auto-completion, run in IDE with input
& console windows for trial and error development)
‣ free to download here: https://www.jetbrains.com/pycharm-edu/

- RStudio: an integrated development environment for R


Exploratory Data Analysis: 33

Using Python Modules


• Python libraries are called modules
• Each module needs to be imported
before use.
• Three common alternatives:
- Import the full module:
‣ import numpy
- Import selected functions from the
module:
‣ from numpy import array, sin, cos
- Import all functions from the module:
(Not Recommended)
‣ from numpy import *

• All modules support shortcuts:


- e.g., import numpy as np
Exploratory Data Analysis: 34

Numpy
• Numpy is the foundation for scientific computing in Python
• It has many features:
- powerful for large, multi-dimensional arrays object
- basic linear algebra functions
- basic Fourier transforms
- sophisticated random number capabilities
- tools for integrating C/C++ code, Fortran code

• More information at the Numpy web page: http://www.numpy.org/


Exploratory Data Analysis: 35

Numpy for EDA


- Data manipulation: NumPy provides powerful array objects that allow
efficiently manipulate and process data, e.g., filtering, slicing, indexing,
and reshaping data, making it easier to extract relevant information from
your dataset.

- Mathematical operations: NumPy supports various mathematical


operations, enabling you to perform element-wise arithmetic, logarithmic,
trigonometric, and other mathematical functions on arrays of data

- Descriptive statistics: NumPy offers statistical functions to compute


descriptive statistics on your data, e.g., mean, median, standard
deviation, variance, min, max, percentiles, etc.

- Data visualization: plays an essential role in data preparation for


visualization libraries like Matplotlib and Seaborn
Exploratory Data Analysis: 36

SciPy and SciPy for EDA


SciPy stands for Scientific Python, is a collection of mathematical algorithms and
convenience functions built upon Numpy
• more functional
• a powerful tool for advanced scientific computing and data analysis
• widely used in various scientific and engineering disciplines, including physics,
chemistry, biology, economics, and data science
SciPy for EDA
• statistical tests: e.g., t-tests, chi-square tests, ANOVA, and correlation tests, to
assess relationships between variables and identify significant differences in
data, and for hypothesis testing to validate assumptions and draw conclusions
about the data.
• outlier detection (‘scipy.stats.zscore’),
• distance calculations (scipy.spatial.distance), …
Exploratory Data Analysis: 37

Pandas
• Pandas (Python Data Analysis Library), a python library that provides easy-to-
use data structures and data analysis tools for data manipulation, data
cleaning, data exploration, and data preparation tasks in data science and
data analysis
• built on Numpy, Scipy, and Matplotlib (to some extent).
• many features:
- support for CSV, Excel, JSON, SQL, HDF5, …
- powerful tools: DataFrame and Series
- data cleansing
- re-shape & merge data (joins & merge) & pivoting
- data visualisation
- database-like operations: filtering, sorting, grouping, aggregating, merging, and
joining
• more information at the Pandas web page: https://pandas.pydata.org/
Exploratory Data Analysis: 38

Pandas for EDA


- Data Inspection and Summary: Pandas provides methods like head(), tail(),
info(), describe(), .. and shape to quickly inspect the data and get a summary of
the dataset, including the data types, missing values, and basic statistics

- filter and select data based on conditions using boolean indexing, methods like
loc[], iloc[], and boolean expressions to extract relevant subsets of data

- powerful grouping and aggregation functions using groupby(). You can group
data based on one or more columns and apply aggregate functions like sum(),
mean(), count(), etc., to obtain insights and summaries

- reshape and pivot data using methods like pivot(), melt(), stack(), and unstack()

- provide a simple interface for creating basic visualizations directly from


DataFrames and Series using the ‘.plot()’ method
Exploratory Data Analysis: 39

Matplotlib and Seaborn


• Matplotlib is a comprehensive graphics library for generating scientific
figures
- high-quality output in many formats, including PNG, PDF, SVG, EPS, and PGF
- efficiently with data frames and arrays
- GUI for interactively exploring figures but supports figure generation without
the GUI
- integrates well with Jupyter notebooks
- support 3D
• More information at the Matplotlib web page: http://matplotlib.org/
• Seaborn: another Python data visualization library based on matplotlib
o have a polished default style, a better looking figure
o more convenient when using a pandas DataFrame
o 3D is not supported
• For basic plots, use Matplotlib, for more advanced statistical
visualizations, use Seaborn
Exploratory Data Analysis: 40

Scikit-learn
• What is Scikit-learn in Python?
- SciPy ToolKit
- Scikit-learn (sklearn): an open source library that provides a consistent API for
using traditional state-of-the-art ML algorithms/methods in Python.
- majorly written in Python, some of sklearn’s internal algorithms are written in
Cython using third party bindings.
- Sklearn’s major Python dependencies are scipy, numpy and matplotlib.

Scikit-learn

Scipy Matplotlib

Numpy

Python
Exploratory Data Analysis: 41

Scikit-learn - Overview

http://scikit-learn.org/stable/index.html
Exploratory Data Analysis: 42

Scikit-learn – APIs
• Estimators and Meta-estimators:
- Estimators: define instantiation mechanism of objects and expose a fit method for learning
a model from training data, E.g., LinearRegression
- Meta-estimators : combine one or more estimators into a single estimator E.g. ensemble, RF

• Predictors: a predict method that takes an array “X test” and produces predictions for “X
test”, based on the learned parameters of the estimator.

• Transformers
- modify or filter data before feeding, a transformer interface which defines a transform
method, E.g. StandardScaler()
Exploratory Data Analysis: 43

Scikit-learn – APIs
• Pipelines and Feature Unions
- A distinguishing feature of the Scikit-learn API is its ability to compose new estimators from
several base estimators. Two ways:
‣ Pipeline objects chain multiple estimators into a single one.
‣ FeatureUnion objects combine multiple transformers into a single one that concatenates
their outputs

• Pipelines:
Exploratory Data Analysis: 44

Scikit-learn – Feature Union


• apply different transformers on the same input data in parallel and
concatenate the outputs
• a FeatureUnion takes as input a list of transformers. Calling fit on the union is
the same as calling fit independently on each of the transformers and then
joining their outputs
For example:
- Join two feature transformers (i.e., PCA and UnivariateFeatureSelection) to
construct combined features
Exploratory Data Analysis: 45

Scikit-learn for EDA


• Data Preprocessing: Scikit-learn provides preprocessing techniques, e.g.,
scaling, normalization, handling missing values, and encoding categorical
variables, to ensure data is in a suitable format for analysis.
• Dimensionality Reduction: Scikit-learn offers techniques like Principal
Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding
(t-SNE) for dimensionality reduction.
• Feature Selection: identifying the most relevant features for modeling or
understanding the relationships between variables.
• Unsupervised Learning: clustering algorithms (e.g., KMeans) for discovering
patterns and structures.
Exploratory Data Analysis: 46

Scikit-learn Cheat Sheet

https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-
learning-big-data-678c51b4b463
Scikit-Learn Cheat Sheet Exploratory Data Analysis: 47

You might also like