03a EDA
03a EDA
Week Overview
• Exploratory Data Analysis (EDA) Introduction
What is EDA
Why need EDA, what can do with EDA
How to do EDA
• EDA Techniques
Groups of EDA Methods
Visualisation Methods
• EDA Tools
Python Modules for EDA
Thinking
• How to choose the most suitable algorithms for your dataset?
https://alastairrushworth.github.io/exploring_eda/EDA.html#1
https://duo.com/labs/research/gamifying-data-
science-education
• EDA is statisticians’ way of story telling where you explore data, find patterns
and tells insights
Exploratory Data Analysis: 5
Graphical Categorical Variable: Bar Chart One Categorical Variable and One Quantitative
Variable:
Quantitative Variable: • Side-by-side Boxplots
• Histogram Two and more Categorical Variable:
• Boxplot • Grouped Bar Chart
• … Two and more Quantitative Variable:
• Scatter plot, Correlation Heatmap, Pairplot …
Exploratory Data Analysis: 8
How to do EDA
Steps/activities involved in EDA:
• identification of variables and data types
• non-graphical and graphical univariate analysis
• bi-/multivariate analysis, correlation analysis
• detect missing values and anomalies
• detect outliers
A typical example:
https://www.researchgate.net/publication/342282008_Exploratory_Data_Analysis_and_Data_Envelo
pment_Analysis_of_Construction_and_Demolition_Waste_Management_in_the_European_Economi
c_Area/figures?lo=1
Exploratory Data Analysis: 9
Visualisation
There are four basic presentation types
• composition
• comparison
• distribution
• relationship
Visual Aids
Common charts in EDA:
• Pie chart
• Histogram
• Bar & Stack Bar Chart
• Box Plot & Violin plot
• Area Chart
• Scatter Plot
• Correlogram
• Heatmap
Exploratory Data Analysis: 11
Pie chart
• Pie chart: circle divided to sectors, to communicate proportions
• a common method for representing categorical variables
Histogram
• Histogram: a plot of the frequency distribution of numeric
variable by splitting values to small equal-sized bins, provide
a visual summary of central tendency, spread, and shape.
https://stackoverflow.com/questions/37911731/seaborn-histogram-with-4-panels-2-x-2-in-python
Exploratory Data Analysis: 13
https://chartio.com/learn/charts/bar-chart-complete-guide/
Exploratory Data Analysis: 15
https://medium.com/analytics-vidhya/exploratory-data-analysis-uni-variate-analysis-of-iris-data-set-690c87a5cd40
Exploratory Data Analysis: 16
https://levelup.gitconnected.com/data-visualization-with-pandas-in-action-part-2-2cc8674da1d0
Exploratory Data Analysis: 17
Scatter Plot
• use a Cartesian coordinates system to display values of
two variables for a set of data
Correlogram
• Correlogram: AKS correlation matrix, to analyse the
relationship between each pair of numeric variables
Exploratory Data Analysis: 19
HeatMap
• Heatmap: a two-dimensional graphical representation of
data where the individual values that are contained in a
matrix are represented as colors
• useful to see which intersections of the categorical values,
have higher concentration of the data compared to the
others
Exploratory Data Analysis: 20
https://www.tatvic.com/blog/7-visualizations-learn-r/
Exploratory Data Analysis: 21
https://www.neuroelectrics.com/blog/2018/07/06/clustering-methods-in-exploratory-analysis/
Exploratory Data Analysis: 22
T-distributed stochastic neighbor embedding (T-SNE) and Principal Component Analysis (PCA)
https://www.programmersought.com/article/92363395092/
Exploratory Data Analysis: 23
https://www.visual-design.net/post/feature-selection-and-eda-in-machine-learning
Exploratory Data Analysis: 26
https://www.tableau.com/about/blog/examining-data-viz-rules-dont-use-red-green-together
Lession 03 - 02
Week Overview
• Exploratory Data Analysis (EDA) Introduction
What is EDA
Why need EDA, what can do with EDA
How to do EDA
• EDA Techniques
Groups of EDA Methods
Visualisation Methods
• EDA Tools
Python Modules for EDA
EDA Tools
Some of the open source tools to facilitate EDA
• Python: This is an open source programming language widely used
in data analysis, data mining, and data science
• R programming language: an open source programming language
that is widely utilized in statistical computation and graphical data
analysis, provides packages like ggplot2 for data visualisation
• Weka: This is an open source data mining package that involves
several EDA tools and algorithms
• Orange: This is an open source and workbench-style tool for data
analysis
Exploratory Data Analysis: 30
Software - Anaconda
• What is Anaconda (miniconda)?
- An essential large (~400 mb) Python installation
- It contains almost everything your need for basic machine learning and data
analysis
• Why to use Anaconda?
- gives the user ability to make an easy install of the version of python
- gives high performance computing with Anaconda Accelerate and several other
components
- removes bottlenecks involved in installing the right packages while taking into
considerations their compatibility
- no risk of messing up required system libraries
- over 7500 open source packages, manyof which are not in the pip repo
Exploratory Data Analysis: 32
Software - IDEs
• Recommend IDEs:
- Ipython (Jupyter Notebook)
‣ Ipython stands for “Interactive Python”
‣ Writing markdown at the same time
‣ More reasons can be found: http://pythonforengineers.com/why-ipython-is-the-
best-thing-since-sliced-bread/
- PyCharm EDU
‣ A real powerful IDE
‣ Many good features (file browser, intelligent auto-completion, run in IDE with input
& console windows for trial and error development)
‣ free to download here: https://www.jetbrains.com/pycharm-edu/
Numpy
• Numpy is the foundation for scientific computing in Python
• It has many features:
- powerful for large, multi-dimensional arrays object
- basic linear algebra functions
- basic Fourier transforms
- sophisticated random number capabilities
- tools for integrating C/C++ code, Fortran code
Pandas
• Pandas (Python Data Analysis Library), a python library that provides easy-to-
use data structures and data analysis tools for data manipulation, data
cleaning, data exploration, and data preparation tasks in data science and
data analysis
• built on Numpy, Scipy, and Matplotlib (to some extent).
• many features:
- support for CSV, Excel, JSON, SQL, HDF5, …
- powerful tools: DataFrame and Series
- data cleansing
- re-shape & merge data (joins & merge) & pivoting
- data visualisation
- database-like operations: filtering, sorting, grouping, aggregating, merging, and
joining
• more information at the Pandas web page: https://pandas.pydata.org/
Exploratory Data Analysis: 38
- filter and select data based on conditions using boolean indexing, methods like
loc[], iloc[], and boolean expressions to extract relevant subsets of data
- powerful grouping and aggregation functions using groupby(). You can group
data based on one or more columns and apply aggregate functions like sum(),
mean(), count(), etc., to obtain insights and summaries
- reshape and pivot data using methods like pivot(), melt(), stack(), and unstack()
Scikit-learn
• What is Scikit-learn in Python?
- SciPy ToolKit
- Scikit-learn (sklearn): an open source library that provides a consistent API for
using traditional state-of-the-art ML algorithms/methods in Python.
- majorly written in Python, some of sklearn’s internal algorithms are written in
Cython using third party bindings.
- Sklearn’s major Python dependencies are scipy, numpy and matplotlib.
Scikit-learn
Scipy Matplotlib
Numpy
Python
Exploratory Data Analysis: 41
Scikit-learn - Overview
http://scikit-learn.org/stable/index.html
Exploratory Data Analysis: 42
Scikit-learn – APIs
• Estimators and Meta-estimators:
- Estimators: define instantiation mechanism of objects and expose a fit method for learning
a model from training data, E.g., LinearRegression
- Meta-estimators : combine one or more estimators into a single estimator E.g. ensemble, RF
• Predictors: a predict method that takes an array “X test” and produces predictions for “X
test”, based on the learned parameters of the estimator.
• Transformers
- modify or filter data before feeding, a transformer interface which defines a transform
method, E.g. StandardScaler()
Exploratory Data Analysis: 43
Scikit-learn – APIs
• Pipelines and Feature Unions
- A distinguishing feature of the Scikit-learn API is its ability to compose new estimators from
several base estimators. Two ways:
‣ Pipeline objects chain multiple estimators into a single one.
‣ FeatureUnion objects combine multiple transformers into a single one that concatenates
their outputs
• Pipelines:
Exploratory Data Analysis: 44
https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-
learning-big-data-678c51b4b463
Scikit-Learn Cheat Sheet Exploratory Data Analysis: 47