0% found this document useful (0 votes)

8 views45 pages

MGS 626 - Week 4 - Exploring Data

The document outlines key concepts in data visualization, focusing on exploratory data analysis (EDA) and various visualization methods for big data. It emphasizes the importance of understanding data through visualization to identify patterns, outliers, and errors. The document also discusses different visualization techniques suitable for various data formats, including univariate and bivariate methods, time series, and geospatial data.

Uploaded by

Pooja Kabadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views45 pages

MGS 626 - Week 4 - Exploring Data

Uploaded by

Pooja Kabadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

EXPLORING

DATA ‘-

MGS 626: Data Visualization

Christopher Keaton

1
2

Outline
• Exploratory Data Analyses
• Visualization of big tabular data
• Univariate visualization methods
• Bivariate visualization methods
• Small multiples ‘-
• Visualization of time series
• Geospatial visualization

2
3

Why exploration?

We have BIG unknown data so: “Statistics is the

- Find interesting variables science of learning
- Find interesting patterns from data”
- Find groups in the data ‘-Larry Wasserman
- Find possible outliers
- Find possible errors

So, get to know your data!

3
4

Exploratory Data Analysis (EDA)

• Introduced by John Tukey (1977)
• EDA is an approach to analyze data, with the focus on looking
at the data directly rather than looking at indirect information
such as model fit and hypothesis tests.
‘-
• Many tools are useful for EDA, especially visualization
methods, e.g. scatter plot, histogram, box plot, and parallel
coordinates.
• EDA is one of the foundations of data mining.

4
5

Visualization of Big Data

Bottlenecks:
• Screen resolution; not enough pixels to plot each data point separately
• Restricted focus of the users regarding
• number of variables
• different subgroups ‘-
• overview vs. detail
• Computational issues

Which tools are used to visualize big data? That depends on what you want to explore, the
form of the data, etc.

5
6

Variety of Big Data

• Big data comes in many different formats. That is why variety is one of the
V’s by which Big Data is commonly described.
• In general, the following formats are common for big data, especially the top
three:
• Tabular data
‘-
• Time series data
• Spatial data
• Tree-structured data
• Text data
• Image/video/audio data
• Usually, the format of a big data source is a mixture of these formats.

6
7

Variety of Big Data (2)

• Web data is often tree-structured (e.g. the prices of a web shop by category)
and contains text and/or images.
• Social media data is often text, image, and video.
• Sensor data is usually tabular data with variables that describe space and
time. The type of measurement data depends on the sensor type; most often
numeric, but also image, video, audio, etc. ‘-
• If a tabular dataset contains a spatial component, then it can be linked to
spatial units, which can be spatial points, lines, polygons, or raster points.

7
8

Visualization of Big Tabular Data

• A major bottleneck of many EDA tools, it that they do not scale very well.
• For instance, a scatter plot with over a 1000 points will cause occlusion.
• Therefore, let Exploratory Big Data Analysis (EBDA) be EDA that is “big
data” – proof. ‘-
• The question is: which of the tools used for EDA can also be applied to big
tabular data?

8
Key Factors to Consider
Data Volume: Technical Considerations:

• How many records are you handling? • Data source integration

• What's the refresh frequency? • Existing technology stack

• Do you need real-time visualization? ‘-

• Team expertise
Performance Requirements: • Deployment requirements

• Browser-based vs desktop application Cost Factors:

• Need for interactivity • Licensing models

• Response time expectations • Infrastructure requirements

• Maintenance needsT
• raining requirements
9
Different Scenarios
Real-time Dashboards:
• Grafana
• Kibana
• Apache Superset

Geographic Data:
• Kepler.gl ‘-
• deck.gl
• QGIS

Network/Graph Data:
• Neo4j Bloom
• Gephi
• Cytoscape

Scientific/Research:
• ParaView
• VisIt
• MATLAB 10
11

How many items can be plot at once?

Not too many, because of screen resolution
and occlusion.

‘-
Solution: aggregation
Top down Bottom up

11
EDA is Cyclical

‘-

12
Common Pitfalls

• Over-reliance on default visualizations

• Jumping to conclusions too quickly
• Not considering data quality issues
‘-
• Forgetting about domain context

13
14

Scope of visualizing Big Tabular Data

Number of variables p

Large Small

Use variable selection or Exploratory

Large dimension reduction,
proceed with EBDA
‘-
Big Data
analysis
Number of
Observations n
Use variable selection or
Small dimension reduction, EDA
proceed with EDA

14
15

Exploratory (Big) Data Analysis tasks

The tasks for EBDA are the same as for EDA:

• Describe data distribution: how often does a value or a combination of values occur?
What is the dependence on other variables?
‘-
• Find groups: is the distribution a mixture of distributions of different “natural” groups
in the data? If so, the data can be split.
• Identify outliers: find improbable extreme values. Are they errors? Determine if they
should be excluded or connected in the analysis.
• How many values are invalid or missing? Determine what the impact is on the
analysis.

15
16

Univariate Analysis

• Histogram / frequency polygon

• Frequency plot, pie chart, stacked bar chart
• Treemap
• Calendar plot ‘-

16
17

Histogram / frequency polygon

• A histogram is a statistical technique to plot a numerical,
ordinal or date-time variable.
• Numerical values are discretized cutting the variable range
into intervals and counting the frequency.
• It is advised to try several different bin sizes to extract
features from the data set: a bin size too small results in a
coarse histogram describing global structure, but missing ‘-
fine grained details.
• A histogram is well suited for describing a data distribution
since it makes no assumptions.
• A date-time variable is typically shown in a frequency
polygon, which is the line-variant of the histogram.

17
18

Calendar plot
• For date-time variables with
a range between one month
and a couple of years, a
calendar plot is very useful.
• It can be seen as a heatmap
in calendar format where the ‘-
values are binned to days.

18
19

Bivariate visualizations
• Heatmap
• Two-dimensional kernel density plot
• Surface plot
• Mosaic plot
• Tableplot
‘-
• Treemap

19
20

Heatmap
• The heatmap is a powerful
workhorse that copes with the
shortcomings of scatter plots.
• Moreover, it can be applied to
numerical and high cardinality
ordinal, categorical, and date-time
variables. Numerical variables are
discretized in the same way as for
‘-
histograms.
• Counting the number of occurrences
for each combination of (discretized)
values results in a frequency matrix,
which is displayed as a heatmap.
• Alternatively, numeric data can also
be binned and visualized in
hexagons.

20
21

2D kernel density plot

• The two-dimensional kernel density plot is
another visualization method that is scalable
for large number of observations in which
contour lines are drawn based on the
estimated kernel densities.
‘-
• It is especially suitable for numerical data, but
also for date-time variables and high
cardinality ordinal variables.

21
22

Surface plot Age

Income

• The surface plot is a three-

dimensional plot, in which the
densities are expressed by height, is
another useful tool to visualize
bivariate relationships.
• Like the two-dimensional kernel
‘- Count
density plot, it is useful for numerical
values , and can also be used for
high cardinality ordinal and date-time
variables.

22
23

Mosaic plot
• The mosaic plot is useful for low
cardinality variables.
• The areas of the rectangles are
proportional to the counts.
• A stacked bar chart is similar to a
mosaic plot, but does not show the ‘-
univariate frequency distribution of the
column variable.

Subjects of pop songs

23
24

Treemap

• The treemap is very useful to

visualize the relationship between a
hierarchical categorical and a
numerical variable.
‘-
• The sizes of the rectangles
correspond to aggregates of the
numerical variable based on the
hierarchy of the categorical variable.
• In addition, color can be used to
encode a third, numerical, variable,
typically with a diverging color
scheme.
Size represents number of employees per economic sector,
color represents the difference with last year.

24
25

Tableplot
• The tableplot is a plot that may seem a
multivariate plot, but it actually is a
combination of bivariate plots.
• Data from two or more variables, that can be
numerical, ordinal, categorical, and ordinal, is
binned according to the quantiles of a
numerical variable. ‘-
• For each variable that is either numerical or
ordinal with high cardinality, a bar chart with
mean values per bin is plotted.
• For each low cardinality categorical variable
or ordinal variable, a stacked bar chart is
plotted.

25
BREAK ‘-

26
27

Small Multiples
• Data is split into multiple subsets.
• For each subset, a small plot is created.
• These plots are called small multiples, facets, trellis charts, or lattice
charts.
• They are usually placed on a rectangular grid.
‘-

27
28

Example 1

‘-

Number of counted vehicles on Dutch highways in one month (September).

28
Example 2

‘-

Estimated Day Time Population during one week per municipality based on mobile phone
network data. Color indicates cluster. 29
30

These plots are also small multiples

‘-

Bar chart per subset Tableplot

30
31

Time series data

• Line graph / dot plot
• Calendar plot / heatmap
• Streamgraph
• Horizon graph ‘-

31
32

Line graph / dot plot

• The line graph is probably the most used
method for displaying time series data.
• It is less suitable when the time series are
not stable, or when there is noise in the
data.
• The dot plot (without lines) is a good ‘-
alternative for less stable or noisy data.
Dot plot Line graph with dots

32
33

Streamgraph
• The streamgraph is a stacked area
chart (which is in turn an alternative
to a line chart).
• It is used for time series for several
subsets of data.

‘-

Frequency of subjects of New York service calls (311) during a day

33
34

Horizon graph
• A horizon graph is a space efficient
alternative to a line chart.
• It is constructed as follows:

‘- 3) Horizon graph. Only the

peaks (positive or negative)
are shown

1) Normal line chart 2) Line chart where area under the curve is filled
with a diverging color scheme.
34
35

Horizon graph (2)

The horizon graph is especially useful when there is little space, for
instance in case of small multiples:

‘-

35
36

Spatial data
• Choropleth
• Dot map
• 2D kernel density map
• Small multiples
‘-

36
37

Choropleth
A choropleth is a map type where
administrative regions are filled with
colors that represent a density or
ratio variable.

‘-

Day Time Population per municipality based on mobile

phone network data
37
38

Dot map
• Data points are shown as dots.
• Useful for spatial point data, such as geo-tagged events.

Deaths per
address

‘-

Water pumps (possible

causes)

Cholera-outbreak in London (1854) by John Snow. 38

3
9

Dot map (2)

• Interactive dot map of the Dutch
population colored by ethnic origin.
• The luminance of the dots indicates
the density.
• Prototype:
http://research.cbs.nl/colordotmap/
• Based on the Racial Dot Map: ‘-
http://demographics.coopercenter.o
rg/DotMap/

39
4
0

Dot map (3)

• Crimes registered in Greater
London during October
2015. (Data available on https://data.police.uk/)
• Number of dots: 80,000
• Alpha transparency has ‘-
been applied to better show
to spatial data distribution.

40
4
1

2D Kernel Density Map

• A 2D Kernel Density Estimator is a
technique to show the smoothed
densities.
• The bandwidth parameter
determines the level of
smoothness. ‘-

41
4
2

Small multiples
Small multiples are very
suitable for spatial data

‘-

Crimes in the City of London by type of crime 42

Small multiples

Daytime Population
estimations (relative to
residential population totals)
‘-

43
44

Summary
• Data exploration is important to get to know the data.
• Visualization is key in data exploration; by looking at the data in different ways, patterns and
anomalies become clear.
• Many visualization methods can be used for data exploration. A part of these methods can
be used for big data, since they are scalable. ‘-

44
Group Project 2, due in two weeks.
• As an addition to your paper submission for GP1, please submit an overview of at least 5
visualizations you will be creating.
• Each Viz should have a clear narrative that can be explained in a few sentences.
‘- of design discussed in weeks 2
• Additionally, for each Viz you should explain why the aspects
through 4 that help craft the narrative.

Unit 3 DATA VISUAIZATION
No ratings yet
Unit 3 DATA VISUAIZATION
25 pages
03 Temporal, Geospatial Multivariate Data
No ratings yet
03 Temporal, Geospatial Multivariate Data
69 pages
09 Plotting and Visualization
No ratings yet
09 Plotting and Visualization
97 pages
EDA & Data Visualization Guide
No ratings yet
EDA & Data Visualization Guide
49 pages
Exploratory Data Analysis Reference
No ratings yet
Exploratory Data Analysis Reference
50 pages
Data Visualization Guide: 1. Common Types of Data Visualizations
No ratings yet
Data Visualization Guide: 1. Common Types of Data Visualizations
11 pages
Data Visualization 21st June
No ratings yet
Data Visualization 21st June
110 pages
Data Visualization Essentials
No ratings yet
Data Visualization Essentials
32 pages
Exploratory Data Analysis Reference
100% (2)
Exploratory Data Analysis Reference
49 pages
Visual Presentation of Data
No ratings yet
Visual Presentation of Data
26 pages
Module 4
No ratings yet
Module 4
91 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
DVP 3
No ratings yet
DVP 3
97 pages
DM14 Visualisation
100% (1)
DM14 Visualisation
67 pages
02a EDA and Data Visualization
No ratings yet
02a EDA and Data Visualization
79 pages
02 Data
No ratings yet
02 Data
42 pages
Data Visualization
No ratings yet
Data Visualization
7 pages
Data Visualization Essentials
No ratings yet
Data Visualization Essentials
120 pages
DV Methods
No ratings yet
DV Methods
6 pages
Chapter 3 Non Spatial Data Visualization
No ratings yet
Chapter 3 Non Spatial Data Visualization
45 pages
Big Data Visualization and Common Adopattation Issues
No ratings yet
Big Data Visualization and Common Adopattation Issues
34 pages
03a EDA
No ratings yet
03a EDA
47 pages
DS - Unit 3
No ratings yet
DS - Unit 3
37 pages
WINSEM2018-19 MGT1051 TH MB310 VL2018195003608 Reference
No ratings yet
WINSEM2018-19 MGT1051 TH MB310 VL2018195003608 Reference
35 pages
Principles of Data Visualization
No ratings yet
Principles of Data Visualization
61 pages
Big Data Analysis Presentation
No ratings yet
Big Data Analysis Presentation
9 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
48 pages
Matplotlib Basics
No ratings yet
Matplotlib Basics
27 pages
Scientific Design Choices in Data Visualization
No ratings yet
Scientific Design Choices in Data Visualization
11 pages
STAT243 Chapter 2 - Section 2.3
No ratings yet
STAT243 Chapter 2 - Section 2.3
22 pages
(602107) - Introduction To Data Analytics - Tuáº N 2 - 3 - Chapter02 - Updated
No ratings yet
(602107) - Introduction To Data Analytics - Tuáº N 2 - 3 - Chapter02 - Updated
32 pages
Advanced EDA for Data Analysts
No ratings yet
Advanced EDA for Data Analysts
47 pages
Edashsh
No ratings yet
Edashsh
7 pages
Data Visualization
No ratings yet
Data Visualization
3 pages
Data Visualization Techniques 1
No ratings yet
Data Visualization Techniques 1
27 pages
DA Unit-V Material
No ratings yet
DA Unit-V Material
19 pages
Daunit 5
No ratings yet
Daunit 5
18 pages
Unit 1 Data Objects Attributes Visualization
No ratings yet
Unit 1 Data Objects Attributes Visualization
34 pages
Da Unit-5
100% (1)
Da Unit-5
19 pages
DV Co1 All PDF
No ratings yet
DV Co1 All PDF
196 pages
Module 4 DS
No ratings yet
Module 4 DS
89 pages
Lesson 2
No ratings yet
Lesson 2
18 pages
Medical Informatics GM Lecture 4 Materials
No ratings yet
Medical Informatics GM Lecture 4 Materials
57 pages
Unit 2 DS
No ratings yet
Unit 2 DS
36 pages
MCA - S3 - Data Visualisation - U2
No ratings yet
MCA - S3 - Data Visualisation - U2
17 pages
Data Visualization Shorts
No ratings yet
Data Visualization Shorts
68 pages
4 - Exploring Data
No ratings yet
4 - Exploring Data
32 pages
Visualization Techniques Overview
No ratings yet
Visualization Techniques Overview
18 pages
L5 Data Visualization
No ratings yet
L5 Data Visualization
33 pages
4 - Data Visualization For Decison Making
100% (1)
4 - Data Visualization For Decison Making
64 pages
Basics of Data Visualization A Necessity
No ratings yet
Basics of Data Visualization A Necessity
11 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
Data Visualization Techniques
No ratings yet
Data Visualization Techniques
15 pages
Common Visualization Idioms
0% (1)
Common Visualization Idioms
95 pages
Data Visualization
50% (2)
Data Visualization
44 pages
Visualizing Distributions
No ratings yet
Visualizing Distributions
28 pages
Data Visualization - Chapter1
No ratings yet
Data Visualization - Chapter1
66 pages
Unit IV Final
No ratings yet
Unit IV Final
54 pages
Dsbda Ut6
No ratings yet
Dsbda Ut6
11 pages
01FA16BAT020 Tribal Heritage Centre Komal Solanki
No ratings yet
01FA16BAT020 Tribal Heritage Centre Komal Solanki
2 pages
Fam
No ratings yet
Fam
1 page
The Revival of Hand Spinning and Hand Weaving Will Make The Largest Contribution To The Economics and Moral Regeneration of India - Mahatma Gandhi
No ratings yet
The Revival of Hand Spinning and Hand Weaving Will Make The Largest Contribution To The Economics and Moral Regeneration of India - Mahatma Gandhi
1 page
Project: Centre of Development For Autism: Conceptual Brief
No ratings yet
Project: Centre of Development For Autism: Conceptual Brief
2 pages
3 - Percent, Ratios, SI & CI
No ratings yet
3 - Percent, Ratios, SI & CI
20 pages
How Many Cups of Tea Were Consumed in Mumbai Last Month?
No ratings yet
How Many Cups of Tea Were Consumed in Mumbai Last Month?
9 pages
Heritage Interpretation Centre: Hastinapur
No ratings yet
Heritage Interpretation Centre: Hastinapur
2 pages
Buddhist Education Center: Amis and Purposes
No ratings yet
Buddhist Education Center: Amis and Purposes
1 page
Machine Learning: Welcome
No ratings yet
Machine Learning: Welcome
27 pages
Summary of Chapter 4
No ratings yet
Summary of Chapter 4
3 pages
Boosting Employee Performance
No ratings yet
Boosting Employee Performance
17 pages
Beer Foam
No ratings yet
Beer Foam
25 pages
Stormwater Quality Model Calibration
No ratings yet
Stormwater Quality Model Calibration
8 pages
Sampling Strategies
No ratings yet
Sampling Strategies
2 pages
CHAPTER 6 Solution
67% (3)
CHAPTER 6 Solution
64 pages
Assignment 2 EE765
No ratings yet
Assignment 2 EE765
8 pages
Reflection Paper On Quartile
No ratings yet
Reflection Paper On Quartile
2 pages
Muat Bongkarmhp-Wks Sukadaryati
No ratings yet
Muat Bongkarmhp-Wks Sukadaryati
23 pages
Robust and Classical PLS Regression Compared: Bettina Liebmann, Peter Filzmoser and Kurt Varmuza
No ratings yet
Robust and Classical PLS Regression Compared: Bettina Liebmann, Peter Filzmoser and Kurt Varmuza
10 pages
Assignments MA-41
No ratings yet
Assignments MA-41
1 page
MODULE in Stat Week 6
No ratings yet
MODULE in Stat Week 6
10 pages
Previous Year Question Paper June 2024
No ratings yet
Previous Year Question Paper June 2024
2 pages
Gardner & Altman (1986) PDF
No ratings yet
Gardner & Altman (1986) PDF
5 pages
Stats Notes
No ratings yet
Stats Notes
46 pages
Football Player Market Value Analysis
No ratings yet
Football Player Market Value Analysis
12 pages
Honours LY Project
No ratings yet
Honours LY Project
31 pages
Operational Self-Sufficiency of Select MFIs-30.9.2015
100% (1)
Operational Self-Sufficiency of Select MFIs-30.9.2015
31 pages
Computing Volatility and Identifying Its Limitations
No ratings yet
Computing Volatility and Identifying Its Limitations
24 pages
MAED Volume 120 Issue 2 Pages 101-134
No ratings yet
MAED Volume 120 Issue 2 Pages 101-134
33 pages
PA Unit
No ratings yet
PA Unit
2 pages
TS-0003413 Measurement Systems Analysis Standard
No ratings yet
TS-0003413 Measurement Systems Analysis Standard
12 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
64 pages
2022 PAPER 1 SOLUTION FOR TEST 1 (A'LEVEL STATISTICS) Share
No ratings yet
2022 PAPER 1 SOLUTION FOR TEST 1 (A'LEVEL STATISTICS) Share
23 pages
STA104 July 2022 (ODL5)
No ratings yet
STA104 July 2022 (ODL5)
6 pages
Chapter14 Panel Data Models
No ratings yet
Chapter14 Panel Data Models
140 pages
ANOVA F-Value & P-Value Guide
No ratings yet
ANOVA F-Value & P-Value Guide
3 pages
Statistics Formulas & Tables Guide
No ratings yet
Statistics Formulas & Tables Guide
28 pages
Gender Differences in Young Children's Compliance To Maternal Directives - A Methanalysis
No ratings yet
Gender Differences in Young Children's Compliance To Maternal Directives - A Methanalysis
11 pages
Solutions To Chapter 5 - Compress
100% (2)
Solutions To Chapter 5 - Compress
11 pages