EXPLORING
DATA ‘-
MGS 626: Data Visualization
Christopher Keaton
1
2
Outline
• Exploratory Data Analyses
• Visualization of big tabular data
• Univariate visualization methods
• Bivariate visualization methods
• Small multiples ‘-
• Visualization of time series
• Geospatial visualization
2
3
Why exploration?
We have BIG unknown data so: “Statistics is the
- Find interesting variables science of learning
- Find interesting patterns from data”
- Find groups in the data ‘-Larry Wasserman
- Find possible outliers
- Find possible errors
So, get to know your data!
3
4
Exploratory Data Analysis (EDA)
• Introduced by John Tukey (1977)
• EDA is an approach to analyze data, with the focus on looking
at the data directly rather than looking at indirect information
such as model fit and hypothesis tests.
‘-
• Many tools are useful for EDA, especially visualization
methods, e.g. scatter plot, histogram, box plot, and parallel
coordinates.
• EDA is one of the foundations of data mining.
4
5
Visualization of Big Data
Bottlenecks:
• Screen resolution; not enough pixels to plot each data point separately
• Restricted focus of the users regarding
• number of variables
• different subgroups ‘-
• overview vs. detail
• Computational issues
Which tools are used to visualize big data? That depends on what you want to explore, the
form of the data, etc.
5
6
Variety of Big Data
• Big data comes in many different formats. That is why variety is one of the
V’s by which Big Data is commonly described.
• In general, the following formats are common for big data, especially the top
three:
• Tabular data
‘-
• Time series data
• Spatial data
• Tree-structured data
• Text data
• Image/video/audio data
• Usually, the format of a big data source is a mixture of these formats.
6
7
Variety of Big Data (2)
• Web data is often tree-structured (e.g. the prices of a web shop by category)
and contains text and/or images.
• Social media data is often text, image, and video.
• Sensor data is usually tabular data with variables that describe space and
time. The type of measurement data depends on the sensor type; most often
numeric, but also image, video, audio, etc. ‘-
• If a tabular dataset contains a spatial component, then it can be linked to
spatial units, which can be spatial points, lines, polygons, or raster points.
7
8
Visualization of Big Tabular Data
• A major bottleneck of many EDA tools, it that they do not scale very well.
• For instance, a scatter plot with over a 1000 points will cause occlusion.
• Therefore, let Exploratory Big Data Analysis (EBDA) be EDA that is “big
data” – proof. ‘-
• The question is: which of the tools used for EDA can also be applied to big
tabular data?
8
Key Factors to Consider
Data Volume: Technical Considerations:
• How many records are you handling? • Data source integration
• What's the refresh frequency? • Existing technology stack
• Do you need real-time visualization? ‘-
• Team expertise
Performance Requirements: • Deployment requirements
• Browser-based vs desktop application Cost Factors:
• Need for interactivity • Licensing models
• Response time expectations • Infrastructure requirements
• Maintenance needsT
• raining requirements
9
Different Scenarios
Real-time Dashboards:
• Grafana
• Kibana
• Apache Superset
Geographic Data:
• Kepler.gl ‘-
• deck.gl
• QGIS
Network/Graph Data:
• Neo4j Bloom
• Gephi
• Cytoscape
Scientific/Research:
• ParaView
• VisIt
• MATLAB 10
11
How many items can be plot at once?
Not too many, because of screen resolution
and occlusion.
‘-
Solution: aggregation
Top down Bottom up
11
EDA is Cyclical
‘-
12
Common Pitfalls
• Over-reliance on default visualizations
• Jumping to conclusions too quickly
• Not considering data quality issues
‘-
• Forgetting about domain context
13
14
Scope of visualizing Big Tabular Data
Number of variables p
Large Small
Use variable selection or Exploratory
Large dimension reduction,
proceed with EBDA
‘-
Big Data
analysis
Number of
Observations n
Use variable selection or
Small dimension reduction, EDA
proceed with EDA
14
15
Exploratory (Big) Data Analysis tasks
The tasks for EBDA are the same as for EDA:
• Describe data distribution: how often does a value or a combination of values occur?
What is the dependence on other variables?
‘-
• Find groups: is the distribution a mixture of distributions of different “natural” groups
in the data? If so, the data can be split.
• Identify outliers: find improbable extreme values. Are they errors? Determine if they
should be excluded or connected in the analysis.
• How many values are invalid or missing? Determine what the impact is on the
analysis.
15
16
Univariate Analysis
• Histogram / frequency polygon
• Frequency plot, pie chart, stacked bar chart
• Treemap
• Calendar plot ‘-
16
17
Histogram / frequency polygon
• A histogram is a statistical technique to plot a numerical,
ordinal or date-time variable.
• Numerical values are discretized cutting the variable range
into intervals and counting the frequency.
• It is advised to try several different bin sizes to extract
features from the data set: a bin size too small results in a
coarse histogram describing global structure, but missing ‘-
fine grained details.
• A histogram is well suited for describing a data distribution
since it makes no assumptions.
• A date-time variable is typically shown in a frequency
polygon, which is the line-variant of the histogram.
17
18
Calendar plot
• For date-time variables with
a range between one month
and a couple of years, a
calendar plot is very useful.
• It can be seen as a heatmap
in calendar format where the ‘-
values are binned to days.
18
19
Bivariate visualizations
• Heatmap
• Two-dimensional kernel density plot
• Surface plot
• Mosaic plot
• Tableplot
‘-
• Treemap
19
20
Heatmap
• The heatmap is a powerful
workhorse that copes with the
shortcomings of scatter plots.
• Moreover, it can be applied to
numerical and high cardinality
ordinal, categorical, and date-time
variables. Numerical variables are
discretized in the same way as for
‘-
histograms.
• Counting the number of occurrences
for each combination of (discretized)
values results in a frequency matrix,
which is displayed as a heatmap.
• Alternatively, numeric data can also
be binned and visualized in
hexagons.
20
21
2D kernel density plot
• The two-dimensional kernel density plot is
another visualization method that is scalable
for large number of observations in which
contour lines are drawn based on the
estimated kernel densities.
‘-
• It is especially suitable for numerical data, but
also for date-time variables and high
cardinality ordinal variables.
21
22
Surface plot Age
Income
• The surface plot is a three-
dimensional plot, in which the
densities are expressed by height, is
another useful tool to visualize
bivariate relationships.
• Like the two-dimensional kernel
‘- Count
density plot, it is useful for numerical
values , and can also be used for
high cardinality ordinal and date-time
variables.
22
23
Mosaic plot
• The mosaic plot is useful for low
cardinality variables.
• The areas of the rectangles are
proportional to the counts.
• A stacked bar chart is similar to a
mosaic plot, but does not show the ‘-
univariate frequency distribution of the
column variable.
Subjects of pop songs
23
24
Treemap
• The treemap is very useful to
visualize the relationship between a
hierarchical categorical and a
numerical variable.
‘-
• The sizes of the rectangles
correspond to aggregates of the
numerical variable based on the
hierarchy of the categorical variable.
• In addition, color can be used to
encode a third, numerical, variable,
typically with a diverging color
scheme.
Size represents number of employees per economic sector,
color represents the difference with last year.
24
25
Tableplot
• The tableplot is a plot that may seem a
multivariate plot, but it actually is a
combination of bivariate plots.
• Data from two or more variables, that can be
numerical, ordinal, categorical, and ordinal, is
binned according to the quantiles of a
numerical variable. ‘-
• For each variable that is either numerical or
ordinal with high cardinality, a bar chart with
mean values per bin is plotted.
• For each low cardinality categorical variable
or ordinal variable, a stacked bar chart is
plotted.
25
BREAK ‘-
26
27
Small Multiples
• Data is split into multiple subsets.
• For each subset, a small plot is created.
• These plots are called small multiples, facets, trellis charts, or lattice
charts.
• They are usually placed on a rectangular grid.
‘-
27
28
Example 1
‘-
Number of counted vehicles on Dutch highways in one month (September).
28
Example 2
‘-
Estimated Day Time Population during one week per municipality based on mobile phone
network data. Color indicates cluster. 29
30
These plots are also small multiples
‘-
Bar chart per subset Tableplot
30
31
Time series data
• Line graph / dot plot
• Calendar plot / heatmap
• Streamgraph
• Horizon graph ‘-
31
32
Line graph / dot plot
• The line graph is probably the most used
method for displaying time series data.
• It is less suitable when the time series are
not stable, or when there is noise in the
data.
• The dot plot (without lines) is a good ‘-
alternative for less stable or noisy data.
Dot plot Line graph with dots
32
33
Streamgraph
• The streamgraph is a stacked area
chart (which is in turn an alternative
to a line chart).
• It is used for time series for several
subsets of data.
‘-
Frequency of subjects of New York service calls (311) during a day
33
34
Horizon graph
• A horizon graph is a space efficient
alternative to a line chart.
• It is constructed as follows:
‘- 3) Horizon graph. Only the
peaks (positive or negative)
are shown
1) Normal line chart 2) Line chart where area under the curve is filled
with a diverging color scheme.
34
35
Horizon graph (2)
The horizon graph is especially useful when there is little space, for
instance in case of small multiples:
‘-
35
36
Spatial data
• Choropleth
• Dot map
• 2D kernel density map
• Small multiples
‘-
36
37
Choropleth
A choropleth is a map type where
administrative regions are filled with
colors that represent a density or
ratio variable.
‘-
Day Time Population per municipality based on mobile
phone network data
37
38
Dot map
• Data points are shown as dots.
• Useful for spatial point data, such as geo-tagged events.
Deaths per
address
‘-
Water pumps (possible
causes)
Cholera-outbreak in London (1854) by John Snow. 38
3
9
Dot map (2)
• Interactive dot map of the Dutch
population colored by ethnic origin.
• The luminance of the dots indicates
the density.
• Prototype:
http://research.cbs.nl/colordotmap/
• Based on the Racial Dot Map: ‘-
http://demographics.coopercenter.o
rg/DotMap/
39
4
0
Dot map (3)
• Crimes registered in Greater
London during October
2015. (Data available on https://data.police.uk/)
• Number of dots: 80,000
• Alpha transparency has ‘-
been applied to better show
to spatial data distribution.
40
4
1
2D Kernel Density Map
• A 2D Kernel Density Estimator is a
technique to show the smoothed
densities.
• The bandwidth parameter
determines the level of
smoothness. ‘-
41
4
2
Small multiples
Small multiples are very
suitable for spatial data
‘-
Crimes in the City of London by type of crime 42
43
Small multiples
Daytime Population
estimations (relative to
residential population totals)
‘-
43
44
Summary
• Data exploration is important to get to know the data.
• Visualization is key in data exploration; by looking at the data in different ways, patterns and
anomalies become clear.
• Many visualization methods can be used for data exploration. A part of these methods can
be used for big data, since they are scalable. ‘-
44
Group Project 2, due in two weeks.
• As an addition to your paper submission for GP1, please submit an overview of at least 5
visualizations you will be creating.
• Each Viz should have a clear narrative that can be explained in a few sentences.
‘- of design discussed in weeks 2
• Additionally, for each Viz you should explain why the aspects
through 4 that help craft the narrative.
45