[go: up one dir, main page]

0% found this document useful (0 votes)
8 views45 pages

MGS 626 - Week 4 - Exploring Data

The document outlines key concepts in data visualization, focusing on exploratory data analysis (EDA) and various visualization methods for big data. It emphasizes the importance of understanding data through visualization to identify patterns, outliers, and errors. The document also discusses different visualization techniques suitable for various data formats, including univariate and bivariate methods, time series, and geospatial data.

Uploaded by

Pooja Kabadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views45 pages

MGS 626 - Week 4 - Exploring Data

The document outlines key concepts in data visualization, focusing on exploratory data analysis (EDA) and various visualization methods for big data. It emphasizes the importance of understanding data through visualization to identify patterns, outliers, and errors. The document also discusses different visualization techniques suitable for various data formats, including univariate and bivariate methods, time series, and geospatial data.

Uploaded by

Pooja Kabadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

EXPLORING

DATA ‘-

MGS 626: Data Visualization


Christopher Keaton

1
2

Outline
• Exploratory Data Analyses
• Visualization of big tabular data
• Univariate visualization methods
• Bivariate visualization methods
• Small multiples ‘-
• Visualization of time series
• Geospatial visualization

2
3

Why exploration?

We have BIG unknown data so: “Statistics is the


- Find interesting variables science of learning
- Find interesting patterns from data”
- Find groups in the data ‘-Larry Wasserman
- Find possible outliers
- Find possible errors

So, get to know your data!

3
4

Exploratory Data Analysis (EDA)


• Introduced by John Tukey (1977)
• EDA is an approach to analyze data, with the focus on looking
at the data directly rather than looking at indirect information
such as model fit and hypothesis tests.
‘-
• Many tools are useful for EDA, especially visualization
methods, e.g. scatter plot, histogram, box plot, and parallel
coordinates.
• EDA is one of the foundations of data mining.

4
5

Visualization of Big Data


Bottlenecks:
• Screen resolution; not enough pixels to plot each data point separately
• Restricted focus of the users regarding
• number of variables
• different subgroups ‘-
• overview vs. detail
• Computational issues

Which tools are used to visualize big data? That depends on what you want to explore, the
form of the data, etc.

5
6

Variety of Big Data


• Big data comes in many different formats. That is why variety is one of the
V’s by which Big Data is commonly described.
• In general, the following formats are common for big data, especially the top
three:
• Tabular data
‘-
• Time series data
• Spatial data
• Tree-structured data
• Text data
• Image/video/audio data
• Usually, the format of a big data source is a mixture of these formats.

6
7

Variety of Big Data (2)


• Web data is often tree-structured (e.g. the prices of a web shop by category)
and contains text and/or images.
• Social media data is often text, image, and video.
• Sensor data is usually tabular data with variables that describe space and
time. The type of measurement data depends on the sensor type; most often
numeric, but also image, video, audio, etc. ‘-
• If a tabular dataset contains a spatial component, then it can be linked to
spatial units, which can be spatial points, lines, polygons, or raster points.

7
8

Visualization of Big Tabular Data


• A major bottleneck of many EDA tools, it that they do not scale very well.
• For instance, a scatter plot with over a 1000 points will cause occlusion.
• Therefore, let Exploratory Big Data Analysis (EBDA) be EDA that is “big
data” – proof. ‘-
• The question is: which of the tools used for EDA can also be applied to big
tabular data?

8
Key Factors to Consider
Data Volume: Technical Considerations:

• How many records are you handling? • Data source integration


• What's the refresh frequency? • Existing technology stack

• Do you need real-time visualization? ‘-


• Team expertise
Performance Requirements: • Deployment requirements

• Browser-based vs desktop application Cost Factors:

• Need for interactivity • Licensing models

• Response time expectations • Infrastructure requirements

• Maintenance needsT
• raining requirements
9
Different Scenarios
Real-time Dashboards:
• Grafana
• Kibana
• Apache Superset

Geographic Data:
• Kepler.gl ‘-
• deck.gl
• QGIS

Network/Graph Data:
• Neo4j Bloom
• Gephi
• Cytoscape

Scientific/Research:
• ParaView
• VisIt
• MATLAB 10
11

How many items can be plot at once?


Not too many, because of screen resolution
and occlusion.

‘-
Solution: aggregation
Top down Bottom up

11
EDA is Cyclical

‘-

12
Common Pitfalls

• Over-reliance on default visualizations


• Jumping to conclusions too quickly
• Not considering data quality issues
‘-
• Forgetting about domain context

13
14

Scope of visualizing Big Tabular Data

Number of variables p

Large Small

Use variable selection or Exploratory


Large dimension reduction,
proceed with EBDA
‘-
Big Data
analysis
Number of
Observations n
Use variable selection or
Small dimension reduction, EDA
proceed with EDA

14
15

Exploratory (Big) Data Analysis tasks


The tasks for EBDA are the same as for EDA:

• Describe data distribution: how often does a value or a combination of values occur?
What is the dependence on other variables?
‘-
• Find groups: is the distribution a mixture of distributions of different “natural” groups
in the data? If so, the data can be split.
• Identify outliers: find improbable extreme values. Are they errors? Determine if they
should be excluded or connected in the analysis.
• How many values are invalid or missing? Determine what the impact is on the
analysis.

15
16

Univariate Analysis

• Histogram / frequency polygon


• Frequency plot, pie chart, stacked bar chart
• Treemap
• Calendar plot ‘-

16
17

Histogram / frequency polygon


• A histogram is a statistical technique to plot a numerical,
ordinal or date-time variable.
• Numerical values are discretized cutting the variable range
into intervals and counting the frequency.
• It is advised to try several different bin sizes to extract
features from the data set: a bin size too small results in a
coarse histogram describing global structure, but missing ‘-
fine grained details.
• A histogram is well suited for describing a data distribution
since it makes no assumptions.
• A date-time variable is typically shown in a frequency
polygon, which is the line-variant of the histogram.

17
18

Calendar plot
• For date-time variables with
a range between one month
and a couple of years, a
calendar plot is very useful.
• It can be seen as a heatmap
in calendar format where the ‘-
values are binned to days.

18
19

Bivariate visualizations
• Heatmap
• Two-dimensional kernel density plot
• Surface plot
• Mosaic plot
• Tableplot
‘-
• Treemap

19
20

Heatmap
• The heatmap is a powerful
workhorse that copes with the
shortcomings of scatter plots.
• Moreover, it can be applied to
numerical and high cardinality
ordinal, categorical, and date-time
variables. Numerical variables are
discretized in the same way as for
‘-
histograms.
• Counting the number of occurrences
for each combination of (discretized)
values results in a frequency matrix,
which is displayed as a heatmap.
• Alternatively, numeric data can also
be binned and visualized in
hexagons.

20
21

2D kernel density plot


• The two-dimensional kernel density plot is
another visualization method that is scalable
for large number of observations in which
contour lines are drawn based on the
estimated kernel densities.
‘-
• It is especially suitable for numerical data, but
also for date-time variables and high
cardinality ordinal variables.

21
22

Surface plot Age

Income

• The surface plot is a three-


dimensional plot, in which the
densities are expressed by height, is
another useful tool to visualize
bivariate relationships.
• Like the two-dimensional kernel
‘- Count
density plot, it is useful for numerical
values , and can also be used for
high cardinality ordinal and date-time
variables.

22
23

Mosaic plot
• The mosaic plot is useful for low
cardinality variables.
• The areas of the rectangles are
proportional to the counts.
• A stacked bar chart is similar to a
mosaic plot, but does not show the ‘-
univariate frequency distribution of the
column variable.

Subjects of pop songs


23
24

Treemap

• The treemap is very useful to


visualize the relationship between a
hierarchical categorical and a
numerical variable.
‘-
• The sizes of the rectangles
correspond to aggregates of the
numerical variable based on the
hierarchy of the categorical variable.
• In addition, color can be used to
encode a third, numerical, variable,
typically with a diverging color
scheme.
Size represents number of employees per economic sector,
color represents the difference with last year.

24
25

Tableplot
• The tableplot is a plot that may seem a
multivariate plot, but it actually is a
combination of bivariate plots.
• Data from two or more variables, that can be
numerical, ordinal, categorical, and ordinal, is
binned according to the quantiles of a
numerical variable. ‘-
• For each variable that is either numerical or
ordinal with high cardinality, a bar chart with
mean values per bin is plotted.
• For each low cardinality categorical variable
or ordinal variable, a stacked bar chart is
plotted.

25
BREAK ‘-

26
27

Small Multiples
• Data is split into multiple subsets.
• For each subset, a small plot is created.
• These plots are called small multiples, facets, trellis charts, or lattice
charts.
• They are usually placed on a rectangular grid.
‘-

27
28

Example 1

‘-

Number of counted vehicles on Dutch highways in one month (September).

28
Example 2

‘-

Estimated Day Time Population during one week per municipality based on mobile phone
network data. Color indicates cluster. 29
30

These plots are also small multiples

‘-

Bar chart per subset Tableplot

30
31

Time series data


• Line graph / dot plot
• Calendar plot / heatmap
• Streamgraph
• Horizon graph ‘-

31
32

Line graph / dot plot


• The line graph is probably the most used
method for displaying time series data.
• It is less suitable when the time series are
not stable, or when there is noise in the
data.
• The dot plot (without lines) is a good ‘-
alternative for less stable or noisy data.
Dot plot Line graph with dots

32
33

Streamgraph
• The streamgraph is a stacked area
chart (which is in turn an alternative
to a line chart).
• It is used for time series for several
subsets of data.

‘-

Frequency of subjects of New York service calls (311) during a day


33
34

Horizon graph
• A horizon graph is a space efficient
alternative to a line chart.
• It is constructed as follows:

‘- 3) Horizon graph. Only the


peaks (positive or negative)
are shown

1) Normal line chart 2) Line chart where area under the curve is filled
with a diverging color scheme.
34
35

Horizon graph (2)


The horizon graph is especially useful when there is little space, for
instance in case of small multiples:

‘-

35
36

Spatial data
• Choropleth
• Dot map
• 2D kernel density map
• Small multiples
‘-

36
37

Choropleth
A choropleth is a map type where
administrative regions are filled with
colors that represent a density or
ratio variable.

‘-

Day Time Population per municipality based on mobile


phone network data
37
38

Dot map
• Data points are shown as dots.
• Useful for spatial point data, such as geo-tagged events.

Deaths per
address

‘-

Water pumps (possible


causes)

Cholera-outbreak in London (1854) by John Snow. 38


3
9

Dot map (2)


• Interactive dot map of the Dutch
population colored by ethnic origin.
• The luminance of the dots indicates
the density.
• Prototype:
http://research.cbs.nl/colordotmap/
• Based on the Racial Dot Map: ‘-
http://demographics.coopercenter.o
rg/DotMap/

39
4
0

Dot map (3)


• Crimes registered in Greater
London during October
2015. (Data available on https://data.police.uk/)
• Number of dots: 80,000
• Alpha transparency has ‘-
been applied to better show
to spatial data distribution.

40
4
1

2D Kernel Density Map


• A 2D Kernel Density Estimator is a
technique to show the smoothed
densities.
• The bandwidth parameter
determines the level of
smoothness. ‘-

41
4
2

Small multiples
Small multiples are very
suitable for spatial data

‘-

Crimes in the City of London by type of crime 42


43

Small multiples

Daytime Population
estimations (relative to
residential population totals)
‘-

43
44

Summary
• Data exploration is important to get to know the data.
• Visualization is key in data exploration; by looking at the data in different ways, patterns and
anomalies become clear.
• Many visualization methods can be used for data exploration. A part of these methods can
be used for big data, since they are scalable. ‘-

44
Group Project 2, due in two weeks.
• As an addition to your paper submission for GP1, please submit an overview of at least 5
visualizations you will be creating.
• Each Viz should have a clear narrative that can be explained in a few sentences.
‘- of design discussed in weeks 2
• Additionally, for each Viz you should explain why the aspects
through 4 that help craft the narrative.

45

You might also like