Data Visualization
Visualization of Categorical and Numerical Data
Data Visualization
Table of Contents
Introduction ............................................................................................................................................................................. 3
1. Visual Analysis of Statistical Data ..................................................................................................................................... 4
1.1. Key Measures Computed for Statistical Data ............................................................................................................ 4
1.2. Examining the Data .................................................................................................................................................... 5
2. Data Visualization - Variation within Categorical Measures .......................................................................................... 7
2.1. Bar Graphs .................................................................................................................................................................. 7
2.2. Specialized Bar Graphs ........................................................................................................................................... 11
2.3. Treemaps .................................................................................................................................................................. 13
3. Variation within Numerical Measures ............................................................................................................................. 15
3.1. Histograms ................................................................................................................................................................ 15
Summary ............................................................................................................................................................................... 19
Data Visualization
Introduction
The topic gives an overview of the need to use visual analysis techniques for examining
statistical data. These techniques are used to understand the variation within categorical
and numerical data measures and also the relationships between these measures.
The visualization methods used for graphically examining the variation within the categorical,
and numerical types of data measures are covered in this topic. The remaining methods of
visualization through time and across space, and relationships between the measures are
covered in the subsequent topic.
Learning Objectives
Upon completion of this topic, you will be able to:
• Explain the importance of visual analysis for statistical data
• Describe the techniques available for visualizing variation within categorical and numerical
data.
Data Visualization
1. Visual Analysis of Statistical Data
In the data analysis process, it is essential to get a thorough understanding of the data prior
to applying modelling for extracting insights. However, the huge volume of data available in
today’s enterprises is a challenge in getting sufficient clarity regarding the data.
Hence, the approach taken will be to move top-down by arriving at an initial set of
characteristics which describe the data at an overall level. This is achieved by using data
visualization methods on sample data to unearth critical facts and trends.
1.1. Key Measures Computed for Statistical Data
To analyze the datasets, the following statistical measures are computed for key variables:
a. Location: The mean, median and mode
b. Variability: Percentiles, variance and standard deviation
c. Shape: Skew and kurtosis
Visualization techniques are especially useful for identifying outliers in distributions and
checking for associations between variables. Often, visualization of data distributions also
provides insight into very different behavior of data distributions in datasets which have
identical location or variability measures as shown in the figure 1.1.
Data Visualization
Figure 1.1. - Anscombe quartet
1.2. Examining the Data
To understand the data in a dataset, both the categorical and numerical (or quantitative)
measures associated with it need to be examined.
Categorical measures belong to a category. Typical examples of such data are product,
country, customer and territory. Numerical measures are measurable quantitatively.
Examples are profit, revenue, expense and blood pressure.
Statistical meaning is found both in the variations within the categorical and numerical
measures as well as in relationships among them. The following table summarizes the
different types of variations and relationships.
Variation within categorical How items in the categories relate to each other?
measures (Ranking, Part-to-whole)
Variation within numerical How values in the measure are distributed across the
measures range? (Distribution)
Data Visualization
Variation through time How values change through time? (Time-series)
Relationship between How measures relate to one another? (Correlation)
numerical measures
Variation across space Where are values located in space relative to one
another? (Spatial)
Relationship between How categories relate to each other mediated by
categorical measures measures? (Inter-category)
Table 1.1 – Summary table of variations and relationships within and across measures.
Data Visualization
2. Data Visualization - Variation within Categorical Measures
This analysis is done to evaluate how data items within a category in the dataset relate to
each other in terms of ranking (Example - highest to lowest) and proportion to the whole
that is, part-to-whole. The visualization methods used in this analysis are explained below.
2.1. Bar Graphs
Bar graphs are most suited for displaying values subdivided into discrete instances along a
nominal or ordinal scale. The visual weight of bars, places emphasis on the individual
values in the graph, and makes it easy to compare individual values to one another by
simply comparing the height of the bars.
There are three main types of bar graphs:
a. Horizontal: It is perfect for comparative ranking, like a top-five list. It is also preferred
in situations where the category labels are very descriptive, and adjusting them within
the axis of a column graph becomes an issue.
Figure 2.1. - Ranking of top five products sales.
Source: datapine.com
Data Visualization
b. Vertical/Column: Good for showing chronological data, such as growth over specific
periods, and comparing data across categories.
Figure 2.2. - Comparing product sales across channels and countries
Source: datapine.com
Data Visualization
c. Stacked: Useful for handling part-to-whole relationships.
Figure 2.3. - Age-wise distribution of new customers across quarters
Source: datapine.com
Two other techniques used to visualize part-to-whole distribution of items in the category
are the Pie Chart and the Dough-nut Chart. The arc length of each sector and consequently
its area is proportional to the quantity it represents.
Data Visualization
Car Taxi Two wheeler Cycle Local bus Metro Walk
3%
6%
28%
14%
2%
12%
35%
Figure 2.4. - Commuting means by employees in XYZ company
Figure 2.5. - Doughnut chart of age structure
Source - Devexpress
Data Visualization
2.2. Specialized Bar Graphs
a. Bullet graph
The bullet graph is a variation of the bar graph, which depicts a performance measure
along with a comparative value and a qualitative measure to show if performance is
good, bad or intermediate.
Figure 2.6. - Bullet Graph
Source: Wikipedia - Bullet Graph
In the figure 2.6, the dark bar represents the performance measure, the vertical marker
represents the comparative value, and the colour shading represents performance
degree, with the lightest shade denoting the best performance.
b. Pareto Charts
Pareto charts are helpful to depict part-to-whole relationships. The items are shown as
bars arranged in descending order of value. The line denotes cumulative value totalling
to 100% and helps to pin-point the main contributing parts of the whole.
Data Visualization
Figure 2.7. Pareto Graph/Charts
Source: pareto-chart.com
c. Deviation Bar Graphs
These are bar graphs which directly express the variation in value between two points in
time. They are very useful in cases where the focus is exclusively on variation of a value
in time, regardless of ranking or part-to-whole relationships.
Data Visualization
Figure 2.8. - Deviation in diseases before and after 4 weeks of sanitation drive
Source: www.cdc.gov
2.3. Treemaps
When the number of products to be compared in the categories exceed what a bar graph
can handle, treemaps are used. Treemaps are designed to display part-to-whole
relationships.
They use rectangles contained within larger rectangles to represent a hierarchy of up to 3
levels. In addition to rectangle size, we can also use colour to display another attribute.
Data Visualization
Figure 2.9. - Proportion of individual country GDP to Top 15 Nation GDP 2011
Source - Satori group
Data Visualization
3. Variation within Numerical Measures
When examining the variation within numerical measures, the focus is on understanding
how data is distributed across the range from the lowest to highest in other words, the data
distribution.
3.1. Histograms
These are the most commonly used graphs for summarising distributions and frequently
used to understand the data. It is useful when there are a large number of observations.
The spread of values is divided into intervals of equal size. Bars are used to display
percentage of values in each interval. The histogram is very helpful for easy visual
recognition of key characteristics or patterns in the data. For example, highest/lowest
scores, where scores are centred and whether scores are clustered together or scattered.
Figure 3.1. - Pattern of frequency of arrivals at park gate
Source - Wikimedia
Data Visualization
Figure 3.2. - Bimodal histogram showing 2 peaks in distribution of weights
Source – Minitab
a. Relative frequency histogram
In a relative frequency histogram, the vertical scale is marked with relative frequencies
instead of actual frequencies.
Relative Frequency = Class Frequency/Sum of all Frequencies
Figure 3.3. - Relative frequency percentage of occurrences over the year
Source - The National Severe Storms Laboratory
Data Visualization
b. Frequency polygons
These are similar in purpose to a histogram, but use a line to represent the values
instead of bars. Line segments are connected to points located directly above class
midpoint values. The heights of the points correspond to the class frequencies, and the
line segments are extended to the right and left so that the graph begins and ends on
the horizontal axis.
Figure 3.4. - Frequency polygon of bacterial cell lengths
Source - Kean University
Figure 3.5. - Frequency polygon with multiple distributions
Source - onlinestatbook.com
Frequency polygons have two advantages over histograms:
• The shape of the distribution is shown more clearly
Data Visualization
• The shapes of multiple distributions can be compared in a single graph
c. Cumulative Frequency Distributions
Frequency polygons are also good choices to display cumulative frequency distributions.
Cumulative frequency for a given class is the sum of the frequencies for that class and
the preceding classes.
Figure 3.6. - Cumulative frequency distribution
Source - sychstat.missouristate.edu
3.4. Box Plots
Invented by John Tukey, box-plots are an excellent tool for comparing multiple distributions.
To draw a box-plot, we need to do the following:
a. Order the data
b. Obtain the minimum and maximum values
c. Obtain the median and the quartiles Q1 and Q3.
d. Draw a line from the minimum to the maximum value
Data Visualization
e. Draw a box with its lines drawn at Q1, the median and Q3.
The IQR is the inter-quartile range i.e. the value of Q3-Q1. Values which fall outside the
lower and upper inferences i.e. below (Q1- 1.5 IQR) and above (Q3 + 1.5 IQR) are
considered as Outliers.
Figure 3.7. - Box-plots
Source - whatissixsigma.net
The box-plot is very useful to display the full range of data that is, center, spread of values
from min. to max. and outlier values. It is also helpful to check if values are clustered or
evenly distributed.
Summary
In the data analysis process, it is important to thoroughly understand the data prior to further
analysis in order to obtain insights. Visualizations help tremendously in this process to
unearth facts and trends which might otherwise not have been visible except through
graphical means.
Data Visualization
Visualization techniques are used to understand the shape of the data distribution. In order
to understand the data in the dataset, both the categorical and numerical (or quantitative)
measures associated with it need to be examined. Statistical meaning is found both in the
variations within, as well as relationships between categorical and numerical measures.
Variation within categorical measures is visualized primarily using bar graphs (horizontal,
column and stacked), pie and doughnut charts, and specialized bar graphs such as bullet
charts, pareto charts, deviation bar graphs and treemaps.
Variation within numerical measures is visualized using Histograms (including relative
frequency), frequency polygons and box-plots.