Combinepdf
Combinepdf
TO
EXPLORATORY DATA
ANALYSIS(EDA)
• Generating summary statistics for numerical data in the
dataset
• creating graphical representations to understand the
data easy and better.
• The initial analysis of data supplied or extracted
• to understand the trends
• underlying limitations
• quality, patterns
• relationships between various entities within the data
set,using descriptive statistics and visualization tools
EDA consist of combination of the following
methods:
1. Univariate visualization
It is the process of finding the summary statistics for each
field in the raw dataset.
1. Bivariate visualization
It is the process of finding the summary statistics for assessing the
relationship between each variable in the dataset and the target variable of
interest.
2. Multivariate visualization
To understand interactions between different fields in the data.
DATA VISUALIZATION
• Data visualization is the representation of data through use
of common graphics, such as charts, plots, infographics and
even animations.
• These visual displays of information communicate complex
data relationships and data-driven in a such way that is easy
to understand.
• Identify Outliers in Data
• Enhanced Collaboration
• Business Analysis Made Easy
• Improve Response Time
• Greater Simplicity
• Easier Visualization of Patterns
VISUAL ENCODING
• Encoding in data visualization is translating the data into a visual element
on a chart or map through position, shape, size, symbols and colour.
• Data is mapped into visual structures which is used to build the images on
a screen.
• It can be more effective due to the easy perception of information
conveyed by the former visualization graph .
• The attribute values signify important data characteristics such as
numerical data, categorical data, or ordinal data.
• Spatiotemporal data contains special attributes such as geographical
location (spatial dimension) and/or time (temporal dimension).
Data Visualization Libraries
• Matplotlib
• Seaborn
• Plotnine(ggplot)
• Bokeh
• pygal
• Plotly
• geoplotlib
• Gleam
• missingno
• Leather
• Altair
• Folium
Library Primary Use Strengths
7.
Graphics
“A picture is worth a
thousand words…”
STA6166-2-1
Example Data
STA6166-2-2
Excel spreadsheet
STA6166-2-3
Af
ter
Di
nn
erM
0
50
100
150
200
250
Ca int
nd
Ch y Co
ew rn
ing
Gu Gu
m m
m
yB
Column chart
Lic ea
ori rs
ce
M Tw
ilk ist
Ch s
oc
M ola
ilk ...
Ch
oc
M ola
ilk ...
Ch
oc
ola
lic
es
So
ur
Ba
lls
Ta
ffy
Display the data table
STA6166-2-4
Ch
ew
i
50
100
150
200
250
0
ng
Gu
Bu m
tte
rs
co
tch
the data.
Lo
llip
op
So
ur
Ba
St ls
ar
lig
ht
Se M
m int
iSw s
ee
tC To
ho ffe
co
la e
te
Ch
G ip
um s
m
y
Li Be
ar
co s
Alternate Display
ric
eT
wi
st
Pe s
ct
inS
Af lic
te es
rD
in
ne
rM
int
Ca
nd
y
Co
rn
M Ca
ilk ra
Ch m
el
oc
ol Je s
at ly
eC Be
ov an
er s
ed
M Ra
ilk isi
Ch ns
oc
Calories in Common Candies
ol
at
eM Ta
al ffy
te
dM
ilk
Ba
Pe lls
an
Da ut
M rkC Br
ittl
ilk ho e
Ch co
oc la
ol
at te
Ba
eA r
M lm
ilk M o nd
Ch ilk Ba
oc
ol
Ch
oc r
at
eC o la
Sorting and expanding the scale of the graph allows all
ov te
er Ba
ed r
labels to be seen as well as displaying a characteristic of
Pe
an
ut
s
STA6166-2-5
Vertical Display of Data
Calories in Common Candies
MilkChocolate Bar
DarkChocolateBar
MilkChocolateMaltedMilkBalls
MilkChocolateCoveredRaisins
Caramels
AfterDinnerMint
LicoriceTwists
SemiSweetChocolateChips
StarlightMints
Lollipop
Chewing Gum
3 ( 3, 13.6%)
1 ( 3, 13.6%)
6 ( 1, 4.5%)
4 ( 1, 4.5%)
SatFat ( 9, 40.9%)
0 (14, 63.6%)
10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
STA6166-2-9
Range
Extremes
•Minimum(calories) = 10 Range = 210-10 = 200
•Maximum(calories) = 210
Trimmed mean = mean of data where some fraction of the smallest and
largest data values are not considered. Usually the
smallest 5% and largest 5% values (rounded to nearest
integer) of data are removed for this computation.
= 136.0 (with 10% trimmed, 5% each tail).
Number of observations
Symbolic “name”
n
y
for sample mean
y1 + y2 + + yn
i
y= =
i =1
n n
STA6166-2-12
Quartiles
Here n=22, (n+1)/4=23/4=5.75, hence Q1 is three quarters between the 5th and 6th
observations in the sorted list. The 5th value is 60 and the 6th
value is 60, thus
60 + .75(60-60)=60.
For Q2, (n+1)/2 = 23/2 = 11.5, e.g. half way between the 11th and 12th obs.
Q2 = 160 + .5(160-160) = 160.
For Q3, 3(n+1)/4 = 3(23)/4 = 69/4 = 17.25, e.g a quarter of the way between the 17th
and 18th observations.
Q3 = 180 + .25(180-180) = 180
10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
STA6166-2-13
Percentiles
100pth Percentile: that value in a sorted list of the data that
has approx p100% of the measurements below it
and approx (1-p)100% above it. (The p quantile.)
Distribution
function 0<p<1
Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile
STA6166-2-14
Simplified Quartiles
A simpler way to find Q1 & Q3 is as follows:
1. Order the data from the lowest to the highest value, and find the
median.
2. Divide the ordered data into the lower half and the upper half, using
the median as the dividing value. (Always exclude the median itself
from each half.)
3. Q1 is just the median of the lower half.
4. Q3 is just the median of the upper half.
Ex: For the candy data we still get Q1=60 and Q3=180.
STA6166-2-15
Measures of Variability
Range
Interquartile Range
Variance
Standard Deviation
Quartiles:
Q1 = 25th = 60
Q2 = 50th = median = 160
Q3 = 75th = 180
n n
(
iy − y )2
s2 = i =1
n −1
STA6166-2-18
Excel Data Analysis Tool
Select the Data Analysis Tool
Select Descriptive Statistics
The menu below appears.
Enter the Input Range and
check the output options
desired.
STA6166-2-19
Excel Descriptive Statistics Output
STA6166-2-20
Computing Descriptive
Stats
Descriptive Statistics
Maximum
100
Minimum
Box Plot
(SAS Proc Insight)
STA6166-2-23
Percentiles
100pth Percentile: that value in a sorted list of the data that
has approx p100% of the measurements below it
and approx (1-p)100% above it. (The p quantile.)
Smoothed
histogram 0<p<1
Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile
STA6166-2-24
Frequency Histogram
A graphical presentation of the frequency table where the relative
areas of the bars are in proportion to the frequencies.
Frequency 9
6
F re q u e n c y
calories
Bin width
STA6166-2-25
Density Histogram
Histograms have
important ties to
probability.
STA6166-2-26
Number of Bins for Smoothed histogram or density curve.
Histograms
100
0 5 10 15
lengths of the axes can
totfat
change how the relationship is
perceived.
200
calories
100
0 5 10 15
totfat
STA6166-2-28
Matrix Plot
STA6166-2-29
Brushing the plot Three-D
to identify Views
interesting points.
STA6166-2-30
• Basic data visualization tools
Histograms, Bar charts/graphs, Scatter plots, Line
charts, Area plots, Pie charts, Donut charts
• Specialized data visualization tools
Boxplots, Bubble plots, Heat map, Dendrogram,
Venn diagram, Treemap, 3D scatter plots
• Advanced data visualization tools-
Wordclouds Visualization of geospatial data Data
Visualization types
Area plots
• An Area Line Plot, also known as an Area Chart or Stacked Area Chart, is
a data visualization technique that is used to represent data over time
or across categories.
df = pd.DataFrame({
'x': list(range(1, 11)),
'y': [1, 3, 2, 4, 5, 7, 6, 8, 9, 10]
})
# Create the area line plot
plt.fill_between(df['x'], df['y'], color='blue', alpha=0.2)
plt.plot(df['x'], df['y'], color='red', alpha=0.5, linewidth=0.9)
Donut Charts
• Donut charts are the modified version of Pie Charts with the area of
center cut out.
• The use of area of arcs to represent the information in the most
effective manner instead of Pie chart
• focused on comparing the proportion area between the slices.
• Donut charts are more efficient in terms of space because the blank
space inside the donut charts can be used to display some additional
information about the donut chart.
Employee = ['Roshni', 'Shyam', 'Priyanshi','Harshit', 'Anmol']
Salary = [40000, 50000, 70000, 54000, 44000]
# colors
colors = ['#FF0000', '#0000FF', '#FFFF00',
'#ADFF2F', '#FFA500']
explode = (0.05, 0.05, 0.05, 0.05, 0.05)
# Pie Chart
plt.pie(Salary, colors=colors, labels=Employee,
autopct='%1.1f%%', pctdistance=0.85,
explode=explode)
# draw circle
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)
Box Plot
• A Box Plot is also known as Whisker plot is created to display
the summary of the set of data values having properties like
minimum, first quartile, median, third quartile and
maximum.
• In the box plot, a box is created from the first quartile to the
third quartile, a vertical line is also there which goes through
the box at the median.
• x-axis denotes the data to be plotted while the y-axis shows
the frequency distribution.
plt.figure(figsize=(10, 6))
plt.boxplot(random_integers, patch_artist=True,
boxprops=dict(facecolor='orange', color='black'),
medianprops=dict(color='red'))
plt.title('Box Plot of Random Integers')
plt.ylabel('Value')
plt.grid(True)
plt.show()
Boxplots
Bubble plots
• The bubble chart in Plotly is created using the
scatter plot.
• It can be created using the scatter() method of
plotly.express.
• A bubble chart is a data visualization which helps
to displays multiple circles (bubbles) in a two-
dimensional plot as same in scatter plot.
• A bubble chart is primarily used to depict and
show relationships between numeric variables.
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length",
color="species",
size='petal_length',
hover_data=['petal_width'])
fig.show()
Heat map
• A 2-D Heatmap is a data visualization tool that helps to represent
the magnitude of the matrix in form of a colored table.
• 2-D Heatmaps can be plot using
the Matplotlib and Seaborn packages.
• It is used to see the correlation between columns of a dataset
where darker color is used for columns having a high correlation.
• Used for plotting various time series and finance-related data
where the Y-axis will be the month and X-axis will be the year and
the element of the heatmap will be users data.
import numpy as np
import matplotlib.pyplot as plt
data = np.random.random(( 12 , 12 ))
plt.imshow( data )
plt.title( "2-D Heat Map" )
plt.show()
Data visualization convert large and small data sets into visuals, which is easy to understand
and process for humans.
Data visualization tools provide accessible ways to understand outliers, patterns, and trends in
the data.
In the world of Big Data, the data visualization tools and technologies are required to analyze
vast amounts of information.
Advertisement
Data visualizations are common in your everyday life, but they always appear in the form of
graphs and charts. The combination of multiple visualizations and bits of information are still
referred to as Infographics.
Data visualizations are used to discover unknown facts and trends. You can see visualizations in
the form of line charts to display change over time. Bar and column charts are useful for
observing relationships and making comparisons. A pie chart is a great way to show parts-of-a-
whole. And maps are the best way to share geographical data visually.
Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel
spreadsheet, which displays the data in more sophisticated ways such as dials and gauges,
geographic maps, heat maps, pie chart, and fever chart.
Effective data visualization are created by communication, data science, and design collide.
Data visualizations did right key insights into complicated data sets into meaningful and natural.
American statistician and Yale professor Edward Tufte believe useful data visualizations
consist of ?complex ideas communicated with clarity, precision, and efficiency.
To craft an effective data visualization, you need to start with clean data that is well-sourced and
complete. After the data is ready to visualize, you need to pick the right chart.
Advertisement
1. Weather reports: Maps and other plot types are commonly used in weather reports.
2. Internet websites: Social media analytics websites such as Social Blade and Google
Analytics use data visualization techniques to analyze and compare the performance of
websites.
3. Astronomy: NASA uses advanced data visualization techniques in its reports and
presentations.
4. Geography
5. Gaming industry
There are many data visualization types. The following are the commonly used data visualization
charts:
1. Distribution plot
3. Violin plot
4. Line plot
5. Bar plot
6. Scatter plot
7. Histogram
8. Pie chart
9. Area plot
11. Heatmap
14.
15. Source: seaborn.pydata.org
16. 2. Box and whisker plot
17. This plot is used to plot the variation of the values of a
numerical feature. You can get the values' minimum, maximum,
median, lower and upper quartiles.
18.
19. 3. Violin plot
20. Similar to the box and whisker plot, the violin plot is used to
plot the variation of a numerical feature. But it contains a kernel
density curve in addition to the box plot. The kernel density curve
estimates the underlying distribution of data.
21.
22. Source: seaborn.pydata
23. 4. Line plot
24. A line plot is created by connecting a series of data points
with straight lines. The number of periods is on the x-axis.
25.
26. 5. Bar plot
27. A bar plot is used to plot the frequency of occurring
categorical data. Each category is represented by a bar. The bars
can be created vertically or horizontally. Their heights or lengths
are proportional to the values they represent.
28.
29. 6. Scatter plot
30. Scatter plots are created to see whether there is a
relationship (linear or non-linear and positive or negative) between
two numerical variables. They are commonly used in regression
analysis.
31.
32. 7. Histogram
33. A histogram represents the distribution of numerical data.
Looking at a histogram, we can decide whether the values are
normally distributed (a bell-shaped curve), skewed to the right or
skewed left. A histogram of residuals is useful to validate important
assumptions in regression analysis.
34.
35. 8. Pie chart
36. A categorical variable pie chart includes each category's
values as slices whose sizes are proportional to the quantity they
represent. It is a circular graph made with slices equal to the
number of categories.
37.
38. 9. Area plot
39. The area plot is based on the line chart. We get the area plot
when we cover the area between the line and the x-axis.
40.
41. Source: python-graph-gallery.com
42. 10. Hexbin plot
43. Similar to the scatter plot, a hexbin plot represents the
relationship between two numerical variables. It is useful when
there are a lot of data points in the two variables. When you have a
lot of data points, they will overlap when represented in a scatter
plot.
44.
45. Source: python-graph-gallery.com
46. 11. Heatmap
47. A heatmap visualizes the correlation coefficients of numerical
features with a beautiful color map. Light colors show a high
correlation, while dark colors show a low correlation. The heatmap
is extremely useful for identifying multicollinearity that occurs when
the input features are highly correlated with one or more of the
other features in the dataset.
48.
Tools and Software for Data Visualization
There are multiple tools and software available for data
visualization.
1. Python provides open-source libraries such as
• Matplotlib
• Seaborn
• Plotty
• Bokeh
• Altair
Advantages and Disadvantages of Data Visualization
Advantages
There are many advantages of data visualization. Data visualization
is used to:
• Communicate your results or findings with your audience
• Tune hyperparameters
• Identify trends, patterns and correlations between variables
• Monitor the model’s performance
• Clean data
• Validate the model’s assumptions
Disadvantages
There are also some disadvantages of data visualization.
• We need to download, install and configure software and open-
source libraries. The process will be difficult and time-consuming
for beginners.
• Some data visualization tools are not available for free. We need to
pay for those.
• When we summarize the data, we’ll lose the exact information.
Advertisement
After you have decided the chart type, you need to design and customize your visualization to
your liking. Simplicity is essential - you don't want to add any elements that distract from the
data.
The concept of using picture was launched in the 17th century to understand the data from the
maps and graphs, and then in the early 1800s, it was reinvented to the pie chart.
Several decades later, one of the most advanced examples of statistical graphics occurred
when Charles Minard mapped Napoleon's invasion of Russia. The map represents the size of
the army and the path of Napoleon's retreat from Moscow - and that information tied to
temperature and time scales for a more in-depth understanding of the event.
Data visualization is important because of the processing of information in human brains. Using
graphs and charts to visualize a large amount of the complex data sets is more comfortable in
comparison to studying the spreadsheet and reports.
Data visualization is an easy and quick way to convey concepts universally. You can experiment
with a different outline by making a slight adjustment.
Advertisement
Advertisement
Advertisement
Advertisement
Data visualization tools have been necessary for democratizing data, analytics, and making
data-driven perception available to workers throughout an organization. They are easy to
operate in comparison to earlier versions of BI software or traditional statistical analysis
software. This guide to a rise in lines of business implementing data visualization tools on their
own, without support from IT.
Why Use Data Visualization?
5. To competitive analyze.
6. To improve insights.
In our increasingly data-driven world, it’s more important than ever to have accessible ways to
view and understand data. After all, the demand for data skills in employees is steadily
increasing each year. Employees and business owners at every level need to have an
understanding of data and of its impact.
That’s where data visualization comes in handy. With the goal of making data more accessible
and understandable, data visualization in the form of dashboards is the go-to tool for many
businesses to analyze and share information.
In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.
Something as simple as presenting data in graphic format may seem to have no downsides. But
sometimes data can be misrepresented or misinterpreted when placed in the wrong style of
data visualization. When choosing to create a data visualization, it’s best to keep both the
advantages and disadvantages in mind.
Advantages
Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and squares
from circles. Our culture is visual, including everything from art and advertisements to TV and
movies. Data visualization is another form of visual art that grabs our interest and keeps our
eyes on the message. When we see a chart, we quickly see trends and outliers. If we can see
something, we internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a
massive spreadsheet of data and couldn’t see a trend, you know how much more effective a
visualization can be.
Disadvantages
While there are many advantages, some of the disadvantages may seem less obvious. For
example, when viewing a visualization with many different datapoints, it’s easy to make an
inaccurate assumption. Or sometimes the visualization is just designed wrong so that it’s
biased or confusing.
The importance of data visualization is simple: it helps people see, interact with, and better
understand data. Whether simple or complex, the right visualization can bring everyone on the
same page, regardless of their level of expertise.
It’s hard to think of a professional industry that doesn’t benefit from making data more
understandable. Every STEM field benefits from understanding data—and so do fields in
government, finance, marketing, history, consumer goods, service industries, education,
sports, and so on.
While we’ll always wax poetically about data visualization (you’re on the Tableau website, after
all) there are practical, real-life applications that are undeniable. And, since visualization is so
prolific, it’s also one of the most useful professional skills to develop. The better you can convey
your points visually, whether in a dashboard or a slide deck, the better you can leverage that
information. The concept of the citizen data scientist is on the rise. Skill sets are changing to
accommodate a data-driven world. It is increasingly valuable for professionals to be able to use
data to make decisions and use visuals to tell stories of when data informs the who, what,
when, where, and how.
While traditional education typically draws a distinct line between creative storytelling and
technical analysis, the modern professional world also values those who can cross between the
two: data visualization sits right in the middle of analysis and visual storytelling.
As the “age of Big Data” kicks into high gear, visualization is an increasingly key tool to make
sense of the trillions of rows of data generated every day. Data visualization helps to tell stories
by curating data into a form easier to understand, highlighting the trends and outliers. A good
visualization tells a story, removing the noise from data and highlighting useful information.
However, it’s not simply as easy as just dressing up a graph to make it look better or slapping on
the “info” part of an infographic. Effective data visualization is a delicate balancing act between
form and function. The plainest graph could be too boring to catch any notice or it make tell a
powerful point; the most stunning visualization could utterly fail at conveying the right message
or it could speak volumes. The data and the visuals need to work together, and there’s an art to
combining great analysis with great storytelling.
When you think of data visualization, your first thought probably immediately goes to simple bar
graphs or pie charts. While these may be an integral part of visualizing data and a common
baseline for many data graphics, the right visualization must be paired with the right set of
information. Simple graphs are only the tip of the iceberg. There’s a whole selection of
visualization methods to present data in effective and interesting ways.
General Types of Visualizations:
• Chart: Information presented in a tabular, graphical form with data displayed along two
axes. Can be in the form of a graph, diagram, or map. Learn more.
• Graph: A diagram of points, lines, segments, curves, or areas that represents certain
variables in comparison to each other, usually along two axes at a right angle.
• Geospatial: A visualization that shows data in map form using different shapes and
colors to show the relationship between pieces of data and specific locations. Learn
more.
• Infographic: A combination of visuals and words that represent data. Usually uses
charts or diagrams.
• Dashboards: A collection of visualizations and data displayed in one place to help with
analyzing and presenting data. Learn more.
• Area Map: A form of geospatial visualization, area maps are used to show specific
values set over a map of a country, state, county, or any other geographic location. Two
common types of area maps are choropleths and isopleths. Learn more.
• Bar Chart: Bar charts represent numerical values compared to each other. The length of
the bar represents the value of each variable. Learn more.
• Box-and-whisker Plots: These show a selection of ranges (the box) across a set
measure (the bar). Learn more.
• Heat Map: A type of geospatial visualization in map form which displays specific data
values as different colors (this doesn’t need to be temperatures, but that is a common
use). Learn more.
• Highlight Table: A form of table that uses color to categorize similar data, allowing the
viewer to read it more easily and intuitively. Learn more.
• Histogram: A type of bar chart that split a continuous measure into different bins to
help analyze the distribution. Learn more.
• Pie Chart: A circular chart with triangular segments that shows data as a percentage of
a whole. Learn more.
• Treemap: A type of chart that shows different, related values in the form of rectangles
nested together. Learn more.
There are dozens of tools for data visualization and data analysis. These range from simple to
complex, from intuitive to obtuse. Not every tool is right for every person looking to learn
visualization techniques, and not every tool can scale to industry or enterprise purposes. If
you’d like to learn more about the options, feel free to read up here or dive into detailed third-
party analysis like the Gartner Magic Quadrant.
Also, remember that good data visualization theory and skills will transcend specific tools and
products. When you’re learning this skill, focus on best practices and explore your own personal
style when it comes to visualizations and dashboards. Data visualization isn’t going away any
time soon, so it’s important to build a foundation of analysis and storytelling and exploration
that you can carry with you regardless of the tools or software you end up using.
By
• Kate Brush
• Ed Burns
Data visualization is the practice of translating information into a visual context, such as a map
or graph, to make data easier for the human brain to understand and pull insights from. The
main goal of data visualization is to make it easier to identify patterns, trends and outliers in
large data sets. The term is often used interchangeably with information graphics, information
visualization and statistical graphics.
Data visualization is one of the steps of the data science process, which states that after data
has been collected, processed and modeled, it must be visualized for conclusions to be made.
Data visualization is also an element of the broader data presentation architecture discipline,
which aims to identify, locate, manipulate, format and deliver data in the most efficient way
possible.
Data visualization is important for almost every professional discipline. Teachers use it to
display student test results, computer scientists to explore advancements in artificial
intelligence (AI) and executives to share information with stakeholders. It also plays an
important role in big data projects. As businesses accumulated massive collections of data,
they needed a way to get an overview of their data quickly and easily. Visualization tools were a
natural fit to provide useful information.
Visualization is central to advanced analytics for similar reasons. When a data scientist is
writing advanced predictive analytics or machine learning algorithms, it's important to be able
to visualize the outputs to monitor results and ensure that the models are performing as
intended. Visualizations of complex algorithms are generally easier to interpret than numerical
outputs.
The
timeline depicting the history of data visualization starts hundreds of years before the
introduction of modern technology.
Data visualization provides a quick and effective way to communicate information in a universal
manner using visual information. Business professionals have different areas and levels of
expertise, but visualizations are meant to be understandable by anyone. Visualizations make it
easier for employees in an organization to make decisions and act based on insights derived
from them.
Visualizations help businesses in many ways. Some examples include the following:
• They help organizations understand when and where to place specific products.
• They can predict sales or revenue volumes.
• Compelling storytelling. Data dashboards that are visually compelling will maintain
the audience's interest with information they can understand.
While data visualization comes with many advantages, it can also pose several challenges,
including the following:
• Potential for misinterpretation. Users might have good intentions when using a
visualization platform, but they can draw incorrect conclusions from detailed
visualizations.
• Data privacy and security. Users must consider the security and privacy of the data
being visualized. A platform might be susceptible to cyberattacks, thus compromising
the security of data being used, or a data set could be used that isn't compliant with
data privacy regulations.
• Bias. Visualizations and the data they're based on should be scrutinized to ensure they
aren't intentionally or unintentionally biased. Failing to do so could compromise the
credibility of those analyses. For example, a data set that leaves out key demographics
within a population could lead to a biased visualization of that data.
The increased popularity of big data and data analysis projects has made visualization more
important than ever. Companies are increasingly using machine learning to gather massive
amounts of data that can be difficult and slow to sort through, comprehend and explain.
Visualization offers a way to speed up the process and present information to stakeholders in
ways they can understand.
Big data visualization often goes beyond the typical techniques used in normal visualization,
such as pie charts, histograms and graphs. Data visualization can provide more complex
representations, such as heat maps and fever charts.
While big data visualization projects have become increasingly useful in recent years, there are
disadvantages, including the following:
• Human intervention. To get the most out of big data visualization tools, it might be
necessary to hire a visualization specialist. These experts can identify the best data sets
and visualization styles to help an organization make good use of its data.
• Data integrity and security. The insights provided by big data visualization will only be
as accurate as the information being processed. Large volumes of data typically require
people and processes in place to govern and control the data quality and metadata.
These people and processes must also ensure trusted data sources are used and that
the data remains secure.
When computers were first applied to data visualization, one of the most common visualization
techniques was using a Microsoft Excel spreadsheet to transform the information into a table,
bar chart or pie chart. While these visualization methods are still used, more intricate
techniques are available, including infographics, bubble clouds, bullet graphs, heat maps, fever
charts and time series charts.
• Line charts. These charts are among the most basic and common techniques used.
Line charts display how variables can change over time.
• Area charts. This visualization method is a variation of a line chart. It displays multiple
values in a time series -- or a sequence of data collected at consecutive, equally spaced
points in time.
• Treemaps. This method shows hierarchical data in a nested format. The size of the
rectangles used for each category is proportional to the percentage of the whole each
represents. Treemaps are best used when multiple categories are present, and the goal
is to compare different parts of a whole.
• Population pyramids. This technique uses a stacked bar graph to display the complex
social narrative of a population. It's best used when trying to display the distribution of a
population.
• Scatter plots. This technique displays the relationship between two variables. A scatter
plot takes the form of an x- and y-axis with dots to represent data points.
• Sales and marketing. Market and consumer research firm eMarketer estimated $264
billion would be spent on U.S.-based digital advertising in 2023. That number is
expected to cross the $390 billion mark by 2027. Given the size of the investment in
advertising, marketing teams must pay close attention to their sources of web traffic
and how their web properties generate revenue. Data visualization helps to illustrate
how marketing efforts affect traffic trends over time.
• Data scientists and researchers. Data professionals typically build visualization for
their own use or to present the information to a select audience. They use visualization
libraries of the chosen programming languages and tools. Data scientists and
researchers frequently use open source programming languages -- such as Python -- or
proprietary tools designed for complex data analysis. Data scientists and researchers
use visualizations to get a greater understanding of data sets and to identify patterns or
trends that might otherwise go unnoticed.
The science of data visualization is based on an understanding of how humans gather and
process information. Daniel Kahneman and Amos Tversky collaborated on research that
defined two different methods for gathering and processing information.
The first method focuses on thought processing that is fast, automatic and unconscious. This
method is frequently used in day-to-day life and helps accomplish tasks including the following:
• Riding a bike.
The second focuses on slow, logical, calculating and infrequent thought processing, as
demonstrated by the following:
• Determining the difference in meaning between multiple signs standing side by side.
Data visualization tools can be used in a variety of ways. The most common is a business
intelligence reporting tool. Users set up visualization tools to generate automatic dashboards
that track company performance across key performance indicators and visually interpret the
results.
The generated images also include interactive capabilities, enabling users to manipulate them
or look more closely into the data for in-depth analysis. Indicators alert users when data has
been updated or when predefined conditions occur.
A business might implement data visualization software to track its own initiatives. For example,
a marketing team might use such software to monitor the performance of an email campaign,
tracking business metrics, such as the open rate, click-through rate and conversion rate.
As data visualization vendors extend the functionality of these tools, they're increasingly being
used as front ends for more sophisticated big data environments. In such a setting, the tools
help data engineers and scientists track data sources and do basic exploratory analysis of data
sets prior to or after more detailed advanced analyses.
Forbes has compiled a list of some data visualization software vendors and tools useful to small
businesses. These include Domo, Kilpfolio, Looker, Microsoft Power BI, Qlik Sense, Tableau and
Zoho Analytics. While Microsoft Excel continues to be a popular tool for data visualization, other
tools have been created to provide users with more sophisticated and far-reaching capabilities.