[go: up one dir, main page]

0% found this document useful (0 votes)
16 views101 pages

Combinepdf

Uploaded by

omkar2342004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views101 pages

Combinepdf

Uploaded by

omkar2342004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

INTRODUCTION

TO
EXPLORATORY DATA
ANALYSIS(EDA)
• Generating summary statistics for numerical data in the
dataset
• creating graphical representations to understand the
data easy and better.
• The initial analysis of data supplied or extracted
• to understand the trends
• underlying limitations
• quality, patterns
• relationships between various entities within the data
set,using descriptive statistics and visualization tools
EDA consist of combination of the following
methods:
1. Univariate visualization
It is the process of finding the summary statistics for each
field in the raw dataset.
1. Bivariate visualization
It is the process of finding the summary statistics for assessing the
relationship between each variable in the dataset and the target variable of
interest.
2. Multivariate visualization
To understand interactions between different fields in the data.
DATA VISUALIZATION
• Data visualization is the representation of data through use
of common graphics, such as charts, plots, infographics and
even animations.
• These visual displays of information communicate complex
data relationships and data-driven in a such way that is easy
to understand.
• Identify Outliers in Data
• Enhanced Collaboration
• Business Analysis Made Easy
• Improve Response Time
• Greater Simplicity
• Easier Visualization of Patterns
VISUAL ENCODING
• Encoding in data visualization is translating the data into a visual element
on a chart or map through position, shape, size, symbols and colour.
• Data is mapped into visual structures which is used to build the images on
a screen.
• It can be more effective due to the easy perception of information
conveyed by the former visualization graph .
• The attribute values signify important data characteristics such as
numerical data, categorical data, or ordinal data.
• Spatiotemporal data contains special attributes such as geographical
location (spatial dimension) and/or time (temporal dimension).
Data Visualization Libraries
• Matplotlib
• Seaborn
• Plotnine(ggplot)
• Bokeh
• pygal
• Plotly
• geoplotlib
• Gleam
• missingno
• Leather
• Altair
• Folium
Library Primary Use Strengths

Highly customizable, produces publication-


General-purpose quality plots, supports various plot types (line,
Matplotlib
plotting bar, scatter, histograms, etc.), and serves as a
foundation for many other libraries.

Built on top of Matplotlib, simplifies complex


Statistical data visualizations, aesthetically pleasing, ideal for
Seaborn
visualization statistical plots like heatmaps, violin plots, and
pair plots.
Plotly Interactive web-based Creates interactive plots, supports 3D
visualizations plots, maps, and animations, integrates
well with web applications, and has
extensive chart types (bar, scatter, pie,
etc.).
Bokeh Interactive visualizations for web Similar to Plotly but with a focus on high-
browsers performance interactivity in large datasets,
supports complex layouts, integrates well
with web frameworks like Flask.

Altair Declarative statistical Focuses on simplicity and efficiency,


visualization based on the Vega perfect for quickly creating complex
and Vega-Lite grammars statistical visualizations with minimal code,
well-suited for exploratory data analysis.
ggplot (plotnine) Grammar of graphics, similar Implements a layered
to ggplot2 in R grammar of graphics,
intuitive and consistent for
building complex plots step-
by-step, preferred by users
familiar with R's ggplot2.

Pandas Visualization Built-in plotting capabilities Convenient for quick


in Pandas plotting directly from Pandas
DataFrames, not as
customizable as other
libraries, good for
exploratory data analysis.
Extends Pandas to
support geospatial
Geospatial data data, integrates with
Geopandas
visualization Matplotlib for plotting
maps, used for spatial
data analysis.

Folium Visualizing geospatial Ideal for creating


data on interactive interactive maps,
maps, built on works well with
Leaflet.js geographic data, easy
to integrate with
Jupyter notebooks
Library Primary Use Strengths
Matplotlib General-purpose plotting Highly customizable, produces publication-quality
plots, supports various plot types (line, bar, scatter,
histograms, etc.), and serves as a foundation for
many other libraries.
Seaborn Statistical data visualization Built on top of Matplotlib, simplifies complex
visualizations, aesthetically pleasing, ideal for
statistical plots like heatmaps, violin plots, and pair
plots.
Plotly Interactive web-based Creates interactive plots, supports 3D plots, maps,
visualizations and animations, integrates well with web
applications, and has extensive chart types (bar,
scatter, pie, etc.).
Bokeh Interactive visualizations for Similar to Plotly but with a focus on high-
web browsers performance interactivity in large datasets, supports
complex layouts, integrates well with web
frameworks like Flask.
Altair Declarative statistical Focuses on simplicity and efficiency, perfect for
visualization based on the quickly creating complex statistical visualizations
Vega and Vega-Lite grammars with minimal code, well-suited for exploratory data
analysis.
ggplot Grammar of graphics, similar Implements a layered grammar of graphics, intuitive
(plotnine) to ggplot2 in R and consistent for building complex plots step-by-
step, preferred by users familiar with R's ggplot2.
Pandas Built-in plotting capabilities in Convenient for quick plotting directly from Pandas
Visualization Pandas DataFrames, not as customizable as other libraries,
good for exploratory data analysis.
Geopandas Geospatial data visualization Extends Pandas to support geospatial data,
integrates with Matplotlib for plotting maps, used for
spatial data analysis.
Folium Visualizing geospatial data on Ideal for creating interactive maps, works well with
interactive maps, built on geographic data, easy to integrate with Jupyter
Leaflet.js notebooks.
Pygal SVG-based interactive Creates interactive and aesthetically pleasing SVG
visualizations plots, ideal for web integration, easy to use, but less
customizable than Matplotlib or Plotly.
Holoviews High-level interface for data Simplifies the process of creating complex
visualization visualizations, integrates with Bokeh and Matplotlib,
supports interactive and large dataset visualizations.
Dash Web-based dashboard Ideal for building web-based analytical applications
creation using Plotly with interactive charts and controls, heavily
integrates with Plotly for visualizations.
NetworkX Visualizing complex networks Specializes in the visualization of networks and
and graphs graph-based data, can handle large-scale networks,
integrates with Matplotlib and other libraries.
Missingno Visualization of missing data in Provides quick and easy plots to visualize missing
datasets data in datasets, useful for data cleaning and
preprocessing.
Pydeck Large-scale geospatial data Focuses on large-scale interactive visualizations,
visualization especially for geospatial data, integrates with
Mapbox and other mapping services.
Numpy Numerical Operation 1. Numerical Python.
2. supports large matrices and multi-
dimensional data.
3. It consists of in-built mathematical functions
for easy computations.
4. Libraries like TensorFlow use Numpy
internally to perform several operations on
tensors.
5. Array Interface is one of the key features of
this library.

SciPy Scientific Python 1. “Scientific Python”.


2. It is an open-source library used for high-
level scientific computations.
3. This library is built over an extension of
Numpy.
4. works with Numpy to handle complex
computations.
5. While Numpy allows sorting and indexing of
array data, the numerical data code is stored
in SciPy.
6. It is also widely used by application
developers and engineers

7.
Graphics

The visual portrayal of quantitative information


Are used to: Graphical Display
• Display the actual data table Objectives
• Display quantities derived from the
• Tabulation
data
• Description
• Show what has been learned
• Illustration
about the data from other analyses
• Exploration
• Allow one to see what may be
occurring in the data over and
above what has already been
described

“A picture is worth a
thousand words…”
STA6166-2-1
Example Data

STA6166-2-2
Excel spreadsheet

STA6166-2-3
Af
ter
Di
nn
erM

0
50
100
150
200
250
Ca int
nd
Ch y Co
ew rn
ing
Gu Gu
m m
m
yB
Column chart

Lic ea
ori rs
ce
M Tw
ilk ist
Ch s
oc
M ola
ilk ...
Ch
oc
M ola
ilk ...
Ch
oc
ola

What are the problems with this graph?


Pe ...
cti
nS
Calories in Common Candies

lic
es
So
ur
Ba
lls

Ta
ffy
Display the data table

STA6166-2-4
Ch
ew
i

50
100
150
200
250

0
ng
Gu
Bu m
tte
rs
co
tch

the data.
Lo
llip
op
So
ur
Ba
St ls
ar
lig
ht
Se M
m int
iSw s
ee
tC To
ho ffe
co
la e
te
Ch
G ip
um s
m
y
Li Be
ar
co s
Alternate Display

ric
eT
wi
st
Pe s
ct
inS
Af lic
te es
rD
in
ne
rM
int
Ca
nd
y
Co
rn
M Ca
ilk ra
Ch m
el
oc
ol Je s
at ly
eC Be
ov an
er s
ed
M Ra
ilk isi
Ch ns
oc
Calories in Common Candies

ol
at
eM Ta
al ffy
te
dM
ilk
Ba
Pe lls
an
Da ut
M rkC Br
ittl
ilk ho e
Ch co
oc la
ol
at te
Ba
eA r
M lm
ilk M o nd
Ch ilk Ba
oc
ol
Ch
oc r
at
eC o la
Sorting and expanding the scale of the graph allows all

ov te
er Ba
ed r
labels to be seen as well as displaying a characteristic of

Pe
an
ut
s
STA6166-2-5
Vertical Display of Data
Calories in Common Candies

MilkChocolate Bar

DarkChocolateBar

MilkChocolateMaltedMilkBalls

MilkChocolateCoveredRaisins

Caramels

AfterDinnerMint

LicoriceTwists

SemiSweetChocolateChips

StarlightMints

Lollipop

Chewing Gum

0 50 100 150 200 250

In this case, a vertical display allows better comparison of


calorie amounts.
STA6166-2-6
Pie Charts

Pie Chart of SatFatC

NoSatFat (13, 59.1%)

Pie Chart of protein

3 ( 3, 13.6%)
1 ( 3, 13.6%)

6 ( 1, 4.5%)

4 ( 1, 4.5%)
SatFat ( 9, 40.9%)

0 (14, 63.6%)

A pie chart is good for making relative comparisons among


pieces of a whole.
STA6166-2-7
Statistical Uses of Graphics
Describe Distributions of Measurements Compare Distributions
• Box & Whisker plot (Boxplot) • Multiple Box & Whisker plots
• Histogram

Associations and Bivariate Distributions


• Scatter plot
• Symbolic scatter plot
Multidimensional Data Displays
• All pairwise scatter plot
• Rotating scatter plot
Graphical Methods in Support of Statistical Inference
• Regression lines Most of these
• Residual plots will be
• Quantile-quantile plots demonstrated
• Cumulative distribution function plots at some point
• Confidence and prediction interval plots in the course.
• Partial leverage plots
• Smoothed curves
STA6166-2-8
Extremes
First, if we sort the data we can immediately identify the
extremes.
Extremes
• Minimum(calories) = 10
• Maximum(calories) = 210

The minimum and maximum are “statistics”.

Reminder: A statistic is a function of the data. In this


case, the function is very simple.

10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210

STA6166-2-9
Range

Range: the difference between the largest and


smallest measurements of a variable.

Extremes
•Minimum(calories) = 10 Range = 210-10 = 200
•Maximum(calories) = 210

Tells us something about the spread of the data.

The middle of the range is a measure of the “center” of


the data.
Midrange = minimum + (Range/2)
=10 + 200/2
=110
Is it a “good” measure of the center of the data? STA6166-2-10
Measures of Central Tendency
Estimate the value that is in the center of the
“distribution” of the data .
Median = middle value in the sorted list of n numbers: at position (n+1)/2
= unique value at (n+1)/2 if n is an odd number or
= average of the values at n/2 and n/2+1 if n is even
= (160 + 160)/2 = 160

Mean = sum of all values divided by number of values (average)


= (10 + 60 + 60 + 60 + … + 210 + 210)/22
= 133.6

Trimmed mean = mean of data where some fraction of the smallest and
largest data values are not considered. Usually the
smallest 5% and largest 5% values (rounded to nearest
integer) of data are removed for this computation.
= 136.0 (with 10% trimmed, 5% each tail).

Again – these are statistics (functions of the data) STA6166-2-11


Mathematical Notation

Number of observations
Symbolic “name”
n

y
for sample mean

y1 + y2 + + yn
i
y= =
i =1
n n

STA6166-2-12
Quartiles
Here n=22, (n+1)/4=23/4=5.75, hence Q1 is three quarters between the 5th and 6th
observations in the sorted list. The 5th value is 60 and the 6th
value is 60, thus

60 + .75(60-60)=60.

For Q2, (n+1)/2 = 23/2 = 11.5, e.g. half way between the 11th and 12th obs.
Q2 = 160 + .5(160-160) = 160.

For Q3, 3(n+1)/4 = 3(23)/4 = 69/4 = 17.25, e.g a quarter of the way between the 17th
and 18th observations.
Q3 = 180 + .25(180-180) = 180

10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210

STA6166-2-13
Percentiles
100pth Percentile: that value in a sorted list of the data that
has approx p100% of the measurements below it
and approx (1-p)100% above it. (The p quantile.)
Distribution
function 0<p<1

Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile

• Ott & Longnecker suggest finding a general 100pth percentile via a


complicated graphical method (pp. 87-90).
• We will relegate these elaborate calculations to software packages…
• We will however return to this later when we discuss QQ-Plots.

STA6166-2-14
Simplified Quartiles
A simpler way to find Q1 & Q3 is as follows:
1. Order the data from the lowest to the highest value, and find the
median.
2. Divide the ordered data into the lower half and the upper half, using
the median as the dividing value. (Always exclude the median itself
from each half.)
3. Q1 is just the median of the lower half.
4. Q3 is just the median of the upper half.

Ex: For the candy data we still get Q1=60 and Q3=180.

Ex: {3, 4, 7, 8, 9, 11, 12, 15, 18}.


We get Q1=(4+7)/2=5.5 and Q3=(12+15)/2=13.5.

STA6166-2-15
Measures of Variability
 Range
 Interquartile Range
 Variance
 Standard Deviation

Interquartile Range (IQR): Difference between the third


quartile (Q3) and the first quartile (Q1).

Quartiles:
Q1 = 25th = 60
Q2 = 50th = median = 160
Q3 = 75th = 180

IQR = Q3-Q1 = 180 - 60 = 120


STA6166-2-16
Variance and Standard Deviation

Variance: The sum of squared deviations Sample Mean


of measurements from their n

mean divided by n-1. y i


y= i =1

n n
(
 iy − y )2

s2 = i =1

n −1

Standard Deviation: The square


root of the variance. s = s2

Rough approximation for large n:


These measure the spread
srange/4.
of the data.
STA6166-2-17
Using Excel Data Analysis Tool
Under the “Tools” menu in
Excel there is a tool called
“Data Analysis”. This tool
is not normally loaded
when the Excel default
installation is used so you
may have to load it
yourself. This will require
the Excel CD. Use the
Tools > Add Ins option,
select the Data Analysis
tool and add it to your
menu.

STA6166-2-18
Excel Data Analysis Tool
Select the Data Analysis Tool
Select Descriptive Statistics
The menu below appears.
Enter the Input Range and
check the output options
desired.

STA6166-2-19
Excel Descriptive Statistics Output

You should be able to easily


identify the basic statistics we
have described so far.

Note: the variance is not in this


list. This is typical of statistics
packages. Since the variance is
simply the square of the
Standard Deviation, it is often
considered redundant.

Learn to use the Excel Help


files. Type “Statistic” in the
Excel Help Keyword dialog for
a list of helps available.

STA6166-2-20
Computing Descriptive
Stats

Descriptive Statistics

Variable N Mean Median TrMean StDev SEMean


calories 22 133.6 160.0 136.0 60.5 12.9

Variable Min Max Q1 Q3


calories 10.0 210.0 60.0 180.0
STA6166-2-21
Frequency Table
A tabular representation of a set of data.
A frequency table also describes the distribution of the
data and facilitates the estimation of probabilities.

The “Histogram” dialog in the Excel Data Mode = most


Analysis Tool can be used to create this table. abundant
But it is not straightforward.
STA6166-2-22
Box Plot for Calories

A visualization of most of the basic statistics.

Maximum

Interquartile 200 75th percentile (Q3)


range Median (Q2)
calories

100

25th percentile (Q1)


0

Minimum

Box Plot
(SAS Proc Insight)

Is there an Excel Tool? No.

STA6166-2-23
Percentiles
100pth Percentile: that value in a sorted list of the data that
has approx p100% of the measurements below it
and approx (1-p)100% above it. (The p quantile.)
Smoothed
histogram 0<p<1

Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile

A distribution is said to be symmetric if the distance from the median to the


100pth percentile is the same as the distance from the median to the
100(1-p)th percentile. Otherwise the distribution is said to be skewed.
In the case above, the distribution is skewed to the right since the right tail is
longer than the left tail.

STA6166-2-24
Frequency Histogram
A graphical presentation of the frequency table where the relative
areas of the bars are in proportion to the frequencies.

This is a frequency histogram

Frequency 9

6
F re q u e n c y

0 50 100 150 200

calories

Bin width
STA6166-2-25
Density Histogram

A density histogram (or simply a histogram) is


constructed just like a frequency histogram, but now the
total area of the bars sums to one. This is accomplished
by rescaling the vertical axis. Instead of frequencies, the
vertical axis records the rescaled value of the density.

Histograms have
important ties to
probability.

Sum of shaded area is equal to one.

STA6166-2-26
Number of Bins for Smoothed histogram or density curve.
Histograms

Six bins Five bins

How we view the


“distribution” of a dataset
can depend on how
much data we have and
how it is binned.
Eleven bins
STA6166-2-27
Scatterplot
Graphics to examine relationships

200 Is the relationship linear


or non-linear?
c alor ies

100

Beware, changing the relative


0

0 5 10 15
lengths of the axes can
totfat
change how the relationship is
perceived.

200
calories

100

0 5 10 15

totfat

STA6166-2-28
Matrix Plot

View multiple variables at one time.

STA6166-2-29
Brushing the plot Three-D
to identify Views
interesting points.

STA6166-2-30
• Basic data visualization tools
Histograms, Bar charts/graphs, Scatter plots, Line
charts, Area plots, Pie charts, Donut charts
• Specialized data visualization tools
Boxplots, Bubble plots, Heat map, Dendrogram,
Venn diagram, Treemap, 3D scatter plots
• Advanced data visualization tools-
Wordclouds Visualization of geospatial data Data
Visualization types
Area plots
• An Area Line Plot, also known as an Area Chart or Stacked Area Chart, is
a data visualization technique that is used to represent data over time
or across categories.
df = pd.DataFrame({
'x': list(range(1, 11)),
'y': [1, 3, 2, 4, 5, 7, 6, 8, 9, 10]
})
# Create the area line plot
plt.fill_between(df['x'], df['y'], color='blue', alpha=0.2)
plt.plot(df['x'], df['y'], color='red', alpha=0.5, linewidth=0.9)
Donut Charts
• Donut charts are the modified version of Pie Charts with the area of
center cut out.
• The use of area of arcs to represent the information in the most
effective manner instead of Pie chart
• focused on comparing the proportion area between the slices.
• Donut charts are more efficient in terms of space because the blank
space inside the donut charts can be used to display some additional
information about the donut chart.
Employee = ['Roshni', 'Shyam', 'Priyanshi','Harshit', 'Anmol']
Salary = [40000, 50000, 70000, 54000, 44000]
# colors
colors = ['#FF0000', '#0000FF', '#FFFF00',
'#ADFF2F', '#FFA500']
explode = (0.05, 0.05, 0.05, 0.05, 0.05)
# Pie Chart
plt.pie(Salary, colors=colors, labels=Employee,
autopct='%1.1f%%', pctdistance=0.85,
explode=explode)
# draw circle
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)
Box Plot
• A Box Plot is also known as Whisker plot is created to display
the summary of the set of data values having properties like
minimum, first quartile, median, third quartile and
maximum.
• In the box plot, a box is created from the first quartile to the
third quartile, a vertical line is also there which goes through
the box at the median.
• x-axis denotes the data to be plotted while the y-axis shows
the frequency distribution.
plt.figure(figsize=(10, 6))
plt.boxplot(random_integers, patch_artist=True,
boxprops=dict(facecolor='orange', color='black'),
medianprops=dict(color='red'))
plt.title('Box Plot of Random Integers')
plt.ylabel('Value')
plt.grid(True)
plt.show()
Boxplots
Bubble plots
• The bubble chart in Plotly is created using the
scatter plot.
• It can be created using the scatter() method of
plotly.express.
• A bubble chart is a data visualization which helps
to displays multiple circles (bubbles) in a two-
dimensional plot as same in scatter plot.
• A bubble chart is primarily used to depict and
show relationships between numeric variables.
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length",
color="species",
size='petal_length',
hover_data=['petal_width'])
fig.show()
Heat map
• A 2-D Heatmap is a data visualization tool that helps to represent
the magnitude of the matrix in form of a colored table.
• 2-D Heatmaps can be plot using
the Matplotlib and Seaborn packages.
• It is used to see the correlation between columns of a dataset
where darker color is used for columns having a high correlation.
• Used for plotting various time series and finance-related data
where the Y-axis will be the month and X-axis will be the year and
the element of the heatmap will be users data.
import numpy as np
import matplotlib.pyplot as plt
data = np.random.random(( 12 , 12 ))
plt.imshow( data )
plt.title( "2-D Heat Map" )
plt.show()

data = np.random.random((12, 12))


plt.imshow(data, cmap='autumn')

plt.title("Heatmap with different color")


plt.show()
Dendrogram
• Dendrograms are used to visualize hierarchical data, in fields like
biology, linguistics, and computer science.
• They represent relationships between data points in a hierarchical
manner, in a tree-like structure.
• Data Preparation: Ensure the data is in a format suitable for
hierarchical clustering. This involves having a distance or similarity
matrix between data points.
• Hierarchical Clustering: Apply hierarchical clustering algorithms like
agglomerative or divisive clustering to the data. These algorithms
iteratively merge or split clusters based on a defined metric.
• Dendrogram Construction: Use libraries like scipy or matplotlib in
Python or other statistical software to construct the dendrogram.
set.seed(123)
data <- matrix(rnorm(100), ncol = 5)

# Perform hierarchical clustering


dist_matrix <- dist(data) # Compute distance matrix
hc <- hclust(dist_matrix) # Perform hierarchical clustering
# Plot dendrogram
plot(hc, main = "Dendrogram of Hierarchical Clustering", xlab = "Samples",
ylab = "Distance")
Venn diagram
• Venn Diagrams are useful for illustrating relations between two or more
groups.
• Used to show commonalities and differences between different groups.
pip install matplotlib-venn
# import modules
from matplotlib_venn import venn2
from matplotlib import pyplot as plt
# depict venn diagram
venn2(subsets = (50, 10, 7), set_labels = ('Group
A', 'Group B'))
plt.show()
Treemap
• A Treemap diagram is an type of visualization when the data set is
structured in hierarchical order with a tree layout with roots,
branches, and nodes.
• It allows to show information about an important amount of data in a
very efficient way in a limited space.
• plot a Treemap using Squarify.

• pip install squarify


import squarify
import matplotlib.pyplot as plt
squarify.plot(sizes=[1, 2, 3, 4, 5],
color="yellow")
data = [300, 400, 120, 590, 600, 760]
colors = ["red", "black", "green",
"violet", "yellow", "blue"]
squarify.plot(sizes=data, color=colors)
plt.axis("off")
3D scatter plots
• three-dimensional plotting
• used to display the properties of data as three variables of a dataset
using the cartesian coordinates.
• To create a 3D Scatter plot, Matplotlib’s mplot3d toolkit is used to
enable three dimensional plotting.
• 3D scatter plot is created by using ax.scatter3D() the function of the
matplotlib library which accepts a data sets of X, Y and Z to create the
plot while the rest of the attributes of the function are the same as
that of two dimensional scatter plot.
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
z = np.random.randint(100, size =(50))
x = np.random.randint(80, size =(50))
y = np.random.randint(60, size =(50))
# Creating figure
fig = plt.figure(figsize = (10, 7))
ax = plt.axes(projection ="3d")
ax.scatter3D(x, y, z, color = "green")
plt.title("simple 3D scatter plot")
What is Data Visualization?

Data visualization is a graphical representation of quantitative information and data by using


visual elements like graphs, charts, and maps.

Data visualization convert large and small data sets into visuals, which is easy to understand
and process for humans.

Data visualization tools provide accessible ways to understand outliers, patterns, and trends in
the data.

In the world of Big Data, the data visualization tools and technologies are required to analyze
vast amounts of information.

Advertisement

Data visualizations are common in your everyday life, but they always appear in the form of
graphs and charts. The combination of multiple visualizations and bits of information are still
referred to as Infographics.

Data visualizations are used to discover unknown facts and trends. You can see visualizations in
the form of line charts to display change over time. Bar and column charts are useful for
observing relationships and making comparisons. A pie chart is a great way to show parts-of-a-
whole. And maps are the best way to share geographical data visually.

Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel
spreadsheet, which displays the data in more sophisticated ways such as dials and gauges,
geographic maps, heat maps, pie chart, and fever chart.

What makes Data Visualization Effective?

Effective data visualization are created by communication, data science, and design collide.
Data visualizations did right key insights into complicated data sets into meaningful and natural.

American statistician and Yale professor Edward Tufte believe useful data visualizations
consist of ?complex ideas communicated with clarity, precision, and efficiency.
To craft an effective data visualization, you need to start with clean data that is well-sourced and
complete. After the data is ready to visualize, you need to pick the right chart.

Advertisement

Examples of Data Visualization in Data Science

Here are some popular data visualization examples.

1. Weather reports: Maps and other plot types are commonly used in weather reports.

2. Internet websites: Social media analytics websites such as Social Blade and Google
Analytics use data visualization techniques to analyze and compare the performance of
websites.

3. Astronomy: NASA uses advanced data visualization techniques in its reports and
presentations.

4. Geography

5. Gaming industry

Different Types of Data Visualization in Data Science

There are many data visualization types. The following are the commonly used data visualization
charts:
1. Distribution plot

2. Box and whisker plot

3. Violin plot

4. Line plot

5. Bar plot

6. Scatter plot

7. Histogram

8. Pie chart

9. Area plot

10. Hexbin plot

11. Heatmap

12. 1. Distribution plot


13. A distribution plot is used to visualize data distribution—for
example: A probability distribution plot or density curve.

14.
15. Source: seaborn.pydata.org
16. 2. Box and whisker plot
17. This plot is used to plot the variation of the values of a
numerical feature. You can get the values' minimum, maximum,
median, lower and upper quartiles.
18.
19. 3. Violin plot
20. Similar to the box and whisker plot, the violin plot is used to
plot the variation of a numerical feature. But it contains a kernel
density curve in addition to the box plot. The kernel density curve
estimates the underlying distribution of data.

21.
22. Source: seaborn.pydata
23. 4. Line plot
24. A line plot is created by connecting a series of data points
with straight lines. The number of periods is on the x-axis.
25.
26. 5. Bar plot
27. A bar plot is used to plot the frequency of occurring
categorical data. Each category is represented by a bar. The bars
can be created vertically or horizontally. Their heights or lengths
are proportional to the values they represent.

28.
29. 6. Scatter plot
30. Scatter plots are created to see whether there is a
relationship (linear or non-linear and positive or negative) between
two numerical variables. They are commonly used in regression
analysis.
31.
32. 7. Histogram
33. A histogram represents the distribution of numerical data.
Looking at a histogram, we can decide whether the values are
normally distributed (a bell-shaped curve), skewed to the right or
skewed left. A histogram of residuals is useful to validate important
assumptions in regression analysis.

34.
35. 8. Pie chart
36. A categorical variable pie chart includes each category's
values as slices whose sizes are proportional to the quantity they
represent. It is a circular graph made with slices equal to the
number of categories.
37.
38. 9. Area plot
39. The area plot is based on the line chart. We get the area plot
when we cover the area between the line and the x-axis.

40.
41. Source: python-graph-gallery.com
42. 10. Hexbin plot
43. Similar to the scatter plot, a hexbin plot represents the
relationship between two numerical variables. It is useful when
there are a lot of data points in the two variables. When you have a
lot of data points, they will overlap when represented in a scatter
plot.
44.
45. Source: python-graph-gallery.com
46. 11. Heatmap
47. A heatmap visualizes the correlation coefficients of numerical
features with a beautiful color map. Light colors show a high
correlation, while dark colors show a low correlation. The heatmap
is extremely useful for identifying multicollinearity that occurs when
the input features are highly correlated with one or more of the
other features in the dataset.

48.
Tools and Software for Data Visualization
There are multiple tools and software available for data
visualization.
1. Python provides open-source libraries such as
• Matplotlib
• Seaborn
• Plotty
• Bokeh
• Altair
Advantages and Disadvantages of Data Visualization
Advantages
There are many advantages of data visualization. Data visualization
is used to:
• Communicate your results or findings with your audience
• Tune hyperparameters
• Identify trends, patterns and correlations between variables
• Monitor the model’s performance
• Clean data
• Validate the model’s assumptions
Disadvantages
There are also some disadvantages of data visualization.
• We need to download, install and configure software and open-
source libraries. The process will be difficult and time-consuming
for beginners.
• Some data visualization tools are not available for free. We need to
pay for those.
• When we summarize the data, we’ll lose the exact information.

Advertisement

After you have decided the chart type, you need to design and customize your visualization to
your liking. Simplicity is essential - you don't want to add any elements that distract from the
data.

History of Data Visualization

The concept of using picture was launched in the 17th century to understand the data from the
maps and graphs, and then in the early 1800s, it was reinvented to the pie chart.

Several decades later, one of the most advanced examples of statistical graphics occurred
when Charles Minard mapped Napoleon's invasion of Russia. The map represents the size of
the army and the path of Napoleon's retreat from Moscow - and that information tied to
temperature and time scales for a more in-depth understanding of the event.

Computers made it possible to process a large amount of data at lightning-fast speeds.


Nowadays, data visualization becomes a fast-evolving blend of art and science that certain to
change the corporate landscape over the next few years.
Importance of Data Visualization

Data visualization is important because of the processing of information in human brains. Using
graphs and charts to visualize a large amount of the complex data sets is more comfortable in
comparison to studying the spreadsheet and reports.

Data visualization is an easy and quick way to convey concepts universally. You can experiment
with a different outline by making a slight adjustment.

Data visualization have some more specialties such as:

Advertisement

Advertisement

Advertisement

Advertisement

o Data visualization can identify areas that need improvement or modifications.

o Data visualization can clarify which factor influence customer behavior.

o Data visualization helps you to understand which products to place where.

o Data visualization can predict sales volumes.

Data visualization tools have been necessary for democratizing data, analytics, and making
data-driven perception available to workers throughout an organization. They are easy to
operate in comparison to earlier versions of BI software or traditional statistical analysis
software. This guide to a rise in lines of business implementing data visualization tools on their
own, without support from IT.
Why Use Data Visualization?

1. To make easier in understand and remember.

2. To discover unknown facts, outliers, and trends.

3. To visualize relationships and patterns quickly.

4. To ask a better question and make better decisions.

5. To competitive analyze.

6. To improve insights.

What Is Data Visualization? Definition, Examples, And Learning Resources

In our increasingly data-driven world, it’s more important than ever to have accessible ways to
view and understand data. After all, the demand for data skills in employees is steadily
increasing each year. Employees and business owners at every level need to have an
understanding of data and of its impact.

That’s where data visualization comes in handy. With the goal of making data more accessible
and understandable, data visualization in the form of dashboards is the go-to tool for many
businesses to analyze and share information.

In this article, we'll cover:

1. The definition of data visualization

2. Advantages and disadvantages of data visualization

3. Why data visualization is important

4. Data visualization and big data

5. Data visualization examples

6. Tools and software of data visualization

7. More about data visualization

What is data visualization?


Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data. Additionally, it provides an excellent
way for employees or business owners to present data to non-technical audiences without
confusion.

In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.

What are the advantages and disadvantages of data visualization?

Something as simple as presenting data in graphic format may seem to have no downsides. But
sometimes data can be misrepresented or misinterpreted when placed in the wrong style of
data visualization. When choosing to create a data visualization, it’s best to keep both the
advantages and disadvantages in mind.

Advantages

Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and squares
from circles. Our culture is visual, including everything from art and advertisements to TV and
movies. Data visualization is another form of visual art that grabs our interest and keeps our
eyes on the message. When we see a chart, we quickly see trends and outliers. If we can see
something, we internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a
massive spreadsheet of data and couldn’t see a trend, you know how much more effective a
visualization can be.

Some other advantages of data visualization include:

• Easily sharing information.


• Interactively explore opportunities.

• Visualize patterns and relationships.

Disadvantages

While there are many advantages, some of the disadvantages may seem less obvious. For
example, when viewing a visualization with many different datapoints, it’s easy to make an
inaccurate assumption. Or sometimes the visualization is just designed wrong so that it’s
biased or confusing.

Some other disadvantages include:

• Biased or inaccurate information.

• Correlation doesn’t always mean causation.

• Core messages can get lost in translation.

Why data visualization is important

The importance of data visualization is simple: it helps people see, interact with, and better
understand data. Whether simple or complex, the right visualization can bring everyone on the
same page, regardless of their level of expertise.

It’s hard to think of a professional industry that doesn’t benefit from making data more
understandable. Every STEM field benefits from understanding data—and so do fields in
government, finance, marketing, history, consumer goods, service industries, education,
sports, and so on.

While we’ll always wax poetically about data visualization (you’re on the Tableau website, after
all) there are practical, real-life applications that are undeniable. And, since visualization is so
prolific, it’s also one of the most useful professional skills to develop. The better you can convey
your points visually, whether in a dashboard or a slide deck, the better you can leverage that
information. The concept of the citizen data scientist is on the rise. Skill sets are changing to
accommodate a data-driven world. It is increasingly valuable for professionals to be able to use
data to make decisions and use visuals to tell stories of when data informs the who, what,
when, where, and how.

While traditional education typically draws a distinct line between creative storytelling and
technical analysis, the modern professional world also values those who can cross between the
two: data visualization sits right in the middle of analysis and visual storytelling.

Data visualization and big data

As the “age of Big Data” kicks into high gear, visualization is an increasingly key tool to make
sense of the trillions of rows of data generated every day. Data visualization helps to tell stories
by curating data into a form easier to understand, highlighting the trends and outliers. A good
visualization tells a story, removing the noise from data and highlighting useful information.

However, it’s not simply as easy as just dressing up a graph to make it look better or slapping on
the “info” part of an infographic. Effective data visualization is a delicate balancing act between
form and function. The plainest graph could be too boring to catch any notice or it make tell a
powerful point; the most stunning visualization could utterly fail at conveying the right message
or it could speak volumes. The data and the visuals need to work together, and there’s an art to
combining great analysis with great storytelling.

Learn more about big data.

Create beautiful visualizations with your data.

TRY TABLEAU FOR FREE

Examples of data visualization


Of course, one of the best ways to understand data visualization is to see it. What a crazy
concept! With public data visualization galleries and data everywhere online, it can be
overwhelming to know where to start. Tableau’s own public gallery shows off loads of
visualizations made with the free Tableau Public tool, we feature some common starter
business dashboards as usable templates, and Viz of the Day collects some of the best
community creations. We’ve also collected 10 of the best examples of data visualization of all
time, with examples that map historical conquests, analyze film scripts, reveal hidden causes
of mortality, and more.

Different types of visualizations

When you think of data visualization, your first thought probably immediately goes to simple bar
graphs or pie charts. While these may be an integral part of visualizing data and a common
baseline for many data graphics, the right visualization must be paired with the right set of
information. Simple graphs are only the tip of the iceberg. There’s a whole selection of
visualization methods to present data in effective and interesting ways.
General Types of Visualizations:

• Chart: Information presented in a tabular, graphical form with data displayed along two
axes. Can be in the form of a graph, diagram, or map. Learn more.

• Table: A set of figures displayed in rows and columns. Learn more.

• Graph: A diagram of points, lines, segments, curves, or areas that represents certain
variables in comparison to each other, usually along two axes at a right angle.

• Geospatial: A visualization that shows data in map form using different shapes and
colors to show the relationship between pieces of data and specific locations. Learn
more.

• Infographic: A combination of visuals and words that represent data. Usually uses
charts or diagrams.

• Dashboards: A collection of visualizations and data displayed in one place to help with
analyzing and presenting data. Learn more.

More specific examples

• Area Map: A form of geospatial visualization, area maps are used to show specific
values set over a map of a country, state, county, or any other geographic location. Two
common types of area maps are choropleths and isopleths. Learn more.

• Bar Chart: Bar charts represent numerical values compared to each other. The length of
the bar represents the value of each variable. Learn more.

• Box-and-whisker Plots: These show a selection of ranges (the box) across a set
measure (the bar). Learn more.

• Bullet Graph: A bar marked against a background to show progress or performance


against a goal, denoted by a line on the graph. Learn more.
• Gantt Chart: Typically used in project management, Gantt charts are a bar chart
depiction of timelines and tasks. Learn more.

• Heat Map: A type of geospatial visualization in map form which displays specific data
values as different colors (this doesn’t need to be temperatures, but that is a common
use). Learn more.

• Highlight Table: A form of table that uses color to categorize similar data, allowing the
viewer to read it more easily and intuitively. Learn more.

• Histogram: A type of bar chart that split a continuous measure into different bins to
help analyze the distribution. Learn more.

• Pie Chart: A circular chart with triangular segments that shows data as a percentage of
a whole. Learn more.

• Treemap: A type of chart that shows different, related values in the form of rectangles
nested together. Learn more.

Visualization tools and software

There are dozens of tools for data visualization and data analysis. These range from simple to
complex, from intuitive to obtuse. Not every tool is right for every person looking to learn
visualization techniques, and not every tool can scale to industry or enterprise purposes. If
you’d like to learn more about the options, feel free to read up here or dive into detailed third-
party analysis like the Gartner Magic Quadrant.

Also, remember that good data visualization theory and skills will transcend specific tools and
products. When you’re learning this skill, focus on best practices and explore your own personal
style when it comes to visualizations and dashboards. Data visualization isn’t going away any
time soon, so it’s important to build a foundation of analysis and storytelling and exploration
that you can carry with you regardless of the tools or software you end up using.

Learn more about data visualizations


What is data visualization?

By

• Cameron Hashemi-Pour, Site Editor

• Kate Brush

• Ed Burns

Data visualization is the practice of translating information into a visual context, such as a map
or graph, to make data easier for the human brain to understand and pull insights from. The
main goal of data visualization is to make it easier to identify patterns, trends and outliers in
large data sets. The term is often used interchangeably with information graphics, information
visualization and statistical graphics.

Data visualization is one of the steps of the data science process, which states that after data
has been collected, processed and modeled, it must be visualized for conclusions to be made.
Data visualization is also an element of the broader data presentation architecture discipline,
which aims to identify, locate, manipulate, format and deliver data in the most efficient way
possible.

Data visualization is important for almost every professional discipline. Teachers use it to
display student test results, computer scientists to explore advancements in artificial
intelligence (AI) and executives to share information with stakeholders. It also plays an
important role in big data projects. As businesses accumulated massive collections of data,
they needed a way to get an overview of their data quickly and easily. Visualization tools were a
natural fit to provide useful information.

Visualization is central to advanced analytics for similar reasons. When a data scientist is
writing advanced predictive analytics or machine learning algorithms, it's important to be able
to visualize the outputs to monitor results and ensure that the models are performing as
intended. Visualizations of complex algorithms are generally easier to interpret than numerical
outputs.
The
timeline depicting the history of data visualization starts hundreds of years before the
introduction of modern technology.

Why is data visualization important?

Data visualization provides a quick and effective way to communicate information in a universal
manner using visual information. Business professionals have different areas and levels of
expertise, but visualizations are meant to be understandable by anyone. Visualizations make it
easier for employees in an organization to make decisions and act based on insights derived
from them.

Visualizations help businesses in many ways. Some examples include the following:

• They help isolate factors that affect customer behavior.

• They identify products or services that need to be improved.

• They make data more memorable for stakeholders.

• They help organizations understand when and where to place specific products.
• They can predict sales or revenue volumes.

Benefits of data visualization

The benefits of data visualization include the following:

• Actionable insights. A broad spectrum of an organization's personnel can understand


visuals presented in business intelligence dashboards. This lets users absorb
information quickly, get better insights and figure out the next steps faster.

• Exploration of complex relationships. Visualization platforms with advanced


capabilities can display complex relationships among data points and metrics, allowing
an organization to make faster data-based decisions.

• Compelling storytelling. Data dashboards that are visually compelling will maintain
the audience's interest with information they can understand.

• Accessibility. Visualization tools make data more accessible and understandable, so


that laypersons or semi-technical users who aren't data scientists can interpret and
analyze it.

• Interactivity. Interactive dashboards have the functionality to allow users to click on


various aspects of data displays to get more information. This is especially useful for
those with less expertise on the subject area covered by the data. Static displays don't
allow this.

Disadvantages of data visualization

While data visualization comes with many advantages, it can also pose several challenges,
including the following:

• Complexity. A highly complicated visualization could appear cluttered or make it


difficult to glean valuable insights. More complexity also means users need training on
the tools being used or risk creating the wrong visual type for the data being used.

• Potential for misinterpretation. Users might have good intentions when using a
visualization platform, but they can draw incorrect conclusions from detailed
visualizations.

• Data privacy and security. Users must consider the security and privacy of the data
being visualized. A platform might be susceptible to cyberattacks, thus compromising
the security of data being used, or a data set could be used that isn't compliant with
data privacy regulations.

• Bias. Visualizations and the data they're based on should be scrutinized to ensure they
aren't intentionally or unintentionally biased. Failing to do so could compromise the
credibility of those analyses. For example, a data set that leaves out key demographics
within a population could lead to a biased visualization of that data.

Data visualization and big data

The increased popularity of big data and data analysis projects has made visualization more
important than ever. Companies are increasingly using machine learning to gather massive
amounts of data that can be difficult and slow to sort through, comprehend and explain.
Visualization offers a way to speed up the process and present information to stakeholders in
ways they can understand.

Big data visualization often goes beyond the typical techniques used in normal visualization,
such as pie charts, histograms and graphs. Data visualization can provide more complex
representations, such as heat maps and fever charts.

This heatmap shows data on an organization's use


of cloud resources.

While big data visualization projects have become increasingly useful in recent years, there are
disadvantages, including the following:

• Human intervention. To get the most out of big data visualization tools, it might be
necessary to hire a visualization specialist. These experts can identify the best data sets
and visualization styles to help an organization make good use of its data.

• IT resources. The visualization of big data requires powerful computer hardware,


efficient storage systems and might entail cloud migration of data. These additional
resources mean more IT involvement as well.

• Data integrity and security. The insights provided by big data visualization will only be
as accurate as the information being processed. Large volumes of data typically require
people and processes in place to govern and control the data quality and metadata.
These people and processes must also ensure trusted data sources are used and that
the data remains secure.

Examples of data visualization

When computers were first applied to data visualization, one of the most common visualization
techniques was using a Microsoft Excel spreadsheet to transform the information into a table,
bar chart or pie chart. While these visualization methods are still used, more intricate
techniques are available, including infographics, bubble clouds, bullet graphs, heat maps, fever
charts and time series charts.

Other popular types of visualizations include the following:

• Line charts. These charts are among the most basic and common techniques used.
Line charts display how variables can change over time.

• Area charts. This visualization method is a variation of a line chart. It displays multiple
values in a time series -- or a sequence of data collected at consecutive, equally spaced
points in time.

• Treemaps. This method shows hierarchical data in a nested format. The size of the
rectangles used for each category is proportional to the percentage of the whole each
represents. Treemaps are best used when multiple categories are present, and the goal
is to compare different parts of a whole.

• Population pyramids. This technique uses a stacked bar graph to display the complex
social narrative of a population. It's best used when trying to display the distribution of a
population.

• Scatter plots. This technique displays the relationship between two variables. A scatter
plot takes the form of an x- and y-axis with dots to represent data points.

This scatter plot graph shows the relationship


between the character count of emails sent out as part of a marketing campaign and the read
rate.

Common data visualization use cases

Use cases for data visualization include the following:

• Sales and marketing. Market and consumer research firm eMarketer estimated $264
billion would be spent on U.S.-based digital advertising in 2023. That number is
expected to cross the $390 billion mark by 2027. Given the size of the investment in
advertising, marketing teams must pay close attention to their sources of web traffic
and how their web properties generate revenue. Data visualization helps to illustrate
how marketing efforts affect traffic trends over time.

• Politics. A common use of data visualization in politics is a geographic map that


displays the candidates' states, counties or other geographic regions voted for.

• Healthcare. Healthcare professionals frequently use choropleth maps to visualize


important health data. A choropleth map displays geographical areas or regions that are
assigned a certain color in relation to a numeric variable. Choropleth maps allow
professionals to see how a variable, such as the mortality rate of heart disease, changes
across specific geographic areas.

• Scientists. Scientific visualization, sometimes referred to as SciVi, allows scientists


and researchers to gain greater insight from their experimental and other collected data.

• Finance. Finance professionals must track the performance of their investment


decisions when choosing to buy or sell an asset. Candlestick charts are used as tools to
help finance professionals analyze price movements over time, displaying information
such as securities, derivatives, currencies, stocks, bonds and commodities. By
analyzing how prices have changed over time, data analysts and finance professionals
can detect trends.

• Logistics. Shipping companies use visualization tools to create a data-driven supply


chain and determine the most efficient shipping routes.

• Data scientists and researchers. Data professionals typically build visualization for
their own use or to present the information to a select audience. They use visualization
libraries of the chosen programming languages and tools. Data scientists and
researchers frequently use open source programming languages -- such as Python -- or
proprietary tools designed for complex data analysis. Data scientists and researchers
use visualizations to get a greater understanding of data sets and to identify patterns or
trends that might otherwise go unnoticed.

The science of data visualization

The science of data visualization is based on an understanding of how humans gather and
process information. Daniel Kahneman and Amos Tversky collaborated on research that
defined two different methods for gathering and processing information.

The first method focuses on thought processing that is fast, automatic and unconscious. This
method is frequently used in day-to-day life and helps accomplish tasks including the following:

• Reading the text on a sign.

• Solving simple math problems, like 1+1.

• Identifying where a sound is coming from.

• Riding a bike.

• Determining the difference between colors.

The second focuses on slow, logical, calculating and infrequent thought processing, as
demonstrated by the following:

• Reciting a phone number.

• Solving complex math problems, like 132 x 154.

• Determining the difference in meaning between multiple signs standing side by side.

• Understanding complex social cues.

Data visualization tools and vendors

Data visualization tools can be used in a variety of ways. The most common is a business
intelligence reporting tool. Users set up visualization tools to generate automatic dashboards
that track company performance across key performance indicators and visually interpret the
results.

The generated images also include interactive capabilities, enabling users to manipulate them
or look more closely into the data for in-depth analysis. Indicators alert users when data has
been updated or when predefined conditions occur.

A business might implement data visualization software to track its own initiatives. For example,
a marketing team might use such software to monitor the performance of an email campaign,
tracking business metrics, such as the open rate, click-through rate and conversion rate.

As data visualization vendors extend the functionality of these tools, they're increasingly being
used as front ends for more sophisticated big data environments. In such a setting, the tools
help data engineers and scientists track data sources and do basic exploratory analysis of data
sets prior to or after more detailed advanced analyses.
Forbes has compiled a list of some data visualization software vendors and tools useful to small
businesses. These include Domo, Kilpfolio, Looker, Microsoft Power BI, Qlik Sense, Tableau and
Zoho Analytics. While Microsoft Excel continues to be a popular tool for data visualization, other
tools have been created to provide users with more sophisticated and far-reaching capabilities.

You might also like