[go: up one dir, main page]

0% found this document useful (0 votes)
14 views26 pages

Data Visualization UNIT II

The document focuses on visualizing distributions using various methods such as bar plots, histograms, and density plots to understand data spread, symmetry, and outliers. It explains different types of distributions including symmetric, skewed, bimodal, uniform, and multimodal, along with practical examples. Additionally, it covers techniques for visualizing multiple distributions and cumulative distribution functions.

Uploaded by

D Anveshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views26 pages

Data Visualization UNIT II

The document focuses on visualizing distributions using various methods such as bar plots, histograms, and density plots to understand data spread, symmetry, and outliers. It explains different types of distributions including symmetric, skewed, bimodal, uniform, and multimodal, along with practical examples. Additionally, it covers techniques for visualizing multiple distributions and cumulative distribution functions.

Uploaded by

D Anveshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT II

VISUALIZING DISTRIBUTIONS

 Visualizing Amounts-
o Bar Plots,
o Grouped and Stacked Bars,
o Dot Plots and Heatmaps,
 Visualizing Distributions: Histograms and Density Plots-
o Visualizing a Single Distribution,
o Visualizing Multiple Distributions at the Same Time,
 Visualizing Distributions: Empirical Cumulative Distribution Functions and
Q-Q Plots-
o Empirical Cumulative Distribution Functions,
o Highly Skewed Distributions,
o Quantile Plots,
 Visualizing Many Distributions at Once-
o Visualizing Distributions Along the Vertical Axis,
o Visualizing Distributions Along the Horizontal Axis

VISUALIZING DISTRIBUTIONS

A distribution shows how values are spread across a range. It tells us:

 Are values clustered or spread out?


 Are they symmetrical, skewed, or bimodal?
 Are there any outliers?

For example, marks of students — are most students getting 70–80? Are
there few low or very high scores?

The shape tells how your data is spread and where most values lie. It helps
you understand the pattern of your dataset.

1. Symmetric Distribution

 Looks the same on both sides of the center (like a mirror).


 The mean = median = mode
 Example: Normal Distribution (Bell Curve)

*
* *
* *
* *
* *

1
Example: Heights of students in a class, test scores when most perform
average.

2. Skewed Distribution

When the data is stretched more to one side.

➤ a. Right Skewed (Positive Skew)

 Long tail is on the right.


 Mean > Median

****
*
*
*
*

Example: Income of people (few rich people make the average high).

➤ b. Left Skewed (Negative Skew)

 Long tail is on the left.


 Mean < Median

*
*
*
*
****

Example: Age at retirement (most people retire at similar age, few retire
early).

3. Bimodal Distribution

 Has two peaks.


 Indicates two different groups or populations in the data.

* *
** **
* * * *
* ** *

2
Example: Test scores from two sections – one did well, one did poorly.

4. Uniform Distribution

 All values occur with equal frequency.


 No peak, flat distribution.

********

Example: Rolling a fair die – 1 to 6 has equal chance.

5. Multimodal Distribution

 Has more than two peaks.


 Can mean there are several groups within your data.

* * *
** ** **
* * * * * *

Example: Mixed population with different behaviors or patterns.

Shape Description Real-life Example


Symmetric Equal left and right Heights, exam scores
Right Tail on right, big
Income, house prices
Skewed mean
Retirement age, birth
Left Skewed Tail on left
weights
Test scores from two
Bimodal Two peaks
sections
Uniform All values equal Rolling dice
More than two
Multimodal Mixed customer behavior
peaks

Visualizing Amounts

Visualizing amounts means how much of something exists — for example:


sales by month, marks by subject, or students by department.

Bar Plots

A Bar Plot uses vertical bars to represent the value of each category.

3
Used when

 You want to compare individual values


 Categories are not connected to each other

Example:

 Compare categories (e.g., Sales by product, Marks by subject)


 Best when X-axis = categories, Y-axis = values

import matplotlib.pyplot as plt


categories = ['Math', 'Science', 'English', 'History']
marks = [80, 90, 70, 85]
plt.bar(categories, marks, color='blue')
plt.title("Bar Plot - Marks by Subject")
plt.xlabel("Subjects")
plt.ylabel("Marks")
plt.show()

Grouped Bar Plot

Multiple bars grouped side by side for each category, comparing two or more
groups.

Used when:

 You want to compare groups over same categories


 Like this year vs last year, or boys vs girls

Example:

Monthly sales for 2024 vs 2025:

 Jan: 2024 = 20, 2025 = 25


 Two bars for Jan: shown next to each other
4
Real-life Examples:

 Sales of 2 products across months


 Test results for boys vs girls
 Power consumption for 2 years

import matplotlib.pyplot as plt


import numpy as np
subjects = ['Math', 'Science', 'English']
classA = [85, 90, 80]
classB = [75, 85, 78]
x = np.arange(len(subjects)) # label locations
width = 0.35
plt.bar(x - width/2, classA, width, label='Class A', color='orange')
plt.bar(x + width/2, classB, width, label='Class B', color='green')
plt.xticks(x, subjects)
plt.title("Grouped Bar Plot")
plt.xlabel("Subjects")
plt.ylabel("Marks")
plt.legend()
plt.show()

Stacked Bar Plot

One bar per category, but it's divided into colored segments, each showing a
part of the total.

Used when:

 You want to show total + contribution of each part


 Compare category totals and internal breakdowns

Example:

Fruit sales in Jan:


5
 Apple = 10
 Banana = 5
 Mango = 15
All stacked into one single bar for Jan, with red/yellow/orange
segments.

Real-life Examples:

 Total monthly budget (Rent, Food, Travel)


 Web traffic by source (Mobile, Desktop, Tablet)
 Population by age group (Children, Adults, Seniors)

import matplotlib.pyplot as plt


import numpy as np
girls = [30, 35, 40]
boys = [40, 30, 25]
plt.bar(subjects, girls, label='Girls', color='pink')
plt.bar(subjects, boys, bottom=girls, label='Boys', color='blue')
plt.title("Stacked Bar Plot - Students by Gender")
plt.ylabel("Number of Students")
plt.legend()
plt.show()

Dot Plot

A plot with dots instead of bars. Each dot = one data point.

Used when:

 The dataset is small


 You want to show exact counts
 Dots are more intuitive than bars

Example:

Fruit choices in a classroom:


6
 3 students chose Apple → 3 dots in Apple row
 1 for Banana, 6 for Mango

Real-life Examples:

 Votes for class leader


 Number of books read by students
 Rating counts (how many gave 1 star, 2 stars…)

import matplotlib.pyplot as plt


import numpy as np
values = [60, 75, 90, 65]
categories = ['Reading', 'Writing', 'Math', 'Science']
plt.plot(values, categories, 'o', color='purple')
plt.title("Dot Plot - Test Scores")
plt.xlabel("Scores")
plt.grid(True, axis='x', linestyle='--')
plt.show()

Heatmap

A color-coded matrix where each cell’s color intensity shows value.

Used when:

 You have 2D data (rows + columns)


 You want to show intensity, frequency, or patterns

Example:

Time vs Day traffic:

7
Mo Tu We
Time
n e d
Morning 10 15 20
Afternoo
5 25 18
n
Evening 12 9 30

Darker red = more traffic.


Light yellow = low traffic.

Real-life Examples:

 Exam scores (students vs subjects)


 Traffic by time of day and day
 Correlation matrix in Machine Learning

import seaborn as sns


import pandas as pd
data = pd.DataFrame({
'Math': [80, 70, 90],
'Science': [75, 85, 78],
'English': [60, 88, 77]
}, index=['Student1', 'Student2', 'Student3'])
sns.heatmap(data, annot=True, cmap='YlGnBu')
plt.title("Heatmap - Marks of Students")
plt.show()

Chart
Best For Shows
Type
Bar Plot Comparing single values Individual amounts
Grouped Compare multiple Category + subgroup
Bar subgroups values
Stacked Total + sub-group
Show part-to-whole
Bar composition

8
Chart
Best For Shows
Type
Clean view with many
Dot Plot Quantities as dots
labels
Color represents
Heatmap Matrix-style comparison
magnitude

Visualizing Distributions: Histograms and Density Plots

Visualizing Distributions

Visualizing a distribution helps understand how data points are spread,


centered, and shaped. It answers:

 Is the data symmetrical or skewed?


 Are there outliers or multiple peaks?
 How do two or more groups compare?

Types of Distribution Plots:

Visualizing Distributions: Histograms and Density Plots

Histogram

A histogram is a bar plot that shows the frequency of values falling in


continuous intervals (called bins).

 Good for discrete or continuous numeric data


 X-axis → Data ranges (bins)
 Y-axis → Frequency (count)

Density Plot

A density plot shows the smoothed version of the histogram.


It uses a Kernel Density Estimate (KDE) to estimate the probability
distribution.

 Smooth curve, better for understanding the data distribution shape


 Area under curve = 1

I. Visualizing a Single Distribution

Using Histogram:
9
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Sample data
data = [55, 60, 65, 70, 72, 75, 75, 78, 80, 90, 95]
# Histogram
plt.hist(data, bins=5, color='skyblue', edgecolor='black')
plt.title("Histogram of Marks")
plt.xlabel("Marks")
plt.ylabel("Frequency")
plt.show()

Using Density Plot:

import matplotlib.pyplot as plt


import seaborn as sns
data = [55, 60, 65, 70, 72, 75, 75, 78, 80, 90, 95]
# Density Plot (updated)
sns.kdeplot(data, fill=True, color='purple') # use fill=True
plt.title("Density Plot of Marks")
plt.xlabel("Marks")
plt.show()

10
II. Visualizing Multiple Distributions at the Same Time

Let’s compare two classes’ marks using both plots.

Histogram for Multiple Distributions:

import matplotlib.pyplot as plt


import seaborn as sns
# Sample data
class_A = [75, 70, 85, 90, 75, 80, 95]
class_B = [55, 68, 70, 72, 75, 78, 92]
# Plot both histograms
plt.hist(class_A, bins=5, alpha=0.5, label='Class A', color='red')
plt.hist(class_B, bins=5, alpha=0.5, label='Class B', color='pink')
plt.title("Histogram - Class A vs B")
plt.xlabel("Marks")
plt.ylabel("Frequency")
plt.legend()
plt.show()

Density Plot for Multiple Distributions:

import matplotlib.pyplot as plt


import seaborn as sns
class_A = [55, 60, 65, 70, 75, 80, 85]
class_B = [65, 68, 70, 72, 75, 78, 90]
sns.kdeplot(class_A, fill=True, label='Class A', color='blue')
sns.kdeplot(class_B, fill=True, label='Class B', color='orange')
plt.title("Density Plot - Class A vs B")
plt.xlabel("Marks")
plt.legend()
plt.show()

11
Visualizing Distributions: Empirical Cumulative Distribution
Functions and Q-Q Plots

1. Empirical Cumulative Distribution Function (ECDF)

An ECDF is a step function that estimates the cumulative distribution of a


sample dataset. It shows the proportion (or percent) of data points less than
or equal to a given value.

 X-axis: Data values


 Y-axis: Cumulative proportion (0 to 1)
 Better than histogram for small datasets
 You can see medians, percentiles clearly

How is it computed?

Given a dataset with n values x1,x2,...,xn, sorted in ascending order:

 For any value x, the ECDF is calculated as:

Why use ECDF?

 It gives a non-parametric view of the data distribution.


 Easy to compare sample distributions.
 Helpful for visualizing spread, central tendency, and skewness.

Example:

Suppose we have a sample:


[2, 4, 6, 8, 10]

The ECDF would be:

 x=2⇒ECDF=1/5=0.2
12
 x=4⇒ECDF=2/5=0.4
 x=6⇒ECDF=3/5=0.6
 x=8⇒ECDF=4/5=0.8
 x=10⇒ECDF=5/5=1.0

Plotting these gives a step-like graph.

import numpy as np
import matplotlib.pyplot as plt
# Sample data: exam scores
data = np.array([45, 50, 55, 60, 65, 70, 72, 75, 80, 85])
# Sort the data
x = np.sort(data)
# Calculate ECDF
y = np.arange(1, len(x)+1) / len(x)
# Plot ECDF
plt.plot(x, y, marker='.', linestyle='none')
plt.title("ECDF - Exam Scores")
plt.xlabel("Score")
plt.ylabel("Proportion")
plt.grid(True)
plt.show()

data = np.array([45, 50, 55, 60, 65, 70, 72, 75, 80, 85])

This creates a NumPy array containing 10 exam scores.

x = np.sort(data)

 Sorts the data in ascending order.


 ECDF requires sorted data to correctly plot cumulative values.
 x becomes: [45, 50, 55, 60, 65, 70, 72, 75, 80, 85]

y = np.arange(1, len(x)+1) / len(x)

 Creates the ECDF values:


o np.arange(1, len(x)+1) gives: [1, 2, ..., 10]
o Dividing by len(x) (which is 10) gives proportions:

plt.plot(x, y, marker='.', linestyle='none')

 Plots the ECDF using dots only (marker='.', linestyle='none').


 X-axis = sorted scores, Y-axis = cumulative proportion up to that
score.
13
2. Highly Skewed Distributions

Skewness refers to the asymmetry in the distribution of data:

Type Shape
Tail on the left, mean <
Left Skewed
median
Right Tail on the right, mean >
Skewed median
Symmetric Bell curve, mean ≈ median

Example: Visualizing a Right Skewed Distribution

 The tail is on the right


 Most values are concentrated on the left
 Few very large values stretch the tail to the right

Real-life Examples:

 Income (most people earn average, few earn a lot)


 Hospital stay duration
 Number of items in online shopping cart

import matplotlib.pyplot as plt


import numpy as np
data = np.random.exponential(scale=2, size=1000)
plt.hist(data, bins=30, color='salmon', edgecolor='black')
plt.title("Right-Skewed Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

14
Exponential Distribution is a right-skewed distribution used commonly in
modeling time between events, like how long until the next customer
arrives or a light bulb burns out.

Example: Visualizing a Left Skewed Distribution

 The tail is on the left


 Most values are concentrated on the right
 Few very small values stretch the tail to the left

Real-life Examples:

 Retirement age (most retire at 58–60, few earlier)


 Exam scores (if the test is too easy)
 Lifespan of very healthy people

import matplotlib.pyplot as plt


import numpy as np
data = np.random.beta(a=5, b=2, size=1000) * 10
plt.hist(data, bins=30, color='skyblue', edgecolor='black')
plt.title("Left-Skewed Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

 It generates 1000 random values from a Beta distribution with:


o Shape parameters: a = 5, b = 2
 Beta distribution always returns values in the range [0, 1].
 Multiplying by 10 scales the data to the range [0, 10].

Why is it left-skewed?

 In the Beta distribution:


o When a > b, the distribution is left-skewed (more values near the
upper end).
o When a < b, it is right-skewed.
 So here, a = 5, b = 2 → skewed to the left.
15
👉 This means most values are closer to 10, with fewer values near 0.

Example: Visualizing a Symmetric Distribution

 Data is evenly spread


 Left and right sides look like a mirror
 Mean = Median = Mode

Example:

 Heights of people
 IQ scores
 Measurement errors

import matplotlib.pyplot as plt


import numpy as np
# Generate symmetric data (loc or mean=0, scale or std=1)
data = np.random.normal(loc=0, scale=1, size=1000)
plt.hist(data, bins=30, color='plum', edgecolor='black')
plt.title("Symmetric Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

16
3. Quantile-Quantile Plot (Q–Q Plot)

A Q-Q plot compares your data's quantiles to a theoretical normal


distribution.

 If points lie on a straight line → data is normally distributed


 Deviations from line → non-normality or skewness

A quantile is a way to divide your data into equal parts based on its value
(not position).

In short: A quantile is a cut-off point that splits your sorted data into chunks.

Example: Think of a Cake

Imagine you have a cake 🍰 and you want to cut it equally:

 2 pieces → Median (2 quantiles)


 4 pieces → Quartiles (4 quantiles)
 10 pieces → Deciles (10 quantiles)
 100 pieces → Percentiles (100 quantiles)

Each slice will contain the same amount of data, not necessarily the same
value range.

Real-Life Analogy: Exam Scores

Let’s say 100 students wrote an exam. After sorting their scores:

 25th percentile (1st quartile) → 25% scored below this


 50th percentile (median) → 50% scored below this
 75th percentile (3rd quartile) → 75% scored below this

These are quantiles:

 0.25 quantile → 25th percentile


 0.50 quantile → 50th percentile (Median)
 0.75 quantile → 75th percentile

Use case: To check if your data is normally distributed

import scipy.stats as stats


import matplotlib.pyplot as plt
# Normally distributed data
data = np.random.normal(0, 1, 100)
# Create Q–Q Plot
stats.probplot(data, dist="norm", plot=plt)
plt.title("Q–Q Plot - Normal Data")

17
plt.show()

import scipy.stats as stats

This imports the stats module from the SciPy library.


It contains many statistical functions, including tools to create Q–Q plots and
test for normality, etc.

data = np.random.normal(0, 1, 100)

This generates 100 random values from a normal distribution:

 Mean (loc) = 0
 Standard deviation (scale) = 1

So, data should follow a bell-shaped, symmetric distribution.

stats.probplot(data, dist="norm", plot=plt)

This creates the Q–Q (Quantile-Quantile) Plot.

Here's what it does:

 Compares quantiles of your sample data to quantiles of a theoretical


normal distribution (dist="norm").
 If your data is normally distributed, the points will fall on or close to a
straight diagonal line.

Summary Table:
Plot
Use For Tells You…
Type
ECDF Any numeric data Distribution step-by-step
Histogra
Frequency in bins General shape, skewness
m
18
Plot
Use For Tells You…
Type
KDE Smoothed histogram Peaks and spread
Compare to normal Normality or skewness
Q-Q Plot
distribution detection

Visualizing Many Distributions at Once

Used when you want to compare how different groups (like different classes,
categories, etc.) are distributed.

Why Visualize Many Distributions Together?

 To compare groups
 To observe variation, spread, and central tendency
 To detect differences in patterns

Techniques:

1. Visualizing Distributions Along the Vertical Axis

This means:

 X-axis = category/group
 Y-axis = distribution values (vertical plots)

Examples:

 Box Plot
 Violin Plot
 Strip Plot
 Swarm Plot

Boxplot (Vertical):

import seaborn as sns


import matplotlib.pyplot as plt
from seaborn import load_dataset
# Load example dataset (tips data)
data = sns.load_dataset("tips")
# Boxplot: Bill amount grouped by day
sns.boxplot(x='day', y='total_bill', data=data)
plt.title("Boxplot - Vertical Distribution by Day")
plt.show()

19
Violin Plot (Vertical):

A Violin Plot combines a boxplot and a KDE (density curve). It shows the
distribution shape, median, and spread.

import matplotlib.pyplot as plt


import seaborn as sns
import numpy as np
# Load sample data
data = sns.load_dataset("tips")
sns.violinplot(x='day', y='total_bill', data=data)
plt.title("Violin Plot - Vertical")
plt.show()

In the violin plot, we can find the same information as in the box plots:

 Median (a white dot on the violin plot)


 Interquartile range (the black bar in the center of violin)
 The lower/upper adjacent values (the black lines stretched from the bar) –
defined as first quartile - 1.5 IQR and third quartile + 1.5 IQR respectively.
These values can be used in a simple outlier detection technique (Tukey’s

20
fences) – observations lying outside of these "fences" can be considered
outliers.

Strip Plot (Vertical)

Both strip plots and swarm plots are types of categorical scatter plots. They
show the distribution of data points across categories using dots (instead of
bars or boxes). They are very useful when you want to see individual
observations rather than summaries.

A strip plot displays individual data points along a categorical axis, by


"stripping" the data into a line or narrow area. Points may overlap. A strip
plot is a scatter plot where individual data points are plotted along one axis
(usually a categorical variable), with some jitter (random nudging) to reduce
overlapping.

Characteristics:

 X-axis (or Y-axis): Categorical variable (e.g., day, gender)


 Y-axis (or X-axis): Continuous variable (e.g., height, price)
 Points may overlap if there are many similar values
 Optional jitter can be added to spread out overlapping points

Visualization:

Each data point is a dot placed on the category’s line.

Example Use Case:

 Showing student marks across different classes


 Displaying total restaurant bills across different days

import matplotlib.pyplot as plt


import seaborn as sns

21
import numpy as np
# Load sample data
data = sns.load_dataset("tips")
sns.stripplot(x='day', y='total_bill', data=data, jitter=True, color='orange')
plt.title("Strip Plot - Total Bill by Day")
plt.show()

jitter=True spreads the dots slightly so they don't sit on top of each other.

Swarm Plot (Vertical)

A swarm plot is a more advanced strip plot that adjusts automatically the
position of the dots so that no data points overlap— making it more
readable. It arranges them in a way that they "swarm" around the category
line.

Characteristics:

 Same as strip plot, but smarter dot positioning


 Points never overlap
 Automatically adjusted spacing

Visualization:

Each dot is carefully placed next to others, forming a neatly packed swarm.

Example Use Case:

 Visualizing patient weights by gender


 Displaying scores in exams by section

import matplotlib.pyplot as plt


import seaborn as sns
import numpy as np
from seaborn import load_dataset
# Load sample data
data = sns.load_dataset("tips")
22
sns.swarmplot(x='day', y='total_bill', data=data, color='green')
plt.title("Swarm Plot - Total Bill by Day")
plt.show()

2. Visualizing Distributions Along the Horizontal Axis

This means:

 Y-axis = category/group
 X-axis = distribution values (horizontal plots)

Just flip the axes of the previous plot

Axis Type X-axis Y-axis Use When...


Category (e.g., Values (e.g., Default view; easy for many
Vertical Plot
Day) Bill) groups
Horizontal Better when category names are
Values Category
Plot long

Boxplot (Horizontal):

import matplotlib.pyplot as plt


import seaborn as sns
import numpy as np
# Load sample data
data = sns.load_dataset("tips")
sns.boxplot(y='day', x='total_bill', data=data)
plt.title("Boxplot - Horizontal Distribution by Day")
plt.show()

23
Violin Plot Example (Horizontal):

import matplotlib.pyplot as plt


import seaborn as sns
import numpy as np
# Load sample data
data = sns.load_dataset("tips")
sns.violinplot(y='day', x='total_bill', data=data)
plt.title("Violin Plot - Horizontal")
plt.show()

Strip Plot (Horizontal)

import seaborn as sns


import matplotlib.pyplot as plt
# Sample data
tips = sns.load_dataset("tips")
# Horizontal strip plot
sns.stripplot(x="total_bill", y="day", data=tips, jitter=True, color="orange")
plt.title("Horizontal Strip Plot")
plt.xlabel("Total Bill")

24
plt.ylabel("Day")
plt.show()

Swarm Plot (Horizontal)

import seaborn as sns


import matplotlib.pyplot as plt
# Sample data
tips = sns.load_dataset("tips")
# Horizontal swarm plot
sns.swarmplot(x="total_bill", y="day", data=tips, color="skyblue")
plt.title("Horizontal Swarm Plot")
plt.xlabel("Total Bill")
plt.ylabel("Day")
plt.show()

Other Plot Types (you can do both horizontal/vertical):

Plot
Purpose
Type
Box Plot Shows median, IQR, outliers
Violin Plot Like boxplot + shows density
Individual data points
Strip Plot
(scattered)
Swarm Like strip plot but avoids

25
Plot
Purpose
Type
Plot overlapping

Summary Table:

Plot
Shows What? Best For
Type
Shape + Median + Spread (KDE Comparing distribution
Violin Plot
view) shapes
Small datasets (quick
Strip Plot Raw data points
view)
Swarm Raw data points (non- Small/medium datasets
Plot overlapping) (clean)

26

You might also like