UNIT II
VISUALIZING DISTRIBUTIONS
Visualizing Amounts-
o Bar Plots,
o Grouped and Stacked Bars,
o Dot Plots and Heatmaps,
Visualizing Distributions: Histograms and Density Plots-
o Visualizing a Single Distribution,
o Visualizing Multiple Distributions at the Same Time,
Visualizing Distributions: Empirical Cumulative Distribution Functions and
Q-Q Plots-
o Empirical Cumulative Distribution Functions,
o Highly Skewed Distributions,
o Quantile Plots,
Visualizing Many Distributions at Once-
o Visualizing Distributions Along the Vertical Axis,
o Visualizing Distributions Along the Horizontal Axis
VISUALIZING DISTRIBUTIONS
A distribution shows how values are spread across a range. It tells us:
Are values clustered or spread out?
Are they symmetrical, skewed, or bimodal?
Are there any outliers?
For example, marks of students — are most students getting 70–80? Are
there few low or very high scores?
The shape tells how your data is spread and where most values lie. It helps
you understand the pattern of your dataset.
1. Symmetric Distribution
Looks the same on both sides of the center (like a mirror).
The mean = median = mode
Example: Normal Distribution (Bell Curve)
*
* *
* *
* *
* *
1
Example: Heights of students in a class, test scores when most perform
average.
2. Skewed Distribution
When the data is stretched more to one side.
➤ a. Right Skewed (Positive Skew)
Long tail is on the right.
Mean > Median
****
*
*
*
*
Example: Income of people (few rich people make the average high).
➤ b. Left Skewed (Negative Skew)
Long tail is on the left.
Mean < Median
*
*
*
*
****
Example: Age at retirement (most people retire at similar age, few retire
early).
3. Bimodal Distribution
Has two peaks.
Indicates two different groups or populations in the data.
* *
** **
* * * *
* ** *
2
Example: Test scores from two sections – one did well, one did poorly.
4. Uniform Distribution
All values occur with equal frequency.
No peak, flat distribution.
********
Example: Rolling a fair die – 1 to 6 has equal chance.
5. Multimodal Distribution
Has more than two peaks.
Can mean there are several groups within your data.
* * *
** ** **
* * * * * *
Example: Mixed population with different behaviors or patterns.
Shape Description Real-life Example
Symmetric Equal left and right Heights, exam scores
Right Tail on right, big
Income, house prices
Skewed mean
Retirement age, birth
Left Skewed Tail on left
weights
Test scores from two
Bimodal Two peaks
sections
Uniform All values equal Rolling dice
More than two
Multimodal Mixed customer behavior
peaks
Visualizing Amounts
Visualizing amounts means how much of something exists — for example:
sales by month, marks by subject, or students by department.
Bar Plots
A Bar Plot uses vertical bars to represent the value of each category.
3
Used when
You want to compare individual values
Categories are not connected to each other
Example:
Compare categories (e.g., Sales by product, Marks by subject)
Best when X-axis = categories, Y-axis = values
import matplotlib.pyplot as plt
categories = ['Math', 'Science', 'English', 'History']
marks = [80, 90, 70, 85]
plt.bar(categories, marks, color='blue')
plt.title("Bar Plot - Marks by Subject")
plt.xlabel("Subjects")
plt.ylabel("Marks")
plt.show()
Grouped Bar Plot
Multiple bars grouped side by side for each category, comparing two or more
groups.
Used when:
You want to compare groups over same categories
Like this year vs last year, or boys vs girls
Example:
Monthly sales for 2024 vs 2025:
Jan: 2024 = 20, 2025 = 25
Two bars for Jan: shown next to each other
4
Real-life Examples:
Sales of 2 products across months
Test results for boys vs girls
Power consumption for 2 years
import matplotlib.pyplot as plt
import numpy as np
subjects = ['Math', 'Science', 'English']
classA = [85, 90, 80]
classB = [75, 85, 78]
x = np.arange(len(subjects)) # label locations
width = 0.35
plt.bar(x - width/2, classA, width, label='Class A', color='orange')
plt.bar(x + width/2, classB, width, label='Class B', color='green')
plt.xticks(x, subjects)
plt.title("Grouped Bar Plot")
plt.xlabel("Subjects")
plt.ylabel("Marks")
plt.legend()
plt.show()
Stacked Bar Plot
One bar per category, but it's divided into colored segments, each showing a
part of the total.
Used when:
You want to show total + contribution of each part
Compare category totals and internal breakdowns
Example:
Fruit sales in Jan:
5
Apple = 10
Banana = 5
Mango = 15
All stacked into one single bar for Jan, with red/yellow/orange
segments.
Real-life Examples:
Total monthly budget (Rent, Food, Travel)
Web traffic by source (Mobile, Desktop, Tablet)
Population by age group (Children, Adults, Seniors)
import matplotlib.pyplot as plt
import numpy as np
girls = [30, 35, 40]
boys = [40, 30, 25]
plt.bar(subjects, girls, label='Girls', color='pink')
plt.bar(subjects, boys, bottom=girls, label='Boys', color='blue')
plt.title("Stacked Bar Plot - Students by Gender")
plt.ylabel("Number of Students")
plt.legend()
plt.show()
Dot Plot
A plot with dots instead of bars. Each dot = one data point.
Used when:
The dataset is small
You want to show exact counts
Dots are more intuitive than bars
Example:
Fruit choices in a classroom:
6
3 students chose Apple → 3 dots in Apple row
1 for Banana, 6 for Mango
Real-life Examples:
Votes for class leader
Number of books read by students
Rating counts (how many gave 1 star, 2 stars…)
import matplotlib.pyplot as plt
import numpy as np
values = [60, 75, 90, 65]
categories = ['Reading', 'Writing', 'Math', 'Science']
plt.plot(values, categories, 'o', color='purple')
plt.title("Dot Plot - Test Scores")
plt.xlabel("Scores")
plt.grid(True, axis='x', linestyle='--')
plt.show()
Heatmap
A color-coded matrix where each cell’s color intensity shows value.
Used when:
You have 2D data (rows + columns)
You want to show intensity, frequency, or patterns
Example:
Time vs Day traffic:
7
Mo Tu We
Time
n e d
Morning 10 15 20
Afternoo
5 25 18
n
Evening 12 9 30
Darker red = more traffic.
Light yellow = low traffic.
Real-life Examples:
Exam scores (students vs subjects)
Traffic by time of day and day
Correlation matrix in Machine Learning
import seaborn as sns
import pandas as pd
data = pd.DataFrame({
'Math': [80, 70, 90],
'Science': [75, 85, 78],
'English': [60, 88, 77]
}, index=['Student1', 'Student2', 'Student3'])
sns.heatmap(data, annot=True, cmap='YlGnBu')
plt.title("Heatmap - Marks of Students")
plt.show()
Chart
Best For Shows
Type
Bar Plot Comparing single values Individual amounts
Grouped Compare multiple Category + subgroup
Bar subgroups values
Stacked Total + sub-group
Show part-to-whole
Bar composition
8
Chart
Best For Shows
Type
Clean view with many
Dot Plot Quantities as dots
labels
Color represents
Heatmap Matrix-style comparison
magnitude
Visualizing Distributions: Histograms and Density Plots
Visualizing Distributions
Visualizing a distribution helps understand how data points are spread,
centered, and shaped. It answers:
Is the data symmetrical or skewed?
Are there outliers or multiple peaks?
How do two or more groups compare?
Types of Distribution Plots:
Visualizing Distributions: Histograms and Density Plots
Histogram
A histogram is a bar plot that shows the frequency of values falling in
continuous intervals (called bins).
Good for discrete or continuous numeric data
X-axis → Data ranges (bins)
Y-axis → Frequency (count)
Density Plot
A density plot shows the smoothed version of the histogram.
It uses a Kernel Density Estimate (KDE) to estimate the probability
distribution.
Smooth curve, better for understanding the data distribution shape
Area under curve = 1
I. Visualizing a Single Distribution
Using Histogram:
9
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Sample data
data = [55, 60, 65, 70, 72, 75, 75, 78, 80, 90, 95]
# Histogram
plt.hist(data, bins=5, color='skyblue', edgecolor='black')
plt.title("Histogram of Marks")
plt.xlabel("Marks")
plt.ylabel("Frequency")
plt.show()
Using Density Plot:
import matplotlib.pyplot as plt
import seaborn as sns
data = [55, 60, 65, 70, 72, 75, 75, 78, 80, 90, 95]
# Density Plot (updated)
sns.kdeplot(data, fill=True, color='purple') # use fill=True
plt.title("Density Plot of Marks")
plt.xlabel("Marks")
plt.show()
10
II. Visualizing Multiple Distributions at the Same Time
Let’s compare two classes’ marks using both plots.
Histogram for Multiple Distributions:
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
class_A = [75, 70, 85, 90, 75, 80, 95]
class_B = [55, 68, 70, 72, 75, 78, 92]
# Plot both histograms
plt.hist(class_A, bins=5, alpha=0.5, label='Class A', color='red')
plt.hist(class_B, bins=5, alpha=0.5, label='Class B', color='pink')
plt.title("Histogram - Class A vs B")
plt.xlabel("Marks")
plt.ylabel("Frequency")
plt.legend()
plt.show()
Density Plot for Multiple Distributions:
import matplotlib.pyplot as plt
import seaborn as sns
class_A = [55, 60, 65, 70, 75, 80, 85]
class_B = [65, 68, 70, 72, 75, 78, 90]
sns.kdeplot(class_A, fill=True, label='Class A', color='blue')
sns.kdeplot(class_B, fill=True, label='Class B', color='orange')
plt.title("Density Plot - Class A vs B")
plt.xlabel("Marks")
plt.legend()
plt.show()
11
Visualizing Distributions: Empirical Cumulative Distribution
Functions and Q-Q Plots
1. Empirical Cumulative Distribution Function (ECDF)
An ECDF is a step function that estimates the cumulative distribution of a
sample dataset. It shows the proportion (or percent) of data points less than
or equal to a given value.
X-axis: Data values
Y-axis: Cumulative proportion (0 to 1)
Better than histogram for small datasets
You can see medians, percentiles clearly
How is it computed?
Given a dataset with n values x1,x2,...,xn, sorted in ascending order:
For any value x, the ECDF is calculated as:
Why use ECDF?
It gives a non-parametric view of the data distribution.
Easy to compare sample distributions.
Helpful for visualizing spread, central tendency, and skewness.
Example:
Suppose we have a sample:
[2, 4, 6, 8, 10]
The ECDF would be:
x=2⇒ECDF=1/5=0.2
12
x=4⇒ECDF=2/5=0.4
x=6⇒ECDF=3/5=0.6
x=8⇒ECDF=4/5=0.8
x=10⇒ECDF=5/5=1.0
Plotting these gives a step-like graph.
import numpy as np
import matplotlib.pyplot as plt
# Sample data: exam scores
data = np.array([45, 50, 55, 60, 65, 70, 72, 75, 80, 85])
# Sort the data
x = np.sort(data)
# Calculate ECDF
y = np.arange(1, len(x)+1) / len(x)
# Plot ECDF
plt.plot(x, y, marker='.', linestyle='none')
plt.title("ECDF - Exam Scores")
plt.xlabel("Score")
plt.ylabel("Proportion")
plt.grid(True)
plt.show()
data = np.array([45, 50, 55, 60, 65, 70, 72, 75, 80, 85])
This creates a NumPy array containing 10 exam scores.
x = np.sort(data)
Sorts the data in ascending order.
ECDF requires sorted data to correctly plot cumulative values.
x becomes: [45, 50, 55, 60, 65, 70, 72, 75, 80, 85]
y = np.arange(1, len(x)+1) / len(x)
Creates the ECDF values:
o np.arange(1, len(x)+1) gives: [1, 2, ..., 10]
o Dividing by len(x) (which is 10) gives proportions:
plt.plot(x, y, marker='.', linestyle='none')
Plots the ECDF using dots only (marker='.', linestyle='none').
X-axis = sorted scores, Y-axis = cumulative proportion up to that
score.
13
2. Highly Skewed Distributions
Skewness refers to the asymmetry in the distribution of data:
Type Shape
Tail on the left, mean <
Left Skewed
median
Right Tail on the right, mean >
Skewed median
Symmetric Bell curve, mean ≈ median
Example: Visualizing a Right Skewed Distribution
The tail is on the right
Most values are concentrated on the left
Few very large values stretch the tail to the right
Real-life Examples:
Income (most people earn average, few earn a lot)
Hospital stay duration
Number of items in online shopping cart
import matplotlib.pyplot as plt
import numpy as np
data = np.random.exponential(scale=2, size=1000)
plt.hist(data, bins=30, color='salmon', edgecolor='black')
plt.title("Right-Skewed Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
14
Exponential Distribution is a right-skewed distribution used commonly in
modeling time between events, like how long until the next customer
arrives or a light bulb burns out.
Example: Visualizing a Left Skewed Distribution
The tail is on the left
Most values are concentrated on the right
Few very small values stretch the tail to the left
Real-life Examples:
Retirement age (most retire at 58–60, few earlier)
Exam scores (if the test is too easy)
Lifespan of very healthy people
import matplotlib.pyplot as plt
import numpy as np
data = np.random.beta(a=5, b=2, size=1000) * 10
plt.hist(data, bins=30, color='skyblue', edgecolor='black')
plt.title("Left-Skewed Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
It generates 1000 random values from a Beta distribution with:
o Shape parameters: a = 5, b = 2
Beta distribution always returns values in the range [0, 1].
Multiplying by 10 scales the data to the range [0, 10].
Why is it left-skewed?
In the Beta distribution:
o When a > b, the distribution is left-skewed (more values near the
upper end).
o When a < b, it is right-skewed.
So here, a = 5, b = 2 → skewed to the left.
15
👉 This means most values are closer to 10, with fewer values near 0.
Example: Visualizing a Symmetric Distribution
Data is evenly spread
Left and right sides look like a mirror
Mean = Median = Mode
Example:
Heights of people
IQ scores
Measurement errors
import matplotlib.pyplot as plt
import numpy as np
# Generate symmetric data (loc or mean=0, scale or std=1)
data = np.random.normal(loc=0, scale=1, size=1000)
plt.hist(data, bins=30, color='plum', edgecolor='black')
plt.title("Symmetric Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
16
3. Quantile-Quantile Plot (Q–Q Plot)
A Q-Q plot compares your data's quantiles to a theoretical normal
distribution.
If points lie on a straight line → data is normally distributed
Deviations from line → non-normality or skewness
A quantile is a way to divide your data into equal parts based on its value
(not position).
In short: A quantile is a cut-off point that splits your sorted data into chunks.
Example: Think of a Cake
Imagine you have a cake 🍰 and you want to cut it equally:
2 pieces → Median (2 quantiles)
4 pieces → Quartiles (4 quantiles)
10 pieces → Deciles (10 quantiles)
100 pieces → Percentiles (100 quantiles)
Each slice will contain the same amount of data, not necessarily the same
value range.
Real-Life Analogy: Exam Scores
Let’s say 100 students wrote an exam. After sorting their scores:
25th percentile (1st quartile) → 25% scored below this
50th percentile (median) → 50% scored below this
75th percentile (3rd quartile) → 75% scored below this
These are quantiles:
0.25 quantile → 25th percentile
0.50 quantile → 50th percentile (Median)
0.75 quantile → 75th percentile
Use case: To check if your data is normally distributed
import scipy.stats as stats
import matplotlib.pyplot as plt
# Normally distributed data
data = np.random.normal(0, 1, 100)
# Create Q–Q Plot
stats.probplot(data, dist="norm", plot=plt)
plt.title("Q–Q Plot - Normal Data")
17
plt.show()
import scipy.stats as stats
This imports the stats module from the SciPy library.
It contains many statistical functions, including tools to create Q–Q plots and
test for normality, etc.
data = np.random.normal(0, 1, 100)
This generates 100 random values from a normal distribution:
Mean (loc) = 0
Standard deviation (scale) = 1
So, data should follow a bell-shaped, symmetric distribution.
stats.probplot(data, dist="norm", plot=plt)
This creates the Q–Q (Quantile-Quantile) Plot.
Here's what it does:
Compares quantiles of your sample data to quantiles of a theoretical
normal distribution (dist="norm").
If your data is normally distributed, the points will fall on or close to a
straight diagonal line.
Summary Table:
Plot
Use For Tells You…
Type
ECDF Any numeric data Distribution step-by-step
Histogra
Frequency in bins General shape, skewness
m
18
Plot
Use For Tells You…
Type
KDE Smoothed histogram Peaks and spread
Compare to normal Normality or skewness
Q-Q Plot
distribution detection
Visualizing Many Distributions at Once
Used when you want to compare how different groups (like different classes,
categories, etc.) are distributed.
Why Visualize Many Distributions Together?
To compare groups
To observe variation, spread, and central tendency
To detect differences in patterns
Techniques:
1. Visualizing Distributions Along the Vertical Axis
This means:
X-axis = category/group
Y-axis = distribution values (vertical plots)
Examples:
Box Plot
Violin Plot
Strip Plot
Swarm Plot
Boxplot (Vertical):
import seaborn as sns
import matplotlib.pyplot as plt
from seaborn import load_dataset
# Load example dataset (tips data)
data = sns.load_dataset("tips")
# Boxplot: Bill amount grouped by day
sns.boxplot(x='day', y='total_bill', data=data)
plt.title("Boxplot - Vertical Distribution by Day")
plt.show()
19
Violin Plot (Vertical):
A Violin Plot combines a boxplot and a KDE (density curve). It shows the
distribution shape, median, and spread.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Load sample data
data = sns.load_dataset("tips")
sns.violinplot(x='day', y='total_bill', data=data)
plt.title("Violin Plot - Vertical")
plt.show()
In the violin plot, we can find the same information as in the box plots:
Median (a white dot on the violin plot)
Interquartile range (the black bar in the center of violin)
The lower/upper adjacent values (the black lines stretched from the bar) –
defined as first quartile - 1.5 IQR and third quartile + 1.5 IQR respectively.
These values can be used in a simple outlier detection technique (Tukey’s
20
fences) – observations lying outside of these "fences" can be considered
outliers.
Strip Plot (Vertical)
Both strip plots and swarm plots are types of categorical scatter plots. They
show the distribution of data points across categories using dots (instead of
bars or boxes). They are very useful when you want to see individual
observations rather than summaries.
A strip plot displays individual data points along a categorical axis, by
"stripping" the data into a line or narrow area. Points may overlap. A strip
plot is a scatter plot where individual data points are plotted along one axis
(usually a categorical variable), with some jitter (random nudging) to reduce
overlapping.
Characteristics:
X-axis (or Y-axis): Categorical variable (e.g., day, gender)
Y-axis (or X-axis): Continuous variable (e.g., height, price)
Points may overlap if there are many similar values
Optional jitter can be added to spread out overlapping points
Visualization:
Each data point is a dot placed on the category’s line.
Example Use Case:
Showing student marks across different classes
Displaying total restaurant bills across different days
import matplotlib.pyplot as plt
import seaborn as sns
21
import numpy as np
# Load sample data
data = sns.load_dataset("tips")
sns.stripplot(x='day', y='total_bill', data=data, jitter=True, color='orange')
plt.title("Strip Plot - Total Bill by Day")
plt.show()
jitter=True spreads the dots slightly so they don't sit on top of each other.
Swarm Plot (Vertical)
A swarm plot is a more advanced strip plot that adjusts automatically the
position of the dots so that no data points overlap— making it more
readable. It arranges them in a way that they "swarm" around the category
line.
Characteristics:
Same as strip plot, but smarter dot positioning
Points never overlap
Automatically adjusted spacing
Visualization:
Each dot is carefully placed next to others, forming a neatly packed swarm.
Example Use Case:
Visualizing patient weights by gender
Displaying scores in exams by section
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from seaborn import load_dataset
# Load sample data
data = sns.load_dataset("tips")
22
sns.swarmplot(x='day', y='total_bill', data=data, color='green')
plt.title("Swarm Plot - Total Bill by Day")
plt.show()
2. Visualizing Distributions Along the Horizontal Axis
This means:
Y-axis = category/group
X-axis = distribution values (horizontal plots)
Just flip the axes of the previous plot
Axis Type X-axis Y-axis Use When...
Category (e.g., Values (e.g., Default view; easy for many
Vertical Plot
Day) Bill) groups
Horizontal Better when category names are
Values Category
Plot long
Boxplot (Horizontal):
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Load sample data
data = sns.load_dataset("tips")
sns.boxplot(y='day', x='total_bill', data=data)
plt.title("Boxplot - Horizontal Distribution by Day")
plt.show()
23
Violin Plot Example (Horizontal):
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Load sample data
data = sns.load_dataset("tips")
sns.violinplot(y='day', x='total_bill', data=data)
plt.title("Violin Plot - Horizontal")
plt.show()
Strip Plot (Horizontal)
import seaborn as sns
import matplotlib.pyplot as plt
# Sample data
tips = sns.load_dataset("tips")
# Horizontal strip plot
sns.stripplot(x="total_bill", y="day", data=tips, jitter=True, color="orange")
plt.title("Horizontal Strip Plot")
plt.xlabel("Total Bill")
24
plt.ylabel("Day")
plt.show()
Swarm Plot (Horizontal)
import seaborn as sns
import matplotlib.pyplot as plt
# Sample data
tips = sns.load_dataset("tips")
# Horizontal swarm plot
sns.swarmplot(x="total_bill", y="day", data=tips, color="skyblue")
plt.title("Horizontal Swarm Plot")
plt.xlabel("Total Bill")
plt.ylabel("Day")
plt.show()
Other Plot Types (you can do both horizontal/vertical):
Plot
Purpose
Type
Box Plot Shows median, IQR, outliers
Violin Plot Like boxplot + shows density
Individual data points
Strip Plot
(scattered)
Swarm Like strip plot but avoids
25
Plot
Purpose
Type
Plot overlapping
Summary Table:
Plot
Shows What? Best For
Type
Shape + Median + Spread (KDE Comparing distribution
Violin Plot
view) shapes
Small datasets (quick
Strip Plot Raw data points
view)
Swarm Raw data points (non- Small/medium datasets
Plot overlapping) (clean)
26