0% found this document useful (0 votes)

20 views11 pages

Nummerical Summaries

This document discusses numerical data summaries including mean, median, mode, percentile, quartiles, standard deviation, variance, range, proportion, and correlation. Formulas and examples using Python and R are provided for calculating each summary.

Uploaded by

60 Vibha Shree.S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views11 pages

Nummerical Summaries

Uploaded by

60 Vibha Shree.S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

In this article, we will go over the theory and application (with

examples in Python and R) of the following numerical summaries:

1. Mean

2. Median

3. Mode

4. Percentile

5. Quartiles (five-number summary)

6. Standard Deviation

7. Variance

8. Range

9. Proportion

10. Correlation

Mean
This is the point of balance, describing the most typical value for
normally distributed data. I say “normally distributed” data because
the mean is highly influenced by outliers.
The mean adds up all the data values and divides by the total number
of values, as follows:

The formula for the mean

The ‘x-bar’ is used to represent the sample mean (the mean of a sample
of data). ‘∑’ (sigma) implies de addition of all values up from ‘i=1’ until
‘i=n’ (’n’ is the number of data values). The result is then divided by ‘n’.

Python: np.mean([1,2,3,4,5]) The result is 3.

R: mean(c(2,2,4,4)) The result is 3.

Effect of outliers:

Effect of an outlier in the mean

The first plot ranges from 1 to 10. The mean is 5.5. When we replace 10
with 20, the average increases to 6.5. In the next concept, we will go
over the ‘median’, that is the perfect choice to ignore outliers.

Median
This is the “middle data point”, where half of the data is below the
median and half is above the median. It’s the 50th percentile of the
data (we will cover percentile later in this article). It’s also mostly used
with skewed data because outliers won’t have a big effect on the
median.

There are two formulas to compute the median. The choice of which
formula to use depends on n (number of data points in the sample, or
sample size) if it’s even or odd.

The formula for the median when n is even.

When n is even, there is no “middle” data point, so the middle two

values are averaged.

The formula for the median when n is odd.

When n is odd, the middle data point is the median.

Python: np.median([1,2,3,4,5,6]) (n is even). The result is 3.5, the

average between 3 and 4 (middle points).

R: median(c(1,2,3,4,5,6,7)) (n is odd). The result is 4, the middle

point.

Effect of outliers:
The effect of outliers on the median is low. None in this case.

In the graph above, we are using the same data used to calculate the
mean. Notice how the median stays the same in the second graph when
we replace 10 with 20. It doesn’t mean that the median will always
ignore the outliers. If we had a larger number of numbers and/or
outliers, the median could be affected, but the influence of an outlier is
low.

Mode
The mode will return you the most commonly occurring data value.

Python: statistics.mode([1,2,2,2,3,3,4,5,6]) The result is 2.

R doesn’t give you specifically the mean, but you can do the following
to get the frequency of each data
value: R: table(c('apple','banana','banana','tomato','orange','orang
e','banana')) The result is apple:1, banana:3, orange:2, tomato:1.

‘Banana’ has a higher frequency with 3 occurrences. Follows below a

histogram plot of this fruit vector.
Example of a mode using a histogram.

Percentile
The percent of data that is equal to or less than a given data point. It’s
useful for describing where a data point stands within the data set. If
the percentile is close to zero, then the observation is one of the
smallest. If the percentile is close to 100, then the data point is one o
the largest in the data set.

Python:
from scipy import statsx = [10, 12, 15, 17, 20, 25, 30]## In what
percentile lies the number 25?
stats.percentileofscore(x,25)
# result: 85.7

R:
library(stats)x <- c(10, 12, 15, 17, 20, 25, 30)## In what
percentile lies the number 25?
ecdf(x)(25)
# resul: 85.7## In what percentile lies the number 12?
ecdf(x)(12)
# resul: 0.29

Quartiles (five-number summary)

Quartiles measure the center and it’s also great to describe the spread
of the data. Highly useful for skewed data. There are four quartiles, and
they compose the five-number summary (combined with
the minimum). The Five-number summary is composed of:

1. Minimum

2. 25th percentile (lower quartile)

3. 50th percentile (median)

4. 75th percentile (upper quartile)

5. 100th percentile (maximum)

Python:
import numpy as npx = [10,12,15,17,20,25,30]min = np.min(x)
q1 = np.quantile(x, .25)
median = np.median(x)
q3 = np.quantile(x, .75)
max = np.max(x)print(min, q1, median, q3, max)

R:
x <- c(10,12,15,17,20,25,30)min = min(x)
q1 = quantile(x, .25)
median = median(x)
q3 = quantile(x, .75)
max = max(x)paste(min, q1, median, q3, max)## You can also use the
function favstats from the mosaic
## It will give you the five-number summary, mean, standard
deviation, sample size and number of missing
values.librarylibrary(mosaic)
favstats(x)

A boxplot is one good way to plot the five-number summary and

explore the data set.

A boxplot of the ‘mtcars’ data set (mpg x gear).

The bottom end of the boxplot represents the minimum; the first
horizontal line represents the lower quartile; the line inside the square
is the median; the next line is the upper quartile, and the top is the
maximum.

Standard Deviation
Standard deviation is extensively used in statistics and data science. It
measures the amount of variation or dispersion of a data set,
calculating how spread out the data are from the mean. Small values
mean the data is consistent and close to the mean. Larger values
indicate the data is highly variable.

Deviation: The idea is to use the mean as a reference point from

which everything varies. A deviation is defined as the distance an
observation lies from the reference point. This distance is obtained by
subtracting the data point (xi) from the mean (x-bar).

The formula to calculate the standard deviation.

Calculating the standard deviation: The average of all the

deviations will always turn out to be zero, so we square each deviation
and sum up the results. Then, we divide it for ‘n-1’ (called degrees of
freedom). We square root the final result to undo de squaring of the
deviations.

The standard deviation is a representation of all deviations in the data.

It’s never negative and it’s zero only if all the values are the same.

Density plot of the Sepal.Width from the Iris data set.

This graph shows the density of Sepal.Width from the Iris data set. The
standard deviation is 0.436. The blue line represents the mean, and
the red lines one and two standard deviations away from the mean. For
example, a Sepal.Width with a value of 3.5 lies 1 standard deviation
from the mean.

Python: np.std(x)
R: sd(x)

Effect of outliers: The standard deviation, like the mean, is highly

influenced by outliers. The code below will use R to compare the
standard deviation of two vectors, one without outliers and a second
with an outlier.
x <- c(1,2,3,4,5,6,7,8,9,10)
sd(x)
# result: 3.02765#Replacing 10 by 20:
y <- c(1,2,3,4,5,6,7,8,9,20)
sd(y)
# result: 5.400617

Variance
Variance is almost the same calculation of the standard deviation, but
it stays in squared units. So, if you take the square root of the variance,
you have the standard deviation.

The formula for the variance.

Note that it’s represented by ‘s-squared’, while the standard deviation

is represented by ‘s’.

Python: np.var(x)

R: var(x)

Range
The difference between the maximum and minimum values. Useful for
some basic exploratory analysis, but not as powerful as the standard
deviation.

The formula for the range.

Python: np.max(n) — np.min(x)

R: max(x) — min(x)

Proportion
It’s often referred to as “percentage”. Defines the percent of
observations in the data set that satisfy some requirements.

Correlation
Defines the strength and direction of the association between two
quantitative variables. It ranges between -1 and 1. Positive correlations
mean that one variable increases as the other variable increases.
Negative correlations mean that one variable decreases as the other
increases. When the correlation is zero, there is no correlation at all. As
closest to one of the extreme the result is, stronger is the association
between the two variables.

The formula to compute the correlation.

Python: stats.pearsonr(x,y)

R: cor(x,y)

Correlation between MPG and Weight.

The graph is showing the correlation in the mtcars data set,

between MPG and Weight (-0.87). This is a strong negative
correlation, meaning that as the weight increases, the MPG decreases.

These basic summaries are essential as you explore and analyze your
data. They are the foundations to dive deeper into statistics and work
on advanced analytics. If there is any summary you would like me to
include, feel free to reach out in the responses below.

Statistics For Data Science
No ratings yet
Statistics For Data Science
93 pages
Lecture 06-Describing Data Visual Information
No ratings yet
Lecture 06-Describing Data Visual Information
49 pages
Unit 3
No ratings yet
Unit 3
45 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
Descriptive Statistics & Data Analysis
No ratings yet
Descriptive Statistics & Data Analysis
48 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
Statistics Unit1 Notes
No ratings yet
Statistics Unit1 Notes
11 pages
Math236 Lecture 2
No ratings yet
Math236 Lecture 2
64 pages
ch2 (Descriptive Statistics)
No ratings yet
ch2 (Descriptive Statistics)
18 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Stat 1101 4 7
No ratings yet
Stat 1101 4 7
18 pages
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
No ratings yet
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
9 pages
Chapter 4-1
No ratings yet
Chapter 4-1
46 pages
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
No ratings yet
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
19 pages
St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
No ratings yet
St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
62 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
SLIDES - Statistics-Descriptive Statistics
No ratings yet
SLIDES - Statistics-Descriptive Statistics
25 pages
Maths
No ratings yet
Maths
30 pages
Probability Theory & Statistics: Describing Data: Numerical
No ratings yet
Probability Theory & Statistics: Describing Data: Numerical
36 pages
Week 4 Bioscience
No ratings yet
Week 4 Bioscience
37 pages
Statistics
No ratings yet
Statistics
23 pages
Descriptive Statistics Guide
No ratings yet
Descriptive Statistics Guide
5 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
Statistics
No ratings yet
Statistics
30 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Exp 10
No ratings yet
Exp 10
4 pages
Descriptive Stats for Students
No ratings yet
Descriptive Stats for Students
46 pages
3 Data Visualization
No ratings yet
3 Data Visualization
75 pages
Godinez Kizzha G Asynchronous Output 3
No ratings yet
Godinez Kizzha G Asynchronous Output 3
7 pages
Math in The Modern World Stat Lecture
No ratings yet
Math in The Modern World Stat Lecture
3 pages
Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
No ratings yet
Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
4 pages
Types of Statistics
No ratings yet
Types of Statistics
7 pages
Descriptive Statistics Course Guide
No ratings yet
Descriptive Statistics Course Guide
50 pages
ML Lab Manual
No ratings yet
ML Lab Manual
27 pages
Measures of Central Tendency and Dispersion
No ratings yet
Measures of Central Tendency and Dispersion
9 pages
TDA1
No ratings yet
TDA1
57 pages
Advanced Statistics
No ratings yet
Advanced Statistics
259 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
03 Numerical Description
No ratings yet
03 Numerical Description
52 pages
Program-1
No ratings yet
Program-1
15 pages
GCE As Level Representation of Dbxbbcata Measures of Central Tendency and Variation
No ratings yet
GCE As Level Representation of Dbxbbcata Measures of Central Tendency and Variation
9 pages
Statistics for Computer Science Students
No ratings yet
Statistics for Computer Science Students
6 pages
Stats 1, Lecture
No ratings yet
Stats 1, Lecture
11 pages
Statistics, Statistical Modelling & Data Analytics
No ratings yet
Statistics, Statistical Modelling & Data Analytics
68 pages
Q & A - Unit 1 - Introduction To Statistics
No ratings yet
Q & A - Unit 1 - Introduction To Statistics
20 pages
STAT241 - Business Statistics (Day 3)
No ratings yet
STAT241 - Business Statistics (Day 3)
32 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
ADS Imp Ans
No ratings yet
ADS Imp Ans
11 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
Ai - Ssmda
No ratings yet
Ai - Ssmda
142 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Statistics Basics for Data Science
100% (1)
Statistics Basics for Data Science
27 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
Mathematical Analysis
100% (1)
Mathematical Analysis
46 pages
Railclamp Low Capacitance Tvs Diode Array: Protection Products Description Features
No ratings yet
Railclamp Low Capacitance Tvs Diode Array: Protection Products Description Features
13 pages
CC104 - Module 1
No ratings yet
CC104 - Module 1
10 pages
PID ModesTraining LabVolt
No ratings yet
PID ModesTraining LabVolt
34 pages
Big Data Machine Learning Using Apache S
No ratings yet
Big Data Machine Learning Using Apache S
7 pages
MANDT Kernelpool PAPER
No ratings yet
MANDT Kernelpool PAPER
28 pages
THRSL Key
No ratings yet
THRSL Key
25 pages
001 - 4CAE000945 - FOX CHasis Datasheet
No ratings yet
001 - 4CAE000945 - FOX CHasis Datasheet
3 pages
Seminar Report (1) For
No ratings yet
Seminar Report (1) For
18 pages
CV - Taruna Mandiri Elektrindo
No ratings yet
CV - Taruna Mandiri Elektrindo
2 pages
Final Report of Catia 2
No ratings yet
Final Report of Catia 2
38 pages
Q
No ratings yet
Q
10 pages
A Machine Learning Based Crop Yield Prediction
No ratings yet
A Machine Learning Based Crop Yield Prediction
25 pages
Archsalt Pepper Noise S3.Ipynb - Colab
No ratings yet
Archsalt Pepper Noise S3.Ipynb - Colab
6 pages
Internship Report
No ratings yet
Internship Report
73 pages
Spek Laptop
No ratings yet
Spek Laptop
8 pages
GATE CS Digital Logic Questions 2001-2003
No ratings yet
GATE CS Digital Logic Questions 2001-2003
49 pages
Angelicvibes User License Agreement: What Does This Mean?
No ratings yet
Angelicvibes User License Agreement: What Does This Mean?
1 page
Translation Support For AEM Content Fragments - Adobe Experience Manager
No ratings yet
Translation Support For AEM Content Fragments - Adobe Experience Manager
3 pages
IT Assignments 1-13
No ratings yet
IT Assignments 1-13
34 pages
Final Resume
No ratings yet
Final Resume
1 page
Updated DTF Ebook
No ratings yet
Updated DTF Ebook
50 pages
MATRIX 6/816: Control Panel With Remote Keypads Software Version 1.34 Programming Manual
No ratings yet
MATRIX 6/816: Control Panel With Remote Keypads Software Version 1.34 Programming Manual
40 pages
Lab 06 Complete
No ratings yet
Lab 06 Complete
10 pages
Paper 2126
No ratings yet
Paper 2126
4 pages
Math Kangaroo Online Exam Guide
No ratings yet
Math Kangaroo Online Exam Guide
2 pages
Software Project Management: Aamir Anwar Lecturer Computer Science SZABIST, Islamabad
100% (1)
Software Project Management: Aamir Anwar Lecturer Computer Science SZABIST, Islamabad
25 pages
Gujarat Technological University (Established Under Gujarat Act No. 20 of 2007)
No ratings yet
Gujarat Technological University (Established Under Gujarat Act No. 20 of 2007)
27 pages
Intelligent Transportation Systems Using External Infrastructure: A Literature Survey
No ratings yet
Intelligent Transportation Systems Using External Infrastructure: A Literature Survey
18 pages
Artificial Intelligence: A.S.Sawtha Safreena - Iii ECE P.Shenbagavalli - Iii Ece
No ratings yet
Artificial Intelligence: A.S.Sawtha Safreena - Iii ECE P.Shenbagavalli - Iii Ece
12 pages
Ic 8855
No ratings yet
Ic 8855
36 pages

Nummerical Summaries

Uploaded by

Nummerical Summaries

Uploaded by

In this article, we will go over the theory and application (with

examples in Python and R) of the following numerical summaries:

5. Quartiles (five-number summary)

The formula for the mean

Python: np.mean([1,2,3,4,5]) The result is 3.

R: mean(c(2,2,4,4)) The result is 3.

Effect of an outlier in the mean

The formula for the median when n is even.

When n is even, there is no “middle” data point, so the middle two

The formula for the median when n is odd.

When n is odd, the middle data point is the median.

Python: np.median([1,2,3,4,5,6]) (n is even). The result is 3.5, the

R: median(c(1,2,3,4,5,6,7)) (n is odd). The result is 4, the middle

Python: statistics.mode([1,2,2,2,3,3,4,5,6]) The result is 2.

‘Banana’ has a higher frequency with 3 occurrences. Follows below a

Quartiles (five-number summary)

2. 25th percentile (lower quartile)

3. 50th percentile (median)

4. 75th percentile (upper quartile)

5. 100th percentile (maximum)

A boxplot is one good way to plot the five-number summary and

A boxplot of the ‘mtcars’ data set (mpg x gear).

Deviation: The idea is to use the mean as a reference point from

The formula to calculate the standard deviation.

Calculating the standard deviation: The average of all the

The standard deviation is a representation of all deviations in the data.

Density plot of the Sepal.Width from the Iris data set.

Effect of outliers: The standard deviation, like the mean, is highly

The formula for the variance.

Note that it’s represented by ‘s-squared’, while the standard deviation

The formula for the range.

Python: np.max(n) — np.min(x)

The formula to compute the correlation.

Correlation between MPG and Weight.

The graph is showing the correlation in the mtcars data set,

You might also like