[go: up one dir, main page]

0% found this document useful (0 votes)
20 views11 pages

Nummerical Summaries

This document discusses numerical data summaries including mean, median, mode, percentile, quartiles, standard deviation, variance, range, proportion, and correlation. Formulas and examples using Python and R are provided for calculating each summary.

Uploaded by

60 Vibha Shree.S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Nummerical Summaries

This document discusses numerical data summaries including mean, median, mode, percentile, quartiles, standard deviation, variance, range, proportion, and correlation. Formulas and examples using Python and R are provided for calculating each summary.

Uploaded by

60 Vibha Shree.S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

In this article, we will go over the theory and application (with

examples in Python and R) of the following numerical summaries:

1. Mean

2. Median

3. Mode

4. Percentile

5. Quartiles (five-number summary)

6. Standard Deviation

7. Variance

8. Range

9. Proportion

10. Correlation

Mean
This is the point of balance, describing the most typical value for
normally distributed data. I say “normally distributed” data because
the mean is highly influenced by outliers.
The mean adds up all the data values and divides by the total number
of values, as follows:

The formula for the mean

The ‘x-bar’ is used to represent the sample mean (the mean of a sample
of data). ‘∑’ (sigma) implies de addition of all values up from ‘i=1’ until
‘i=n’ (’n’ is the number of data values). The result is then divided by ‘n’.

Python: np.mean([1,2,3,4,5]) The result is 3.

R: mean(c(2,2,4,4)) The result is 3.

Effect of outliers:

Effect of an outlier in the mean

The first plot ranges from 1 to 10. The mean is 5.5. When we replace 10
with 20, the average increases to 6.5. In the next concept, we will go
over the ‘median’, that is the perfect choice to ignore outliers.

Median
This is the “middle data point”, where half of the data is below the
median and half is above the median. It’s the 50th percentile of the
data (we will cover percentile later in this article). It’s also mostly used
with skewed data because outliers won’t have a big effect on the
median.

There are two formulas to compute the median. The choice of which
formula to use depends on n (number of data points in the sample, or
sample size) if it’s even or odd.

The formula for the median when n is even.

When n is even, there is no “middle” data point, so the middle two


values are averaged.

The formula for the median when n is odd.

When n is odd, the middle data point is the median.

Python: np.median([1,2,3,4,5,6]) (n is even). The result is 3.5, the


average between 3 and 4 (middle points).

R: median(c(1,2,3,4,5,6,7)) (n is odd). The result is 4, the middle


point.

Effect of outliers:
The effect of outliers on the median is low. None in this case.

In the graph above, we are using the same data used to calculate the
mean. Notice how the median stays the same in the second graph when
we replace 10 with 20. It doesn’t mean that the median will always
ignore the outliers. If we had a larger number of numbers and/or
outliers, the median could be affected, but the influence of an outlier is
low.

Mode
The mode will return you the most commonly occurring data value.

Python: statistics.mode([1,2,2,2,3,3,4,5,6]) The result is 2.

R doesn’t give you specifically the mean, but you can do the following
to get the frequency of each data
value: R: table(c('apple','banana','banana','tomato','orange','orang
e','banana')) The result is apple:1, banana:3, orange:2, tomato:1.

‘Banana’ has a higher frequency with 3 occurrences. Follows below a


histogram plot of this fruit vector.
Example of a mode using a histogram.

Percentile
The percent of data that is equal to or less than a given data point. It’s
useful for describing where a data point stands within the data set. If
the percentile is close to zero, then the observation is one of the
smallest. If the percentile is close to 100, then the data point is one o
the largest in the data set.

Python:
from scipy import statsx = [10, 12, 15, 17, 20, 25, 30]## In what
percentile lies the number 25?
stats.percentileofscore(x,25)
# result: 85.7

R:
library(stats)x <- c(10, 12, 15, 17, 20, 25, 30)## In what
percentile lies the number 25?
ecdf(x)(25)
# resul: 85.7## In what percentile lies the number 12?
ecdf(x)(12)
# resul: 0.29

Quartiles (five-number summary)


Quartiles measure the center and it’s also great to describe the spread
of the data. Highly useful for skewed data. There are four quartiles, and
they compose the five-number summary (combined with
the minimum). The Five-number summary is composed of:

1. Minimum

2. 25th percentile (lower quartile)

3. 50th percentile (median)

4. 75th percentile (upper quartile)

5. 100th percentile (maximum)

Python:
import numpy as npx = [10,12,15,17,20,25,30]min = np.min(x)
q1 = np.quantile(x, .25)
median = np.median(x)
q3 = np.quantile(x, .75)
max = np.max(x)print(min, q1, median, q3, max)

R:
x <- c(10,12,15,17,20,25,30)min = min(x)
q1 = quantile(x, .25)
median = median(x)
q3 = quantile(x, .75)
max = max(x)paste(min, q1, median, q3, max)## You can also use the
function favstats from the mosaic
## It will give you the five-number summary, mean, standard
deviation, sample size and number of missing
values.librarylibrary(mosaic)
favstats(x)

A boxplot is one good way to plot the five-number summary and


explore the data set.

A boxplot of the ‘mtcars’ data set (mpg x gear).

The bottom end of the boxplot represents the minimum; the first
horizontal line represents the lower quartile; the line inside the square
is the median; the next line is the upper quartile, and the top is the
maximum.

Standard Deviation
Standard deviation is extensively used in statistics and data science. It
measures the amount of variation or dispersion of a data set,
calculating how spread out the data are from the mean. Small values
mean the data is consistent and close to the mean. Larger values
indicate the data is highly variable.

Deviation: The idea is to use the mean as a reference point from


which everything varies. A deviation is defined as the distance an
observation lies from the reference point. This distance is obtained by
subtracting the data point (xi) from the mean (x-bar).

The formula to calculate the standard deviation.

Calculating the standard deviation: The average of all the


deviations will always turn out to be zero, so we square each deviation
and sum up the results. Then, we divide it for ‘n-1’ (called degrees of
freedom). We square root the final result to undo de squaring of the
deviations.

The standard deviation is a representation of all deviations in the data.


It’s never negative and it’s zero only if all the values are the same.

Density plot of the Sepal.Width from the Iris data set.

This graph shows the density of Sepal.Width from the Iris data set. The
standard deviation is 0.436. The blue line represents the mean, and
the red lines one and two standard deviations away from the mean. For
example, a Sepal.Width with a value of 3.5 lies 1 standard deviation
from the mean.

Python: np.std(x)
R: sd(x)

Effect of outliers: The standard deviation, like the mean, is highly


influenced by outliers. The code below will use R to compare the
standard deviation of two vectors, one without outliers and a second
with an outlier.
x <- c(1,2,3,4,5,6,7,8,9,10)
sd(x)
# result: 3.02765#Replacing 10 by 20:
y <- c(1,2,3,4,5,6,7,8,9,20)
sd(y)
# result: 5.400617

Variance
Variance is almost the same calculation of the standard deviation, but
it stays in squared units. So, if you take the square root of the variance,
you have the standard deviation.

The formula for the variance.

Note that it’s represented by ‘s-squared’, while the standard deviation


is represented by ‘s’.

Python: np.var(x)

R: var(x)

Range
The difference between the maximum and minimum values. Useful for
some basic exploratory analysis, but not as powerful as the standard
deviation.

The formula for the range.

Python: np.max(n) — np.min(x)

R: max(x) — min(x)

Proportion
It’s often referred to as “percentage”. Defines the percent of
observations in the data set that satisfy some requirements.

Correlation
Defines the strength and direction of the association between two
quantitative variables. It ranges between -1 and 1. Positive correlations
mean that one variable increases as the other variable increases.
Negative correlations mean that one variable decreases as the other
increases. When the correlation is zero, there is no correlation at all. As
closest to one of the extreme the result is, stronger is the association
between the two variables.

The formula to compute the correlation.


Python: stats.pearsonr(x,y)

R: cor(x,y)

Correlation between MPG and Weight.

The graph is showing the correlation in the mtcars data set,


between MPG and Weight (-0.87). This is a strong negative
correlation, meaning that as the weight increases, the MPG decreases.

These basic summaries are essential as you explore and analyze your
data. They are the foundations to dive deeper into statistics and work
on advanced analytics. If there is any summary you would like me to
include, feel free to reach out in the responses below.

You might also like