[go: up one dir, main page]

0% found this document useful (0 votes)
20 views10 pages

Analytics People Programming Parte 1

This document discusses descriptive statistics, focusing on measures of central tendency, variability, and data distribution. It covers concepts such as mode, range, variance, standard deviation, quartiles, skewness, and kurtosis, along with their calculations and interpretations using R programming. Additionally, it highlights the importance of sample size and outliers in data analysis, particularly in the context of people analytics.

Uploaded by

Ruben Sierra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Analytics People Programming Parte 1

This document discusses descriptive statistics, focusing on measures of central tendency, variability, and data distribution. It covers concepts such as mode, range, variance, standard deviation, quartiles, skewness, and kurtosis, along with their calculations and interpretations using R programming. Additionally, it highlights the importance of sample size and outliers in data analysis, particularly in the context of people analytics.

Uploaded by

Ruben Sierra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

100 Descriptive Statistics

The larger the n-count, the less influential an extreme value will be on .x̄. As we
will learn in chapter “Statistical Inference”, sample size is fundamental to our ability
to achieve precise estimates of population parameters based on sample statistics.
While the focus of this section is central tendency, it is important to recognize that
outlying values are often the more actionable data points in an analysis since these
cases may represent those with significantly different experiences relative to the
average employee. Understanding the distribution of data is critical, and the spread
of data around measures of central tendency will receive considerable attention
throughout this book.

Mode

The mode is the most frequent number in a set of values.


While mean() and median() are standard functions in R, mode() returns the
internal storage mode of the object rather than the statistical mode of the data. We
can easily create a function to return the statistical mode(s):

# Fill vector x2 with integers


x3 <- c(1, 2, 3, 3, 100, 200, 300, 300)

# Create function to calculate statistical mode(s)


stat.mode <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}

# Return mode(s) of vector x3


stat.mode(x3)

## [1] 3 300
In this case, we have a bimodal distribution since both 3 and 300 occur most
frequently.

Range

The range is the difference between the maximum and minimum values in a set of
numbers.
The range() function in R returns the minimum and maximum numbers:

# Return lowest and highest values of vector x


range(x)
Univariate Analysis 101

## [1] 1 300

We can leverage the max() and min() functions to calculate the difference
between these values:

# Calculate range of vector x


max(x, na.rm = TRUE) - min(x, na.rm = TRUE)

## [1] 299

In people analytics, there are many conventional descriptive metrics—largely


counts, percentages, and averages cut by various time (e.g., day, month, quarter,
year) and categorical (e.g., department, job, location, tenure band) dimensions. Here
is a sample of common measures:
• Time to Fill: average days between job requisition posting and offer accep-
tance
• Offer Acceptance Rate: percent of offers extended to candidates that are
accepted
• Pass-Through Rate: percent of candidates in a particular stage of the recruiting
process who passed through to the next stage
• Progress to Goal: percent of approved positions that have been filled
• cNPS/eNPS: candidate and employee NPS (.−100 to 100)
• Headcount: counts and percent of workforce across worker types (employee,
intern, contingent)
• Diversity: counts and percent of workforce across gender, ethnicity, and
generational cohorts
• Positions: count and percent of open, committed, and filled seats
• Hires: counts and rates
• Career Moves: counts and rates
• Turnover: counts and rates (usually terms/average headcount over the period)
• Workforce Growth: net changes over time, accounting for hires, internal
transfers, and exits
• Span of Control: ratio of people leaders to individual contributors
• Layers/Tiers: average and median number of layers removed from CEO
• Engagement: average score or top-box favorability score

Measures of Spread
Variance

Variance is a measure of variability in the data. Variance is calculated using the


average of squared differences—or deviations—from the mean.
102 Descriptive Statistics

Variance of a population is defined by:


n
(xi − μ)2
i=1
σ2 =
.
N
Variance of a sample is defined by:


n
(xi − x̄)2
i=1
s2 =
.
n−1

It is important to note that since differences are squared, the variance is always
non-negative. In addition, we cannot compare these squared differences to the
arithmetic mean since the units are different. For example, if we calculate the
variance of annual compensation measured in USD, variance should be expressed
as .USD2 while the mean exists in the original USD unit of measurement.
In R, the sample variance can be calculated using the var() function:

# Load library
library(peopleanalytics)

# Load data
data("employees")

# Calculate sample variance for annual compensation


var(employees$annual_comp)

## [1] 1788038934

Sample statistics are the default in R. Since the population variance differs from
the sample variance by a factor of .s 2 ( n−1
n ), it is simple to convert output from var()
to the population variance:

# Store number of observations


n = length(employees$annual_comp)

# Calculate population variance for annual compensation


var(employees$annual_comp) * (n - 1) / n

## [1] 1786822581
Univariate Analysis 103

Standard Deviation

The standard deviation is simply the square root of the variance.


The standard deviation of a population is defined by:

 n

 (xi − μ)2

 i=1
σ =
.
N

The standard deviation of a sample is defined by:



 n

 (xi − x̄)2

 i=1
s=
.
n−1

Since a squared value can be converted back to its original units by taking its
square root, the standard deviation expresses variability around the mean in the
variable’s original units.
In R, the sample standard deviation can be calculated using the sd() function:

# Calculate sample standard deviation for annual compensation


sd(employees$annual_comp)

## [1] 42285.21
Since the population
 standard deviation differs from the sample standard devia-
n−1
tion by a factor of .s n , it is simple to convert output from sd() to the population
standard deviation:

# Calculate population standard deviation for annual


→ compensation
sd(employees$annual_comp) * sqrt((n - 1) / n)

## [1] 42270.82

Quartiles

A quartile is a type of quantile that partitions data into four equally sized parts after
ordering the data. Each quartile is equally sized with respect to the number of data
points—not the range of values in each. Quartiles are also related to percentiles.
For example, Q1 is the 25th percentile—the value at or below which 25% of values
104 Descriptive Statistics

lie. Percentiles are likely more familiar than quartiles, as percentiles show up in the
height and weight measurements of babies, performance on standardized tests like
the SAT and GRE, among other things.
The Interquartile Range (IQR) represents the difference between Q3 and Q1
cut point values (the middle two quartiles). The IQR is sometimes used to detect
extreme values in a distribution; values less than .Q1 − 1.5 ∗ I QR or greater than
.Q3 + 1.5 ∗ I QR are generally considered outliers.

In R, the quantile() function returns the values that bookend each quartile:

# Return quartiles for annual compensation


quantile(employees$annual_comp)

## 0% 25% 50% 75% 100%


## 62400 99840 137280 174200 208000

Based on this output, we know that 25% of people in our data earn annual com-
pensation of .99,840 USD or less, .137,280 USD is the median annual compensation,
and 75% of people earn annual compensation of .174,200 USD or less.
We can also return a specific percentile value using the probs argument in the
quantile() function. For example, if we want to know the 80th percentile annual
compensation value, we can execute the following:

# Return 80th percentile annual compensation value


quantile(employees$annual_comp, probs = .8)

## 80%
## 180960

In addition, the summary() function returns several common descriptive statis-


tics for an object:

# Return common descriptives


summary(employees$annual_comp)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 62400 99840 137280 137054 174200 208000

Box plots are a common way to visualize the distribution of data. Box plots are not
usually found in presentations to stakeholders, since they are a bit more technical
and often require explanation, but these are very useful to analysts for understanding
data distributions during the EDA phase.
Let us visualize the spread of annual compensation by education level and gender
using the geom_boxplot() function from the ggplot2 library:
Univariate Analysis 105

# Load library
library(ggplot2)

# Produce box plots to visualize compensation distribution by


→ education level and gender
ggplot2::ggplot(employees, aes(x = as.factor(ed_lvl), y =
→ annual_comp, color = gender)) +
ggplot2::geom_boxplot() +
ggplot2::labs(x = "Education Level", y = "Annual
→ Compensation") +
ggplot2::guides(col = guide_legend("Gender")) +
ggplot2::theme_bw()

200000
Annual Compensation

150000
Gender
Female
Male

100000

1 2 3 4 5
Education Level

Box plots can be interpreted as follows:


• Horizontal lines represent median compensation values.
• The box in the middle of each distribution represents the IQR.
• The end of the line above the IQR represents the threshold for outliers in the
upper range: .Q3 + 1.5 ∗ I QR.
• The end of the line below the IQR represents the threshold for outliers in the
lower range: .Q1 − 1.5 ∗ I QR.
• Data points represent outliers: .x > Q3 + 1.5 ∗ I QR or .x < Q1 − 1.5 ∗ I QR.
While box plots are pervasive in statistically oriented disciplines, they can be
misleading. Figure 1 illustrates how information about the shape of a distribution
106 Descriptive Statistics

Box Plot Bar Chart


100
90−99
90
80−89
80

70−79
70

60−69
60

50 50−59

40 40−49

30 30−39

20
20−29

10
0−9
0
0 5 10 15 20 25

Fig. 1 The number range with the highest frequency (0–9) is not as apparent with a box plot (left)
relative to the bar chart (right)

can be lost on a box plot. The range with the highest frequency (0–9) is not as
obvious in the box plot relative to the bar chart.
Box plot alternatives such as violin plots, jittered strip plots, and raincloud plots
are often more helpful in understanding data distributions. Figure 2 shows the
juxtaposition of a raincloud plot against a box plot. While it may seem like an
oxymoron, in this case the spread of data is clearer in the rain.

Skewness

Skewness is a measure of the horizontal distance between the mode and mean—
a representation of symmetric distortion. In most practical settings, data are not
normally distributed. That is, the data are skewed either positively (right-tailed
distribution) or negatively (left-tailed distribution). The coefficient of skewness is
one of many ways in which we can ascertain the degree of skew in the data. The
skewness of sample data is defined as:


n
(xi − x̄)3
1 i=1
Sk =
.
n s3
Univariate Analysis 107

100

90

80

70

60

50

40

30

20

10

Fig. 2 Raincloud plot superimposed on a box plot to illustrate the data distribution

A positive skewness coefficient indicates positive skew, while a negative coef-


ficient indicates negative skew. The order of descriptive statistics can also be
leveraged to ascertain the direction of skew in the data:
• Positive skewness: mode < median < mean
• Negative skewness: mode > median > mean
• Symmetrical distribution: mode = median = mean
Figure 3 illustrates the placement of these descriptive statistics in each of the
three types of distributions. The magnitude of skewness can be determined by
measuring the distance between the mode and mean relative to the variable’s scale.
Alternatively, we can simply evaluate this using the coefficient of skewness:
• If skewness is between .−0.5 and 0.5, the data are considered symmetrical.
• If skewness is between .−0.5 and .−1 or 0.5 and 1, the data are moderately
skewed.
• If skewness is < .−1 or > 1, the data are highly skewed.
Since there is not a base R function for skewness, we can leverage the moments
library to calculate skewness:
108 Descriptive Statistics

Fig. 3 Skewness

# Load library
library(moments)

# Calculate skewness for org tenure, rounded to two


→ significant figures via the round() function
round(moments::skewness(employees$org_tenure), 2)

## [1] 2.27
Statistical Moments, after which this library was named, play an important role in
specifying the appropriate probability distribution for a set of data. Moments are a
set of statistical parameters used to describe the characteristics of a distribution.
Skewness is the third statistical moment in the set; hence the sum of cubed
differences and cubic polynomial in the denominator of the formula above. The
complete set of moments comprises: (1) expected value or mean, (2) variance and
standard deviation, (3) skewness, and (4) kurtosis.
We can verify that the skewness() function from the moments library returns
the expected value (per the aforementioned formula) by validating against a manual
calculation:

# Store components of skewness calculation


n = length(employees$org_tenure)
x = employees$org_tenure
Univariate Analysis 109

x_bar = mean(employees$org_tenure)
s = sd(employees$org_tenure)

# Calculate skewness manually, rounded to two significant


→ figures via the round() function
round(1/n * (sum((x - x_bar)ˆ3) / sˆ3), 2)

## [1] 2.27
A skewness coefficient of 2.27 indicates that organization tenure is positively
skewed. We can visualize the data to confirm the expected right-tailed distribution
(Fig. 4):

# Produce histogram to visualize sample distribution


ggplot2::ggplot() +
ggplot2::aes(employees$org_tenure) +
ggplot2::labs(x = "Organization Tenure", y = "Density") +
ggplot2::geom_histogram(aes(y = ..density..), fill =
→ "#414141") +
ggplot2::geom_density(fill = "#ADD8E6", alpha = 0.6) +
ggplot2::theme_bw()

Kurtosis

While skewness provides information on the symmetry of a distribution, kurtosis


provides information on the heaviness of a distribution’s tails (“tailedness”).
Kurtosis is the fourth statistical moment, defined by:


n
(xi − x̄)4
1 i=1
K=
.
n s4
Note that the quartic functions characteristic of the fourth statistical moment are
the only differences from the skewness formula we reviewed in the prior section
(which featured cubic functions).
The terms leptokurtic and platykurtic are often used to describe distributions
with light and heavy tails, respectively. “Platy-” in platykurtic is the same root as
“platypus,” and I have found it helpful to recall the characteristics of the flat platypus
when characterizing frequency distributions as platykurtic (wide and flat) vs. its
antithesis, leptokurtic (tall and skinny). The normal (or Gaussian) distribution is
referred to as a mesokurtic distribution in the context of kurtosis.
Figure 5 illustrates the three kurtosis categorizations.

You might also like