Basics: Data Description
12 June 2024 16:20
Describing and Summarising Data Variability
The mean, median and mode give you a sense of the center of the data, but none of
When we acquire a set of data, we should begin by asking some these indicate how far the data are spread around the center.
important questions:
• Where do the data come from? Standard Deviation
• How were they collected? The standard deviation tells us how far the data are spread out. A large standard
• How can we help the data tell their story? deviation indicates that the data are widely dispersed. A smaller standard deviation
tells us that the data points are more tightly clustered together.
Outlier- First, we must investigate why an outlier exists. Is it just an
unusual, but valid value? Could it be a data entry error? Was it collected
in a different way than the rest of the data? At a different time? The
following action can be taken
• Leave it alone
• Remove it- if not relevant for analysis eg- Data entry error
• Replace it
Histogram
Data Analysis> Histogram> Input Values> Bin range> Tick on chart
output
Variance
1. Each Saturday's number of requests lies a certain distance from 172, the
mean number of requests. To find the variance, we first sum the squares of
these differences. Why square the differences?
Histogram 2. A hotel manager would want information about the magnitude of each
difference, which can be positive, negative, or zero. If we simply summed the
differences between each Saturday's requests and the mean, positive and
negative differences would cancel each other out.
Measures of central tendency- Sense of centre of the data
Mean- Average function
Median- Median function, unordered list works as well
Mode- Mode function
Coefficient of Variation
To get a sense of the relative magnitude of the variation in a data
set, we want to compare the standard deviation of the data to the
data's mean.
3. The formula for variance adds up the squared differences and divides by n-1 to
get a type of "average" squared difference as a measure of variability. (The
reason we divide by n-1 to get an average here is a technicality beyond the
Applying Data Analysis scope of this course.) SD= 25.2 requests
Interpretation
Skewness of Histogram Larger the SD, larger the spread and vice versa
Excel-
scuba_price
Even without calculation, we can figure out which dataset is closer to has lower SD
by comparing the values to the mean
Eg- A looks less skewed as the values are closer to the mean which is the correct
answer
In a right-skewed distribution, the tail extends towards
the higher values. The peak of the distribution is on the left
side, and the mean is greater than the median. This
skewness suggests that while most data points cluster towards
the lower end, a few significantly higher values stretch the tail
towards the right.
Relationship between variables Correlation
Hidden Variables It quantifies the extent to which there is a linear relationship between two variables.
Even when two data sets seem to be directly related, we may need to To describe the strength of a linear relationship, the correlation coefficient takes on
investigate further to understand the reason for the relationship. We may values between -1 and +1. If every point falls exactly on a line with a negative slope, the
QM Page 1
Relationship between variables Correlation
Hidden Variables It quantifies the extent to which there is a linear relationship between two variables.
Even when two data sets seem to be directly related, we may need to To describe the strength of a linear relationship, the correlation coefficient takes on
investigate further to understand the reason for the relationship. We may values between -1 and +1. If every point falls exactly on a line with a negative slope, the
find that the reason is not due to any fundamental connection between the correlation coefficient is exactly -1.
two variables themselves, but that they are instead mutually related to
another underlying factor. Eg- The below plot might show that as baseball Even when the correlation coefficient is 0, a
scales increase, hockey puck scale decrease but the actual difference is relationship might exist — just not a linear
because baseball is played in summer and hockey in winter. relationship. As we've seen, scatter plots can reveal
patterns and help us better understand the business
context the data describe.
Influence of outliers
Eg- Suppose a manager suspects that his employees skip work to enjoy the good life
more often as the temperature rises. After pairing absences with daily temperature data,
he finds the correlation coefficient to be 0.466.
But look at the data — besides a few outliers, there isn't a clear relationship. Seeing the
scatter plot, the manager might realize that the three outliers correspond to a late-
False relationships summer, three-day transportation strike that kept some workers homebound the previous
year.
Variable and Time Without looking at the data, the correlation coefficient can lead us down false paths. If we
Assuming we have price data collected over time, we can plot a scatter exclude the outliers, the relationship disappears, and the correlation essentially drops to
diagram for memory price, in the same way we plotted height and weight. zero, quieting any suspicion of weather.
Because time is one of the variables, we call this graph a time series.
Time series will help us
recognize seasonal
patterns and yearly
trends. But we must be
careful: we shouldn't rely
only on visual analysis
when looking for
relationships and
patterns.
As a summary statistic for the data, the correlation coefficient is calculated numerically,
incorporating the value of every data point. Because measures like correlation give more
weight to points distant from the center of the data, outliers can strongly influence the
correlation coefficient of the entire set. In these situations, our intuition and the measure
we use to quantify our intuition can be quite different. We should always attempt to
reconcile those differences by returning to the data.
Excel- CORREL function
QM Page 2