[go: up one dir, main page]

0% found this document useful (0 votes)
19 views22 pages

Basics of Data Analysis

The document covers fundamental concepts in data analysis, including range, variance, standard deviation, and correlation. It explains how to calculate these metrics, their significance in understanding data dispersion, and their implications for business decision-making. Additionally, it discusses normal distribution, tests for normality, skewness, kurtosis, and the limitations of correlation analysis.

Uploaded by

Shivam Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

Basics of Data Analysis

The document covers fundamental concepts in data analysis, including range, variance, standard deviation, and correlation. It explains how to calculate these metrics, their significance in understanding data dispersion, and their implications for business decision-making. Additionally, it discusses normal distribution, tests for normality, skewness, kurtosis, and the limitations of correlation analysis.

Uploaded by

Shivam Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Basics of Data

Analysis
Range
• The range is the difference between the maximum and minimum values of a certain variable.

• It gives you an idea of the spread or dispersion of the data.

• Example: If you consider the Sales_Volume variable, the range would be the difference between
the highest number of units sold at any store and the lowest number of units sold.

• For instance, if the maximum Sales_Volume is 500 units and the minimum is 50 units, the range
would be 500−50=450 units.

• This indicates that the sales volume across different stores and products varies by 450 units.
Variance
• Variance measures how much the data points in a dataset differ from the mean (average) value.

• It quantifies the degree of spread in the data.

• A higher variance indicates that the data points are more spread out from the mean, while a
lower variance indicates that they are closer to the mean.

• Example: Consider the Revenue variable. The variance in Revenue would tell us how much the
revenues generated by different stores vary from the average revenue.

• A high variance might suggest that some stores are performing exceptionally well while others are
not, indicating possible inconsistencies in in-store performance.
Formula for calculating Variance
Variance

• Why 1/(n-1) is used in the formula instead of 1/n?


Understanding the Difference: Population vs.
Sample

 Population Variance: If you have data for the entire population, you can calculate the variance using 1/N ​,
where N is the total number of data points in the population. This gives you the true variance.

 Sample Variance: When you only have a sample (a subset) of the entire population, using 1/n​tends to
underestimate the true variance of the population. This is because the sample mean xˉ is not necessarily
the population's true mean; it is an estimate based on the sample.
1/n−1 Is Used To…
Correcting for Bias: Using 1/n​assumes that the sample mean xˉ is exactly the same as the population mean, which
is usually not the case.
The sample mean is typically closer to the sample data points than the true population mean is to the population
data points.
This results in smaller deviations (squared differences) when using the sample mean, leading to an
underestimation of the variance.

Degrees of Freedom: The term n−1 represents the "degrees of freedom" in the dataset.
When calculating variance, one degree of freedom is "lost" because the mean of the sample is itself an estimate
derived from the sample.
By dividing by n−1, we account for this loss of a degree of freedom, making the sample variance an unbiased
estimate of the population variance.
Degrees of Freedom refer to the number of independent values or observations in the dataset that are free to
vary when estimating statistical parameters.
In the case of variance, only n−1 data points are free to vary after calculating the mean.
Standard Deviation
Standard deviation is the square root of variance and measures the average distance of each data point from the
mean.
It is expressed in the same units as the data, making it more interpretable.

Example: If we take the Customer_Satisfaction scores, the standard deviation would indicate how much the
satisfaction scores for different stores vary from the average satisfaction score.
A low standard deviation means most stores have customer satisfaction scores close to the average, suggesting a
consistent customer experience across the stores.
A high standard deviation might indicate that while some stores have very high satisfaction scores, others may
have much lower scores, pointing to inconsistencies in service or product quality.
These measures help the store understand the consistency and variability in key business metrics like sales volume,
revenue, and customer satisfaction, which can inform strategic decisions on inventory management, marketing
efforts, and overall store performance.
Calculation of Outliers

Upper Outlier = 75%ile + 1.5*IQR

Lower Outlier = 25%ile – 1.5*IQR


Normal Distribution curve
A normal distribution curve, often referred to as the "bell curve" due to its shape, is a graphical representation of a
normal distribution.
It's characterized by its symmetrical, bell-shaped appearance, where the mean, median, and mode of the data are
all equal and located at the center of the distribution.

Key Features of an Ideal Normal Distribution Curve:


1. Symmetry: The curve is perfectly symmetrical around the mean, meaning the left and right halves are mirror
images.

2. Mean, Median, and Mode: All three measures of central tendency are the same and located at the peak of the
curve.
Normal Distribution curve
3. 68-95-99.7 Rule:
- About 68% of the data falls within one standard deviation of the mean.
- About 95% falls within two standard deviations.
- About 99.7% falls within three standard deviations.

4. Tails: The tails of the curve extend infinitely in both directions, approaching but never touching the horizontal
axis, indicating that extreme values are possible but less likely.

5. Area Under the Curve: The total area under the curve equals 1 (or 100%), representing the entire dataset.
What does One Standard Deviation (±1σ) mean?

 Definition: This is the range that covers all data points that are within one standard deviation from the mean.

 In a Normal Distribution:
o Approximately 68% of the data falls within this range.
o Mathematically, if the mean (μ) of the dataset is 50 and the standard deviation (σ) is 5, then one standard
deviation from the mean would cover the range from 45 to 55.
Test for Normality: K-S test
The Kolmogorov-Smirnov (K-S) test is a non-parametric test used in SPSS to determine if a sample of data differs
significantly from a specified distribution, often the normal distribution.
When used as a normality test, it compares your sample data with a normally distributed data set with the same
mean and standard deviation as your sample.

Understanding the K-S Test Output in SPSS


When you run the Kolmogorov-Smirnov test in SPSS, you will see several key pieces of information in the output:

Interpreting the Results:


1. Null Hypothesis (H₀):
The null hypothesis for the K-S test is that the sample data comes from a normal distribution.
Test for Normality: K-S test
Interpreting the Results:

1. Null Hypothesis (H₀):


The null hypothesis for the K-S test is that the sample data comes from a normal distribution.

2. p-value:
 p > 0.05: If the p-value is greater than 0.05, you fail to reject the null hypothesis. This suggests that there is no
significant difference between the sample distribution and the normal distribution, meaning the data is
consistent with normality.

 p ≤ 0.05: If the p-value is less than or equal to 0.05, you reject the null hypothesis. This suggests that the
sample data significantly differs from the normal distribution, indicating that the data is not normally
distributed.
Test for Normality: Shapiro-Wilk test
The Shapiro-Wilk test is a statistical test used to assess whether a given dataset is normally distributed.

It is particularly useful for small to medium-sized samples.

Key Points About the Shapiro-Wilk Test:

1.Purpose:
1. The test evaluates the null hypothesis that the data is drawn from a normally distributed population.

2.Interpretation:
1. P-value: The primary output of the Shapiro-Wilk test is a p-value.
1. If the p-value is greater than a chosen significance level (e.g., 0.05), we fail to reject the null
hypothesis, suggesting that the data is normally distributed.

2. If the p-value is less than the significance level, we reject the null hypothesis, indicating that the data
significantly deviates from a normal distribution.
Test for Normality: Shapiro-Wilk test
1.When to Use:
1. The Shapiro-Wilk test is recommended for datasets with small to moderate sample sizes (typically less than
2000 observations).
2. For larger samples, other normality tests like the Kolmogorov-Smirnov test might be more appropriate.

2.Limitations:
1. While the Shapiro-Wilk test is sensitive to deviations from normality, it can be too sensitive for large
datasets, where even minor deviations from normality can result in a significant p-value.
Skewness
Skewness measures the asymmetry of the data distribution around the mean.

 Positive Skewness (Right-Skewed):


o If the skewness value is positive, the distribution has a longer tail on the right side.
o Most of the data points are concentrated on the left side of the distribution, with fewer values extending into the right
tail.
o Example: Income distributions in a society.

 Negative Skewness (Left-Skewed):


o If the skewness value is negative, the distribution has a longer tail on the left side.
o Most of the data points are concentrated on the right side, with fewer values extending into the left tail.
o Example: Age of death in developed nations.
Skewness
 Skewness Close to Zero:
o A skewness value near zero indicates that the data distribution is fairly symmetrical.
o A perfectly symmetrical distribution (like a normal distribution) has a skewness of zero.

Interpretation in SPSS:
 Normal Distribution: Skewness close to 0 (between -0.5 and 0.5) suggests the data is approximately normally distributed.
 Mild Skewness: Skewness between -1 and -0.5 or 0.5 and 1 indicates mild skewness.
 Severe Skewness: Skewness less than -1 or greater than 1 indicates a highly skewed distribution.
Kurtosis
Kurtosis measures the "tailedness" of the distribution, or how much of the data is in the tails compared to a normal
distribution.

 Positive Kurtosis (Leptokurtic):


o A positive kurtosis value indicates that the distribution has heavier tails than a normal distribution.
o There are more outliers (extreme values) than you would expect in a normal distribution.
o The peak of the distribution is sharper and higher than a normal distribution.

 Negative Kurtosis (Platykurtic):


o A negative kurtosis value indicates that the distribution has lighter tails than a normal distribution.
o The distribution is flatter, with fewer outliers than a normal distribution.
o The peak is lower and broader compared to a normal distribution.
Kurtosis

 Kurtosis Close to Zero:


o A kurtosis value close to zero suggests that the data's tail and peak are similar to a normal distribution (mesokurtic).

Interpretation in SPSS:
 Normal Distribution: Kurtosis close to 0 (within the range of -2 to 2) suggests that the data is approximately normally
distributed.
 High Kurtosis (Leptokurtic): Indicates more extreme outliers than a normal distribution.
 Low Kurtosis (Platykurtic): Indicates fewer extreme outliers than a normal distribution.
Correlation
Correlation is a statistical measure that describes the strength and direction of the relationship between two variables.
When two variables are correlated, changes in one variable are associated with changes in the other.

Key Concepts:
1. Direction of Correlation:
o Positive Correlation: When one variable increases, the other variable tends to increase as well. For example, as the
number of hours studied increases, exam scores might also increase.
o Negative Correlation: When one variable increases, the other variable tends to decrease. For example, as the
temperature increases, the sales of winter coats might decrease.

2. Strength of Correlation:
o The strength of the correlation indicates how closely the two variables are related.
o Strong Correlation: The data points lie close to a straight line when plotted on a scatterplot.
o Weak Correlation: The data points are more scattered, indicating a looser relationship.
Limitations of Correlation
1. Correlation Does Not Imply Causation:
o Just because two variables are correlated does not mean that one causes the other.
o For example, ice cream sales and drowning incidents might be positively correlated, but eating ice cream does not
cause drowning—both are related to a third variable, such as hot weather.

2. Linear Relationship Assumption:


o Correlation measures linear relationships.
o If the relationship between the variables is non-linear, the correlation coefficient may not capture the strength of the
relationship accurately.

3. Outliers:
o Outliers can significantly affect the correlation coefficient, either inflating or deflating the perceived strength of the
relationship.

You might also like