0% found this document useful (0 votes)

19 views22 pages

Basics of Data Analysis

The document covers fundamental concepts in data analysis, including range, variance, standard deviation, and correlation. It explains how to calculate these metrics, their significance in understanding data dispersion, and their implications for business decision-making. Additionally, it discusses normal distribution, tests for normality, skewness, kurtosis, and the limitations of correlation analysis.

Uploaded by

Shivam Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views22 pages

Basics of Data Analysis

Uploaded by

Shivam Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Basics of Data

Analysis
Range
• The range is the difference between the maximum and minimum values of a certain variable.

• It gives you an idea of the spread or dispersion of the data.

• Example: If you consider the Sales_Volume variable, the range would be the difference between
the highest number of units sold at any store and the lowest number of units sold.

• For instance, if the maximum Sales_Volume is 500 units and the minimum is 50 units, the range
would be 500−50=450 units.

• This indicates that the sales volume across different stores and products varies by 450 units.
Variance
• Variance measures how much the data points in a dataset differ from the mean (average) value.

• It quantifies the degree of spread in the data.

• A higher variance indicates that the data points are more spread out from the mean, while a
lower variance indicates that they are closer to the mean.

• Example: Consider the Revenue variable. The variance in Revenue would tell us how much the
revenues generated by different stores vary from the average revenue.

• A high variance might suggest that some stores are performing exceptionally well while others are
not, indicating possible inconsistencies in in-store performance.
Formula for calculating Variance
Variance

• Why 1/(n-1) is used in the formula instead of 1/n?

Understanding the Difference: Population vs.
Sample

 Population Variance: If you have data for the entire population, you can calculate the variance using 1/N ,
where N is the total number of data points in the population. This gives you the true variance.

 Sample Variance: When you only have a sample (a subset) of the entire population, using 1/ntends to
underestimate the true variance of the population. This is because the sample mean xˉ is not necessarily
the population's true mean; it is an estimate based on the sample.
1/n−1 Is Used To…
Correcting for Bias: Using 1/nassumes that the sample mean xˉ is exactly the same as the population mean, which
is usually not the case.
The sample mean is typically closer to the sample data points than the true population mean is to the population
data points.
This results in smaller deviations (squared differences) when using the sample mean, leading to an
underestimation of the variance.

Degrees of Freedom: The term n−1 represents the "degrees of freedom" in the dataset.
When calculating variance, one degree of freedom is "lost" because the mean of the sample is itself an estimate
derived from the sample.
By dividing by n−1, we account for this loss of a degree of freedom, making the sample variance an unbiased
estimate of the population variance.
Degrees of Freedom refer to the number of independent values or observations in the dataset that are free to
vary when estimating statistical parameters.
In the case of variance, only n−1 data points are free to vary after calculating the mean.
Standard Deviation
Standard deviation is the square root of variance and measures the average distance of each data point from the
mean.
It is expressed in the same units as the data, making it more interpretable.

Example: If we take the Customer_Satisfaction scores, the standard deviation would indicate how much the
satisfaction scores for different stores vary from the average satisfaction score.
A low standard deviation means most stores have customer satisfaction scores close to the average, suggesting a
consistent customer experience across the stores.
A high standard deviation might indicate that while some stores have very high satisfaction scores, others may
have much lower scores, pointing to inconsistencies in service or product quality.
These measures help the store understand the consistency and variability in key business metrics like sales volume,
revenue, and customer satisfaction, which can inform strategic decisions on inventory management, marketing
efforts, and overall store performance.
Calculation of Outliers

Upper Outlier = 75%ile + 1.5*IQR

Lower Outlier = 25%ile – 1.5*IQR

Normal Distribution curve
A normal distribution curve, often referred to as the "bell curve" due to its shape, is a graphical representation of a
normal distribution.
It's characterized by its symmetrical, bell-shaped appearance, where the mean, median, and mode of the data are
all equal and located at the center of the distribution.

Key Features of an Ideal Normal Distribution Curve:

1. Symmetry: The curve is perfectly symmetrical around the mean, meaning the left and right halves are mirror
images.

2. Mean, Median, and Mode: All three measures of central tendency are the same and located at the peak of the
curve.
Normal Distribution curve
3. 68-95-99.7 Rule:
- About 68% of the data falls within one standard deviation of the mean.
- About 95% falls within two standard deviations.
- About 99.7% falls within three standard deviations.

4. Tails: The tails of the curve extend infinitely in both directions, approaching but never touching the horizontal
axis, indicating that extreme values are possible but less likely.

5. Area Under the Curve: The total area under the curve equals 1 (or 100%), representing the entire dataset.
What does One Standard Deviation (±1σ) mean?

 Definition: This is the range that covers all data points that are within one standard deviation from the mean.

 In a Normal Distribution:
o Approximately 68% of the data falls within this range.
o Mathematically, if the mean (μ) of the dataset is 50 and the standard deviation (σ) is 5, then one standard
deviation from the mean would cover the range from 45 to 55.
Test for Normality: K-S test
The Kolmogorov-Smirnov (K-S) test is a non-parametric test used in SPSS to determine if a sample of data differs
significantly from a specified distribution, often the normal distribution.
When used as a normality test, it compares your sample data with a normally distributed data set with the same
mean and standard deviation as your sample.

Understanding the K-S Test Output in SPSS

When you run the Kolmogorov-Smirnov test in SPSS, you will see several key pieces of information in the output:

Interpreting the Results:

1. Null Hypothesis (H₀):
The null hypothesis for the K-S test is that the sample data comes from a normal distribution.
Test for Normality: K-S test
Interpreting the Results:

1. Null Hypothesis (H₀):

The null hypothesis for the K-S test is that the sample data comes from a normal distribution.

2. p-value:
 p > 0.05: If the p-value is greater than 0.05, you fail to reject the null hypothesis. This suggests that there is no
significant difference between the sample distribution and the normal distribution, meaning the data is
consistent with normality.

 p ≤ 0.05: If the p-value is less than or equal to 0.05, you reject the null hypothesis. This suggests that the
sample data significantly differs from the normal distribution, indicating that the data is not normally
distributed.
Test for Normality: Shapiro-Wilk test
The Shapiro-Wilk test is a statistical test used to assess whether a given dataset is normally distributed.

It is particularly useful for small to medium-sized samples.

Key Points About the Shapiro-Wilk Test:

1.Purpose:
1. The test evaluates the null hypothesis that the data is drawn from a normally distributed population.

2.Interpretation:
1. P-value: The primary output of the Shapiro-Wilk test is a p-value.
1. If the p-value is greater than a chosen significance level (e.g., 0.05), we fail to reject the null
hypothesis, suggesting that the data is normally distributed.

2. If the p-value is less than the significance level, we reject the null hypothesis, indicating that the data
significantly deviates from a normal distribution.
Test for Normality: Shapiro-Wilk test
1.When to Use:
1. The Shapiro-Wilk test is recommended for datasets with small to moderate sample sizes (typically less than
2000 observations).
2. For larger samples, other normality tests like the Kolmogorov-Smirnov test might be more appropriate.

2.Limitations:
1. While the Shapiro-Wilk test is sensitive to deviations from normality, it can be too sensitive for large
datasets, where even minor deviations from normality can result in a significant p-value.
Skewness
Skewness measures the asymmetry of the data distribution around the mean.

 Positive Skewness (Right-Skewed):

o If the skewness value is positive, the distribution has a longer tail on the right side.
o Most of the data points are concentrated on the left side of the distribution, with fewer values extending into the right
tail.
o Example: Income distributions in a society.

 Negative Skewness (Left-Skewed):

o If the skewness value is negative, the distribution has a longer tail on the left side.
o Most of the data points are concentrated on the right side, with fewer values extending into the left tail.
o Example: Age of death in developed nations.
Skewness
 Skewness Close to Zero:
o A skewness value near zero indicates that the data distribution is fairly symmetrical.
o A perfectly symmetrical distribution (like a normal distribution) has a skewness of zero.

Interpretation in SPSS:
 Normal Distribution: Skewness close to 0 (between -0.5 and 0.5) suggests the data is approximately normally distributed.
 Mild Skewness: Skewness between -1 and -0.5 or 0.5 and 1 indicates mild skewness.
 Severe Skewness: Skewness less than -1 or greater than 1 indicates a highly skewed distribution.
Kurtosis
Kurtosis measures the "tailedness" of the distribution, or how much of the data is in the tails compared to a normal
distribution.

 Positive Kurtosis (Leptokurtic):

o A positive kurtosis value indicates that the distribution has heavier tails than a normal distribution.
o There are more outliers (extreme values) than you would expect in a normal distribution.
o The peak of the distribution is sharper and higher than a normal distribution.

 Negative Kurtosis (Platykurtic):

o A negative kurtosis value indicates that the distribution has lighter tails than a normal distribution.
o The distribution is flatter, with fewer outliers than a normal distribution.
o The peak is lower and broader compared to a normal distribution.
Kurtosis

 Kurtosis Close to Zero:

o A kurtosis value close to zero suggests that the data's tail and peak are similar to a normal distribution (mesokurtic).

Interpretation in SPSS:
 Normal Distribution: Kurtosis close to 0 (within the range of -2 to 2) suggests that the data is approximately normally
distributed.
 High Kurtosis (Leptokurtic): Indicates more extreme outliers than a normal distribution.
 Low Kurtosis (Platykurtic): Indicates fewer extreme outliers than a normal distribution.
Correlation
Correlation is a statistical measure that describes the strength and direction of the relationship between two variables.
When two variables are correlated, changes in one variable are associated with changes in the other.

Key Concepts:
1. Direction of Correlation:
o Positive Correlation: When one variable increases, the other variable tends to increase as well. For example, as the
number of hours studied increases, exam scores might also increase.
o Negative Correlation: When one variable increases, the other variable tends to decrease. For example, as the
temperature increases, the sales of winter coats might decrease.

2. Strength of Correlation:
o The strength of the correlation indicates how closely the two variables are related.
o Strong Correlation: The data points lie close to a straight line when plotted on a scatterplot.
o Weak Correlation: The data points are more scattered, indicating a looser relationship.
Limitations of Correlation
1. Correlation Does Not Imply Causation:
o Just because two variables are correlated does not mean that one causes the other.
o For example, ice cream sales and drowning incidents might be positively correlated, but eating ice cream does not
cause drowning—both are related to a third variable, such as hot weather.

2. Linear Relationship Assumption:

o Correlation measures linear relationships.
o If the relationship between the variables is non-linear, the correlation coefficient may not capture the strength of the
relationship accurately.

3. Outliers:
o Outliers can significantly affect the correlation coefficient, either inflating or deflating the perceived strength of the
relationship.

AYURSURE (Research and Stat) 4
No ratings yet
AYURSURE (Research and Stat) 4
44 pages
Week 2 Measures of Dispersion II
No ratings yet
Week 2 Measures of Dispersion II
34 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel
48 pages
Measures of Variability Lec 7: DR - Nesrin H. Darwesh University of Duhok-College of Dentistry
No ratings yet
Measures of Variability Lec 7: DR - Nesrin H. Darwesh University of Duhok-College of Dentistry
48 pages
Wa Nko Nalipay PR
No ratings yet
Wa Nko Nalipay PR
12 pages
Measures of Central Tendency and Dispersion/ Variability
No ratings yet
Measures of Central Tendency and Dispersion/ Variability
35 pages
Describing Data
No ratings yet
Describing Data
8 pages
3 - Descriptive Stat
No ratings yet
3 - Descriptive Stat
70 pages
Descriptive Statistics MBA
100% (3)
Descriptive Statistics MBA
7 pages
Tutoring Study Plan
No ratings yet
Tutoring Study Plan
17 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
34 pages
Understanding Normal Distribution
No ratings yet
Understanding Normal Distribution
23 pages
Intro to Descriptive Statistics
No ratings yet
Intro to Descriptive Statistics
51 pages
Statistical Measures 2024 (Part 2) - Word
No ratings yet
Statistical Measures 2024 (Part 2) - Word
8 pages
8614.educational Statitics Unit 4
No ratings yet
8614.educational Statitics Unit 4
34 pages
Chapter 5 - RM
No ratings yet
Chapter 5 - RM
22 pages
Unit 1 Computational Statistics
No ratings yet
Unit 1 Computational Statistics
58 pages
Main Central of Depression
No ratings yet
Main Central of Depression
32 pages
SPSS
No ratings yet
SPSS
24 pages
Biostatistics Unit 5. Measure of Skew
No ratings yet
Biostatistics Unit 5. Measure of Skew
38 pages
Ed216 Chapter 7
No ratings yet
Ed216 Chapter 7
31 pages
BA20 Session2 M
No ratings yet
BA20 Session2 M
40 pages
Error and Uncertainty: General Statistical Principles
No ratings yet
Error and Uncertainty: General Statistical Principles
8 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
Statistical Methods in Social Sciences
No ratings yet
Statistical Methods in Social Sciences
69 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Mba Statistics Midterm Review Sheet
No ratings yet
Mba Statistics Midterm Review Sheet
1 page
Descriptive Statistics Notes
No ratings yet
Descriptive Statistics Notes
4 pages
43hyrs Principles of Statistics 3
No ratings yet
43hyrs Principles of Statistics 3
56 pages
Biostatistics Revision DR - NJ
No ratings yet
Biostatistics Revision DR - NJ
67 pages
Cental Tendency
No ratings yet
Cental Tendency
20 pages
MMW Data Management and Analysis
No ratings yet
MMW Data Management and Analysis
96 pages
Statistics 1 (Final) / Orthodontic Courses by Indian Dental Academy
No ratings yet
Statistics 1 (Final) / Orthodontic Courses by Indian Dental Academy
15 pages
Faculty Introduction: Tkachwala@nmims - Edu
No ratings yet
Faculty Introduction: Tkachwala@nmims - Edu
27 pages
SPSS Notes
No ratings yet
SPSS Notes
7 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
22 pages
Basic Statistical Concepts Review
No ratings yet
Basic Statistical Concepts Review
8 pages
Definitions of Descriptive Statistics of A Single Variable Generated by The Descriptive Statistics Tool in Excel's Data Analysis
No ratings yet
Definitions of Descriptive Statistics of A Single Variable Generated by The Descriptive Statistics Tool in Excel's Data Analysis
48 pages
Descriptive Stat
No ratings yet
Descriptive Stat
13 pages
Click To Add Text Dr. Cemre Erciyes
No ratings yet
Click To Add Text Dr. Cemre Erciyes
69 pages
Descriptive Statistic
No ratings yet
Descriptive Statistic
37 pages
Intro to Descriptive Statistics
100% (1)
Intro to Descriptive Statistics
20 pages
Measures of Variability Explained
No ratings yet
Measures of Variability Explained
15 pages
CH.5.
No ratings yet
CH.5.
34 pages
iQRM Warm Up Week 5 February 17 Corrected
No ratings yet
iQRM Warm Up Week 5 February 17 Corrected
39 pages
Assumption - 16 - Oct18
No ratings yet
Assumption - 16 - Oct18
48 pages
Psychological Stats Reviewer
No ratings yet
Psychological Stats Reviewer
11 pages
Statistics Basics for Data Science
100% (1)
Statistics Basics for Data Science
27 pages
Measure of Dispersion-1
No ratings yet
Measure of Dispersion-1
17 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
91 pages
BUSD2027 QualityMgmt Module2
No ratings yet
BUSD2027 QualityMgmt Module2
168 pages
Module 6 Statistics
No ratings yet
Module 6 Statistics
44 pages
Descreptive Statistics 1
No ratings yet
Descreptive Statistics 1
74 pages
Unit 1 - Business Statistics & Analytics
No ratings yet
Unit 1 - Business Statistics & Analytics
25 pages
Measures of The Spread of The Data (Ch2Sec7)
No ratings yet
Measures of The Spread of The Data (Ch2Sec7)
24 pages
2 Normality PG OK
No ratings yet
2 Normality PG OK
24 pages
Why Study Dispersion?: Spread of The Data
No ratings yet
Why Study Dispersion?: Spread of The Data
31 pages
Central Tendency - HU 2023
No ratings yet
Central Tendency - HU 2023
48 pages
Statests
No ratings yet
Statests
20 pages
Unit Roots Tests Methods and Problems
No ratings yet
Unit Roots Tests Methods and Problems
28 pages
Assignment II
No ratings yet
Assignment II
3 pages
Ascher 1987
No ratings yet
Ascher 1987
6 pages
Malik 1976
No ratings yet
Malik 1976
5 pages
Chapter 6 - S
No ratings yet
Chapter 6 - S
4 pages
Geyer - Markov Chain Monte Carlo Lecture Notes
No ratings yet
Geyer - Markov Chain Monte Carlo Lecture Notes
166 pages
Ce 463
No ratings yet
Ce 463
139 pages
Geostatistical Simulation
100% (1)
Geostatistical Simulation
262 pages
Probability Concepts Overview
No ratings yet
Probability Concepts Overview
3 pages
Session3 Probability
No ratings yet
Session3 Probability
19 pages
Stochastic Processes for Students
No ratings yet
Stochastic Processes for Students
37 pages
Problabistic PP
No ratings yet
Problabistic PP
26 pages
Uji Normaitas Hasil Jawaban Kuesioner
No ratings yet
Uji Normaitas Hasil Jawaban Kuesioner
10 pages
Lampiran 3: Hasil Pengolahan Data Dengan SPSS 21.0: Case Processing Summary
No ratings yet
Lampiran 3: Hasil Pengolahan Data Dengan SPSS 21.0: Case Processing Summary
7 pages
Machine Learning Quiz Solutions
No ratings yet
Machine Learning Quiz Solutions
3 pages
Multivariate Statistical Modelling Based On Generalized Linear Models 2nd Edition ISBN 0387951873, 9780387951874 PDF
No ratings yet
Multivariate Statistical Modelling Based On Generalized Linear Models 2nd Edition ISBN 0387951873, 9780387951874 PDF
17 pages
Wide-Sense Stationary Process
No ratings yet
Wide-Sense Stationary Process
8 pages
B.Tech Probability & Stats Tasks
No ratings yet
B.Tech Probability & Stats Tasks
2 pages
Solutions To IIT JAM For Mathematical Statistics: December 2018
No ratings yet
Solutions To IIT JAM For Mathematical Statistics: December 2018
21 pages
Bayes Rule Partition
No ratings yet
Bayes Rule Partition
2 pages
4.1 Mean of A Random Variable: Ziad Zahreddine
No ratings yet
4.1 Mean of A Random Variable: Ziad Zahreddine
13 pages
The Distribution of Share Price Changes (The Journal of Business, Vol. 45, Issue 1) (1972)
No ratings yet
The Distribution of Share Price Changes (The Journal of Business, Vol. 45, Issue 1) (1972)
8 pages
PPT+59+ +Ex+11B+Standardisation,+68!95!99.7+Rule
No ratings yet
PPT+59+ +Ex+11B+Standardisation,+68!95!99.7+Rule
23 pages
Unit I Lesson 3 Computing The Mean of A Discrete Probability Distribution
100% (1)
Unit I Lesson 3 Computing The Mean of A Discrete Probability Distribution
24 pages
Probability and Causality
No ratings yet
Probability and Causality
300 pages
Fundamentals of The Monte Carlo Method
100% (1)
Fundamentals of The Monte Carlo Method
348 pages
4.2 Lecture Slides (With Solutions)
No ratings yet
4.2 Lecture Slides (With Solutions)
60 pages
HUFLIT HW 1 Random Process
No ratings yet
HUFLIT HW 1 Random Process
4 pages
Solving Solid Transportation Problem With Multi Choice Cost and Stochastic Supply and Demand PDF
No ratings yet
Solving Solid Transportation Problem With Multi Choice Cost and Stochastic Supply and Demand PDF
26 pages
TQ-SHS-Statistics - Probability With Answer Key
No ratings yet
TQ-SHS-Statistics - Probability With Answer Key
14 pages

Basics of Data Analysis

Uploaded by

Basics of Data Analysis

Uploaded by

Basics of Data

• It gives you an idea of the spread or dispersion of the data.

• It quantifies the degree of spread in the data.

• Why 1/(n-1) is used in the formula instead of 1/n?

Upper Outlier = 75%ile + 1.5*IQR

Lower Outlier = 25%ile – 1.5*IQR

Key Features of an Ideal Normal Distribution Curve:

Understanding the K-S Test Output in SPSS

Interpreting the Results:

1. Null Hypothesis (H₀):

It is particularly useful for small to medium-sized samples.

Key Points About the Shapiro-Wilk Test:

 Positive Skewness (Right-Skewed):

 Negative Skewness (Left-Skewed):

 Positive Kurtosis (Leptokurtic):

 Negative Kurtosis (Platykurtic):

 Kurtosis Close to Zero:

2. Linear Relationship Assumption:

You might also like