BBA 304 SLM
BBA 304 SLM
BBA 304 SLM
BABASAHEB AMBEDKAR
OPEN UNIVERSITY
BBA
BACHELOR OF BUSINESS ADMINISTRATION
BBAR-304
Business Analytics
Business Analytics
ISBN 978-93-91071-43-1
Editorial Panel
Authors
Mr. Ankur Sharma
Senior Manager, TCS Limited,
Gandhinagar
Editor
Dr Ashish Joshi
Associate Professor,
Pandit Deendayal Petroleum University,
Gandhinagar
Language Editor
Dr. Jagdish Anerao
Associate Professor,
Smt AP Patel Arts & NP Patel Commerce College, Naroda, Ahmedabad
Edition : 2021
Acknowledgment
Every attempt has been made to trace the copyright holders of material re–
produced in this book. Should an infringement have occurred, we apologize
for the same and will be pleased to make necessary correction/amendment in
future edition of this book.
The content is developed by taking reference of online and print publications
that are mentioned in Bibliography. The content developed represents the
breadth of research excellence in this multidisciplinary academic field. Some
of the information, illustrations and examples are taken "as is" and as available
in the references mentioned in Bibliography for academic purpose and better
understanding by learner.
ROLE OF SELF INSTRUCTIONAL MATERIAL IN DISTANCE
LEARNING
Business Analytics
UNIT 1
INTRODUCTION TO BUSINESS ANALYTICS
UNIT 2
DESCRIPTIVE ANALYTICS
UNIT 3
VISUALIZATION TECHNIQUES FOR BUSINESS ANALYTICS
BLOCK 1 : BUSINESS ANALYTICS FUNDAMENTALS
Block Introduction
• Advanced hardwares are available which can store huge data in such
a way that it can be easily available for analysis without any time lag.
Also, the cost is quite reasonable e.g. GPU/TPU processors, distributed
networks, etc
: UNIT STRUCTURE :
1.0 Learning Objectives
1.1 Introduction
1.2 Importance of Data and its Sources
1.2.1 Why has Data Suddenly become so Important
1.2.2 Different Sources of Data Accumulation in the Personal
and Business World
1.3 Life Cycle of Business Analytics Process
1.3.1 Scope of Business Analytics – Where Does it Fit on Business
Canvas
1.4 Classification of Business Analytics
1.4.1 Descriptive Analytics
1.4.2 Diagnostic Analytics
1.4.3 Predictive Analytics
1.4.4 Prescriptive Analytics
1.5 Challenges in Business Analytics
1.6 Let Us Sum Up
1.7 Answers for Check Your Progress
1.8 Glossary
1.9 Assignment
1.10 Activities
1.11 Case Study
1.12 Further Readings
1
Business Analytics 1.1 Introduction :
In this unit we will study basic institution about business analytics
and how it is changing the overall canvas of business world. At the end
of the unit, you will understand how different components of business
analytics are influencing decision–making capabilities. We will also see
different type of business analytics and important tools and techniques
used in each type of business analytics. We will also touch upon various
challenges organizations are facing in current era and what are different
approaches to overcome these challenges.
7
Business Analytics 4. Based on prior steps we take appropriate decisions by keeping
budget, time, and resources in mind also predict the output in the
short and long run
5. To get sustainable solutions, we put suitable controls in the system
which results in optimizing outputs and helps the organization in
creating a competitive strategy
1.3.1 Scope of Business Analytics – Where Does it Fit on Business
Canvas :
Business Analytics works like a magical crystal ball that can solve
tiny day–to–day problems like reducing packaging errors to enormous and
complex business problems such as designing a space shuttle or setting
up a nuclear power plant.
Let's try to understand the applications of Business Analytics at a
different level with the help of fig 1.6
At the bottom, more and
more workforce get involved in
Business Analytics initiatives, and
they identify smaller problems in
their day–to–day operational
activities with the help of quick
applications of analytics, here
analysts help them to find
solutions and incorporate these
solutions into their way of
working. In business, we call it
Kaizen, which is a Japanese word
that means "change for good". In
the next level generally,
specialized analysts get involved Fig 1.6 - Scope of Business
in solving critical business Analytics at Different Levels
problems; typically, they tackle these in terms of business projects. The
next level is Decision Analytics, where departmental heads get engaged
in executing more significant business initiatives, here the analytics team
uses complex tools and massive data for impactful and detailed analysis.
At the top–level leadership gets involved in using comprehensive analytics
for strategy formation.
Check Your Progress – 1 :
1. Which of the below tasks can be handled by business analytics ?
a. Predicting business results
b. Finding patterns in the data
c. Validating business assumptions
d. All of the above
8
2. The business analysis process starts with : Introduction to
a. Analysis of data Business Analytics
b. Collecting data
c. Determine the need of the process
d. Predict the business outcomes
12
3. The product return rate was very high during last month, and it Introduction to
found that out of total return items more than 60% of products Business Analytics
were supplied by two vendors only, where the vendor provided the
wrong specification about products.
Essential tools used in Descriptive Analytics :
1. Correlation Analysis : It is a statistical measure that indicates the
strength of the relationship between two variables. It is a critical
causal analysis technique that helps in identifying reasons in terms
of relationship with other metrics.
2. 5 Why Analysis : It is a very structured approach where we try
to dig into a problem and peel it layer by layer to reach the root
cause of the problem. Solutions to root cause provide us with
sustainable solutions.
3. Cause and Effect Analysis : Here, we identify all possible reasons
for one problem then we pick up all the reasons as a problem one
by one and try to find other causes for that problem. In this way,
we create a diagram that looks like the skeleton of a fish because
of its looks. It is also known as the fishbone diagram.
1.4.3 Predictive Analytics :
"The future belongs to those who see possibilities before they
become obvious". – John Scully
Predictive analytics is the heart of business analytics, it aims to
help the organization by predicting probabilities of occurrence of a future
event or future values of any essential business metrics, e.g. sales in
the next month/ quarter, employee attrition, and product return rate, etc.
Once organizations have a
stable setup for descriptive
analytics, means data sources have
been identified, and those data
sources are supplying data about
important metrics continuously
into leadership dashboards.
Predictive analytics combines this
historical data with advanced
business protocols (policy and
rules) to forecast future values of Fig 1.11 Predictive Analytics Cycle
business events. Predictive analytics allows organizations to become
forward–looking, providing an appetite to consume calculated risk by
anticipating customer behaviour and business outcomes.
Below is a List of Predictive Analytics Examples :
• Financial analysts predict the share value/ gold prices/crude oil
prices in the next few days or weeks with the help of predictive
modelling.
13
Business Analytics • Airline companies predict competitive airfares to extraordinary and
ordinary days also they indicate how much airfare should be increased
as per the increased customer's traffic on their websites.
• Netflix predicts the next movie customers want to watch, more than
80% of customers select their next movie from their recommendation
list. In this way, Netflix earns more rental income from regular
customers by suggesting them the next film or programs.
• IRCTC predict the probability to confirm the seat which provides
assurance to the customer about their seat confirmation, it helps
to attract more customers to their portal.
• Taxi services predict the demand during different time slots and
change their tariff accordingly.
18
1.6 Let Us Sum Up : Introduction to
Business Analytics
1. In the current era of business 4.0, clean and comprehensive data
is available in all type of organization therefore the scope of
business analytics has increased many folds.
2. Business analytics is a structured approach that brings value to the
business in a very systematic way. It is always great to start with
the basics and build analytical capabilities step by step. Analytics
can be classified into four levels which help the organizations to
become mature in terms of analytical proficiency.
3. Business Analytics projects start with a correctly framed business
problem; analysts convert this business problem into an analytical
problem (writing in terms of business metrics and quantitatively
impact on business), analysts figure out relevant data and tools to
solve the statistical problem. In the end, they again summarise their
statistical findings in terms of Business Solutions which can be
easily interpreted and converted into a sustainable and replicable
solution.
4. Descriptive analytics is the most simplistic form of analytics, it
digs deeper into past data and tells us "what has happened and
when did it happen".
5. Diagnostic analytics provides "Why did it happen in my business".
It is a bit advanced where analysts examine data in order to find
reasons for business problems or opportunities.
6. Predictive analytics aims to help the organization by predicting
probabilities of occurrence of a future event or future values of
any essential business metrics, e.g. sales in the next month/ quarter,
employee attrition, and product return rate, etc.
7. Prescriptive analytics solves the complex business problem as it
is the most advanced form of analytics, where we have to choose
the most optimal way to increase important business metrics.
19
Business Analytics 1.8 Glossary :
Descriptive Analytics : This is the simplest form of analytics; it
summarises current business status in the way of narrative and innovative
visualization. It emphasizes on "what is going on in the business."
Diagnostic Analytics : It provides the reasons for descriptive
analytics; generally, analysts provide visual reasoning in terms of interactive
dashboards. It emphasizes on "why did it happen."
Predictive Analytics : It predicts the business metrics in the short
and long run, here we use advanced machine learning algorithms to
analyze massive data from different sources to predict the future values
of essential business metrics. It emphasizes on "what will happen in the
future".
Prescriptive Analytics : It suggests all favourable business outputs
for any specified course of action, it also offers the pros and cons for
each course of action. It optimizes the business results suggested by
descriptive and predictive analytics. It emphasizes on "how can we make
it happen."
Z–Score : Z Score tells us how far (in terms of standard deviation)
is a particular value of x from its mean.
Coefficient of Variance : It is a ratio where we divide standard
deviation with mean. Alone mean or standard deviation are not appropriate
methods to measure to benchmark different company performance metrics.
It is important to consider both centrality and spread of data to make
it comprehensive.
Regression Analysis : It establishes the mathematical relationship
between input variables and output variables, which means if we can
calculate the future value of output for any given input, e.g. sales forecast
for next month.
Logistic Regression : It is a classification predictive analytics
technique that can predict the output class for any given set of inputs.
E.g. by providing customer demographics logistic regression can indicate
whether the customer will default bank loan in the future or not.
Decision Tree : Most of the time, we use a decision tree as a
classification technique; it tells us the output probability of the output
variable for various permutations of our input variables. Although it can
be used for continuous output variables also.
Linear Programming : In linear programming, we optimize the
objective functions like revenue, market share, customer feedback ratings
by also keeping constraints in the model like budget, no. of people
deployed, etc. as linear functions.
20
1.9 Assignments : Introduction to
Business Analytics
1. Write down important tools used in descriptive analytics.
2. Write down important components of business analytics.
3. Write down important data sources in the personal and business
world.
4. Mention the scope of business analytics at different levels in an
organization.
1.10 Activities :
Suppose you have been hired as an analyst at SDFG Bank, your
manager has provided you home loan data for the last 5 years. You have
to build up an application with the help of the Data Science team to
predict potential loan defaulters.
Write down various analyses you can do on this data from a
Descriptive, diagnostic, predictive, and prescriptive business analytics
perspective.
21
Business Analytics Questions :
1. Write down the various type of Descriptive, diagnostic, predictive,
and prescriptive business analytic you can imagine for Nirav Soft
drinks ?
2. Which important business metrics will improve through this
technology intervention ?
3. How this change will bring value for store owners, Nirav soft drinks,
and end customers ?
22
Unit DESCRIPTIVE ANALYTICS
2
: UNIT STRUCTURE :
2.0 Learning Objectives
2.1 Introduction
2.2 Introduction to Descriptive Statistics
2.3 Different Type of Data Measurement Scales
2.3.1 Categorical Data
2.3.2 Continuous Data
2.4 Population and Sample Size
2.5 Components of Descriptive Statistics
2.5.1 The Measures of Central Tendency
2.5.2 Measures of Variation
2.6 Let Us Sum Up
2.7 Answers for Check Your Progress
2.8 Glossary
2.9 Assignment
2.10 Activities
2.11 Case Study
2.12 Further Readings
2.1 Introduction :
In this unit we will study about descriptive statistics and different
type of measurement scales used in descriptive statistics. Basic concepts
about sampling theory and how it helps business in saving huge revenue
and time to conduct the study/ analysis. We will also understand various
techniques to measures centrality and variability of data and how these
help us to take better business decisions. At the end we will touch upon
few data visualization techniques which helped us to understand the data
and analysis better.
23
Business Analytics 2.2 Introduction to Descriptive Statistics :
"Data are just summaries of thousands of stories – tell a few of
those stories to help make the data meaningful" – Chip & Dan Heath
Descriptive analytics is the foundation of any analytical project. It
focuses on "What has happened" by visualizing the current business
performance.
We develop dashboards to showcase important business metrics. By
putting together historical performances, data helps us to see hidden
inferences which lead to better business decisions. Innovative visualization
plays an important role in displaying a comprehensive picture of
organizational progress. We can also include competitor's information
which will help management take appropriate strategic actions.
Categorical Continious
Data Data
26
Check Your Progress – 1 : Descriptive Analytics
1. Final grades (A, A+, B, etc.) in university exams is an example
of :
a. Nominal data b. Ordinal data
c. Ratio data d. None of above
2. Generally, we prefer to analyze the ___________ data as it is not
advisable to analyze the entire data even if we have access to it
as it leads to consuming more time, resources and efforts.
Mode
Descriptive
Statistics
Range
Dispersion Variance
Standard Deviation
27
Business Analytics 1. Mean (Average) Value :
Arithmetic mean is the most frequently used measure of central
tendency. For simplicity, we refer to arithmetic mean as "mean". It is
the summation of all numbers divided by the total numbers. The population
means is represented by a Greek letter while the sample means is
represented by X whereas N is the number of total data in population
while n is the total number of data points in the sample.
Σx x1 + x 2 + x 3 + x 4 + ... + x n
=
µ =
N N
MS Excel formula for the mean is AVERAGE (array of numbers).
The array represents all numbers in sequence.
Example 2.1 : Calculate the mean for a student who has scored
90, 76, 62, 91 and 56.
Solution : Step1 : Add all numbers, which is 375
Step2 : Divide by the total number of data points, which is 5 here
90 + 76 + 62 + 91 + 56
=µ = 75
5
Example 2.2 : Data is available in the form of frequency
Age 16 18 20 25 28
Total Students 2 1 5 2 10
Weighted Average : If a few data points amongst the data set are
to be given more importance than others, the weighted average method
can be applied. In the below example, few subjects are more important.
In that scenario, we calculate the weighted average; it's calculated the
same as we have calculated the mean (average) from the frequency table
in the above example.
Example 2.3 : Data is available in the form of frequency
Sr. No. Subject Name Credit (Weight) Score (out of 10)
1 Statistical Analysis 5 7
2 Data Mining 4 8
3 Logical Reasoning 4 9
4 English 3 8
5 × 7 + 4 × 8 + 4 × 9 + 3 × 8 127
Weighted Average
= = = 7.94
(5 + 4 + 4 + 3) 16
2. Median Value :
Median value is the midpoint of the data sets when data is sorted
in ascending/descending order. For the odd number of data points, the
28
(n + 1) Descriptive Analytics
median value will be observation while for the even number
2
of data points, an average of the middle two observations after sorting
the data in ascending order.
MS Excel formula for mean is median (array of numbers).
Example 2.4 : Calculate median for following income data of 11
executives in company ABC Limited.
Annual Income
62000
64000
49000
324000
1264000
54330
64000
51000
55000
48000
53000
Solution :
Step1 : Arrange all data in ascending or descending order
Step2 : Check if total observations are odd or even
Step3 : As there is an odd number of data points hence median
th
n + 1
will be observation.
2
th
n + 1
In this case, n = 11 hence observation will be the median.
2
Therefore 6th observation, which is 55000 is the median.
Example 2.5 : Calculate median for following income data of 12
executives in company ABC Limited.
Annual Income
62000
64000
49000
324000
29
Business Analytics 1264000
54330
64000
51000
55000
48000
49000
53000
Solution : Step 1 : Arrange all data in ascending or descending
order
Step 2 : Check if total observations are odd or even
Step 3 : As there are an even number of data points hence the
median will be the average of the middle two observations
n (Xi − µ)2
∑
i =1 N
31
Business Analytics While the variance of a sample is given by :
n (Xi − µ)2
∑
i =1 N − 1
32
Solution : Descriptive Analytics
No. of Toys (x) Xi – (Xi – )2
1 1 – 3.2 = –2.20 4.84
1 1 – 3.2 = –1.96 4.84
1 1 – 3.2 = 1.00 4.84
2 2 – 3.2 = 2.00 1.44
3 3 – 3.2 = 3.00 0.04
4 4 – 3.2 = 4.00 0.64
5 5 – 3.2 = 5.00 3.24
5 5 – 3.2 = 5.00 3.24
5 5 – 3.2 = 5.00 3.24
5 5 – 3.2 = 5.00 3.24
x = 32 (Xi – )2 = 29.6
Mean = 3.2
The formula for sample variance
n(Xi − µ)2
∑
i =1 N − 1
29.6
Variance = = 3.29
9
29.6
While if it had been population, then variance would be = 2.96
10
3. Standard Deviation :
Standard deviation is the square root of variance. The formula for
the standard deviation of the population ( ) and sample standard deviation
(s) is as below :
The standard deviation for a population :
n
(Xi − µ )2
σ= ∑ N
i =1
n
(Xi − µ)2
s= ∑ N −1
i =1
33
Business Analytics The standard deviation of no of toys given in example 2.6 is as
below :
= 1.72 s = 1.81
Variance is squared; hence it is quite large compared to the data
points. Another challenge is it is square the unit of measurement. Because
of that, most of the time, we use standard deviation instead of variance.
One challenge with variance and mean is that we can't compare
these measures of different datasets as these depend on absolute values
of datasets. For example, we can't compare the share price of one company
with another company as one company may have a share price of less
than 100 while others may have more than 20,000. To counter this, we
get another measure known as the coefficient of variance.
4. Coefficient of Variance :
It is a ratio of standard deviation and means. It is unitless; hence
we can compare the coefficient of variance of two completely different
datasets. Similar to other measures, we have a different formula for the
coefficient of variance for population and sample.
σ S
CV = × 100%; CV = × 100
µ x
n
f (x)
= ∑ (Xi − X)
n =1
a. Zero b. Mean
c. Variance d. Standard deviation
34
3. Which one of the following measures is appropriate for Nominal Descriptive Analytics
scale variables ?
a. The average age of students in a class
b. Average marks of students
c. Gender of students
d. Height of students
4. Calculate the variance of the below numbers; consider it as population
data :
6, 8, 2, 9, 7, 3, 1, 4
a. 5 b. 40 c. 6 d. 7.5
5. Calculate the standard deviation of the below numbers; consider
it as sample data :
6, 8, 2, 9, 7, 3, 1, 4
a. 2.4 b. 2.93 c. 3 d. 4.5
6. What are NOT true characteristics of mean
a. It includes all data points
b. It gets impacted with extreme values
c. It depends only on the central value of an ordered list
d. It is equivalent to the arithmetic mean
7. Mean, Median and Mode are best to represent :
a. Shape of dataset b. Location of dataset
c. Both d. None of the above
8. If most frequently occurring data is an outlier, then in that case :
a. Mode is a poor measure
b. Mode is a good measure
c. We can't say
d. The mode can be calculated in that case
9. Which of the following is best to compare the variability of two
completely different data–set ?
a. Standard deviation b. Variance
c. Coefficient of variance d. Range
10. We have extracted a sample of 200 people from a city, which one
of below is correct :
a. The standard deviation of the sample is always greater than the
standard deviation of the population
b. The standard deviation of the sample is always less than the
standard deviation of the population
c. We can't say whether the Standard deviation of the sample is
lesser, or the Standard deviation of the population is lesser
d. 200 people sample is too less to calculate standard deviation
35
Business Analytics 2.6 Let Us Sum Up :
1. Business analytics projects start with descriptive analytics, which
includes data summarization, aggregation, descriptive statistics, and
visualization
2. Analysts run queries to fetch appropriate data and consolidate
required data from different sources
3. Descriptive analytics focus on information gathering from raw data
which tells us what happened in the past
4. Visualization helps in storytelling through data and appropriate
graphs as per the data types
5. Data scientists calculate central tendency, variation, and shape of
data to understand the characteristics such as variance, central point,
and skewness
6. Descriptive analytics prepare field for further in–depth analysis like
diagnostic and predictive analytics
2.7 Answers to Check Your Progress :
Check Your Progress – 1 :
1. b 2. Sample
Check Your Progress – 2 :
1. c 2. Centrality, variation
Check Your Progress – 3 :
1. b 2. d 3. d 4. a 5. b
6. c 7. a 8. c 9. c 10. a
2.8 Glossary :
Descriptive Analytics is the simplest form of analytics; it summarises
current business status in the way of narrative and innovative visualization.
It emphasizes "what is going on in the business."
The sample is a logical subset of the population, which mimics
the population.
Categorical Data : This type of data is known as discrete data.
Categorical variables have a finite number of groups, e.g., payment
method, gender, income group, property type, etc.
Continuous Data : This type of data can be measured on a
continuous scale–like height, weight, money, time, etc. It can be divided
into halves any number of times.
Nominal Data : In this type of categorical variable, there is no
logical sequence among categories, which means one category is not
superior or inferior to others. Categorization is just a type of segregation
of data into different groups.
Ordinal Data : It is better than nominal data in terms of usage
potential for any analysis. Ranked or ordered data generally come into
this category.
36
Ratio Data : In terms of statistical relevance, this is the highest Descriptive Analytics
level of data which is desirable for the application of statistical tools
and techniques.
Mean (Average) Value : It is the summation of all numbers divided
by the total numbers.
Median Value : Median value is the midpoint of the data sets when
data is sorted in ascending order. For odd data points, the median value
(n + 1)
will be observation while for even data points, an average of
2nd
middle two observations after sorting the data in ascending order.
Mode : It is the value that most often occurs in the data; it can
be applicable for both numeric and categorical data.
Variance : Variance is a measure of the distance of data points
from its mean.
Standard Deviation : Standard deviation is the square root of
variance.
Coefficient of Variance : It is a ratio of standard deviation and
means. It is unitless; hence we can compare the coefficient of variance
of two completely different datasets.
2.9 Assignments :
1. Define various ways to calculate the dispersion of data, write their
strengths and limitations.
2. Write down one scenario where mean is better than median and
vice-versa.
3. Explain how the coefficient of variance can help in ranking important
mutual funds for your customers.
2.10 Activities :
Supertronic motors have published month–wise sales data in Gujarat.
Month Number of Units Sold
1949–01 112
1949–02 118
1949–03 132
1949–04 129
1949–05 121
1949–06 135
1949–07 148
1949–08 148
1949–09 136
1949–10 119
37
Business Analytics Calculate mean, median, mode, range, variance, standard deviation,
and coefficient of variance of above data.
MEAN 129.8
Median 130.5
Mode 148.0
Variance 153.7
Standard Deviation 12.4
Range 36.0
Coefficient of Variance 0.1
38
Unit VISUALIZATION TECHNIQUES
3 FOR BUSINESS ANALYTICS
: UNIT STRUCTURE :
3.0 Learning Objectives
3.1 Introduction
3.2 Introduction to Data Visualization
3.3 Histogram
3.4 Bar Chart
3.5 Scatter Plot
3.6 Box Plot
3.7 Control Chart
3.8 Tree Map
3.9 Let Us Sum Up
3.10 Answers for Check Your Progress
3.11 Glossary
3.12 Assignment
3.13 Activities
3.14 Case Study
3.15 Further Readings
3.1 Introduction :
In this unit we will learn various type visualization techniques and
scenarios in which we use specific type of visualization techniques. At
the end we will also look into the benefits and limitation of each type
of graphs. We have also included few case studies so that interpretation
part can also be explained in detail.
3.3 Histogram :
Histogram applicable for continuous data (Ratio and Interval data
types), there are consecutive but non–overlapping bars where each bar
shows the frequency distribution of the data. Histogram assesses the
probability distribution of the data.
In MS–Excel, a histogram is not part of graphs; we have to activate
the "Data Analysis" pack.
Below is an example of sales data from different customers in the
last financial year. Total there are 33 sales figures from these customers,
and sales amount varies from ` 14,786 to ` 29,29,278.
X max − X min
Number of bins, N =
W
Where Xmax is the maximum value in the dataset while Xmin is
the minimum value in the dataset, W is the width of the bins : Famous
statistician Strurges (1926) suggested below equation in this research
paper :
Width of the bins (W) = 1 + 3.332 log10(n)
Here, n is the total number of observations in the data set.
In the above example, we had n = 33 hence the number of bins
suggested by the above formula is 6 (round off). Therefore,
(2929278 − 14786)
= 485749
6
Now all data points in the range of 0–485749 will be in the first
bin (interval); similarly, we will have six bins of width 485749 while
the last bin will have all remaining data points also.
The histogram is one of the most important charts in statistical
visualization treasure. Statisticians and business analysts use the histogram
for the following reasons :
(a) It helps in identifying the outlier in the data set (if few bins are
quite far from the remaining bins then it can be considered as an
outlier)
(b) It helps in assessing the probability distribution of the data by
judging the shape of the distribution
(c) It suggests about central tendency of the datasets like mean, median,
and mode
(d) It helps in assessing the variability of the data set
(e) It also helps in assessing other important statistical measures like
skewness
41
Business Analytics
42
Visualization Techniques
for Business Analytics
43
Business Analytics
Month_10
Month_1
Month_2
Month_3
Month_4
Month_5
Month_6
Month_7
Month_8
Month_9
Student
Roll No
1 86 99 97 84 88 84 87 94 100 83
2 62 73 50 50 52 50 58 57 53 51
3 64 62 61 60 73 73 74 79 75 64
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
25 91 84 83 84 80 97 91 89 99 89
3.12 Assignments :
1. Which graph to be used if you want to show the relationship
between two continuous variables, explain with an example.
49
Business Analytics 2. How interquartile range helps in taking decisions, explain it with
an example.
3. Write down the importance of tree diagram, mention two scenarios
where you can implement a tree diagram.
3.13 Activities :
Online training firms collected the training hours for their 16 trainers
and their feedback. Draw a scatter plot and see where there is the
relationship between training hours and feedback scores. Also, draw a
quadrant graph like provided in the chapter (fig 3.6)
Training Hours Feedback Training Hours Feedback
75 8 94 5
69 5 81 5
104 5 86 9
106 4 99 3
92 8 110 7
111 2 76 4
64 6 101 4
92 4 125 4
51
Business Analytics BLOCK ASSIGNMENT
Short Answer Questions :
1. Explain how business analytics is a systematic approach to be
established in an organization.
2. Write down few important challenges in establishing business analytics
in an organization
3. Write down the important difference between descriptive and diagnostic
analytics
4. Explain the importance of visualization in the business analytics
5. Why we say that "central tendency alone does not show us the
complete picture, we should also consider variation also". Justify the
statement with one example ?
6. Write down important components of business analytics
7. Explain the life cycle of business analytics, mention one example
8. Why should we study sample only even if we have access to entire
population data ?
9. Write one scenario for the usage of mean and median as a measure
of central tendency
10. Why do we call mode a weak measure of central tendency ?
11. Explain briefly the difference between bar chart and histogram
12. Write a short note on the degree of freedom
52
Long Answer Questions :
1. Write a note on how predictive analytics is different from prescriptive
analytics. Mention important tools/ techniques under each type of
analytics ?
2. Why data become so important in the last 10 years, write down few
success stories from the business world about data–driven decision
making ?
3. Write short note on different data measurement scales, give two
examples for each data type ?
4. How control charts and boxplot help in the analysis of variance in
the data. Give suitable examples ?
53
Business Analytics Enrolment No. :
1. How many hours did you need for studying the units ?
Unit No. 1 2 3
No. of Hrs.
2. Please give your reactions to the following items based on your reading
of the block :
54
Dr. Babasaheb BBAR-304
Ambedkar
OpenUniversity
Business Analytics
UNIT 1
DISCRETE PROBABILITY DISTRIBUTIONS
UNIT 2
CONTINUOUS PROBABILITY DISTRIBUTIONS
UNIT 3
SAMPLING AND CONFIDENCE INTERVALS
UNIT 4
INTRODUCTION TO HYPOTHESIS TESTING
BLOCK 2 : STATISTICAL CONCEPTS AND HYPOTHESIS
TESTING
Block Introduction
that executives can make better decisions backed by data. Descriptive and
in terms of important business metrics and lay down the foundation blocks
for the next level of analytics. Here onwards probability theory start playing
alternatives that have fewer risks and enhance the decision-making process.
are built.
reduces the uncertainty e.g. instead of saying next month sales will be Rs
856 crores, analysts say sales will be in the interval of (Rs 800 Cr to 912
instead of just sample which result into more robust decision making. Here
opportunities.
Block Objectives
: UNIT STRUCTURE :
1.0 Learning Objectives
1.1 Introduction
1.2 Random Experiments and Probability Distributions
1.3 Discrete Probability Distributions
1.3.1 Binomial Distributions
1.3.2 Poisson Distribution
1.4 Let Us Sum Up
1.5 Answers for Check Your Progress
1.6 Glossary
1.7 Assignment
1.8 Activities
1.9 Case Study
1.10 Further Readings
1.1 Introduction :
Unit Introduction : In this unit, we will study the basic concepts
of probability distributions. We will see various examples of how we
apply theory of these distributions in the business world to take important
decisions. Further, in the unit we will see how each outcome of a random
experiment is mapped to a probability distribution know as probability
mass function while the cumulative distribution function is the probability
that a random variable can take values less than or equal to xt. As a
part of an application, we will see the expected values and variance of
these discrete probability distributions.
57
Business Analytics Probability Mass Function
(PMF) is a graphical representation of
probabilities for all possible values of
a discrete random variable. This is also
known as a frequency function.
If we see an example of dice,
then possible outcomes are x = [1, 2,
3, 4, 5, 6] and there is equal probability,
1
= .167
6
There is an equal probability of each possible value of random
variable x.
Cumulative Distribution
Function (CDF) is another way to
represent the distribution of a
random variable, but unlike PMF,
it is not limited to individual
discrete variables only. At any
point, it represents probability up
to that point.
P(X < 3) = P(X = 0) + P(X = 1) + P(X = 2)
Y–axis of CDF goes up to 1 as probability cannot be more than 1.
Let's consider one more example to understand this concept further,
suppose I have a special dice where I can't roll 3 or 4. There x = 1,
2, 5, 6. Therefore there is blank space in PDF graphs (there is no "mass"
in the probability mass function). In CDF, we can see the probability
of having 2 or less is the same as 4 or less. Once we have mass at
5 in the PDF graph hence CDF value at 5 get change.
n!
P(k out of n)
= p k (1 − p)(n − k)
k!(n − k)!
• Expected Value (Mean) of Binomial Distribution : np
Here n is the no of trials while p is the probability of success
• Variance and Standard Deviation of Binomial distribution :
np(1 – p)
59
Business Analytics The probability of the first student tried paragliding is .08.
Probability of second student also tried paragliding is .08 .08 =
.0064. Similarly, the probability of all 10 students tried paragliding
is (.08)10.
P(X = 10) = (.08)10 = 1.07 10–11
(b)
60
The formula is BINOM.DIST(2, 10, .08, False), here the first Discrete Probability
argument is required individual discrete outcome (2 in this example) then Distributions
the second argument is total numbers of trial, the third argument is the
probability of success, and the last argument will be False for NOT
cumulative while True for cumulative.
Similarly, answer to example 2.1(d) would be 1–BINOM.DIST(1,
10, .08, True) = 0.188.
Please Note : cumulative functions always work in one direction
(equal to or lower). Therefore if we want to calculate for a greater than
region, then we can subtract the result from 1, as we did above.
1.3.2 Discrete Probability Distributions – Poisson Distribution :
Poisson distribution is named French mathematician, Siméon Denis
Poisson. He discovered it in 1938. In his surname "I" remain silent; hence
it sounds like Pozzon. One of the critical assumptions of the binomial
distribution is that number of events (trails) must be fixed, but in case
that is not true then we use Poisson distribution.
When we cannot calculate the probability of success and failure
both, we use Poisson distribution–for example, the number of potholes
in a 10 km long road. Here we cannot calculate the probability of success
and failure both. We can calculate only one either success or failure.
For example, in a 10 km road stretch, we can only count potholes (which
can be a measure of failure), but we cannot say how many potholes could
be possible in this 10 km stretch (total number of trials).
Another example, number of calls received at a call centre in an
hour, here theoretically there can be numerous calls of small duration
are possible, but we cannot estimate the total number of possible calls,
we can only calculate the average number of calls received in an hour
(by analysing the historical calls data).
So Poisson distribution describes the average number of events
occurring in a fixed interval of time or any continuum region of
opportunity (which can be measure on a continuous scale, like length,
money, distance, weight etc.). It requires only one parameter, or
, which tells the average number of events in a given interval.
Poisson distribution bounded by 0 and . For example, there can
be 0 calls in a day, or there can be an infinite number of calls in
a day at a call centre.
The expected value (mean), E(X) and variance, V(X) of Poisson
distribution are the same, .
Example : On average, three
customers need a wheelchair per day
( / ) at a local grocery store. In the
below graph, these bars go up to
infinity although probability will
become minuscule, so for convenience, I have put up to 10 customers
only. 61
Business Analytics Important Assumptions of Poisson Distribution :
1. Events occur at a constant rate, which means there should be equal
chances of the happening number of events in a one–time interval
to any other interval of time
2. The occurrence of one event must be independent of any other event
(i.e. Events are independent)
Please Note : These assumptions may not hold good in reality or
may not aligned with the business problem, which we are trying to solve
hence we need to be sure whether it is appropriate and to apply Poisson
distribution.
PMF for Poisson Distribution :
It is the probability of each discrete outcome (height of each bar
in the above graph).
e−λ λ x
P(X
= x)
=
x!
Here e is Euler's constant whose value is 2.71.
CDF for Poisson Distribution :
Γ ( x + 1 , λ)
P(X ≤ x) =
x!
We don't need to understand this formula as it may be difficult to
understand for undergraduate students of this course. Instead, we will see
the excel formula to calculate CDF.
Excel Formula to Calculate PMF and CDF for Poisson
Distribution :
To calculate PDF, POISSON.DIST(X, , FALSE) while calculating
CDF POISSON.DIST(X, , TRUE)
Example 1.2 :
(a) Calculate the probability of arrival for 5 customers who need a
wheelchair in a day, while = 3 is given.
(b) Calculate the probability of arrival of at least 5 customers who need
a wheel–chair in a day
Solution :
e−3 35
(a) P(X
= 5)
= = 0.101
5!
(b) POISSON.DIST(5, 3, TRUE) = 0.916
62
Graphical Representation of Poisson Distributions for Different Discrete Probability
Values : Distributions
The desired bar is highlighted with white colour in the above graph.
e−121210
P(X
= 10)
= = 0.105
10!
POISSON.DIST(10, 12, FALSE) = 0.105
So there is approx. 10.5% probability of selling exactly 10 watches
in a day
63
Business Analytics (b) Probability of selling at least 10 sales in a day, probability of 10
or more in a day
The desired bars are highlighted with white colour in the above
graph.
P(X > = 10) = 1 – P(X < 9) = POISSON.DIST(9, 12, TRUE)
= 0.758.
So there is approx. 76% probability of selling at least 10 watches
in a day
(c) probability of selling more than 1 watch in the first hour of the
day
We have the value of in terms of sales per day while we need
to calculate it in terms of sales per hour. Therefore we need to change
our (mean) accordingly
12
λ= = 0.5 sales per hour
24
The desired bars are highlighted with white colour in the above
graph.
1 – POISSON.DIST(1, 0.5, TRUE) = 0.090. So there is approx.
a – 9% probability of selling more than 1 watch during the first hour.
Please Note : we solved the above example mathematically, that
is correct but if we re–think the assumptions of Poisson distribution that
there should be a constant rate of occurring events through all intervals
but here in the case of online watch sales it is very unlikely that somebody
will purchase a watch from late night to early morning. So in a way,
it is breaching the assumption of Poisson distribution. Therefore, before
applying statistical tools or techniques, we have to be very careful whether
our use case fulfil all required assumptions or not. Otherwise, we will
not get any error theoretically, but our inferences would be very misleading.
64
Poisson distribution can be an excellent approximation to the Discrete Probability
normal distribution, but only in case when the probability of success (p) Distributions
must be significantly small compared to the number of trials (n). One
thumb rule says Poisson distribution can be an excellent approximation
to the binomial distribution, in case n > 20 and np < 10. We will cover
the normal distribution detail in the next chapter.
Check Your Progress – 2 :
1. Poisson distribution can be an excellent approximation to the normal
distribution, but only in case when the probability of success (p)
must be ___________ compared to the number of trials (n)
2. Cumulative distribution function (CDF) can be used to define a
discrete probability function only
a. True b. False
Check Your Progress – 3 :
1. Which of the following is the formula for variance for a discrete
probability distribution ?
a. Np b. x P(x)
65
Business Analytics 6. Which of the following statement(s) are correct about the Poisson
distribution ?
a. Events occur at a constant rate, which means there should be
equal chances of the happening number of events in a one–time
interval to any other interval of time
b. The occurrence of one event must be independent of any other
event
c. Both (a) and (b) are correct
d. Only (a) is correct
7. Calculate the probability of arrival for 7 customers who need a
wheelchair in a day, while = 3 given.
a. 0.988 b. 0.082 c. 0.022 d. 0.052
8. What of following is the formula for the standard deviation for
Poisson distribution
a. b. np c. np(1 – p) d. λ
9. One car showroom sells 60 cars per month (historical average)
which following Poisson distribution, the analyst wants to calculate
the probability of selling 3 cars on the first day of the month. What
can be in this case
a. 30
b. 2
c. Not sufficient information to calculate the expected value
d. 3
10. In an exam, there are 25 multiple choice questions, each question
has four probable answers but only one answer can be correct if
a student is guessing every question. calculate the probability of
getting 10 questions correct ?
a. .042 b. .625
c. 2.17 d. None of the above
1.6 Glossary :
Random Probability Distribution : A random probability
distribution is a statistical function that explains all the possible values
a random variable can take within a given minimum and maximum range.
Discrete Probability Distribution : If a random variable can obtain
only discrete (countable) outcomes, for example, 0, 1, 2, 3 etc. then we
say it is a discrete random variable that follows discrete probability
distribution.
Probability Mass Function (PMF) : It is a graphical representation
of probabilities for all possible values of a discrete random variable. This
is also known as a frequency function.
Cumulative Distribution Function (CDF) : It is another way to
represent the distribution of a random variable, but unlike PMF, it is
not limited to discrete variables only. At any point, it represents probability
up to that point.
Binomial Distribution : In the binomial distribution, there are only
two possible outcomes. Here prefix "bi" indicates two. For example the
probability of "yes" or "no", "pass" or "fail", "heads" or "tails" etc.
Poisson Distribution : When we cannot calculate the probability
of success and failure both, we use Poisson distribution–for example,
the number of potholes in a 10 km long road.
67
Business Analytics 1.7 Assignments :
1. What do you mean by a probability distribution, what are different
types of probability distributions ?
2. What is the generic formula of variance and standard deviation for
a discrete probability distribution ?
3. What is the relationship between probability mass function and
probability cumulative function, explain it with an example ?
4. Explain the MS Excel formula for Binomial distribution and its
important arguments.
1.8 Activities :
1. Suppose email come to a customer care inbox follows a Poisson
distribution and the average number of emails per hour is 20
(a) What is the probability that exactly 10 emails will arrive in an hour ?
(b) What is the probability of arriving more than 15 emails in an hour ?
(c) What is the probability of arriving less than 5 emails in an hour
Ans. (a) 0.006
(b) 0.84
(c) 0.000017l,
68
1.10 Further Readings : Discrete Probability
Distributions
• "Mathematical Methods of Statistics," Princeton University Press,
Carmer H. (1946)
• "Super Freakonomics," Penguin Press, Levitt S. D. and Dubner
S. J. (2009)
• "An Explanation of the Persistent Doctor Mortality Association"
Journal of Epidemiology and Community Hearlth, Yough
F. W. (2001)
• "Data Strategy : How to Profit from A World of Big Data, Analytics
and The Internet of Things", O'Reilly Media, Bernard Marr
• "Predictive Analytics : The Power to Predict Who Will Click, Buy,
Lie, or Die", Wiley, Eric Siegel
69
Business Analytics
Unit CONTINUOUS PROBABILITY
2 DISTRIBUTIONS
: UNIT STRUCTURE :
2.0 Learning Objectives
2.1 Introduction
2.2 Probability Density Function
2.3 The Normal Distribution
2.3.1 Binomial Distributions
2.3.2 Poisson Distribution
2.4 Student's t–Distribution
2.4.1 PDF and CDF for t–Distribution
2.4.2 Properties of t–Distribution
2.5 Let Us Sum Up
2.6 Answers for Check Your Progress
2.7 Glossary
2.8 Assignment
2.9 Activities
2.10 Case Study
2.11 Further Readings
2.1 Introduction :
In this unit, we will study the theory and application of continuous
probability distributions. We will see how these distributions help us to
take important business decisions about continuous business metrics like
time, money, weight etc. Through various examples from the business
world, we will understand the concepts of the probability density function
and cumulative density functions. In the end, we will understand the
assumptions and decision criteria to select the most appropriate distribution
as per the experimental data.
70
2.2 Probability Density Function : Continuous Probability
Distributions
"I believe that we don't know anything for certain, but everything
probably." – Christiaan Huygens
A Continuous Probability distribution can obtain any value in a
given range. Unlike a discrete variable, the probability of a particular
point in continuous distribution is always 0 because it is one point out
of infinite possible outcomes. Hence continuous probability can't be
represented in tabular form. There is always a formula or equation to
express a continuous probability distribution. This equation to describe
a continuous probability distribution is known as the probability density
function. It is a counterpart of the probability mass function for the
discrete probability distribution. Example of continuous variables is height,
length, money, weight, distance, time, temperature, blood pressure, etc.
As we can split these things into half infinite times.
Here we can see that mean of this height data is approx.–165 cm.
The Bell shape in front of the histogram is a pictorial representation of
the probability density function, but, in the probability density function
graph, Y–axis is not frequency like in the histogram; instead of that
Y–axis is always a probability. A normal distribution can be defined
by two parameters, (mean) and (standard deviation) of data.
Like we draw cumulative distribution function for the discrete
probability distribution; similarly, we can draw CDF in case of continuous
probability distribution also.
71
Business Analytics
The formula for PDF represents by f(x) and CDF represents F(x)
for the normal distribution is as follows :
2
1 x − µ
1 −
=f (x) e 2 σ , − ∞ < x < + ∞
σ 2π
x 1 t − µ 2
1 −
=F(x)
∫σ
−∞
2π
e 2 σ dt, − ∞ < x < +∞
X−µ
Z=
σ
This is a typical normal distribution with mean = 0 and standard
deviation = 1. The value of Z tells us that the calculated point is how
many standard deviations away from the mean.
Above standard normal distribution equation can be rewritten in
terms of random variable X :
X = + Z
Important Properties of Normal Distribution :
1. The total area for a normal distribution is always 1 as it is a
continuous probability distribution
2. It is always symmetric bellshaped around its mean; it has a mirror
image about its mean
3. Theoretically, normal distribution never touches the x–axis; it is
defined from – to + . This property of probability distribution
is known as asymptotic
72
4. For any normal distribution, always area between specific values Continuous Probability
remain constant (in terms of and ) Distributions
5. Linear transformation of any normal distribution is also normal
distribution. If X is a normal random variable, then its linear
transformation AX + B (where A and B are constants)
6. Two independent normal distributions X1 and X2 with means 1
and 2 and variance 21 and 22 respectively. New transformed
distribution X1 + X2 will also follow the normal distribution with
mean 1 + 2 and variance 21 + 22
74
(a) How many files can be processed under 125 mins ? Continuous Probability
(b) How many files can be processed under 115 mins ? Distributions
(c) How many files processes took more than 140 mins
(d) How many files took time between 120 mins to 125 mins
(e) What should be the target that they should be able to complete
95% files within the set target
Solution : Here mean = 125 and standard deviation = 9
(a) Z (125
= = 9
− 125)
9
The value of 0 in the Z table is .5
hence 50% of files are Processed within
125 mins.
1 − (140 − 125
Z =
9
= 1 – .9525 = 0.048
Hence 4.8% files process took more than 140 minutes.
(X − 125)
1.65 =
9
X = 1.65 9 + 125 = 139.5 minutes
Therefore, if the bank will set 139.5 mins target then almost 95%
of files will be processed within the set target.
75
Business Analytics Check Your Progress – 1 :
1. For Standard normal distribution mean is always ___________ and
standard deviation is ___________.
2. Two independent normal distributions X1 and X2 with means 1
and 2 and variance 21 and 22 respectively. New transformed
distribution X1 + X2 will also follow the normal distribution with
mean ___________ and variance ___________.
t=
(X − µ)
s
n
76
2.4.1 PDF and CDF for t–Distribution : Continuous Probability
PDF and CDF formula for t–distribution are very complex; hence Distributions
we will see only Microsoft Excel function to calculate.
MS Excel formula is T.DIST(x, degree of freedom, true). Here
x is the t value we have calculated with the above formula.
2.4.2 Properties of t–Distribution :
1. The mean of a t–distribution with 2 or more degrees of freedom
is always 0
n
2. The variance of t–distribution is for 2 or more degrees
(n − 2)
of freedom
3. As we increase the degree of freedom, PDF of t–distribution will
approach PDF for standard normal distribution
4. For a small sample, t–distribution has a bell curve shape, but it
will be fatter than the normal distribution. But if the sample increase
more than 30 observations, t–distribution start mimicking standard
normal distribution
5. t–distribution play an important role in hypothesis testing of means
of a population and comparing means of two populations
We will study about import properties of t–distribution in the next
unit, "Hypothesis testing" and in the next two blocks, "Regression analysis"
and "Time series analysis".
Check Your Progress – 2 :
1. We apply t–distribution when population ___________ is unknown
and sample size is ___________.
2. Shape of t–distribution depends on ___________.
Check Your Progress – 3 :
1. The weight of bags in a box are normally distributed with mean
µ and standard deviation of 6.5 given that 20% of bags are less
than 250 gms, the value of :
a. 255.46 b. 250.49
c. 244.54 d. None of the above
2. A normal distribution can be defined by :
a. Mean
b. Standard deviation
c. Both
d. Either mean or standard deviation
77
Business Analytics 3. The best scenarios to use t–distribution
a. Population standard deviation is not known
b. The population is normally distributed
c. The sample size is small
d. All of the above
4. t–distribution is defined by :
a. Mean b. Standard deviation
c. Degree of freedom d. None of the above
5. The total area under normal distribution :
a. It depends on observations b. 1
c. .5 d. None of the above
6. % of data points between the mean 2 standard deviations :
a. Approx. 68% b. Approx. 95%
c. Approx. 99.7% d. None of the above
7. Two standard normal distribution X1 and X2 with mean 1 and
2 and 2 respectively, then what we can say
2 and variance 1 2
about new distribution X1 + X2 :
a. X1 + X2 will also be normally distributed
b. Mean of X1 + X2 will be 1 + 2
c. Variance of X1 + X2 will be 2 + 2
1 2
d. All of the above
8. Mean of t–distribution with more than two degrees of freedom :
a. Always 0
b. Depends on the population mean
c. Depends on the sample mean
d. Always 1
9. The variance of t–distribution with more than two degrees of
freedom :
n n n
a. b. c. d. Always 1
2 (n + 2) (n − 2)
78
2.5 Let Us Sum Up : Continuous Probability
Distributions
1. Unlike a discrete variable, the probability of a particular point in
continuous distribution is always 0 because it is one point out of
infinite possible outcomes.
2. Continuous probability distribution is defined by an equation called
probability density function
3. A normal distribution can be defined by two parameters, (mean)
and (standard deviation) of data.
4. The standard normal distribution has zero mean and 1 standard
deviation
5. The standard normal distribution is always symmetric bell–shaped
around its mean; it has a mirror image about its mean. Theoretically,
normal distribution never touches the x–axis; it is defined from –
to + . This property of probability distribution is known as asymptotic
6. The degree of freedom can be calculated as the number of observations
in the sample minus the number of estimates made using the data
(sample).
7. The mean of a t–distribution with 2 or more degrees of freedom
n
is always 0 while the variance of t–distribution is for 2
(n − 2)
or more degrees of freedom
8. As we increase the degree of freedom, PDF of t–distribution will
approach PDF for standard normal distribution
2.7 Glossary :
Probability Density Function (PDF) : PDF defines as the probability
of the continuous variable coming within a given range of values. It
generally tells us the height of the distribution at a given point
Cumulative Density Function (CDF) : Derivative probability density
function is known as the cumulative density function. It provides us with
the area from – to a particular point
Standard Normal Distribution : It is a special normal distribution
whose mean is always 0 and the standard deviation is always 1
79
Business Analytics Degree of Freedom : Degree of freedom can be calculated as the
number of observations in the sample minus the number of estimates
made using the data (sample).
t–distribution : t–distribution has a bell curve shape, but it will be
fatter than the normal distribution. It is defined by the degree of freedom,
for a large sample size its shape start mimicking like a normal distribution
2.8 Assignments :
1. What is the probability of a particular point in a continuous probability
distribution ?
2. How many data points occur between mean ± 1 standard deviation,
mean ± 2 standard deviation and mean ± 3 standard deviations
for a normally distributed sample data.
3. TRYC bank takes an admission test (TCAT) to hire freshers. Scores
on the TCAT are normally distributed with a mean of 353 and a
standard deviation of 80. What is the probability of an individual
scoring above 250 in the TCAT exam ?
2.9 Activities :
The teacher surveyed 200 students to know how many hours
students study Business Analytics per day. The study shows a sample
mean of 68 minutes and a standard deviation of 12 minutes. Assume
study hours follow the normal distribution.
(a) How many students study Business Analytics less than 90 minutes
per day
(b) How many students study less than 60 minutes
(c) How many students study between 50 minutes and 80 minutes
Ans. (a) 25.46% – approx. 51 students
(b) 96.64% – approx. 193 students
(c) 77.45% – approx. 155 students
2.10 Case Study :
There are approx. 5000 tickets raised for a technical helpdesk in
an Akshay software company. The technical helpdesk team resolves a
ticket in 25 hours with a standard deviation of 3 hours. Let's say the
technical helpdesk manager said that they would award ` 100 to associates
whose tickets will take more than 28 hours. Calculate the monthly
technical helpdesk expense due to penalty.
Question :
Now manager calculates the expense and realized that it is going
out of budget, suggest to him the penalty threshold so he should not
pay penalty to more than 2% of associates.
80
2.11 Further Readings : Continuous Probability
Distributions
• "Mathematical Methods of Statistics," Princeton University Press,
Carmer H. (1946)
• "Super Freakonomics," Penguin Presss, Levitt S. D. and Dubner
S. J. (2009)
• "An Explanation of the Persistent Doctor Mortality Association"
Journal of Epidemiology and Community Hearlth, Yough
F. W. (2001)
• "Data Strategy : How To Profit From A World Of Big Data,
Analytics And The Internet Of Things", O'Reilly Media,
Bernard Marr
• "Predictive Analytics : The Power to Predict Who Will Click, Buy,
Lie, or Die", Wiley, Eric Siegel
81
Business Analytics
Unit SAMPLING AND
3 CONFIDENCE INTERVALS
: UNIT STRUCTURE :
3.0 Learning Objectives
3.1 Introduction
3.2 Introduction to Sampling Process
3.2.1 Important Steps in Designing a Sampling Strategy
3.3 Sampling Methods
3.3.1 Probabilistic Sampling Methods
3.3.2 Non-Probabilistic Sampling Methods
3.4 Central Limit Theorem
3.5 Confidence Interval
3.6 Let Us Sum Up
3.7 Answers for Check Your Progress
3.8 Glossary
3.9 Assignment
3.10 Activities
3.11 Case Study
3.12 Further Readings
3.1 Introduction :
In this unit, we will study samples and their importance in the
business world. Sampling helps in making decisions faster also reduce
the cost of the experiment. There are different types of sampling
methodologies as per the business scenarios. We will also see how
estimating a range of values know as confidence intervals are better than
point estimation (average, variance etc.). In the end, we will see the
importance of the confidence level of a study and how it influences our
study in terms of its effectiveness and efficiency.
82
3.2 Introduction to Sampling Process : Sampling and
Confidence Intervals
"Statistical analysis in cases involving small numbers can be
particularly helpful because on many occasions intuition can be highly
misleading." – Sandy Zabell
Sampling is one of the important tools in the business world as
it is directly aligned with the three most important business metrics –
cost, efforts and time. In the real business world, it is very difficult to
analyze entire data even if we have access to it because accessing entire
data is very expensive and time–consuming. A correct sampling
methodology by fulfilling all its assumptions can help us to complete
endeavours within time and budget.
Sampling is a technique by which we collect only a few data points
from the population to reveal information and insights about the population
parameters like average, standard deviation, variance, proportion etc.
Most of the time, sampling works as two edge sword where right
sampling can save huge money and time, but wrong sampling can result
in disastrous results. Sampling involves various steps; hence it is important
we have to be very cautious at each step to get desired results.
3.2.1 Important Steps in Designing a Sampling Strategy :
83
Business Analytics 3. Select a Suitable Sampling Technique : The right sampling technique
plays an important role in achieving the research objective. Various
sampling techniques broadly can be distinguished between
probabilistic and non–probabilistic techniques. We will study these
techniques in the next section.
4. Determine the Right Sample Size : Data collection can be expensive
and time–consuming; hence it is very important to calculate the right
sample size, which is sufficient for achieving the research objective
within budget and time frame. Sample size depends on the required
level of confidence, effect size, variation and margin of error.
5. Execute the Sampling Process : All the above steps must be
executed as the right process, and we must be agile to keep it aligned
with the business problem for which we are drafting this sampling
strategy. Proper execution requires a nicely identified population,
correct sampling frames, appropriate sampling technique and sufficient
sample size.
86
Check Your Progress – 1 : Sampling and
1. In which sampling technique, samples are picked up after a regular Confidence Intervals
interval :
a. Quota sampling b. Systematic sampling
c. Cluster sampling d. Snowball sampling
2. Bootstrap sampling is a type of ___________ sampling techniques
while quota sampling is a type of ___________ sampling technique.
87
Business Analytics
88
1. A bigger sample leads to a smaller spread. We saw when we had Sampling and
only 10 means in the sample; our bars were varying from 1 to Confidence Intervals
6 while in the case of the 150 means sample, most of the data
points are varying from 2.3 to 4.9. Irrespective of the type of
population distribution, a relatively large sampling distribution
(n > 30) will follow the normal distribution. This distribution will
have mean same as the population meanwhile standard error (standard
σ
deviation of the sample) will be
n
2. The spread of sampling distribution will be lesser than the spread
of population distribution (from which this sample has been drawn)
(X − µ )
3. The variable
σ
will always have a
mean 0 and standard
error = 1. This
distribution of mean 0
and standard deviation
1. This distribution is
known as the standard
normal distribution.
This calculation is true
for any dataset. If we subtract
the entire data set from its
mean and divide it by
Fig 3.8 Stanardization
standard deviation then for
transformed data, the mean will always be 0, and the standard deviation
always is 1. This is known as the Standard Score.
This helps us to compare different datasets which are measured
in different units like one data is in Kgs while another is in currency.
When we compare two datasets with the help of a standard score
is known as standardization as we standardize all data points. Each data
is no standard deviation away from its mean. For example, – .05 means
.05 standard deviation less than its mean.
S S
Confidence Interval= X − t α / 2, n −1 ≤ µ ≤ X − t α / 2, n −1
n n
Here the value of t depends on sample size and confidence level;
there would be a wider confidence level for bigger values of t. Here we
are using t /2, n–1 because of the two–tailed test, in unit 3 we will study
one–tailed or two–tailed tests in detail. Below are some important t values
for different confidence levels and sample sizes.
90
• In the above formula Sampling and
X is sample mean Confidence Intervals
• t × s n is called the
margin of error
• The confidence
interval provides us with the range we are quite sure of (as per
confidence level) for the mean of our population.
• At 95% confidence level, the confidence interval does not tell us
that 95% of mangoes will weigh within this range rather it will
tell us that there are 95% chances that the average weight of
mangoes will lie in this range.
• Therefore if we take lots of samples and create 95% confidence
intervals for them, then we can be assured that 95% of them will
contain the true population mean although 5% of these confidence
intervals will not contain the true population mean
• The confidence interval can be calculated for any population measure
(parameter) like median, standard deviation etc
• Excel formula for calculating confidence interval with the help of
excel is CONFIDENCE.T (alpha, standard deviation, size). The
value of t /2, n–1 can be calculated using the excel formula
T.INV( /2, n–1) or another excel formula T.INV.2T( , n–1).
There is another formula in excel TINV( , n–1) while
TINV(2 , n–1) provides value for a one–tailed T–test.
Example 3.1 : A sample of 100 students from the Business Analytics
class was taken to estimate their daily study hours. The sample mean
is 4.8 hours and the standard deviation of the population is given as
1.4 hours. Solve the below questions :
(a) 90% confidence interval for the population mean
(b) 95% confidence interval for the population mean
Solution :
(a) Confidence interval at 95% confidence level
91
Business Analytics
Confidence interval = X ± t α / 2,n − 1 × s / n
92
6. When we increase the confidence level, how does it impact the Sampling and
confidence interval Confidence Intervals
a. Confidence interval gets narrow
b. Confidence interval gets wider
c. Does not impact the confidence interval
d. None of the above is a correct statement
7. For a standard normal distribution
a. The sample mean is always higher than the population and the
standard deviation is lesser than the population
b. Mean is always 50 and the standard deviation is always 10
c. Mean, and standard deviation always depends on the population
from where the sample extracted
d. Mean is always ZERO while standard deviation is always ONE
8. The relation between significance (alpha) and confidence level is :
a. Both are valid only for a two–tailed test
b. The summation is both is always 100%
c. Both are valid only for a one–tailed test
d. None of the above
9. One of the below factors does not impact the width of the confidence
interval :
a. Population standard deviation b. Sample size
c. Sample mean d. Confidence level
10. How the width of the confidence interval will change. If we increase
sample size and also increase the confidence level at the same time
a. We cannot conclude anything about the width of the confidence
interval
b. The length of the confidence interval will be increased
c. The length of the confidence interval will be decreased
d. The length of the confidence interval will remain the same
3.8 Glossary :
Sampling : Sampling is a technique by which we collect only a
few data points from the population to reveal information and insights
about the population parameters like average, standard deviation, variance,
proportion etc.
Probabilistic Sampling Methods : When the researcher picks up
a sample from the population based on probability theory. e.g. random
sampling, stratified sampling, bootstrapping, systematic sampling, cluster
sampling etc
Non–Probabilistic Sampling Methods : When the researcher picked
sample without any probability–based method e.g. snowball sampling,
Convenience sampling, judgement sampling etc
Center Limit Theorem : It states that if a population is normally
distributed and we take few samples of size n then the means of these
samples will always be normally distributed.
Standardization : When we compare two datasets with the help
of a standard score, it is known as standardization
Confidence Interval : It is a measure of certainty in terms of a
probability value that population parameter will fall within a range of
values around the mean
94
3.9 Assignments : Sampling and
Confidence Intervals
1. Write down important steps involved in designing a sampling
strategy.
2. In an organization, there is a total of 3,000 employees. A random
sample of 90 engineers reveals that the average sample age is 29
years. Historically, the population ? of the age of the company's
engineers is approx. 7 years. Construct a 95% confidence interval
to estimate the average age of all the employees in this company.
3. What are the important differences between probabilistic and non-
probabilistic sampling techniques ? Write important techniques
under each of these categories.
3.10 Activities :
A sample of 70 customers from an online shopping company was
taken. The sample mean is 24 transactions as per their transaction history
and the standard deviation of the population is given as 35. Calculate
a 95% confidence interval for the population mean.
Ans. : (23.09, 24.91)
95
Business Analytics 3.12 Further Readings :
• "Mathematical Methods of Statistics," Princeton University Press,
Carmer H (1946)
• "Super Freakonomics," Penguin Press, Levitt S D and Dubner
S J (2009)
• "An Explanation of the Persistent Doctor Mortality Association"
Journal of Epidemiology and Community Hearlth, Yough
F W (2001)
• "Data Strategy : How To Profit From A World Of Big Data,
Analytics And The Internet Of Things", O'Reilly Media,
Bernard Marr
• "Predictive Analytics : The Power to Predict Who Will Click, Buy,
Lie, or Die", Wiley, Eric Siegel
96
Unit INTRODUCTION TO
4 HYPOTHESIS TESTING
: UNIT STRUCTURE :
4.0 Learning Objectives
4.1 Introduction
4.2 Life Cycle of Hypothesis Testing
4.2.1 Hypothesis Testing Process Steps
4.3 Hypothesis Test Statistics
4.4 Two–Tailed and One–Tailed Hypothesis Test
4.5 Concept of p–Value
4.6 Type I, Type II Error and Power of the Hypothesis Test
4.7 Hypothesis Testing for a Population Mean with Known Population
Variance : Z–Test
4.8 Hypothesis Testing for a Population Mean with Known Population
Variance : t-Test
4.9 Let Us Sum Up
4.10 Answers for Check Your Progress
4.11 Glossary
4.12 Assignment
4.13 Activities
4.14 Case Study
4..15 Further Readings
4.1 Introduction :
In this unit, we will study the concept of hypothesis testing and
how it helps us to make robust decisions about the future based on a
study conducted on sample data. We will see the various types of errors
associated with hypothesis testing techniques. We will also understand
97
Business Analytics the concept of p–value and its application in decision–making with the
help of various examples.
99
Business Analytics 4.2.1 Hypothesis Testing Process Steps :
There are eight important steps in the hypothesis testing process :
100
Introduction to
Hypothesis Testing
101
Business Analytics
103
Business Analytics In scenarios where the rejection region lies only on one side of
the distribution, we called it a one–tail test. If the rejection region lies
on the left side, then we call it a left–tailed Test otherwise if the rejection
region lies only on the right side then we call it a right–tailed test. Below
are the examples :
For a right–tailed test, the average salary of a business analyst in
an organization is at least ` 65,000
H0 : salary <= 65000
H1 : salary > 65000
104
Let's try to understand it with the help of an example. The average Introduction to
salary of young analysts in the Analytics and Insight department of Hypothesis Testing
company Circa Ltd. is at least ` 55,000.
Here, H0 : <= 55000
He has collected sample salary information from the Human resource
department and found that the sample mean ( X ) is ` 55,000. Suppose
the standard deviation of the population is known, and the standard error
σ
of the sampling distributions is 2500 = 2500 .
n
The standardized difference between the estimated sample mean
value and hypothesized salary is :
(55000 − 50000)
=2
2500
Now we will find the probability of calculating this sample mean
if the null hypothesis is true. Remember, a large standardized distance
between hypothesized mean and sample mean will result in a low p–
value. This is a right–tailed test. Therefore, we will see Z value in the
Z table (.9772) which is an area up to Z = 2 from left most point, but
we are interested in calculating area beyond Z = 2 hence our interested
p–value is 1–.9772 = .0228
106
2. As per hypothesis testing theory significance test based on a small Introduction to
sample may not provide a significant output even if the correct value Hypothesis Testing
differs significantly from the null hypothesis statement value. This
type of incorrect result is known as :
a. Alpha value (the significance level of the test)
b. 1 – (the power of the hypothesis test)
c. a Type 1 error
d. a Type 2 error
µ
Z − Statistics =
X−
σ
n
107
Business Analytics From the above data points, we calculated the sample mean,
X = 27.05 days
σ 12.5
Standard deviation of sampling distribution
= = = 1.974
n 40
µ 30
Z–Statistics =
X− 27.05 −
= −1.4926
=
σ 12.5
n 40
µ
t − Statistics =
X−
S
n
108
632 457 335 252 667 636 286 444 636 292 Introduction to
Hypothesis Testing
601 627 330 364 562 353 583 254 528 470
762 439 599 708 530 402 729 593 601 408
125 60 101 110 60 252 281 227 484 402
Solution :
µ 429.55 − 500
t − Statistics =
X− = − 2.2845
=
S 195.0337
n 40
109
Business Analytics a. Only sentence 1 is correct b. Only sentence 2 is correct
c. Both sentences are correct d. None of the sentences is correct
Check Your Progress – 3 :
1. A claim made about the population parameter for analysis purpose
is called
a. Test–Statistic b. Statistic
c. Hypothesis d. Level of Significance
2. A researcher is conducting hypothesis testing at = 0.10. P–value
is 0.06 then what is the most appropriate answer :
a. Reject the null hypothesis
b. Fail to reject the null hypothesis
c. Information is incomplete to make a decision
d. value is too high
3. The null hypothesis gets rejected beyond a point. What we call
this point
a. Significant Value b. Critical Value
c. Rejection Value d. Acceptance Value
4. If the critical region is split into two parts then which option is
correct
a. The Test is two–tailed b. The Test is one–tailed
c. The Test is zero–tailed d. The Test is the three–tailed Test
5. When will type I error occurred
a. We accept H0 if it is True b. We reject H0 if it is False
c. We accept H0 if it is False d. We reject H0 if it is True
6. Power of Test can be calculated as
a. b. c. 1 – d. 1 –
7. Confidence level can be calculated as
a. 100% b. 100%
c. (1 – ) 100% d. (1 – ) 100%
8. What can be another name of the alternate hypothesis
a. Composite hypothesis b. Simple Hypothesis
c. Null Hypothesis d. Research Hypothesis
9. Another name for type II error
a. Producer's risk b. P–value
c. Consumer's risk d. Confidence interval
110
10. = 30, this statement can be considered as Introduction to
a. Alternate hypothesis Hypothesis Testing
b. Null hypothesis
c. None of the above
d. It depends on the scenario; it can be either a null or alternate
hypothesis
111
Business Analytics Check Your Progress – 3 :
1. c 2. a 3. b 4. a 5. b
6. d 7. c 8. d 9. c 10. b
4.11 Glossary :
Hypothesis Testing : It is a process of validating a claim (hypothesis)
based on analysis of sample data
Null Hypothesis Statement : It is a claim about research that is
assumed to be true unless it is proved incorrect with the help of collected
sample data
Alternate Hypothesis Statement : It is opposite to the null hypothesis
statement. It is the interested claim of a researcher which he wants to
prove correct with the help of collected data
P–Value : The p–value is the probability value that indicates how
likely it is that a result occurred by chance alone. The p–value is a
conditional probability of observing the sample statistic value when the
null hypothesis is true. P–value (probability) is evidence in support of
the null hypothesis statement.
Type–I Error : When we reject a null hypothesis when in fact
it is true
Type–II Error : When we fail to reject (accept) null hypothesis
when in fact it is false
Power of Test : The value of 1– is known as the power of test
which means how sensitive is our hypothesis testing in rejecting the null
hypothesis when it is false
4.12 Assignments :
1. Define Hypothesis testing, risk, null and alternate hypothesis
testing statements and p–value with example.
2. What is the difference between one–tail and two-tail hypothesis
tests, explain with an example.
3. Why type–I error in hypothesis testing is known as producer's error.
Explain with an example.
4. Write down null and alternate hypothesis testing statements for the
below scenarios :
a. A company claims that they complete the full and final
settlement of an employee within 30 days of quitting the
company.
b. The passport office claimed that they issue passports within
30 days after submission of all required documents.
112
4.13 Activities : Introduction to
Hypothesis Testing
Peacock training institute invented new technological ways to teach
programming to their students. They picked up 20 students and checked
their monthly study hours before and after training as per new technology.
Below is the table that represents study hours before and after. Conduct
a t–test to see whether new technology is motivating students to study
more hours. Assume = 0.05
Before New Technology After New Technology
349 335
449 344
378 318
359 492
469 531
329 417
389 358
497 391
493 398
268 394
445 508
287 399
338 345
271 341
412 326
335 467
470 408
354 439
496 321
351 437
Ans. : The value of test statistics is 0.5375 and the critical value of
t–test when = 0.05 and degree of freedom 19 is 1.7291. since
the t–statistics value is less than the critical value hence we will
retain the null hypothesis and the difference is studying hours
is not significant.
113
Business Analytics 4.14 Case Study :
ABC limited company has appointed a new analytics and insight
department head to streng–then its presence in India and the Asian market.
He wants to analyze whether the marketing cost for units A and B for
the financial year 2018 and financial year 2019 is significantly different.
He asked the MIS team to provide him with sales data and marketing
costs in the last 10 years :
Year Marketing Cost for Unit A Marketing Cost for Unit B
B2009 41 31
2010 56 36
2011 28 37
2012 42 35
2013 31 38
2014 38 41
2015 39 44
2016 39 45
2017 51 47
2018 43 49
Answer the below Questions :
a. Draw scatter plot of above data
b. Conduct hypothesis testing at a 95% confidence level and conclude
your result
c. Will, there be any change in your recommendations if you will run
the hypothesis testing at a 90% confidence level
114
BLOCK SUMMARY
In the business world, we have various business metrics which tells
about the health of our organization. We should see these metrics over some
time so that we can see the trends. Business wants to predict the values with
the help of probability theory. One such tool is a probability distribution.
Based on the nature of business metrics we classify them among continuous
or discrete metrics. We saw how we apply the goodness of fit test to check
whether data follows a specific distribution or not. This is done with the
help of hypothesis testing. On the other hand, most of the time we do not
want to predict the point estimation (e.g. future share price of our
organization) instead of that we want to estimate the interval range in which
this value can fall. The study of confidence interval helps us to calculate
it more precisely and also advise businesses on how they can be ready for
extreme situations. Businesses can plan continuity measures in case things
will be extremely good or not so good.
In the business world, time always remains a critical resource. Most
of the time we want to conclude our research fast so that results can be
beneficial for businesses and their end customers. Therefore, we conduct
the sample study at the end of the research whether it is meeting with the
business objective and there is no adverse effect. This can be done with the
help of hypothesis techniques. It helps us to conclude whether the sample
result will hold in the future also. Hypothesis testing is the soul of
inferential statistics, it plays important role in concluding research outputs.
115
Business Analytics BLOCK ASSIGNMENT
Short Answer Questions :
1. Write short notes on PDF, PMF and CDF
2. Explain the difference between the binomial and Poisson probability
distribution
3. Explain the difference between the normal and student–t probability
distribution
4. Write important assumption of Poisson distribution
5. Why do we say that we need to be very cautious about applying
Poisson distribution to solve a real–life business problem ?
6. Write a short note on "How to check Z score in Standard Normal
Distribution Table (Z Table)"
7. Why confidence interval is better than point estimation, explain with
an example
8. Write a short note on the margin of error in the calculation of the
confidence interval
9. Is there any relationship between sample size and confidence level,
explain with an example ?
10. Why do we need to define null and alternate hypothesis statements ?
11. Write short notes on critical value and rejection region in the
hypothesis testing process
12. Write differences between type I and type II error, draw a decision
matrix to explain the concept
116
Long Answer Questions :
1. Write down the basic difference between discrete and continuous
probability distributions. Write examples with their formulae
2. Explain important properties of normal distribution, draw diagram
wherever possible to highlight the important characteristic of each
property
3. Write important properties of t–distribution
4. Explain important steps in designing a sampling strategy
5. Explain two important types of sampling techniques, write down
important sampling techniques under each category
6. Write down Hypothesis Testing Process Steps, explain these steps
briefly
117
Business Analytics Enrolment No. :
1. How many hours did you need for studying the units ?
Unit No. 1 2 3 4
No. of Hrs.
2. Please give your reactions to the following items based on your reading
of the block :
118
Dr. Babasaheb BBAR-304
Ambedkar
OpenUniversity
Business Analytics
UNIT 1
COVARIANCE AND CORRELATION ANALYSIS
UNIT 2
SIMPLE LINEAR REGRESSION
UNIT 3
MULTIPLE LINEAR REGRESSION
BLOCK 3 : CORRELATION AND REGRESSION
Block Introduction
• If the marketing budget is X, then what can be sales in the near future
• Which age group will be the right customer segment for the upcoming
product
: UNIT STRUCTURE :
1.0 Learning Objectives
1.1 Introduction
1.2 Covariance : Statistical Relationship between Variables
1.2.1 Mathematical Interpretation of the Covariance
1.2.2 Relationship between Covariance and Variance
1.3 Covariance Matrix
1.4 Relationship between Covariance and Correlation
1.5 Spearman Rank Correlation
1.6 Let Us Sum Up
1.7 Answers for Check Your Progress
1.8 Glossary
1.9 Assignment
1.10 Activities
1.11 Case Study
1.12 Further Readings
1.1 Introduction :
In this unit, we study the statistical relationship between continuous
variables in terms of covariance and correlation. We will discuss the
different scenarios in which one of these techniques will be more
appropriate than another. We will see the limitation of these measures
and their visualization techniques. In the end, we will also touch upon
the correlation of a ranked data.
119
Business Analytics 1.2 Covariance : Statistical Relationship between Variables :
Covariance is one of the techniques we use in the business world
to measure the linear relationship between two variables. Other measures
from the same family are correlation and linear regression, which we will
study in this block later. Covariance means – "Co–Vary," variables that
behave in pairs.
120
Covariance and
Correlation Analysis
Σ ( (x − x) (yi − y) )
COV(x, y) =σ xy = i
n −1
121
Business Analytics Please Note : In the above formula, we have n – 1 in the denominator
as we lost one degree of freedom for estimating the mean to the sample.
35
COV(x, y) =σ xy = = 8.75
4
So our both variables, "total study hours" and "study hours for
business analytics," are positively correlated. Here the vital point to note
is that a covariance score of 8.75 does not tell us the strength of a
relationship. It depends on the variable range. If we had both variables
in lacs, then the value of the covariance coefficient would also be in
lacks. So we cannot compare the covariance coefficient of two different
studies. It just tells us whether there is positive covariance, negative
covariance, or no covariance at all.
Suppose there would have been lots of negative values in column
F of the above spreadsheet. In that case, the covariance value might also
be negative, which indicates a negative relationship between the variables.
Σ(x − x)2
VAR(x) =σ 2x = i
n −1
We can notice that formula for covariance and variance is quite
similar; we can also rewrite the variance formula as below:
Σ ( (x i − x) (x i − x) )
VAR(x) =σ2x =
n −1
if we replace xj with yi and x in second them with y then the formula of
variance will turn into the formula of covariance. In other words, covariance
is nothing but a variance formula with two variables x and y.
Σ ( (x − x) (yi − y) )
COV(x, y) =σxy = i
n −1
122
Check Your Progress – 1 : Covariance and
1. Which statement about covariance is NOT correct Correlation Analysis
123
Business Analytics
962.4
COV(x, y) =σ xy = =106.93
9
Here it is a positive covariance between the number of employees
and cardboard manufactured, so if the company wants to increase
productivity, then they should increase the workforce.
124
The covariance matrix shows us a comprehensive picture of all Covariance and
variables included in the study. The lower diagonal and upper diagonal Correlation Analysis
of the covariance matrix is always identical as the covariance of variable
x and y is the same as the covariance of variable y and x.
Check Your Progress – 2 :
1. Which statement about the covariance matrix is NOT correct
a. Covariance is advisable if we have more than two variables in
the study
b. Off diagonal elements provides covariance between each pair
of variables
c. In the covariance matrix, lower triangular and upper triangular
elements are the same
d. Values always vary between 1 and 5
2. Diagonal elements of covariance matix shows ___________ of each
variable
σxy
r=
σx σ y
8.75
=r = 0.86
5.70 × 1.79
125
Business Analytics Please Note : Covariance tells only the direction (no covariance,
positive covariance, and negative covariance) of the linear relationship
between two variables. In contrast, correlation tells us both direction and
strength. The sign of the correlation coefficient is the same as the sign
of covariance.
The above correlation formula can be rewritten as below:
n
∑ (x i − x) (yi − y)
i =1
n n
∑ (x i − x) 2 ∑ (yi − y) 2
=i 1=i 1
n ∑ x i yi − ∑ x i ∑ yi
⇒
2 2
n ∑ x i2 − ( ∑ x i ) n ∑ yi2 − ( ∑ yi )
Machine life in
months (y) 63 68 72 62 65 46 51 60 55
Check whether there is a correlation between the number of batches
and machines life.
Solution : We can create a table to calculate all components of
the coefficient of the correlation formula:
126
Covariance and
Correlation Analysis
Hence there is 61% that the number of batches influences the life
of the machine before breakdown.
rxy NΣ x i y i − Σx i y i
=
NΣx i2 − (Σx i ) 2 NΣyi2 − (Σyi )2
9.24640 − 425.542
=
9.24525 − 180625 9.33188 − 293764
− 8590
= = − 0.61
40100 4928
6Σd i2
rs= 1 −
n(n 2 − 1)
Here, di is the difference between the rank of xi and yi
127
Business Analytics Example – 1.3 : An advertisement company collected data for
'Number of hours TV watch' and 'IQ level.' Below is the data–calculated
Spearman rank correlation to check whether there is a relationship between
TV watching hours and IQ level.
IQ 106 100 86 101 99 103 97 113 112 110
TV Watching
hours / Week 7 27 2 50 28 29 20 12 6 17
The first thing we have to calculate a rank for both x and y variables.
6Σdi2
ρ= 1−
n(n 2 − 1)
to give
6 × 194
ρ= 1−
10(102 − 1)
128
− 29 Covariance and
Finally, ρ = = − 0.17576 Correlation Analysis
165
The value is very close to zero, which indicates no correlation
between "TV–watching hours" and "IQ level."
Check Your Progress – 3 :
1. The covariance of two numeric variables varies:
a. –1 to +1
b. –50 to + 50
c. It depends on the median of data
d. It depends on the distribution of data
2. Which statement is NOT correct regarding covariance of data
a. Covariance depends on the sample size
b. It is applicable for numeric data only
c. It always varies between –1 and +1
d. All of the above
3. Which of the following is NOT correct about the correlation
a. Correlation value always depends on the scale of variables
b. The correlation value is independent of the scale of the variables
c. It always varies between –1 and +1
d. All of the above
4. If the correlation between two variables is .8, then how much
variance explained by these two variables
a. 16%
b. 40%
c. 64%
d. Can't calculate variance if the correlation is provided between
two variables
5. Rajesh found a covariance level of 21.98 between the two variables.
What we can say about this:
a. It means there is a strong negative between two variables
b. It means there is a strong positive correlation between the two
variables
c. It means there is no correlation between two variables
d. It means there is a weak positive correlation between two
variables
6. Mahesh created a scatter plot diagram for two variables for which
we want to assess the relationship; he could see a few outliers in
his graph. What would you like to suggest as the most appropriate
technique to analyze the data:
129
Business Analytics a. He should calculate the covariance
b. He should calculate spearman rank correlation
c. He should calculate the variance of both variables individually
and then compare
d. He should calculate the correlation coefficient for both the variables
7. Which one is the most appropriate option regarding the covariance
matrix
a. We should fill all the cells of the covariance matrix, including
the upper triangular and lower triangular matrix
b. We should create a covariance matrix only in case when we have
more than two variables in our study
c. The covariance matrix is vital for the non–linear relationship
d. We should always calculate the variance of each variable
individually before creating the covariance matrix
8. If the correlation coefficient between study hours and the final score
is .9, how much variance in the final score is not accounted for
by study hours
a. 19% b. 18% c. 45% d. 9%
9. Which statement about covariance is NOT correct:
a. Covariance can be defined as an unstandardized version of the
correlation coefficient
b. It is a measure of the linear relationship between two variables
c. It is a synonym of the correlation coefficient
d. It is dependent on units of measurement of the variables
10. Statement 1 : Covariance tells about the direction of the linear
relationship between two variables
Statement 2 : Correlation tells about the direction and strength of
the linear relationship between two variables
a. Only statement 1 is true b. Only statement 2 is true
c. Both statements are true d. None of the statements is true
130
4. Covraince score depends on the unit of variables; therefore, we Covariance and
can not compare the covariance score of two different studies Correlation Analysis
5. Correlation is a standardized relationship score (covariance
standardized by the standard deviation of both variables); therefore,
its value is always between –1 to +1. We can compare the correlation
score of two different studies
6. Correlation explains the only association of relationship but not
guarantees the causal relationship between the variables
7. Pearson correlation is appropriate to apply if both variables are
continuous (either interval or ratio scale). In the case, variables are
in ordinal scale or if there are outlier or relationship is non–linear
then Spearman rank correlation would be a better choice
1.8 Glossary :
Covariance : It is a linear relationship between continuous variables;
it explains only direction, not the strength of the relationship. Covariance
scores always depend on unit of variables; hence we can not compare
covariance scores of two independent studies
Pearson Correlation Coefficient : It is a standardized linear
relationship between two variables; therefore, values always come in
between –1 and +1. Correlation does not guarantee causation
Covariance Matrix : In case there are more than two variables
in the study, then it is always advisable to see the linear relationship
between all possible pairs of variables. This combination of the relationship
among various variables is known as a covariance matrix.
Spearman Rank Correlation : If variables are on an ordinal scale,
or there are some outliers, or the relationship is not non–linear, the
Spearman rank correlation is more appropriate than Pearson's correlation.
Here we calculate the correlation between the ranks of the variable instead
of raw form.
131
Business Analytics 1.9 Assignment :
1. Why do we call covariance means co–vary, explain it with an
example ?
2. What is the significance of covariance metrics visualization ?
3. If two variables are measured in different scales why correlation
is better to measure for the relationship than covariance.
1.10 Activities :
1. In the below table, there are dividends provided by two banks for
the last five years. Calculate covariance and correlation for the
dividends of these two banks. Describe the relationship between
the shares.
Year 1 Year 2 Year 3 Year 4 Year 5
Bank A 5% 7% 2% –5% 3%
Bank B –1% 0% 5% –1% 3%
2. Use the following tabular data to answer the below questions
Variable A Variable B Variable C Variable D
10 110 92 930
11 120 94 900
12 115 97 1020
13 128 98 990
11 137 100 1100
10 145 102 1050
9 150 104 1150
10 130 105 1120
11 120 105 1130
14 115 107 1200
a. Make covariance matrix of the above data
b. Calculate the correlation between variable A and variable B
Ans. : 1. Covariance = 1.075 Correlation = 0.0904
2. (b) –0.51
2009 2064 31
2010 2389 36
2011 2418 37
2012 2509 35
2013 2608 38
2014 2706 41
2015 2802 44
2016 2905 45
2017 3056 47
2018 3189 49
Answer the below Questions :
a. Draw scatter plot of above data
b. Calculate the variance of Sales and Marketing Cost data
c. Calculate covariance and correlation of sales and marketing cost
d. Suggest whether Pearson correlation or Pearson rank correlation
would be better in this case, justify your answer
133
Business Analytics
Unit SIMPLE LINEAR REGRESSION
2
: UNIT STRUCTURE :
2.0 Learning Objectives
2.1 Introduction
2.2 Essence of Simple Linear Regression
2.2.1 Introduction to Simple Linear Regression
2.2.2 Determining the Equation of the Linear Regression Line
2.3 Baseline Prediction Model
2.4 Simple Linear Regression Model Building
2.5 Ordinary Least Square Method to Estimate Parameters
2.5.1 Calculation of Regression Parameters
2.5.2 Interpretation of Regression Equation
2.6 Measures of Variation
2.6.1 Comparison of Two Models
2.6.2 Coefficient of Determination
2.6.3 Mean Square Error and Root Mean Square Error (Standard
Error)
2.7 Simple Linear Regression in MS Excel
2.7.1 Residual Analysis to Test The Regression Assumptions
2.8 Let Us Sum Up
2.9 Answers for Check Your Progress
2.10 Glossary
2.11 Assignment
2.12 Activities
2.13 Case Study
2.14 Further Readings
2.1 Introduction
Unit : In this unit, we will study Simple Linear regression that
explains the relationship between two continuous variables and how it
is more effective than correlation and covariance. There are various ways
to calculate linear relationships, we will touch upon the most fundamental
technique know as least square optimization. We will also touch upon
various validating techniques for linear regression. In the end, we will
see various examples where linear regression helps us to make better
decisions and empowered us with predicting capabilities.
138
So to get rid out of these negative differences, we can square all Simple Linear Regression
these residuals. It helps in two ways:
1. It makes all differences positive
2. It emphasizes larger deviations
If we sum all these squared residuals, then this term is known as
the sum of squared errors (SSE), which is 120 in this case.
So we have established a basic prediction model and calculate the
error in that model. Now the goal of simple linear regression is to create
a model that can minimize this sum of squared error (SSE). The regression
model will try to minimize this error by introducing an independent
variable in the model and draw a best fit line, so that difference between
the dependent variable value and line best is minimum; this error explained
by the regression model is known as Regression error.
Check Your Progress – 2 :
1. If we have only dependent variable then we can predict the next
value by taking ___________ of all the observeations.
2. Difference between predicting line and each observation is know
as ___________.
3. If we add all residuals, then its sum will be very close to _________.
139
Business Analytics Where
Yi = observed value of dependent variable
615
b1
= = 0.1462 → slope of line
4200
yi = 0.1462x – 0.8188
Please Note : Best–fit line always cross from the centroid
(intersection point of the mean of x and mean of y)
2.5.2 Interpretation of Regression Equation :
yi = 0.1462x – 0.8188
The above regression equation tells us that for every new addition
in feedback, there will be a .1462 addition to the "Very satisfied" category.
In other words, for any new feedback, there are 15% chances of having
it in the "Very Satisfied" category. If there is zero feedback, then there
are –.8188 "Very satisfied" feedback, which does not make any sense;
therefore, intercept may or may not make sense in terms of its business
value. So the important point is how the dependent variable changes in
relation to one unit change in the independent variable.
In this way, the regression model tells us the relationship between
important business metrics and factors that influence that metric. An
important point to remember is that the regression model can only predict
the output variable only in the range of input variables supplied. For
example, in the above example, input variable rage from 34 to 108; hence
technically, we can predict the number of "Very Satisfied" feedback only
if total feedbacks is in the range of 34 to 108. If the total number of
feedbacks is significantly high or lower than this range, then the regression
model may give us misleading predictions.
141
Business Analytics Total sum of squares (SST) = Regression sum of squares (SSR)
+ Error sum of squares (SSE)
Σ (y i − y)2
Total Sum of Squares = SST =
Σ(y i − y)2
Regression Sum of Squares = SSR =
Σ(y i − y i )2
Error Sum of Squares = SSE =
142
Simple Linear Regression
SSR
Coefficient of determination = r2 =
SST
Coefficient of determination for the above feedback example:
89.925
Coefficient of determination = r2 = = 0.7493 or 74.93%
120
We can conclude that 74.93% of the total sum of squares can be
explained by using the estimated regression equation to predict the "Very
Satisfied" feedback. The remainder is an error. For simple linear regression,
r2 is the square of correlation coefficient r. In the above survey feedback
143
Business Analytics example, the correlation coefficient is .866; if we square it, then it will
be equal to the coefficient of determination r2, which we calculated
through the sum of squares method.
2.6.3 Mean Square Error and Root Mean Square Error (Standard
Error) :
MSE, s2 is an estimate of 2 the variance of the error, . In other
words, how spread out the data points are from the regression line. MSE
is SSE divided by its degree of freedom; in the above example, it will
be n–2 as we lost two degrees of freedom as we estimated slope and
intercept. In simple linear regression, it is always n–2, while in the case
of multiple linear regression, it will be different; we will study that in
the next unit.
2 SSE
MSE
= s=
n−2
S= 7.5187 = 2.742
So the average distance of the data points from the fitted line is
about 2.74 feedbacks. We can think of s as a measure of how well the
regression model makes the predictions.
144
Check Your Progress – 3 : Simple Linear Regression
1. A total sum of squares is the summation of sum of squares
regression and ___________.
2. Cofficient of determination is the ratio of _________ and _________.
3. If SSE is 100 and there are 27 observations in the sample then
standard error would be _________.
145
Business Analytics 1 The correlation coefficient for both the variables
SSR
2 The coefficient of determination (R2) .749 74.9%
SST
variation in output variable due to input variable
SSE
3 The standard error 2.742
n − 2
10 The P–value for each variable, if it is less than .05 at 95% CL,
then the input variable is significant in the model
2.7.1 Residual Analysis to Test The Regression Assumptions :
Residual analysis is mainly used to test the assumptions of the
regression model. Below two are important residual analyses provided
by MS Excel:
1. The Linearity of the Regression Model : The linearity of the
regression model can be obtained by plotting the residual on the
vertical axis against the corresponding xi values of the independent
variable on the horizontal axis. There should not be any apparent
pattern in the plot for a fit regression model.
146
Simple Linear Regression
147
Business Analytics 6. How can we explain the standard error
a. It is defined as a variation around the regression line
b. Total error in regression model divided by its standard deviation
c. Total error in the regression model
d. Total variation in output and input variable together
7. How can we define the coefficient of determination in simple linear
regression
a. SSR/SST
b. Square of the coefficient of correlation
c. The proportion of variation in the output variable explained by
the input variable
d. All of the above
8. How can we define residual in a regression model
a. Difference between the variation of input and output variable
b. Difference between the predicted value and observed value
c. Difference between the input and output variable
d. All of the above
9. Which of the following statements are correct
a. Residual in a regression model must be distributed normally
b. Residuals must have constant variance
c. There should not be any significant relation between the input
variable and residuals
d. All of the above
10. Which of the below option is correct about the below two statements:
Statement 1 : Constant variance among residuals is known as
homoscedasticity
Statement 2 : Relationship between the input variable and residuals
is known as autocorrelation
a. Only 1 is correct b. Only 2 is correct
c. Both are correct d. Both are incorrect
2.10 Glossary :
Simple Linear Regression : It is an important statistical technique
for finding the linear (straight line) relationship between an input and
output variable. Regression gives us a mathematical expression by which
we can predict the value of the output variable by supplying a value
of the input variable.
Total Sum of Squares (SST) : Total sum of squares in a model
can be measured as the summation of total variation explained by the
regression model (SSR) and unexplained variation due to random error
(SSE)
Coefficient of Determination (R2) : It is a measure of fit of the
regression model. It is a ratio of SSR to SST. The value of the coefficient
of determination varies between 0 and 1.
Standard Error : Standard error can be understood as a standard
deviation around the regression line. A large standard error indicates a
large amount of variation or scatters around the regression line
Residual Analysis of Regression Model : Residual or error analysis
is a vital step to check whether the assumptions of regression models
have been satisfied
149
Business Analytics Homoscedasticity : The assumption of homoscedasticity means
constant error variance across the range of input variable
Autocorrelation : The input variable must be independent of error
terms. The Durbin–Watson statistic can measure the autocorrelation effect.
2.11 Assignment :
1. What is residual in simple linear regression.
2. What do we mean by the baseline prediction model in simple linear
regression ?
3. Write down important steps in simple linear regression model
calculation.
4. In simple linear regression equation, Yi = 23x – 7. Explain the
relationship between the x and y variable.
2.12 Activities :
Company Nirja Plc has produced its expenses (input variable) and
revenue (output variable) for the last 10 years:
Year Expenses (In million $) Revenue (In million $)
2009 3506 4108
2010 3518 4190
2011 3609 4278
2012 3689 4295
2013 3712 4307
2014 3754 4389
2015 3798 4401
2016 3803 4456
2017 3874 4497
2018 3967 4501
2019 3994 4508
a. Make a scatter plot and check whether it is showing the linear
relationship, and are there a few outliers ?
b. Calculate the Pearson's correlation coefficient and coefficient of
determination (r2)
c. Calculate the regression equation
d. Calculate standard error, the total sum of squares, the sum of square
error, and a regression sum of squares
e. Validate the regression equation through residual analysis
150
Ans. : Simple Linear Regression
• Yes, the scatter plot is showing a linear relationship. There are not
any outliers visible in the scatter plot
151
Business Analytics Year Sales Marketing Cost
(in a million rupees) (in a million rupees)
2009 2064 31
2010 2389 36
2011 2418 37
2012 2509 35
2013 2608 38
2014 2706 41
2015 2802 44
2016 2905 45
2017 3056 47
2018 3189 49
Answer the below Questions :
a. Make a scatter plot and check whether it is showing the linear
relationship, and are there a few outliers ?
b. Calculate the Pearson's correlation coefficient and coefficient of
determination (r2)
c. Calculate the regression equation
d. Calculate standard error, the total sum of squares, the sum of square
error, and a regression sum of squares
e. Validate the regression equation through residual analysis
152
Unit MULTIPLE LINEAR REGRESSION
3
: UNIT STRUCTURE :
3.0 Learning Objectives
3.1 Introduction
3.2 Essence of Multiple Linear Regression
3.2.1 Introduction to Multiple Linear Regression
3.3 Understanding the Concept of Multiple Linear Regression with
a Worked Example
3.4 The Correlation Coefficient for Multiple Linear Regression
3.5 Coefficient of Coefficient (R2), Adjusted R2, and Standard Error
3.6 Multiple Linear Regression in MS Excel
3.6.1 The Modified Regression Model in Excel
3.6.2 Residual Analysis to Test the Regression Assumptions
3.7 Let Us Sum Up
3.8 Answers for Check Your Progress
3.9 Glossary
3.10 Assignment
3.11 Activities
3.12 Case Study
3.13 Further Readings
155
Business Analytics 3.3 Understanding the Concept of Multiple Linear Regression with
a Worked Example :
Let's consider an example of a supply chain of an online retail
company; their logistic arm tries to deliver the packages on the same
day. They try to optimize their delivery associate's trips by strategizing
trips with the help of city maps to reduce time and fuel costs. Below
is the data for 10 random trips, each trip has three important pieces of
information – Total miles travelled, number of deliveries, daily gas price
and a total time of travel in hours.
As an analyst, you want to estimate how long delivery will take
(dependent variable) based on two inputs – total distance, number of
deliveries and daily gas (fuel) price (independent variables).
MileTraveled numDeliveries gasPrice travelTime(hrs)
89 4 3.84 7
66 1 3.19 5.4
78 3 3.78 6.6
111 6 3.89 7.4
44 1 3.57 4.8
77 3 3.57 6.4
80 3 3.03 7
66 2 3.51 5.6
109 5 3.54 7.3
76 3 3.25 6.4
Below is the pictorial view of the relationship between the dependent
and independent variables.
156
Multiple Linear
Regression
157
Business Analytics
Σ (y i − y)2
Total Sum of Squares = SST =
Σ(y i − y)2
Regression Sum of Squares = SSR =
Σ(y i − y i )2
Error Sum of Squares = SSE =
Mile num gas travel
(y i - y¯) (yi - y¯)2 (y ^i - y¯) (y ^i - y¯)2 (y i - y ^i) (yi - y^i)2
Traveled Deliveries Price Time (y)
89 4 3.84 7 0.61 0.3721 0.2817 0.0793 0.3283 0.1078
66 1 3.19 5.4 -0.99 0.9801 -0.7983 0.6373 -0.1917 0.0367
78 3 3.78 6.6 0.21 0.0441 -0.2204 0.0486 0.4304 0.1853
111 6 3.89 7.4 1.01 1.0201 1.3283 1.7644 -0.3183 0.1013
44 1 3.57 4.8 -1.59 2.5281 -1.3395 1.7943 -0.2505 0.0627
77 3 3.57 6.4 0.01 0.0001 -0.1072 0.0115 0.1172 0.0137
80 3 3.03 7 0.61 0.3721 0.2627 0.0690 0.3473 0.1206
66 2 3.51 5.6 -0.79 0.6241 -0.6093 0.3712 -0.1807 0.0327
109 5 3.54 7.3 0.91 0.8281 1.1292 1.2751 -0.2192 0.0481
76 3 3.25 6.4 0.01 0.0001 0.0728 0.0053 -0.0628 0.0039
Average 6.39 Sum 6.769 6.0561 0.7129
• Σ(y i − y)2 =
Total Sum of Squares = SST = 6.769
• Σ(y i − y)2 =
Regression Sum of Squares = SSR = 6.0561
• Σ(y i − y i )2 =
Error Sum of Squares = SSE = 0.7129
SSR SSE
R2 = =1−
SST SST
SSE /(n − k − 1)
Adjusted R 2 = 1 −
SST /(n − 1)
.7129 /(10 − 3 − 1)
1−
= 0.8420
=
6.769 /(10 − 1)
Standard error
= SSE /(n − k − 1)
= .7129 /(10 =
− 3 − 1) 0.3447
160
Check Your Progress – 3 : Multiple Linear
1. In multiple linear regression, correlation coefficient is calculated Regression
between dependent variable and ___________.
2. Adjusted R2 is always ___________ than R2 value.
3. Standard error in multiple linear regression is adjusted by
___________ and ___________.
SSR
2 The coefficient of determination (R2) .8947 ' 89.47%
SST
variation in output variable due to input variables
161
Business Analytics 6 Error sum of squares (SSE) 0.7129
7 Total sum of squares (SST) 6.6769
8 The overall significance of the model, if the value is less than .05
at 95% confidence level, then the overall model showing a significant
relationship between input and output variables
9 y–intercept 6.21
162
Multiple Linear
Regression
163
Business Analytics
164
1. The value of R2 for the above data is: Multiple Linear
a. 0.9515 b. 0.9053 c. 0.8782 d. 3.8751 Regression
2. The value of SSR and SST for the above data is:
a. 1004.484 and 1109.6 b. 105.1158 and 1109.6
c. 502.2421 and 1004.484 d. 33.4459 and 1109.6
3. Value of Standard error for the above data is:
a. 0.9515 b. 0.9053 c. 0.8782 d. 3.8751
4. Value of Adjusted R2 for the above data is:
a. 0.9515 b. 0.9053 c. 0.8782 d. 3.8751
5. By visualizing scatter plot and regression output, we can conclude
that:
a. Data have multicollinearity issue
b. Data has overfitting issue
c. Data has both multicollinearity and overfitting issues
d. Data has neither multicollinearity nor overfitting issue
6. If there is multicollinearity in the model, it may result in:
a. The regression coefficient may be in the opposite sign
b. May add a new variable that is not significant
c. Removing a significant variable from the data
d. All of the above
7. If we have added a new variable in the model then:
Statement 1 : R2 will always increase
Statement 2: Adjusted R2 will always increase
a. Statement 1 is always true
b. Statement 2 is always true
c. a and b both statements are correct
d. a and b both statements are incorrect
8. In the case of multiple linear regression, the correlation coefficient
is:
a. Correlation between the dependent variable and predicted variable
b. The square root of the coefficient of determination
c. a and b both options are correct
d. a and b both options are incorrect
165
Business Analytics 9. In the below formula of standard error, n and k represent:
3.9 Glossary :
Multiple Linear Regression : Multiple linear regression is an
important statistical technique for finding the relationship between one
dependent variable and multiple independent variables.
Multiple R (Multiple Correlation Coefficient) : In multiple linear
regression, correlation coefficient represents the relationship between the
dependent variable and predicted dependent variable.
Overfitting : In multiple linear regression if there are too many
significant variables in the model, then there would be a higher fictitious
coefficient of determination. Also, if we have not split and validated the
model on test data, then also model may be overfitted on train data.
Multicollinearity : If independent variables show a strong
relationship among themselves, then it is difficult to measure the actual
impact of an independent variable on the dependent variable as an
independent variable also has an impact on another independent variable.
Residual Analysis : Residual analysis validates the regression
assumptions like residuals are random in nature, and there is not any
hidden trend. Another assumption is the residuals do not show any
relation with the independent variable.
167
Business Analytics Adjusted R2 : Coefficient of determination (R2) increase whenever
we add a new variable in the model irrespective of its significance, hence
sometimes R2 tells us fictitious model accuracy. Adjusted R2 is a measure
where R2 get adjusted by the degree of freedom (sample size and the
number of independent variables).
Regression Equation : It is a mathematical expression that represents
the mathematical relationship between the dependent variable and
independent variables. If we have values of all independent variables then
with the help of the regression equation, we can predict the value of
the dependent variable.
3.10 Assignment :
1. Why it is always recommended to see a scatter plot before applying
multiple linear regression. Explain with an example ?
2. What do we mean by non-linear regression, explain with an example ?
3. What do we mean by y-intercept, explain its role in the interpretation
of regression equation ?
3.11 Activities :
Below is the data for car sales. Here "Price" is the dependent
variable, while all others are independent variables. Calculate SST, SSR,
SSE, standard error, R2 and adjusted R2, regression equation for the below
data. Also, justify the difference between R2 and adjusted R2.
engine_s horsepower wheelbase width length curb_w gt fuel_cap mpg price
1.8 140 101.2 67.3 172.4 2.639 13.2 28 21.5
3.2 225 106.9 70.6 192 3.47 17.2 26 27.3
3.2 225 108.1 70.3 192.9 3.517 17.2 25 28.4
3.5 210 114.6 71.4 196.6 3.85 18 22 42
1.8 150 102.6 68.2 178 2.998 16.4 27 23.99
2.8 200 108.7 76.1 192 3.561 18.5 22 33.95
4.2 310 113 74 198.2 3.902 23.7 21 62
2.5 170 107.3 68.4 176 3.179 16.6 26.1 26.99
2.8 193 107.3 68.5 176 3.197 16.6 24 33.4
2.8 193 111.4 70.9 188 3.472 18.5 24.8 38.9
3.1 175 109 72.7 194.6 3.368 17.5 25 21.975
3.8 240 109 72.7 196.2 3.543 17.5 23 25.3
3.8 205 112.2 73.5 200 3.591 17.5 25 27.885
3.8 205 113.8 74.7 206.8 3.778 18.5 24 31.965
3 200 107.4 70.3 194.8 3.77 18 22 31.01
4.6 275 108 75.5 200.6 3.843 19 22 39.665
4.6 275 115.3 74.5 207.2 3.978 18.5 22 39.895
5.7 255 117.5 77 201.2 5.572 30 15 46.225
1 55 93.1 62.6 149.4 1.895 10.3 45 9.235
1.8 120 97.1 66.7 174.3 2.398 13.2 33 13.96
168
Ans. : Multiple Linear
Regression Regression
Statistics
Multiple R 0.9460
R Square 0.8950
Adjusted R
Square 0.8187
Standard Error 4.9892
Observations 20
ANOVA
df SS MS F Significance F
Regression 8 2334.10 291.76 11.72 0.00
Residual 11 273.81 24.89
Total 19 2607.91
Standard
Coefficients Error t Stat P-value Lower 95%
Intercept -116.159 78.407 -1.481 0.167 -288.731
engine_s -6.055 4.886 -1.239 0.241 -16.809
horsepow 0.203 0.065 3.133 0.010 0.060
wheelbas 1.310 0.520 2.521 0.028 0.166
width 0.261 0.954 0.274 0.789 -1.840
length -0.359 0.287 -1.250 0.237 -0.990
curb_wgt -3.136 10.581 -0.296 0.772 -26.425
fuel_cap 1.831 1.203 1.521 0.156 -0.818
mpg 0.475 0.661 0.718 0.488 -0.980
169
Business Analytics Consider mpg (mileage per gallon) as predicted variable and answers
for the below questions:
1. Calculate the correlation coefficient between the dependent variable
(mpg) and independent variables
2. Calculate R2 and adjusted R2
3. Calculate SSR, SSE, SST and standard error
4. Calculate the regression equation
5. Scatter plot between the dependent variable and top 3 independent
variables
170
BLOCK SUMMARY
Correlation plays a vital role in the corporate world as decision
making about any business metric always depends on various factors.
Therefore we can not take decisions in silos; we need to make a robust
strategy about all positive and negative consequences of our decision. At
one side correlation validates all these relationships while we take the help
of simple and multiple linear regression to establish mathematical relations
between these business metrics (dependent and independent variables).
Regression models help us to predict the value of output metrics like sales,
revenue, productivity, time etc. based on various factors (independent
variables) like manpower requirement, marketing budget, number of sales
pitches etc.
We have to be cautious at the time of doing regression analysis as
model accuracy and validity depends on selecting the right independent
variables. We should not unnecessary try to add independent variables as
one side; it rises the problem of overfitting another side practically it is
difficult to control too many input variables in the model. Regression also
has two critical assumptions about independent variables – firstly, there
should be a linear relationship between the variables (relationship should
not be curvilinear or zig–zag); secondly, independent variables should not
have a significant relationship among themselves (multicollinearity problem).
Regression is also very sensitive towards residual analysis, therefore, it is
utterly crucial to observe and analyze residuals as it may lead to
autocorrelation (the relationship between residual and independent variables)
or heteroscedasticity (error variance is not constant).
We should not supply the entire data into the regression model as it
leads to overfitting. The model works very well for trained data but behaves
poorly for unobserved data. Therefore, it is vital to split the entire data into
training and test set (80% training and 20% test). Before exposing the
model to the external world, it should be validated on the test dataset.
Regression is also an important foundation for various important tools and
techniques in predictive and prescriptive analytics. Currently, most machine
learning algorithms use regression as a base methodology.
171
Business Analytics BLOCK ASSIGNMENT
Short Questions
1. Explain the fundamental difference between covariance and correlation
2. Write down three scenarios where Spearman rank correlation can be
better than the Pearson correlation coefficient
3. Regression analysis plays a vital role in decision making. Explain this
statement
4. Explain the concept of SST, SSE and SSR in a regression model
5. Explain the importance of the coefficient of multiple determination
(R2) in interpreting the multiple regression output
6. Why adjusted R2 is better than the coefficient of multiple determination
(R2), explain with an example
172
Long Questions
1. Explain the relationship between variance and covariance also derive
the Pearson correlation formula from the covariance formula
2. Write down the assumptions of simple linear regression analysis,
write down the consequences of these assumptions that are not
validated by the model
3. Explain the concept of standard error and coefficient of determination
and their significance in the regression model
4. Explain the importance of residual analysis in the regression model,
define different types of residual assumptions
5. Explain the steps of carrying out multiple linear regression and the
significance of each step in decision making
173
Business Analytics Enrolment No. :
1. How many hours did you need for studying the units ?
Unit No. 1 2 3
No. of Hrs.
2. Please give your reactions to the following items based on your reading
of the block :
174
Dr. Babasaheb BBAR-304
Ambedkar
OpenUniversity
Business Analytics
UNIT 1
INTRODUCTION TO FORECASTING TECHNIQUES
UNIT 2
MOVING AVERAGE AND SINGLE EXPONENTIAL SMOOTHING
TECHNIQUES
UNIT 3
REGRESSION METHODS FOR FORECASTING
UNIT 4
AUTO–REGRESSION (AR) AND MOVING AVERAGE (MA)
FORECASTING MODELS
BLOCK 4 : TIME SERIES ANALYSIS
Block Introduction
All organizations despite their size and industry prepare a short term
(weekly, monthly or quarterly) and long–term planning (half–yearly, yearly
or five–yearly etc.) and forecasting remain the most important integral part
of all these planning activities. It directly impacts the revenue and cost of
the organization.
Block Objectives
: UNIT STRUCTURE :
1.0 Learning Objectives
1.1 Introduction
1.2 Forecasting : Magical Crystal Ball of Statisticians
1.3 Time–Series Data and Components of Time–Series Data
1.4 Time–Series Data Modelling Techniques
1.4.1 Additive Model of Time–Series Modelling
1.4.2 Multiplicative Model of Time–Series Modelling
1.5 Measuring Forecasting Accuracy Techniques
1.5.1 Mean Absolute Error (MAE) / Mean Absolute Deviation
(MAD)
1.5.2 Mean Absolute Percentage Error (MAPE)
1.5.3 Mean Square Error (MSE)
1.5.4 Root Mean Square Error (RMSE)
1.6 Factors Affecting Forecasting Accuracy
1.7 Let Us Sum Up
1.7 Answers for Check Your Progress
1.9 Glossary
1.10 Assignment
1.11 Activities
1.12 Case Study
1.13 Further Readings
177
Business Analytics Most statisticians believed that time–series data from a forecasting
perspective can be categorized into four components: trend component,
seasonal component, cyclic component, and irregular component. Most
of the time all these components are not present in time–series data.
178
and customs that depend on society, the end of season sale etc. Introduction to
Another big difference between cyclical and seasonal components Forecasting Techniques
is that cyclical movement has not fixed time between fluctuations,
it is random in nature while seasonal component has a fixed time
period within a year generally. In another world periodicity of
cyclical fluctuations is not constant while it is constant for a
seasonal component.
4. Irregular Component (It) : Irregular
component consists of random
movements in the time series and follow
a normal distribution with mean 0 and
constant variance. Distribution with
mean zero and constant variance is
known as white noise. From time–series
data if we remove the trend, seasonal and cycle components then
residual time–series data is an irregular component.
n |x − F |
t t
MAE = ∑
t =1 n
410
MAD = = 2.878
9
Here n = 9 (forecasting is available only for 9 observations).
This example explains that some of the forecasted errors are positive
while others are negative. If add all these error values, then most of
the time outcome will be either 0 or very close to 0 as these negative
and positive errors are cancelling each other. By taking the absolute value
we can resolve this problem. In the above example, the summation of
all error terms is zero but it is not mandatory that we will get error
summation all the time zero but for sure it will remain close to zero
all the time until there are lots of outliers in our data.
One of the important limitations of mean absolute error is that we
cannot compare the MAD score of two different studies as it directly
depends on the absolute value of the Time–series data. If one time series
in hundreds, share prices of an organization is in range of ` 600 to
` 850 while the share price of another organization is in range of
` 45,000 to ` 56,000 then MAD in case of the second organization will
be higher as their absolute values are higher. Another accuracy measure
means absolute percentage error resolve this problem.
1.5.2 Mean Absolute Percentage Error (MAPE) :
Mean absolute percentage error consider the average of absolute
percentage error instead of the average of just absolute value of the error
terms. The formula is as follows:
1 n | x t − Ft |
MAPE = ∑
n t =1 xt
181
Business Analytics As MAPE is dimensionless hence it can be used to compare two
different Time–series irrespective of the magnitude of their values.
Absolute values of
Year Sales Forecast Error Value errors divided by
actual values
1 1336 -
2 1392 1477 -85 0.061
3 1487 1532 -45 0.030
4 1547 1460 87 0.056
5 1610 1626 -16 0.010
6 1689 1729 -40 0.024
7 1741 1669 72 0.041
8 1798 1760 38 0.021
9 1760 1752 8 0.005
10 1714 1733 -19 0.011
Sum 0 0.259
.259
MAPE = × 100 = 2.878
9
Here n = 9 (forecasting is available only for 9 observations).
1.5.3 Mean Square Error (MSE) :
When we calculate error terms in Time–series forecasting, we had
already seen the method of taking the absolute value to get rid out of
negative values. Another way to do the same thing is to calculate the
square of error terms.
n (x t − Ft )2
MSE = ∑
t =1 n
25728
MSE
= = 2859
9
Here n = 9 (forecasting is available only for 9 observations).
182
1.5.4 Root Mean Square Error (RMSE) : Introduction to
Square root of the mean square error is known as root mean square Forecasting Techniques
error (RMSE). It is the standard deviation of errors.
n (x − F ) 2
t t
RMSE = ∑
t =1 n
25728
RMSE = = 2859 = 53.47
9
RMSE and MAPE are the two most popular forecasting accuracy
measures. Generally, the organization put it as a target.
Worked Example : An organization has below sales and forecasting
figures. Calculate MAD, MAPE, MSE and RMSE.
Solution :
In the earlier section, we have seen the formula and significance
of different forecasting accuracy measures. Here Time–series data is
available for 12 data periods. Generally, we do not have forecasting
figures available for the first period or sometimes we consider actual
183
Business Analytics sales itself forecasting number. Generally, a line chart is preferred to
visualise time–series data.
Absolute
error value
Square of
Error Value Absolute Value divided by
Year Actual Forecast Error Terms
x t - Ft |xt - Ft| actual
(xt - Ft)2
values
|xt - Ft|/xt
1 27.580 27.580 0.000 0 0.000 0.000
2 25.950 26.765 -0.815 0.815 0.664 0.031
3 26.080 26.015 0.065 0.065 0.004 0.002
4 26.360 26.220 0.140 0.14 0.020 0.005
5 27.990 27.175 0.815 0.815 0.664 0.029
6 29.610 28.800 0.810 0.81 0.656 0.027
7 28.850 29.230 -0.380 0.38 0.144 0.013
8 29.430 29.140 0.290 0.29 0.084 0.010
9 29.670 29.550 0.120 0.12 0.014 0.004
10 30.190 29.930 0.260 0.26 0.068 0.009
11 31.790 30.990 0.800 0.8 0.640 0.025
12 31.980 31.885 0.095 0.095 0.009 0.003
Sum 2.2 4.59 2.9679 0.160
Here n = 12
n |x − F | 4.59
t = t
MAD/MAE
= ∑ = 0.383
t =1 n 12
n (x − F )2 2.968
t = t
MSE
= ∑ = 0.247
t =1 n 12
n (x − F ) 2 2.968
RMSE
= ∑ t= t = 0.497
t =1 n 12
1 n | x t − Ft | .160
MAPE = ∑ = × 100 = 1.329
n t =1 xt 12
184
1.6 Factors Affecting Forecasting Accuracy : Introduction to
Forecasting Techniques
Forecasting always remain challenging as there are so many factors
that influence it, we apply various fixes to improve the forecasting
accuracy. Below are few important factors:
A. The high volume of data helps in better forecasting accuracy
: High volumes always help in attaining better forecasting accuracy
as it covers various possible business scenarios also smoothes the
impact of natural variance in the data. Therefore, forecasting a
superstore is always easier than a local confectionary or grocery
store. High volume is the prerequisite for most of the data science
techniques as large data explains its variation well.
B. Aggregation helps in improving forecasting accuracy : Forecasting
at an individual product/ service level is difficult but if we aggregate
data into a few logical groups then it becomes easy. For example,
forecasting at each course in a university may be difficult but if
we try to forecast graduate students in science streams by aggregating
individual specialization in the science department then it would
be relatively easy.
C. Short period forecasting works better than long term : A long
term period consists of various business scenarios which are not
so regular therefore longer forecasting has relatively lower accuracy.
D. Forecasting is relatively easy for a stable business : In a start–
up or new business, the unit has unstable demand which makes
forecasting challenging while on the other hand, a mature business
has regular customers and a stronger brand which helps in superior
accuracy.
Check Your Progress :
1. In Time–series data y–axis denotes the business measures while
x–axis denotes ___________.
2. Seasonality component is relatively easier to estimate while cyclical
component is ___________ to estimate.
3. Root mean square error and ___________ forecasting accuracy
measures can be compared across the industries.
4. Mean absolute error is also known as ___________.
Multiple Choice Questions :
1. Cyclic components in a Time–series is caused due to:
a. Random events occur in local geography
b. Festivals in society
c. Macroeconomic changes
d. Changes in customer behaviour
185
Business Analytics 2. Which option is NOT a valid component of Time–Series data ?
a. Irregular component b. Seasonal component
c. Trend component d. Noise
3. White noise is:
a. Errors with mean and the standard deviation both 1
b. Uncorrelated errors with a mean value of 0 and constant standard
deviation
c. Errors with constant mean and standard deviation
d. Completely random errors
4. Which forecasting accuracy measures can be measured across different
industries
a. MSE and RMSE b. MAD and MAPE
c. Only MAPE d. RMSE and MAPE
5. Which forecasting model will be more appropriate if seasonality
is completely independent of the trend
a. Additive model
b. Multiplicative model
c. Both additive and multiplicative can be used
d. Neither additive nor multiplicative can be used
6. Which Time–Series component depicts long term upward and
downward movement of the data
a. Seasonality b. Cyclic c. Irregular d. Trend
7. Which component depends on local festivals and events in the
society
a. Trend b. Seasonality c. Cyclic d. Irregular
8. Which forecasting accuracy measure is the standard deviation of
errors
a. MAPE b. MAD c. RMSE d. MSE
9. Which of the below factors help in better forecasting ?
a. Large data set
b. Predicting for a shorter period
c. Forecasting for a stable business
d. All the above
10. What is modelling in statistics ?
a. Predicting output variable based on input variables
b. Predicting input variables
c. Predicting error terms
d. To check errors following a normal distribution or not
186
1.7 Let Us Sum Up : Introduction to
Forecasting Techniques
1. Time–series analysis plays an important role in descriptive analytics
as it shows timely movement (fluctuations) of important KPIs over
the period
2. Time–series forecasting is the backbone of predictive analytics as
all organizations need to forecast their important business KPIs (key
performance indicators)
3. Time–series data always have an x–axis as time while the y–axis
is the output variable, for which we want to show the movement
over the time
4. Time–series has four important components, trend, seasonality,
cyclical and irregular
5. Trends showcase overall upward or downward movement of data
throughout the time period
6. When time–series data shows repetitive downward or upward
movement / fluctuations from the trend then we say there is a
seasonal influence in our data. These repetitions are generally
shorter (less than a year) for example, daily, weekly, monthly,
quarterly etc.
7. A cyclical component is a movement over a longer period (generally
more than a year).
8. An irregular component consists of random movements in the time
series and follow a normal distribution with mean 0 and constant
variance.
9. One of the major factors of choosing the right forecasting technique
is calculating the accuracy measure, technique which shows the
least error generally become the choice of forecasting, but it is not
the only factor. Businesses consider several other factors also to
identify the right forecasting technique
10. There are four important forecasting accuracy measures used
frequently, these are Mean Absolute Error (MAE) / Mean Absolute
Deviation (MAD), Mean Absolute Percentage Error (MAPE), Mean
Square Error (MSE), Root Mean Square Error (RMSE)
11. The value of MAD and MSE depends on the magnitude of data
hence these cannot be used for benchmarking across the department
or businesses
12. MAPE and RMSE can be used as a target for benchmarking as
these are represented as ratio and percentage respectively
13. Forecasting is relatively easy for large data, stable businesses,
aggregated data and shorter period
187
Business Analytics 1.8 Answers for Check Your Progress :
Check Your Progress :
1. Time 2. Difficult
3. MAPE 4. Mean absolute deviation
Multiple Choice Questions :
1. c 2. c 3. b 4. d
5. a 6. d 7. b 8. c
9. d 10. a
1.9 Glossary :
Time–Series : Time–series analysis represents the output variable
on the y–axis while the x–axis always remains time (generally time
interval remain constant).
Forecasting Technique : Predicting future value of output variable
based on historic data is known as a forecasting technique.
Component of Time–Series : Time series can be decomposed into
four important parts namely, trend, seasonality, cyclical and irregular.
These parts of time–series are known as components of time series.
Forecasting Error : The difference between the actual value and
forecasting value is known as a forecasting error.
Forecasting Accuracy Measure : The approach for calculating the
forecasting error in a time series is known as the forecasting accuracy
measure.
Mean Absolute Error (MAE) / Mean Absolute Deviation (MAD):
This is the average error in forecast without considering the sing of the
error.
Mean Absolute Percentage Error (MAPE) : Mean absolute
percentage error consider the average of absolute percentage error instead
of the average of just absolute value of the error terms.
Mean Square Error (MSE) : It is the average of the square of
the error term.
Root Mean Square Error (RMSE) : Square root of the mean
square error is known as root mean square error (RMSE).
1.10 Assignments :
1. Why forecasting is known as one of the most important functions
of business analytics.
2. Write down a few examples of the application of forecasting
techniques in personal life.
3. Write down various components of time series analysis and their
importance in brief.
4. Write important scenarios where additive and multiplicative
forecasting models can be used. Explain with a few examples.
188
1.11 Activities : Introduction to
Forecasting Techniques
One of the leading organizations in the retail store provided their
west India store's revenue (in crores rupees) of last 12 months. Calculate
the MAD, MSE, RMSE and MAPE for this data.
Year Actual Forecast
1 26.580 26.580
2 24.950 25.765
3 25.080 25.015
4 25.360 25.220
5 26.990 26.175
6 28.610 27.800
7 27.850 28.230
8 28.430 28.140
9 28.670 28.550
10 29.190 28.930
11 30.790 29.990
12 30.980 30.885
190
MOVING AVERAGE AND
Unit SINGLE EXPONENTIAL
2 SMOOTHING TECHNIQUES
: UNIT STRUCTURE :
2.0 Learning Objectives
2.1 Introduction
2.2 Parts of Forecasting Techniques
2.3 Naïve Forecasting Models
2.4 Averaging Models
2.4.1 Simple Averages
2.4.2 Moving Averages
2.4.3 Weighted Moving Averages
2.5 Single Exponential Smoothing Forecasting Technique
2.6 Single Exponential Smoothing Forecasting Technique in
MS Excel
2.7 Let Us Sum Up
2.8 Answers for Check Your Progress
2.9 Glossary
2.10 Assignment
2.11 Activities
2.12 Case Study
2.13 Further Readings
2.1 Introduction :
In this unit, we will study various forecasting techniques also known
as smoothing techniques as these techniques smooth out the irregular
fluctuation effects in the time–series data. We will analyse the graphical
representation of these smoothing techniques and will see the ways to
optimize smoothing constants. In the end, we will see the relationship
between smoothing constant and accuracy measures.
191
Business Analytics 2.2 Parts of Forecasting Techniques :
Forecasting techniques for stationary time series data (no significant
level of seasonal, trend, cyclical component) are also known as smoothing
techniques as smooth (minimize) the fluctuation due to irregular effects
in the time–series. Below are important forecasting techniques:
192
Month Incoming calls Moving Average and
January 1052 Single Exponential
February 979 Smoothing Techniques
March 936
April 927
May 1004
June 1125
July 1241
August 1029
September 1053
October 965
November 996
December 1150
January ???
x t −1 + x t − 2 + ..... + x t − n
Ft =
n
Below is the data for a year, given in the form of weekly sales
data.
Week Sales (in Thousands) Week Sales (in Thousands)
1 297 27 374
2 437 28 315
3 317 29 306
4 421 30 452
5 297 31 378
6 434 32 438
7 276 33 366
8 427 34 379
9 292 35 365
10 359 36 369
11 324 37 470
12 281 38 318
13 441 39 342
14 364 40 540
194
15 344 41 421 Moving Average and
16 369 42 387 Single Exponential
17 380 43 318 Smoothing Techniques
18 385 44 419
19 384 45 418
20 364 46 359
21 332 47 513
22 333 48 459
23 364 49 431
24 309 50 459
25 460 51 456
26 382 52 354
Here if want to calculate the forecast for the 53rd week then we
can calculate the average of all 52 weeks.
x 52 + x 51 + ...... + x1
=F53 = 380
52
In case we have data for several months, for example, 30 months
then instead of calculating the average of all 30 months we can also
calculate the average of 12 months. Generally, statisticians take decisions
by consulting with domain experts.
2.4.2 Moving Averages :
Moving average is another simple and interesting way to forecast
the time series data. It forecast the future value of a time series using
an average of the last 'N' observations. Below is the formula for calculating
a simple moving average:
1 t
Ft + 1 = ∑ Yk
N k = t +1− N
Solution :
We can calculate the moving average by calculating the average
of 4 consecutive terms, first forecasting term will be
(297 + 437 + 317 + 421)
= 368 similarly second forecasting term will
4
(437 + 317 + 421 + 297)
be = 368 so on so for..
4
Absolute
Square
Absolute values of
Sales (in Moving value
Month Error error Errors Divided
Thousands) Average of
value by Actual
Error
values
Jan’20 297
Feb’20 437
Mar’20 317
Apr’20 421
May’20 297 368 -71 71 5041 0.239
Jun’20 434 368 66 66 4356 0.152
Jul’20 276 367 -91 91 8327 0.331
Aug’20 427 357 70 70 4900 0.164
Sep’20 292 359 -67 67 4422 0.228
Oct’20 359 357 2 2 3 0.005
Nov’20 324 339 -15 15 210 0.045
Dec’20 281 351 -70 70 4830 0.247
Jan’21 314
Average 56 4011 0.176
t
=Ft + 1 ∑ Wk × Yk
k = t +1− N
5 × Ft + 3 × Ft −1 + 2 × Ft − 2 + 1 × Ft − 3
Ft + 1 =
11
197
Business Analytics Worked Example : Calculate a 3–month weighted moving average
for a retail firm. Using weights of 3 for the recent month, 2 for the month
prior and 1 for the month before that. Calculate forecast for Jan 2021
also calculate MAD, MSE, RMSE and MAPE. 12 months data as below:
Month Sales (in Thousands)
Jan’20 496
Feb’20 339
Mar’20 427
Apr’20 421
May’20 398
Jun’20 434
Jul’20 276
Aug’20 427
Sep’20 397
Oct’20 359
Nov’20 424
Dec’20 361
Solution :
First moving average will be calculated on the first three values
(Jan, Feb and Mar data).
200
Solution : Moving Average and
We can create a table with forecast and error values for each alpha Single Exponential
value. Generally, the forecast is not available for the very first data point Smoothing Techniques
hence we cannot calculate a forecasted value for the second time period.
Therefore, we use the actual value of the first time period as the
forecasted value for the second period and start the forecasting process.
Applying the exponential smoothing formula to calculate the forecasted
value for third and fourth data points (for = .3) :
F3 = .3(339) + .7(496) = 449
F4 = .3(427) + .7(449) = 442
F5 = .3(421) + .7(442) = 436
F6 = .3(398) + .7(436) = 425
Similarly, we can calculate forecasted values with other values
also. The below table has all calculated values.
201
Business Analytics
Conclusion :
The smaller value of ? (larger the damping factor), the smoothers
the entire time–series. On the other hand, the larger ? (smaller the damping
factor), forecasted values follow time–series data.
Benefits and Drawbacks of Single Smoothing Forecasting
Techniques:
Benefits :
The single exponential smoothing technique performs very well in
most scenarios. Its important benefits are as follows:
1. It is simple in calculation and works well with the data if there
is no strong trend or seasonal patterns in the time series
2. This technique uses all historic data unlike moving average forecasting
techniques
3. It assigns progressively decreasing weights which ensure more
importance to recent data
Drawbacks :
1. For every large data set, the forecast becomes less sensitive to
changes in the data
2. It always lags the trend and it is based on past observations. The
longer the time period n, the greater the last as it is slow to recognize
the shifts in the level of the data points
3. It doesn't work well if data has trend and seasonality influence
Step – 2 : Select the "Data Analysis" pack under the Data tab
203
Business Analytics
205
Business Analytics 8. What is the main motive behind the progressively decreasing weights
in the single exponential smoothing technique ?
a. Recent actual values will have higher weights
b. Recent actual values will have lower weights
c. Time series will smooth all the fluctuations
d. Forecasting accuracy measure will be better
9. One of the parameters to select the appropriate forecasting technique
is
a. Forecasting accuracy measure
b. Business domain
c. Knowledge of statistician about the forecasting technique
d. None of the above
10. One of the drawbacks of moving average is
a. It is computationally very expensive
b. It provides the same importance to all data points
c. It is difficult to choose the optimal length of moving average
d. None of the above
206
8. Single exponential smoothing technique, there is always a lag Moving Average and
because forecasts depend on the last observations. Lower the alpha Single Exponential
value smoother the forecast. Smoothing Techniques
2.9 Glossary :
Forecasting Technique : Predicting future value of output variable
based on historic data is known as a forecasting technique
Moving Average Forecasting Techniques : Moving average is a
forecasting technique that forecasts the future value of a time series data
using average (or weighted average) of past 'N' observations
Single Exponential Smoothing Forecasting Technique : This
technique uses exponentially reducing weights with the help of a smoothing
constant alpha. Unlike moving average methods, it uses all historic data
Damping Factor : It is used to smooth the time series and
progressively assign weights to historic data. Its value is 1–
2.10 Assignment :
1. What is the basic difference between naïve, averaging models and
exponential smoothing forecasting techniques ? Write down the
important type of forecasting techniques under these categories.
2. "It is not necessary that more complex mathematical forecasting
technique will always give better forecasting accuracy". Explain this
statement with appropriate examples.
3. What is the basic difference between moving average and weighted
moving average forecasting techniques ?
2.11 Activities :
One of the leading organizations in the retail store provided their
west India store's revenue (in crores rupees) of last 12 months. Calculate
the forecast using a 3–month moving average and single exponential
smoothing technique (use = .3). Also calculate MAD, MSE, RMSE
and MAPE for this data.
207
Business Analytics Year Revenue
1 26.580
2 24.950
3 25.080
4 25.360
5 26.990
6 28.610
7 27.850
8 28.430
9 28.670
10 29.190
11 30.790
12 30.980
Questions :
1. Develop a forecasting model using 5–days and 7 days moving
average and single exponential smoothing technique for alpha value
.3 and .8
2. Visualise the forecasted values
3. Calculate the MAD, MSE, RMSE and MAPE also suggest which
measure is most appropriate for their restaurant chain
4. Which forecasted model will you recommend for their business
208
2.13 Further Readings : Moving Average and
Single Exponential
• "Time series analysis and Control," Holden Day, Box and Smoothing Techniques
Jenkins (1970)
• "How to get a better forecast, Harvard Business Review",
Praker G, Segura E (1971)
• "Time Series Based Predictive Analytics Modelling: Using
MS Excel", Glyn Davis, Branko Pecar; 1st edition (2016)
• "An introduction to Time series analysis and forecasting with
applications of SAS and SPSS", Management science journal,
Yaffee R, McGee M (2000)
209
Business Analytics
Unit REGRESSION METHODS FOR
3 FORECASTING
: UNIT STRUCTURE :
3.0 Learning Objectives
3.1 Introduction
3.2 Forecasting Techniques with a Trend
3.3 How to Draw Trendline and Regression Equation in Time–Series
Graph
3.4 Double Exponential Smoothing Constant Technique for
Forecasting
3.5 Let Us Sum Up
3.6 Answers for Check Your Progress
3.7 Glossary
3.8 Assignment
3.9 Activities
3.10 Case Study
3.11 Further Readings
3.1 Introduction :
In this unit, we will study the forecasting methods when we can
have extra information about time series. For example, besides sales data,
we also have information about marketing expenses, competition
information, consumer's demographic information etc. Here we can also
include information about seasonality variation. In the last, we will study
a few worked examples to understand how regression–based forecasting
methods help in robust decision making in various industries.
210
regression analysis. Here we consider the output variable as the metric Regression Methods for
we are determining on a time scale or the metric we want to forecast Forecasting
while the x–axis represents time.
Regression is a more powerful forecasting technique when the time–
series has values of various independent variables also besides the dependent
variable Yt. The forecasting techniques we discussed in the last units have
only output variables that's why their application in the real world is
comparatively less as most of the time we have various independent
variables which influence output variables. Therefore, forecasting only
based on output variables is less effective. For example, if we want to
predict the sales price alone then it may not be so effective but if we
also include information like marketing cost, whether a promotion was
going on, nearby festival etc then our forecasting will be more accurate.
The general form of a regression equation is as follows:
Ft = 0 + 1X1t + 2X2t + ..... + nX3t + t
Where
Ft = The forecasted value at time t
X1t, X2t etc are the predictor variables measures at time t
t = Error or Irregular component at time t
Worked Example : Below are crude oil per barrel prices in the
Indian market (IN USD). See if a trend is visible, can we predict the
price for 17th August 2021
211
Business Analytics SUMMARY OUTPUT
Regression Statistics
Multiple R 0.76187804
R Square 0.580458147
Adjusted R
0.542317979
Square
Standard Error 1.528988275
Observations 13
ANOVA
df SS MS F Sig F
Regression 1 35.58 35.58 15.22 0.00
Residual 11 25.72 2.34
Total 12 61.30
Ft = 72.6142 – 0.4421 Xt
Here, Ft is forecasted per barrel price of crude oil and Xt is the
time period
So, if we want to forecast the sales for the next time period then
the value will be:
F14 = 72.6142 – 0.4421 X14 = 72.6142 – 0.4421 14 = $66.42
Here can interpret it like every unit increase in the time period
Xt, per barrel price of crude oil will decrease by 0.442 USD. Y–intercept
72.6142 indicates that the time period before the very first time period
(at Day 0) per barrel price of crude oil was $ 72.61.
The P–value for the independent variable "Day" is 0.0025 represents
a significant relationship between input and output variables. Below is
the visual representation of the trend.
212
3.3 How to Draw Trendline and Regression Equation in Time–Series Regression Methods for
Graph : Forecasting
4. Select the second last option, "Trendline", which will create a trend
line as per the data. There is various type of trendline like linear
(straight line), exponential, Linear forecast etc. In this course, we
have included only Linear, but others are also quite intuitive, please
refer to the Microsoft excel help website to learn about these
different trendlines. By default, Excel will create a Linear trendline
only (the very first option).
213
Business Analytics
214
Regression Methods for
Forecasting
215
Business Analytics Month Sales Month Sales
Apr-17 9.75 Jan-19 11.46
May-17 10.09 Feb-19 11.06
Jun-17 10.27 Mar-19 10.76
Jul-17 8.98 Apr-19 10.32
Aug-17 8.79 May-19 9.34
Sep-17 8.23 Jun-19 8.39
Oct-17 8.24 Jul-19 7.01
Nov-17 8.97 Aug-19 6.71
Dec-17 9.03 Sep-19 5.93
Jan-18 9.31 Oct-19 5.07
Feb-18 10.25 Nov-19 5.21
Mar-18 9.5 Dec-19 6.46
Apr-18 10.05 Jan-20 6.35
May-18 9.73 Feb-20 5.36
Jun-18 9.48 Mar-20 5.25
Jul-18 9.38 Apr-20 4.94
Aug-18 9.44 May-20 5.13
Sep-18 10.02 Jun-20 5.63
Oct-18 10.15 Jul-20 5.35
Nov-18 10.55 Aug-20 5.74
Dec-18 11.13 Sep-20 5.26
Solution :
Step 1 : Let's assume value for smoothing constants, = 0.7 and
= 0.6. Later we will see how we can optimize these values with the
help of the "Solver" functionality of MS Excel.
Step 2 : We must initialize the very first values of Level, trend
and forecast. One of the popular ways to initialize the level value is
considering the same actual value for that period While the trend is the
difference between the current and last actual value.
216
The initial value of forecast value is the summation of Level and Regression Methods for
trend while Error is the difference between the actual value and forecasted Forecasting
value.
Step 3 : Now we must calculate the level value for the row "Jun–
17" Yt + (1– ) Ft.
LevelJun–17 = 0.7 10.27 + (1 – 0.7) (10.09 + 0.34) = 10.318
Step 4 : Now we must calculate the trend value for the row "Jun–
17" (Lt – Lt–1) + (1– ) Tt–1
TrendJun–17 = 0.6 (10.318 – 10.090) + (1 – 0.6) (0.340) =
0.273
Step 5 : Copy the same formula for entire column of "Level" and
"Trend" (up to the cells actual values are available). Also copy the formula
for "Forecast" and "Error" column
Month Sales Level Trend Forecast Error
Apr-17 9.75
May-17 10.09 10.090 0.340
Jun-17 10.27 10.318 0.273 10.43 -0.16
Jul-17 8.98 9.463 -0.404 10.59 -1.61
Aug-17 8.79 8.871 -0.517 9.06 -0.27
Sep-17 8.23 8.267 -0.569 8.35 -0.12
Oct-17 8.24 8.077 -0.341 7.70 0.54
Nov-17 8.97 8.600 0.177 7.74 1.23
Dec-17 9.03 8.954 0.283 8.78 0.25
Jan-18 9.31 9.288 0.314 9.24 0.07
Feb-18 10.25 10.056 0.586 9.60 0.65
Mar-18 9.5 9.842 0.107 10.64 -1.14
Apr-18 10.05 10.020 0.149 9.95 0.10
May-18 9.73 9.862 -0.035 10.17 -0.44
Jun-18 9.48 9.584 -0.181 9.83 -0.35
Jul-18 9.38 9.387 -0.190 9.40 -0.02
Aug-18 9.44 9.367 -0.088 9.20 0.24
Sep-18 10.02 9.798 0.223 9.28 0.74
Oct-18 10.15 10.111 0.277 10.02 0.13
Nov-18 10.55 10.502 0.345 10.39 0.16
Dec-18 11.13 11.045 0.464 10.85 0.28
Jan-19 11.46 11.475 0.443 11.51 -0.05
Feb-19 11.06 11.317 0.083 11.92 -0.86
Mar-19 10.76 10.952 -0.186 11.40 -0.64
Apr-19 10.32 10.454 -0.373 10.77 -0.45
May-19 9.34 9.562 -0.684 10.08 -0.74
217
Business Analytics Jun-19 8.39 8.536 -0.889 8.88 -0.49
Jul-19 7.01 7.201 -1.157 7.65 -0.64
Aug-19 6.71 6.510 -0.877 6.04 0.67
Sep-19 5.93 5.841 -0.753 5.63 0.30
Oct-19 5.07 5.076 -0.760 5.09 -0.02
Nov-19 5.21 4.942 -0.384 4.32 0.89
Dec-19 6.46 5.889 0.415 4.56 1.90
Jan-20 6.35 6.336 0.434 6.30 0.05
Feb-20 5.36 5.783 -0.158 6.77 -1.41
Mar-20 5.25 5.362 -0.316 5.62 -0.37
Apr-20 4.94 4.972 -0.361 5.05 -0.11
May-20 5.13 4.974 -0.143 4.61 0.52
Jun-20 5.63 5.391 0.193 4.83 0.80
Jul-20 5.35 5.420 0.095 5.58 -0.23
Aug-20 5.74 5.672 0.189 5.51 0.23
Sep-20 5.26 5.441 -0.063 5.86 -0.60
218
Regression Methods for
Forecasting
MAPE 6.332
Here we can see MAPE has been reduced to 6.332 by adjusting
= .906461 and = .360152.
219
Business Analytics Check Your Progress :
1. Regression based forecasting techniques are advisable if clear
___________ is visible in the time series data
2. In regression equation, Ft = 0 + 1X1t; 0 is known as ___________
while 1 is known as ___________.
3. If a significant value under the ANOVA table in regression output
is 0 then ___________ shows significant relationship between input
and output variables.
4. Double smoothing forecasting techniques has two smoothing
constants, one for ___________ and other for ___________.
5. Smoothing constants have values between ___________ and
___________.
Multiple Choice Questions :
1. Double smoothing forecasting techniques are applicable if the below
effect is NOT visible:
a. Trend b. Irregular or random
c. Seasonality d. Level
2. Which of the below statements are true ?
I. Both smoothing constants must have the same value
II. Both smoothing constants can have a value between 0 and .5
a. Only statement 1 is correct b. Only statement 2 is correct
c. Both statements are correct d. Both statements are incorrect
3. Complete the Level (or Intercept) equation (Lt) : Lt = Yt +
_________ from below options:
a. (1– ) Ft b. Ft
c. (1– ) Ft d. Ft
4. If the R–Square value is .72 in an MS Excel output of regression
analysis, then it means:
a. Input variables justify 72% variation in an output variable
b. The researcher needs more input variables in the study
c. Regression output cannot be trusted
d. The researcher can use a regression model at a 72% confidence
level only
5. Complete the trend equation (Tt) : Tt = (Lt – Lt–1) + _________
from below options:
a. (1– ) Ft b. Ft
c. (1– ) Ft d. Ft
220
6. As per unit content, MAPE/ RMSE value can be minimized by using Regression Methods for
the below functionality in MS Excel : Forecasting
a. Pivot table b. Solver
c. Both the functionality d. None of the above
7. Solver optimises MAPE/ RMSE to :
a. Maximum b. More than 1
c. Minimum d. Less than 0
221
Business Analytics 3.7 Glossary :
Regression Model for Forecasting : When information about input
and output variables are available then regression–based forecasting
techniques work better than other techniques which are based on output
only.
Double Exponential Smoothing Forecasting Techniques : When
the clear trend is visible in the time–series data then double exponential
smoothing techniques generally works better, here we use two equations,
one for level (intercept) and the other for trend. Final forecasts come
as a summation of both equation.
Optimization of Smoothing Constants : We use linear programming
techniques to adjust the values of smoothing constants and to
minimize the forecasting error. In MS–Excel there is an optimizing
technique in form of Solver.
Decomposition of Time–Series : Separating the impact of trend,
seasonality and irregular components from a time series is known as
decomposition of a time series.
3.8 Assignment :
1. What is the basic difference between regression-based and other
forecasting techniques, why most of the time regression-based
forecasting techniques works better in terms of accuracy of forecasts.
2. Write down the important steps to draw a trend line in a forecasting
graph in MS Excel.
3. What is the basic difference between single and exponential
forecasting techniques ? Write down the important scenarios where
these techniques work better than other.
3.9 Activities :
Below are the sales (in ` Crore) data of Contesa Ice cream. Use
double exponential smoothing technique to forecast sales for the next 5
days.
222
Date Sales Date Sales Regression Methods for
January 1, 2012 8.75 January 22, 2012 10.26 Forecasting
January 2, 2012 9.09 January 23, 2012 9.86
January 3, 2012 9.27 January 24, 2012 9.56
January 4, 2012 7.98 January 25, 2012 9.12
January 5, 2012 7.79 January 26, 2012 8.14
January 6, 2012 7.23 January 27, 2012 7.19
January 7, 2012 7.24 January 28, 2012 5.81
January 8, 2012 7.97 January 29, 2012 5.51
January 9, 2012 8.03 January 30, 2012 4.73
January 10, 2012 8.31 January 31, 2012 3.87
January 11, 2012 9.25 February 1, 2012 4.01
January 12, 2012 8.5 February 2, 2012 5.26
January 13, 2012 9.05 February 3, 2012 5.15
January 14, 2012 8.73 February 4, 2012 4.16
January 15, 2012 8.48 February 5, 2012 4.05
January 16, 2012 8.38 February 6, 2012 3.74
January 17, 2012 8.44 February 7, 2012 3.93
January 18, 2012 9.02 February 8, 2012 4.43
January 19, 2012 9.15 February 9, 2012 4.15
January 20, 2012 9.55 February 10, 2012 4.54
January 21, 2012 10.13 February 11, 2012 4.06
223
Business Analytics Questions :
1. Is there any clear trend visible in the data, use MS Excel to visualize
the trend also show R–Square and regression equation in the same
graph ?
2. Which technique do you think will be more appropriate between
double exponential smoothing technique or regression? Justify your
selection
3. Use the initial value of both and constants as .6 but optimize
them using solver functionality of MS Excel
224
AUTO–REGRESSION (AR) AND
Unit MOVING AVERAGE (MA)
4 FORECASTING MODELS
: UNIT STRUCTURE :
4.0 Learning Objectives
4.1 Introduction
4.2 Introduction to Autocorrelation
4.2.1 Reasons for Autocorrelation
4.2.2 Impact of Autocorrelation on a Regression Model
4.2.3 Ways to Detect Autocorrelation : Durbin Watson Test
4.3 Autoregression : Remedy to Resolve Autocorrelation
4.4 Moving Average Model MA(q)
4.5 Let Us Sum Up
4.6 Answers for Check Your Progress
4.7 Glossary
4.8 Assignment
4.9 Activities
4.10 Case Study
4.11 Further Readings
Share Residuals
Price
Time
Time
227
Business Analytics 4.2.3 Ways to Detect Autocorrelation : Durbin Watson Test :
There are two important ways to detect autocorrelation, first one
is the "Durbin Watson test" which detects only first–order autocorrelation
(only consecutive error terms are correlated) and the second one is the
"Breusch–Godfrey test", which can detect autocorrelation of any order
(error terms are corrected with any past error order of error terms). In
this text, we will consider only the Durbin–Watson test as the second
one is more complex mathematically and rarely used in day to day
business analytics.
Durbin Watson Test : This test was developed by two statisticians
in 1950. Here we capture the error terms in a regression model. We try
to see the successive error terms if these are related.
So, if a positive error term is correlated with the very next positive
error term then we call it positive autocorrelation. Similarly, if a positive
error term is related to the next negative error term then it is known
as negative autocorrelation. There may be the cases when error term at
the kth position is correlated with the K+2nd error term or any other error
of higher–order but the Durbin–Watson test is only effective for the first
order of autocorrelation.
in the simple and generic form above equation can be rewritten as:
n
∑ (e t − e t −1)2
t =2
dw statistics =
n
∑ (et )2
t =1
228
Auto–Regression (AR)
and Moving Average
(MA) Forecasting Models
229
Business Analytics Number of Special Ics Manufactured = 23350.34888 +
(–0.436671535 * No. of Laptops)
For the first month, an error term can be calculated as:
Actual – Predicted (Regressed) = 18005 – 15842 = 2163
Difference of the first pair of consecutive error terms:
eSecond month – eFirst month = –258
Similarly, we can calculate the error terms for the rest of the data
points
Month Y Predicted et e2t et - e t-1 (e t - et-1 ) 2
1 15842 2163 4677624 - -
2 15864 1905 3628827 -258 66478
3 15797 444 197309 -1461 2133798
4 15763 -1178 1387082 -1622 2630688
5 15596 349 121846 1527 2331144
6 15516 -549 300949 -898 805780
7 15770 -1184 1401219 -635 403406
8 16059 -820 672084 364 132440
9 16241 -342 117194 477 227979
10 16702 -561 314259 -218 47634
11 16927 -441 194402 120 14323
12 16873 -423 178729 18 329
13 17088 -491 240694 -68 4603
14 17371 -248 61290 243 59067
15 17532 251 62932 498 248434
16 17621 93 8608 -158 24990
17 17704 465 216052 372 138408
18 17716 560 314116 96 9148
19 17890 271 73328 -290 83908
20 18214 -192 36948 -463 214379
21 18266 -108 11609 84 7136
22 18284 277 76682 385 147963
23 18332 -105 11050 -382 145950
24 18389 -138 19013 -33 1074
Sum 14323847 9879058
n
− e t −1)2
∑ (e t
t =2 9879058
dw=
statistics = = 0.6897
n 14323847
∑ (et )2
t =1
230
Auto–Regression (AR)
and Moving Average
(MA) Forecasting Models
231
Business Analytics
Two
Number of One Period Three Period
Period
Month Special Ics Lagged Lagged
Lagged
Manufactured Yt-1 (X1) Yt-3 (X3)
Yt-2 (X2)
1 13541 - - -
2 13305 13541 - -
3 11777 13305 13541 -
4 10121 11777 13305 13541
5 11481 10121 11777 13305
6 10503 11481 10121 11777
7 10122 10503 11481 10121
8 10775 10122 10503 11481
9 11435 10775 10122 10503
10 11677 11435 10775 10122
11 12022 11677 11435 10775
12 11986 12022 11677 11435
13 12133 11986 12022 11677
14 12659 12133 11986 12022
15 13319 12659 12133 11986
233
Business Analytics 16 13250 13319 12659 12133
17 13705 13250 13319 12659
18 13812 13705 13250 13319
19 13697 13812 13705 13250
20 13558 13697 13812 13705
21 13694 13558 13697 13812
22 14097 13694 13558 13697
23 13763 14097 13694 13558
24 13787 13763 14097 13694
13787 13763 14097
13787 13763
13787
A regression model can be developed only for the rows where data
is available for input variables X1, X2 and X3. Hence, we can remove
the first three rows and the last three rows. We will consider the first
column as the output variable and the remaining three columns for input
variables. Below is the regression output:
SUMMARY
OUTPUT
Regression
Statistics
Multiple R 0.8896
R Square 0.7914
Adjusted R Square 0.7545
Standard Error 664.0296
Observations 21.0000
ANOVA
df SS MS F Significance F
Regression 3 28432295 9477432 21 0
Residual 17 7495900 440935
Total 20 35928195
235
Business Analytics 2. What are important reasons for autocorrelation
a. Missing of the important variable(s) from the regression model
b. The incorrect functional form of a regression model
c. None of the above
d. Both options a and b are correct
3. Due to autocorrelation, we cannot estimate the regression coefficients
because of:
a. Minimum variance property
b. Non–randomness of the error term
c. Seasonality component in the time series
d. None of the above
4. AR models can be used when:
a. There is no autocorrelation effect in the data
b. Data is stationary
c. When there is more than one variable
d. None of the above
5. Auto–regressive models are regression models where:
a. Yt is the output variable
b. Yt–1, Yt–2 etc are independent variable
c. The error term is not an independent variable
d. All the above are correct
6. For not showing autocorrelation, Durbin–Watson statistics should
be around:
a. Statistics value should be around 2
b. Statistics value should be around 0
c. Statistics value should be around 4
d. Statistics value should be less than 1
7. In the Durbin–Watson test if the value of the statistics is in between
dU and dL
a. There is no autocorrelation
b. Result is inconclusive
c. There is a strong autocorrelation
d. There is weak autocorrelation
236
8. In the Durbin–Watson test, which of the below statements are true: Auto–Regression (AR)
a. If the test statistics value is around zero, there is a positive and Moving Average
autocorrelation (MA) Forecasting Models
b. If the test statistics value is around four, there is a negative
autocorrelation
c. If the test statistics value is around two, there is no autocorrelation
d. All the above are true
9. For moving average of lag 1, MA (1) model
a. Past residual is the independent variable
b. Error terms correlate with one error term before
c. Both a and b statements are wrong
d. Both a and b options are correct
10. If data is showing an autoregressive model of third–order then
which test can be used to validate it
a. Durbin–Watson test
b. Breusch–Godfrey test
c. Both above test
d. None of the tests given in options a and b
237
Business Analytics 4.6 Answers for Check Your Progress :
Check Your Progress :
1. Lags 2. First–order 3. Errors
4. 1 and 4 5. Autocorrelation
Multiple Choice Questions :
1. c 2. d 3. a 4. b
5. d 6. a 7. b 8. d
9. d 10. b
4.7 Glossary :
Autocorrelation : Autocorrelation is a mathematical representation
where a time–series data shows correlation with a lagged version of itself.
Durbin–Watson Test : This test helps to detect autocorrelation of
the first order when a time series data correlate with lag 1 data of itself.
Breusch–Godfrey Test : This test helps to detect autocorrelation
of any order.
Autoregressive (AR) Models : AR Models forecast time–series data
using a linear combination of its lagged values. Here the output variable
is Yt while input variables are Yt–1, Yt–2 etc.
Moving Average MA(q) Models : Moving average models use past
forecast errors as input variables. MA (q) models are also known as the
moving average process.
4.8 Assignments :
1. What could be important consequences of ignoring the auto-correlation
problem in a forecasting technique.
2. Write down two important reasons for autocorrelation in a forecasting
technique.
3. What are two important ways to detect auto-correlation, write down
the appropriate scenarios where these techniques can be utilized.
4. What is the basic difference between positive and negative
autocorrelation, explain with an example.
4.9 Activities :
Laburnum chemicals limited has issued their production data for
one of the agricultural pesticide products for the last 26 years.
238
Year Unit Produced Year Unit Produced Auto–Regression (AR)
1 1196 14 2330 and Moving Average
(MA) Forecasting Models
2 1536 15 3792
3 1726 16 2756
4 1636 17 3152
5 1682 18 2920
6 2280 19 3178
7 2430 20 3512
8 2818 21 3446
9 2526 22 3780
10 2468 23 3166
11 2892 24 3350
12 2672 25 3744
13 2486 26 3690
Check the first–order autocorrelation in the above data with the
help of the Durbin–Watson test also develop a one–period (lag 1) two–
period (lag 2) AR model and check if there is significant autocorrelation.
239
Business Analytics Questions :
1. Is there any clear trend visible in the data, use MS Excel to visualize
the trend also show R–Square and regression equation in the same
graph ?
2. Calculate a Durbin–Watson test to check if there is a first–order
autocorrelation at = 0.05
3. Develop an autoregressive model with a one–period lag and then
develop a model with two–period lag. Compare the results and write
your observations.
4.11 Further Readings :
• "Time series analysis and Control," Holden Day, Box and
Jenkins (1970)
• "How to get a better forecast, Harvard Business Review",
Praker G, Segura E (1971)
• "Time Series Based Predictive Analytics Modelling: Using
MS Excel", Glyn Davis, Branko Pecar; 1st edition (2016)
• "An introduction to Time series analysis and forecasting with
applications of SAS and SPSS", Management science journal,
Yaffee R, McGee M (2000)
240
BLOCK SUMMARY
Business analytics is the brain of the organization while forecasting
techniques are the spinal cord. Forecasting was one of the fundamental
requirements to establish business analytics as a separate department in a
leading organization. Better forecasting is the key difference between
successful and struggling organizations, in the last two decades supply
chain management has emerged as one of the most important departments
and time–series forecasting is the heart of successful supply chain
management. Forecasting techniques like MAPE and RMSE can be used
to benchmark different industries, these parameters have gained huge
importance in most quality standard certifications like ISO, CMMI etc.
Most of the fundamental features for forecasting are available in MS Excel
while few advanced softwares make forecasting quite easy, for example,
e–views, Minitab, SAS etc. Forecasting techniques are important for both
short term and long term planning in the organization. Good forecasting
helps us to reduce warehouse cost, additional manpower, additional
bandwidth and server capacity. It also helps in optimizing manpower cost,
budgeting, revenue management etc.
241
Business Analytics BLOCK ASSIGNMENT
Short Answer Questions :
1. What are the different components of a time series ?
2. How forecasting techniques impact both the top and bottom line of
an organization
3. Write a short note on additive and multiplicative forecasting models
and write down the ideal scenarios in which we can apply each of
these models
4. Write down the difference between weighted average and moving
average forecasting techniques
5. Write a short note on autoregression and what are important reasons
for it
6. Describe the basic concept of the Durbin–Watson test to detect the
first–order autocorrelation
7. Write a short note on the Moving average, MA(q) forecasting model
8. Write down different factors affecting forecasting accuracy of a time–
series
242
Long Answer Questions :
1. Explain the benefits of forecasting in the business world with the help
of few examples in the banking, retail and textile industries
2. Write down different forecasting accuracy techniques, which
techniques are suitable for benchmarking across different industries
3. Explain the intuition behind single and double exponential smoothing
techniques. Take data from with assignment question and use MS
Excel solver the optimize the smoothing constants
4. Explain the difference between autoregressive models and the moving
average MA (q) forecasting model. Write down the similarity
between moving average techniques and moving average forecasting
model MA (q)
5. Write down impacts of autocorrelation
243
Business Analytics Enrolment No. :
1. How many hours did you need for studying the units ?
Unit No. 1 2 3 4
No. of Hrs.
2. Please give your reactions to the following items based on your reading
of the block :
244
DR.BABASAHEB AMBEDKAR
OPEN UNIVERSITY
'Jyotirmay' Parisar,
Sarkhej-Gandhinagar Highway, Chharodi, Ahmedabad-382 481.
Website : www.baou.edu.in