[go: up one dir, main page]

0% found this document useful (0 votes)
45 views78 pages

Forcasting Daily Sales in Retail

This document is a degree project report submitted by Daniel Fredén and Hampus Larsson to the KTH Royal Institute of Technology. The report details their research using machine learning models to forecast daily sales at supermarkets. They analyzed sales data from Coop Värmland along with weather data from SMHI. Models tested included XGBoost, ARIMAX, LSTM, and Facebook Prophet. XGBoost and LSTM performed best overall, while Facebook Prophet worked best for forecasting during holidays. The inclusion of weather data did not significantly improve the models' performance.

Uploaded by

Hainsley Edwards
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views78 pages

Forcasting Daily Sales in Retail

This document is a degree project report submitted by Daniel Fredén and Hampus Larsson to the KTH Royal Institute of Technology. The report details their research using machine learning models to forecast daily sales at supermarkets. They analyzed sales data from Coop Värmland along with weather data from SMHI. Models tested included XGBoost, ARIMAX, LSTM, and Facebook Prophet. XGBoost and LSTM performed best overall, while Facebook Prophet worked best for forecasting during holidays. The inclusion of weather data did not significantly improve the models' performance.

Uploaded by

Hainsley Edwards
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

DEGREE PROJECT IN MATHEMATICS,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2020

Forecasting Daily
Supermarkets Sales with
Machine Learning
DANIEL FREDÉN

HAMPUS LARSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ENGINEERING SCIENCES
Forecasting Daily Supermarkets
Sales with Machine Learning

DANIEL FREDÉN

HAMPUS LARSSON

Degree Projects in Optimization and Systems Theory (30 ECTS credits)


Master’s Programme in Industrial Engineering and Management
KTH Royal Institute of Technology year 2020
Supervisor at ELVENITE AB: Erik Karlström
Supervisor at KTH: Xiaoming Hu
Examiner at KTH: Xiaoming Hu
TRITA-SCI-GRU 2020:218
MAT-E 2020:061

Royal Institute of Technology


School of Engineering Sciences
KTH SCI
SE-100 44 Stockholm, Sweden
URL: www.kth.se/sci
Abstract

Improved sales forecasts for individual products in retail stores can have a
positive effect both environmentally and economically. Historically these fore-
casts have been done through a combination of statistical measurements and
experience. However, with the increased computational power available in mod-
ern computers, there has been an interest in applying machine learning for this
problem. The aim of this thesis was to utilize two years of sales data, yearly cal-
endar events, and weather data to investigate which machine learning method
could forecast sales the best. The investigated methods were XGBoost, ARI-
MAX, LSTM, and Facebook Prophet. Overall the XGBoost and LSTM models
performed the best and had a lower mean absolute value and symmetric mean
percentage absolute error compared to the other models. However, Facebook
Prophet performed the best in regards to root mean squared error and mean
absolute error during the holiday season, indicating that Facebook Prophet was
the best model for the holidays. The LSTM model could however quickly adapt
during the holiday season improved the performance. Furthermore, the inclu-
sion of weather did not improve the models significantly, and in some cases, the
results were worsened. Thus, the results are inconclusive but indicate that the
best model is dependent on the time period and goal of the forecast.

i
Sammanfattning

Förbättrade försäljningsprognoser för individuella produkter inom detaljhan-


deln kan leda till både en miljömässig och ekonomisk förbättring. Historiskt sett
har dessa utförts genom en kombination av statistiska metoder och erfarenhet.
Med den ökade beräkningskraften hos dagens datorer har intresset för att ap-
plicera maskininlärning på dessa problem ökat. Målet med detta examensarbete
är därför att undersöka vilken maskininlärningsmetod som kunde prognosticera
försäljning bäst. De undersökta metoderna var XGBoost, ARIMAX, LSTM
och Facebook Prophet. Generellt presterade XGBoost och LSTM modellerna
bäst då dem hade ett lägre mean absolute value och symmetric mean percentage
absolute error jämfört med de andra modellerna. Dock, gällande root mean
squared error hade Facebook Prophet bättre resultat under högtider, vilket in-
dikerade att Facebook Prophet var den bäst lämpade modellen för att förutspå
försäljningen under högtider. Dock, kunde LSTM modellen snabbt anpassa sig
och förbättrade estimeringarna. Inkluderingen av väderdata i modellerna resul-
terade inte i några markanta förbättringar och gav i vissa fall även försämringar.
Övergripande, var resultaten tvetydiga men indikerar att den bästa modellen
är beroende av prognosens tidsperiod och mål.

ii
Acknowledgements

We would like to express our sincere gratitude towards Erik Karlström and
Elvenite for this opportunity and also Coop Värmland for the exciting material.
We would also like to thank our supervisor at KTH, Xiaoming Hu, for the
guidance and feedback throughout the project.

iii
Contents

Abstract i

Sammanfattning ii

Acknowledgements iii

Table of Contents iv

List of Figures vii

List of Tables viii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Literature Review 4
2.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Algorithms Used in Previous Work . . . . . . . . . . . . . . . . . . . 5
2.3 Variables Used in Previous Work . . . . . . . . . . . . . . . . . . . . 5
2.4 Evaluation Metrics Used in Previous Work . . . . . . . . . . . . . . . 6

3 Theory 7
3.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Auto-Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Selected Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5.1 Naive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5.2 ARIMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5.3 Facebook Prophet . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.4 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

iv
3.5.5 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Evaluation of Model Performance . . . . . . . . . . . . . . . . . . . . 19
3.6.1 Cross-validation for Time-Series . . . . . . . . . . . . . . . . . 19
3.6.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 Weather as a Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Data 22
4.1 Included Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 Coop Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.2 SMHI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 Additional Data . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.3 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.4 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.6 One Hot Encoding . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.7 Data Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Method 32
5.1 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.1 ARIMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.2 Facebook Prophet . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.3 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.4 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Result 36
6.1 Performance of Models . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1.1 Naive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.2 ARIMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.1.3 Facebook Prophet . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1.4 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1.5 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Discussion 46
7.1 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

v
7.2 The Effect of Adding Weather as an Input . . . . . . . . . . . . . . . 48
7.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8 Further Studies 52

9 Appendices 53
9.1 Appendix 1: Available Coop Data . . . . . . . . . . . . . . . . . . . . 53
9.2 Appendix 2: Holiday Data . . . . . . . . . . . . . . . . . . . . . . . . 54

Bibliography 55

vi
List of Figures
1 A neural network with one hidden layer . . . . . . . . . . . . . . . . . 9
2 Example of a regression tree . . . . . . . . . . . . . . . . . . . . . . . 15
3 Unrolled form RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Single module RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Single module of an LSTM network . . . . . . . . . . . . . . . . . . . 17
6 Cross-validation for time series . . . . . . . . . . . . . . . . . . . . . . 19
7 % of mean sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8 Explanation of holiday variable . . . . . . . . . . . . . . . . . . . . . 25
9 % of mean aggregated weekly sales . . . . . . . . . . . . . . . . . . . 28
10 One hot encoding example . . . . . . . . . . . . . . . . . . . . . . . . 30
11 MAE for each week and model . . . . . . . . . . . . . . . . . . . . . . 37
12 SMAPE for each week and model . . . . . . . . . . . . . . . . . . . . 38
13 RMSE for each week and model . . . . . . . . . . . . . . . . . . . . . 38
14 % of mean sales for the naive model and one specific product and store 39
15 % of mean aggregated sales for the naive model . . . . . . . . . . . . 39
16 % of mean sales of the ARIMA and one specific product and store . . 40
17 % of mean sales aggregated ARIMA . . . . . . . . . . . . . . . . . . . 41
18 % of mean sales for Prophet and one specific product and store . . . 42
19 % of mean aggregated sales Prophet . . . . . . . . . . . . . . . . . . . 42
20 % of mean sales for LSTM Model and one specific product and store 43
21 % of mean sales aggregated LSTM . . . . . . . . . . . . . . . . . . . 44
22 % of mean sales for XGBoost Model and one specific product and store 45
23 % of mean sales aggregated XGBoost . . . . . . . . . . . . . . . . . . 45

vii
List of Tables
1 RMSE, MAE and SMAPE . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Available SMHI data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Example of lags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 The evaluated hyperparameters for ARIMAX . . . . . . . . . . . . . 33
5 The evaluated hyperparameters for Prophet . . . . . . . . . . . . . . 33
6 The evaluated hyperparameters for LSTM . . . . . . . . . . . . . . . 34
7 The evaluated hyperparameters for XGBoost . . . . . . . . . . . . . . 35
8 Model results (mean) . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9 Model results (median) . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10 The final hyperparameters for ARIMAX . . . . . . . . . . . . . . . . 40
11 The final hyperparameters for Prophet . . . . . . . . . . . . . . . . . 41
12 The final hyperparameters for LSTM . . . . . . . . . . . . . . . . . . 43
13 The final hyperparameters for XGBoost . . . . . . . . . . . . . . . . . 44

viii
Master Thesis Elvenite AB Page 1

1 Introduction
Developments in the field of machine learning and the increase of computational
power have led to the implementation of machine learning in various industries [1].
The retail industry is no exception. One of the applications of machine learning in
the retail industry is the use of advanced forecasting algorithms to better predict up-
coming sales and thus improve the ordering processes and the allocation of products.

Improved forecasting models for the retail industry can provide many benefits. For
the end customer, products receive higher availability and the stores become a reliable
source of goods. For the stores, an improved forecasting performance can provide the
ability to; minimize waste due to overstocking which can have negative consequences
economically; maximize sales as understocking could decrease sales due to lack of
product availability and; to improve the allocation of personnel. Thus, an increased
forecast precision could be beneficial through multiple economical aspects for the
stores. Furthermore, in the case of grocery stores in Sweden, 30 000 tons of groceries
were wasted during the year of 2016 [2]. Thus, improved forecasts could also be
beneficial environmentally. Therefore it is evident that improved forecasts could be
desirable from multiple perspectives and for multiple stakeholders within the whole
retail industry but, especially within grocery stores.

1.1 Background

Historically, forecasting has been relying on experience-based knowledge within the


personnel. However, as grocery stores grow larger and contain a high number of
products, with different characteristics, knowledge-based forecasting becomes an in-
creasingly difficult task. With the increased ability of data gathering, it is possible to
utilize data for the forecasts. Statistical models are often used to calculate how sales
have behaved historically and then these statistics are used in combination with expe-
rience to predict future sales. Now with the increased computational power available,
it could be possible to apply sophisticated machine learning models and rely on the
data to a larger extent when predicting sales.
Master Thesis Elvenite AB Page 2

1.2 Research Objective

The primary objective of this thesis was to investigate which machine learning model
yields the best performance when forecasting sales for a given set of products and
stores. Utilizing data provided by Coop Värmland, the forecasts were implemented
for multiple stores and products to forecast the sale quantity of each product and
store for a seven day period. Thus, the goal of this thesis was to lower the amount of
waste and increase product availability through improved forecasting models.

1.3 Problem Setting

Coop Värmland is one of the largest grocery store chains in the county of Värmland,
Sweden, and consists of over 60 stores of various sizes [3]. Currently, they utilize their
data to automate orders for a large set of products. However, for a set of products
with a short expiration date, orders are placed manually and as these orders could
potentially be improved and become more accurate, these products were the focus
of this thesis. For these products, it was assumed to be optimal if all products are
sold the same day that they are displayed in stores due to products becoming less
desirable by the customer if stored longer. Thus, when forecasting sales for these
products, it should be done for each day between consecutive deliveries. In this case,
the time between consecutive deliveries was assumed to be seven days.

As the project was performed based on data from Coop Värmland, the data was
biased towards this market. Other counties could have other characteristics and thus
other variables that would be necessary to include to fully understand why sales
increase or decrease in a general retail setting. As the scope of this thesis was limited
to Coop Värmland and grocery products with a short expiration date, the results are
not guaranteed to be viable for other sorts of products or other sorts of retail stores.
Master Thesis Elvenite AB Page 3

1.4 Programming Language

Python was used for data preparation, data analysis, and implementation of forecast
models. During these stages, multiple libraries were used, including Pandas, Numpy,
Scikit-learn, Keras, and TensorFlow.

1.5 Outline

The thesis is structured as follows: Chapter 2 presents a literature review of related


works including, algorithms, features, and evaluation metrics used. Chapter 3 de-
scribes the theory behind the models that were chosen and used. In chapter 4 the
data is described in detail, including the pre-processing of the data. In chapter 5
the methodology of implementing the models is discussed. The results are then pre-
sented in chapter 6, followed by a discussion of the results and a conclusion in chapter
7. Chapter 8 proposes possible ideas for future work and how this project could be
continued and improved.
Master Thesis Elvenite AB Page 4

2 Literature Review
This chapter contains a brief overview of previous work related to the problem this
thesis was investigating. The main objective of this section was to understand the
current depth in this field, the amount of academic research existing, how that research
has been executed, and where there exist possible gaps in the literature. Furthermore,
a secondary objective was to dig deeper into the existing research to conclude which
algorithms, features, and evaluation metrics that appear frequently throughout the
literature.

2.1 Previous Work

A simplistic approach to understanding the current width of the academic literature


within this thesis area was to utilize academic literature databases such as Web of
Science and Google Scholar. Introducing the keywords "food", "waste" and "ma-
chine learning" resulted in only seven hits within Web of Science and 23,000 hits at
Google Scholar. Comparing this to 146,365 and 3,100,000 hits when searching only
for "Machine Learning", and 20,780 and 255,000 hits for "Food Waste" at Web of
Science and Google Scholar respectively, it was evident that much research has been
done in related areas. Although many articles include the terms, few approached
an equivalent problem as this thesis. Thus, it is evident that a large amount of re-
search has been done concerning limiting food waste and within machine learning,
individually. However, there has been a limited amount of research on how to limit
food waste using machine learning with forecasts on a day to day basis. As seen by
the individual searches, the limited amount of hits does not correspond to a lack of
knowledge. Instead, it indicates that the knowledge has not been thoroughly applied
for this specific use case.

By broadening the search and focusing on the knowledge instead of the application,
there existed more academic research on algorithms and methods for the problem
at hand [4, 5, 6, 7, 8, 9, 10, 11]. Thus, the academic literature on time series and
forecast are thorough and new algorithms and approaches are continuously developed
to handle new problems. As this field expands continuously with new implementations
and new algorithms, there existed a need to compare newer algorithms to older ones
to conclude if improvements are occurring.
Master Thesis Elvenite AB Page 5

2.2 Algorithms Used in Previous Work

To forecast sales, several types of algorithms have been proposed with neural networks
and auto-regression being the most prominent. This is expected as the problem was,
in essence, a time series problem. ARIMA and Long Short Term Memory (LSTM)
have yielded much discussion and promising results in the academic literature [4, 5, 6].
However, regression models such as Lasso, support vector regression (SVR), and Ran-
dom Forest has shown promising results as well, indicating that this approach could
yield prominent results [4, 7, 8]. Extreme gradient boosting (XGBoost), was published
in 2016 by Tianqi Chen and Carlos Guestrin from the University of Washington [12],
and since then it has been proved to be a successful model for forecasting in data
science competitions and recent literature [13]. Furthermore, Facebook Prophet, an
auto-regressive model, was published on Github in 2017 [14]. While Prophet lacks
academic research, it has been actively used in the online communities with promising
results [15].

2.3 Variables Used in Previous Work

Besides analyzing which models have shown the most promise, it is important to
analyze variables that could explain customer shopping habits and thus correlate with
the number of products sold. The relevance of weather as a predictor for sales have
been shown throughout the literature [7, 16, 9, 17]. Different aspects of weather have
been utilized, such as temperature, the amount of sunlight, and the amount of rain and
there is no consensus over which aspect is the most relevant. Furthermore, multiple
studies have shown that calendar events and public holidays such as Christmas and
Easter have a high correlation with sales [9, 10, 11]. A third important variable that
has been shown to correlate with sales of a product in the literature, was if there is an
ongoing sale on the product in question or not [4, 11]. This was presumably due to a
lowered cost and thus and increases demand. Lastly, previous work also suggests that
the specific weekday correlates to products sold, and can thus be used to improve the
performance of the models [4, 10, 11].
Master Thesis Elvenite AB Page 6

2.4 Evaluation Metrics Used in Previous Work

To evaluate and compare the different models fairly, the choice of evaluation metrics
was important as each metric have different characteristics. It was also important to
include several metrics since different metrics could display different flaws or benefits
in the models. Root mean squared error (RMSE), mean absolute error (MAE), and
Mean Absolute Percentage Error (MAPE) have been used extensively in the aca-
demic literature and could, therefore, be deemed to be the most useful [6, 7, 18]. In
comparison with these, when analyzing the online communities, data science competi-
tions, and sources outside the academic literature, it was clear that Symmetric Mean
Absolute Percentage Error (SMAPE) can be beneficial when comparing the models
[19, 20]. By utilizing multiple performance metrics, with different characteristics, as
specified above, the chances of locating the best algorithm for a specific outcome is
increased.
Master Thesis Elvenite AB Page 7

3 Theory
This chapter introduces the relevant theory and forms the foundation for subsequent
chapters. Firstly, the basic theory of time series, auto-regression, supervised learning,
and neural networks are introduced. Secondly, the models that were selected for this
thesis are presented and discussed. Thirdly, the evaluation methods and metrics are
discussed.

3.1 Time Series

When data is collected over time and time is an aspect of the data containing impor-
tant information, it is a time series. The order of the data is important as succeeding
data points can be correlated. Therefore, it is possible that previous values in the
time series can be a great predictor of the following ones. There are several examples
of time series, for example, sales data and weather data. [21]

3.2 Auto-Regression

In an auto-regressive model the predictions, ŷt , are based on a linear combination of


past values yt . Thus, this is a regressive model were previous values of the variable
in question is used to predict the subsequent values. The model can be altered to
include a pre-defined set of previous variables. If the model utilizes p number of
previous values, the model can be written as:

ŷt = a1 yt−1 + · · · + ap yt−p + et . (3.1)

Where at is the coefficients, yt is the previous values of the variable, and et is a


Gaussian distributed white noise. The goal is to determine the coefficient, ai , i =
1, . . . , p, such that the errors of the auto-regressive model are minimized. [22]
Master Thesis Elvenite AB Page 8

3.3 Supervised Learning

Supervised learning maps a set of inputs, often referred to as features, X, to a set


of outputs, often referred to as the target variable, Y. In this problem setting, the
target variable corresponds to a non-negative real value, thus the applied supervised
learning is a regression task. The model is constructed utilizing training data, which
is a subset of the data containing prior observation. Each prior observation is a pair
of inputs, xi ∈ X, and the observed target variable, yi ∈ Y. The goal is to construct
a model that can utilize previously unseen inputs, x∗i to predict an estimation of the
target variable, ŷ with a minimal error. [23]

With time series modeling, the data used to train the models, the training data, has
to be data of prior dates compared to the test data due to the time dependency
of the data [24]. However, most machine learning models do not consider the time
of the observations when predicting the target variable as they are not explicitly
developed for time series. Observations of earlier date are a powerful predictor, and
by incorporating it as an input for subsequent data points the time series forecasting
problem can be analyzed as a supervised machine learning problem [25].

3.4 Artificial Neural Networks

Artificial neural networks, often simply called neural networks, is a supervised ma-
chine learning method that was developed to mimic the network on neurons in the
brain. A neural network is structured of several layers, where each layer contains a
set of neurons. In the neural networks, there is an input layer, one or several hidden
layers, and an output layer. However, the configuration in how data is transported
from layer to layer can be different depending on which neural network model is used.
In Figure 1, a simple feedforward neural network is displayed as the output from each
layer is the input for the subsequent layer. [26]
Master Thesis Elvenite AB Page 9

Figure 1: A neural network with one hidden layer

Figure 1 displays a neural network with one hidden layer. In the displayed neural
network, the inputs go through the input layer and are given individual weights,
wi,j . The weighted outputs from the input layer are then combined as inputs to the
subsequent layer, the hidden layer. Within the hidden layer, the values are mapped
to a value between zero and one in each neuron using an activation function. The
choice of activation function can differ depending on the task at hand but is most
commonly sigmoid, relu, or tanh. The output from these neurons is then transported
to the next layer, which in this example is the output layer. The predicted target
value is then calculated based on the weights (and biases) within this output layer
and outputted as ŷ. If this small example were to be expanded with several hidden
layers, the process of weighting the inputs and combining them would be replicated
through each added layer. Regardless of the number of hidden layers, the goal is to
minimize the error metric used by tuning the weights and biases [27]

To optimize the performance of the neural network, backward propagation of errors,


often denoted simply as backpropagation can be used. Given an error function, the
gradient of the error function is calculated based on the weights of the neural net-
work. The gradients are calculated backward through the network, with the gradients
of the first layer being calculated last. As the error of the model flows backward in
the model, instead of each layer being calculated independently, backpropagation is
computationally more efficient. [28]

Depending on the problem, different types of neural networks might be suitable.


When dealing with time-series data, a neural network configuration which can utilize
previously seen data points is presumably the best.
Master Thesis Elvenite AB Page 10

3.5 Selected Models

The models that this thesis utilized were ARIMAX, LSTM, XGBoost, and Facebook
Prophet as well as a naive model based on mean values of sales for each day of the
week, product, and store. Each model is described in detail in the following chapters.

3.5.1 Naive Model

The naive model in this thesis was based on the assumption that each day of the
week has the same quantity of sold products, independent of the week for each com-
bination of store and product. As all other variables are, in this model, assumed to
have no effect and the future will have equivalent sales as the past, this model is a
naive approach to forecasting. The estimate, ŷ, for an individual product, store, and
weekday was calculated as the mean value of the previously observed values, yi .

Thus for a given product and store, the prediction, ŷj , for each day of the week was
given by

k
1X
ŷj = yj,i , j = 1, 2, . . . , 7. (3.2)
k i=1

Where yj is the observed target value for a day of the week j, k is the number of prior
observations of that day of the week in the training data.

Naturally, this model cannot predict when sales increase or decrease over time as the
model is not dependent on time. However, the model can serve as a baseline for other
models to be evaluated against.

3.5.2 ARIMAX

ARIMAX is an auto-regressive model based on the auto-regressive integrated moving


average (ARIMA) model. ARIMAX is an extension of the ARIMA model, adding
exogenous variables as inputs. Furthermore, the ARIMA model adds the ability to
model non-stationary models on the ARMA model [29]. Thus to understand ARI-
MAX it is important to understand the underlying ARMA model.
Master Thesis Elvenite AB Page 11

The ARMA, model is denoted as ARM A(p, q). Where, p denotes the number of previ-
ous time series observations that the estimation, ŷt , is dependent on and q+1 denotes
the number of errors which the models should include, et , et−1 , . . . , et−q . Where et
is the Gaussian distributed white noise, ai , i = 1, . . . , p is the Auto-regressive (AR)
coefficients, and bj , j = 1, . . . , q is the moving average (MA) coefficients. [30]

The prediction for ŷ using the ARMA model is therefore

ŷt = a1 yt−1 + · · · + ap yt−p + et + b1 et + · · · + bq et−q . (3.3)

The ARIMA model extends the ARMA model by adding a component to handle
non-stationary time series. The ARIMA model is denoted as ARIMA(p,d,q), where
d denotes the number of times the time series is differentiated until made stationary.
When the time series is made stationary, the ARMA(p,q) model is used for predic-
tions. [30]

The ARIMAX model adds additional exogenous variables, Xt , for each time step to
the ARIMA model.

Xt = [x1t , x2t , . . . , xm
t ]
T
(3.4)

Where m is the number of exogenous variables for each time step.

Multiplying the exogenous variables with a row vector, β, containing the coefficients
for each exogenous variable and adding these to the prediction ARIMA model we get

ŷt = βXt + a1 yt−1 + · · · + ap yt−p + et + b1 et + · · · + bq et−q . (3.5)

By incorporating additional explanatory variables it is possible to increase the pre-


dictive power of the model as more complex behaviors of customer shopping habits
can be modeled.
Master Thesis Elvenite AB Page 12

3.5.3 Facebook Prophet

Facebook Prophet is an additive and decomposable model with three main compo-
nents. Trend denoted as g(t) models non-periodic changes in the time series, for
example, a linear growth over time. Seasonality denoted as s(t) models the periodic
changes in the time series, for example, weekly, monthly, yearly changes in sales.
Holidays denoted as h(t) models the effects of irregular events, such as holidays [31].
Combining the components and a Gaussian distributed white noise, et , the following
equation is obtained

y(t) = g(t) + s(t) + h(t) + et . (3.6)

Trend can be modeled in two different ways in Prophet, either by a piece-wise linear
model or a saturating growth model.

The piece-wise linear model is given by

g(t) = (k + a(t)T δ)t + (m + a(t)T γ). (3.7)

where the growth rate is denoted by k, the rate adjustments are denoted by δ, γ is a
set to make the function continuous, and m is an offset parameter.

The saturating growth model is given by

C
g(t) = . (3.8)
1 + exp(−k(t − m))

Where C is the carrying capacity, k is the growth rate, and m is an offset parameter.
Master Thesis Elvenite AB Page 13

Seasonality is modelled with Fourier series. Smooth seasonal effects are approximated
by
N
X 2πnt 2πnt
s(t) = (an cos( ) + bn sin( )). (3.9)
n=1
P P

Where P is a regular period expected in the data.

Fitting the seasonal components require estimation of a1 , . . . , aN and b1 , . . . , bN .


Therefore a matrix consisting of seasonal vectors is constructed for each historic and
future time value in the data. For yearly seasonality and N = 10, this becomes

2π(1)t 2π(10)t
X(t) = [cos( ), . . . sin( )]. (3.10)
365.25 365.25

An increased N results in the ability to model faster-changing seasonality affects.


However, it also increases the risk of overfitting.

The seasonal component is then

s(t) = X(t)β (3.11)

Where β is normally distributed N (0, σ 2 ) to impose a smoothing prior on the season-


ality.

Holidays are modeled by an indicator function. Assume that L is the number of


holidays imputed, then

Z(t) = [1(t ∈ D1 ), . . . , 1(t ∈ DL )]. (3.12)

Holidays are assumed to not only affect the explicit day but also surrounding days.
Master Thesis Elvenite AB Page 14

Therefore, a prior is used, such that

h(t) = Z(t)k. (3.13)

Where k is normally distributed N (0, σ 2 ). It is important to note that the holiday


function does not need to be explicit holidays, but can be other events affecting sales,
such as sport events.

3.5.4 XGBoost

XGBoost is an abbreviation of extreme gradient boosting and is based on the gradient


tree boosting methods [12]. Thus it is important to introduce gradient boosting to
understand XGBoost. Gradient boosting is an ensemble machine learning technique
used to combine weak learners into a strong learner through an iterative approach.
Typically, weak learners are decision trees or regression trees. For a dataset with m
features and N number of samples we have

D = {(xi , yi )}(|D| = N, xi ∈ Rm , yi ∈ R). (3.14)

A tree ensemble model uses K additive functions to predict the output

K
X
ŷ = φ(xi ) = fk (xi ), fk ∈ F, (3.15)
k=1

where F = f (x) = wq(x) (q : Rm →


− T, w ∈ RT ). (3.16)

Here F denotes the space of the regression trees and in F , q represents the structure
of each tree. T is the number of leaves, fk corresponds to a tree structure independent
of q and leaf weight w. The weight for each leaf can be understood as a score for
each leaf. Thus, wi is the score for the i:th leaf.

In Figure 2 an example of a possible regression tree is displayed.


Master Thesis Elvenite AB Page 15

Figure 2: Example of a regression tree

As functions are used as parameters, this model cannot be optimized using traditional
methods, instead, it has to be trained additively.

The prediction of the i:th instance at the t:th iteration is denoted as ŷit . ft is added
to minimize the equation below and is chosen in a greedy manner such that the
improvements of the model is maximized.

X X
L(φ) = l(ŷi , yi ) + Ω(fk ). (3.17)
i k

l, is the loss on the training data and is a differentiable convex function measuring the
difference in ŷi and yi . The loss function is most commonly a square or logistic loss
and is dependent on the problem. Ω is the regularization and it measures the model
complexity. This regularization term is added to avoid overfitting by smoothing the
final weights. When the regularization is set to zero the model defaults to a regular
gradient boosting.

XGBoost improved on regular gradient boosting by utilizing second-order derivatives


of the loss function to gain information about the gradient descent direction. In con-
trast, regular gradient boosting uses the loss function of the base model for minimizing
the error of the model. As presented, L1 and L2 regularization are implemented to
improve model generalization. Furthermore, hardware optimization and paralleliza-
tion lowers the model training time significantly [12]. The increased computational
efficiency is what extreme gradient boosting refers to, however, given the nature of
the model it has also been referred to as regularized gradient boosting [32].
Master Thesis Elvenite AB Page 16

3.5.5 LSTM

LSTM is an acronym for long short-term memory and is an artificial neural network
that is based on a recurrent neural network (RNN) architecture. Unlike other com-
mon neural network architectures, RNNs are capable of keeping information from
previous events. This architecture makes RNNs suitable for problems with sequences
of data such as time series as it can store information from previous time steps. How-
ever, when the dependencies are over a long period of time this information can be
lost. All tough RNNs are, in theory, capable of learning long time dependencies, in
practice, it can be difficult due to either vanishing or exploding gradient [33]. The
unrolled form of RNN can be seen in Figure 3, where xt is the input, and ht is the
output for each time step. Each module, A, can be viewed independently as seen in
Figure 4.

Figure 3: Unrolled form RNN

Figure 4: Single module RNN


Master Thesis Elvenite AB Page 17

LSTM was developed to better store information for a longer period of time or when
the time dependencies are of unknown duration [34]. In each repeating module, there
are four interacting neural network layers, instead of one, as in a regular RNN. LSTM
contains a cell state which can maintain information over time. The cell state consists
of a cell state vector and a gating unit which regulates the information held in this
memory over longer periods of time. The gates control which information should be
kept and which should be removed by utilizing a sigmoid neural net layer and a point-
wise multiplication operation. The information is then scaled, based on the relevancy
of the information, to a value between zero and one [35]. A descriptive picture of a
single module can be seen in Figure 5.

Figure 5: Single module of an LSTM network

The first step in LSTMs is the "forget gate layer". This gate is controlled by a sigmoid
layer which decides which information should be kept. For each component in Ct−1
the sigmoid layer outputs a value between zero and one based on the input xt and
ht−1 . The activation vector of the forget gate is given by

ft = σ(Wf ∗ [ht−1 , xt ] + bf ). (3.18)


Master Thesis Elvenite AB Page 18

The following step is to decide which information should be kept in the cell state. The
"input gate layer" consists of a sigmoid layer and decides which information should
be updated. It is followed by a tanh layer that determines candidate values, C̃t , which
can be added to the cell state. The activation vectors are given by

it = σ(Wi ∗ [ht−1 , xt ] + bi ), (3.19)

C̃t = tanh(WC ∗ [ht−1 , xt ] + bC ). (3.20)

These steps are followed by an update in the cell state, Ct . The old state, Ct−1 is
multiplied with the forget gate’s activation vector, and the new candidate values, C̃t ,
are multiplied with the input gates activation vector, it . Thus, both the old cell state
and the new candidate values are scaled by their importance.

Ct = ft ∗ Ct−1 + it ∗ C̃t . (3.21)

Lastly, the output is decided. The cell state information is put through an activation
function, commonly tanh, and a sigmoid layer filters this information such that it can
be outputted.

ot = σ(Wo ∗ [ht−1 , xt ] + bo ), (3.22)

ht = ot ∗ tanh(Ct ). (3.23)

This output and the current cell state is then transferred to the next module in the
LSTM model and is used for subsequent predictions.
Master Thesis Elvenite AB Page 19

3.6 Evaluation of Model Performance

In this section, the theory of cross-validation for time-series is presented. This method
of evaluation lays the foundation for how the results were obtained. Subsequently,
the evaluation metrics used in combination with these methods of evaluation are
presented and discussed.

3.6.1 Cross-validation for Time-Series

For a time-series problem, it is important that the unseen data, the testing data,
are of later dates than the training data, due to the time-dependency of the data.
Furthermore, as the purpose is to predict one week into the future and predictions
over a longer period can degenerate the performance, the evaluation method has to
be adapted.

A method of adapting cross-validation for time series is by dividing the test data into
several subsets in chronological order, each with a size corresponding to the real-life
scenario, in this case, seven days. The training data is then used to predict the first
subset of seven days. This subset is then added into the training data to predict the
consecutive subset of seven days and the model is updated and trained again. See
Figure 6 for an overview of this methodology. [36]

Figure 6: Cross-validation for time series

The blue cells denote training data, the red cells denote test data, and the grey
cells denote data that is not used for that iteration. Thus, the models are tested
Master Thesis Elvenite AB Page 20

on the tested data similarly to the real-life scenario where each week would yield
new data used for the subsequent week. Note that in this case, the blue cells denote
a seven day period, where the sales for each individual day of that period is predicted.

The overall performance of the models is then calculated as the mean and median
values of the chosen performance metrics over all iterations. Assuming that the testing
data is divided into n subsets and has a performance pi for each subset i = 1, 2, . . . , n.
Then the overall performance of the model is given by:

n
1X
Pmean = pi , (3.24)
n i=1

Pmedian = M edian(pi ). (3.25)

It is important to utilize both the median and the mean value of the performance
metrics since the results of one, or several, weeks of the testing data could skew
the results of the mean. However, the median of the performance metrics does not
consider the results during all weeks and could, therefore, depict a glorified version
of the results.

3.6.2 Evaluation Metrics

Three evaluation metrics were used, MAE, RMSE, and SMAPE, each with its char-
acteristics. In this setting, MAE is the most basic metric as it corresponds to the
actual wasted goods or sale opportunities. RMSE will enlarge the effect of larger
absolute errors and SMAPE is better protected against outliers. Large absolute er-
rors can be seen as detrimental due to the large economical effect for a retail store.
However, the large absolute errors can also be seen as coincidental events that were
not preventable. Furthermore, it is important to utilize several evaluation metrics
since the target variable is of various sizes. For products with a low average number
of products sold, an absolute error will yield a larger percentage error compared to
the same absolute error for a product with a large number of quantities sold on av-
erage. Thus, MAE, RMSE, and SMAPE cover multiple aspects of evaluation for the
forecast. For a complete overview of these metrics, see Table 1, where et is the error,
yt is the target value, and n is the number of observations in the test data.
Master Thesis Elvenite AB Page 21

Table 1: RMSE, MAE and SMAPE


v
u n
u1 X
Root Mean squared error RMSE = t e2
n t=1 t
n
1X
Mean absolute error MAE = |et |
n t=1
n
(
et
1 X | |yt |+|ŷ t|
| if |yt | + |ŷt | =
6 0
Symmetric mean absolute percentage error SMAPE =
n t=1 0 Otherwise

3.7 Weather as a Predictor

Utilizing weather as a predictor in machine learning models can increase the predic-
tive performance and thus improving the results according to the academic literature
[7, 16, 9, 17]. As forecasting involves predicting future behaviors and events, the uti-
lization of weather as a feature introduces uncertainty in the data due to the weather
features being based on forecast rather than actual weather data. The inclusion of
weather as a feature should, therefore, be done carefully. If the weather forecast is not
accurate for a given day, the sales on that day could potentially increase or decrease
leading to increased waste or missing sales opportunities. However, the weather fore-
casts are relatively accurate when forecasting a fewer number of days into the future.
A five-day prediction is correct about 90% of the time while a seven-day prediction lies
around 80% accuracy [37]. This indicates that weather could presumably, with quite
a high certainty, be utilized to improve the predictive performance of the models.
Master Thesis Elvenite AB Page 22

4 Data
This chapter describes in detail what data was collected and how it was utilized. This
includes a description of the data used, the data aggregation procedure, missing values
handling, feature engineering, and One Hot Encoding.

4.1 Included Data

As this project and machine learning projects overall utilizes large sets of data, the
data has to be carefully explained in order to simplify the reproducibility and thus
improve the validity of the project. The available raw data used within this project
could be divided into three separate categories; sales data available from Coop Värm-
land, weather data gathered from SMHI, and additional data such as calendar events.

4.1.1 Coop Data

The raw data that was made available to this project included Point of Sale (POS)
data. This corresponded to detailed receipt information for each sale made in stores
belonging to Coop Värmland, according to a specific selection. This included two
years of data from four different stores, all with similar size with regards to their
total amount of sales and all within the proximity of each other. To capture the long
term effects within the sales patterns, additional years of data would have been favor-
able. As there was not an initial selection concerning the specific products or product
categories, some assumptions had to be made to reduce the number of products to
a manageable amount in regards to the total data size. In order to achieve this, five
different products from each corresponding category; bread; charcuterie; diary; veg-
etables, and cheese that satisfied two additional criteria were chosen to be included
in this project.

• Criteria 1: The product should have a relatively short shelf life which means
that Coop Värmland are placing and planning those orders manually without
any external forecasting models. It is favorable that these particular products
are to be sold within the same day as they are displayed, thus, accurate forecasts
on a daily basis is a necessity.
Master Thesis Elvenite AB Page 23

• Criteria 2: The products should have continuous, or close to continuous, sales in


every store throughout the two-years. One challenge with forecasting product
sales is that products are often altered or changed, and thus become another
article in the data.

Within the POS data there existed several columns with information, which product
was sold, at which store the product was sold, which date the corresponding sale
occurred, how much of the product was sold, and if there were any type of discount
or not. The specific products that were used in this thesis are resented in Appendix
1.

4.1.2 SMHI

The utilized weather data was gathered from the organization SMHI [38]. There ex-
isted a large number of different aspects of weather within this database, including;
amount of rain, mean temperature, minimum temperature, maximum temperature,
amount of sun hours and mean wind speed for all separate days within the two-year
time interval used within this project. As the POS data included sales from four
different stores all of these weather variables had to be collected from their respective
closest weather station. This weather station should also be reasonably close to the
store. This resulted in; rain and temperature information from Arvika, Karlstad and
Kristinehamn, amount of sun hours information from Karlstad. While heavy wind
could potentially affect the sales, wind information had to be disregarded as the only
weather station that collected that information in proximity to the four stores were
closed during a longer time period. In Table 2 the location of all the stations for
each parameter is displayed. Two stores were in proximity and will, therefore, use the
same stations for all the variables.

Table 2: Available SMHI data

Information Variables Station Location


Rain Arvika, Karlstad and Kristinehamn
Mean Temperature Arvika, Karlstad and Kristinehamn
Minimum Temperature Arvika, Karlstad and Kristinehamn
Maximum Temperature Arvika, Karlstad and Kristinehamn
Sun Hours Karlstad
Master Thesis Elvenite AB Page 24

4.1.3 Additional Data

As the academic literature showed that including holidays and other special calendar
events could be beneficial when forecasting sales, this data needed to be gathered
separately as well. Based on an overview of the aggregated sales for all products and
stores during the two year time period in question, these events showed sale spikes
during; Easter, Midsummer, and Christmas. These holidays were, therefore, included
as a variable to be able to predict the increased sales during these holidays. See
Figure 7 for a clear overview of this aggregated sales.

Figure 7: % of mean sales


Master Thesis Elvenite AB Page 25

As people do not necessarily make their purchases for Christmas on Christmas Day,
the sales were not only affected on the day in question but also on all the surrounding
days. This phenomenon was present for all holidays, as during the holiday itself peo-
ple do not traditionally shop. For Easter, Midsummer, and Christmas this was solved
by including the dates two days before the day in question, the day in question, and
two days after the day in question as different variables. Thus, there existed several
categorical holiday variables denoting if there was a holiday within two days of the
holiday. For Christmas and Easter this meant that six respectively eight dates were
used as there existed several Swedish public holidays within those events, and for
Midsummer which only had one Swedish public holiday, only five dates were included
[39]. In Figure 8 the red block denotes the public holiday. The orange blocks sur-
rounding the holiday are also modeled as separate variables.

Figure 8: Explanation of holiday variable

As there was not a sufficient amount of data to model the holidays as different vari-
ables, all holidays were modeled within the same holiday variable. Thus, a binary
variable was introduced that denotes if it was a holiday or not at each date. The
variables denoting if it was one or two days before, or after a holiday, was modeled
equivalently with binary variables. See Appendix 2 for an overview of all holiday
dates, including surrounding days, utilized in the holiday feature.

By analyzing the data, it was possible to determine the possible effects and patterns
of certain events. For multiple months, one of the days with the largest amount of
sold products was on the payday. The number of products sold during a payday
was on average 20 percent larger than the average number of products sold during
non-paydays. When comparing the number of products sold during a payday and the
number of products sold on dates later than the 25:th, the difference was 17 percent.
Thus, it was probable that the payday could have a large effect on sales. In Sweden
the most common payday is on the 25:th or if the 25:th is on a holiday or weekend, the
closest working day before the 25:th. As the payday can contribute to an increase in
sales, it was included as a binary variable. It should however be noted that depending
on the weekday in which the payday occurs the effects can differ.
Master Thesis Elvenite AB Page 26

4.2 Data Processing

As the raw data could not be directly used for the different models it had to be pre-
processed. Transforming all the previously mentioned raw data into usable datasets
involved multiple steps; exploratory data analysis; aggregating the daily sales; han-
dling possible missing values; transforming the features to become more useful; se-
lecting the features that actually help explain the quantity sold; making the features
usable within the models by One Hot Encoding; and lastly splitting the dataset into
a train, a validation, and a test dataset.

4.2.1 Exploratory Data Analysis

In order to fully understand the available dataset, an initial data analysis had to be
completed. The objective of this data analysis was to get a deeper understanding
into how the different variables behaved throughout the two-year time period, how
the sales of different products compared to each other, what trends and patterns that
existed, for example weekly and monthly trends, and lastly to detect possible missing
values and outliers. By performing an exploratory data analysis it could be possible
to find underlying patterns that are not apparent when looking at the data. Thus,
this is a critical step to determine if the data has to be manipulated before being
applied and if it was possible to extract more information that could be utilized. [40]

4.2.2 Aggregation

The second step was to aggregate the daily sales. As there could occur hundreds of
individual purchases of one particular product at one particular store each day, the
data needed to be aggregated so that it showed the total sold quantity every day for
each product and store combination. As there were four stores with 25 products each,
this yielded 100 rows for each date within the two-year time period. Some products
could, presumably, been sold both in units of weight and as individual items. How-
ever, when aggregating all products, they were assumed to be of the same unit. Thus,
the aggregated quantity for each product was either in weight or number of items.
In the aggregated data, there existed rows where the information regarding Type of
Discount could have multiple values. This occurred if the product was sold both with
and without a promotion during that day. Multiple could be recorded when the sales
were only for a specific group or special deals dependent on how much the person
bought. However, in this project, the promotion of a product in an individual store
Master Thesis Elvenite AB Page 27

was modeled as a binary variable. Therefore, if the product was on a promotion in a


store, for any customer, it was determined to be on promotion. The POS data and
additional data were then merged based on dates while the weather data were merged
based on both dates and locations.

4.2.3 Missing Values

After the aggregation was completed, the next step was to handle missing values in
each corresponding feature. There existed a few missing values in two variables. Ei-
ther there was missing weather data due to the weather station not being operational
or there were no recorded quantities sold for an individual product. The missing
values within the weather data were imputed using the mean values of that feature.
Missing values regarding sold quantity could have two underlying reasons, there was
no sale of that product at that store at that particular date or that the informa-
tion was missing. Furthermore, it could also be due to the supplier not producing
a sufficient amount of products and thus no product could have been sold although
the demand was presumably similar to other dates. To handle this, all periods with
one or two days in a row with missing values were imputed as zero. Periods with
three or more missing days in a row was imputed using a rolling mean function that
calculated the mean value of that particular product at that particular store during
the last 30-day window, starting from one week before the missing day in question.
[41]

4.2.4 Feature Engineering

A feature that showed promise within the academic literature was the corresponding
weekday of the sales. As this feature was not part of the raw data, explicitly, it had
to be engineered by utilizing the dates and then merged into the dataset. According
to previous research, this feature could have high explanation power as weekly sales
often follow weekly trends, and therefore the sales occurring at a Monday could be
correlated with the last Monday and the Monday before that. Furthermore, sales
tended to not be uniformly distributed over all weekdays and thus it could be used
as a predictor to, potentially, increase the performance of the models. [4, 10, 11]

In order to make sure that these facts were applicable to this specific data, an aggre-
gation over the existing weekdays was conducted and this showed clearly that there
Master Thesis Elvenite AB Page 28

did not exist a uniform distribution of sales during a week. More exactly, there ex-
isted two spikes occurring during Fridays and Tuesdays, where Friday was the largest.
This could have corresponded to the fact that people often prepare for the weekend
by shopping during the Fridays. Over the week Sunday corresponded to the day with
the lowest quantity sold which may correlate with the fact the stores often receive new
inventories on non-weekends. These aggregated weekly sales can be seen in Figure 9.

Figure 9: % of mean aggregated weekly sales

As sale analysis revolves around time series analysis, another feature that could pre-
sumably be beneficial for some machine learning models is previous values of the
quantity itself, in the form of lags [4, 42, 43]. These lags represent previous values
and was derived by copying the quantity and shifting it desired number of days for-
ward, see Table 3 for an example. It is also clear within this example that if one
chooses a lag corresponding to -3, the first three rows would get missing values and
thus have to be removed or imputed. 16 different lags of three types were created.
Firstly, ten different daily lags based on the same logic as the example in Table 3.
Secondly, four weekly rolling sums, the total quantity sold last week, the week before
that, the week before that, and the week before that. Thirdly, two weekly rolling
means, the mean quantity sold during the last 4 and 8 weeks. As 8 weeks is 56 days,
the 56 first days were removed from the dataset. This was done to incorporate and
capture the seasonality effects.
Master Thesis Elvenite AB Page 29

Table 3: Example of lags

Quantity Lag -1 Lag -2 Lag -3


5
6 5
8 6 5
1 8 6 5
6 1 8 6
9 6 1 8
8 9 6 1
4 8 9 6

4.2.5 Feature Selection

Including a large number of features in machine learning models could be beneficial or


detrimental depending on the model and computational power available. Including
more features could improve the model performance as there is more information
included within the model. However, this can result in overfitting and an unnecessary
increase in computational power if included features do not contribute with sufficient
predictive power. Overfitting can also occur if there is a large amount of features
compared to the number of datapoints since this would result in a large variance.
Thus the increased errors could occur due to the models sensitivity to the training
data. This thesis aimed to model forecasting algorithms that can be beneficial for
multiple stores, not only for the four chosen ones, thus the models needed to be
tested with and without the weather data, as not all stores would have access to
nearby weather stations. It would also be beneficial to analyze if the weather data
would increase the variance of the models, or if it can aid in the predictive power.
To analyze these effects, separate feature sets were tested and evaluated within this
project. The difference being the inclusion of the weather data gathered from SMHI.
These will be denoted as FSW and FS, respectively, throughout the rest of this report.
Master Thesis Elvenite AB Page 30

4.2.6 One Hot Encoding

Some machine learning models are not suitable with categorical data as they cannot
interpret the data as such. Instead, the models would interpret it as numerical data,
resulting in misinterpreted data and worse predictions. For example, the weekdays
feature, ranging from 0 to 6, where 0 corresponds to Mondays and 6 to Sundays. If
the model were to use a single variable ranging from 0 to 6 as values it would inter-
pret Tuesday to be greater than Monday. Similarly, Wednesday would be considered
greater than Tuesday and so on. This is meaningless as the values cannot be com-
pared directly, and thus such data had to be modeled as categorical variables.

One method to transform a categorical variable that has multiple different categories
is to use one-hot encoding (OHE). This algorithm transformed each categorical fea-
ture with r categories to r new features. Each observation corresponding to category
j would then get a 1 in the corresponding feature column j and a 0 in every other
column. So for two categorical features with three respectively six categories OHE
created nine new features. The original, non one-hot encoded variable, would then
be removed from the dataset. See Figure 10 for an example. [44]

Figure 10: One hot encoding example


Master Thesis Elvenite AB Page 31

4.2.7 Data Split

In order to use the prepared data in the machine learning models and be able to
evaluate the performance of the models, the data had to be split into three separate
datasets. A training, a validation, and a testing set. The training data was 16 months,
validation data 2 months, and the test data was 4 months. The training data was
used to train the model. The validation data was used to tune the parameters of the
model. Lastly, the testing data was used to evaluate the performance of each model.
The performance of the models on the validation and testing data was measured
following the logic in section 3.3.1.
Master Thesis Elvenite AB Page 32

5 Method
In this chapter, each model and its corresponding implementation is presented in
detail. This includes descriptions of hyperparameters, packages used and a general
overview of how each model was utilized.

5.1 Model Implementation

All models were trained on the training dataset and the hyperparameters were tuned
using the validation data. Lastly, the models were evaluated with the test dataset.
The process of evaluating the models, and the hyperparameters, followed the cross-
validation approach presented in chapter 3. As the naive model followed the basic
approach described in section 3.2.1 and did not have any hyperparameters it is not
further presented within this chapter.

5.1.1 ARIMAX

Statsmodels was used when creating the ARIMAX model. Each product and store
combination was trained and evaluated separately. As the ARIMAX model was cre-
ated to incorporate lags automatically, the manually lags created in the data set
were not necessary [45]. However, the models were tested both with and without
the manually created lags to evaluate if the model could gain an increased predictive
power. Auto-correlation and partial auto-correlation graphs were used to determine
possible values for the automatic lags for this model. Furthermore, multiple models
were evaluated based on the presented evaluation metrics to determine the best hy-
perparameter values. The hyperparameters that were tested are displayed in Table
4.
Master Thesis Elvenite AB Page 33

Table 4: The evaluated hyperparameters for ARIMAX

Hyperparameter P ossible V alues


p 7, 14, 21
q 7, 14, 21
d 0, 1, 2
trend None, c, t, ct

Where c indicates a constant trend, t indicates a linear trend, and ct is both. Fur-
thermore, "None" indicates no trend variable. [45]

5.1.2 Facebook Prophet

Prophet was implemented using Facebooks own fbprophet library. As this model
handled, trends, seasonality, and holidays internally, the manually created lags and
holiday data could be discarded, anyhow the model was tested with and without these
manually created lags. Besides its main steps, creating the model, fitting data to the
model, and predicting new values from the model, some model-specific steps had to
be taken. The holiday related dates had to be specified, the additional features had
to be added as extra regressors, and the hyperparameters had to be tuned. See Table
5 for a complete list of evaluated hyperparameters for the prophet model.

Table 5: The evaluated hyperparameters for Prophet

Hyperparameter Possible Values


Yearly Seasonality True, False
Weekly Seasonality True, False
Daily Seasonality True, False
Changepoint Prior Scale 0.05, 0.5, 2, 10
Changepoint Range 0.65, 0.9, 0.05
Seasonality Prior Scale 0.01, 0.1, 1, 10, 100, 1000
Holidays Prior Scale 0.1, 1, 10, 100, 1000
Growth Linear, Logistic
Master Thesis Elvenite AB Page 34

5.1.3 LSTM

The LSTM model was created using the Keras, TensorFlow, and Numpy libraries in
Python. As the LSTM model contained a cell state which stores memory of previous
days, the manually created lags were not explicitly necessary. Therefore, the model
was evaluated both with and without the manually created lags. The algorithm im-
plementation includes creating the model, compiling the model, fitting data to the
model, and predicting new values from the model. Besides these major steps, the
data had to be normalized as some products target variable was of a larger magni-
tude than others. Multiple hyperparameters were tuned in order to reach the best
performance, see Table 12.

Table 6: The evaluated hyperparameters for LSTM

Hyperparameter Possible Values


Batch Size 100, 700, 1400, 2100, 2800
Number of Epochs 100, 200, 300,400
Dropout Level 0, 0.1, 0.2
Number of Neurons 8, 16, 32, 64
Activation Function Tanh, relu
Number of Layers 1, 2, 3
Master Thesis Elvenite AB Page 35

5.1.4 XGBoost

XGBoost was implemented using the XGBoost library which was based on the model
authored by Tianqi Chen [46]. The hyperparameters that were tested in a grid search
approach are displayed in Table 7. The manually created lags were used for this
model so that previous data points could be used as a predictive variable. Without
the manually created lags, the model would not have benefited from the time depen-
dency of the data.

Table 7: The evaluated hyperparameters for XGBoost

Hyperparameter P ossible V alues


Subsample ratio of columns when constructing each tree 0.5, 0.6, 0.7, 0.8, 0.9, 1
Learning rate 1,0.5, 0.25, 0.1, 0.05, 0.01,0.005
Max depth 6, 7, 8, 9, 10
Minimum child weight 4, 5, 6, 7, 8, 9
Fraction of observations to be randomly samples for each tree 0.5, 0.6, 0.7, 0.8, 0.9, 1
Number of estimators 100, 250, 500, 750, 1000
Master Thesis Elvenite AB Page 36

6 Result
In this chapter the final selection of hyperparameters for each model and the result
for these models are presented. The results are based on the median and mean MAE,
RMSE, and SMAPE for each model. Furthermore, as the performance of the models
can be dependent on the time period, graphs are presented displaying the evaluation
metrics for each week in the testing data. The naive model follows the basic func-
tionality described in section 3.2.1 and does not have any hyperparameters. Thus, no
final set of parameters is presented for the naive model.

6.1 Performance of Models

The performance of the final models were evaluated using the mean and median value
of RMSE, MAE, and SMAPE over all the iterations, and these results are presented
in Tables 8 and 9. Furthermore, Figures, 11, 12 and 13 displays the results for each
individual week. Two feature sets were evaluated for all models except the naive
model, FSW, and FS, which, respectively, denotes if the feature set included weather
data or not. Furthermore, as the naive model was only based on the mean values of
the quantities sold, it used neither FSW nor FS.

Table 8: Model results (mean)

Model Feature Set MAE SMAPE RMSE


ARIMAX FSW 6.64 0.25 13.95
LSTM FSW 6.63 0.24 14.68
Prophet FSW 6.65 0.28 13.44
XGBoost FSW 6.43 0.24 13.75

ARIMAX FS 6.58 0.25 13.78


LSTM FS 6.72 0.25 14.46
Prophet FS 6.62 0.27 13.38
XGBoost FS 6.39 0.24 13.75

Naive - 7.45 0.27 15.08


Master Thesis Elvenite AB Page 37

Table 9: Model results (median)

Model Feature Set MAE SMAPE RMSE


ARIMAX FSW 5.40 0.23 9.86
LSTM FSW 5.23 0.23 9.63
Prophet FSW 5.66 0.26 9.99
XGBoost FSW 5.07 0.23 9.20

ARIMAX FS 5.23 0.24 9.27


LSTM FS 5.39 0.24 9.69
Prophet FS 5.65 0.26 9.93
XGBoost FS 5.09 0.23 9.27

Naive - 6.20 0.26 10.04

Figure 11: MAE for each week and model


Master Thesis Elvenite AB Page 38

Figure 12: SMAPE for each week and model

Figure 13: RMSE for each week and model


Master Thesis Elvenite AB Page 39

6.1.1 Naive Model

In Figure 14 the sales for one particular product are displayed in relation to its mean
value. In Figure 15 the aggregated sales for all products are displayed in relation to
its mean value.

Figure 14: % of mean sales for the naive model and one specific product and store

Figure 15: % of mean aggregated sales for the naive model


Master Thesis Elvenite AB Page 40

6.1.2 ARIMAX

The final set of hyperparameters for the ARIMAX model is displayed in Table 10.

Table 10: The final hyperparameters for ARIMAX

Hyperparameter F inal V alue


p 7
q 7
d 0
trend None

Utilizing the manually created lags resulted in no added performance over the vali-
dation data, therefore the final model did not utilize the manually created lags.

In Figure 16 the sales of an individual product and store are displayed to showcase
the model prediction. Figure 17 displays the total amount of sales compared to the
total predicted sales. Both figures displayed the results in comparison to the mean
value of sales.

Figure 16: % of mean sales of the ARIMA and one specific product and store
Master Thesis Elvenite AB Page 41

Figure 17: % of mean sales aggregated ARIMA

6.1.3 Facebook Prophet

The final set of hyperparameters for the Prophet model are displayed in Table 11.

Table 11: The final hyperparameters for Prophet

Hyperparameter Final Value


Yearly Seasonality True
Weekly Seasonality True
Daily Seasonality False
Changepoint Prior Scale 0.5
Changepoint Range 0.65
Seasonality Prior Scale 1000
Holidays Prior Scale 1000
Growth Linear

Utilizing the manually created lags resulted in no additional performance over the
validation data, thus the final model did not utilize the manually created lags. In
Figure 19 the aggregated sales for all products are displayed in relation to its indi-
vidual mean value and in Figure 18 the sales for one particular product are displayed
in relation to its mean value.
Master Thesis Elvenite AB Page 42

Figure 18: % of mean sales for Prophet and one specific product and store

Figure 19: % of mean aggregated sales Prophet


Master Thesis Elvenite AB Page 43

6.1.4 LSTM

The final parameters for the LSTM model are presented in Table 12. Utilizing the
manually created lags did not improve performance over the validation data, thus the
final model was evaluated without these lags.

Table 12: The final hyperparameters for LSTM

Hyperparameter Final Value


Batch Size 1400
Number of Epochs 300
Dropout Level 0.1
Number of Neurons 32
Activation Function Tanh
Number of Layers 2

Figure 20 displays the predicted amount of sales for an individual product and store
in comparison to the mean value of sales. Figure 21 displays the total amount of sales
and predicted sales for all products and stores.

Figure 20: % of mean sales for LSTM Model and one specific product and store
Master Thesis Elvenite AB Page 44

Figure 21: % of mean sales aggregated LSTM

6.1.5 XGBoost

Multiple versions of XGBoost with different sets of hyperparameters were evaluated.


The final and best model is displayed in Table 13.

Table 13: The final hyperparameters for XGBoost

Hyperparameter F inal V alue


Subsample ratio of columns when constructing each tree 0.9
Learning rate 0.01
Max depth 9
Minimum child weight 6
Fraction of observations to be randomly samples for each tree 0.9
Number of estimators 750

Figure 22 displays the true and predicted quantities sold for a specific product and
store pair. Figure 23 displays the total amount of sold products and predicted prod-
ucts sold for all stores.
Master Thesis Elvenite AB Page 45

Figure 22: % of mean sales for XGBoost Model and one specific product and store

Figure 23: % of mean sales aggregated XGBoost


Master Thesis Elvenite AB Page 46

7 Discussion
In this chapter, the results are thoroughly discussed and analyzed. This includes
comparisons of the result based on the different evaluation metrics for each model,
what those results indicate, and the underlying reasons for those results. Furthermore,
the implications of these results and the limitations of the thesis are discussed.

7.1 Model Comparison

The results are inconclusive and the models, excluding the naive model, show sim-
ilar performance in their predictive power. However, the results in Tables 8, and 9
indicate that the XGBoost and LSTM models performed slightly better in regards to
mean and median MAE and SMAPE. It was clear that the holiday season was where
the models were most unsuccessful, this was due to the larger fluctuations in sales
during this time period, the changed customer behaviors, and the minimal amount of
data existing representing this time period. It should be noted that the naive model
was relatively successful during the non-holiday time period, indicating that the sales
for this set of products followed a weekly pattern with the mean value of products
sold being a successful predictor. As the naive model was relatively successful during
the non-holiday time period, more advanced models can be more beneficial during
the holiday season as the naive model did not work in that setting.

In regards to mean RMSE, Facebook Prophet performed the best. As the largest ab-
solute errors are over the holiday season, the mean RMSE was heavily dependent on
the performance over this time period. Since Facebook Prophet performed the best
regarding RMSE this indicated that it might handle holiday-effects better. Contrary,
Facebook Prophet performed worse in regards to MAE and was the worst machine
learning model in regards to mean and median SMAPE indicating that the overall
predictive power was, in relation to XGBoost, ARIMAX, and LSTM, worse. Fur-
thermore, from Figure 19, it is evident that Facebook Prophet successfully modeled
the holiday effects over a longer period of time in comparison to the other models.
From Figure 11 this effect is also visible, showing that the prophet model successfully
predicted the increase of sales over a longer period of time. Furthermore, Facebook
Prophet accurately modeled the existence of larger spikes in sales, all tough the mag-
nitude of each spike in sales was not accurately predicted. With more data, the
Master Thesis Elvenite AB Page 47

magnitude of each spike could potentially be better predicted.

LSTM successfully predicted the spikes over Christmas holidays, all tough it on one
occasion overestimated it as seen in Figure 21. This indicated that the LSTM model
was a good predictor even over the holiday. With more data, it is possible that this
overestimation could have been fixed.

As Facebook Prophet managed to predict the existence of the holiday effect better
than LSTM, XGBoost, and ARIMAX, it was clear that, in this case, the modeling of
holidays was handled more efficiently in Facebook Prophet. While this did not help
the predictive power of XGBoost, LSTM, and ARIMAX in this case, it could be seen
as an indication that it is possible to handle holidays and other special events more
accurately. If this were to be adapted to the LSTM model it could potentially be
improved substantially.

A possible reason why Facebook Prophet performed better than the other models
with regards to RMSE might be related to that Facebook Prophet internally incor-
porated seasonality effects of the model. In this case, weekly and yearly seasonality.
In Facebook Prophet, these effects corresponded to one of its three main components
and were modeled by an indicator function, seen in equation 3.12. This function was
then used additively to model the increase in sales. This component was not limited
to just holiday data as it could include all public dates, including sport events or spe-
cial celebrations. Which implied that it was easy to implement with other calendar
events as well.

As mentioned there exist some indication that XGBoost and LSTM performed slightly
better overall, which could be due to the non-linearity of the models. Thus, the models
could model non-linear relationships within the data which might have been relevant
in this specific case as retail sales fluctuate heavily and have multiple complex depen-
dencies. Most notably the sales fluctuated weekly, monthly, and yearly, mostly due to
calendar events, such as holidays, paydays, and certain weekdays. For LSTM, these
complex relationships as well as the existence of the cell state in the neural network
could also be the explanation for the accurate predictions during Christmas. Although
it is possible that the LSTM model overestimated the prediction after Christmas as
a reaction to the days prior. With more data, especially data on holidays, the LSTM
model could presumably be improved. Similarly, the XGBoost model could be im-
Master Thesis Elvenite AB Page 48

proved with more data on holidays. Since the model had a regression tree architecture
and only a small amount of holidays in the data, the holidays were not considered
a successful predictor (overall) in the trees. This was evident by the consistency in
the forecast even through the holiday season. Thus, an increased amount of data and
perhaps an oversampling of holidays could improve the model.

Furthermore, the LSTM and XGBoost models could predict all products and stores si-
multaneously, allowing the models to capture complex relationships between different
products. Furthermore, the models could learn from patterns from all products and
thus presumably improve the performance for all products. There was, presumably,
a possibility that there exist complex relationships between the products within the
same product category. If one product suddenly increases in sales that could indicate
that other similar products might increase in sales due to higher demands. Contrary,
it could potentially decrease sales for other products due to a cannibalization effect.
Thus, it is clear that complex relationships could exist between products in cases such
as this one. This could also imply that when more products are added to the models,
the complex relations could be easier established, further increasing the performance,
relative to ARIMAX, Facebook Prophet, and the Naive model.

7.2 The Effect of Adding Weather as an Input

Based on previous literature discussed in chapter 2, weather data was expected to


increase the predictive power of the models. However, including weather as a feature
showed a minuscule increase in model performance and in some cases a decrease in
performance. Thus, including weather showed no clear benefits. That does not neces-
sarily mean that it would not improve predictions for another specific set of products
or over another specific time period. For example, weather could have a greater effect
during summer.

If weather were to be used, the input weather data would consist of weather forecasts
which have an inherent inaccuracy. Thus, the model would predict sales based on the
forecast of the weather and not the actual weather. Furthermore, for the customer,
it is possible that it is not the actual weather but the weather forecast that affects
sales or a combination of both. More specifically, if people plan their shopping based
on the predicted weather and will shop regardless of the actual weather, or if they
shop based on the actual weather on that day. As this uncertainty was not modeled
Master Thesis Elvenite AB Page 49

into the algorithms it could present a possible problem. However, given that the
forecasted weather was similar to the actual weather there is no indication that either
would aid in the prediction.

Furthermore, it is possible that the weather features used were not predictive of sales
for groceries but other weather variables could be. For example, wind, a variable that
was not used in this thesis could have predictive power. Also, the weather features
could be aggregated to a single feature, if the weather was abnormal. The amount
of rain, amount of sun, or the temperature could have a low predictive power in this
case but if there is a large amount of rain or very low or high temperature the sales
could potentially be affected. Thus, it might be that it is not the weather features
themselves, but if the weather is abnormal, that affects sales.

As this thesis utilized actual weather data instead of predicted weather data it was
tough to make any clear conclusions of the validity of adding weather as an input.
However, the actual weather of a given day did not seem to predict sales. Further-
more, as including the weather features increased the data complexity it could be
recommended to not utilize weather as a feature as long as it does not increase the
performance results significantly.

7.3 Implications

The products used within this project were specifically chosen to be products that
are part of the manually ordered items. A decreased mean absolute error of 0.7-1.0
and root mean squared error of 1.0-1.5 when comparing the naive model with the
other models could carry high significance in the long term. This decrease of one
unit in MAE represents a decrease of 25 wasted or missed sales units each day for
each store (and this set of products) and would presumably scale with the number of
products added to this forecasting methodology. Furthermore, the allocation of staff
could benefit as only the necessary products would be put up for display. The staff
could then focus on putting up the necessary products before the customer arrives
or when the customer has arrived, focus on helping customers instead of putting up
groceries.
Master Thesis Elvenite AB Page 50

7.4 Limitations

One limitation of this project was the mixed unit quantities used by the products.
This meant that a mean absolute error of k could both imply that the predicted value
is off by k items or k kilograms. As this was the case the results for the evaluation
metric may be skewed. A second limitation was that the number of products sold
varied to a large extent between products. Thus, a product that sells more could also
have increased the absolute errors more than a product that was sold less, resulting
in interpretation issues of the results. Based on these limitations the results might
have been positively affected by separating the products into several subsets and
using different models depending on the performance on the specific subsets. A third
limitation was the simplification that all products need to be sold within the same
day as displayed. Some of the analyzed products could be sold within the next couple
of days before being considered as waste. This could be incorporated into the models
but was deemed of out scope for this thesis.

7.5 Conclusion

This thesis implemented four supervised machine learning models on a specific case
involving sales data from Coop Värmland. Given this specific case and available data,
there were no major differences in predictive power among LSTM, Prophet, XGBoost,
and ARIMAX. However, LSTM and XGBoost showed the most promise overall and
could potentially be further improved with more data. Most notably XGBoost and
LSTM performed better than the other models in regards to SMAPE and MAE. How-
ever, Prophet showed the most promise when evaluating the model’s performance over
the Christmas holiday season as evidenced by the results in regards to RMSE as well
as the MAE before Christmas. The incorporation of modeling holidays similarly to
Prophet could be beneficial for all models. Furthermore, the LSTM model adapted
quickly to the sales increase before Christmas and could predict the spikes accurately
after Christmas. Thus, it was evident that the best performing model was dependent
on the evaluation metric and which time period is in focus, but non-linear models
shows the most promise.
Master Thesis Elvenite AB Page 51

Compared to the Naive model, the supervised machine learning models performed
better when comparing RMSE, MAE, and SMAPE. This result could be seen as an
indication that it would be possible for Coop Värmland to incorporate supervised
machine learning models to lower the amount of waste. However, it should be noted
that the naive model was not dependent on time, while a experience-based forecasting
would be. During the holiday season, this experience-based forecasting would pre-
sumably predict sales with a better performance. The inclusion of experience-based
forecasting could thus benefit the models over the holiday season.

Furthermore, there was no indication that variables containing information about the
weather benefited the model’s performance. However, this could have been biased
towards the evaluated products or the time period in which the models were evaluated.
The recommendation would, therefore, be to further investigate if weather as a feature
could be beneficial for other products or seasons.
Master Thesis Elvenite AB Page 52

8 Further Studies
As seen overall in the sales prediction graphs in chapter 6 the errors of the model
were increased during the holiday season. This may imply that further studies should
focus on holidays and other special occasions particularly. This could, potentially, be
done by utilizing different models for different seasons.

The predictive models could potentially be improved by implementing cluster analysis


such as KNN (K-Nearest Neighbors) and develop individual models for each cluster
of products. As those products might be relational, cannibalization effects could
potentially be incorporated. By doing this it could also be possible to analyze if
specific models could yield higher results for specific clusters of products. The models
used within this project might be better suited for only a subset of utilized products
and this could be concluded by testing different subsets of products within different
models.
Master Thesis Elvenite AB Page 53

9 Appendices

9.1 Appendix 1: Available Coop Data

Data Type
HUSHÅLLSOST 26% Cheese
RIVEN OST TEX MEX Cheese
HALLOUMI Cheese
HUSHÅLL 26% 500 G Cheese
OST PHILADELPHIA Cheese
GULLÖK Vegetable
AVOCADO PRERIPENED Vegetable
ISBERG PÅSE Vegetable
RÖDLÖK Vegetable
TOMATER KVIST RÖDA Vegetable
POLARPÄRLAN VETE Bread
KORVBRÖD 8P Bread
PÅGENLIMPAN Bread
ROAST N TOAST Bread
KÄRGÅRDSKAKA HÖNÖ Bread
KOKKORV Charcuterie
BACON Charcuterie
GRILLKORV SKINNFRI Charcuterie
GRILLKORV BARNENS Charcuterie
SKINKA RÖKT Charcuterie
MJÖLK MELLAN Dairy
STANDARDMJÖLK 3% Dairy
VISPGRÄDDE 40% Dairy
LÄTTMJÖLK 0.5% Dairy
CREMÉ FRAICHE 34% Dairy
Karlstad Store Location
Karlstad Store Location
Kristinehamn Store Location
Arvika Store Location
Master Thesis Elvenite AB Page 54

9.2 Appendix 2: Holiday Data

Holiday Date Holiday Description


3/28/2018 Easter
3/29/2018 Easter
3/30/2018 Easter
3/31/2018 Easter
4/1/2018 Easter
4/2/2018 Easter
4/3/2018 Easter
4/4/2018 Easter
6/21/2018 Midsummer
6/22/2018 Midsummer
6/23/2018 Midsummer
6/24/2018 Midsummer
6/25/2018 Midsummer
12/23/2018 Christmas
12/24/2018 Christmas
12/25/2018 Christmas
12/26/2018 Christmas
12/27/2018 Christmas
12/28/2018 Christmas
4/17/2019 Easter
4/18/2019 Easter
4/19/2019 Easter
4/20/2019 Easter
4/21/2019 Easter
4/22/2019 Easter
4/23/2019 Easter
4/24/2019 Easter
6/20/2019 Midsummer
6/21/2019 Midsummer
6/22/2019 Midsummer
6/23/2019 Midsummer
6/24/2019 Midsummer
12/23/2019 Christmas
12/24/2019 Christmas
12/25/2019 Christmas
12/26/2019 Christmas
12/27/2019 Christmas
12/28/2019 Christmas
Master Thesis Elvenite AB Page 55

References

[1] Tom M. Mitchell. Machine Learning and Data Mining. COMMUNICATIONS


OF THE ACM N. 1999. url: http://www.cs.cmu.edu/~tom/pubs/cacm99_
final.pdf.
[2] Matavfall i Sverige – Uppkomst och behandling 2016. Naturvårdsverket. 2018.
url: https://www.naturvardsverket.se/Documents/publikationer6400/
978-91-620-8811-8.pdf?pid=22466.
[3] Coop Värmland. 2020. url: https://coopvarmland.se/vara-butiker/.
[4] Bohdan M Pavlyshenko. “Machine-Learning Models for Sales Time Series Fore-
casting”. In: Data 4.1 (2019), p. 15. doi: 10.3390/data4010015.
[5] Kasun Bandara et al. “Sales Demand Forecast in E-commerce Using a Long
Short-Term Memory Neural Network Methodology”. In: Neural Information
Processing Lecture Notes in Computer Science (2019), pp. 462–474. doi: 10.
1007/978-3-030-36718-3_39.
[6] Ching-Wu Chu and Guoqiang Peter Zhang. “A comparative study of linear and
nonlinear models for aggregate retail sales forecasting”. In: International Journal
of Production Economics 86.3 (2003), pp. 217–231. doi: 10 . 1016 / s0925 -
5273(03)00068-9.
[7] Zeynep Hilal Kilimci et al. “An Improved Demand Forecasting Model Using
Deep Learning Approach and Proposed Decision Integration Strategy for Supply
Chain”. In: Complexity 2019 (2019), pp. 1–15. doi: 10.1155/2019/9067367.
[8] Real Carbonneau, Kevin Laframboise, and Rustam Vahidov. “Application of
machine learning techniques for supply chain demand forecasting”. In: European
Journal of Operational Research 184.3 (2008), pp. 1140–1154. doi: 10.1016/
j.ejor.2006.12.004.
[9] Indrė Žliobaitė, Jorn Bakker, and Mykola Pechenizkiy. “Beating the baseline
prediction in food sales: How intelligent an intelligent predictor is?” In: Expert
Systems with Applications 39.1 (2012), pp. 806–815. doi: 10.1016/j.eswa.
2011.07.078.
Master Thesis Elvenite AB Page 56

[10] Robert Siwerz and Christopher Dahlén. “Predicting sales in a food store de-
partment using machine learning”. PhD thesis. SCHOOL OF COMPUTER
SCIENCE and COMMUNICATION, 2017. url: http://www.diva-portal.
org/smash/get/diva2:1108597/FULLTEXT01.pdf.
[11] Mithileysh Sathiyanarayanan. Data Analysis and Forecasting of Grocery Sales
in Ecuador. url: https://www.academia.edu/37542292/Data_Analysis_
and_Forecasting_of_Grocery_Sales_in_Ecuador.
[12] Tianqi Chen and Carlos Guestrin. “XGBoost: A Scalable Tree Boosting Sys-
tem”. In: Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining - KDD 16 (2016). doi: 10 . 1145 /
2939672.2939785.
[13] Shouwen Ji et al. “An Application of a Three-Stage XGBoost-Based Model to
Sales Forecasting of a Cross-Border E-Commerce Enterprise”. In: Mathematical
Problems in Engineering 2019 (2019), pp. 1–15. doi: 10.1155/2019/8503252.
[14] Ben Letham and Sean J Taylor. Prophet: forecasting at scale. Sept. 2018. url:
https://peerj.com/preprints/3190.pdf.
[15] Caner Dabakoglu. Time Series Forecasting-ARIMA, LSTM, Prophet with Python.
June 2019. url: https://medium.com/@cdabakoglu/time-series-forecasting-
arima-lstm-prophet-with-python-e73a750a9887.
[16] Johan Krylstedt and Andreas Weidlertz. “A Study of Weather’s Impact on
Consumption of Goods For Certain Weather-Dependent Products at a Small
Grocery Store”. PhD thesis. KTH ROYAL INSTITUTE OF TECHNOLOGY,
SCI School of Engineering Sciences, 2016. url: http://www.diva- portal.
org/smash/get/diva2:942546/FULLTEXT01.pdf.
[17] Kyle B. Murray et al. “The effect of weather on consumer spending”. In: Journal
of Retailing and Consumer Services 17.6 (2010), pp. 512–520. doi: 10.1016/
j.jretconser.2010.08.006.
[18] Gopal Behera and Neeta Nain. “A Comparative Study of Big Mart Sales Pre-
diction”. In: Communications in Computer and Information Science Computer
Vision and Image Processing (2020), pp. 421–432. doi: 10.1007/978-981-15-
4015-8_37.
[19] Nikolay Laptev. Engineering Extreme Event Forecasting at Uber with Recurrent
Neural Networks. Jan. 2019. url: https://eng.uber.com/neural-networks/.
Master Thesis Elvenite AB Page 57

[20] Store Item Demand Forecasting Challenge. url: https://www.kaggle.com/c/


demand-forecasting-kernels-only/overview/evaluation.
[21] “Time Series Analysis Springer Texts in Statistics”. In: (2008), pp. 1–10. doi:
10.1007/978-0-387-75959-3_1.
[22] Hira L. Koul. “Autoregression”. In: Weighted Empirical Processes in Dynamic
Nonlinear Models Lecture Notes in Statistics (2002), pp. 294–357. doi: 10 .
1007/978-1-4613-0055-7_7.
[23] Michael W. Berry, Azlinah Mohamed, and Bee Wah. Yap. Supervised and Un-
supervised Learning for Data Science. Springer International Publishing, 2020.
[24] Ratnadip Adhikari and R. K. Agrawal. An Introductory Study on Time Series
Modeling and Forecasting. 2013. arXiv: 1302.6613 [cs.LG].
[25] Gábor Petneházi. Recurrent Neural Networks for Time Series Forecasting. 2019.
arXiv: 1901.00069 [cs.LG].
[26] A. K. Jain, Jianchang Mao, and K. M. Mohiuddin. “Artificial neural networks:
a tutorial”. In: Computer 29.3 (1996), pp. 31–44.
[27] Raúl Rojas. “Neural Networks”. In: (1996). doi: 10.1007/978-3-642-61068-4.
[28] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http:
//www.deeplearningbook.org. MIT Press, 2016.
[29] Yasmine Rashed et al. “Short-term forecast of container throughout: An ARIMA-
intervention model for the port of Antwerp”. In: Maritime Economics Logistics
19.4 (Oct. 2017), pp. 749–764. doi: 10.1057/mel.2016.8.
[30] Peter J. Brockwell and Richard A. Davis. Introduction to time series and fore-
casting. Springer International Publishing Switzerland, 2016.
[31] Sean J Taylor and Benjamin Letham. “Forecasting at scale”. In: (2017). doi:
10.7287/peerj.preprints.3190v2.
[32] What is the difference between the R gbm (gradient boosting machine) and xg-
boost (extreme gradient boosting). Qoura. 2015. url: https : / / www . quora .
com/What-is-the-difference-between-the-R-gbm-gradient-boosting-
machine-and-xgboost-extreme-gradient-boosting.
[33] Y. Bengio, P. Simard, and P. Frasconi. “Learning long-term dependencies with
gradient descent is difficult”. In: IEEE Transactions on Neural Networks 5.2
(1994), pp. 157–166. doi: 10.1109/72.279181.
Master Thesis Elvenite AB Page 58

[34] Schmidhuber Hochreiter. “LSTM CAN SOLVE HARD LONG TIME LAG
PROBLEMS”. In: (1997). url: https : / / papers . nips . cc / paper / 1215 -
lstm-can-solve-hard-long-time-lag-problems.pdf.
[35] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In:
Neural Computation 9.8 (1997), pp. 1735–1780. doi: 10 . 1162 / neco . 1997 .
9.8.1735.
[36] R J Hyndman. Forecasting: Principles and Practice. url: https://otexts.
com/fpp2/accuracy.html.
[37] How Reliable Are Weather Forecasts? url: https://scijinks.gov/forecast-
reliability/.
[38] Ladda ner meteorologiska observationer. url: https://www.smhi.se/data/
meteorologi/ladda-ner-meteorologiska-observationer.
[39] Sweden Public Holidays 2019. url: https : / / publicholidays . se / 2019 -
dates/.
[40] Howard J. Seltman. “Experimental Design and Analysis”. In: (2018). url: https:
//www.stat.cmu.edu/~hseltman/309/Book/Book.pdf.
[41] Will Badr. 6 Different Ways to Compensate for Missing Data (Data Imputa-
tion with examples). Jan. 2019. url: https://towardsdatascience.com/6-
different-ways-to-compensate-for-missing-values-data-imputation-
with-examples-6022d9ca0779.
[42] Yves R. Sagaert et al. “Tactical sales forecasting using a very large set of macroe-
conomic indicators”. In: European Journal of Operational Research 264.2 (2018),
pp. 558–569. doi: 10.1016/j.ejor.2017.06.054.
[43] Nilesh Acharya. Build more accurate forecasts with new capabilities in automated
machine learning. June 2019. url: https : / / azure . microsoft . com / en -
us/blog/build-more-accurate-forecasts-with-new-capabilities-in-
automated-machine-learning/.
[44] Kedar Potdar, Taher S., and Chinmay D. “A Comparative Study of Categori-
cal Variable Encoding Techniques for Neural Network Classifiers”. In: Interna-
tional Journal of Computer Applications 175.4 (2017), pp. 7–9. doi: 10.5120/
ijca2017915495.
[45] SARIMAX. statsmodels. 2020. url: https://www.statsmodels.org/dev/
generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html.
Master Thesis Elvenite AB Page 59

[46] Tianqi Chen and Guestrin. XGBoost: A Scalable Tree Boosting System. Aug.
2016. url: https://dl.acm.org/doi/10.1145/2939672.2939785.
TRITA -SCI-GRU 2020:218

www.kth.se

You might also like