Grid Search Optimization (GSO) Based Future Sales Prediction For Big Mart

2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)
Grid Search Optimization (GSO) Based Future

Sales Prediction For Big Mart
Gopal Behera1 and Neeta Nain2
1,2
Department Of Computer Science and Engineering
1,2
Malaviya National Institute of Technology Jaipur, India
1
2019rcp9002@mnit.ac.in
2
nnain.cse@mnit.ac.in
Abstract—In retailer domain predicting sales before actual on mathematical and statistical models) and modern heuris-
sales plays a vital role for any retailer company like Big tic methods. In the first group, they named exponential
Mart or Mall for maintaining a successful business. Traditional smoothing, regression, BoxJenkins, autoregressive integrated
forecasting models such as statistical model is commonly used
as methodology for future sales prediction, but these techniques moving average(ARIMA), generalized autoregressive con-
takes much more time to estimate the sales, also they are not ditionally heteroskedastic (GARCH) methods etc. Most of
capable to handle the non-linear data. Therefore, Machine these models are linear and are not able to deal with the
Learning(ML) techniques are employed to handle both non- asymmetric behaviour in most real-world sales data [6].
linear and linear data. ML techniques can also efficiently
In contrast, modern heuristic methods are mostly able to
large volume of data like Big Mart dataset, containing large
number of customer data and individual data item’s attribute. handle these challenges. In order to handle these variables,
A retailer company wants a model that can predict accurate more advanced models were developed in literature. Differ-
sales so that it can keep track of customers future demand and ent kinds of evaluations or accuracy metrices are used such
update in advance the sale inventory. In this work, we propose as Root Mean Squared Error (RMSE) [7], Mean Absolute
a Grid Search Optimization (GSO) technique to optimize
Error (MAE) [8].
the parameters and select the best tuning hyper parameters,
further ensemble with Xgboost techniques for forecasting the We perform hyper parameter tuning (HPT) through grid
future sales of a retailer company such as Big Mart and we search optimization (GSO) technique [9] for predicting fu-
found our model produces the better result. ture demand of sales for customer in a retailer company.
Index Terms—Sales Forecasting, Xgboost, Grid Search, Ma- The rest of the paper is organized as literature review in
chine Learning, Accuracy.
section-II, background work in section-III, Proposed work in
I. INTRODUCTION section-IV and experimental result and conclusion in section-
V and VI respectively.
Impact of sales prediction or forecasting tells the success
and performance of a retailer company. So always inaccurate
II. LITERATURE REVIEW
prediction leads to stock-outs or over stock inventories in
which companies faces losses. In particular, for consumer- As sale is the life of any retailer organization so the
oriented industries like Big Mart, accurate predictions are forcasting of sales plays vital role in retailer domain hence
essential where the retailer industry faces several challenges. accurate forecasting of sales and analysis of sales has been
That is, they have prior knowledge about the customers done by many researchers. Which are summarized:
future demand so that product can be stocked ahead. In A Mathematical model has been suggested by Ait-alla et
addition, other factors, such as, location, changing weather al. [10] for robust production planning for apparel suppliers.
conditions, public events, holidays, festivals can have an The authors have focused to support the decision-making on
impact on future demands [1]. There is a huge lack of distribution of articles at different production plants and they
historical sales data. For sales forecasting purpose, statis- claim their model performs robustly and can successfully
tical techniques, such as exponential smoothing, ARIMA, deal with constrains of uncertain consumer demands. Various
Box and Jenk-ins model, regression models or Holt-Winters machine learning (ML) techniques with their applications
model, are often applied. In order to increase the forecasting in different sectors has been presented in [11]. Langely
performances, more often hybrid models are developed to and Simon [12] pointed out Rule Induction (RI) as the
use the advantages of different models for a new com- most widely used data mining technique in the field of
bined approach [2]. Especially, hybrid models seem to be business as compared to others. Where as sale prediction of
accurate in sales forecasting [3] [4]. Xia and Wong [5] a pharmaceutical distribution company has been described
proposed the differences between classical methods (based in [13], focusing on two issues, (i) stock state should not
978-1-7281-5686-6/19/$31.00 ©2019 IEEE 172

DOI 10.1109/SITIS.2019.00038
Authorized licensed use limited to: University of Exeter. Downloaded on May 06,2020 at 22:17:24 UTC from IEEE Xplore. Restrictions apply.
undergo out of stock and (ii) it should avoid customer
dissatisfaction by predicting sales that manages the stock
level of medicines. K.Punam et.al. [14] have designed a
two level statistical model for future sales prediction for
retailer store. Handling of footwear sales fluctuation in a
period of time has been addressed in [15], also it uses
neural network for predicting weekly retail sales, which
decrease the probability present in the short term planning
of sales. Linear and non-linear comparative analysis models
for sales forecasting is proposed in [16] for the retailing
sector. Sales prediction in fashion market is performed in
[17]. Recently, many authors have examined the relationship Fig. 1. Factor affecting retail sales.
of online chatter to real-world outcomes and the predictive
power of user-generated content. The micro blogging service
Twitter has served as the data source for most of the works. A. Store Level Hypotheses:
For instance, Asur and Huberman [18] focus on movie City type: Generally stores situated in Tier 1 cities or
box-office revenues and Twitter data, demonstrating high urban areas have higher sales as the income level of people
correlations between online data and the real rank of a are high.
movie. Dhar and Chang [19] suggest that user-generated Population Density: Similarly densely populated area stores
content is a good indicator for future sales of online music have higher sales because of more demand. Store Capacity:
sales. Further research focuses on exploring sentiments from Big size stores have higher sales as they act like one-in-all-
Twitter data examining potential correlations to the value of shops; customer would prefer getting everything from one
the Dow Jones Industrial Average [20] and on the prediction place
of stock markets in general [21], [22]. Likewise, Twitter Competitors: Establishment year affects the sale volume due
posts are used to examine effect of social networking sites in to legacy effect competition.
predicting the outcome of elections [23], [24]. Some of the Marketing: Marketing and advertising immensely affects
challenging factors like lack of historical data, consumer- sales as it increases its visibility and catchy slogans stay
oriented markets, uncertain demands and short life cycles of with customers for a long period effecting the sales.
prediction methods result in inaccurate predictions. Location: Similarly popular marketplaces or better located
stores have higher sales due to easy access.
III. DATA P REPROCESSING Customer Behavior: Stores having right set of products to
meet the local needs will have higher sales.
The dataset plays an important role for a model to
B. Product Level Hypotheses:
predict accurately the future demand of sales in a retailer
envirnoment. In our work we have collected the Big Mart Brand: Customer always trust branded products so they
sales dataset [25] which is shown in Table I. The dataset have a higher sales.
Packaging: Stores having good packaging of products can
attract customers and sell more.
TABLE I
F EATURES OF DATASET Utility: Products used daily have a higher tendency to be
sold as compared to the specific use products.
Name Of the Attributes Types Total Count
Item Identifier object 8523 Display Area: Products kept in bigger shelves inside the
Item Weight float64 7060 store are likely to make attention first and sell more.
Item Fat Content object 8523
Item Visibility float64 8523 Visibility in Store: The location of product in a store will
Item Type object 8523 impact sales. One which are right at entrance will catch the
Item MRP float64 8523
Outlet Identifier object 8523 eye of customer first rather than the ones in back.
Outlet Establishment Year int64 8523 Display: Better display of products in the store makes higher
Outlet Size object 6113
Outlet Location Type object 8523 sales in most cases.
Outlet Type object 8523 Promotional Offers: Products accompanied with attractive
Item Outlet Sales float64 8523
offers and discounts sell more.
contains 12 features with 8523 tuples of each features, C. Exploratory Data Analysis(EDA):
producing the approximate size as 102276 instances. The Based on assumption data is analyzed, duplicated and ir-
assumptions or factors affecting the sales of a store are regularities data are corrected during this phase. The working
mentioned in Figure 1. procedure of the proposed model is illustrated in Figure 2.
173
TABLE II
C ORRELATION AMONG ATTRIBUTES OF THE DATA SET.
Item Item MRP Outlet
Item weight visibility establishment Item outlet
year sales
Item weight 1 -0.014047 0.0271411 -0.0115882 0.0141227
-0.014047 1 -0.001314 -0.0748335 -0.1286246
Item visibility
Item MRP 0.0271411 -0.001314 1 0.005019 0.0567574
Outlet -0.0115882 -0.074833 0.005019 1 -0.0491349
establishment
year
Item outlet 0.0141227 -0.128624 0.0567574 -0.0491349 1
sales
use products.
(Xi − X̄)3
skewness = (1)
ns3
(Xi − X̄)4
Fig. 2. Work flow of GSO based future sales prediction: EDA is analyzing Kurtosis = (2)
and processing the data, feature engineering is transforming data into ns4
correct format, model is receiving the actual data fro feature engineering
stage, hyper parameter is tuned by using Grid search optimization (GSO) Where n is the sample size, xi is the ith value of X and X̄
technique then ensemble techniques are used to predict the result. is the mean value of sample, s is the standard deviation
of the sample. Similarly the correlation among attributes
are represented in the Table II and it is observed that
1) Data Exploration: In Data Exploration phase impor- the attribute Item M RP is highly correlated with target
tant data needs to be further explored from the raw dataset. attribute Item outlet sales and
In data analysis stage we identify the missing value attributes the correlation is defined in Equation 3.

values as Nan or zero as minimum value, which effects the n xi yi − xi yi
prediction accuracy. Such fields need to be corrected before rxy = (3)
n xi − ( xi )2 n yi − ( yi )2
feeding to the model hence a data cleaning mechanism is
employed to handle such attributes in the next section. A where rxy = Pearson r correlation coefficient between x
univariate distribution of the target variable is shown in and y, n = Number of observations, xi = Value of x (for
Figure 3. which describes that the target attribute is skewed ith observation) and yi = Value of y (for ith observation).
Also it is observed that lowest sales were produced in
smallest locations. However, in some cases medium size
outlet location produced highest sales though it was tire 2
super market instead of largest size location is depicted in
Figure 4. There are three type of super market e.g. super
market tire-1, tire-2 and tire-3.
Fig. 3. Univariate distribution of target variable depicting a skew towards

higher sales of low priced products.
as defined in Equation 1 towards right with a skew coeffi-

cient value 1.177. Kurtosis as defined in Equation 2 with
coefficient value 1.165. Both skew and kurtosis coefficient Fig. 4. Outlet Location Vs Item outlet Sales: Shows that tire 2 outlet
values indicate that the target attribute is skewed towards the produced highest sales w.r.t tire 1 and tire 3
higher sales with higher concentration on low priced daily
174
2) Data Cleaning: In Data Exploration phase it is found any differentiable loss function. The GBM algorithm is
that the missing values attributes are Outlet Size and Item illustrated in Algorithm 1.
Weight. In our case we replace all the missing values by
the mode and mean of the corresponding attribute according Step 1: Initialize model with a constant value:
to their type, which diminishes the correlation among input n

attributes. F0 = arg min L(yi , γ)
3) Feature Engineering: Some nuances were observed i=0
in the dataset during data exploration phase. This phase Step 2: for m= 1 to M : do
is used to resolve all such inconsistencies found in dataset a. Compute pseudo residuals:
readying it for building the predictive model. We observed
∂L(yi F (xi ))
that Item visibility attribute has a zero value, practically rim = −
which has no sense. So the mean value item visibility of ∂F (xi ) F (x)=Fm−1 (x)
that product is used to replace zero value attributes which f or all i = 1, 2...n
makes all products likely to sell. All categorical attribute b. Fit a Base learner hm (x) to pseudo residuals
discrepancies are resolved by modifying into appropriate that is train the learner using training set.
ones. In some cases, it is noticed that non-consumables and c. Compute γm
fat content property are not specified. To avoid such cases n

we create a third category of Item fat content i.e., none. In γm = arg min (L(yi , Fm−1 (xi ) + γh(xi )))
the Item Identifier attribute, it was found that the unique ID γ
i=0
starts with either DR or FD or NC. So, we create a new
d. Update the model:
attribute Item Type New with three categories like Foods,
Drinks and Non-consumables. And finally, for determining Fm (x) = Fm−1 (x) + γm hm (x)
how old a particular outlet is, we add an additional attribute
end
Year to the dataset.
4) Feature Transformation: As per our hypothesis “more Step 3: Output FM
Algorithm 1: Gradient boosting machine(GBM) algo-
the product visibility more likely it is to sell. The lesser rithm
visible items are less likely to be sold. To overcome this
limitation we apply feature transformation by replacing Key features of Xgboost technique are:
an Items zero visibility by Item Mean Visibility. Similarly 1) It is sparse aware, that is, the missing data values are
other hypothesis can also be applied in this phase. automatic handled.
Once all the background work is completed, the data is ready
for model building. 2) Supports parallelism of tree construction.
IV. P ROPOSED M ODEL FOR S ALES P REDICTION
After completion of the previous phases, the dataset is 3) Continued training, so that the fitted model can further
ready to build the predictive model to forecast sales of boost with new data.
Big Mart. In our work, we present a model using Grid Some of the hyper parameter used in the grid search
Search Optimization (GSO) [9] technique and ensemble with optimization technique are:
Xgboost algorithm [26]. The objective function of GSO is 1) The default loss function is least square.
illustrated as follows: 2) Learning rate: default is 0.1.
Let f be a function that gives the MAPE (Mean Absolute 3) Number of estimators n estimators are number of
Percentage Error) depending upon parameters a and b and boosting stages, and the default value is 100.
the value of a, b ∈ [0, 1], and let S be a set of values defined 4) Subsample: Is the fraction of sample used for fitting the
as S = {0.1, 0.2, · · · , 0.9}. Then the objective function f is individual learner. The default value is 1.0.
defined in Equation 4. Finally the model receives input features after preprocess-
arg minf (a, b) = {(a, b) ∈ S|∀(x, y) ∈ S : f (a, b) ≤ f (x, y)} ing and splits into training and test set in the ratio of 70 : 30.
a,b∈S
(4) V. E XPERIMENTAL R ESULT
In this work GSO is used for selecting the best parameters In our work we set cross-validation as 10 fold to test
for the predictive model after tuning the different param- accuracy of models with six different learning rates along
eters of Xgboost algorithm. Xgboost (Extreme Gradient with other parameters. Each variation of learning rate is
Boosting) is an extended version of Gradient Boosting evaluated using 10 fold cross validation, thus, 6 × 10 or
Machines(GBM) [27], which not only enhances the perfor- 60 Xgboost model is trained and evaluated. The log loss
mance but also optimizes the system. It also works with as defined in Equation 5, of each learning rate is found
175
along with the best score, best parameter and learning rate.
Figure 5 depicts that the best learning rate as 0.1 w.r.t to log
loss values after tuning the learning rate parameter.
n
1
L=− [yi log(ŷi ) + (1 − yi )log(1 − ŷi )] (5)
n i=1
Fig. 6. FScore of features: Item MRP feature is more informative than

other feature.
Fig. 5. Learning rate vs Log loss
It is also observed from Figure 6 that the Fscore of feature

Item MRP is very high than other features, that is feature
importance score [27] indicates Item MRP feature is more
informative than other features and the feature importance
score [27] is as defined in Equation 6

nij (6)
j: node j split on feature i
Fig. 7. Estimators vs learning rate relationship. Performance of learning
where fi : Importance of feature i. rate 0.1 is better than other learning rate
nij : Importance of node j and is computed as in Equation 7
nij = wj Cj − wleft(j) Cleft(j) − wright(j) Cright(j) (7)
In our work the performance of the models are measured
in terms of RMSE and MAE which are defined in Equation 8
where wj : weighted number of samples reaching node j. and Equation 9 respectively.
Cj : impurity value of node j. n
lef t(j): child node on left split of node j. 1
M AE = |yi − ŷi | (8)
right(j): child node on right split of node j. Figure 7 rep- N i=1
resents the behavior of different learning rates with respect

to estimators after tuning both estimator and learning rate n
parameters and it is observed that 0.1 is the best learning 1
RM SE = (yi − ŷi )2 (9)
rate. N i=1
We also found that the performance is poor for smaller
learning rates and suggest much larger tree to overcome this. where yi : is the actual value and ŷi is the predicted value.
However, it is quite computationally expensive to increase Where the model performance both for training and testing
tree size to many thousands. The performance at learning dataset is represented in Table III and Table IV respectively.
rate 0.1 against the estimators is represented in Figure 8 We infer that the best parameter is optimized after applying
which indicates that after 300 estimators the model will grid search optimization in grid search cross validation stage
perform better. of the model.
176
TABLE IV
C OMPARISON OF X GBOOST MODEL WITH AND WITHOUT PARAMETER
TUNING FOR TEST SET
Models RMSE MAE

Xgboost(Before tuning parameter ) 180.2 134.08
Xgboost(After tuning parameter ) 178.7 129.90
[3] L. Aburto and R. Weber, “Improved supply chain management based

on hybrid demand forecasts,” Applied Soft Computing, vol. 7, no. 1,
pp. 136–144, 2007.
[4] W.-I. Lee, B.-Y. Shih, and C.-Y. Chen, “Retracted: A hybrid artificial
intelligence sales-forecasting system in the convenience store indus-
try,” Human Factors and Ergonomics in Manufacturing & Service
Industries, vol. 22, no. 3, pp. 188–196, 2012.
[5] M. Xia and W. K. Wong, “A seasonal discrete grey forecasting model
for fashion retailing,” Knowledge-Based Systems, vol. 57, pp. 119–
Fig. 8. Performance of learning rate = 0.1 vs estimators. 126, 2014.
[6] S. Wheelwright, S. Makridakis, and R. J. Hyndman, Forecasting:
methods and applications. John Wiley & Sons, 1998.
TABLE III
C OMPARISON OF X GBOOST MODEL WITH AND WITHOUT PARAMETER [7] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and
TUNING FOR TRAINING SET
M. Sartin, “Combing content-based and collaborative filters in an
online newspaper,” 1999.
Models RMSE MAE [8] B. Smyth and P. Cotter, “Personalized electronic program guides for
Xgboost(Before tuning parameter ) 1066 749.78 digital tv,” Ai Magazine, vol. 22, no. 2, pp. 89–89, 2001.
Xgboost(After tuning parameter ) 1052 739.03 [9] M. Claesen and B. De Moor, “Hyperparameter search in machine
learning,” arXiv preprint arXiv:1502.02127, 2015.
[10] A. Ait-Alla, M. Teucke, M. Lütjen, S. Beheshti-Kashi, and H. R.
Karimi, “Robust production planning in fashion apparel industry
VI. CONCLUSIONS under demand uncertainty via conditional value at risk,” Mathematical
Problems in Engineering, vol. 2014, 2014.
In present era of digitally connected world every retailer [11] I. Bose and R. K. Mahapatra, “Business data mininga machine
company desires to know the customer demands beforehand learning perspective,” Information & management, vol. 39, no. 3, pp.
to avoid the shortfall of sale items in any season. Day to 211–225, 2001.
[12] P. Langley and H. A. Simon, “Applications of machine learning and
day companies or retailers are predicting more accurately rule induction,” Communications of the ACM, vol. 38, no. 11, pp.
the demand of sales of a product such that the companies 54–64, 1995.
achieve more return on investment (ROI). Many researchers [13] A. Ribeiro, I. Seruca, and N. Durão, “Improving organizational
decision support: Detection of outliers and sales prediction for a
are working in this area to get the accurate sales prediction. pharmaceutical distribution company,” Procedia Computer Science,
The profit made by a company is directly proportional to vol. 121, pp. 282–290, 2017.
the accurate predictions of sale, the Big marts are desiring [14] K. Punam, R. Pamula, and P. K. Jain, “A two-level statistical model
for big mart sales prediction,” in 2018 International Conference
more accurate prediction techniques to avoid any losses on on Computing, Power and Communication Technologies (GUCON).
their investment. In this research work, we have designed IEEE, 2018, pp. 617–620.
a predictive model using ensemble techniques with Xgboost [15] P. Das and S. Chaudhury, “Prediction of retail sales of footwear using
algorithm in the Big Mart dataset for forecasting future sales feedforward and recurrent neural networks,” Neural Computing and
Applications, vol. 16, no. 4-5, pp. 491–502, 2007.
of a particular store or outlet of Big Mart. Experimental anal- [16] C.-W. Chu and G. P. Zhang, “A comparative study of linear and
ysis found our technique produce more accurate prediction nonlinear models for aggregate retail sales forecasting,” International
with lowest RMSE and MAE both for training and testing Journal of production economics, vol. 86, no. 3, pp. 217–231, 2003.
[17] S. Beheshti-Kashi, H. R. Karimi, K.-D. Thoben, M. Lütjen, and
set as mentioned in Table III and Table IV respectively. Also M. Teucke, “A survey on retail sales forecasting and prediction in
from the Table III it is also concluded that model performs fashion markets,” Systems Science & Control Engineering, vol. 3,
better when hyper parameters are tuned. As part of future no. 1, pp. 154–161, 2015.
[18] S. Asur and B. A. Huberman, “Predicting the future with social
work we aim to enhance the accuracy using advanced hyper media,” in Proceedings of the 2010 IEEE/WIC/ACM International
parameter optimization techniques. Conference on Web Intelligence and Intelligent Agent Technology-
Volume 01. IEEE Computer Society, 2010, pp. 492–499.
R EFERENCES [19] V. Dhar and E. A. Chang, “Does chatter matter? the impact of user-
generated content on music sales,” Journal of Interactive Marketing,
vol. 23, no. 4, pp. 300–307, 2009.
[1] S. Thomassey, “Sales forecasts in clothing industry: The key success
factor of the supply chain management,” International Journal of [20] J. Bollen, H. Mao, and X. Zeng, “Twitter mood predicts the stock
Production Economics, vol. 128, no. 2, pp. 470–483, 2010. market,” Journal of computational science, vol. 2, no. 1, pp. 1–8,
2011.
[2] N. Liu, S. Ren, T.-M. Choi, C.-L. Hui, and S.-F. Ng, “Sales fore-
casting for fashion retailing service industry: a review,” Mathematical [21] E. Gilbert and K. Karahalios, “Widespread worry and the stock
Problems in Engineering, vol. 2013, 2013. market,” in Fourth International AAAI Conference on Weblogs and
Social Media, 2010.
177
[22] X. Zhang, H. Fuehres, and P. A. Gloor, “Predicting stock market
indicators through twitter i hope it is not as bad as i fear,” Procedia-
Social and Behavioral Sciences, vol. 26, pp. 55–62, 2011.
[23] A. Bermingham and A. Smeaton, “On using twitter to monitor
political sentiment and predict election results,” in Proceedings of the
Workshop on Sentiment Analysis where AI meets Psychology (SAAIP
2011), 2011, pp. 2–10.
[24] B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith,
“From tweets to polls: Linking text sentiment to public opinion time
series,” in Fourth International AAAI Conference on Weblogs and
Social Media, 2010.
[25] T. Shrivas. (2013, Jun.) Big mart dataset@ONLINE. [Online]. Avail-
able: https://datahack.analyticsvidhya.com/contest/practice-problem-
big-mart-sales-iii/
[26] G. Behera and N. Nain, “A comparative study of big mart sales
prediction,” in Proceedings of International Conference on Computer
Vision and Image Processing. Springer, 2019.
[27] T. Hastie, R. Tibshirani, and J. Friedman, “10. boosting and additive
trees. 337–387,” 2009.
178

Grid Search Optimization (GSO) Based Future Sales Prediction For Big Mart

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Grid Search Optimization (GSO) Based Future Sales Prediction For Big Mart

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Grid Search Optimization (GSO) Based Future Sales Prediction For Big Mart

Uploaded by

Copyright:

Available Formats

2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)

Grid Search Optimization (GSO) Based Future

978-1-7281-5686-6/19/$31.00 ©2019 IEEE 172

Fig. 3. Univariate distribution of target variable depicting a skew towards

as deﬁned in Equation 1 towards right with a skew coefﬁ-

Fig. 6. FScore of features: Item MRP feature is more informative than

Fig. 5. Learning rate vs Log loss

It is also observed from Figure 6 that the Fscore of feature

Models RMSE MAE

[3] L. Aburto and R. Weber, “Improved supply chain management based

You might also like