AutoML Tool for Big Industrial Data
AutoML Tool for Big Industrial Data
SoftwareX
journal homepage: www.elsevier.com/locate/softx
article info a b s t r a c t
Article history: The Machine Learning (ML) based solutions in manufacturing industrial contexts often require skilled
Received 15 June 2021 resources. More practical non-expert software solutions are then desired to enhance the usability of
Received in revised form 22 November 2021 ML algorithms. The algorithm selection and configuration is one of the most difficult tasks for users like
Accepted 22 November 2021
manufacturing specialists. The identification of the most appropriate algorithm in an automatic manner
Keywords: is among the major research challenges to achieve optimal performance of ML tools. In this paper, we
Machine learning present an auto-explained Automated Machine Learning tool for Big Industrial Data (AMLBID) to better
AutoML cope with the prominent challenges posed by the evolution of Big Industrial Data. It is a meta-learning
Meta-learning based decision support system for the automated selection and tuning of implied hyperparameters
Decision-support systems for ML algorithms. Moreover, the framework is equipped with an explainer module that makes the
Explainable AI outcomes transparent and interpretable for well-performing ML systems.
Big industrial data © 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Code metadata
1. Motivation and significance the ML sufficiently assists the large data analysis for decision-
making purposes, the human interventions are often required.
The domain experts master better the application area. For ex-
Industrial Big Data refers to the large amount of diversified
ample, the domain experts are be able to provide characteristics
data that is generated continuously, in real time by the network
of the application, which can help to improve the performance
of industrial equipment [1]. The continuous digital transforma-
of the algorithms. However, they are not necessarily ML experts.
tion of the manufacturing industry has led to the widespread
Consequently, the large number of available ML algorithms and
adoption of ML solutions [2]. Although, in many industrial areas, hyperparameters configurations could lead to infeasible exhaus-
tive search executions. Therefore, in this context, the expertise of
∗ Corresponding author at: Univ. Littoral Cote d’Opale, UR 4491, LISIC, data-scientists is highly desired for the identification of the most
Laboratoire d’Informatique Signal et Image de la Cote d’Opale, F-62100 Calais,
appropriate algorithm configurations [3,4].
France. The selection of an algorithm or a family of algorithms that
E-mail address: moncef.garouani@etu.univ-littoral.fr (Moncef Garouani). are more likely to perform better on a given combination of
https://doi.org/10.1016/j.softx.2021.100919
2352-7110/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
Moncef Garouani, Adeel Ahmad, Mourad Bouneffa et al. SoftwareX 17 (2022) 100919
datasets and their evaluation measures is a critical task [3]. The 2. Software description
ML algorithms generally have two kinds of parameters :
It is often difficult to build an accurate predictive model
• the ordinary parameters that the model learns and opti- based on ML for an industrial problem that is easy to be inter-
mizes automatically based on its normal behavior during the preted by non-expert ML developers [6,22]. The key idea behind
learning phase, the transparent and auto-explainable AutoML vision is to sepa-
• the hyperparameters (categorical and continuous), which rate the recommendations from the explanations by using two
are usually manually set before the beginning of the model modules simultaneously, as shown in Fig. 1. The Recommender
training. module (AMLBID) for the recommendations and the Explanatory
module (AMLExplainer) for explanations. The first module is
In the context of the manufacturing industry, a major chal-
used to provide the most appropriate ML configuration (s) for a
lenge is the selection of the feasible ML algorithm and the tun- given problem. It is aimed at maximizing the requested predic-
ing of related hyperparameters. The algorithm selection and its tive metric (e.g. Accuracy, Recall, Precision). The second module
configuration (tuning of hyperparameters) is a complex process is used to provide the rationale behind the recommended ML
because the ML algorithms are used as a ‘‘black-box’’. The perfor- configuration (s) as well as auto-generated explanations to better
mance is affected by the characteristics of the datasets and the understand the inner workings of the model in an interpretable
configuration of algorithms hyperparameters [5]. The selection manner through an interactive multi-view tool.
and configuration of appropriate algorithm(s) is an error prone
and time-consuming process due to the prevailing flaws while 2.1. Software architecture
establishing the multiple configurations. It hence emphasizes the
need to automate this process. The workflow of the proposed self-explanatory AutoML sys-
The Automated Machine Learning (AutoML) [6] is a decision tem consists of two major components :
support system that partially or totally automates the ML pipeline. • the AutoML component, which presents the AutoML process
The major goal of this research field is to enable non-expert ML at the different levels of abstraction from ML configuration’s
developers to effectively utilize ‘‘off-the-shelf’’ solutions, which recommendation to the refinement,
would save time and effort for practitioners [7,8]. At its core, the • and the explanatory one, which allows users to inspect both
AutoML strives to achieve the performance criteria (e.g. accuracy, the process of decision generation and the inner working of
recall, F1 score) in order to solve the respective ML tasks such as the recommended ML model.
classification, regression, or clustering. In the intervening period,
In the following sections, we discuss these modules in brief detail.
the AutoML optimizes a given performance criterion [9] to solve
the particular task with respect to the dataset.
2.1.1. Recommendation module
Multiple approaches have been proposed to tackle the above The AutoML tool for Big Industrial Data (AMLBID) is a meta-
problem [9–13] owing to the immense potential of AutoML. In learning based system in order to automate the problem of
this regard, several tools are available in the research community algorithm selection and its configuration. It uses a recommen-
such as Auto-sklearn [5], TPOT [12], and AutoWEKA [14]. There dation system that is bootstrapped with a knowledge base. The
are also several commercial tools such as RapidMiner [15], H2O current knowledge base is derived from a large set of experi-
Driverless AI [16], Data Robot [17], and MATLAB ML toolbox [18]. ments conducted on 400 real-world manufacturing classification
We observe that many industrial actors are competing around datasets which are collected from the popular repositories, such
the goal of automating the machine learning [19]. They are mostly as OpenML,1 UCI2 and Kaggle3 . It accumulates the generation
focused on various budget-limited tasks dealing with the su- of more than 4 millions evaluated ML configurations (pipelines).
pervised learning. However, they typically come up with the Each pipeline consists of a choice of a ML model and the con-
black-box solutions and lack the effective explanations of the pre- figuration of its hyperparameters. The system is able to identify
dicted performance factors. It is also worth noting that the cost of effective pipelines without performing expensive computational
these solutions tends to higher due to the involved computational analysis. For this purpose, the system explores the interactions
complexity and the time required to generate recommendations between meta-features (characteristics) of the datasets and the
pipelines topologies.
[3].
The recommendation phase is initiated with the occurrence
Generally, in the most of the existing AutoML systems, the
of a dataset as a new input of the AutoML process. At this
visibility is limited on the prominent exhibition of input and
stage, the user selects a predictive analytical metric (e.g. Precision,
output parameters. They rather conceal the visibility of inherent
Accuracy, Recall) to be used for the analysis. AMLBID then auto-
associations among them. Instead of that, the confidence of users matically provides a set of ML algorithms and recommended con-
can be increased with the transparency of the automatic results figuration of their related hyperparameters, so that the predictive
in AutoML systems. The user confidence in AutoML systems is performance becomes the first-rate performance.
important because conventionally AutoML systems are used as
the Decision Support Systems (DSS). Therefore, the acceptabil- 2.1.2. Explainer module
ity and the trust-in factors of an AutoML support system are AMLExplainer and AMLBID are implemented following a
highly dependent on the transparency of the recommendation client–server architecture. The server coordinates the interactions
generation process [20]. between AMLExplainer and the AutoML recommendation tool.
In this paper, we present AMLBID, a transparent, interpretable The client-side scripts manage the visual user interfaces including
and auto-explainable meta-learning based tool [21] that iden- the visualization of data summaries on multiple levels of the
tifies the optimal or near-optimal ML configuration for a given recommended models. Meanwhile, AMLExplainer guides the
problem. It also explains the rationale traceability behind a rec-
ommendation. The tool, as a decision support system, is able 1 https://www.openml.org/.
to simulate the role of the ML expert because it is based on 2 https://archive.ics.uci.edu/.
meta-learning approach. 3 https://www.kaggle.com/.
2
Moncef Garouani, Adeel Ahmad, Mourad Bouneffa et al. SoftwareX 17 (2022) 100919
Table 1
The configuration of ML algorithms and hyperparameters as tuned in the current experiments.
ML algorithm Tuned hyperparameters
Logistic Regression (LR) C, Penalty, Fit_intercept, Dual
Stochastic Gradient Descent (SGD) Loss, Penalty, Alpha, Learning_rate,
Fit_intercept, L1_ratio, Eta0, Power_t
Support Vector Classifier (SVM) Kernel, C, Gamma, Degree, Coef0
Decision Tree (DT) Min_simple_leaf, Min_simple_split, Criterion,
Max_features
Random Forest (RF) & Extra N_estimators, Min_simple_split, Max_features,
Trees (ET) Min_simple_leaf, Min_weight_fraction_leaf
AdaBoost (AB) N_estimators, Learning_rate, Algorithm,
Base_estimator_max_depth
Gradient Boosting (GB) N_estimators, Learning_rate, Criterion, Loss,
Max_depth, Min_simple_leaf, Min_simple_split
end-users to improve the predictive performances, in case of The knowledge-base is continuously improved by running
the unsatisfying results. Hence, it can increase the transparency, more tasks. It makes AMLBID smarter by achieving more
controllability, and reliability of AutoML DSS. experience, based on the growing knowledge-base.
• It provides assistance when AutoML returns unsatisfying re-
2.2. Software functionalities sults, in order to improve the predictive performances. That
is achieved by assessing the importance and the correlation
The current version of AMLBID is available on the PyPI pack- among the algorithm hyperparameters.
age index in form of a Python-package4 to facilitate its distri- • The framework is equipped with an explanation module,
bution and use. It presents a meta-learning based framework which allows the end-user to understand the diagnostic
with major objective to automate the process of algorithm selec- design of the returned ML models using various explanation
tion and the tuning of hyperparameters in supervised ML along techniques. In particular, the explanation artifact allows the
with rational traceability. The available literature witness that the
end-user to:
majority of state-of-the-art tools evaluate a set of pipelines by
actually executing them on a given dataset prior to the recom- – investigate the reasoning behind the AutoML recom-
mendation. It is observed that such executions may require con- mendation generation process,
siderable computing time while consuming precious resources as – and explore the predictions of a recommendation in
per their availability [22]. The proposed system (AMLBID) imme- a trustful manner, through linked visual summaries in
diately produces a list of potential top-ranked pipelines using its form of graphical, tabular, or textual information for a
knowledge base at an imperceptible computational time, hence higher trust.
it notably economizes the cost of resources and their provisional
availability. Therefore, AMLBID enables the end users to ask a series of
The available version of AMLBID in its present form supports what-if scenarios while probing the opportunities to use
08 different classification algorithms from the popular Python- predictive models. It can improve outcomes and reduce
based ML library Scikit-learn. The Table 1 gives the detailed costs for various tasks such as the dependencies of classical
description of the supported algorithms and the tuned hyperpa- collaborations of domain experts and data-scientists.
rameters.
Broadly, AMLBID is an interactive tool to guide the end-users 3. Illustrative examples
for improving the utility and usability of the AutoML process with
the following salient features: AMLBID broadly has two major modules, the AMLBID_
Recommendation module and the AMLBID_Explainer module.
• It automatically (and accurately) selects the most appropri- The AMLBID_Recommendation module recommends and builds
ate ML algorithm (s) with related hyperparameters config- highly-tuned ML pipelines, whereas, the AMLBID_Explainer
uration through the use of a collaborative knowledge-base. module is used to intercept the inner working of the generated
pipeline (s). In the following sections we discuss the functionality
4 https://pypi.org/project/AMLBID/. of these modules in further detail.
3
Moncef Garouani, Adeel Ahmad, Mourad Bouneffa et al. SoftwareX 17 (2022) 100919
The AMLBID_Explainer module allows users to inspect the The machine learning based applications are increasingly de-
insights of the recommendation module and the decision gen- sired due to their robustness for the large data analysis. Also,
eration process. Its use is illustrated in listing 3. It provides ex- they can rapidly integrate ‘‘off-the-shelf" solutions in multiple
planations on several levels of abstraction like the importance of areas. However, the non-expert data analysts are more inclined
features and the contribution of features to the individual predic- to adapt the ML based solutions that are more easily persuad-
tions (with the help of SHAP tool [23] that finds the shapely values able, among diverse algorithms, with the help of their rational
of a contribution for some prediction), ‘‘what-if’’ analysis, visual- traceability. We argue that the adaptability of the powerful de-
ization of individual decision path, the weight of hyperparame- cision support systems based on the ML based solutions can be
ters, and correlations. A partial vision of the AMLBID_Explainer further enhanced with the help of comprehensive instructions
component is shown in Fig. 2. regarding the recommended pipelines and their insights. Thus,
4
Moncef Garouani, Adeel Ahmad, Mourad Bouneffa et al. SoftwareX 17 (2022) 100919
Fig. 2. Functional dashboard of AMLBID_Explainer showing the decision path of a predicted instance.
5
Moncef Garouani, Adeel Ahmad, Mourad Bouneffa et al. SoftwareX 17 (2022) 100919
[14] Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K. Auto-WEKA: [22] Garouani M, Ahmad A, Bouneffa M, Hamlich M, Bourguin G,
automatic model selection and hyperparameter optimization in WEKA. Lewandowski A. Towards big industrial data mining through explainable
In: Hutter F, Kotthoff L, Vanschoren J, editors. Automated machine learn- automated machine learning. 2021, http://dx.doi.org/10.21203/rs.3.rs-
ing: methods, systems, challenges. The springer series on challenges in 755783/v1.
machine learning, Cham; 2019, p. 81–95. http://dx.doi.org/10.1007/978-3- [23] Lundberg SM, Lee S-I. A unified approach to interpreting model predictions.
030-05318-5. In: Proceedings of the 31st international conference on neural information
[15] RapidMiner, data science & machine learning platform. https://rapidminer. processing systems. 2017, p. 4768–77, arXiv:1705.07874.
com. [24] Tao F, Qi Q, Liu A, Kusiak A. Data-driven smart manufacturing. In:
[16] H2O.ai, AI cloud platform. https://www.h2o.ai/. Special Issue on Smart Manufacturing, J Manuf Syst In: Special Issue
on Smart Manufacturing, 2018;48:157–69.http://dx.doi.org/10.1016/j.jmsy.
[17] DataRobot, AI cloud - the next generation of AI. https://www.datarobot.
2018.01.006,
com/.
[25] Garouani M, Ahmad A, Bouneffa M, Lewandowski A, Bourguin G,
[18] Machine Learning avec MATLAB. https://fr.mathworks.com/solutions/
Hamlich M. Towards the automation of industrial data science: a
machine-learning.html.
meta-learning based approach. In: 23rd international conference on en-
[19] Guyon I, Sun-Hosoya L, Boullé M, Escalante HJ, Escalera S, Liu Z, Jajetic D, terprise information systems. 2021, p. 709–16. http://dx.doi.org/10.5220/
Ray B, Saeed M, Sebag M, Statnikov A, Tu W-W, Viegas E. Analysis of 0010457107090716.
the automl challenge series 2015– 2018. In: Automated machine learning: [26] Garouani M, Hamlich M, Ahmad A, Bouneffa M, Bourguin G,
methods, systems, challenges. 2019, p. 177–219. http://dx.doi.org/10.1007/ Lewandowski A. Towards an automatic assistance framework for the
978-3-030-05318-5. selection and configuration of machine-learning-based data analytics
[20] Samek W, Müller K-R. Towards explainable artificial intelligence. In: solutions in industry 4.0. In: The fifth international conference on big
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. data and internet of things. [in press].
2019, p. 5–22. http://dx.doi.org/10.1007/978-3-030-28954-6. [27] Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J,
[21] Lemke C, Budka M, Gabrys B. Metalearning: A survey of trends and Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I.
technologies. Artif Intell Rev 2015;44(1):117–30. http://dx.doi.org/10.1007/ Apache spark: A unified engine for big data processing. Commun ACM
s10462-013-9406-y. 2016;59(11):56–65. http://dx.doi.org/10.1145/2934664.