Interpreting Black-Box Machine Learning Models
Interpreting Black-Box Machine Learning Models
Abstract—Many datasets are of increasingly high dimension- only introduces unwanted noise but also increases computa-
ality, where a large number of features could be irrelevant to tional complexity as the data becomes sparser. With increased
the learning task. The inclusion of such features would not modelling complexity involving hundreds of features and their
only introduce unwanted noise but also increase computational
complexity. Deep neural networks (DNNs) outperform machine interactions, making a general conclusion or interpreting the
learning (ML) algorithms in a variety of applications due to their black-box model’s outcome becomes increasingly difficult,
effectiveness in modelling complex problems and handling high- whereas many approaches do not take into account understand-
dimensional datasets. However, due to non-linearity and higher- ing the inner structure of opaque models.
order feature interactions, DNN models are unavoidably opaque, In contrast, DNNs benefit from higher pattern recognition
making them black-box methods. In contrast, an interpretable
model can identify statistically significant features and explain capabilities during learning useful representation from such
the way they affect the model’s outcome. In this paper1 , we datasets. With multiple hidden layers and non-linear activation
propose a novel method to improve the interpretability of black- functions within layers, autoencoder (AEs) can model com-
box models in the case of high-dimensional datasets. First, a plex and higher-order feature interactions. Learning non-linear
black-box model is trained on full feature space that learns mappings allow embedding input feature space into a lower-
useful embeddings on which the classification is performed. To
decompose the inner principles of the black-box and to identify dimensional latent space. Such representations can be used for
top-k important features (global explainability), probing and both supervised and unsupervised downstream tasks [3]. The
perturbing techniques are applied. An interpretable surrogate embedding can capture contextual information of the data [3].
model is then trained on top-k feature space to approximate the However, predictions from such a black-box model can neither
black-box. Finally, decision rules and counterfactuals are derived be traced back to the input, nor it is clear why outputs are
from the surrogate to provide local decisions. Our approach out-
performs tabular learners, e.g., TabNet and XGboost, and SHAP- transformed in a certain way. This exposes even the most
based interpretability techniques, when tested on a number of accurate model’s inability to answer questions like “how and
datasets having dimensionality between 54 and 20,5312 . why inputs are ultimately mapped to certain decisions”. In
Index Terms—Curse of dimensionality, Black-box models, sensitive areas like banking and healthcare, explainability and
Interpretability, Attention mechanism, Model surrogation. accountability are not only some desirable properties of AI
but also legal requirements – especially where AI would
I. I NTRODUCTION have a significant impact on human lives [4]. Therefore, legal
landscapes are fast-moving in European and North American
High availability and easy access to large datasets, AI
countries, e.g., EU GDPR enforces that processing based on
accelerators, and state-of-the-art machine learning (ML) and
automated decision-making tools should be subject to suitable
deep learning (DNNs) algorithms paved the way for per-
safeguards, including “right to obtain an explanation of the
forming predictive modelling at scale. However, in the case
decision reached after such assessment and to challenge
of high-dimensional datasets (e.g., omics), the feature space
the decision”. Since how decisions are made should be as
exponentially increases. Principal component analysis (PCA)
transparent as possible in a faithful and interpretable manner.
and isometric feature mapping (Isomap) are widely used to
Explainable AI (XAI), which gains a lot of attention
tackle the curse of dimensionality [1]. Although they preserve
from both academia and industries, aims to overcome the
inter-point distances, they are fundamentally limited to lin-
opaqueness of black-boxes and brings transparency in AI sys-
ear embedding and tend to lose useful information, which
tems. Model-specific and model-agnostic approaches covering
makes them less effective in dimensionality reduction [2].
local and global interpretability have emerged [5]. While local
The inclusion of a large number of irrelevant features not
explanations focus on explaining individual predictions, global
1 This paper is accepted and included in proceedings of 2023 IEEE 10th International
explanations explain entire model behaviour using plots or
Conference on Data Science and Advanced Analytics (DSAA’2023) 2 GitHub: decision sets. Although an interpretable model can explain
https://github.com/rezacsedu/DeepExplainHidim how it makes a prediction by exposing important factors that
influence its outcomes, interpretability comes at the cost of black-box model is used to produce heatmaps indicating
efficiency. Research suggested by learning an interpretable their relative importance. Gradient-weighted class activation
model to approximate a black-box globally in order to pro- mapping (Grad-CAM++) [11] and layer-wise relevance prop-
vide local explanations [6]. A surrogate model’s input-output agation (LRP) [12] are examples of this category that highlight
behaviour can be represented in a more human-interpretable relevant parts of inputs, e.g., images to a DNN which caused
using decision rules (DRs). DRs containing antecedents (IF) the decision can be highlighted. Attention mechanisms are
and a consequent (THEN) provide intuitive explanations3 than used in a variety of supervised and language modelling tasks,
graph- or plot-based explications [6]. as they can detect larger subsets of features. Self-attention
Further, humans tend to think in a counterfactual way by network (SAN) [13] is proposed to identify important features
asking questions like “How would the prediction have been from tabular data. TabNet [14] uses sequential attention to
if input x had been different?”4 . By using a set of rules and choose a subset of semantically meaningful features to process
counterfactuals, it is possible to explain decisions directly to at each decision step. It also visualizes the importance of
humans with the ability to comprehend the underlying reason features and how they are combined to quantify the con-
so that users can focus on 5 learned knowledge without em- tribution of each feature to the model enabling local and
phasising underlying data representations. Keeping in mind the global interpretability. SAN is found effective on datasets
practical and legal consequences of using black-box models, having a large number of features, while its performance
we propose a novel method to improve the interpretability degrades in the case of smaller datasets, indicating that having
of black-box models for classification tasks. We hypothesize not enough data can distil the relevant parts of the feature
that: i) by decomposing the inner logic (e.g., most important space [13]. Model interpretation strategies are proposed that
features), the opaqueness of a black-box can be mitigated involve training an inherently interpretable surrogate model to
by outlining the most (e.g., top-k feature space) and least learn a locally faithful approximation of a black-box model [6].
important features, ii) by finding a sub-domain of full feature Since an explanation relates the feature values of a sample to
space, would allow us training a surrogate model, which will its prediction, rule-based explanations are easier to understand
sufficiently be able to approximate the black-box model, and for humans. Anchor [15] is a rule-based method that extends
ii) a representative decision rule set can be generated with the LIME, which provides explanations in the form of decision
surrogate, which can be used to sufficiently explain individual rules. Anchor computes rules by incrementally adding equality
decisions in a human-interpretable way. conditions in the antecedents, while an estimate of the rule
II. R ELATED W ORK precision is above a threshold [16].
A drawback of rule-based explanations is overlapping and
Existing interpretable ML methods can be categorized as
contradictory rules. Sequential covering (SC) and Bayesian
either model-specific or model-agnostic with a focus on lo-
rule lists (BRL) are proposed to deal with these. SC iteratively
cal and global interpretability or either. Local interpretable
learns a single rule covering the entire training data rule-by-
model-agnostic explanations (LIME) [7], model understanding
rule and removes the data points that are already covered
through subspace explanations (MUSE) [8], SHapley Additive
by new rules, while SBRL combines pre-mined frequent
exPlanations (SHAP) [9], partial dependence plots (PDP),
patterns into a decision list using Bayesian statistics [6].
individual conditional expectation (ICE), permutation feature
Local rule-based explanations (LORE) [16] is proposed to
importance (PFI), counterfactual explanations (CE) [5] are
overcome these issues. LORE learns an interpretable model of
among others. These methods operate by approximating the
a neighbourhood based on genetic algorithms. LORE derives
outputs of an opaque model via tractable logic, such as
explanations via the interpretable model and provides local
game theoretic Shapley values (SVs) or local approximation
explanations in the form of a decision rule and counterfactuals
of complex or black-box models via a linear model [10].
- that signifies making what feature values may lead to a
Since these approaches do not take into account the inner
different outcome. LIME indicates where to look for a decision
structure of an opaque black-box model, probing, perturbing,
based on feature values, while counterfactual rules of LORE
attention mechanism, sensitivity analysis (SA), saliency maps,
signify minimal-change contexts for reversing the predictions.
and gradient-based attribution methods have been proposed to
understand the underlying logic of complex models. III. M ETHODS
Saliency map and gradient-based methods can identify
Each high-dimensional dataset has a large feature space.
relevant regions and assign importance to each feature, e.g.,
Therefore, first, we train a black-box model to learn represen-
image pixels, where first-order gradient information of a
tations. Then, we classify the data points on their embedding
3 An example rule for a loan application denial could be “IF monthly income space instead of the original feature space. To decompose
= 3000 AND credit rating history=BAD AND employment status=YES the inner structure of the black box, probing and perturbing
AND married=YES, THEN decision = DENY” 4 “What would have been
the decision if my monthly income would be higher?” 5 “Although you’re techniques are applied to identify top-k features that contribute
employed, given your monthly income of 2,000 EUR and having bad credit most to the overall model’s decision-making. An interpretable
rating history, our model has denied your application, as we think you’re surrogate model is then built on top-k features to approximate
unlikely to repay. Even though you have had bad credit rating history, an
increase in your monthly income of 1,000 EUR will definitely end up with the black-box. Finally, decision-rules and counterfactuals are
acceptance, as you’re already employed.” generated from the surrogate to explain individual decisions.
Fig. 1: Workflow of our proposed approach (recreated based on Karim et al. [17])
COAD), these variables will play a crucial role in main- SULT4A1, EN1, EFNB1, and GABRP have negative impacts
taining this prediction. Conversely, TP53, CDS1, PCOLCE2, on the prediction. It means if the prediction is COAD and the
MGP, MTCO1P53, TFF3, AC026403-1, BRCA1, LAPTM5, value of these variables are increased, the final prediction is
Fig. 7: Lift curve for the SANCAE model trained on GE dataset
TABLE III: Percentage of variance (R2 ) of surrogates features not only reveal their relevance for this decision but
Dataset DT RF XGBoost also signify that removing them is like to impact the final
UJIndoorLoc 86.2 ± 1.7 89.3 ± 1.5 91.4 ± 1.5 prediction. Further, we focus on local explanations for this
Health advice 89.4 ± 1.5 92.1 ± 1.8 94.2 ± 1.7
Forest cover 90.3 ± 1.4 91.2 ± 1.4 94.3 ± 1.3
prediction by connecting decision rules and counterfactuals
Gene expression 88.3 ± 1.4 90.2 ± 1.3 93.3 ± 1.5 with additive feature attributions (AFA) in fig. 8. While An-
chor provides a single rule outlining which features impacted
at arriving this decision, LIME generates AFA stating which
likely to end up flipping to another cancer type. features had positive and negative impacts. However, using
decision rules and a set of counterfactuals, we show how the
E. Local interpretability classifier could arrive at the same decision in multiple ways
First, we randomly pick a sample from the test set. As- due to different negative or positive feature impacts.
suming XGBoost predicts the instance is of COAD cancer V. C ONCLUSION
type, the contribution plot (fig. 7 in supplementary) out-
In this paper, we proposed an efficient technique to improve
lines how much contribution individual features had on this
the interpretability of complex black-box models trained on
prediction. Features (genes) DNMT3A, SLC22A18, RB1,
high-dimensional datasets. Our model surrogation strategy is
CDKN18, MYB are top-k features w.r.t impact values, while
equivalent to the knowledge distillation process for creating a
features CASP8 and MAP2K4 had negative contributions.
simpler model. However, instead of training the student model
Further, to quantitatively validate the impact of top-k features
on teacher’s predictions, we transferred learned knowledge
and to assess feature-level relevances, we carry out what-
(e.g., top-k or globally most and least important features)
if analysis. As shown, the observation is of COAD with a
to a student and optimize an objective function. Further,
probability of 55% and BRCA type with a probability of
the more trainable parameters are in a black-box model,
29%. Features on the right side (i.e., TFAP2A, VPS9D1-
the bigger the size of a model would be. This makes the
AS1, MTND2P28, ADCY3, and FOXP4 are positive for
deployment infeasible for such a large model in resource-
COAD class, where feature TFAP2A has the highest positive
constrained devices15 . Further, the inferencing time of large
impact of 0.29) positively impact the prediction, while fea-
models increases and ends up with poor response times due to
tures on the left negatively. Genes TFAP2A, VPS9D1-AS1,
network latency even when deployed in a cloud infrastructure,
MTND2P28, ADCY3, FOXP4, GPRIN1, EFNB1, FABP4,
which is unacceptable in many real-time applications. We hope
MGP, AC020916-1, CDC7, CHADL, RPL10P6, OASL, and
our model surrogation strategy would help create simpler and
PRSS16 are most sensitive to making changes, while features
lighter models and improve interpretability in such a situation.
SEMA4C, CWH43, HAGLROS, SEMA3E, and IVL are less
Depending on the complexity of the modelling tasks, a
sensitive to making changes.
surrogate model may not be able to fully capture a complex
If we remove feature TFAP2A from the profile, we would black-box model. Consequently, it may lead users to recom-
expect the model to predict the observation of COAD cancer mend wrong conclusions (e.g., in healthcare) – especially if
type with a probability of 26% (i.e., 55% − 29%). This will the knowledge distillation process is not properly evaluated
recourse the actual prediction to BRCA in which features IVL, and validated. In the future, we want to focus on other model
PRSS16, EFNB1, and CWH43 are most important, having
impacts of 0.23, 0.17, 0.123, and 0.07, respectively. These 15 e.g., IoT devices having limited memory and low computing power.
Fig. 8: Example of explaining single prediction using rules, counterfactuals, and additive feature attributions
compression techniques such as quantization (i.e., reducing [11] A. Chattopadhay and A. Sarkar, “Grad-CAM++: Generalized gradient-
numerical precision of model parameters or weights) and based visual explanations for deep convolutional networks,” in Conf. on
Applications of Computer Vision(WACV). IEEE, 2018, pp. 839–847.
pruning (e.g., removing less important parameters or weights). [12] S. Bach, A. Binder, G. Montavon, K.-R. Müller, and W. Samek, “On
pixel-wise explanations for non-linear classifier decisions by layer-wise
ACKNOWLEDGEMENT relevance propagation,” PloS one, vol. 10, no. 7, 2015.
[13] B. Škrlj, S. Džeroski, N. Lavrač, and M. Petkovič, “Feature im-
This paper is a collaborative effort and based on the PhD portance estimation with self-attention networks,” arXiv preprint
arXiv:2002.04464, 2020.
thesis [17] by the first author and the second author’s work [14] S. O. Arık and T. Pfister, “TabNet: Attentive interpretable tabular
as part of the Marie Skłodowska-Curie project funded by the learning,” in AAAI, vol. 35, 2021, pp. 6679–6687.
Horizon Europe 2020 research and innovation program of the [15] M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision
model-agnostic explanations,” in Thirty-Second AAAI Conference on
European Union under the grant agreement no. 955422. Artificial Intelligence, 2018.
[16] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, and
R EFERENCES F. Giannotti, “Local rule-based explanations of black box decision
systems,” arXiv preprint arXiv:1805.10820, 2018.
[1] Q. Fournier and D. Aloise, “Empirical comparison between autoencoders [17] M. R. Karim, D. Rebholz-Schuhmann, and S. Decker, “Interpreting
and traditional dimensionality reduction methods,” in 2019 IEEE Sec- black-box machine learning models with decision rules and knowledge
ond International Conference on Artificial Intelligence and Knowledge graph reasoning,” Aachen, Germany, June 2022. [Online]. Available:
Engineering (AIKE). IEEE, 2019, pp. 211–214. https://publications.rwth-aachen.de/record/850613
[2] C. C. Aggarwal and C. K. Reddy, Data clustering: algorithms and [18] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional
applications. CRC press, 2014. networks for mid and high level feature learning,” in 2011 international
[3] M. R. Karim, T. Islam, M. Cochez, D. Rebholz-Schuhmann, and conference on computer vision. IEEE, 2011, pp. 2018–2025.
S. Decker, “Explainable AI for Bioinformatics: Methods, Tools, and [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Applications,” Briefings in Bioinformatics, 2023. Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
[4] M. E. Kaminski, “The right to explanation, explained,” Berkeley Tech. neural information processing systems, vol. 30, pp. 5998–6008, 2017.
LJ, vol. 34, p. 189, 2019. [20] S. M. Lundberg and S.-I. Lee, “Consistent feature attribution for tree
[5] S. Wachter, B. Mittelstadt, and C. Russell, “Counterfactual explanations ensembles,” arXiv preprint arXiv:1706.06060, 2017.
without opening the black box: Automated decisions and the GDPR,” [21] R. M. Grath, L. Costabello, C. L. Van, P. Sweeney, F. Kamiab,
Harv. JL & Tech., vol. 31, p. 841, 2017. Z. Shen, and F. Lecue, “Interpretable credit application predictions with
[6] C. Molnar, Interpretable machine learning. Lulu. com, 2020. counterfactual explanations,” arXiv preprint arXiv:1811.05245, 2018.
[7] M. Ribeiro, S. Singh, and C. Guestrin, “Local interpretable model- [22] J. Torres-Sospedra, R. Montoliu, A. Martı́nez-Usó, T. J. Arnau,
agnostic explanations (LIME): An introduction,” 2019. M. Benedito-Bordonau, and J. Huerta, “Ujiindoorloc: A new multi-
[8] H. Lakkaraju, E. Kamar, R. Caruana, and J. Leskovec, “Faithful and building and multi-floor database for wlan fingerprint-based indoor
customizable explanations of black box models,” in Proceedings of localization problems,” in 2014 international conference on indoor
AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 131–138. positioning and indoor navigation (IPIN). IEEE, 2014, pp. 261–270.
[9] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model [23] J. A. Blackard and D. J. Dean, “Comparative accuracies of artificial neu-
predictions,” in Advances in neural information processing systems, ral networks and discriminant analysis in predicting forest cover types
2017, pp. 4765–4774. from cartographic variables,” Computers and electronics in agriculture,
[10] T. Miller, “Explanation in artificial intelligence: Insights from the social vol. 24, no. 3, pp. 131–151, 1999.
sciences,” Artificial Intelligence, 2018.