LocalGLMnet Interpretable Deep Learning For Tabular Data
LocalGLMnet Interpretable Deep Learning For Tabular Data
To cite this article: Ronald Richman & Mario V. Wüthrich (2023) LocalGLMnet: interpretable
deep learning for tabular data, Scandinavian Actuarial Journal, 2023:1, 71-95, DOI:
10.1080/03461238.2022.2081816
1. Introduction
Deep learning models celebrate great success in statistical modeling because they often provide supe-
rior predictive power over classical regression models. This success is based on the fact that deep
learning models perform representation learning of features, which means that they bring features
into the right structure to be able to extract maximal information for the prediction task at hand.
This feature engineering is done internally in a nontransparent way by the deep learning model. For
this reason, deep learning solutions are often criticized to be non-explainable and interpretable, in
particular, because this process of representation learning is performed in high-dimensional spaces
analyzing bits and pieces of the feature information. Recent research has been focusing on interpret-
ing machine learning predictions in retrospect, see, e.g. Friedman’s partial dependence plot (PDP)
(Friedman 2001), the accumulated local effects (ALE) method of Apley & Zhu (2020), the locally
interpretable model-agnostic explanation (LIME) introduced by Ribeiro et al. (2016), the SHapley
Additive exPlanations (SHAP) of Lundberg & Lee (2017) or the marginal attribution by conditioning
on quantiles (MACQ) method proposed by Merz et al. (2022). PDP, ALE, LIME and SHAP can be
used for any machine learning method such as random forests, boosting or neural networks, whereas
MACQ requires differentiability of the regression function which is the case for neural networks
under differentiable activation functions; for a review of more gradient-based methods, we refer to
Merz et al. (2022).
We follow a different approach here, namely, we propose a new network architecture that has an
internal structure that directly allows for interpreting and explaining. Moreover, this internal struc-
ture also allows for variable selection of tabular feature data and to extract interactions between
feature components. The starting point of our proposal is the framework of generalized linear mod-
els (GLMs) introduced by Nelder & Wedderburn (1972) and McCullagh & Nelder (1983). GLMs are
characterized by the choice of a link function that maps the regression function to a linear predictor,
and, thus, leading to a linear functional form that directly describes the influence of each predictor
variable on the response variable. Of course, this (generalized) linear form is both transparent and
interpretable. To some extent, our architecture preserves this linear structure of GLMs, but we make
the coefficients of the linear predictors feature dependent. Such an approach follows a similar strategy
as the ResNet proposal of He et al. (2016) that considers a linear term and then builds the network
around this linear term. The LassoNet of Lemhadri et al. (2021) follows a similar philosophy, too,
by performing Lasso regularization on network features. Both proposals have in common that they
use a so-called skip connection in the network architecture that gives a linear model part around
which the non-linear network model is built. Our proposal uses such a skip connection, too, which
provides the linear model part, and we weight these linear terms with potentially non-linear weights.
This allows us to generate non-linear regression functions with arbitrary interactions. In spirit, our
non-linear weights are similar to the attention layers recently introduced by Bahdanau et al. (2014)
and Vaswani et al. (2017). Attention layers are a rather successful new way of building powerful net-
works by extracting more important feature components from embeddings by giving more weight
(attention) to them. From this viewpoint we construct (network regression) attention weights that
provide us with a local GLM for tabular data, and we therefore call our proposal LocalGLMnet. These
regression attention weights also provide us with a possibility of explicit variable selection, which is
a novel and unique property within network regression models. Moreover, we can explicitly explore
interactions between feature components. A similar approach is considered in Ahn et al. (2021) on
Bayesian credibility theory, where the credibility weights are modeled by attention weights.
There is another stream of literature that tries to choose regression structures that have inter-
pretable features, e.g. the explainable neural networks (xNN) and the neural additive models (NAM)
make restrictions to the structure of the network regression function by running different subsets of
feature components through (separated) parallel networks, see Vaughan et al. (2018) and Agarwal
et al. (2020). The drawback of these proposals is that representation learning is restricted within the
parallel networks. Our proposal overcomes this issue as it allows for general interactions. We also
mention (Richman 2021) who extends the xNN approach by explicitly including linear features to a
combined model called CAXNN. Another interpretable network approach is the TabNet proposal of
Arik & Pfister (2019). TabNet uses networks to create attention to important features for the regres-
sion task at hand, however, this proposal has the drawback that it may lead to heavy computational
burden.
Our LocalGLMnet approach overcomes the limitations of these explainable network approaches.
As we will see below, our proposal is computationally efficient and it leads to nice explanations, in the
sense that we can interpret the predictions made by our model in terms of an additive decomposition
of the features. This will be further explored in Section 2.4.
Our proposal of the LocalGLMnet, which estimates the coefficients of a (local) GLM, and, there-
fore, maintains a directly interpretable structure in terms of the original features, is, to the best of
our knowledge, unique within the literature. In this key respect, the LocalGLMnet is different from
the DeepGLM and the DeepGLMM of Tran et al. (2020), who study GLMs and generalized linear
mixed models (GLMMs) that rely on learned feature representations from a deep neural network, i.e.
the original features are not used directly to make predictions. One similarity between these model
approaches is that the DeepGLMM allows for observation specific random effects in the last layer of
that model, thus, the coefficients may vary for each observation, which is similar to our LocalGLMnet.
Whereas the LocalGLMnet complies with the interpretable machine learning requirements discussed
in the manifesto of Rudin (2019), the interpretable models discussed in that work and its references
rely on significantly different classes of algorithms from GLMs, which we focus on here due to the
importance of GLMs within actuarial work. Among the models mentioned in Rudin (2019), perhaps
the most similar is the prototype network of Chen et al. (2019), which performs classification based
SCANDINAVIAN ACTUARIAL JOURNAL 73
on the similarity of an image to interpretable learned prototypes, which are used as interpretable fea-
tures within a deep neural network. In the case of the LocalGLMnet, the original feature values are
used directly to make predictions, thus, maintaining interpretability. Within the machine learning
literature, the most similar work to the LocalGLMnet is the hypernetwork concept of Ha et al. (2016),
where one neural network is used to estimate the weights of a second network. In the case of the
LocalGLMnet, a neural network is being used to estimate the weights of a second model, which is
a (local) GLM. Similar types of models have been considered within the actuarial literature in the
context of mortality forecasting by, e.g. Perla et al. (2021), who derive the coefficients of a Lee–Carter
type of model using convolutional and recurrent neural networks.
Using the LocalGLMnet architecture, we propose an empirically based Wald-type method for
variable selection, which relies on adding synthetic variables generated by a uniform or Gaussian dis-
tribution to the data before the LocalGLMnet is fitted. The estimated non-linear regression weights
for these synthetic variables are then used to estimate a confidence interval to determine the sig-
nificance of the regression weights for the rest of the dataset. The method proposed here is closely
connected to the Boruta system of variable selection proposed by Kursa & Rudnicki (2010), which
functions by adding permuted copies of the features to a random forest algorithm, and assessing the
significance of each feature with reference to the level of significance assigned to these permuted fea-
tures by the variable importance scores within the random forest algorithm. The approach proposed
here is similar in that synthetic variables are added to the dataset, but differs in terms of how the sig-
nificance of the features is evaluated: as opposed to using variable importance scores, the structure
of the LocalGLMnet allows for the significance of variables to be evaluated directly in terms of the
regression weights.
2. Model architecture
2.1. Generalized linear model
The starting point of our proposal is a GLM which typically is based on the exponential disper-
sion family (EDF). GLMs have been introduced by Nelder & Wedderburn (1972) and McCullagh
& Nelder (1983), and the EDF has been analyzed in detail by Jørgensen (1997) and Barndorff-
Nielsen (2014); we use the notation and terminology of Wüthrich & Merz (2021), and for a detailed
treatment of GLMs and the EDF we refer to Chapters 2 and 5 of that latter reference.
Assume we have a datum (Y, x, v) with a given exposure v > 0, a vector-valued feature x ∈ Rq ,
and a response variable Y following a member of the (single-parameter linear) EDF having density
74 R. RICHMAN AND M. V. WÜTHRICH
with GLM regression parameter β = (β1 , . . . , βq ) ∈ Rq , bias (intercept) β0 ∈ R, and where ·, ·
denotes the scalar product in Rq . This GLM assumption implies that the canonical parameter takes
the following form (on the canonical scale of the EDF)
θ = (κ )−1 g −1 (β0 + β, x ) ,
where (κ )−1 is called canonical link of the chosen EDF (1).
The GLM regression function (2) is very appealing because it leads to a linear predictor η(x) =
β0 + β, x (after applying the link function g to the mean μ(x)), and the regression parameter βj
directly explains how the individual feature component xj influences the linear predictor η(x) and
the expected value μ(x) of Y, respectively. Our goal is to benefit from this transparent structure as far
as possible.
qm−1
(m)
(m) (m) (m) (m)
zj (x) = φm w0,j + wj , x = φm w0,j + wl,j xl ,
l=1
A FFN network of depth d ∈ N is obtained by composing d FFN layers (3) to provide a deep learned
representation, we set input dimension q0 = q,
z (d:1) : Rq → Rqd
(4)
x → z (d:1) (x) = z (d) ◦ · · · ◦ z (1) (x).
This qd -dimensional learned representation z (d:1) (x) ∈ Rqd then enters a GLM of type (2) providing
FFN network regression function
with GLM regression (output) parameter β = (β1 , . . . , βqd ) ∈ Rqd and bias β0 ∈ R. From (5), we
see that the ‘raw’ feature x is first suitably transformed before entering the GLM structure. The stan-
dard reference for neural networks is Goodfellow et al. (2016), for more insight, interpretation and
model fitting we refer to Section 7.2 of Wüthrich & Merz (2021).
Assumption 2.1 (LocalGLMnet): Choose an FFN network architecture of depth d ∈ N with input and
output dimensions being equal to q0 = qd = q to model attention weights
β : Rq → Rq
(6)
x → β(x) = z (d:1) (x) = z (d) ◦ · · · ◦ z (1) (x).
The LocalGLMnet is defined by the additive decomposition after applying a strictly monotone and
continuous link function g : R → R to the mean μ
x → g(μ) = g (μ(x)) = β0 + β(x), x . (7)
Network architecture (7) is an FFN network architecture with a skip connection: first, feature x is
processed through the deep FFN network providing us with learned representation β(x) ∈ Rq , and
second, x has a direct link to the output layer (skipping all FFN layers) providing an (untransformed)
linear term x. The LocalGLMnet then scalar multiplies these two different components, see (7). In
the next section, we give extended remarks and interpretation.
This model can be fitted to data by state-of-the-art stochastic gradient descent (SGD) methods
using training and validation data for performing early stopping to not over-fit to the training data,
for details we refer to Goodfellow et al. (2016) and Section 7.2 in Wüthrich & Merz (2021).
76 R. RICHMAN AND M. V. WÜTHRICH
Remarks 2.2: • Formula (7) defines the LocalGLMnet. If we replace the scalar product in (7) by
a Hadamard product ⊗ (component-wise product) we receive a LocalGLM layer
x → β (1) (x) ⊗ x = β1(1) (x)x1 , . . . , βq(1) (x)xq ∈ Rq .
A deep LocalGLMnet can be received by composing such LocalGLM layers, for instance, if we
compose two such layers we receive a regression function
x → g(μ) = g (μ(x)) = β0(2) + β (2) β (1) (x) ⊗ x , x .
Such an architecture may lead to increased predictive power and interpretable intermediate
steps.
• Above we have emphasized that the LocalGLMnet will lead to an interpretable regression
model, and the verification of this statement will be done in the examples, below. Alternatively,
if one wants to rely on a plain-vanilla deep FFN network, one can still fit a LocalGLMnet to the
deep FFN network as an interpretable surrogate model.
• The LocalGLMnet has been introduced for tabular data as we try to mimic a GLM that acts on a
design matrix which naturally is in tabular form. If we extend the LocalGLMnet to unstructured
data, time series or image recognition it requires that this data is first encoded into tabular form,
e.g. by using a convolutional module that extracts from spatial data relevant feature information
and transforms this into tabular structure. That is, it requires the paradigm of representation
learning by first bringing raw inputs into a suitable form before encoding this information by
the LocalGLMnet for prediction.
Before we study the implementation of the LocalGLMnet, we would like to get the right interpre-
tation and intuition for regression function (7). We select one component 1 ≤ j ≤ q which provides
us (after applying the link g) with terms
βj (x)xj . (8)
We mention specific cases in the following remarks and how they should be interpreted.
Remarks 2.3: (1) A GLM term is obtained in component xj if βj (x) ≡ βj = 0 is not feature
dependent, providing βj xj , we refer to GLM (2). If this appears to be the case, e.g. using
SCANDINAVIAN ACTUARIAL JOURNAL 77
the model diagnostic tools presented in the next sections, one could replace this term in the
LocalGLMnet with a constant to reduce unnecessary model complexity.
(2) Condition βj (x) ≡ 0 proposes that the term xj should not be included. In Section 3.2, we are
going to present an empirical test for the null hypothesis of dropping a term.
(3) Property βj (x) = βj (xj ) says that we have a term βj (xj )xj that does not interact with other
terms. In general, we can analyze βj (x) over different features x with the jth component xj
being fixed. If this βj (x) does not show any sensitivity in the components different from j,
then we do not have interactions and otherwise we do. Below, we extract this information by
considering the gradients
∂ ∂
∇βj (x) = βj (x), . . . , βj (x) ∈ Rq . (9)
∂x1 ∂xq
The jth component of this gradient ∇βj (x) explores whether we have a linear term in xj , and
the components different from j quantify the interaction strengths.
(4) There is one caveat with these interpretations as we do not have identifiability in model
calibration, as, e.g. we could receive the following structure:
by learning a regression attention βj (x) = xj /xj . However, our tests on different configura-
tions have not manifested any such issues as SGD fitting seems rather pre-determined by the
LocalGLMnet functional form (7) if we initialize SGD in the optimal GLM.
3. Examples
3.1. Synthetic data example
We start with a synthetic data example because this has the advantage of knowing the true data gen-
erating model. This allows us to verify that we draw the right conclusions. We choose q = 8 feature
components x = (x1 , . . . , x8 ) ∈ R8 . We generate two data sets, learning data L and test data T .
The learning data will be used for model fitting and the test data for an out-of-sample generaliza-
tion analysis. We choose for both data sets n = 100, 000 randomly generated independent features
x ∼ N (0, ) being centered and having unit variance. Moreover, we assume that all components of
¯
x are independent, except between x2 and x8 we assume a correlation of 50%. Based on these features
we choose regression function
1 1 1 1 1
x = (x1 , . . . , x8 ) ∈ R8 → μ(x) = x1 − x22 + |x3 | sin(2x3 ) + x4 x5 + x52 x6 , (11)
2 4 2 2 8
thus, neither x7 nor x8 run into the regression function, x7 is independent from the remaining compo-
nents, and x8 has a 50% correlation with x2 . Observe that all components of x have unit variance and,
hence, live on the same scale. Based on this regression function we generate independent Gaussian
observations
Y|x ∼ N (μ(x), 1). (12)
This gives us the two data sets with all instances being independent
Gaussian model (12) is an EDF with cumulant function κ(θ ) = θ 2 /2, effective domain = R,
exposure v = 1 and dispersion parameter ϕ = 1, and we can apply the theory of Section 2.
78 R. RICHMAN AND M. V. WÜTHRICH
We start with the GLM. As link function g we choose the identity function which is the canonical
link of the Gaussian model. This provides us with linear predictor in the GLM case
with regression parameter β ∈ R8 and bias β0 ∈ R. This model is fit to the learning data L using
maximum likelihood estimation (MLE) which, in the Gaussian case, is equivalent to minimizing the
mean squared error (MSE) loss function for regression parameter (β0 , β)
1
n
0MLE ,
(β β
MLE
) = arg min (Yi − μ(xi ))2 . (14)
(β0 ,β) n i=1
For this fitted model, we calculate the in-sample MSE on L and the out-of-sample MSE on T . We
compare these losses to the MSEs of the true model μ(xi ), which is available here, and the corre-
sponding MSEs of the null model which only includes a bias β0 . These figures are given in Table 1 on
lines (a)–(c).
The MSEs under the true regression function are roughly equal to 1 which exactly corresponds to
the fact that the responses Y have been simulated with unit variance, see (12), the small differences
to 1 correspond to the randomness implied by simulating. The loss figures of the GLM are between
the null model (homogeneous model) and the correct model. Still these loss figures are comparably
large because the true model (11) has a rather different functional form compared to what we can
capture by the linear function (13). This is also verified by Figure 1 (lhs) which compares the GLM
estimated means μ(xt ) to the true means μ(xt ) for different instances xt . A perfect model would
provide points on the diagonal orange line, and we see rather big differences between the GLM and
the true regression function μ.
We could now start to improve (13), e.g. by including a quadratic term. We refrain from doing so,
but we fit the LocalGLMnet architecture (7) using the identity link for g, a network of depth d = 4
having (q0 , q1 , q2 , q3 , q4 ) = (8, 20, 15, 10, 8) neurons, and as activation functions φm we choose the
hyperbolic tangent function for layers m = 1, 2, 3 and the linear function for m = 4. This architecture
is illustrated in Listing 1 in the appendix. This LocalGLMnet is fitted using the nadam version of SGD;
early stopping is tracked by using 20% of the learning data L as validation data V and the remaining
80% as training data U , and the network calibration with the smallest MSE validation loss on V is
selected; note that L = U ∪ V is disjoint from the test data T . This fitting approach is state-of-the-art,
for more details we refer to Section 7.2.3 in Wüthrich & Merz (2021). Typically, the SGD algorithm
starts in a random initialization, e.g. using the Glorot initializer. We use a special initialization in
(d)
the last hidden layer z (d) , here. Namely, we set all weights to zero wj = 0, 1 ≤ j ≤ q, and the biases
(d) (d)
in this last hidden layer are initialized with the GLM parameter estimate (w0,1 , . . . , w0,q ) =
MLE
β ,
see (14). This choice implies that we start the SGD algorithm directly in the GLM that we try to
improve by learning potentially non-constant attention weights β(x). This initialization also mitigates
identifiability issue (10), because the LocalGLMnet has already rather pre-determined (linear) terms
by this initialization.
SCANDINAVIAN ACTUARIAL JOURNAL 79
Figure 1. Estimated means μ(xt ) vs. true means μ(xt ): (lhs) fitted GLM (13) and (rhs) fitted LocalGLMnet of 5000 randomly selected
out-of-sample instances xt from T .
The fitting results are given on line (d) of Table 1. The MSEs (in-sample and out-of-sample) are
very close to 1 which indicates that the LocalGLMnet is able to find the true regression structure (13).
This is verified by Figure 1 (rhs) which plots the estimated means
μ(xt ) against the true means μ(xt )
(out-of-sample) for 5000 randomly selected instances xt from T . The fitted LocalGLMnet estimators
lie on the diagonal which says that we have good accuracy.
We are now ready to benefit from the LocalGLMnet architecture. We study the estimated regres-
sion attentions and the resulting terms in the LocalGLMnet regression function
j (x)
x → β j (x)xj .
and x → β
The interpretation to these terms has been given in Remarks 2.3. We use the code of Listing 2 to
extract β(x) from the fitted LocalGLMnet. These estimated regression attentions β j (x) are illustrated
in Figure 2 by the black dots for all components 1 ≤ j ≤ q = 8.
We interpret Figure 2. Regression attention β 1 (x) is concentrated around 1/2 which describes the
first term in (11). Regression attentions β 2 (x), . . . , β
6 (x) are quite different from 0 (red horizontal
line) which indicates that x2 , . . . , x6 are important in the description of the true regression function μ.
Finally, β7 (x) and β8 (x) are concentrated around zero (light cyan stripe), which indicates that feature
information x7 and x8 may not be important for our regression function; the construction of the light
cyan stripe is explained in the next section on variable selection. Thus the last two variables could
be dropped from the model, unless they play an important role in β(x). This can be checked by just
refitting the model without these variables. Figure 2 also shows that some regression attentions vary
smoothly, β 2 (x) and β 3 (x), this highlights that the terms x2 and x3 do not exhibit interactions with
other variables, in fact, β 2 (x) suggest a linear term with slope −1/4 which indeed is the case, see (11),
and β 3 (x) has a sine behavior. Other attention weights β j (x), 4 ≤ j ≤ 6, look rather non-smooth (are
dispersed). This indicates that these terms have significant interactions with other variables.
Figure 2. Regression attentions β j (xt ), 1 ≤ j ≤ q = 8, of 5000 randomly selected out-of-sample instances xt from T ; the y-scale
is identical in all plots and on the x-scale we have xj .
H0 of setting β7 (x) = 0, or how much fluctuation reveals real regression structure? In GLMs, this
question is answered by either computing the Wald test or the likelihood ratio test (LRT) which use
asymptotic normality of MLEs, see Section 2.2.2 of Fahrmeir & Tutz (1994). Here, we cannot rely on
an asymptotic theory for MLEs because early stopping implies that we do not consider the MLE. An
7 (x) is bigger than the magnitudes used in the
analysis of the results also shows that the volatility in β
Wald test and the LRT. For this reason, we propose an empirical way of determining the rejection
region of the null hypothesis H0 : β7 (x) = 0.
If we are given a statistical problem with features x = (x1 , . . . , xq ) ∈ Rq , we propose to extend
these features by an additional variable xq+1 which is completely random, independent of x and
which, of course, does not enter the true (unknown) regression function μ(x), but is only included in
the network. This additional random component xq+1 will quantify the magnitudes of the fluctuations
q+1 (x) of an independent component that does not enter the regression function. To successfully
in β
apply this empirical test we need to normalize all feature components xj , 1 ≤ j ≤ q + 1, to have zero
empirical mean and unit variance, i.e. they should all live on the same scale. Note that such a nor-
malization should already have been done for successful SGD fitting, thus, this does not impose an
additional step, here. We then fit a LocalGLMnet (6)–(7) to the learning data L with extended features
x+ = (x , xq+1 ) ∈ Rq+1 which gives us the estimated regression attentions β 1 (x+ +
i ), . . . , βq+1 (xi ).
SCANDINAVIAN ACTUARIAL JOURNAL 81
Using these estimated regression attentions, we receive empirical mean and standard deviation for
the additional component
1 1
n n
2
b̄q+1 = q+1 (x+
β i ) and
sq+1 = q+1 (x+
β i ) − b̄q+1 . (15)
n i=1 n − 1 i=1
Since this additional component xq+1 does not enter the true regression function we expect b̄q+1 ≈ 0,
and sq+1 quantifies the expected magnitude of fluctuations around zero.
The null hypothesis H0 : βj (x) = 0 for component j on significance level α ∈ (0, 1/2) can then be
j (x+
rejected if the coverage ratio of the following centered interval Iα for β i ), 1 ≤ i ≤ n,
Iα = qN (α/2) ·
sq+1 , qN (1 − α/2) ·
sq+1 = qN (α/2) ·
sq+1 , −qN (α/2) ·
sq+1 (16)
is substantially smaller than 1-α, where qN (p) denotes the quantile of the standard Gaussian
distribution on quantile level p ∈ (0, 1).
We come back to our Figure 2. We do not add an additional component xq+1 , but we directly use
component x7 to test the null hypotheses H0 : βj (x) = 0 for the remaining components j = 7. In our
synthetic data example we have
We choose significance level α = 0.1% which provides us with qN (0.05%) = 3.2905. The light cyan
stripe in Figure 2 shows the resulting interval Iα , given in (16), for our example. Only for x8 the black
dots β8 (xt ) are within these confidence bounds Iα which implies that we should drop this component
and keep components x1 , . . . , x6 in the regression model. Note that x8 and x2 are correlated, and still
our empirical Wald test does not reject the null hypothesis of setting β8 (x) = 0. That is, we do the
right decision, and there is no information leakage, here, through the correlated feature components.
In a final step, the model with dropped components should be re-fitted and the out-of-sample loss
should not substantially change, this re-fitting step verifies that the dropped components also do not
play a significant role in the regression attentions β j (x) of the remaining feature components j, i.e.
contribute by interacting with other variables.
Figure 3 gives the resulting feature contributions β j (xt )xt,j , 1 ≤ j ≤ q = 8, to the LocalGLMnet
estimated regression function μ(xt ). We clearly see the linear term in x1 , the quadratic term in x2
and the sine term in x3 (first line of Figure 3), see also (11) for the true regression function μ. The
second line of Figure 3 shows the interacting feature components x4 , x5 and x6 , and the last line those
that should be dropped.
3.3. Interactions
In the next and final step, we explore interactions between different feature components. This is based
on analyzing the gradients ∇βj (x) for 1 ≤ j ≤ q, see (9). The jth component of this gradient ∇βj (x)
explores whether we have a linear term in xj or not. If there are no interactions of the jth compo-
nent with other components, i.e. βj (x)xj = βj (xj )xj , it will exactly provide us with the right functional
form of βj (x) since in that case ∂βj (x)/∂xj = 0 for all j = j. In relation to Figure 2 this means that the
scatter plot resembles a line that does not have any lateral dilation. In Figure 2, this is the case for com-
ponents x1 , x2 , x7 , x8 , for component x3 this is not completely clear from the scatter plot, and x4 , x5 , x6
show lateral extensions indicating interactions. In this latter general case, we have ∂βj (x)/∂xj = 0 for
some j = j.
We calculate the gradients ∇ β j (x), 1 ≤ j ≤ q, of the fitted model. These can be obtained by the
R code of Listing 3. This provides us for fixed j with vectors ∇ β j (xi ) for all instances i = 1, . . . , n.
82 R. RICHMAN AND M. V. WÜTHRICH
Figure 3. Feature contributions β j (xt )xt,j , 1 ≤ j ≤ q = 8, to the LocalGLMnet estimated regression function μ(xt ) of 5000 ran-
domly selected out-of-sample instances xt from T ; the y-scale is identical in all plots and on the x-scale we have xj ; the horizontal
lines indicate the magnitude of the feature contributions in increments of 0.5.
∂
j (xi ) =
∂xk β j (xi ) ∼ xi,j .
β (17)
∂xk
This studies the derivative of regression attention β j (x) w.r.t. xk as a function of the corresponding
feature component xj that is considered in the feature contribution βj (x)xj , see (7). The code for these
spline fits is also provided in Listing 3 on lines 12–15, and it gives us the results illustrated in Figure 4.
Figure 4 studies the spline regressions (17) only of the significant components x1 , . . . , x6 that enter
regression function μ, see (11). We interpret these plots.
• Plot x1 shows that all gradients are roughly zero, which means that β1 (x) = const, which,
indeed, is the case in the true regression function μ.
2 (x),
• Plots x2 and x3 show one term that is significantly different from 0. For the x2 plot it is ∂x2 β
and for the x3 plot it is ∂x3 β3 (x). This says that these two terms do not have any interactions with
other variables, and that the right functional form is not the linear one, but β j (x)xj = βj (xj )xj
SCANDINAVIAN ACTUARIAL JOURNAL 83
We give concluding remarks on this synthetic data example. In general, interpretation is simpler if
we have βj (x) = βj (xj ), i.e. if the components do not have interactions, because this results in rather
smooth curves in Figure 3. This may be achieved by pre-processing features, e.g. we may consider
the body mass index, instead of using the weight and the height of a person (if there is corresponding
interaction in the regression function). Additionally, we can also use the functional forms found by
the LocalGLMnet, e.g. we have seen that component x2 enters the regression function as a quadratic
term, and we could fit another LocalGLMnet using feature component x22 , instead. This may provide
even clearer and better explanations.
Table 2. In-sample and out-of-sample losses on the real MTPL data example.
Poisson deviance losses in 10−2
Out-of-
In-sample sample on
on L T
(a) null model (bias β0 only) 25.213 25.445
(b) FFN network 23.764 23.873
(c) LocalGLMnet 23.728 23.945
(d) reduced LocalGLMnet 23.714 23.912
(e) Poisson GLM3 24.084 24.102
(f) Categorical Embedding network 23.690 23.824
(g) Nagging network 23.691 23.783
terms of predictive power. We choose a network of depth d = 3 having (q1 , q2 , q3 ) = (20, 15, 10) FFN
neurons; the R code for this FFN architecture is given in Listing 7.1 of Wüthrich & Merz (2021), but
we replace input dimension 40 by 42 on line 3 of that listing to account for the two additional feature
components ‘RandU’ and ‘RandN’. To do a proper out-of-sample generalization analysis we parti-
tion the data randomly into a learning data set L and a test data set T . The learning data L contains
n = 610, 206 instances and the test data set T contains 67, 801 instances; we use exactly the same
split as in Table 5.2 of Wüthrich & Merz (2021). The learning data L will be used to learn the network
parameters and the test data T is used to perform an out-of-sample generalization analysis. As loss
function for parameter fitting and generalization analysis we choose the Poisson deviance loss, which
is a distribution adapted and strictly consistent loss function for the mean within the Poisson model,
for details we refer to Section 4.1.3 in Wüthrich & Merz (2021).
We fit this FFN network using the nadam version of SGD on batches of size 5000 over 100 epochs,
and we retrieve the network calibration that provides the smallest validation loss on a training-
validation partition U and V of the learning data L. The results are presented on line (b) of Table 2.
The FFN network provides clearly better results than the null model only using a bias β0 . This justifies
regression modeling, here.
Next we fit the LocalGLMnet architecture using exactly the same set up as in the FFN network, but
having a depth of d = 4 with numbers of neurons (q0 , q1 , q2 , q3 , q4 ) = (42, 20, 15, 10, 42). The results
on line (c) of Table 2 show that we sacrifice a bit of predictive power (out-of-sample) for receiving our
interpretable network architecture. Moreover, we complement the table with results of Wüthrich &
Merz (2021) fitting several other models to the same data, namely, a GLM on line (e), an FFN which
encodes the categorical features using embedding layers on line (f), and an ensemble of networks
using the nagging predictor on line (g). Compared to the GLM, which is currently the industry stan-
dard for non-life claim frequency prediction, the LocalGLMnet model performs significantly better
(out-of-sample). The FFN with embedding layers and the ensemble of networks outperform both the
LocalGLMnet and the simple FFN on line (b). Thus for the LocalGLMnet we sacrifice a little bit of
predictive performance for having an explainable predictive model. On the other hand, we can per-
form variable selection with the LocalGLMnet, and if the ultimate goal is predictive performance we
can still fit another network architecture to the LocalGLMnet-selected variables.
We now analyze the resulting estimated regression attentions β j (x). We start by studying the regres-
sion attentions β j (x) of the continuous and binary feature components Area Code, Bonus-Malus
Level, Density, Driver’s Age, Vehicle Age, Vehicle Gas, Vehicle Power, RandU and RandN. First, we
calculate the empirical standard deviations sj that we receive from RandU and RandN, see (15),
sRandU = 0.052 and
sRandN = 0.048.
Thus these standard deviation estimates are rather similar, and in this case the specific distribu-
tional choice of the control variable xq+1 does not influence the results. We discuss further these
distributional choices and alternative choices together with an analysis of impact of sampling error
86 R. RICHMAN AND M. V. WÜTHRICH
Figure 5. Attention weights βj (xt ) of the continuous and binary feature components Area Code, Bonus-Malus Level, Density,
Driver’s Age, Vehicle Age, Vehicle Gas, Vehicle Power, RandU and RandN for 5000 randomly selected instances xt of T ; the light
cyan stripe shows the confidence area Iα on significance level α = 0.1%.
in Appendix 2. We calculate interval Iα for significance level α = 0.1%, see (16). The resulting interval
Iα is illustrated by the light cyan stripes in Figure 5. We observe that for the variables Bonus–Malus
Level, Density, Driver’s Age, Vehicle Age and Vehicle Gas we clearly reject the null hypothesis
H0 : βj (x) = 0 on the chosen significance level α = 0.1%, and Area Code and Vehicle Power need
further analysis. For these two variables, Iα provides a coverage ratio of 97.1% and 98.1%, thus, strictly
speaking these numbers are below 1 − α = 99.9% and we should keep these variables in the model.
Nevertheless, since there is some sampling error in deriving sRandU and sRandN , e.g. the appendix
shows that different samples for these variables could produce slightly wider intervals, we further
analyze these two variables. From the empirical analysis in Noll et al. (2018) we know that Area Code
and Density are highly correlated. Figure 6 (lhs) shows the boxplot of Density vs. Area Code, and this
plot highlights that Density almost fully explains Area Code. Therefore, it is sufficient to include the
Density variable and we drop Area Code.
Thus, we drop the variables Area Code and VehPower, and we also drop the control variables
RandU and RandN, because these are no longer needed. This gives us a reduced input dimension
of q0 = q = 38, and we run the same LocalGLMnet SGD fitting again. Line (d) of Table 2 gives the
SCANDINAVIAN ACTUARIAL JOURNAL 87
Figure 6. (lhs) Boxplots of Density vs. Area Code and (rhs) Bonus-Malus vs. Driver’s Age.
in-sample and out-of-sample results of this reduced LocalGLMnet model. We observe a small out-of-
sample improvement compared to line (c) which confirms that we can drop Area Code and VehPower
without losing predictive performance. In fact, the small improvement indicates that a smaller model
less likely over-fits, and we can consider more SGD steps before over-fitting, i.e. we receive a later
early stopping point.
Having performed variable selection and dropping unimportant variables, another approach
would be to then fit a different model to the reduced set of features, for example an FFN, which
would provide better out-of-sample performance than the LocalGLMnet. This approach is advisable
if the emphasis is on predictive performance instead of interpretability.
Figure 7 shows the feature contributions β j (xt )xj,t of the selected continuous and binary feature
components Bonus–Malus Level, Density, Driver’s Age, Vehicle Age and Vehicle Gas of 5000 ran-
domly selected instances xt , and the magenta line shows a spline fit to these feature contributions; the
y-scale is the same in all plots. We observe that all these components contribute substantially to the
regression function, the Bonus–Malus variable being the most important one, and Vehicle Gas being
the least important one. Bonus–Malus and Density have in average an increasing trend, and Vehicle
Age has in average a decreasing trend, Density being close to a linear function, and the remaining
continuous variables are clearly non-linear. The explanation of Driver’s Age is more difficult as can
be seen from the spline fit (magenta color). Figure 6 (rhs) shows the boxplot of Bonus–Malus Level
vs. Driver’s Age. We observe that new (young) drivers enter the bonus-malus system at 100, and every
year of driving without an accident decreases the bonus-malus level. Therefore, the lowest bonus-
malus level can only be reached after multiple years of accident-free driving. This can be seen from
Figure 6 (rhs), and it implies that the Bonus–Malus Level and the Driver’s Age variables interact. We
are going to verify this by studying the gradients ∇ β j (x) of the regression attributions. The smoothed
feature contributions of Figure 7 are compared in Appendix 2 to PDPs produced using an FFN fit to
the same reduced set of variables. The comparison shows that the smoothed feature contributions are
generally similar to PDPs, but differ for values of the variables where there are not many observations
(note that PDPs do not consider dependence within feature vectors appropriately).
Figure 8 shows the spline fits to the gradients ∂xk β j (x) of the continuous variables Bonus–Malus
Level, Density, Driver’s Age and Vehicle Age. First, we observe that Bonus–Malus Level and Driver’s
Age have substantial non-linear terms ∂xj β j (x)xj , Vehicle Age shows some non-linearity and Density
seems to be pretty linear since ∂xj βj (x)xj ≈ 0. This verifies the findings of Figure 7 of the magenta
spline fits.
88 R. RICHMAN AND M. V. WÜTHRICH
Figure 7. Feature contributions β j (xt )xj,t of the continuous and binary feature components Bonus–Malus Level, Density, Driver’s
Age, Vehicle Age and Vehicle Gas of 5000 randomly selected instances xt of T ; the y-scale is the same in all plots and the magenta
color gives a spline fit to the feature contributions.
Next we focus on interactions which requires the study of ∂xk β j (x) for k = j in Figure 8. The most
significant interactions can clearly be observed between Bonus–Malus Level and Driver’s Age, but
also between the Bonus–Malus Level and Density we encounter an interaction term saying that a
higher Bonus–Malus Level at a lower Density leads to a higher prediction, which intuitively makes
sense as in less densely populated areas we expect less claims. For Vehicle Age, we do not find sub-
stantial interactions, it only weakly interacts with Bonus–Malus Level and Driver’s Age by entering
the corresponding regression attributions β j (x) of these two feature components.
There remains the discussion of the categorical feature components Vehicle Brand and Region.
This is going to be done in Section 3.6.
Remark 3.1: Networks fitted with early stopping in SGD do not fulfill the so-called balance property
and, henceforth, may have a bias. Since we work under the canonical link choice, here, we can apply
the bias regularization method proposed in Wüthrich (2020). For this we define the learned features
zi,j = β j (xi )xi,j , 1 ≤ j ≤ q, that is, for each instance 1 ≤ i ≤ n we have a new learned feature
zi =
(
zi,1 , . . . ,
zi,q ) ∈ Rq . These learned features can then be used in a classical GLM, see (2),
q
z i → g(μ) = α0 + α,
z i = α0 + αj
zi,j ,
j=1
with regression parameter (α0 , α) ∈ Rq+1 . Since an MLE estimated GLM under canonical link choice
satisfies the balance property, so will the LocalGLMnet
q
α0MLE +
xi → g(μ) = j (xi ) xi,j .
αjMLE β
j=1
SCANDINAVIAN ACTUARIAL JOURNAL 89
j (xi ) of the continuous variables Bonus–Malus Level, Density, Driver’s Age and Vehicle
Figure 8. Spline fits to the gradients ∂xk β
Age over all instances i = 1, . . . , n.
That is, we just slightly rescale the regression attentions to receive the balance property. For more on
this topic, we refer to Section 7.4.2 in Wüthrich & Merz (2021).
1
n
VIj = j (xi ) ,
β
n i=1
for 1 ≤ j ≤ q and where we aggregate over all instances 1 ≤ i ≤ n. Typically, the bigger these values
VIj the more component xj influences the regression function. Note that all feature components xj
have been centered and normalized to unit variance, i.e. they live on the same scale, otherwise such
a comparison of VIj across different j would not make sense. Figure 9 gives the variable importance
results, which emphasizes that Area Code and Vehicle Power are the least important variables which,
in fact, have been dropped in a second step above.
90 R. RICHMAN AND M. V. WÜTHRICH
j (x) of the categorical feature components Vehicle Brand and French Regions; the
Figure 10. Boxplot of the feature contributions β
y-scale is the same as in Figure 7.
4. Conclusion
We have introduced the LocalGLMnet which is inspired by classical GLMs. Making the regression
parameters of a generalized linear model feature dependent allows us to receive a flexible regression
model that shares representation learning and the predictive performance of classical fully-connected
feed-forward neural networks, and at the same time it remains interpretable. This appealing structure
allows us to perform variable selection, it allows us to study variable importance and it also allows
us to determine interactions. Thus it provides us with a fully transparent network model that brings
out the internal structure of the data. To the best of our knowledge, this is rather unique in network
regression modeling, since our proposal does not share the shortcomings of similar proposals like
computational burden or a loss of predictive power.
Above we have mentioned possible extensions, e.g. LocalGLM layers can be composed to receive
deeper interpretable networks, and the LocalGLMnet can serve as a surrogate model to shed more
light into many other deep learning models. A different method of utilizing the LocalGLMnet
would be to iteratively fit the model and then use the smoothed feature contribution and inter-
action plots (i.e. those shown in Figures 7 and 8) to aid the fitting of a GLM by adding new
variables corresponding to these plots to the GLM. Once these are added, the LocalGLMnet could
be refit to the expanded variables and the process is repeated. Another extension would be a
method of constraining the regression attention coefficients to produce monotone effects which
would potentially be useful in practice. However, this would require substantial modifications to
the LocalGLMnet fitting process, so we leave this modification for monotonic effects (which applies
equally to more general neural networks, such as FFNs) for further research. Whereas our archi-
tecture is most suitable for tabular input data, the question about optimal consideration of non-
tabular data or of categorical variables with many levels is one point that should also be further
explored.
Acknowledgments
We thank Christopher Blier-Wong, Friedrich Loser, Andreas Tsanakas and the two anonymous reviewers for providing
very useful remarks to improve earlier versions of this manuscript.
92 R. RICHMAN AND M. V. WÜTHRICH
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
Agarwal R., Frosst N., Zhang X., Caruana R. & Hinton G. E. (2020). Neural additive models: interpretable machine
learning with neural nets. arXiv:2004.13912v1.
Ahn J., Lu Y., Oh R., Park K. & Zhu D. (2021). Neural credibility [conference presentation]. In Virtual 24th International
Congress on Insurance: Mathematics and Economics, July 5–10. Urbana-Champaign, USA: University of Illinois.
Apley D. W. & Zhu J. (2020). Visualizing the effects of predictor variables in black box supervised learning models.
Journal of the Royal Statistical Society: Series B 82(4), 1059–1086.
Arik S. Ö & Pfister T. (2019). TabNet: attentive interpretable tabular learning. arXiv:1908.07442v5.
Bahdanau D., Cho K. & Bengio Y. (2014). Neural machine translation by jointly learning to align and translate.
arXiv:1409.0473.
Barndorff-Nielsen O. (2014). Information and exponential families: in statistical theory. Chichester, West Sussex, UK:
John Wiley & Sons.
Chen C., Li O., Tao C., Barnett A., Su J. & Rudin C. (2019). This looks like that: deep learning for interpretable image
recognition. Advances in Neural Information Processing Systems 32, Vancouver, Canada.
Delong Ł. & Kozak A. (2021). The use of autoencoders for training neural networks with mixed categorical and
numerical features. SSRN Manuscript ID 3952470.
Dutang C. & Charpentier A. (2018). CASdatasets R Package Vignette. Reference Manual. Version 1.0-8, packaged 2018-
05-20.
Fahrmeir L. & Tutz G. (1994). Multivariate statistical modelling based on generalized linear models. New York, USA:
Springer.
Friedman J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics29(5),
1189–1232.
Goodfellow I., Bengio Y. & Courville A. (2016). Deep learning. MIT Press.
Ha D., Dai A. & Le Q. (2016). Hypernetworks. arXiv:1609.09106.
He K., Zhang X., Ren S. & Sun J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition, Vol. 1, P. 770–778. Las Vegas, USA.
Huang X., Khetan A., Cvitkovic M. & Karnin Z. (2020). TabTransformer: tabular data modeling using contextual
embeddings. arXiv:2012.00678.
Jørgensen B. (1997). The theory of dispersion models. London, UK: Chapman & Hall.
Kuo K. (2019). Generative synthesis of insurance datasets. arXiv:1912.02423v2.
Kuo K. & Richman R. (2021). Embeddings and attention in predictive modeling. arXiv:2104.03545v1.
Kursa M. & Rudnicki W. (2010). Feature selection with the Boruta package. Journal of Statistical Software 36, 1–13.
Lemhadri I., Ruan F., Abraham L. & Tibshirani R. (2021). LassoNet: a neural network with feature sparsity. Journal of
Machine Learning Research 22, 1–29.
Lorentzen C. & Mayer M. (2020). Peeking into the black box: an actuarial case study for interpretable machine learning.
SSRN Manuscript ID 3595944. Version May 7, 2020.
Lundberg S. M. & Lee S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Infor-
mation Processing Systems, Vol. 30, Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,
Garnett, R. (eds.), Montreal: Curran Associates. P. 4765–4774.
McCullagh P. & Nelder J. A. (1983). Generalized linear models. London, UK: Chapman & Hall.
Merz M., Richman R., Tsanakas A. & Wüthrich M. V. (2022). Interpreting deep learning models with marginal
attribution by conditioning on quantiles, Data Mining and Knowledge Discovery, in press.
Nelder J. A. & Wedderburn R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series
A 135(3), 370–384.
Noll A., Salzmann R. & Wüthrich M. V. (2018). Case study: french motor third-party liability claims. SSRN Manuscript
ID 3164764. Version March 4, 2020.
Perla F., Richman R., Scognamiglio S. & Wüthrich M. (2021). Time-series forecasting of mortality rates using deep
learning. Scandinavian Actuarial Journal 2021(7), 572–598.
Ribeiro M. T., Singh S. & Guestrin C. (2016). ‘Why should I trust you?’: explaining the predictions of any classifier. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD
’16. New York: Association for Computing Machinery. P. 1135–1144.
Richman R. (2020). AI in actuarial science – a review of recent advances – part 2. Annals of Actuarial Science 15(2),
230–258.
Richman R. (2021). Mind the gap – safely incorporating deep learning models into the actuarial toolkit. SSRN
Manuscript ID 3857693.
Richman R. & Wüthrich M. (2020). Nagging predictors. Risks 8, 83.
SCANDINAVIAN ACTUARIAL JOURNAL 93
Rudin C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable
models instead. Nature Machine Intelligence 1, 206–215.
Sundararajan M., Taly A. & Yan Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th Inter-
national Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, Vol. 70, Sydney,
Australia: International Convention Centre. P. 3319–3328.
Tran M., Nguyen N., Nott D. & Kohn R. (2020). Bayesian deep net GLM and GLMM. Journal of Computational and
Graphical Statistics 29, 97–113.
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł. & Polosukhin I. (2017). Attention is
all you need. arXiv:1706.03762v5.
Vaughan J., Sudjianto A., Brahimi E., Chen J. & Nair V. N. (2018). Explainable neural networks based on additive index
models. arXiv:1806.01933v1.
Wang Q., Li B., Xiao T., Zhu J., Li C., Wong D. F. & Chao L. S. (2019). Learning deep transformer models for machine
translation. arXiv:1906.01787.
Wüthrich M. V. (2020). Bias regularization in neural network models for general insurance pricing. European Actuarial
Journal 10(1), 179–202.
Wüthrich M. V. & Merz M. (2021). Statistical foundations of actuarial learning and its applications. SSRN Manuscript
ID 3822407.
Appendices
R code
Listing 1. Code to implement LocalGLMnet of depth d = 4 for the synthetic Gaussian case.
1 library(keras)
2 #
3 Design = layer_input(shape = c(8), dtype = ’float32’)
4 #
5 Attention = Design %>%
6 layer_dense(units=20, activation=’tanh’) %>%
7 layer_dense(units=15, activation=’tanh’) %>%
8 layer_dense(units=10, activation=’tanh’) %>%
9 layer_dense(units=8, activation=’linear’, name=’Attention’)
10
11 Response = list(Design, Attention) %>% layer_dot(axes=1) %>%
12 layer_dense(units=1, activation=’linear’, name=’Response’)
13 #
14 model <- keras_model(inputs = c(Design), outputs = c(Response))
15 model %>% compile(loss = ’mse’, optimizer = ’nadam’)
same distributions. Moreover, in a different set of 100 runs, we also replaced these variables with permuted copies of
the Bonus-Malus and Vehicle Age variables. The average value of the standard deviations of the regression attentions
relating to each of these variables (i.e. the average value of
s) that is used to estimate the confidence intervals (16) and
the standard deviation of these are shown in Table B1. Over 100 runs, the average magnitude of s varies somewhat
depending on the distributional assumption underlying the control variables and, on the other hand, the standard
deviations of these are roughly constant, showing little dependence on the distributional assumptions. Since there is
some sampling variability in deriving the confidence bands (16), as it is usually the case in statistics on finite samples,
these confidence bands should be treated as indicative of the ‘true’ values, and when performing variable selection,
those variables that produce regression attention values that are close in magnitude to the confidence bands should
be investigated further, as is done in Section 3.4. On the other hand, adding more variables to the LocalGLMnet, for
example, several permuted variables as well as randomly sampled variables, would likely increase the potential for over
fitting, thus, we prefer to rely on ‘RandU’ and ‘RandN’ in the real data example.
Figure B11. Spline fit to the feature contributions βj (xt )xj,t and PDPs of the continuous variables Bonus-Malus Level, Density,
Driver’s Age and Vehicle Age of 5,000 randomly selected instances xt of T ; the y-scale is the same in all plots.