0% found this document useful (0 votes)

23 views26 pages

LocalGLMnet Interpretable Deep Learning For Tabular Data

Uploaded by

david teacher

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views26 pages

LocalGLMnet Interpretable Deep Learning For Tabular Data

Uploaded by

david teacher

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Scandinavian Actuarial Journal

ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/sact20

LocalGLMnet: interpretable deep learning for

tabular data

Ronald Richman & Mario V. Wüthrich

To cite this article: Ronald Richman & Mario V. Wüthrich (2023) LocalGLMnet: interpretable
deep learning for tabular data, Scandinavian Actuarial Journal, 2023:1, 71-95, DOI:
10.1080/03461238.2022.2081816

To link to this article: https://doi.org/10.1080/03461238.2022.2081816

© 2022 The Author(s). Published by Informa

UK Limited, trading as Taylor & Francis
Group.

Published online: 08 Jun 2022.

Submit your article to this journal

Article views: 5739

View related articles

View Crossmark data

Citing articles: 7 View citing articles

Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=sact20
SCANDINAVIAN ACTUARIAL JOURNAL
2023, VOL. 2023, NO. 1, 71–95
https://doi.org/10.1080/03461238.2022.2081816

LocalGLMnet: interpretable deep learning for tabular data

Ronald Richmana and Mario V. Wüthrichb
a Old Mutual Insure and University of the Witwatersrand, Johannesburg, South Africa; b Department of Mathematics,
RiskLab, ETH Zurich, Zurich, Switzerland

ABSTRACT ARTICLE HISTORY

Deep learning models have gained great popularity in statistical modeling Received 23 July 2021
because they lead to very competitive regression models, often outper- Accepted 20 May 2022
forming classical statistical models such as generalized linear models. The KEYWORDS
disadvantage of deep learning models is that their solutions are difficult to Deep learning; neural
interpret and explain, and variable selection is not easily possible because network; generalized linear
deep learning models solve feature engineering and variable selection model; regression model;
internally in a nontransparent way. Inspired by the appealing structure of variable selection;
generalized linear models, we propose a new network architecture that explainable deep learning;
shares similar features as generalized linear models but provides supe- attention layer; tabular data;
rior predictive power benefiting from the art of representation learning. exponential dispersion
This new architecture allows for variable selection of tabular data and for family; model interpretability
interpretation of the calibrated deep learning model, in fact, our approach
provides an additive decomposition that can be related to other model
interpretability techniques.

1. Introduction
Deep learning models celebrate great success in statistical modeling because they often provide supe-
rior predictive power over classical regression models. This success is based on the fact that deep
learning models perform representation learning of features, which means that they bring features
into the right structure to be able to extract maximal information for the prediction task at hand.
This feature engineering is done internally in a nontransparent way by the deep learning model. For
this reason, deep learning solutions are often criticized to be non-explainable and interpretable, in
particular, because this process of representation learning is performed in high-dimensional spaces
analyzing bits and pieces of the feature information. Recent research has been focusing on interpret-
ing machine learning predictions in retrospect, see, e.g. Friedman’s partial dependence plot (PDP)
(Friedman 2001), the accumulated local effects (ALE) method of Apley & Zhu (2020), the locally
interpretable model-agnostic explanation (LIME) introduced by Ribeiro et al. (2016), the SHapley
Additive exPlanations (SHAP) of Lundberg & Lee (2017) or the marginal attribution by conditioning
on quantiles (MACQ) method proposed by Merz et al. (2022). PDP, ALE, LIME and SHAP can be
used for any machine learning method such as random forests, boosting or neural networks, whereas
MACQ requires differentiability of the regression function which is the case for neural networks
under differentiable activation functions; for a review of more gradient-based methods, we refer to
Merz et al. (2022).
We follow a different approach here, namely, we propose a new network architecture that has an
internal structure that directly allows for interpreting and explaining. Moreover, this internal struc-
ture also allows for variable selection of tabular feature data and to extract interactions between

CONTACT Mario V. Wüthrich mario.wuethrich@math.ethz.ch

© 2022 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License
(http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided
the original work is properly cited, and is not altered, transformed, or built upon in any way.
72 R. RICHMAN AND M. V. WÜTHRICH

feature components. The starting point of our proposal is the framework of generalized linear mod-
els (GLMs) introduced by Nelder & Wedderburn (1972) and McCullagh & Nelder (1983). GLMs are
characterized by the choice of a link function that maps the regression function to a linear predictor,
and, thus, leading to a linear functional form that directly describes the influence of each predictor
variable on the response variable. Of course, this (generalized) linear form is both transparent and
interpretable. To some extent, our architecture preserves this linear structure of GLMs, but we make
the coefficients of the linear predictors feature dependent. Such an approach follows a similar strategy
as the ResNet proposal of He et al. (2016) that considers a linear term and then builds the network
around this linear term. The LassoNet of Lemhadri et al. (2021) follows a similar philosophy, too,
by performing Lasso regularization on network features. Both proposals have in common that they
use a so-called skip connection in the network architecture that gives a linear model part around
which the non-linear network model is built. Our proposal uses such a skip connection, too, which
provides the linear model part, and we weight these linear terms with potentially non-linear weights.
This allows us to generate non-linear regression functions with arbitrary interactions. In spirit, our
non-linear weights are similar to the attention layers recently introduced by Bahdanau et al. (2014)
and Vaswani et al. (2017). Attention layers are a rather successful new way of building powerful net-
works by extracting more important feature components from embeddings by giving more weight
(attention) to them. From this viewpoint we construct (network regression) attention weights that
provide us with a local GLM for tabular data, and we therefore call our proposal LocalGLMnet. These
regression attention weights also provide us with a possibility of explicit variable selection, which is
a novel and unique property within network regression models. Moreover, we can explicitly explore
interactions between feature components. A similar approach is considered in Ahn et al. (2021) on
Bayesian credibility theory, where the credibility weights are modeled by attention weights.
There is another stream of literature that tries to choose regression structures that have inter-
pretable features, e.g. the explainable neural networks (xNN) and the neural additive models (NAM)
make restrictions to the structure of the network regression function by running different subsets of
feature components through (separated) parallel networks, see Vaughan et al. (2018) and Agarwal
et al. (2020). The drawback of these proposals is that representation learning is restricted within the
parallel networks. Our proposal overcomes this issue as it allows for general interactions. We also
mention (Richman 2021) who extends the xNN approach by explicitly including linear features to a
combined model called CAXNN. Another interpretable network approach is the TabNet proposal of
Arik & Pfister (2019). TabNet uses networks to create attention to important features for the regres-
sion task at hand, however, this proposal has the drawback that it may lead to heavy computational
burden.
Our LocalGLMnet approach overcomes the limitations of these explainable network approaches.
As we will see below, our proposal is computationally efficient and it leads to nice explanations, in the
sense that we can interpret the predictions made by our model in terms of an additive decomposition
of the features. This will be further explored in Section 2.4.
Our proposal of the LocalGLMnet, which estimates the coefficients of a (local) GLM, and, there-
fore, maintains a directly interpretable structure in terms of the original features, is, to the best of
our knowledge, unique within the literature. In this key respect, the LocalGLMnet is different from
the DeepGLM and the DeepGLMM of Tran et al. (2020), who study GLMs and generalized linear
mixed models (GLMMs) that rely on learned feature representations from a deep neural network, i.e.
the original features are not used directly to make predictions. One similarity between these model
approaches is that the DeepGLMM allows for observation specific random effects in the last layer of
that model, thus, the coefficients may vary for each observation, which is similar to our LocalGLMnet.
Whereas the LocalGLMnet complies with the interpretable machine learning requirements discussed
in the manifesto of Rudin (2019), the interpretable models discussed in that work and its references
rely on significantly different classes of algorithms from GLMs, which we focus on here due to the
importance of GLMs within actuarial work. Among the models mentioned in Rudin (2019), perhaps
the most similar is the prototype network of Chen et al. (2019), which performs classification based
SCANDINAVIAN ACTUARIAL JOURNAL 73

on the similarity of an image to interpretable learned prototypes, which are used as interpretable fea-
tures within a deep neural network. In the case of the LocalGLMnet, the original feature values are
used directly to make predictions, thus, maintaining interpretability. Within the machine learning
literature, the most similar work to the LocalGLMnet is the hypernetwork concept of Ha et al. (2016),
where one neural network is used to estimate the weights of a second network. In the case of the
LocalGLMnet, a neural network is being used to estimate the weights of a second model, which is
a (local) GLM. Similar types of models have been considered within the actuarial literature in the
context of mortality forecasting by, e.g. Perla et al. (2021), who derive the coefficients of a Lee–Carter
type of model using convolutional and recurrent neural networks.
Using the LocalGLMnet architecture, we propose an empirically based Wald-type method for
variable selection, which relies on adding synthetic variables generated by a uniform or Gaussian dis-
tribution to the data before the LocalGLMnet is fitted. The estimated non-linear regression weights
for these synthetic variables are then used to estimate a confidence interval to determine the sig-
nificance of the regression weights for the rest of the dataset. The method proposed here is closely
connected to the Boruta system of variable selection proposed by Kursa & Rudnicki (2010), which
functions by adding permuted copies of the features to a random forest algorithm, and assessing the
significance of each feature with reference to the level of significance assigned to these permuted fea-
tures by the variable importance scores within the random forest algorithm. The approach proposed
here is similar in that synthetic variables are added to the dataset, but differs in terms of how the sig-
nificance of the features is evaluated: as opposed to using variable importance scores, the structure
of the LocalGLMnet allows for the significance of variables to be evaluated directly in terms of the
regression weights.

1.1. Organization of this manuscript

In the next section, we introduce and discuss the LocalGLMnet. We therefore first recall the GLM
framework which will give us the right starting point and intuition for the LocalGLMnet. In
Section 2.2, we extend GLMs to feed-forward neural networks that form the basis of our regression
attention weight construction, and Section 2.3 presents our LocalGLMnet proposal that combines
GLMs with regression attention weights. Section 2.4 discusses the LocalGLMnet, and it relates our
proposal to other methods. Section 3 presents two examples, a toy synthetic data example and a real
data example. The former will give us a proof of concept, a verification is obtained because we know
the true data generating mechanism in this toy synthetic data example. Moreover, in Section 3.2 we
discuss how the LocalGLMnet allows for variable selection, which is a novel and unique property
within network regression modeling. Section 3.3 explains how we can find interactions. In Section 3.4,
we present the main demonstration of the LocalGLMnet on a real data example which has been stud-
ied extensively in recent actuarial work. In the context of this example, Section 3.5 discusses variable
importance, and in Section 3.6 we discuss how categorical feature components can be treated within
our proposal. Finally, in Section 4 we conclude, and in the appendix we provide further analysis.

2. Model architecture
2.1. Generalized linear model
The starting point of our proposal is a GLM which typically is based on the exponential disper-
sion family (EDF). GLMs have been introduced by Nelder & Wedderburn (1972) and McCullagh
& Nelder (1983), and the EDF has been analyzed in detail by Jørgensen (1997) and Barndorff-
Nielsen (2014); we use the notation and terminology of Wüthrich & Merz (2021), and for a detailed
treatment of GLMs and the EDF we refer to Chapters 2 and 5 of that latter reference.
Assume we have a datum (Y, x, v) with a given exposure v > 0, a vector-valued feature x ∈ Rq ,
and a response variable Y following a member of the (single-parameter linear) EDF having density
74 R. RICHMAN AND M. V. WÜTHRICH

(w.r.t. to a σ -finite measure on R)

yθ − κ(θ )
Y ∼ f (y; θ, v/ϕ) = exp + a(y; v/ϕ) , (1)
ϕ/v
with dispersion parameter ϕ > 0, canonical parameter θ ∈ , where the effective domain ⊆ R is
a non-empty interval, with cumulant function κ : → R and with normalizing function a(·; ·). By
construction of the EDF, the cumulant function κ is a smooth and convex function on the interior of
the effective domain . This implies that Y has first and second moments, we refer to Chapter 2 in
Wüthrich & Merz (2021),
ϕ
μ = E [Y] = κ (θ) and Var (Y) = κ (θ ) > 0.
v
A GLM is obtained by making a specific regression assumption on the mean μ = κ (θ ) of Y. Namely,
choose a strictly monotone and continuous link function g : R → R and assume that the mean of Y,
given x, satisfies regression assumption
q

x → g(μ) = g (μ(x)) = β0 + β, x = β0 + βj x j , (2)
j=1

with GLM regression parameter β = (β1 , . . . , βq ) ∈ Rq , bias (intercept) β0 ∈ R, and where ·, ·
denotes the scalar product in Rq . This GLM assumption implies that the canonical parameter takes
the following form (on the canonical scale of the EDF)

θ = (κ )−1 g −1 (β0 + β, x ) ,
where (κ )−1 is called canonical link of the chosen EDF (1).
The GLM regression function (2) is very appealing because it leads to a linear predictor η(x) =
β0 + β, x (after applying the link function g to the mean μ(x)), and the regression parameter βj
directly explains how the individual feature component xj influences the linear predictor η(x) and
the expected value μ(x) of Y, respectively. Our goal is to benefit from this transparent structure as far
as possible.

2.2. Fully-connected feed-forward neural network

A neural network extension of a GLM is obtained rather easily by exploring feature engineering before
considering the scalar product in the linear predictor (2). A fully-connected feed-forward neural
(FFN) network builds upon engineering feature information x through non-linear transformations
before entering the scalar product. A composition of FFN layers performs these non-linear transfor-
mations. Choose a non-linear activation function φm : R → R and dimensions qm−1 , qm ∈ N. The
mth FFN layer of a deep FFN network is defined by the mapping

z (m) : Rqm−1 → Rqm

(3)
(m)
x → z (m) (x) = z1 (x), . . . , zq(m)
m
(x) ,

having neurons zj(m) (x), 1 ≤ j ≤ qm , for variables x = (x1 , . . . , xqm−1 ) ∈ Rqm−1 ,

qm−1
(m)
(m) (m) (m) (m)
zj (x) = φm w0,j + wj , x = φm w0,j + wl,j xl ,
l=1

for given network weights w(m)

j
(m)
= (wl,j (m)
)1≤l≤qm−1 ∈ Rqm−1 and bias w0,j ∈ R.
SCANDINAVIAN ACTUARIAL JOURNAL 75

A FFN network of depth d ∈ N is obtained by composing d FFN layers (3) to provide a deep learned
representation, we set input dimension q0 = q,

z (d:1) : Rq → Rqd
(4)
x → z (d:1) (x) = z (d) ◦ · · · ◦ z (1) (x).

This qd -dimensional learned representation z (d:1) (x) ∈ Rqd then enters a GLM of type (2) providing
FFN network regression function

x → g(μ) = g (μ(x)) = β0 + β, z (d:1) (x) , (5)

with GLM regression (output) parameter β = (β1 , . . . , βqd ) ∈ Rqd and bias β0 ∈ R. From (5), we
see that the ‘raw’ feature x is first suitably transformed before entering the GLM structure. The stan-
dard reference for neural networks is Goodfellow et al. (2016), for more insight, interpretation and
model fitting we refer to Section 7.2 of Wüthrich & Merz (2021).

2.3. Local generalized linear model network

The disadvantage of deep representation learning (4) is that we can no longer track how individual
feature components xj of x influence regression function (5) because the composition of FFN layers
acts rather as a black box, e.g. in general, it is not clear how each feature component xj influences
the response μ(x), whether a certain component xj needs to be included in the regression function
or whether it could be dropped because it does not contribute. This is neither clear for an individual
example μ(x) (locally) nor at a global level.
The key idea of our LocalGLMnet proposal is to retain the GLM structure (2) as far as possible, but
to let the regression parameters βj = βj (x) become feature x dependent; we call β regression parameter
if it does not depend on x, and we call β(x) regression attention if it is x-dependent. Our proposal
of the LocalGLMnet architecture can be interpreted as a local GLM with network learned regression
attention, as we are going to model the regression attentions βj (x) ∈ R by networks. Strictly speaking,
we typically lose the linearity if we let βj (x) be feature dependent, however, if this dependence is
smooth, we have a sort of a local GLM parameter which justifies our terminology, see also Section 2.4.

Assumption 2.1 (LocalGLMnet): Choose an FFN network architecture of depth d ∈ N with input and
output dimensions being equal to q0 = qd = q to model attention weights
β : Rq → Rq
(6)
x → β(x) = z (d:1) (x) = z (d) ◦ · · · ◦ z (1) (x).

The LocalGLMnet is defined by the additive decomposition after applying a strictly monotone and
continuous link function g : R → R to the mean μ
x → g(μ) = g (μ(x)) = β0 + β(x), x . (7)

Network architecture (7) is an FFN network architecture with a skip connection: first, feature x is
processed through the deep FFN network providing us with learned representation β(x) ∈ Rq , and
second, x has a direct link to the output layer (skipping all FFN layers) providing an (untransformed)
linear term x. The LocalGLMnet then scalar multiplies these two different components, see (7). In
the next section, we give extended remarks and interpretation.
This model can be fitted to data by state-of-the-art stochastic gradient descent (SGD) methods
using training and validation data for performing early stopping to not over-fit to the training data,
for details we refer to Goodfellow et al. (2016) and Section 7.2 in Wüthrich & Merz (2021).
76 R. RICHMAN AND M. V. WÜTHRICH

2.4. Interpretation and extension of LocalGLMnet

We call (7) a LocalGLMnet because in a small environment B (x) around x we may approximate
regression attention β(x ), x ∈ B (x), by a constant regression parameter giving us the interpretation
of a local GLM. In particular, the key assumption of the LocalGLMnet is that the predictions have
an additive composition of the features on the canonical scale (7), which lends itself to an easy inter-
pretation for each prediction made, in the same manner as a GLM might be interpreted. There is a
connection to the LIME technique of Ribeiro et al. (2016), which aims to approximate the predictions
of a complex model using an interpretable model fit in a local neighborhood around the predictions.
This is often a regularized linear model. The coefficients of the local models used within LIME have a
similar interpretation to the regression attentions β(x). An advantage of the LocalGLMnet approach
is that the predictions are produced directly from the additive composition of features, whereas the
explanations produced by LIME might not capture correctly the mechanism used by the complex
model to produce predictions.
Network architecture (7) can also be interpreted as an attention mechanism because βj (x) decides
how much attention is given to feature value xj . In Appendix 2, we also discuss similarities and
differences between smoothed versions of the regression attentions β(x) and PDPs.

Remarks 2.2: • Formula (7) defines the LocalGLMnet. If we replace the scalar product in (7) by
a Hadamard product ⊗ (component-wise product) we receive a LocalGLM layer

x → β (1) (x) ⊗ x = β1(1) (x)x1 , . . . , βq(1) (x)xq ∈ Rq .

A deep LocalGLMnet can be received by composing such LocalGLM layers, for instance, if we
compose two such layers we receive a regression function

x → g(μ) = g (μ(x)) = β0(2) + β (2) β (1) (x) ⊗ x , x .

Such an architecture may lead to increased predictive power and interpretable intermediate
steps.
• Above we have emphasized that the LocalGLMnet will lead to an interpretable regression
model, and the verification of this statement will be done in the examples, below. Alternatively,
if one wants to rely on a plain-vanilla deep FFN network, one can still fit a LocalGLMnet to the
deep FFN network as an interpretable surrogate model.
• The LocalGLMnet has been introduced for tabular data as we try to mimic a GLM that acts on a
design matrix which naturally is in tabular form. If we extend the LocalGLMnet to unstructured
data, time series or image recognition it requires that this data is first encoded into tabular form,
e.g. by using a convolutional module that extracts from spatial data relevant feature information
and transforms this into tabular structure. That is, it requires the paradigm of representation
learning by first bringing raw inputs into a suitable form before encoding this information by
the LocalGLMnet for prediction.

Before we study the implementation of the LocalGLMnet, we would like to get the right interpre-
tation and intuition for regression function (7). We select one component 1 ≤ j ≤ q which provides
us (after applying the link g) with terms
βj (x)xj . (8)
We mention specific cases in the following remarks and how they should be interpreted.

Remarks 2.3: (1) A GLM term is obtained in component xj if βj (x) ≡ βj = 0 is not feature
dependent, providing βj xj , we refer to GLM (2). If this appears to be the case, e.g. using
SCANDINAVIAN ACTUARIAL JOURNAL 77

the model diagnostic tools presented in the next sections, one could replace this term in the
LocalGLMnet with a constant to reduce unnecessary model complexity.
(2) Condition βj (x) ≡ 0 proposes that the term xj should not be included. In Section 3.2, we are
going to present an empirical test for the null hypothesis of dropping a term.
(3) Property βj (x) = βj (xj ) says that we have a term βj (xj )xj that does not interact with other
terms. In general, we can analyze βj (x) over different features x with the jth component xj
being fixed. If this βj (x) does not show any sensitivity in the components different from j,
then we do not have interactions and otherwise we do. Below, we extract this information by
considering the gradients

∂ ∂
∇βj (x) = βj (x), . . . , βj (x) ∈ Rq . (9)
∂x1 ∂xq

The jth component of this gradient ∇βj (x) explores whether we have a linear term in xj , and
the components different from j quantify the interaction strengths.
(4) There is one caveat with these interpretations as we do not have identifiability in model
calibration, as, e.g. we could receive the following structure:

βj (x)xj = xj , (10)

by learning a regression attention βj (x) = xj /xj . However, our tests on different configura-
tions have not manifested any such issues as SGD fitting seems rather pre-determined by the
LocalGLMnet functional form (7) if we initialize SGD in the optimal GLM.

3. Examples
3.1. Synthetic data example
We start with a synthetic data example because this has the advantage of knowing the true data gen-
erating model. This allows us to verify that we draw the right conclusions. We choose q = 8 feature
components x = (x1 , . . . , x8 ) ∈ R8 . We generate two data sets, learning data L and test data T .
The learning data will be used for model fitting and the test data for an out-of-sample generaliza-
tion analysis. We choose for both data sets n = 100, 000 randomly generated independent features
x ∼ N (0, ) being centered and having unit variance. Moreover, we assume that all components of
¯
x are independent, except between x2 and x8 we assume a correlation of 50%. Based on these features
we choose regression function

1 1 1 1 1
x = (x1 , . . . , x8 ) ∈ R8 → μ(x) = x1 − x22 + |x3 | sin(2x3 ) + x4 x5 + x52 x6 , (11)
2 4 2 2 8
thus, neither x7 nor x8 run into the regression function, x7 is independent from the remaining compo-
nents, and x8 has a 50% correlation with x2 . Observe that all components of x have unit variance and,
hence, live on the same scale. Based on this regression function we generate independent Gaussian
observations
Y|x ∼ N (μ(x), 1). (12)
This gives us the two data sets with all instances being independent

L = {(Yi , xi ); 1 ≤ i ≤ n} and T = {(Yt , xt ); n + 1 ≤ t ≤ 2n}.

Gaussian model (12) is an EDF with cumulant function κ(θ ) = θ 2 /2, effective domain = R,
exposure v = 1 and dispersion parameter ϕ = 1, and we can apply the theory of Section 2.
78 R. RICHMAN AND M. V. WÜTHRICH

Table 1. In-sample and out-of-sample MSEs in the synthetic data example.

MSE losses
Out-of-
In-sample sample on
on L T
(a) True regression function μ(·) 1.0023 0.9955
(b) Null model (bias β0 only) 1.7907 1.7916
(c) GLM case 1.5241 1.5274
(d) LocalGLMnet 1.0023 1.0047

We start with the GLM. As link function g we choose the identity function which is the canonical
link of the Gaussian model. This provides us with linear predictor in the GLM case

x → μ(x) = η(x) = β0 + β, x , (13)

with regression parameter β ∈ R8 and bias β0 ∈ R. This model is fit to the learning data L using
maximum likelihood estimation (MLE) which, in the Gaussian case, is equivalent to minimizing the
mean squared error (MSE) loss function for regression parameter (β0 , β)

1
n
0MLE ,
(β β
MLE
) = arg min (Yi − μ(xi ))2 . (14)
(β0 ,β) n i=1

For this fitted model, we calculate the in-sample MSE on L and the out-of-sample MSE on T . We
compare these losses to the MSEs of the true model μ(xi ), which is available here, and the corre-
sponding MSEs of the null model which only includes a bias β0 . These figures are given in Table 1 on
lines (a)–(c).
The MSEs under the true regression function are roughly equal to 1 which exactly corresponds to
the fact that the responses Y have been simulated with unit variance, see (12), the small differences
to 1 correspond to the randomness implied by simulating. The loss figures of the GLM are between
the null model (homogeneous model) and the correct model. Still these loss figures are comparably
large because the true model (11) has a rather different functional form compared to what we can
capture by the linear function (13). This is also verified by Figure 1 (lhs) which compares the GLM
estimated means μ(xt ) to the true means μ(xt ) for different instances xt . A perfect model would
provide points on the diagonal orange line, and we see rather big differences between the GLM and
the true regression function μ.
We could now start to improve (13), e.g. by including a quadratic term. We refrain from doing so,
but we fit the LocalGLMnet architecture (7) using the identity link for g, a network of depth d = 4
having (q0 , q1 , q2 , q3 , q4 ) = (8, 20, 15, 10, 8) neurons, and as activation functions φm we choose the
hyperbolic tangent function for layers m = 1, 2, 3 and the linear function for m = 4. This architecture
is illustrated in Listing 1 in the appendix. This LocalGLMnet is fitted using the nadam version of SGD;
early stopping is tracked by using 20% of the learning data L as validation data V and the remaining
80% as training data U , and the network calibration with the smallest MSE validation loss on V is
selected; note that L = U ∪ V is disjoint from the test data T . This fitting approach is state-of-the-art,
for more details we refer to Section 7.2.3 in Wüthrich & Merz (2021). Typically, the SGD algorithm
starts in a random initialization, e.g. using the Glorot initializer. We use a special initialization in
(d)
the last hidden layer z (d) , here. Namely, we set all weights to zero wj = 0, 1 ≤ j ≤ q, and the biases
(d) (d)
in this last hidden layer are initialized with the GLM parameter estimate (w0,1 , . . . , w0,q ) =
MLE
β ,
see (14). This choice implies that we start the SGD algorithm directly in the GLM that we try to
improve by learning potentially non-constant attention weights β(x). This initialization also mitigates
identifiability issue (10), because the LocalGLMnet has already rather pre-determined (linear) terms
by this initialization.
SCANDINAVIAN ACTUARIAL JOURNAL 79

Figure 1. Estimated means μ(xt ) vs. true means μ(xt ): (lhs) ﬁtted GLM (13) and (rhs) ﬁtted LocalGLMnet of 5000 randomly selected
out-of-sample instances xt from T .

The fitting results are given on line (d) of Table 1. The MSEs (in-sample and out-of-sample) are
very close to 1 which indicates that the LocalGLMnet is able to find the true regression structure (13).
This is verified by Figure 1 (rhs) which plots the estimated means
μ(xt ) against the true means μ(xt )
(out-of-sample) for 5000 randomly selected instances xt from T . The fitted LocalGLMnet estimators
lie on the diagonal which says that we have good accuracy.
We are now ready to benefit from the LocalGLMnet architecture. We study the estimated regres-
sion attentions and the resulting terms in the LocalGLMnet regression function
j (x)
x → β j (x)xj .
and x → β

The interpretation to these terms has been given in Remarks 2.3. We use the code of Listing 2 to
extract β(x) from the fitted LocalGLMnet. These estimated regression attentions β j (x) are illustrated
in Figure 2 by the black dots for all components 1 ≤ j ≤ q = 8.
We interpret Figure 2. Regression attention β 1 (x) is concentrated around 1/2 which describes the
first term in (11). Regression attentions β 2 (x), . . . , β
6 (x) are quite different from 0 (red horizontal
line) which indicates that x2 , . . . , x6 are important in the description of the true regression function μ.
Finally, β7 (x) and β8 (x) are concentrated around zero (light cyan stripe), which indicates that feature
information x7 and x8 may not be important for our regression function; the construction of the light
cyan stripe is explained in the next section on variable selection. Thus the last two variables could
be dropped from the model, unless they play an important role in β(x). This can be checked by just
refitting the model without these variables. Figure 2 also shows that some regression attentions vary
smoothly, β 2 (x) and β 3 (x), this highlights that the terms x2 and x3 do not exhibit interactions with
other variables, in fact, β 2 (x) suggest a linear term with slope −1/4 which indeed is the case, see (11),
and β 3 (x) has a sine behavior. Other attention weights β j (x), 4 ≤ j ≤ 6, look rather non-smooth (are
dispersed). This indicates that these terms have significant interactions with other variables.

3.2. Variable selection

In the previous example, we have just said that we should drop variables x7 and x8 from the Local-
GLMnet regression because regression attentions β 7 (x) and β
8 (x) spread around zero (are in the light
cyan stripe). Obviously, these two estimators should be identically equal to zero because they do not
appear in the true regression function μ(·), but the noise in the data Yi is letting their estimators
fluctuate around zero. This fluctuation is of comparable size for both x7 (which is independent of all
other variables) and x8 (which is correlated with x2 ). The main question is: how much fluctuation
around zero is still acceptable for allowing to drop a variable, i.e. does not reject the null hypothesis
80 R. RICHMAN AND M. V. WÜTHRICH

Figure 2. Regression attentions β j (xt ), 1 ≤ j ≤ q = 8, of 5000 randomly selected out-of-sample instances xt from T ; the y-scale
is identical in all plots and on the x-scale we have xj .

H0 of setting β7 (x) = 0, or how much fluctuation reveals real regression structure? In GLMs, this
question is answered by either computing the Wald test or the likelihood ratio test (LRT) which use
asymptotic normality of MLEs, see Section 2.2.2 of Fahrmeir & Tutz (1994). Here, we cannot rely on
an asymptotic theory for MLEs because early stopping implies that we do not consider the MLE. An
7 (x) is bigger than the magnitudes used in the
analysis of the results also shows that the volatility in β
Wald test and the LRT. For this reason, we propose an empirical way of determining the rejection
region of the null hypothesis H0 : β7 (x) = 0.
If we are given a statistical problem with features x = (x1 , . . . , xq ) ∈ Rq , we propose to extend
these features by an additional variable xq+1 which is completely random, independent of x and
which, of course, does not enter the true (unknown) regression function μ(x), but is only included in
the network. This additional random component xq+1 will quantify the magnitudes of the fluctuations
q+1 (x) of an independent component that does not enter the regression function. To successfully
in β
apply this empirical test we need to normalize all feature components xj , 1 ≤ j ≤ q + 1, to have zero
empirical mean and unit variance, i.e. they should all live on the same scale. Note that such a nor-
malization should already have been done for successful SGD fitting, thus, this does not impose an
additional step, here. We then fit a LocalGLMnet (6)–(7) to the learning data L with extended features
x+ = (x , xq+1 ) ∈ Rq+1 which gives us the estimated regression attentions β 1 (x+ +
i ), . . . , βq+1 (xi ).
SCANDINAVIAN ACTUARIAL JOURNAL 81

Using these estimated regression attentions, we receive empirical mean and standard deviation for
the additional component

1 1
n n
2
b̄q+1 = q+1 (x+
β i ) and
sq+1 = q+1 (x+
β i ) − b̄q+1 . (15)
n i=1 n − 1 i=1

Since this additional component xq+1 does not enter the true regression function we expect b̄q+1 ≈ 0,
and sq+1 quantifies the expected magnitude of fluctuations around zero.
The null hypothesis H0 : βj (x) = 0 for component j on significance level α ∈ (0, 1/2) can then be
j (x+
rejected if the coverage ratio of the following centered interval Iα for β i ), 1 ≤ i ≤ n,

Iα = qN (α/2) ·
sq+1 , qN (1 − α/2) ·
sq+1 = qN (α/2) ·
sq+1 , −qN (α/2) ·
sq+1 (16)

is substantially smaller than 1-α, where qN (p) denotes the quantile of the standard Gaussian
distribution on quantile level p ∈ (0, 1).
We come back to our Figure 2. We do not add an additional component xq+1 , but we directly use
component x7 to test the null hypotheses H0 : βj (x) = 0 for the remaining components j = 7. In our
synthetic data example we have

b̄7 = −0.0068 ≈ 0 and

s7 = 0.0461.

We choose significance level α = 0.1% which provides us with qN (0.05%) = 3.2905. The light cyan
stripe in Figure 2 shows the resulting interval Iα , given in (16), for our example. Only for x8 the black
dots β8 (xt ) are within these confidence bounds Iα which implies that we should drop this component
and keep components x1 , . . . , x6 in the regression model. Note that x8 and x2 are correlated, and still
our empirical Wald test does not reject the null hypothesis of setting β8 (x) = 0. That is, we do the
right decision, and there is no information leakage, here, through the correlated feature components.
In a final step, the model with dropped components should be re-fitted and the out-of-sample loss
should not substantially change, this re-fitting step verifies that the dropped components also do not
play a significant role in the regression attentions β j (x) of the remaining feature components j, i.e.
contribute by interacting with other variables.
Figure 3 gives the resulting feature contributions β j (xt )xt,j , 1 ≤ j ≤ q = 8, to the LocalGLMnet
estimated regression function μ(xt ). We clearly see the linear term in x1 , the quadratic term in x2
and the sine term in x3 (first line of Figure 3), see also (11) for the true regression function μ. The
second line of Figure 3 shows the interacting feature components x4 , x5 and x6 , and the last line those
that should be dropped.

3.3. Interactions
In the next and final step, we explore interactions between different feature components. This is based
on analyzing the gradients ∇βj (x) for 1 ≤ j ≤ q, see (9). The jth component of this gradient ∇βj (x)
explores whether we have a linear term in xj or not. If there are no interactions of the jth compo-
nent with other components, i.e. βj (x)xj = βj (xj )xj , it will exactly provide us with the right functional
form of βj (x) since in that case ∂βj (x)/∂xj = 0 for all j = j. In relation to Figure 2 this means that the
scatter plot resembles a line that does not have any lateral dilation. In Figure 2, this is the case for com-
ponents x1 , x2 , x7 , x8 , for component x3 this is not completely clear from the scatter plot, and x4 , x5 , x6
show lateral extensions indicating interactions. In this latter general case, we have ∂βj (x)/∂xj = 0 for
some j = j.
We calculate the gradients ∇ β j (x), 1 ≤ j ≤ q, of the fitted model. These can be obtained by the
R code of Listing 3. This provides us for fixed j with vectors ∇ β j (xi ) for all instances i = 1, . . . , n.
82 R. RICHMAN AND M. V. WÜTHRICH

Figure 3. Feature contributions β j (xt )xt,j , 1 ≤ j ≤ q = 8, to the LocalGLMnet estimated regression function μ(xt ) of 5000 ran-
domly selected out-of-sample instances xt from T ; the y-scale is identical in all plots and on the x-scale we have xj ; the horizontal
lines indicate the magnitude of the feature contributions in increments of 0.5.

To analyze these gradients, we fit a spline to these observations by regressing

∂
j (xi ) =
∂xk β j (xi ) ∼ xi,j .
β (17)
∂xk

This studies the derivative of regression attention β j (x) w.r.t. xk as a function of the corresponding
feature component xj that is considered in the feature contribution βj (x)xj , see (7). The code for these
spline fits is also provided in Listing 3 on lines 12–15, and it gives us the results illustrated in Figure 4.
Figure 4 studies the spline regressions (17) only of the significant components x1 , . . . , x6 that enter
regression function μ, see (11). We interpret these plots.

• Plot x1 shows that all gradients are roughly zero, which means that β1 (x) = const, which,
indeed, is the case in the true regression function μ.
2 (x),
• Plots x2 and x3 show one term that is significantly different from 0. For the x2 plot it is ∂x2 β

and for the x3 plot it is ∂x3 β3 (x). This says that these two terms do not have any interactions with
other variables, and that the right functional form is not the linear one, but β j (x)xj = βj (xj )xj
SCANDINAVIAN ACTUARIAL JOURNAL 83

j (xi ), 1 ≤ j, k ≤ 6, over all instances i = 1, . . . , n.

Figure 4. Spline ﬁts to the sensitivities ∂xk β
84 R. RICHMAN AND M. V. WÜTHRICH

2 (x) ≈ const which says that we have a quadratic

is non-linear for j = 2, 3. In fact, we have ∂x2 β

term in x2 , and ∂x3 β3 (x) shows a sine like behavior.
• Plot x4 shows a linear interaction with x5 because ∂x5 β 4 (x) ≈ const.

• Plot x5 shows a linear interaction with x4 because ∂x4 β5 (x) ≈ const, and it shows an interaction
with x6 .
• Plot x6 does not show any significant terms, as they have already been captured by the previ-
ous plots. Note that this comes from the fact that terms x52 x6 /8 do not lead to an identifiable
decomposition, but this can be allocated either to x5 or to x6 , or could even be split among the
two.

We give concluding remarks on this synthetic data example. In general, interpretation is simpler if
we have βj (x) = βj (xj ), i.e. if the components do not have interactions, because this results in rather
smooth curves in Figure 3. This may be achieved by pre-processing features, e.g. we may consider
the body mass index, instead of using the weight and the height of a person (if there is corresponding
interaction in the regression function). Additionally, we can also use the functional forms found by
the LocalGLMnet, e.g. we have seen that component x2 enters the regression function as a quadratic
term, and we could fit another LocalGLMnet using feature component x22 , instead. This may provide
even clearer and better explanations.

3.4. Real data example

As a second example we consider a real data example. In this real data example, we also discuss how
categorical feature components should be pre-processed for our LocalGLMnet approach. We con-
sider the French motor third party liability (MTPL) claims frequency data FreMTPL2freq which
is available through the R package CASdatasets of Dutang & Charpentier (2018). This dataset
has been used within the actuarial literature as a standard benchmark for the application of machine
learning methods to the problem of non-life pricing and is described in Appendix A of Wüthrich &
Merz (2021) and in the tutorials of Noll et al. (2018) and Lorentzen & Mayer (2020). Some other works
focusing on this dataset are Kuo (2019), Richman (2020), Richman & Wüthrich (2020), and Delong
& Kozak (2021), among others. We apply the data cleaning of Listing B.1 of Wüthrich & Merz (2021)
to this data. After data cleaning, we have observations (Yi , vi , xi )i with claim counts Yi ∈ N0 , time
exposures vi ∈ (0, 1] and feature information xi . We have six continuous feature components (called
‘Area Code’, ‘Bonus-Malus Level’, ‘Density’, ‘Driver’s Age’, ‘Vehicle Age’, ‘Vehicle Power’), 1 binary
component (called ‘Vehicle Gas’) and two categorical components with more than two levels (called
‘Vehicle Brand’ and ‘Region’). We pre-process these components as follows: we center and normalize
to unit variance the six continuous and the binary components. We apply one-hot encoding to the
two categorical variables, we emphasize that we do not use dummy coding as it is usually done in
GLMs. Below, in Section 3.6, we are going to motivate this one-hot encoding choice (which does not
lead to full rank design matrices); for one-hot encoding vs. dummy coding we refer to formulas (5.21)
and (7.29) in Wüthrich & Merz (2021).
As a control variable, we add two random feature components that are i.i.d., centered and with unit
variance, the first one having a uniform distribution and the second one having a standard normal
distribution, we call these two additional feature components ‘RandU’ and ‘RandN’. We consider
two additional independent components to understand whether the distributional choice influences
the results of hypothesis testing using the empirical interval Iα , see (16). Altogether (and using one-
hot encoding) we receive q = 42 dimensional tabular feature variables xi ∈ Rq ; this includes the two
additional components RandU and RandN.
We fit a Poisson network regression model to this French MTPL data. Note that the Poisson distri-
bution belongs to the EDF with cumulant function κ(θ ) = exp{θ } on the effective domain = R.
The canonical link is given by the log-link, this motivates link choice g(·) = log(·). We then start by
fitting a plain-vanilla FFN network (5) to this data. This FFN network will give us the benchmark in
SCANDINAVIAN ACTUARIAL JOURNAL 85

Table 2. In-sample and out-of-sample losses on the real MTPL data example.
Poisson deviance losses in 10−2
Out-of-
In-sample sample on
on L T
(a) null model (bias β0 only) 25.213 25.445
(b) FFN network 23.764 23.873
(c) LocalGLMnet 23.728 23.945
(d) reduced LocalGLMnet 23.714 23.912
(e) Poisson GLM3 24.084 24.102
(f) Categorical Embedding network 23.690 23.824
(g) Nagging network 23.691 23.783

terms of predictive power. We choose a network of depth d = 3 having (q1 , q2 , q3 ) = (20, 15, 10) FFN
neurons; the R code for this FFN architecture is given in Listing 7.1 of Wüthrich & Merz (2021), but
we replace input dimension 40 by 42 on line 3 of that listing to account for the two additional feature
components ‘RandU’ and ‘RandN’. To do a proper out-of-sample generalization analysis we parti-
tion the data randomly into a learning data set L and a test data set T . The learning data L contains
n = 610, 206 instances and the test data set T contains 67, 801 instances; we use exactly the same
split as in Table 5.2 of Wüthrich & Merz (2021). The learning data L will be used to learn the network
parameters and the test data T is used to perform an out-of-sample generalization analysis. As loss
function for parameter fitting and generalization analysis we choose the Poisson deviance loss, which
is a distribution adapted and strictly consistent loss function for the mean within the Poisson model,
for details we refer to Section 4.1.3 in Wüthrich & Merz (2021).
We fit this FFN network using the nadam version of SGD on batches of size 5000 over 100 epochs,
and we retrieve the network calibration that provides the smallest validation loss on a training-
validation partition U and V of the learning data L. The results are presented on line (b) of Table 2.
The FFN network provides clearly better results than the null model only using a bias β0 . This justifies
regression modeling, here.
Next we fit the LocalGLMnet architecture using exactly the same set up as in the FFN network, but
having a depth of d = 4 with numbers of neurons (q0 , q1 , q2 , q3 , q4 ) = (42, 20, 15, 10, 42). The results
on line (c) of Table 2 show that we sacrifice a bit of predictive power (out-of-sample) for receiving our
interpretable network architecture. Moreover, we complement the table with results of Wüthrich &
Merz (2021) fitting several other models to the same data, namely, a GLM on line (e), an FFN which
encodes the categorical features using embedding layers on line (f), and an ensemble of networks
using the nagging predictor on line (g). Compared to the GLM, which is currently the industry stan-
dard for non-life claim frequency prediction, the LocalGLMnet model performs significantly better
(out-of-sample). The FFN with embedding layers and the ensemble of networks outperform both the
LocalGLMnet and the simple FFN on line (b). Thus for the LocalGLMnet we sacrifice a little bit of
predictive performance for having an explainable predictive model. On the other hand, we can per-
form variable selection with the LocalGLMnet, and if the ultimate goal is predictive performance we
can still fit another network architecture to the LocalGLMnet-selected variables.
We now analyze the resulting estimated regression attentions β j (x). We start by studying the regres-
sion attentions β j (x) of the continuous and binary feature components Area Code, Bonus-Malus
Level, Density, Driver’s Age, Vehicle Age, Vehicle Gas, Vehicle Power, RandU and RandN. First, we
calculate the empirical standard deviations sj that we receive from RandU and RandN, see (15),

sRandU = 0.052 and
sRandN = 0.048.

Thus these standard deviation estimates are rather similar, and in this case the specific distribu-
tional choice of the control variable xq+1 does not influence the results. We discuss further these
distributional choices and alternative choices together with an analysis of impact of sampling error
86 R. RICHMAN AND M. V. WÜTHRICH

Figure 5. Attention weights βj (xt ) of the continuous and binary feature components Area Code, Bonus-Malus Level, Density,
Driver’s Age, Vehicle Age, Vehicle Gas, Vehicle Power, RandU and RandN for 5000 randomly selected instances xt of T ; the light
cyan stripe shows the conﬁdence area Iα on signiﬁcance level α = 0.1%.

in Appendix 2. We calculate interval Iα for significance level α = 0.1%, see (16). The resulting interval
Iα is illustrated by the light cyan stripes in Figure 5. We observe that for the variables Bonus–Malus
Level, Density, Driver’s Age, Vehicle Age and Vehicle Gas we clearly reject the null hypothesis
H0 : βj (x) = 0 on the chosen significance level α = 0.1%, and Area Code and Vehicle Power need
further analysis. For these two variables, Iα provides a coverage ratio of 97.1% and 98.1%, thus, strictly
speaking these numbers are below 1 − α = 99.9% and we should keep these variables in the model.
Nevertheless, since there is some sampling error in deriving sRandU and sRandN , e.g. the appendix
shows that different samples for these variables could produce slightly wider intervals, we further
analyze these two variables. From the empirical analysis in Noll et al. (2018) we know that Area Code
and Density are highly correlated. Figure 6 (lhs) shows the boxplot of Density vs. Area Code, and this
plot highlights that Density almost fully explains Area Code. Therefore, it is sufficient to include the
Density variable and we drop Area Code.
Thus, we drop the variables Area Code and VehPower, and we also drop the control variables
RandU and RandN, because these are no longer needed. This gives us a reduced input dimension
of q0 = q = 38, and we run the same LocalGLMnet SGD fitting again. Line (d) of Table 2 gives the
SCANDINAVIAN ACTUARIAL JOURNAL 87

Figure 6. (lhs) Boxplots of Density vs. Area Code and (rhs) Bonus-Malus vs. Driver’s Age.

in-sample and out-of-sample results of this reduced LocalGLMnet model. We observe a small out-of-
sample improvement compared to line (c) which confirms that we can drop Area Code and VehPower
without losing predictive performance. In fact, the small improvement indicates that a smaller model
less likely over-fits, and we can consider more SGD steps before over-fitting, i.e. we receive a later
early stopping point.
Having performed variable selection and dropping unimportant variables, another approach
would be to then fit a different model to the reduced set of features, for example an FFN, which
would provide better out-of-sample performance than the LocalGLMnet. This approach is advisable
if the emphasis is on predictive performance instead of interpretability.
Figure 7 shows the feature contributions β j (xt )xj,t of the selected continuous and binary feature
components Bonus–Malus Level, Density, Driver’s Age, Vehicle Age and Vehicle Gas of 5000 ran-
domly selected instances xt , and the magenta line shows a spline fit to these feature contributions; the
y-scale is the same in all plots. We observe that all these components contribute substantially to the
regression function, the Bonus–Malus variable being the most important one, and Vehicle Gas being
the least important one. Bonus–Malus and Density have in average an increasing trend, and Vehicle
Age has in average a decreasing trend, Density being close to a linear function, and the remaining
continuous variables are clearly non-linear. The explanation of Driver’s Age is more difficult as can
be seen from the spline fit (magenta color). Figure 6 (rhs) shows the boxplot of Bonus–Malus Level
vs. Driver’s Age. We observe that new (young) drivers enter the bonus-malus system at 100, and every
year of driving without an accident decreases the bonus-malus level. Therefore, the lowest bonus-
malus level can only be reached after multiple years of accident-free driving. This can be seen from
Figure 6 (rhs), and it implies that the Bonus–Malus Level and the Driver’s Age variables interact. We
are going to verify this by studying the gradients ∇ β j (x) of the regression attributions. The smoothed
feature contributions of Figure 7 are compared in Appendix 2 to PDPs produced using an FFN fit to
the same reduced set of variables. The comparison shows that the smoothed feature contributions are
generally similar to PDPs, but differ for values of the variables where there are not many observations
(note that PDPs do not consider dependence within feature vectors appropriately).
Figure 8 shows the spline fits to the gradients ∂xk β j (x) of the continuous variables Bonus–Malus
Level, Density, Driver’s Age and Vehicle Age. First, we observe that Bonus–Malus Level and Driver’s
Age have substantial non-linear terms ∂xj β j (x)xj , Vehicle Age shows some non-linearity and Density

seems to be pretty linear since ∂xj βj (x)xj ≈ 0. This verifies the findings of Figure 7 of the magenta
spline fits.
88 R. RICHMAN AND M. V. WÜTHRICH

Figure 7. Feature contributions β j (xt )xj,t of the continuous and binary feature components Bonus–Malus Level, Density, Driver’s
Age, Vehicle Age and Vehicle Gas of 5000 randomly selected instances xt of T ; the y-scale is the same in all plots and the magenta
color gives a spline ﬁt to the feature contributions.

Next we focus on interactions which requires the study of ∂xk β j (x) for k = j in Figure 8. The most
significant interactions can clearly be observed between Bonus–Malus Level and Driver’s Age, but
also between the Bonus–Malus Level and Density we encounter an interaction term saying that a
higher Bonus–Malus Level at a lower Density leads to a higher prediction, which intuitively makes
sense as in less densely populated areas we expect less claims. For Vehicle Age, we do not find sub-
stantial interactions, it only weakly interacts with Bonus–Malus Level and Driver’s Age by entering
the corresponding regression attributions β j (x) of these two feature components.
There remains the discussion of the categorical feature components Vehicle Brand and Region.
This is going to be done in Section 3.6.

Remark 3.1: Networks fitted with early stopping in SGD do not fulfill the so-called balance property
and, henceforth, may have a bias. Since we work under the canonical link choice, here, we can apply
the bias regularization method proposed in Wüthrich (2020). For this we define the learned features

zi,j = β j (xi )xi,j , 1 ≤ j ≤ q, that is, for each instance 1 ≤ i ≤ n we have a new learned feature
zi =
(
zi,1 , . . . ,
zi,q ) ∈ Rq . These learned features can then be used in a classical GLM, see (2),

q

z i → g(μ) = α0 + α,
z i = α0 + αj
zi,j ,
j=1

with regression parameter (α0 , α) ∈ Rq+1 . Since an MLE estimated GLM under canonical link choice
satisfies the balance property, so will the LocalGLMnet

q

α0MLE +
xi → g(μ) = j (xi ) xi,j .
αjMLE β

j=1
SCANDINAVIAN ACTUARIAL JOURNAL 89

j (xi ) of the continuous variables Bonus–Malus Level, Density, Driver’s Age and Vehicle
Figure 8. Spline ﬁts to the gradients ∂xk β
Age over all instances i = 1, . . . , n.

That is, we just slightly rescale the regression attentions to receive the balance property. For more on
this topic, we refer to Section 7.4.2 in Wüthrich & Merz (2021).

3.5. Variable importance

j (x), 1 ≤ j ≤ q, allow us to quantify variable importance. A
The estimated regression attentions β
simple measure of variable importance can be defined by

1
n

VIj = j (xi ) ,
β
n i=1

for 1 ≤ j ≤ q and where we aggregate over all instances 1 ≤ i ≤ n. Typically, the bigger these values
VIj the more component xj influences the regression function. Note that all feature components xj
have been centered and normalized to unit variance, i.e. they live on the same scale, otherwise such
a comparison of VIj across different j would not make sense. Figure 9 gives the variable importance
results, which emphasizes that Area Code and Vehicle Power are the least important variables which,
in fact, have been dropped in a second step above.
90 R. RICHMAN AND M. V. WÜTHRICH

Figure 9. Variable importance VIj , 1 ≤ j ≤ q.

3.6. Categorical feature components

In this section, we discuss how categorical feature components can be considered. Note that at the
beginning of Section 3.4, we have emphasized that we use one-hot encoding and not dummy coding
for categorical variables. The reason for this choice can be seen in Figure 7. Namely, if the feature value
is xj = 0, then the corresponding feature contribution gives βj (x)xj = 0. This provides the calibration
of the regression model, i.e. it gives the reference level which is determined by the bias β0 ∈ R. Since in
GLMs we do not allow the components to interact in the linear predictor η(x), all instances that have
the same level xj receive the same contribution βj xj , see (2). In the LocalGLMnet, we allow the same
level xj to have different contributions βj (x)xj through interactions in the regression attention β(x).
If we want to carry this forward to categorical feature components it requires that these components
receive an encoding that is not identically equal to zero. This is the case for one-hot encoding, but
not for dummy coding where the reference level is just identical to the bias β0 . For this reason, we
recommend to use one-hot encoding for the LocalGLMnet.
Figure 10 shows the feature contributions β j (x) of the categorical feature components, note that
each box corresponds to one level of the chosen categorical variable. First, different medians between
the boxes indicate different parameter sizes β j (x) for different levels j; note that in one-hot encoding j
describes the different levels of the categorical variable. Second, the larger the box the more interaction
this level has with other feature components, because β j (xi ) is more volatile over different instances i.
From Figure 10, we observe that Vehicle Brands B11 and B14 are the two most extreme Vehi-
cle Brands w.r.t. claims frequency, B11 having the highest expected frequency and B14 the lowest.
From the French Regions R74 (Limousin) and R82 (Rhône-Alpes) seem outstanding having a higher
frequency than elsewhere. This finishes our example.
In fact, Figure 10 provides contextualized embeddings for the different levels of the categorical
features. Attention-based embedding models exactly try to do such a contextualized embedding, we
refer to Kuo & Richman (2021) in an actuarial context and to Huang et al. (2020) for general tabular
data.
We could add the learned categorical regression attentions β j (x) to the variable importance plot
of Figure 9. This requires some care. First, if categorical variables have many levels, then showing
individual levels will not result in clear plots. Second, one-hot encoding is not normalized and cen-
tered to unit variance, thus, these one-hot encoded variables live on a different scale compared to the
standardized ones, and a direct comparison is not sensible.
SCANDINAVIAN ACTUARIAL JOURNAL 91

j (x) of the categorical feature components Vehicle Brand and French Regions; the
Figure 10. Boxplot of the feature contributions β
y-scale is the same as in Figure 7.

4. Conclusion
We have introduced the LocalGLMnet which is inspired by classical GLMs. Making the regression
parameters of a generalized linear model feature dependent allows us to receive a flexible regression
model that shares representation learning and the predictive performance of classical fully-connected
feed-forward neural networks, and at the same time it remains interpretable. This appealing structure
allows us to perform variable selection, it allows us to study variable importance and it also allows
us to determine interactions. Thus it provides us with a fully transparent network model that brings
out the internal structure of the data. To the best of our knowledge, this is rather unique in network
regression modeling, since our proposal does not share the shortcomings of similar proposals like
computational burden or a loss of predictive power.
Above we have mentioned possible extensions, e.g. LocalGLM layers can be composed to receive
deeper interpretable networks, and the LocalGLMnet can serve as a surrogate model to shed more
light into many other deep learning models. A different method of utilizing the LocalGLMnet
would be to iteratively fit the model and then use the smoothed feature contribution and inter-
action plots (i.e. those shown in Figures 7 and 8) to aid the fitting of a GLM by adding new
variables corresponding to these plots to the GLM. Once these are added, the LocalGLMnet could
be refit to the expanded variables and the process is repeated. Another extension would be a
method of constraining the regression attention coefficients to produce monotone effects which
would potentially be useful in practice. However, this would require substantial modifications to
the LocalGLMnet fitting process, so we leave this modification for monotonic effects (which applies
equally to more general neural networks, such as FFNs) for further research. Whereas our archi-
tecture is most suitable for tabular input data, the question about optimal consideration of non-
tabular data or of categorical variables with many levels is one point that should also be further
explored.

Acknowledgments
We thank Christopher Blier-Wong, Friedrich Loser, Andreas Tsanakas and the two anonymous reviewers for providing
very useful remarks to improve earlier versions of this manuscript.
92 R. RICHMAN AND M. V. WÜTHRICH

Disclosure statement
No potential conflict of interest was reported by the author(s).

References
Agarwal R., Frosst N., Zhang X., Caruana R. & Hinton G. E. (2020). Neural additive models: interpretable machine
learning with neural nets. arXiv:2004.13912v1.
Ahn J., Lu Y., Oh R., Park K. & Zhu D. (2021). Neural credibility [conference presentation]. In Virtual 24th International
Congress on Insurance: Mathematics and Economics, July 5–10. Urbana-Champaign, USA: University of Illinois.
Apley D. W. & Zhu J. (2020). Visualizing the effects of predictor variables in black box supervised learning models.
Journal of the Royal Statistical Society: Series B 82(4), 1059–1086.
Arik S. Ö & Pfister T. (2019). TabNet: attentive interpretable tabular learning. arXiv:1908.07442v5.
Bahdanau D., Cho K. & Bengio Y. (2014). Neural machine translation by jointly learning to align and translate.
arXiv:1409.0473.
Barndorff-Nielsen O. (2014). Information and exponential families: in statistical theory. Chichester, West Sussex, UK:
John Wiley & Sons.
Chen C., Li O., Tao C., Barnett A., Su J. & Rudin C. (2019). This looks like that: deep learning for interpretable image
recognition. Advances in Neural Information Processing Systems 32, Vancouver, Canada.
Delong Ł. & Kozak A. (2021). The use of autoencoders for training neural networks with mixed categorical and
numerical features. SSRN Manuscript ID 3952470.
Dutang C. & Charpentier A. (2018). CASdatasets R Package Vignette. Reference Manual. Version 1.0-8, packaged 2018-
05-20.
Fahrmeir L. & Tutz G. (1994). Multivariate statistical modelling based on generalized linear models. New York, USA:
Springer.
Friedman J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics29(5),
1189–1232.
Goodfellow I., Bengio Y. & Courville A. (2016). Deep learning. MIT Press.
Ha D., Dai A. & Le Q. (2016). Hypernetworks. arXiv:1609.09106.
He K., Zhang X., Ren S. & Sun J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition, Vol. 1, P. 770–778. Las Vegas, USA.
Huang X., Khetan A., Cvitkovic M. & Karnin Z. (2020). TabTransformer: tabular data modeling using contextual
embeddings. arXiv:2012.00678.
Jørgensen B. (1997). The theory of dispersion models. London, UK: Chapman & Hall.
Kuo K. (2019). Generative synthesis of insurance datasets. arXiv:1912.02423v2.
Kuo K. & Richman R. (2021). Embeddings and attention in predictive modeling. arXiv:2104.03545v1.
Kursa M. & Rudnicki W. (2010). Feature selection with the Boruta package. Journal of Statistical Software 36, 1–13.
Lemhadri I., Ruan F., Abraham L. & Tibshirani R. (2021). LassoNet: a neural network with feature sparsity. Journal of
Machine Learning Research 22, 1–29.
Lorentzen C. & Mayer M. (2020). Peeking into the black box: an actuarial case study for interpretable machine learning.
SSRN Manuscript ID 3595944. Version May 7, 2020.
Lundberg S. M. & Lee S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Infor-
mation Processing Systems, Vol. 30, Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,
Garnett, R. (eds.), Montreal: Curran Associates. P. 4765–4774.
McCullagh P. & Nelder J. A. (1983). Generalized linear models. London, UK: Chapman & Hall.
Merz M., Richman R., Tsanakas A. & Wüthrich M. V. (2022). Interpreting deep learning models with marginal
attribution by conditioning on quantiles, Data Mining and Knowledge Discovery, in press.
Nelder J. A. & Wedderburn R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series
A 135(3), 370–384.
Noll A., Salzmann R. & Wüthrich M. V. (2018). Case study: french motor third-party liability claims. SSRN Manuscript
ID 3164764. Version March 4, 2020.
Perla F., Richman R., Scognamiglio S. & Wüthrich M. (2021). Time-series forecasting of mortality rates using deep
learning. Scandinavian Actuarial Journal 2021(7), 572–598.
Ribeiro M. T., Singh S. & Guestrin C. (2016). ‘Why should I trust you?’: explaining the predictions of any classifier. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD
’16. New York: Association for Computing Machinery. P. 1135–1144.
Richman R. (2020). AI in actuarial science – a review of recent advances – part 2. Annals of Actuarial Science 15(2),
230–258.
Richman R. (2021). Mind the gap – safely incorporating deep learning models into the actuarial toolkit. SSRN
Manuscript ID 3857693.
Richman R. & Wüthrich M. (2020). Nagging predictors. Risks 8, 83.
SCANDINAVIAN ACTUARIAL JOURNAL 93

Rudin C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable
models instead. Nature Machine Intelligence 1, 206–215.
Sundararajan M., Taly A. & Yan Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th Inter-
national Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, Vol. 70, Sydney,
Australia: International Convention Centre. P. 3319–3328.
Tran M., Nguyen N., Nott D. & Kohn R. (2020). Bayesian deep net GLM and GLMM. Journal of Computational and
Graphical Statistics 29, 97–113.
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł. & Polosukhin I. (2017). Attention is
all you need. arXiv:1706.03762v5.
Vaughan J., Sudjianto A., Brahimi E., Chen J. & Nair V. N. (2018). Explainable neural networks based on additive index
models. arXiv:1806.01933v1.
Wang Q., Li B., Xiao T., Zhu J., Li C., Wong D. F. & Chao L. S. (2019). Learning deep transformer models for machine
translation. arXiv:1906.01787.
Wüthrich M. V. (2020). Bias regularization in neural network models for general insurance pricing. European Actuarial
Journal 10(1), 179–202.
Wüthrich M. V. & Merz M. (2021). Statistical foundations of actuarial learning and its applications. SSRN Manuscript
ID 3822407.

Appendices
R code

Listing 1. Code to implement LocalGLMnet of depth d = 4 for the synthetic Gaussian case.
1 library(keras)
2 #
3 Design = layer_input(shape = c(8), dtype = ’float32’)
4 #
5 Attention = Design %>%
6 layer_dense(units=20, activation=’tanh’) %>%
7 layer_dense(units=15, activation=’tanh’) %>%
8 layer_dense(units=10, activation=’tanh’) %>%
9 layer_dense(units=8, activation=’linear’, name=’Attention’)
10
11 Response = list(Design, Attention) %>% layer_dot(axes=1) %>%
12 layer_dense(units=1, activation=’linear’, name=’Response’)
13 #
14 model <- keras_model(inputs = c(Design), outputs = c(Response))
15 model %>% compile(loss = ’mse’, optimizer = ’nadam’)

Listing 2. Extraction of the estimated weights

β(x).
1 zz <- keras_model(inputs=model$input, outputs=get_layer(model, ’Attention’)$output)
2 #
3 beta.x <- data.frame(zz %>% predict(list(XX))) # XX denotes design/feature matrix
4 #
5 # our architecture still requires a scaling coming from dense layer ’Response’ which
6 # we add to have intercept beta_0 in our LocalGLMnet architecture
7 beta.x <- beta.x * as.numeric(get_weights(model)[[9]])

Further analysis of the real data example

Here we explore further the empirical variable selection method demonstrated on the real data example in Section 3.4,
and compare the average feature contributions within the LocalGLMnet to another popular model interpretability
technique, the partial dependence plot (PDP) of Friedman (2001).

B.1 Variability of variable selection method

To determine the potential variability in variable selection induced by the sampling and fitting error of the ‘RandU’
and ‘RandN’ variables, the model fitting described in Section 3.4 was repeated 100 times using exactly the same data
splits and random seed for the model training, and varying only these variables by drawing new samples from these
94 R. RICHMAN AND M. V. WÜTHRICH

j (x) and code for spline ﬁt.

Listing 3. Extraction of the gradients ∇ β
1 j <- 1 # select the feature component
2 #
3 beta.j <- Attention %>% layer_lambda(function(x) x[,j])
4 model.j <- keras_model(inputs = c(Design), outputs = c(beta.j))
5 #
6 grad <- beta.j %>% layer_lambda(function(x) k_gradients(model.j$outputs, model.j$inputs))
7 model.grad <- keras_model(inputs = c(Design), outputs = c(grad))
8 #
9 grad.beta <- data.frame(model.grad %>% predict(as.matrix(XX)))
10 #
11 #
12 k <- 4 # select component for interaction (j,k)
13 #
14 library(locfit)
15 predict(locfit(grad.beta[,k]~ beta.x[,j], alpha=0.1, deg=2), newdata=c(-400:400)/100)

Table B1. Sample averages and standard deviations of

s over 100 diﬀerent training runs.
RandU RandN permute permute
Bonus-Malus Vehicle Age
Sample average of
s 0.047 0.043 0.051 0.042
Sample standard deviation of
s 0.010 0.009 0.009 0.009

same distributions. Moreover, in a different set of 100 runs, we also replaced these variables with permuted copies of
the Bonus-Malus and Vehicle Age variables. The average value of the standard deviations of the regression attentions
relating to each of these variables (i.e. the average value of
s) that is used to estimate the confidence intervals (16) and
the standard deviation of these are shown in Table B1. Over 100 runs, the average magnitude of s varies somewhat
depending on the distributional assumption underlying the control variables and, on the other hand, the standard
deviations of these are roughly constant, showing little dependence on the distributional assumptions. Since there is
some sampling variability in deriving the confidence bands (16), as it is usually the case in statistics on finite samples,
these confidence bands should be treated as indicative of the ‘true’ values, and when performing variable selection,
those variables that produce regression attention values that are close in magnitude to the confidence bands should
be investigated further, as is done in Section 3.4. On the other hand, adding more variables to the LocalGLMnet, for
example, several permuted variables as well as randomly sampled variables, would likely increase the potential for over
fitting, thus, we prefer to rely on ‘RandU’ and ‘RandN’ in the real data example.

B.2 Comparison to partial dependence plots

Figure 7 shows the feature contributions of each variable to the LocalGLMnet and smoothed values of the feature con-
tributions estimated using a local spline fit. In this section, we consider whether these smoothed values can be compared
to results of estimating PDPs on the reduced dataset. To make this comparison, we fit an FFN to the reduced dataset
(i.e. excluding Area Code and VehPower), and then calculate PDPs for 5,000 randomly selected instances xt of T .
These are compared to the smoothed feature contributions of the remaining continuous variables in Figure B11. The
PDPs generally follow a similar shape to the smoothed feature contributions for the Bonus-Malus, Density and Vehicle
Age variables, but the shape is quite different for the Driver Age variable. Moreover, the magnitudes of the values of
the PDPs are roughly similar to those of the smoothed feature contributions. The most significant departures of the
PDPs from the smoothed feature contributions seem to occur at values of the variables with the least data (the oldest
drivers and those drivers with the lowest bonus-malus scores). Whereas PDPs suffer from the well-known deficiency
of assuming independence between variables, potentially leading to unlikely combinations of variables, the smoothed
feature contributions show the average effects of each variable within the LocalGLMnet, likely leading to more robust
and reasonable estimates of feature importance at these values of the variables, i.e. at the oldest driver ages, and lowest
levels of bonus-malus, we should attribute greater credibility to the smoothed feature contributions than the PDPs. Fur-
thermore, PDPs require significantly more computation than the smoothed feature contributions which are available
directly from the LocalGLMnet.
SCANDINAVIAN ACTUARIAL JOURNAL 95

Figure B11. Spline ﬁt to the feature contributions βj (xt )xj,t and PDPs of the continuous variables Bonus-Malus Level, Density,
Driver’s Age and Vehicle Age of 5,000 randomly selected instances xt of T ; the y-scale is the same in all plots.

3032 2023-IEEE TFS-Fuzzy Rule-Based Explainer Systems For Deep Neural Networks From Local Explainability To Global Understanding
No ratings yet
3032 2023-IEEE TFS-Fuzzy Rule-Based Explainer Systems For Deep Neural Networks From Local Explainability To Global Understanding
12 pages
Shapley Value: From Cooperative Game To Explainable Artificial Intelligence
No ratings yet
Shapley Value: From Cooperative Game To Explainable Artificial Intelligence
12 pages
Categorical Deep Learning Is An Algebraic Theory of All Architectures
No ratings yet
Categorical Deep Learning Is An Algebraic Theory of All Architectures
32 pages
Gocheva-Ilieva S. Statistical Data Modeling... Applications 2021
No ratings yet
Gocheva-Ilieva S. Statistical Data Modeling... Applications 2021
186 pages
CPD-Structured Multivariate Polynomial Optimization
No ratings yet
CPD-Structured Multivariate Polynomial Optimization
24 pages
Interpretable Neural Additive Models
No ratings yet
Interpretable Neural Additive Models
23 pages
Neural Logic Networks For Interpretable Classification: Vincent Perreault
No ratings yet
Neural Logic Networks For Interpretable Classification: Vincent Perreault
49 pages
Explainable AI for Fault Diagnosis
No ratings yet
Explainable AI for Fault Diagnosis
23 pages
Book Main 2
No ratings yet
Book Main 2
304 pages
Resnet
No ratings yet
Resnet
25 pages
Interpreting Black-Box Machine Learning Models
No ratings yet
Interpreting Black-Box Machine Learning Models
10 pages
GrADE A Graph Based Data Driven Solver F
No ratings yet
GrADE A Graph Based Data Driven Solver F
20 pages
T M: A T D L P - E E: AB Dvancing Abular EEP Earning With Arameter Fficient Nsembling
No ratings yet
T M: A T D L P - E E: AB Dvancing Abular EEP Earning With Arameter Fficient Nsembling
39 pages
Diffusion Based Causal Representation Learning
No ratings yet
Diffusion Based Causal Representation Learning
17 pages
2021 Deep Polynomial Neural Networks
No ratings yet
2021 Deep Polynomial Neural Networks
14 pages
TDSC Choo 221
No ratings yet
TDSC Choo 221
12 pages
Revisiting Deep Learning Models For Tabular Data
No ratings yet
Revisiting Deep Learning Models For Tabular Data
12 pages
Transformer for Time Series Data
No ratings yet
Transformer for Time Series Data
22 pages
DL Tabular
No ratings yet
DL Tabular
43 pages
Som-Vae: I D R L T S: Nterpretable Iscrete Epresentation Earning On IME Eries
No ratings yet
Som-Vae: I D R L T S: Nterpretable Iscrete Epresentation Earning On IME Eries
18 pages
Bringing Comparative Cognition To Computers
No ratings yet
Bringing Comparative Cognition To Computers
38 pages
Deep Neural Networks and Tabular Data: A Survey
No ratings yet
Deep Neural Networks and Tabular Data: A Survey
22 pages
Deep Learning of Contagion Dynamics On Complex Networks: Article
No ratings yet
Deep Learning of Contagion Dynamics On Complex Networks: Article
11 pages
Using Image Conversion Techniques To Detect Adversarial Exa 2025 Applied Sof
No ratings yet
Using Image Conversion Techniques To Detect Adversarial Exa 2025 Applied Sof
13 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
Categorical Deep Learning - An Algebraic Theory of Architectures
No ratings yet
Categorical Deep Learning - An Algebraic Theory of Architectures
29 pages
Semiparametric M-Estimation With Overparameterized Neural Networks
No ratings yet
Semiparametric M-Estimation With Overparameterized Neural Networks
33 pages
A Time Series Approach To Explainability For Neural Nets R5q9zkij
No ratings yet
A Time Series Approach To Explainability For Neural Nets R5q9zkij
28 pages
A Case Study in Interpretability
No ratings yet
A Case Study in Interpretability
23 pages
Entropy 23 00018 v2 39
No ratings yet
Entropy 23 00018 v2 39
1 page
Tesi
No ratings yet
Tesi
106 pages
Ergodic Network Stochastic Differential Equations
No ratings yet
Ergodic Network Stochastic Differential Equations
22 pages
Attention-Augmented LSTM-Autoencoder For Explainable Smart Grid Anomaly Detection
No ratings yet
Attention-Augmented LSTM-Autoencoder For Explainable Smart Grid Anomaly Detection
11 pages
Mambular: A Sequential Model For Tabular Deep Learning
No ratings yet
Mambular: A Sequential Model For Tabular Deep Learning
21 pages
UC Merced Previously Published Works
No ratings yet
UC Merced Previously Published Works
40 pages
Deep Learning Applications Review
No ratings yet
Deep Learning Applications Review
18 pages
Entropy 23 00018 v2 41
No ratings yet
Entropy 23 00018 v2 41
1 page
Scalable Rule-Based Classification Model
No ratings yet
Scalable Rule-Based Classification Model
17 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Deep Neural Networks and Tabular Data A Survey
No ratings yet
Deep Neural Networks and Tabular Data A Survey
21 pages
Journal of Computational Physics: Zichao Long, Yiping Lu, Bin Dong
No ratings yet
Journal of Computational Physics: Zichao Long, Yiping Lu, Bin Dong
17 pages
A Universal Framework
No ratings yet
A Universal Framework
13 pages
TabNet: Interpretable Deep Learning for Tabular Data
No ratings yet
TabNet: Interpretable Deep Learning for Tabular Data
12 pages
Sensors 21 05888 v2
No ratings yet
Sensors 21 05888 v2
30 pages
Graph Deep Learning For Time Series Forecasting: Andrea Cini, Ivan Marisca, Daniele Zambon, and Cesare Alippi
No ratings yet
Graph Deep Learning For Time Series Forecasting: Andrea Cini, Ivan Marisca, Daniele Zambon, and Cesare Alippi
42 pages
Deep Learning A Tutorial
No ratings yet
Deep Learning A Tutorial
16 pages
pde 微分方程与神经网络
No ratings yet
pde 微分方程与神经网络
16 pages
LocalGLMnet: A Deep Learning Architecture For Actuaries
No ratings yet
LocalGLMnet: A Deep Learning Architecture For Actuaries
35 pages
Deep Learning
No ratings yet
Deep Learning
50 pages
New Bridges Between Deep Learning and Partial Differential Equations
No ratings yet
New Bridges Between Deep Learning and Partial Differential Equations
5 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
Accurate Predictions On Small Data With A Tabular Foundation Model
No ratings yet
Accurate Predictions On Small Data With A Tabular Foundation Model
23 pages
OMINewsLetter November 2024
No ratings yet
OMINewsLetter November 2024
5 pages
2023 Review A Brief Tour of Deep Learning From A Statistical Perspective
No ratings yet
2023 Review A Brief Tour of Deep Learning From A Statistical Perspective
28 pages
Analysis of Conventional Feature Learning
No ratings yet
Analysis of Conventional Feature Learning
12 pages
Deconstructing Models and Methods in Deep Learning - Pavel Izmailov Thesis
No ratings yet
Deconstructing Models and Methods in Deep Learning - Pavel Izmailov Thesis
395 pages
Method: Research
No ratings yet
Method: Research
28 pages
Pde-N 2.0: L PDE D A N - S H D N: ET Earning S From Ata With Umeric Ymbolic Ybrid EEP Etwork
No ratings yet
Pde-N 2.0: L PDE D A N - S H D N: ET Earning S From Ata With Umeric Ymbolic Ybrid EEP Etwork
18 pages
Mathematics of Deep Learning Lecture Notes
No ratings yet
Mathematics of Deep Learning Lecture Notes
58 pages
Interpretability Needs A New Paradigm
No ratings yet
Interpretability Needs A New Paradigm
16 pages
A Theoretical Framework For AI Models
No ratings yet
A Theoretical Framework For AI Models
9 pages
A Multimodal Automated Interpretability Agent
No ratings yet
A Multimodal Automated Interpretability Agent
29 pages
Gradient Based Feature Attribution in Explainable AI
No ratings yet
Gradient Based Feature Attribution in Explainable AI
25 pages
Deep Learned Recurrent Type-3 Fuzzy System
No ratings yet
Deep Learned Recurrent Type-3 Fuzzy System
13 pages
An Interval Type-3 Fuzzy System and A New Online Fractional-Order Learning Algorithm Theory and Practice
100% (1)
An Interval Type-3 Fuzzy System and A New Online Fractional-Order Learning Algorithm Theory and Practice
11 pages
Fuzzy Subsethood For Fuzzy Sets of Type-2 and Generalized Type - N
No ratings yet
Fuzzy Subsethood For Fuzzy Sets of Type-2 and Generalized Type - N
11 pages
A Novel Fractional Order Multiple Model
No ratings yet
A Novel Fractional Order Multiple Model
19 pages
Reconstructing Odes From Time Series Data
No ratings yet
Reconstructing Odes From Time Series Data
15 pages
Lecture 10-Print
No ratings yet
Lecture 10-Print
50 pages
Recurrence Relations Guide
No ratings yet
Recurrence Relations Guide
14 pages
Regression and Correlation
No ratings yet
Regression and Correlation
22 pages
ANN MINI Project
No ratings yet
ANN MINI Project
8 pages
Comment On "The One-Dimensional Harmonic Oscillator PDF
No ratings yet
Comment On "The One-Dimensional Harmonic Oscillator PDF
2 pages
9.3 Worksheet.
No ratings yet
9.3 Worksheet.
3 pages
R09 ADVANCED CONTROL SYSTEMSfr 8736 PDF
No ratings yet
R09 ADVANCED CONTROL SYSTEMSfr 8736 PDF
2 pages
How Skewed Are Simultaneously Long-Short Trading Gains?: Michael Heinrich Baumann
No ratings yet
How Skewed Are Simultaneously Long-Short Trading Gains?: Michael Heinrich Baumann
6 pages
Eng M 540
No ratings yet
Eng M 540
60 pages
Graphical Method Practice Problems: Subject To
No ratings yet
Graphical Method Practice Problems: Subject To
5 pages
Fuzzy Logic Control For A Ball and Beam
No ratings yet
Fuzzy Logic Control For A Ball and Beam
10 pages
Euler's Method Simpson's Rule: X X DX
No ratings yet
Euler's Method Simpson's Rule: X X DX
1 page
Modeling of Discrete-Time Control Systems: 1 Transfer Functions
No ratings yet
Modeling of Discrete-Time Control Systems: 1 Transfer Functions
6 pages
Data Structure & Algorithm Lab Syllabus
No ratings yet
Data Structure & Algorithm Lab Syllabus
6 pages
Gauss Legendre 2 & 3 Point
No ratings yet
Gauss Legendre 2 & 3 Point
12 pages
Tabu Search-Assignment IM-49202
No ratings yet
Tabu Search-Assignment IM-49202
3 pages
Deep Learning for LD50 Prediction
No ratings yet
Deep Learning for LD50 Prediction
1 page
Cryptography Techniques for BCA Students
No ratings yet
Cryptography Techniques for BCA Students
18 pages
CPU Scheduling & Algorithms Guide
No ratings yet
CPU Scheduling & Algorithms Guide
39 pages
19 Accounting For Variance and HPO in Machine Learning Benchmarks
No ratings yet
19 Accounting For Variance and HPO in Machine Learning Benchmarks
131 pages
INTRODUCTION TO SOFT COMPUTING NEURO-FUZZY AND GENETIC ALGORITHMS by Samir Roy & Udit Chakraborthy PDF
100% (1)
INTRODUCTION TO SOFT COMPUTING NEURO-FUZZY AND GENETIC ALGORITHMS by Samir Roy & Udit Chakraborthy PDF
564 pages
Math for Chemistry Students
No ratings yet
Math for Chemistry Students
2 pages
Numerical Methods for Undergrads
No ratings yet
Numerical Methods for Undergrads
40 pages
Comparative Analysis of Deep Learning Based Afaan Oromo Hate Speech Detection
No ratings yet
Comparative Analysis of Deep Learning Based Afaan Oromo Hate Speech Detection
13 pages
Assignment5 Cse330 Summer 2023
No ratings yet
Assignment5 Cse330 Summer 2023
1 page
NM Lab789
No ratings yet
NM Lab789
3 pages
VGG Architecture: Deep CNN Models
No ratings yet
VGG Architecture: Deep CNN Models
3 pages
Machine Learning For Algorithmic Trading
36% (11)
Machine Learning For Algorithmic Trading
13 pages
Data Warehousing Notes Aktu
50% (2)
Data Warehousing Notes Aktu
10 pages

LocalGLMnet Interpretable Deep Learning For Tabular Data

Uploaded by

LocalGLMnet Interpretable Deep Learning For Tabular Data

Uploaded by

Scandinavian Actuarial Journal

ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/sact20

LocalGLMnet: interpretable deep learning for

Ronald Richman & Mario V. Wüthrich

To link to this article: https://doi.org/10.1080/03461238.2022.2081816

© 2022 The Author(s). Published by Informa

Published online: 08 Jun 2022.

Submit your article to this journal

Article views: 5739

View related articles

View Crossmark data

Citing articles: 7 View citing articles

Full Terms & Conditions of access and use can be found at

LocalGLMnet: interpretable deep learning for tabular data

ABSTRACT ARTICLE HISTORY

CONTACT Mario V. Wüthrich mario.wuethrich@math.ethz.ch

1.1. Organization of this manuscript

(w.r.t. to a σ -finite measure on R)

2.2. Fully-connected feed-forward neural network

z (m) : Rqm−1 → Rqm

having neurons zj(m) (x), 1 ≤ j ≤ qm , for variables x = (x1 , . . . , xqm−1 ) ∈ Rqm−1 ,

for given network weights w(m)

x → g(μ) = g (μ(x)) = β0 + β, z (d:1) (x) , (5)

2.3. Local generalized linear model network

2.4. Interpretation and extension of LocalGLMnet

βj (x)xj = xj , (10)

L = {(Yi , xi ); 1 ≤ i ≤ n} and T = {(Yt , xt ); n + 1 ≤ t ≤ 2n}.

Table 1. In-sample and out-of-sample MSEs in the synthetic data example.

x → μ(x) = η(x) = β0 + β, x , (13)

3.2. Variable selection

b̄7 = −0.0068 ≈ 0 and 

To analyze these gradients, we fit a spline to these observations by regressing

j (xi ), 1 ≤ j, k ≤ 6, over all instances i = 1, . . . , n.

2 (x) ≈ const which says that we have a quadratic

3.4. Real data example

3.5. Variable importance

Figure 9. Variable importance VIj , 1 ≤ j ≤ q.

3.6. Categorical feature components

Listing 2. Extraction of the estimated weights 

Further analysis of the real data example

B.1 Variability of variable selection method

j (x) and code for spline ﬁt.

Table B1. Sample averages and standard deviations of

B.2 Comparison to partial dependence plots

You might also like

x → g(μ) = g (μ(x)) = β0 + β, z (d:1) (x) , (5)

βj (x)xj = xj , (10)

x → μ(x) = η(x) = β0 + β, x , (13)

b̄7 = −0.0068 ≈ 0 and

j (xi ), 1 ≤ j, k ≤ 6, over all instances i = 1, . . . , n.

2 (x) ≈ const which says that we have a quadratic

Listing 2. Extraction of the estimated weights

j (x) and code for spline ﬁt.

Table B1. Sample averages and standard deviations of