Book Pytorch Scikit Learn Numpy
Book Pytorch Scikit Learn Numpy
List of Figures v
Preface vii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notion of sample and log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Random sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Loglikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Numerical example with a nonlinear curve . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Numerical illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Glm as neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Why learn pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
i
2.2.6 Indicators with sklearn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3 Multinomial regression (3 or more classes) . . . . . . . . . . . . . . . . . . . . . . 53
2.4 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
ii
5.5.1 How to improve the learning . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5.2 Grid search for optimal hyperparameters . . . . . . . . . . . . . . . . . . 118
5.5.3 Bayesian search for optimal hyperparameters . . . . . . . . . . . . . . . . 129
5.6 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
iii
8 Autoencoder compared to ipca and t-sne 187
8.1 Three pca methods and t-sne for dimension reduction . . . . . . . . . . . . . . . . 188
8.1.1 The pca method in brief . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.1.2 Implementations of pca with sklearn . . . . . . . . . . . . . . . . . . . . . 197
8.1.3 Implementation of t-sne with sklearn . . . . . . . . . . . . . . . . . . . . 203
8.1.4 Quality indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.2 Autoencoders with pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.2.1 Definition and link with pca . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.2.2 Visualization of artificial data . . . . . . . . . . . . . . . . . . . . . . . . 207
8.2.3 Autoencoder for 60000 images . . . . . . . . . . . . . . . . . . . . . . . . 214
8.3 Low dimensional reduction via t-sne . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.3.1 Reduction with a random projection . . . . . . . . . . . . . . . . . . . . . 225
8.3.2 Projection t-sne after pca . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.3.3 Projection t-sne after rp+pca . . . . . . . . . . . . . . . . . . . . . . . . . 232
8.3.4 Comparison of the visualizations . . . . . . . . . . . . . . . . . . . . . . . 234
8.4 Autoencoder associated with a glm . . . . . . . . . . . . . . . . . . . . . . . . . . 235
8.5 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
iv
List of Figures
7.1 Barplot for the target variable from Poisson regression . . . . . . . . . . . . . . . 164
7.2 Losses per epoch for the first/second-order algorithms . . . . . . . . . . . . . . . . 174
v
8.7 Visualization from nonlinear autoencoder for 9 classes . . . . . . . . . . . . . . . 213
8.8 Visualization of two first ipca components for 60000 images . . . . . . . . . . . . 218
8.9 Visualization from nonlinear autoencoder for 60000 images . . . . . . . . . . . . . 224
8.10 Visualization from t-sne after ipca for 60000 images . . . . . . . . . . . . . . . . . 231
8.11 Visualization from t-sne after rp+ipca for 60000 images . . . . . . . . . . . . . . . 234
vi
Preface
This book is an introduction with examples to the python module called "pytorch" for models of
regression and classification. The other existing modules for machine learning considered herein
are "Scikit-Learn" (also called "sklearn") in a dedicated chapter and "statsmodel" for variance
estimation in another dedicated chapter. The module "sklearn" is often involved in projects of
machine learning because of its intuitive and powerful processing of the data with many methods
implemented often with multicore parallelism. The python module for the graphical outputs is
"matplotlib" which allows to draw curves, lines and scatterplots plus the legends and the axis with
names.
When the models are not kept linear, the extensions with nonlinearities of the models come from
hidden layer(s) within a neural network (nn) framework. "Deep learning" allows to train these
nonlinear models despite a high dimensionality and numerous parameters.
NN and GLM
Herein, we are interested on the neural networks called "feedforward neural network" (ffnn) or
"multi-layer perceptrons" (mlp). The name mlp has different meanings from an author to another
in the literature, because historically the perceptron is for binary classification and not regression.
These networks are fully connected neural networks hence with fully-connected layers. All the
possible connexions between the nodes of a layer to the nodes of the next layer enter the models.
This also induces that the network receives an input "x" at the first layer and produces an output -a
prediction approximating the true "y"- at the last layer after a cascade of nonlinear transformations.
The glm are a generalization or many models of classification and regression. They aim at pre-
dicting a variable "y" from the knowledge of variables "x", with an unique function with a high
level parameterization. Herein, we are interested on only particular cases with python because
dealing with this very general unique fonction is not practical, nevertheless the general case is also
discussed. These neural networks allow to extend glm in order to improve the training and the
modeling. How to define the network and how to train the parameters in order to improve the
inference for linear models and reach a relevant prediction.
Training algorithms
The interest in implementing the training algorithms is to understand the mechanisms involved
vii
and to improve some current implementations and models. Thus, the book provides a fully
working python code built on the top of the module pytorch and which is completely flexible. On
the contrary to the usual python modules for machine learning where the models are already fully
implemented the process is different with pytorch: one needs to define and train the parameters for
each model or build a general implementation which is able to handle several models as proposed
herein. One can use existing programs or existing high level libraries but learning asks for coding.
It is proposed an introduction and implementation to several algorithms for the estimation of the
parameters of deep glm. The linear versions are particular cases and the foundation of these mod-
els, they are often involved in most of the chapters herein. The algorithms are implemented with
numpy and pytorch for several of them while some numerical results come directly from the exist-
ing modules that we use for the computations.
The number of datasets processed with python herein is near 20, with a complementary list of some
other datasets from repositories. The smaller dataset is 10 rows for pedagogical reasons while the
larger ones are about 500000 observation from forestry for one and 11000000 for a second one
from physics. The usual dataset "mnist" is included with 60000 images as an usual benchmark in
deep learning.
Herein, images datasets are most of the time converted into this same "tabular form" with an unique
data file where each row corresponds to an image. The file formats for large tables is also discussed
with several examples and exercices. One of the purposes of the document is to be able to handle
those large datasets with "pytorch" (and "sklearn") with a limited available quantity of computer
memory for training a linear or deep glm. Preprocessing should take only a few seconds or at most
a few minutes for millions of rows herein. Obviously, the more as possible of computer memory is
probably a wise advice but this is not always possible for a cost reason or for the largest datasets.
Several datasets and python codes are expected to be available for download at the url
"github.com/rpriam/book1" from a few days after that the book will appear at the publishing plat-
form website. Otherwise, the reader may consult for any information, the page of the publishing
platform for the book, or may directly send a message to "rpriam@gmail.com" for any request
about the datasets and the companion programs.
It has been decided to not present the backpropagation in this volume but first-order and second-
order derivatives are discussed for the linear glm. The purpose is to get rid of most of the algebra
and be happy with the automatic computation of the values of the derivatives of any loss function,
thanks to pytorch. The bayesian models have also not been included and are left for a second
volume with more advanced models. Neural networks specialized for texts and images processing,
called recurrent, convolution or transformer networks are not discussed herein neither but the last
viii
exercice of the book allows to learn about as a perspective.
The document is divided into 9 chapters including a chapter for the introduction and a chapter for
the full correction of selected exercices (among the nearly 60 proposed at the end of each chapter)
with python codes and several additional large datasets.
The exercices allow to check if the chapter is well understood and allow to go behond the contents
of the course. A complementary contents to this book is expected to be proposed as a second
volume with more advanced methods and eventually more analytical contents.
The computer used for writing the document is with a quite old cpu with only 4 cores and a small
gpu (2 Go) but with some supplementary memory (8 Go). In case of unlikelily memory problem, if
freeing the computer memory is not enough the reader may reduce the size of the larger datasets by
working only on a sample or adding the missing loop with chunks. But this should not be required,
most of the time, as this was handled already and took care of. Linux was the operating system
for running the python code, and jupyter notebooks, one per chapter, were initially converted
into the text format ".tex" before final updates. This supposes to install all the python packages
and libraries (called herein "modules") required in order to access their functionalities. Another
solution is jupyter notebooks with google colab for cloud computing or for local computations.
ix
Chapter 1
Introduction
1.1 Overview
In the following chapters we are interested on the estimation of statistical models related to neural
networks. The purpose is to predict a variable yi called "target", "outcome", or "of interest" or
"dependent" with the knowledge of a vector xi , aggregating a set of p variables {xi1 , xi2 , xip } even-
tually large, called "predictive", "independent" or "explicative", and otherwise "predictor". One
supposes that there exists a relation between x and y, say a function g(.), such that:
yi = g(xi1 , xi2 , xip ) .
The most usual models for this problem are divided into two different cases, one for the prediction
of a numeric value and one for the prediction of a categorical value. They are named after the:
- "regression models" when y is a scalar (or sometimes a vector) with its values continuous
(real) or discrete (integer ordered). They are called "linear regression" for a continuous
outcome, and "simple" for one predictor or "multiple" for several ones.
- "classification models" when y is a category or class label. The category is usually coded
with an integer not ordered, just as an index. Thus, they are called1 "logistic regression" or
"multinomial regression" for respectively two classes or more than two classes.
These models depend on unknown parameters θ , because they need to adapt to a sample. Each
dataset and related population is different such that the prediction changes. Instead of changing the
function, only the parameters need to change. The function is well chosen and enough general in
order to be relevant for many different datasets or populations, each one with a different parameter.
This is written as follows:
yi = f (xi1 , xi2 , xip ; θ ) .
For each dataset or population, the parameters in the vector θ needs to be found, or say "inferred",
"estimated", or "learned" while the model is said to be "fitted" or "trained",
θ =? .
1 Thus,both modeling cases lead to the prediction of a numerical value with related models. Herein all the models
have same foundation, as the (extended) members of the "generalized linear models" most of the time.
1
1.2 Notion of sample and log-likelihood
1.2.1 Random sample
The population is supposed infinite or enough large to be supposed infinite. Thus a datum is here
of the form:
(yi , xi ) ,
with for continuous (q ≥ 1) or binary (q = 2), multinomial (q > 2), discrete (q = ∞) regressions:
Because only a sample is available and not the whole population or even an enough large sample,
there exists an (eventually small or big) error in θ̂ = θ̂ (sn ), the value of θ found from the sample.
This is not the true parameter θ ∗ for the whole population which is unknown, hence there is an
algebrical and numerical difference between the available value and the true value.
1.2.2 Loglikelihood
The "probabilistic distribution" of the sample comes from a member of the family of "generalized
linear models" such that there exists a known parametric form for:
f (yi ; xi , θ ) .
Thus, the "loglikelihood" is defined as the log of the n products of the density function or proba-
bility mass function above, one per sample, such that,
n
log L(θ ) = ∑ log f (yi; xi, θ ) .
i=1
The derivatives of the loglikelihood w.r.t. θ allows to write an algorithm in order to find an esti-
mation θ̂ of the parameters, given a sample of n observations.
2
1.2.3 Optimization
The search for the parameters θ is summarized as follows:
This notation says that the vector θ̂ to the left is the solution which maximizes the function to the
right. This function depends on the vector θ which can take usually a not countable infinite number
of possible different values, and we are looking for the one which makes the function maximum.
In neural network, this is a loss function, thus, the optimization is the look for a minimum, for
instance of " - log L(θ̂ ) " here. Sometimes, the loss function is instead just a distance between the
true target values yi and their approximations from the network ŷi , as explained later.
There are many algorithms for this task because, the problem is different if the dataset is large or
small, the variable space is high dimensional or small dimensional, and for other computational
constraints. This is not rare to have different algorithms for differents models, as in computational
statistic for bayesian settings for instance. But for neural networks, the algorithms are often more
generic because the derivative are not written in closed-form, and often estimated numerically.
These derivatives are the horse power of most of the training algorithms because the nonlinear
function is approximated locally with a linear or quadratic function before the optimization. The
approximation comes from these derivatives, as in any usual Taylor serie. For some rare cases, the
solution θ̂ is a closed-formed expression, exact without algorithm, such as for the linear regression.
1 n
�(θ ) = ∑
n i
(yi − β0 + β1 xi1 )2 .
3
The difference with a distribution -and its related loglikelihood function- is that the loss function
is just a measure of the difference between the true targets and the approximated targets as found
from a model, while the distribution has a probabilistic interpretation.
Here, yi is the true target variable while the prediction from the model is ŷi = β0 + β1 xi1 . In the
literature this notation is also sometimes after training such that ŷi = β̂0 + β̂1 xi1 , thus this has to
be understood according to the context. A way to be clear would be to write ŷi (θ ) and ŷi (θ̂ )
respectively without additional symbol for the target variable.
The derivatives are found with usual algebra as follows.
∂ �(θ ) ∂
�1 n 2
�
β0 = β0 n ∑i (yi − β0 + β1 xi1 )
∂ �(θ ) ∂
�1 n 2
�
β1 = β1 n ∑i (yi − β0 + β1 xi1 ) .
1 n
0 = n ∑i −(yi − β̂0 + β̂1 xi1 )
1 n
0 = n ∑i −(yi − β̂0 + β̂1 xi1 )xi1 .
If not resolved by the more elementary way to write one coefficient as a function from the first
equation and replace in the second equation, in order to find the expression of one coefficient
before the expression of the second from first equation, the problem is simply written matricially
for resolution:
� 1 n
�� � � 1 n
�
1 n ∑i xi1 β̂0 n ∑i yi
1 n 1 n 2 = 1 n .
n ∑i xi1 n ∑i xi1 β̂1 n ∑i yi xi1
� � � 1 n
�−1 � 1 n
�
β̂0 1 n ∑i xi1 n ∑i yi
= 1 n 1 n 2 1 n
β̂1 ∑
n i xi1 n ∑i xi1 n ∑i yi xi1
� � � �� �
β̂0 1 1 − 1n ∑ni xi1 1 n
n ∑i yi
= 1 n 2
β̂1 1 n
n ∑i xi1 − ( n ∑i xi1 )
2 − 1n ∑ni xi1 1n ∑ni xi12 1 n
n ∑i yi xi1
� � � �� �
β̂0 1 1 −x1 y
= .
β̂1 var(x1 ) −x1 x12 yx1
The usual linear algebra leads to the known analytical usual solution, with an inverse of the matrix
which is either analytical for p = 2 (as just above) or p = 3, either numerical (via a computer
4
program) for larger values of p. When the function is nonlinear, the problem is different because a
closed-form solution is not available yet. Another issue is for large p and large n where the matrix
involved is large such as the numerical algorithm is too slow or worse, the matrix does not fit in the
computer memory. In both cases, alternative algorithms allow to get a solution directly from the
expression of the derivatives above in order to find the unknown parameters (see next chapters).
ŷ1 1 x1 · · · · · · x1� w0
ŷ2 1 x2 · · · · · · x2�
w1
.. = .. .. .. .. .. .. .
. . . . . . .
ŷn 1 xn · · · · · · xn� w�
5
Note that the simple linear regression is also written with this form, with � = 1, because the result
before is retrieved by the product to the left of the transpose matrix which contains the variable x
and called design matrix.
The estimation ŷi approximates yi with the help of the knowledge of xi , and depends on parameters
wk , such that a comparison between the vector of yi and the vector of ŷi is the Euclidean distance,
or the mean. It is thus optimized the following criterion.
• Cost function with θ = (wk ), it is obtained by comparing the true input with the approxi-
mated output.
1 n
�(θ ) = ∑[yi − w0 − wT g� (xi )]2 .
n i
θ̂ = argminθ �(θ )
= argminθ n1 ∑ni [yi − ŷi (θ )]2
= argminθ n1 ∑ni [yi − w0 − wT g� (xi )]2 .
The solution is known via a least square or regression problem and with an analytical solution
(when there is no problems such as collinearities). Note that an usual trick is to consider the vector
xi by adding the component 1 as its first component, such as the "intercept" (also called "bias" in
the computing literature: understood as a bias term for the model without this additive correction)
w0 comes as a component into the vector w for a lighter notation, with "[w0 , wT ][1, g� (xi )T ]T ".
1 n
n∑
�(θ ) = [yi − g(xi , θ )]2 .
i
The nonlinear model is approximated with a linear model in the polynomial function but this has
some limits for more complex functions which are unknown and not polynomial.
In a neural network, the nonlinearity is modeled with layers which are linear transformations asso-
ciated to a nonlinear function.
The neural network with one hidden layer, instead of a polynomial transformation, is a linear
transformation plus a nonlinear one. The nonlinear transformation is called "activation function",
in the sense that each component in the hidden layer is a neuron, and it is activated or not in the sum
thanks to the activation function which has often a shape as a "S curve": a smooth switch zero or
one. This is not discussed further herein, the idea is biological with signals which are propagated
in the brain via neurons which are or not activated according to the received amount of signal.
Formally, this is just written as follows:
6
• From unidimensional input yi to multidimensional layer
xi w11
xi w12
xi → xiT W (1) = xi w13 ∈ R� .
..
.
xi w1�
A non linear function σ (), such as sigmoid or thanh, is applied to each component,
σ (xi w11 )
σ (xi w12 )
T (1) σ (xi w13 )
σ (xi W ) = ∈ R� .
..
.
σ (xi w1� )
• From multidimensional layer to unidimensional output ŷi Another set of weights are required
in order to retrieve the dimensionaly of yi , such that one just writes that:
or directly,
� �
ŷi T
= g(xi ; θ ) = σ w20 + σ (xi W ) W
(1) T (2) .
The later case would be for a target variable in the range [0; 1] because the additional term w20
would make the infence more cumbersome if left out from the second activation function. while
the just before case would be for a target variable in the real line with continuous values. The
case without a second activation function is just more linear and may be enough for fitting some
simple curves. There exists theorem which prove that neural networks are able to approximate any
nonlinear function, but in practice this is not always easy to find a relevant value for their weights.
Here w20 is not called "intercept" but "bias" in the literature for neural networks, while,
σ (xiT W (1) )T W (2) = w21 σ (xi w11 ) + w22 σ (xi w12 ) + w23 σ (xi w13 ) + · · · + w2� σ (xi w1� ) .
7
• Cost function
1 n
n∑
�(θ ) = [yi − w20 − σ (xiT W (1) )T W (2) ]2 .
i
• Minimization w.r. θ
θ̂ = argminθ �(θ )
= argminθ 1n ∑ni [yi − ŷi (θ )]2
= argminθ 1n ∑ni [yi − w20 − σ (xiT W (1) )T W (2) ]2 .
Generally, xi is not one dimensional but p dimensional and � << p, instead of here having � larger
than the dimensionaly of xi , because the neural network reduces the dimensionaly in order to add
nonlinearities, as the real space for the data is generally smaller than the one given. There is a first
transformation associated to the layer, linear and nonlinear, followed by a linear transformation,
and eventually a second nonlinear transformation, not considered here. Note that on the contrary,
for small dimensional xi , an hidden layer has more neurons than p in order to increase the dimen-
sionality of the nonlinear hidden space, � > p, and get a good prediction for the target variable.
This replaces the polynomial transformation with order not explicit, as the order is automatically
chosen from the found weights. In theory, one hidden layer is probably enough, but two or three
are not rare in practice.
In the literature, it was shown that such functions are able to retrieve the true function g() for a
well chosen set of weights and activation transformations: this is more general than the polynomial
regression, again with weights but also nonlinear transformation. There is a difference between
knowing the function as in inference for the nonlinear regression and estimate its parameters,
while not knowing the form of the function in neural networks and estimate the weights in order to
retrieve the function: g(., θ̂ ) versus ĝ(.). A consequence is that more parameters are to be fitted in
neural networks, this induces more problems of generalization not only because the real function
is unknown but also because of the numerous unknown parameters. This is more an estimation of
function than an estimation of parameters, even if weights are estimated like usual parameters.
��� ����������
������ ���������������������������
Let check the computer memory state. There should be remaining some space otherwise the oper-
ating system risks to freeze. For neural network, it is a good habit to check the computer memory.
The code for this document was written with a laptop with more than 6 GB of memory, thus
it might be running in most of recent architecture very well and even for older computers with
enough computer memory.
For instance, psutil may be considered here.
8
������ ������
������ � �����������������������
�������� ������ �� ���� � ���������������� �����
�������� ��������� � � ����������������� � ������� �� ������ ����
������ ����� �� ��
������ ������������ �� ��
� � ����������
���������� �������� � ������������������
9
�������� ������ � ����������������� ��������������� � ���� � ��
�������� ������ � ����������������� ��������������� � � �������
The dataset was stored in txt files in order to get the same data at each run and made available the
sample.
� ������������������������������������������ ������
� ������������������������������������������ ������
� �������������������������������������������� ��������
� �������������������������������������������� ��������
� ������������������������������������������� �������
� ������������������������������������������� �������
������ ����� �� ��
������ ������������ �� ��
������� � ����������������������������������������������������
������� � ����������������������������������������������������
������ � ���������������������������������������������������
������ � ���������������������������������������������������
������� � ������������
������ � �����������
The data sample was splited into two subsamples, one for fitting the model and one for checking the
obtained model with new data. This is what happens with real data when one wants to predict the
unknown target variables from new values for the explaining variables, this is the ultimate purpose
of the modeling. Actually, in deep learning the purpose is most often "prediction" than "explaining"
because the nonlinearities make the explaination cumbersome, such as neural networks were often
called "black-box" by the past even if now they are more called "artificiel intelligence" or just
related to "deep learning" which is a big change actually. There is even some direction of current
research in order to make them explainable which is completely the opposit of the former naming.
For the moment, we stay in the linear case, let draw the real curve and the available sample points.
10
��� ��������������������� �������
�� � �������
��� ���� �� ��������������������
�� �� �� � ���������������
������ ��
��� �����������������������������������������������������
������������������������������
�� ����� �������������� �� ������������������������� �����
�� ����� ���� � ����������
��� � �������������� ��� �� �� ������������������
��� � ����������������������� ������������� ��� �� �� ������������
�
���������� � �������� �
��� � �� ������������������������
�� ����
�� ������������������
���������� �� � � �
�� �����������������
���������� �� � � �
���������� �� ������������������������
�� ���� ���������� �� � ����������
�
�� �����
������������� ���� ���������������� � � ��
�������������� ����
�������������� �����
��������������� ���������� ����� ��� ��������� ���� ������ ����
�→������������������� ��������
�� �����������
������������������������ ������������������ �����
������ ���� ���� ���
� ����������� ������
������ ����������������� �� ���
������ �������� �������������
����������� � ������ ���� ���� � ���� ����
����������������������� ����
������������������
���������������� ���� ��� � �������������������������������������
�������������������������
������������������
11
��������������� ���� ��� � �����������������������������������
������������������������
����������
Note that here the distribution for the xi was not taken uniform, hence the reader may use its own
example of data for further understanding the polynomial fitting and the sample error associated.
12
The fitting for the simple regression leads to the following result. The solution is written with the
available values in the sample.
2 , y , and y x , are respectively:
With a precision of four digits, the means xtrain , xtrain train train train
� �� � � �
1 −0.5628 β̂0 8.0168
=
−0.5628 1.1940 β̂1 −11.7599
� � � �−1 � �
β̂0 1 −0.5628 8.0168
=
β̂ −0.5628 1.1940 −11.7599
� 1 � � �� �
β̂0 1.3610 0.6415 8.0168
=
β̂ 0.6415 1.1399 −11.7599
� 1 � � �
β̂0 3.3672
=
β̂1 −8.2620
This is a good exercice to find the linear system (see also next chapter) with paper and pencil,
before writing this solution. With python code, first the matrix is defined, followed by its inverse
and its product with the column vector, as follows.
� � ���������� �����������������
������������������������ ��
�������� � ���������������� � �
Note that for a larger number of variables, a more robust solution is to solve for the linear system
Aβ = b instead of taking the inverse of A, this is also available in the module numpy with a native
function.
Thus, β̂ is found equal to:
������������������������������
�� �����������
��������������
The estimations from the simple linear model are obtained as,
13
Thus with python, one gets the prediction for the test sample as follows,
A measure of the error from the model is the value of the loss function, here the mean square error
(mse), which is computed as follows,
��� �������������
������ ����������������������������������
�����������������������������
������������������
Let solves this problem directly with a numerical algorithm implemented in the module
"scipy.optimize" in order to find the unknown parameters as follows.
� ����������� � ���������������������
� ��� � �� ��������������
� �������������� � ������������������������������ ������
����������� � ����������������������� ������ ��� � �� ���������������
������������������������������
����������������
The same value is slightly obtained as expected for a simple linear model. Anyway, the mse is
large because the linear model is not relevant for this dataset, and a nonlinear regression should be
fitted. Next, the curve is approximatively retrieved with a better fit than just a linear regression, by
changing the function to feed the numerical optimization module.
14
Polynomial fitting with scipy
The fitting of a curve is illustrated in this paragraph. We are going to compare several polynoms
with different degrees because usually the polynomial degree is unknown when only the data sam-
ple is available. The purpose is to find the best polynom, say the relevant degree and a relevant
estimation for the coefficients.
�������� � ��� �� �� �� �� �� �� �� �� ��� ��� ���
Note that an analytical expression exists already in the linear case such as the simple regression
and the polynomial regression. Thus, this example illustrates the fitting of any function with a
numerical algorithm and allows to compare with the exact solution. For an optimization of a
function with the python module scipy, one writes the nonlinear function and the fitting as follows:
��� ������������������������������������������������
����� ���� � ����������������������������� ������������� �������������
�������������������������� ���������
����� � ����������������������� ������ ��� �� �� ��������������
������ ��������������������������������������������� ����
��� �����������������������������������������������������������
������������������������������
�� ����� �������������� ������� ������������������������� �����
�� ����� ���� � ����������
����� � ��������������������������
�������� � ���������������������
��� �� �������������� �� ���������������������������������
����� � ��������
15
����� � �������������
��� � �����������������������������������������������������
�→����������
�������������� ���������������������
�� �����
��������������������� ����������������
����������������������������������
�������������� ����
�������������� �����
���������������� ��� ��� ������������������� ��������
�� �����������
���������������������� ������ ������������������ �����
�������� � ���
������ �����
16
Figure 1.4: Several fitted polynomial curves and test sample
The error here is the usual average quadratic distance between the estimated and the true values.
������ ������ �� ��
������� � ���������� ������������������ �����������������
���������� � ���������������������������������
������������������ � ����������� ������������ ������������
��������������������� � �������������������������������������
�����������������
17
���������������� ���� ������� ���� ���������� ������ �� ������� ������
���������������������� �������������������� ������
������� �������������� ������������
����������
It is clear that for the train sample, increasing the degree of the polynom induces a better fitting,
while for new data with the test sample, this is not true, the error increases and even explodes in
some cases. The curve learns all the noise of the data instead of learning the underlying smooth real
curve. Note that the optimization with scipy allows to introduce bounds on the parameters such as
the parameter is no greater than its bounds. In a next chapter, another approach is considered also
related to curve fitting with an explicit penalty term added to the objective function.
With the smaller mse for the test sample, hence for new data, the model kept here is the polynomial
regression with four degrees. The estimations from this model are obtained as,
This model is globally not far of the true one, according to the parameter values, but for some
datasets, the mse on the test sample may be not minimum exactly for the right degree of the
polynom because this way to choose is not optimal or the sample too small. It makes sense that
looking just to one new sample is not enough if the test sample is not reflecting the general case,
and better approaches are implemented in a next chapter with several test samples instead of just
only one. This is related to the problem of generalization which is interested on the quality of the
model for a novel data sample, or how make a model enough general for any new similar dataset
from the unique available sample.
18
Next chapters, pytorch is presented for such cases with also some main algorithms for deep learn-
ing. Instead of relying on a numerical library for the optimization, the numerical training must
be written for each general architecture of neural network. This allows to avoid the traps coming
from the nonlinearities and the associated useless local minima of the loss function. In the other
hand, neural network models are relevant for many current data and still able to predict an out-
come for new data. Less flexible models are more limited and perform lesser, hence extending
linear statistical models within the neural framework is appealing when enough data are avaible
for training.
In the generalized linear models the mean µ = E(Y |x) is written via a function called link function
which is a strictly increasing function, and which is applied to the usual inner product of the
parameter vector β with the vectors of explicative variables (plus bias one for the intercept) xi ,
such that the link function η(.) has an inverse η −1 (.) and,
g(µ) = η(xβ ) .
In neural network, the functional relation between yi and xi is more complex and it is retrieved the
notation of the statistical nonlinear regression. Instead of µ(xiT β ) in linear generalized models,
this is written with weights w,
yi ≈ µ(xi , w) .
To be more precise, often at least for classification and regression which are met in deep learning
for images and biology, the expression is:
yi ≈ µ(φ (xi , w)T β ) .
In this case, yi is most of the time univariate and continuous as a curve fitting problem like the one
in this chapter, or categorical for image classification with advanced layers in the neural network
in order to process the spatial information hidden in the images.
19
Deep regressions can often be interpreted as nonlinear regressions with a particular expression
for the nonlinearities: the observed vector of explaining independent variables is transformed via a
nonlinear space before applying the usual final weighting and sum towards the target variable. His-
torically, the Gauss-Newton method was introduced for inference of the parameters of such model
and back-propagation which is related to the so-called "chain rule" for the derivatives. With a
growing number of variables and large datasets, first-order algorithms are usually more practicable
than exact second-order methods even today.
1.6 Exercices
1. (python) Compare side by side in an array, for ten observations, the true values and estimated
values for the target variable when the polynom is fitted with four degrees. Hint, the parame-
ters are found as the second array in the return from the function f_polynomial_regression()
and stored in the variable y_test_polyn_hat_s.
2. (sklearn) Consider a neural network with non linear functions instead of the polynomial
regression and find suitable values of the weights for this dataset. Compare the pre-
dicted/estimated target values and the error with the results from the ones above in the chap-
ter.
3. (python) Implement a "cross-validation" with scipy and propose a method for regularizing
the model. Hint, a way to see this problem of regularization is that when the polynomial
20
degree increases, the regression coefficient increases and the norm of the vector of coefficient
should be kept enough small.
4. (stat) How to choose well the training and test samples for best results.
5. (stat) Retrieve the solution for one variable of the linear regression in a matricial notation
from the initial matricial expression of the y as a function of the x (as in the polynomial
regression above). The solution should be the same than the one written after the derivatives
proposed here. The linear system with the analytical expression comes from the derivative
w.r.t. β0 and β1 before rewriting into a matrix and two vectors by isolating the vector of
regression coefficient at each row.
6. (stat) For the polynomial regression (and more generally multiple regression), an analyti-
cal exact expression is available for the mean squared error. Retrieve this expression, and
its value for the polynomial regression. Statistical or machine learning books treats these
questions with a more elaborate developement, out of the scope herein.
21
22
Chapter 2
In this chapter we are interested on finding a linear function for explaining observations yi depend-
ing on p independent variables xi j aggregated in the vector xi = (xi1 , xi2 , · · · , xip )T . This is written
with a noise εi as,
yi = f (xi ) + εi
= f (xi1 , xi2 , · · · , xip ) + εi .
The noise comes from the error with the linear approximation. The error may be induced by an
inexact measure (for prediction) or unknown variables (for explaining). This model is usually
called a regression model while f (xi ) is the regression line. Here in the linear case we have just:
With real data the noise is often Gaussian, and other noises are possible actually, depending on
the distribution of yi . Note that with pytorch, there is just need to choose a loss function among
a list of loss functions already available in the python module, or eventually define the loss by
writing a python function, before launching the optimization which does not ask for more analytical
work. On the other hand, knowing the underlying mechanism allows a better understanding of the
algorithms, their limits and a better extension or improvement the models if one is not happy with
the current available contents.
The main purpose is thus to find the coefficients β j from a sample of n couples (yi , xi ), compound
of the variable yi called target, outcome or dependent and the variable xi called the predictor or
explanatory or independent (non dependent) variable. The sample is a set of n couples,
The method is called supervised because the variables yi is available for each vector xi , hence, the
purpose is to "explain" or "predict" yi as a function of xi . These two purposes are well separated:
23
• "to explain" has for meaning that the coefficients β j can be understood as weights positive
or negative for the dependant variable yi ,
• "to predict" has for meaning that given new values of xi a value for yi is computed and should
be near the true value.
In practical applied statistics, one can explain with a non predictive model. Hence, whatever is the
size of the variance of the noise, its distribution just needs to be typically Gaussian. Eventually,
one may not even look at the noise, but this becomes mandatory for validating the model and the
coefficients of the regression. Herein, we want a relevant model for prediction, and the noise might
be not too large otherwise the prediction could be not enough efficient.
The linear algebra provides a solution for the vector of regression coefficients β =
(β0 , β1 , β2 , · · · , β j , · · · , β p )T ∈ R p+1 in closed-form, but next chapters will present alternative so-
lutions via diverse algorithms. Finding a value for β from a sample has diverse namings in the
literature: infering, fitting, learning or training. They may not have exactly the same meaning from
an author to another, but one always gets β̂ given the sample, and this approximated value shoud
be near the true value.
As the sample comes from a whole population often infinite, a consequence is that the obtained
solution depends on the sample: β̂ may be denoted β̂ (sn ), β̂s or β̂n . This estimation comes with
some variability in the sense that changing the sample leads to a different estimation β̂ . Thus when
possible, it is checked the expectation and the variance of β̂ , at least for simpler models such as
linear.
The other quantities which are measured are related to the sample, and compute how good are the
estimation ŷi in comparison to the true values yi of the target variable.
Next, a more formal presentation leads to the expression for the coefficients and of the diverses
criteria discussed for the quality. A probabilistic interpretation of the regression is written with the
help of the Gaussian noise. An illustrative example follows then.
β̂ = argminβ �(β )
� �−1 T
= XT X X y.
where,
1 2
�(β ) = n �y − Xβ � .
24
Here the loss function is written directly with a quadratic distance in order to have a notation
related to a Gaussian distribution.
The two last rows comes from a property of a Gaussian probability function as follows. The prob-
ability distribution of the dependent variable yi conditionally to the independent variables is nor-
mally distributed because of this equality just before. Even without this hypothesis of distribution,
this results holds. This comes from the expression of this underlying distribution as a Gaussian
distribution:
� �
1 (yi − xiT β )2
yi ∼ N(xiT β , σ 2 ) is equivalent to p(yi ; xiT β , σ ) = √ exp − .
2πσ 2 2σ 2
After considering such expression, we just need to take the logarithm from the distribution of the
i.i.d. variables yi conditionally to xi . Conditionally just mean that the variable xi appears in the
mean, hence can be supposed constant, while this is yi which is random. This makes sense because
yi is depending on xi and comes after that xi is given a value. Thus, coming with this distribution,
we retrieve our criterion which is optimized for the usual linear regression, despite that there was
eventually no distributional hypothesis originally (with just the simple sum of squares for the loss).
Indeed:
25
Hence one can assume that the variables with values yi are identically and independently distributed
according to the above Gaussian distribution. This leads to define the likelihood of an observation
yi with the input variables xi given the vector of coefficients β which is unknown. Since these
events are assumed to be independent and identically distributed we can build the probability dis-
tribution function (p.d.f) for all possible observations (y1 , y2 , · · · , yn ) as the product of the single
events for the observations yi .
To find a value for the unknown parameter β , the more frequent method is the maximum likelihood
estimation (MLE). This method estimates the parameters of the parametric probability distribution,
given a data sample. At the end, the function is maximized such as the sample maximizes the like-
lihood function. For analytical reasons and practical reasons, this is the logarithm of the likelihood
which is maximized, also called the loglikelihood, this is equivalent because the logarithm is a
strictly increasing function: this allows to avoid the maximization of products which is a tricky
numerical problem. Note that the loss in neural network is minimized, thus this is the negative
of the loglikelihood which is optimized by modules such as pytorch, but they are not limited to
loglikelihood and other functions not related to parametric distribution are possible such as for
robust losses or more advanced structures of neural networks (recurrent for sequence processing,
or convolution for image processing for instance, out of the scope herein). This induces that there
is a strong distributional hypothesis in this statistical inference, which is not required in neural
networks: they are able to fit any unknown function with a relevant optimization and a suitable
architecture (hidden layer(s) and number of neurons). Only more flexible statistical models such
as tree-based for instance are able to compete with neural networks.
The consequence of this result is that the solution for β is a maximum likelihood estimator, and
there is also required the estimation for the variance parameter:
˜ ,σ)
β̂ = argmaxβ �(β
˜ β̂ , σ ) .
σ̂ 2 = argmaxσ �(
The first line comes from the fact that σ cancels out for the first term and remains proportional for
the second term, when computing the derivative of the function �(.,˜ .) w.r.t. β , while the solution
for σ is after the solution for the regression coefficients.
��� ����������
������ ���������������������������
26
������ ���������
�����������������������
In this example, the explaining variable is univariate such that xi = xi1 , hence:
yi = β0 + β1 xi + εi .
In python, we generate the noise εi , the explaining variable xi , compute yi and show their linear
relation with the line.
The data and true parameters are stored in the computer disk, thus loaded just below from text files.
���� � ���������������������������������������������
�� � �������������������������������������������
� � ����������������������������
� � ����������������������������
���������������� ����������
����������� ������
������ ����������������� �� ���
���� � �
���� � �
���� �� � ��������������
������������������������������ ���������������������������������������
�→�����
���������� � ������
������������������������������� �������������
���������������������
���������������������
��������������������� ������ ��� ��� ���� ������ ���������� ������
����������
27
Figure 2.1: True regression line and sample with 10 observations.
• The independent variables xi = (xi0 , xi1 , xi2 , . . . , xip )T for i = 1, 2, . . . , n multivariate in R p+1
with xi0 = 1 for the intercept β0 .
For notational reasons, the intercept β0 is included in β while xi is also with an additional compo-
nent xi0 in order to deal with this definition of beta. The function f (.) which allows to write yi as a
function of xi is linear here. For instance, the number of apples produced in an apple tree may be
in a linear relation with the season, the average temperature and the height of the tree.
Below, we rewrite the regression with a linear system in order to find an analytical solution.
Let rewrite the n linear equations, one per line as follows, via three different notations. The regres-
sion line is true for each observation couple (yi , xi ), such that:
28
y1 = β0 x10 + β1 x11 + · · · + β p x1p + ε1 y1 = ∑ pj=0 β j x1 j + ε1 y1 = β T x1 + ε1
y2 = β0 x20 + β1 x21 + · · · + β p x2p + ε2 y2 = ∑ pj=0 β j x2 j + ε2 y2 = β T x2 + ε2
.. .. ..
. . .
yi = β0 xi0 + β1 xi1 + · · · + β p xip + εi yi = ∑ pj=0 β j xi j + εi yi = β T xi + εi
.. .. ..
. . .
yn = β0 xn0 + β1 xn1 + · · · + β p xnp + εn yn = ∑ pj=0 β j xn j + εn yn = β T xn + εn
In the system above, it is recognized a column vector to the left for the yi , a column vector to
the right for the noises εi , and a scalar product between the vectors xi and β . A more condensed
expression of the system with the sign sum is given at the second column. In a vectorial notation,
this linear system is written at the third column. One must notice that the new observed variable
xi0 was introduced for notational reason, its value is one.
It is easily recognized a product of a matrix with a vector just after the sign equal, which allows a
matricial notation as explained next paragraph. The notation for the different vectors and matrices
is below, followed by the more condensed notation with a matrix for the independent variables
instead of vectors.
Let define the matrix X as follows. The matrix X ∈ Rn×(p+1) is called the design matrix. It allows
to have a matricial expression of the linear system involving the regression coefficients:
T
1 x11 x12 ... ... x1p 1 x1;1:p y1 ε1
1 T
y2 ,and ε = ε2
x21 x22 ... ... x2p 1 x2;1:p
X =
. . . = , y =
... ... ... ... . . . . . . . . . . . .
1 xn1 xn2 ... ... xnp T
1 xn;1:p yn εn
Here, x1;1:p is when keeping the p last components of xi without the first equal to 1, this is the more
usual notation for xi , not preferred here for the matricial computation, otherwise β0 is separated,
leading to more lenghty analytical expressions. Another approach is a separated name, x̃i when
adding the first constant component, which is not considered here.
After the design matrix, three column vectors are defined, one for the target variable, one for the
regression coefficients and one for the noise.
Here xi = (xi0 , xi1 , xi2 , . . . , xip )T ∈ R p×1 , y ∈ Rn×1 , β = (β0 , β1 , β2 , . . . , β p )T ∈ R(p+1)×1 and ε ∈
Rn×1 . For the three vectors, the dimensionality is obtained such as this leads to rewrite the n
equations with an unique matricial equation:
y = Xβ + ε .
This means that when the data is available, from physical or biological measures for instance, the
data are written with this matricial expression, with y and X, in order to find a solution for β . For
the example just before, this leads to the following numerical matrix and vector:
29
� � ����������������������������������������������������
� � �������������������
� ���������������������������
� ���������������������������
Here recall that p = 1. In summary, from the sample, it is available the information from the
data y, which has a role different from the other variables available. The other variables x are
aggregated in the matrix X which is called the design matrix. The noise is latent and included in
the observation yi , such as it is not dealt with in a first approach: with yi given, and ŷi estimated,
this is just εi = yi − ŷi = yi − β T xi . Note that it is estimated with ε̂i = yi − β̂ T xi when β̂ denotes
the vector of regression coefficients found according to the available sample. The noises are the
quantities minimized with a square as it it recognized:
1 2
�(β ) = n ∑i εi (β , X) .
This is the gap between the observations yi and their linear approximations ŷi , such that the regres-
sion line must minimize these gaps as a solution.
The linear regression model has for purpose to predict or to explain y in terms of X via a linear
relation. Other forms of relation may be polynomial for instance, or even more non linear for a
function of X involving another analytical approach than this chapter. The linear relation between
X and y is relevant when it is acceptable, thus it will be consider how to check such relation. The
linear hypothesis leads to the fitting of the linear regression model and to find regression parameters
which are estimation of β = (β0 , . . . , β p )T from the available sample.
The linear regression provides a mathematical expression of the relation of yi as a function of the
p variables xi j and the parameters β j . This leads to explain the variable yi from the predictors xi j
by the sign of the coefficients β j for a positive or negative influence from the independent variable
to the dependent variable. This leads also to predict the variable ynew
i from new predictors xinew
j
by using the linear equation and ignoring the noise such that ŷi new is a guess for the true target
variable ynew
i which is unknown and not available.
This shows the difference between explaination and prediction. In the first one which "explains",
the noises may be large as only the linearity and the shape of the noise are checked hence a large
variance σ is possible and frequent. In the second one which "predicts", the noises and so the
variance must be small such that the values ŷi found by the model are expected to be near the true
values yi . This is why in clinical research for instance, the analysts may focus on small sets of
independent variables which allow to explain and not to predict. This generates a large amount of
concurrent results which are compared afterwards with the idea that if the bias are different and the
conclusions are identical, an effect is detected. The predictive models may required several tens
of variables which are costly to gather each time. They may be difficult to improve because this
supposes to look for new relevant variables at the beginning of the study.
After this introduction to the notations and the purpose with practical illustrations, the expressions
for the solution are found next.
30
Optimization for β - mse to be minimized
For the fitting of the model, say find a value for β , it is required to write a criterion and optimize
it, the criterion will be a function of β in order to proceed at the optimization (minimization or
maximization). The solution is written as a function of the sample, because only the sample is
available.
A relevant criterion is one able to minimize the error between the true yi and the predicted ŷi from
the model. This is available from the εi , and for mathematical ease but also in order to remove the
sign of εi , the square of the noise is usually minimized. This allows nice closed-form solutions
with nice statistical properties even if this can be improved as will be seen later.
The loss, also called cost function because it is minimized, is thus written as follows:
1 n 2
�(β ) = n ∑i=1 εi (β , X)
1 n 2
= n∑� i=1 (yi − ŷi ) �
1 T
= n� (y − ŷ) (y − ŷ)
�
1 T
= n (y − Xβ ) (y − Xβ )
1
� T T T T
�
= n y y − 2y Xβ + β X Xβ .
The vectorial notation in the criterion at the third line comes from just the norms of the vector
y − ŷ which is the usual trick here. The fourth line comes from rewriting ŷ with its expression as a
function of the design matrix and the vector of the regression coefficients.
Having a cost function, our purpose is now to minimize this loss w.r.t. the parameter vector β . For
the estimation of β which leads to a solution β̂ which depends on the sample, we just minimize
the cost function via the optimization problem:
The minimization is performed via the derivative of the cost function with respect to β which
is equal to zero at the optimum. This gives a mathematical expression as follows in a vectorial
approach:
∂ � �
[�(β )] = 2n −2X T y + 2X T Xβ
∂β
= −2 T
n X (y − Xβ )
= 0.
31
This leads to
X T (y − Xβ ) = 0
XT y = X T Xβ
� �−1 T
β̂ = XT X X y.
Here the expression given is available only if the matrix X T X is invertible, otherwise a way around
is required, such as making the matrix inverting by inflating the diagonal or prefering a pseudo-
inverse, or using a numerical algorithm for the optimization which does not involve the matrix
inversion. Note that nowadays for large matrices, such algorithms are more and more wanted and
used in practice.
One may also notice that the optimization may be performed component-wise which avoids any
knowledge on matrix derivative. This translates in mathematical terms as follows: proceed at
the computation of the derivatives component by component for each coefficient β j with j in
0, 1, . . . , p + 1. More precisely one may write the expression for each ∂∂β j [�(β )], then one may
rewrite the expression from the p + 1 univariate derivatives via vectors and a matrix: the same
matricial solution is obtained as with the direct matricial derivative. Some algorithms proceed also
at an optimization component by component or by set of components instead of all the components
once, both approaches are out of the scope herein.
As the design matrix is defined with X ∈ Rn×p+1 , its product with the transposed version,
X T X ∈ R p+1×p+1 , is a squared matrix of side p + 1. If for a small example, the matrix may be
invertible, for larger values of p + 1 this becomes quickly intractable because the squared matrix is
dense and ask for too much computer memory or because there are just too much useless variables
that one needs to remove. The solution above for the regression is not really able to lead to a so-
lution relevant in this case, and other algorithms are required. Improved algorithms and improved
regression models are thus available in the literature and python modules such as online training
and variables selection. This kind of numerical variants are introduced in the next chapters.
When a solution from the optimization is obtained, it is usual to check if the optimum is a minimum
or a maximum, this is possible by looking at the sign of the eigen values of the second-order
derivative. For this regression model, one gets that the second-order derivative of the criterion is:
∂ 2 [�(β )] 2 T
H= = X X.
∂βT∂β n
This second-order derivative is usually called the Hessian matrix and denote H. As a product of
a matrix with its transpose, it is always positive, hence the optimization is a minimization and the
problem is said convex, with X supposed of full rank here. The Hessian matrix and the gradient
32
which is the first derivative are involved in an approximation of the criterion around the optimal
solution, such as:
∂ [�(β )] 1 ∂ 2 [�(β )]
�(β ) = �(β̂ ) + (β − β̂ )T + (β − β̂ )T (β − β̂ ) + R(β )
∂β 2 ∂βT∂β
This is the same approximation which is available from Taylor series for univariate functions. Here
for a quadratic problem we must have the remaining term R(β ) equal to zero while the expression
is exact. A consequence is that if it exists, there is only one unique solution, which insures the
convergence of the algorithms with few conditions. For nonlinear functions, the remaining term
is not zero in general and there exist local solutions which are not global (the best), and make
the optimization more tricky: today for neural networks, solutions and workarounds have been
recently proposed hence this problem is partially solved for some architectures.
Optimization for σ
� �
∂ ˜ β̂ , σ ) = 0
�(
∂σ
Hence, it comes:
� �
2
∂ n
log 2πσ 2 + ||(y − X β̂ )||2 = 0
∂σ 2 2σ 2
� �
||(y − X β̂ )||2
∂ 2
∂ σ n log σ + 2σ 2
= 0
� �
||(y − X β̂ )||2
∂ 1 2
∂σ nσ − σ3
= 0
1
σ̂ 2 = 2
n ||(y − X β̂ )||2
Note that this estimation is biased because the expectation is equal to n−p
n σ , a result which is found
in any statistical book for the linear model, such that the unbiased estimation may be preferred for a
more advanced concern. This also induces that minimizing the loss minimizes also this expression
for the variance term, while the loglikelihood is maximized.
33
triangle. In order to check if the fitting of the linear model is successful diverse methodologies have
been proposed in the statistical literature by computing different indicators which tell us how the fit
is relevant. One of the most used method is to check the residuals yi − ŷi which might be Gaussian
distributed, centered at zero and without any structure hidding some correlation among the data.
Thus, one can have a look at the (studentized) residuals with a plot, at their distribution with an
histogram or at the comparison of their distribution with the Gaussian one with a qq-plot. These
plots and the related statistical tests for the linear regression are not considered herein, but are
wise complements to the usual numerical indicators. Instead, a plot of yi against ŷi will illustrate
variables selection in the dedicated chapter for the regression with a continuous target variable
while a matrix aggregating the counts for comparing the values of yi against ŷi for a discrete target
variable is presented at the end of the current chapter with an example of classification.
For the regressions with a continuous outcome, numerical indicators are as follows.
This indicator is formally an empirical expectation over a sample. It is written as the mean of the
square of the noise per datum as:
1 n
MSE(y, ŷ) = ∑ (yi − ŷi)2 .
n i=1
This sum is just the loss or objective function that was minimized in order to find the solution for
the regression. When comparing models with different choices of variables, the better fit may be
with the smaller MSE but actually one must also care about the error for new data. The MSE is
never equal to zero because there is always some random noise with real data. The MSE is written
as the following python function,
��� ������������������
������ �����������������������������
The mean absolute error (MAE) is related to the mean squared error by replacing the quadratic
distance by an absolute distance, and not often given, on the contrary to the next indicator, without
any doubt the preferred indicator in many domains despite its limits.
Coefficient of determination or R2
This indicator is related to the prediction from new independent variables. The larger is the better
with a maximum at 1.0, a more negative value in the worser scenario of bad fitting, and a value at
0.0 for only an intercept β0 without covariates. The score R2 is defined as
34
where we have defined the mean value of y as
1 n
ȳ = ∑ yi.
n i=1
It is also known that this statistics is unable to check the linearity of the data, and supposes that the
hypothesis of linearity is true, hence may provide erroneous high value for real function which are
nonlinear: the residuals should always be checked too. The R2 is written with a python function,
Note that these indicators are available with sklearn for some methods such as LinearRegression
in sklearn.linear_model, from the sub module metrics, with the names "mean_squared_error",
"r2_score", "mean_absolute_error" while "mean_squared_log_error" is an alternative one.
The direct implementation involves a matrix inversion and its multiplication with y in order to get
the regression coefficients. The python operator "@" is for a matricial multiplication in its usual
definition by summing along the columns for the element coming at the left and along the rows for
the element coming at the right. The function ".dot()" allows an equivalent matrix multiplication.
The function "linalg.inv()" from the module numpy is for the inversion of a squared matrix.
���������� � �����������������������������������������
��������������������
������������������������������
�����������
��������������
� ������������
For this small dataset, this result comes after a few steps as follows:
35
T −1 T
1 0.35 1 0.35 1 0.35 0.07
1 0.45 1 0.45 1 0.45 0.94
1 0.02 1 0.02 1 0.02 −1.06
1 0.18 1 0.18 1 0.18 −0.21
1 0.23 1 0.23 1 0.23 −0.01
β̂ = .
1 0.93 1 0.93 1 0.93 2.26
1 0.53 1 0.53 1 0.53 1.55
1 0.48 1 0.48 1 0.48 1.50
1 0.73 1 0.73 1 0.73 2.09
1 0.04 1 0.04 1 0.04 −0.23
From the products X T X with the matrix X and X T y with the vector y, this leads to:
� �−1 � �
10.000 3.950 6.913
β̂ =
� 3.950 �2.334 5.567
−0.757
= .
3.666
There is a small difference with the previous direct numerical solution because of the rounding
of the numbers in the matrix, which recalls that rounding numbers is suitable only for showing
the results, while intermediate expression should be computed without approximation in order to
avoid any propagation of the rounding error.
In the module numpy, the regression is implemented with the function "linalg.lstsq()" such that,
������������
��������������
� ������������
In the module sklearn, the linear regression is implemented in the function lin-
ear_model.LinearRegression(r). We get the fitting, the regression coefficients from the fitting and
the prediction for the small dataset as follows.
36
� �����������������
���������������������
������������������
������������
��������������
� ������������
Here, xi has for first component 1, hence there is no need to fit the intercept, with the option
"fit_intercept=True", otherwise by default it is fitted and return in the variable "intercept_", while
the vector of coefficients is returned in the variable "coef_", which in the present case contents
already the intercept.
The R2 and the MSE from the two functions defined above are equal to:
���� � � � �����������
�������� ��� � ����������������������������������� �
�� ��� � ��������������������������������������
��� � ����
��� � ����
Solution for σ
��������� �������
Let plot for the small example, the cost function for σ before having the equation solve. The
loglikelihood is computed for several values of the standard-deviation. Note that the design matrix
37
could be an entry for the function instead of a global python variable. Then we plot the cost error
for the variance quantity.
������ ����� �� ��
��� �����������������������������
� � ������
� � ����������������������������� � �
�������������������������������� �������������
������ ��
��������� � ������������� � ����
��������� � ������������� � ����
���������� � ����������������������������������
����������� � ����������������������������������������
���������� � ��������������������������������������
��� �����������������������������������������������
������������ ������������
���������������������
���������������������
�������������������
38
������ ����������������� �� ���
���� �� � ��������������
����������������������������������������������������������
��������� ���������� �� � �������� �� ��� �������� �����������
����������������� ����������� �����
����������
Most of the functions defined in the chapters are written in a python file in order to be available
after. Note that the functions in the file "deepglmlib.utils.py" does not need to refer "utils" because
the functions are already defined locally, thus, this is removed in every functions in the file.
For the regression, the indicators are computed in a function with optional printing of the indicators
and drawing of the scatterplot between true and predicted targets.
��� ������������������������������������������������������
������������������������������������
� ��������������������������������
����������������������������������
�� ��������
���������� � �� ��������������������� ����
��� � �� ���������������������
����������� � �� ���������������������������
�� ������������
��������������������� ����� ��������������� ���������� ���������
���������������� ��� ���������� ���� �������������� ���
It it recognized the mean square error (for linear and nonlinear regression) and the coefficient of
determination (for linear regression).
39
������������������ � ��������������������������������������������
���������������������������
��� � �����
�� � �����
2.1.6 Remarks
A main purpose of the next chapters is to consider nonlinear models by changing the linear pro-
cessing of the explaining/predicting variables from a sum into some nonlinear function, i.e. β T xi is
changed into a more general function g(xi , β ) where g(., .) also depends on linear transformations
associated to non linear ones which follow each other. The target variables yi shall also be eventu-
ally not continuous, such as binary or integer, associated to different loss functions. The training
of such deep models will be handled via pytorch where g(., .) comes from a particular expression
explained in the next chapter after: one gets powerful and accurate regression and classification
models when the dataset is enough large. If for usual expressions of g(.), python modules for
machine learning such as sklearn are able to process the related model fitting, when the dataset
increases and the model evolves, more flexible approaches are required such as offered with the
module pytorch for running the optimization.
This concludes the introduction to the linear regression. The reader may have noticed that the linear
classification is also a regression with a different distribution for the outcome, in the framework
called generalized linear models. Next, instead of a continuous variable for the outcome, this is a
binary one which is involved.
40
�
0 for the class of failure
yi = .
1 for the class of success
As before y is a vector aggregating all the observations for the outcomes, X is the n × p design
matrix while β is the p × 1 vector of unknown parameters or regression coefficients.
The usual linear regression predicts values on R while we are interested on predicted values in the
set {0, 1}. We have seen that the linear regression involves a Gaussian distribution, then changing
the distribution into a Bernoulli one makes sense but its parameter needs a bounded probability.
Recall that the parameter of the Bernoulli distribution is the probability of success, say the proba-
bility that the observation is equal to one. For this reason, the logistic regression is based on this
distribution but instead of just implementing the inner product xiT β directly as the parameter, it
supposes a nonlinear transformation of this inner product:
� �y � �1−yi
yi ∼ B(σ (xiT β )) is equivalent to p(yi ; xiT β ) = σ (xiT β ) i 1 − σ (xiT β ) .
This model is the one we want because σ (xiT β ) ∈ {0, 1} and it is defined for binary observations
as expected. Note that the usual notation σ () is just a coincidence, and not related to any variance
quantity with the scalar σ from the Gaussian likelihood.
41
Sigmoid function
The sigmoid function (or the logistic curve) is useful in neural networks for assigning weights on
a relative scale. The value u is the weighted sum of parameters involved in the learning algorithm,
with 1 − p(u) = p(−u). This function takes any real number, u, and outputs a number in ]0; 1[.
exp u 1
σ (u) = = .
1 + exp u 1 + exp −u
Note that the logistic model leads to a smooth version of a neural network called "perceptron", not
studied herein, as it performs less well. There are some theoritical concerns (such as number of
steps before convergence of the algorithm) with this simple network which are out of the scope.
Recall that it is assumed to have two classes with yi either 0 or 1. Furthermore it is assumed also
to have an unknown vector of parameters β in our fitting with the sigmoid function. With that, it
is defined probabilities:
where β are the weights we want to guess from data, in our case β0 , β1 , and β2 , in the example
below, for instance.
Note that, as {yi = 0} and {yi = 1} are complementary events,
Likelihood function
It is defined the total likelihood for all possible outcomes from the available sample or dataset
{(yi , xi ), 1 ≥ i ≥ n}, with the binary labels yi ∈ {0, 1} and where the data points are drawn inde-
pendently.
The purpose is thus to maximize this probability: the probability of the observed data for this
model. The likelihood is approximated in terms of the product of the individual probabilities of
each outcome yi , because of the hypothesis of independence.
42
This allows to implement the Maximum Likelihood Estimation (MLE) principle where the opti-
mization leads to solutions for the unknown vector of parameters. This solution has known prop-
erties, all exactly the same for any form of likelihood from Gaussian, Bernoulli, etc distributions.
In neural network, this is a loss which is minimized, the opposit of the loglikelihood.
Loss function
After that, one applies the logarithm from the mass distribution of the i.i.d. variables yi , such that:
Since the events for the observations were assumed to be independent and identically distributed
the probability mass function for all possible event y is a product of the single events, such that
after the logarithm, the loglikelihood is obtained:
Reordering the logarithms, one can rewrite the sum as, with an example for two independent
variables,
� �
�(β ) = ∑ni=1 yi (β T xi ) − log (1 + exp (β T xi ))
= ∑ni=1 (yi (β0 + β1 xi1 + β2 xi2 ) − log (1 + exp (β0 + β1 xi1 + β2 xi2 ))) .
The maximum likelihood estimator is defined as follows: this is the set of parameters that maximize
the loglikelihood, say maximize �(β ) with respect to β .
In neural networks, since the cost (error) function is just the negative loglikelihood, for logistic
regression the sign would be changed for a minimization (of a loss). The correponding loss is
named the "cross-entropy".
2.2.2 Derivatives
The cross-entropy is a convex function of the weights β and, therefore, any local minimizer is a
global minimizer.
Minimizing this cost function with respect to the two parameters β0 , β1 and β2 we obtain,
∂ �(β ) � �
exp (β0 +β1 xi1 +β2 xi2 )
= − ∑ni=1 yi − 1+exp
∂ β0 (β0 +β1 xi1 +β2 xi2 )
∂ �(β ) � �
n exp (β0 +β1 xi1 +β2 xi2 )
= − ∑i=1 yi xi1 − xi1 1+exp (β +β x +β x )
∂ β1 0 1 i1 2 i2
∂ �(β ) � �
exp (β0 +β1 xi1 +β2 xi2 )
= − ∑ni=1 yi xi2 − xi2 1+exp (β0 +β1 xi1 +β2 xi2 ) .
∂ β2
43
Let us now define a vector y with n elements yi , an n × p matrix X which contains the xi values
and a vector p of fitted probabilities p(yi |xi , β ). We can rewrite in a more compact form the first
derivative of the cost function as,
∂ �(β ) ∂ 2 �(β )
= −X T (y − p) and = X T ΩX .
∂β ∂β∂βT
Here it is denoted a diagonal matrix Ω with elements ωi = p(yi |xi , β )(1 − p(yi |xi , β ), which leads
to a compact expression of the second derivative too.
The difference with the linear regression is that the output from the regression was directly used for
ŷi while for the logistic regression this is a probability which needs to mutate into a binary value.
The choice for s is according to diverse criteria in order to insure errors balanced or unbalanced
for the two possible values of outcomes. Here, an error happens when yi �= ŷi as explained next.
Because the problem is a classification and the variable y is not continuous, the purpose is to
retrieve exactly the same class for each datum. Indeed, yi ∈ {0, 1} and ŷi ∈ {0, 1}, then the errors
happen whenever yi is not equal to ŷi . The four different cases are summarized just below:
• yi = 1 and ŷi = 1 is True Positive or TP
• yi = 0 and ŷi = 0 is True Negative or TN
• yi = 1 and ŷi = 0 is False Negative or FN
• yi = 0 and ŷi = 1 is False Positive or FP
44
Confusion matrix
A matrix allows a visual representation of the four cases listed above. They are summarized with
the cells of the "confusion matrix" which shows the counts for the correspondence or non corre-
spondence between the true yi and the predicted ŷi . The matrix is thus defined as follows:
ŷ = 1 ŷ = 0
y = 1 #TP #FN
y = 0 #FP #TN
The cells with #TP, #FN, #FP and #TN denote the sizes of the set by couting the number of TP,
FN, FP and TN in the sample when comparing the true target yi with the predicted target ŷi . We
are happy if the diagonal of the matrix is large, and the non diagonal is small, but other alternative
indicators exist (and also for unbalanced classification not considered here). The value of the
indicators depends on the cut with s for choosing ŷi , this explains why the cut may be decided for
maximizing an indicator of interest, even if often the threshold 0.5 is selected by default.
�
Accuracy
��� ������������������
������ ����������������������
Error rate
This indicator is on the contrary to the accuracy the percentage of missclassified units in the sample,
such that:
ERR(y, ŷ) = 1 − ACC(y, ŷ) .
�
45
Precision and recall
The "precision" is the percentage from the number of true positive against the number of true
positive and false positive, while the "recall" is the percentage from the number of true positive
against the number of true positive and false negative
#TP
PREC(y, ŷ) = .
#TP+#FP
#TP
RECALL(y, ŷ) = .
#TP+#FN
This is only a measure for the the success, conditionally to the true target equal to one, which is
mesured here, as a percentage from the count of predicted positive w.r.t. the count of true positive.
This reflects the purpose to only retrieve the value 1 of the target variable, otherwise the true
negative would be in the numerator.
�
ROC curve
This curve is a representation of the points (precision by recall) when the threshold changes within
the interval [0; 1] in order to find the best threshold which maximizes both:
This curve is not so much considered because the accuracy is more often a concern with a sym-
metric error, such that, one may plot eventually this curve as a complement, named here the "ACC
curve",
� �
#TPs #TNs
s→ , .
#TPs + #FNs + #FPs + #TNs #TPs + #FNs + #FPs + #TNs
Here it is denoted the index s when the counts come from a value s as the level of cut for finding
the target variable ŷi from the the continuous quantity σ (xiT β̂ ), instead of just selecting s = 0.5
from the usual and not optimal first choice, see "sklearn.metrics.roc_curve" for instance.
�
AUC
The quantity from the "Area Under the Curve" ROC or "AUC" allows to judge the quality of a
classification. With 0.5 the random choice for the predicted labels, a larger value towards 1 leads
to the best model. A visual representation associated to this area or just the value of area is a
complement to the other indicators. Note that in image classification, a wide domain of research
in neural networks, authors focus on the accuracy most of the time, while this area appears more
frequently in machine learning for tabular data for instance.
�
46
F1 score
The F1 score is a particular case of F-measure (with parameter equal to 1), an harmonic mean:
2#TP 2
F1 (y, ŷ) = = 1 1
.
2#TP+#FP+#FN PREC(y,ŷ) + RECALL(y, ŷ)
MCC
The "Matthews Correlation Coefficient" is related to the Pearson correlation for a contingency
matrix, such as a phi coefficient or mean square contingency coefficient,
#TP #TN - #FP #FN
MCC(y, ŷ) = .
(#TP+#FP)(#TP+#FN) + (#TN+#FP)(#TN+#FN)
�
All these indicators for two classes are generalized for a muli-classes model, with the confusion
matrix which required the K classes instead of just two at the level of the rows and columns, while
the other indicators are also extended by counting the errors per class and summing.
Herein, the implementation of the indicators comes from the module sklearn. These indicators are
available from the sub module metrics, with the names "accuracy_score", "precision_score" and
"recall_score" and "confusion_matrix", with the first argument for the true target variable and the
second one for the predicted one.
��� �������������
������ �����������������������
� ������ ����� �� ��
� �� � �� � ��
� � � � � ��
47
� ���� � ���������������������������������������
� � � ����������� ��������������������������������������� �
� ���������������������������������������� ��
Let save the dataset with the python module numpy in text format.
� ������������������������������������������������������
� ��������������������������������������������
������ ����� �� ��
�� � ����������������������������������������������
���� � �����������������������������������������������
� � �����������
� � �������
� � ������
�������������������������������
�� � ������������
�� � ������������
�����
���� ���
The format for y as a list with only one dimension is often preferred in some python modules,
otherwise a not related exception may be received without further explaination, such that checking
if it is a (python) vector or a list-like format is a good way to do, before all, in order to avoid tricky
computer messages from python.
� � ���������
48
������ ����������������� �� ���
���� �� � ��������������
����������
Figure 2.4: Data sample for two classes with a linear frontier
From the graphic, we get two classes almost separated as expected because the independent vari-
ables are generated separately for each class with a distance between the means larger than the
standard-deviations. The logistic regression will be able to find the frontier between the two classes
with again a line as in linear regression but in a different way. Actually, this is the reason why these
two models are called "linear models". Note that there is not a closed-form solution on the con-
trary to the linear regression, hence, an iterative algorithm is required whatever the size of the
dataset. This is common to many methods in statistics, machine learning and even neural network
which will be studied in a next chapter. Note that the algorithm will ask for a computation of the
first-order derivative of the cost function. While the second-order will be not required for all the
numerical algorithms. If this is the case in this chapter by default for the used implementations,
second-order ones are less used in deep learning because of the size of the networks and the large
number of weights or unknown parameters to find in order to fit the model to the sample.
49
������ ����� �� ��
���� ����������������������� ������ ����������������
���� ���������������� ������ ������������������
���� �������������������� ������ ������������������
������ � ����������������������������������
������ � ������������� ����������
Here it has been chosen the algorithm called ’lbfgs’ by sklearn, but other algorithms are available
such as ’newton’ for instance, they are both second-order methods, and thus better to use for small
datasets.
The parameters and the accuracy after training follow.
����������������������������������������������������
������������������������������������
���������������������
� ������������
� �������������
���� � �����������������
������������������������������������
���� ����
An hyperplane is the frontier between the two classes such as p(β̂ T xfr ) = 0.5, thus with two
dimensions where p = 2 this leads to the equation:
� 0.5 �
β̂0 + xfr1 β̂1 + xfr2 β̂2 = ln 1−0.5
= 0.0
This line is added to the graphic with the two classes as colored points in order to illustrate the
resulting classification. Such function is presented for drawing several lines useful next chapter
while only one solution from sklearn appears below.
��� �����������������������������������������������������
���������� ������������ ���������� �����������
�� � ����������������������������
�� � ����������������������������
��������������
��������������
50
��� � ���������� � �����������������������������
��� � ������������������������������
������������������������
���������������������� ���������������
��������������������� ������ ��� ������������� ���������
������������������������������������������������������
��������������������������������������������������������
��������������������������������������������������������
�→������
The number of errors is recognized visually (when some points does not overlap) by counting
the points from one side and the other side of the line, which does not belong to the correspond-
ing class from each side. This also underlines that for a nonlinear frontier, a line is not enough,
which explains why nonlinear models are required such as with neural networks. Some errors are
removable by improving the frontier between the two classes.
51
2.2.6 Indicators with sklearn
For the classification, the indicators are computed in a function with optional printing of the indi-
cators and drawing of the scatterplot (not implemented here) between true and predicted outcome.
��� ����������������������������������������������������������
������������������������������������
�������� � ��������������������������� �����
��������� � ������������������������� �����
��������� � �������������������������� �����
��������� � ����������������������� �����
�� ��������
���������������� ��������
���������������
��������������� � �� ���������������������
���������������� � �� ���������������������
������������� � �� ���������������������
�� ������
����
It it recognized the accuracy, precision and recall, plus the confusion matrix, which are described
previously in the chapter. For the bivariate example just before, the resulting error is as follows.
��������� ������
���� ��
� � ����
�������� � ����
��������� � �����
������ � �����
Here, we are mainly interested on the accuracy which may be also computed during the training
if a loop is implemented. As expected from the visual inspection of the classes, the classification
error is small, 0.06, because the classes are linearly separable.
52
2.3 Multinomial regression (3 or more classes)
When the classification is for more than two classes, in this case, either a rule is set from binary
classifiers, either the model is updated for multi-classes. The first way asks for combinations of
binary classifiers, while the second way asks for a new distribution for the classes. Following the
modeling approach of the chapter, this results to:
yi ∈ {1, 2, · · · , K} .
Here the number of classes may be equal to 3 or more. The distribution is not anymore Gaussian
or bernoullian, this is a multinomial law which allows a label with more than two values.
The sigmoid function is replaced with a softmax function for the probabilities which generalized
the sum to one for K probabilities instead of just 2 probabilities, with the expression:
The method is now named "multinomial regression" or "softmax regression". Instead of one linear
regression transformed by a non linear function, there are K linear regressions with each its own
coefficients and they are embedded in the softmax function. Thus, this model remains linear as the
two previous models.
The target variable yi is rewritten with binary vectors with components yik , here the value is one if
yi = k and zero otherwise.
The new likelihood is now:
yik
P(y|X, β ) = ∏ni=1 ∏K
k=1 [p(yi = k|xi , βk )] .
The usual training algorithm remains sligthly similar than for the logistic regression, via gradient
descent (Newton’s method for instance). Note that in the literature, an advice is to substract the
max of the K regressive parts in each probability of class separately in order to avoid numerical
issues. The code from the python module sklearn is as previously for two classes with the same
function "LogisticRegression()" which is able to handle a multicategory target variable.
2.4 Exercices
1. (sklearn) Instead of the logistic regression, let test the linear regression for the binary target
variables, and compare the resulting classification on this (bad) model. We will choose 0 and
1 as values for yi . This is convenient because the linear regression will give us a continuous
value for the prediction ŷi , hence a first choice is just greater or lower to 0.5 for deciding of
53
the class, yi = 1 if xiT β̂ > 0.5, and yi = 0 if xiT β̂ ≤ 0.5. Compare with the logistic regression
for the (very) small dataset of the chapter, or other data. Which model is performing better
for predicting the target variable.
2. (sklearn) Test a linear regression or classification for two or three small datasets from the uci
machine learning repository, kaggle or sklearn. These datasets are generally pre-processed.
Such that mostly missing values may be removed or imputed for some of them. While cat-
egorical independent variables may be binarized in order to enter the models. For instance,
the following datasets are usual benchmarks.
Dataset #rows #vars #classes
Iris 150 4 3
Breast cancer wisconsin 569 30 2
Wine Quality 4898 11 NA
Credit Card Default 30000 22 NA
Pima Indians Diabetes 768 20 2
Ionosphere 351 33 2
Glass Identification 214 9 6
Ecoli 336 7 8
Abalone 4177 8 NA
KDD Cup 1998 191779 481 NA
3. (sklearn) Test a regression or a classification for one large dataset, with for the training: a)
the full sample for a batch algorithm and b) the subsamples for a minibatches algorithm via
cycling chunks from the datafile. For instance, the "Higgs dataset from the communica-
tion "Searching for Exotic Particles in High-energy Physics with Deep Learning" (2014,
Nature) with 11000000 (11M) rows and 28 variables while y is binary, available at the
uci in csv format, or in hdf5 format also online. Another dataset larger, is available at
"https://developer.ibm.com/data/airline/" in a gzip format from a csv file with 200 million
(200M) "domestic US flights" to "predict the likelihood of a flight arriving on time", or its
subsample of 2000000 (2M) row samples.
4. (sklearn) Test a regression to predict the cover types in forest from the dataset available
from the uci machine learning repository at the entry "Covertype Data Set", with 581012
data vectors and 54 variables (attributes or features). This dataset is from "Comparative
Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest
Cover Types from Cartographic Variables" (1999, Elservier).
54
Chapter 3
Pytorch allows to infer the weights of neural networks by automatically computing the gradient
of a cost function and implement the optimization loop without requiring something else than the
definition of the network and the definition of the cost function. All the burden for the optimization
is almost avoided except some settings in order to help the optimization to perform well with the
data. This means that for the linear regression or the logistic regression where the vector β is
unknown, one just needs to define a neural network which is corresponding to one of the linear or
logistic model, add the cost function, choose an option for the optimization and launch the learning
procedure which finds a numerical value for β̂ . The weights of the network w are identified to the
coefficients of regression while the structure of the network is very simple in this case, one of the
first to appear in the history of neural networks.
For minimizing the cost function �(w) the idea of learning parameters for the network is found
with an iterative algorithm called gradient descent, from an initial value w(0) , and a choice for the
learning rate αt often constant and equal to a small value α such that it is iterated until stabilization:
∂ � ���
w(t+1)
=w (t)
+ αt �b (w) �� .
∂w w(t)
Here b is a (mini)batch of indexes i, with sb a subset of {1, · · · , n}. Choosing the whole set or
only one single index i is possible but the later often leads to a slower converge. Here we will
first choose just one datum and iterate over the whole sample. This approach adds randomness to
the learning algorithm which is more efficient, and also avoid an heavy computation on the whole
dataset of the gradient at each step which is quicker. When s is the set of indexes {1, 2, · · · , n} with
∪b sb = s the main variants are defined as follows for each iteration of the algorithm:
• when a singleton b = i ∈ s from the sample is chosen not randomly by cyle, the algorithm is
called "gradient descent"
• when a subset sb ∈ s from the sample is chosen not randomly by cyle, the algorithm is called
"minibatch gradient descent"
• when a singleton is chosen randomly from the whole sample, the algorithm is called1
1 Robbins–Monro procedure.
55
"stochastic gradient descent"
• when a subset is chosen randomly from the whole sample, the algorithm is called "stochastic
minibatch gradient descent"
• when the full sample sb = s is chosen, the algorithm is a "batch gradient descent".
This also supposes a good choice for the learning rate αt which must change according to the
size of the minibatch. The stochastic (minibatch) gradient descent approaches are able to reach
outstanding solutions for deep neural networks with the rise of the size of the dataset which insures
generalization and nearly optimal solutions. For a stochastic procedure, one just need to shuffle,
say randomly change the order of, the units in the dataset before each epoch, where an epoch is a
full cycle on the dataset. Despite huge volumes of data and huge number of weights, the trainind
has become possible nowadays. The batches iterations are generally very slow to compute and
eventually prone to bad solutions for nonlinear functions where they are trap in local minima.
There exists some hypothesis required for the convergence of the gradient descent not discussed
here. Convexity is one which insures to reach the optimal solution, but relaxing this condition
becomes possible in the case of the stochastic version. These algorithms are relevant event when
the cost function is nonlinear with several local minima instead of just one minimum. It is also
suitable for large datasets or some non invertible matrices for instance for the linear regression. The
loss functions of linear regression and logistic regression are convex which allows non stochastic
gradient descents to converge fast to the optimum. The choice for αt is important for insuring the
convergence where a stable value is obtained at the final steps but also a good convergence where
the solution is relevant at the last step. These two conditions are required in order to get a useful
numerical value for β̂ otherwise another function or value for αt is wanted until a suitable result
is obtained. Changing the initialization is also important in order to change the local minimum
of the nonlinear function when the optimum is not reach at the end of the iterations. Think about
an example of a polynomial curve, there are several minima and the algorithm is able to go to the
minimum at each valley but not always change of valley because this would suppose enough big
changes during the first iterations, which is only possible when the gradient is not already small
near zero as at the convergence. The gradient of the function needs to be smaller and smaller
during the algorithm in order to insure the convergence, and according to the formula above for
the update, a large change of value needs either a large learning rate either a large gradient.
Next, such algorithm is implemented with numpy before using the power of pytorch. The opti-
mization is performed with numpy by writing the closed-form expression for the derivative from
linear algebra first, before the automatic derivative with pytorch after: with numpy and pytorch for
the linear regression and directly with pytorch for the logistic regression.
56
rows, as this lead to nearly similar result here.
The set of observations or data sample is (xi , yi )i∈{1,...,n} . We want to estimate the regressions
coefficients with a vector β̂ . The loss function called "mean squared error" is minimized such that,
β̂ = argmin �(β ) ,
β
�(β ) = ∑ni=1 �i (β )
= ∑i∈s (yi − β0 − β1 xi1 )2 .
Here n may be very large, hence it is better to have an algorithm which does not need the full
sample at each iteration.
We have seen previously that the gradient of the cost function with respect to the weights is written
as a vector which will enter the iterative procedure. As explained before, we add the component
one to the vector in order to get xi = (1, xi1 )T while aggregating all the vectors in the design matrix
X, such as at the end, the gradient is written as follows,
� �
� � ∂ � � � �
∂ ∂ β0 ��i (β )� (yi − β0 − β1 xi1 ) 1 � �
�i (β ) = = −2 = −2 yi − xiT β .
∂w ∂ xi1 (yi − β0 − β1 xi1 ) xi1
∂ β1 �i (β )
Now we just need to choose a value or a function for αt = α(t) that we choose constant equal to
αt = α = .01 here for instance. This is implemented just belows, recall that:
Next we follow the update rule of β after a random initialization of the vector, hence we implement:
� � � � � ��
(t+1)
β0
(t)
β0 1 �
= + αt × (−2) yi − xiT β (t) .
(t+1)
β1
(t)
β1 xi1
The algorithm stop at the final step t = T . In general it must be several cycles over all the data
sample, each cycle is called an epoch. This translates in python as follow below, in a vectorized
way because vectors are handled faster than scalar components with loops in pytorch.
57
Dataset from the files
��� ����������
������ ���������������������������
������ ���������
�����������������������
������ ����� �� ��
���� � ������������������������������������������������
�� � ����������������������������������������������
���� � �������������������
� � ����������������������������
� � ����������������������������
� � ������
� � �������������������������� ���
�������������������
�����������
����������
�������
� �����
The regression coefficients are estimated from the sample with algebra as:
58
���������� � �����������������������������������������
���������� � �������������������������
�������������������
�����������������
�������
�����������������������������������������������������������
����������
��������������
� ������������
�������� �������
Loop with data one by one and parameters update via numpy
A first implementation is with a simple loop then a second implementation is more oriented neural
network before a third implementation using pytorch for the derivative and a pure pytorch imple-
mentation before a final version oriented object with a parent class from pytorch which will be the
way to do in the next chapters.
������ ����
���������� � �������������������� � �
�������� � ������������������������
������� � ��� � ��������������
������� � ����
������� � ��
��������� � ��
���������� � ��
��������� � ��
����������������� �������������������������������� �
� � ������
��� ����� �� �����������������
��� � �� �����������
�� � ���������������������
�� � ����
��������� � �����������������������
������� �� ������� � ���������
����������������� �������������������������������� �
������������������������������������
����������������� �������������������������� �
������������������ ���������������������� �
59
�������������
The results are the final estimation for the regression coefficients and the loss. It is wise to show
the form of the loss after training, in order to check the convergence, as a not too fast and not too
slow decrease generally inform for a relevant choice of the learning rate, otherwise another choice
is wanted. Anyway, the gradient should be near zero in order to confirm that an optimum (here
unique for the linear regression) was reached at the end of the loop. The resulting curve validates
the learning rate.
The algorithm performs well even if there was no an advanced tuning of the learning rate, just a
few tries (not presented, but the reader might try other learning rates and compare the obtained
curves, see exercices). Note that some stochastic algorithms would sample the data vector in some
case, this is not considered here. The values for αt among 10u with u integer between −6 to 3 are
an usual try, as there does not exist any formula yet in order to find it automatically.
60
�(β ) = ∑Bi=b ��b (β ) �
= ∑Bb=1 ∑i∈sb (yi − β0 − β1 xi1 )2 .
The new update formula for β (t) is thus written not from a single data vector but from the whole
subset in the minibatch, roughly n/B, as:
� �
� � ∂ � �
∂ ∂ β0 ��b (β )� 1 � �
∂w
�b (β ) =
∂
= ∑ −2 xi1
yi − xiT β = −2XbT (yb − Xb βb ) .
∂ β1 �b (β ) i∈sb
This is implemented just below, with a size |sb | for the batch b equal to 10. For the training,
the same batches are cycled at each epoch, but shuffling is also possible (by reordering constant
minibatches, by sampling new minibatches with or without replacement) at each epoch:
s = s1Us2U · · ·UsB .
This is instead of changing their order or sampling batches at each step. This is convenient for a
fast access to datafiles for example, as the aim is often to process large datasets which cannot fit in
the computer memory. The new update formula is thus as follows,
� � � �
(t+1) (t)
β0 β0
(t+1) = (t) + αt × (−2)XbT (yb − Xb βb ) .
β1 β1
Thus,
�������� � ������������������������
������� � ���� � ��������������
������� � ����
������� � ��
�������������
��� � ����������������������
�→�������������������������������������������
��������� � ��
���������� � ��
��������� � ��
����������������� �������������������������������� �
� � ������
��� ����� �� �����������������
��� � �� ��������������������
61
� � ������������������������������
�� � ������
�� � ����
��������� � �������������������������� � ��������
������� �� ������� � ���������
����������������� �������������������������������� �
����������������� �������������������������� �
������������������ ���������������������� �
�������������
���������������������������
�������
��������������������������������������������������
��������
��������������
� ������������
�������� �������
There is a nice decreasing curve for the loss during the training because the cost function is a
smooth convex parabola and the learning rate was chosen not too lage or not too small even if
probably not optimally. A better setting would lead to a solution nearer to the optimum for β̂ .
With this setting, the convergence may be slower than the stochastic algorithm.
"https://pytorch.org/tutorials/beginner/former_torchies/tensor_tutorial.html"
62
dataset, the cpu is enough and the gpu supposes to install cuda before being available. In this
chapter we stay in the computer memory and with the cpu. For small datasets, an implementation
with numpy may be faster because more direct but asks for explicit expressions.
������ �����
�����������������
��������������
In order to get the data in a format relevant for pytorch, it is usual to use a class called "Dat-
aLoader", which is optional as one may have its own loop for cycling a source of data. This python
class gives an iterator in order to cycle the dataset with minibatches, whatever are the data places:
computer memory, gpu memory, hard disk or even from a remote server via the network because
it is possible to write its own class. Hence, it is first imported the dataset in a dataloader object,
and then implemented the autograd function with calculates a value of the gradient for a function
automatically.
Dataloader
There are several ways to implement this class, for instance by creating our own class, but here we
just create new objects of the class torch.Tensor from the whole dataset because the numpy array
is very small and fit in the computer memory. After that, the iterator is created by instanciating a
new object of class Dataloader, while the operator function "iter()" leads to get the X and y from
each minibatch (of size 10).
63
��������������������������������������������������
��������������������������������������������
����� �� ����� ��
The main interest of these classes appears in next chapters to load a datafile sequentially and par-
tially from the disk, directly from the classes Dataset and Dataloader, without loading the full data
in memory for any format implemented in the class like compressed binary format or multimedia
files in several directories while applying some transformation (mutate into a tensor format or into
a different shape for instance).
The automatic computation from pytorch with tensors via cpu (or gpu if added) is then written as
a function here, but this is for later use, otherwise an implementation with no function is easier to
debug if required.
First an object of class Dataset reprenting a pointer to the data in memory or computer disk is
created, and for cycling the data sample, an object of class DataLoader which is like a python
"iterator", they are named respectively "dataset" and "dataloader", as explained in the pararaph
above.
Then, the function for the optimization with pytorch is defined as follows. The iterator from
dataloader provides the minibatches of size 10 here, leading to the objects Xb and yb which are
part of the whole design matrix X and the whole vector y aggregating the target variables yi .
The vector for the regression coefficients is declared requiring gradient, with the command "re-
quires_grad_(True)" such as the automatic derivative is processed at the call from the function
"backward()". In short, a separated function here was implemented for the implementation of the
updates, with the gradient retrieved with the variable "grad" from the Tensor object which con-
tains the regression coefficients, "betat_ag", and it is reinitialised before new data. The call "with
torch.no_grad()" allows to avoid any update to the gradient for any other computations than gradi-
ent or parameters updates.
��� ���������������������������������������������������
������� �� ������� � ������������
��� ���������������������������������������������������
�������������������������
���������������
��������� � ��
��������� � ��
� � ����������������������������������������������
64
� � ����������������������������������������������
The call to this function for updating the parameters after one step is as follows:
���� � �������������������
������� � ����
�������� � � ���
�������� � ���������������������������
���������������������������
�������
����������������������������������������
��������
�����������������
65
���������� �������������������
������������
It must be noticed that changing the loss obviouly change the gradient, but more importantly, there
is a difference between the mean and the sum, with a factor 1/n, thus this must be keep in mind
when choosing the learning rate which is proportional to this factor by linearity and definition of
the gradient. One can compare the value of the numerical gradient in "betat_ag.grad" with the
analytical expression written with numpy, they are exactly the same (with an eventual very small
numerical difference actually, often neglectable with real data).
������� � ������������������������
To compare the numerical and analytical gradient, let rewrite the update by computing the differ-
ence between both, as follows.
������������ � ��
��� ���������������������������������������������������
�������� � �������������������������� � ��������
��������� � �������� � ������������
�������������������� ���������� ��������� ���� �
������� �� ������� � ������������
��������������������
�������� � ���������������������������
��������������������� ��������������������
����� ����
This avoids to find the mathematical expression for the derivative which is eventually not even
possible to compute exactly in closed-form with paper and pencil for more complicated loss func-
tions. A verification of the gradient is required at least at the end of the training in order to check if
the numerical procedure has performed well (which should be the case most of the time) but also
if the gradient is enough small as required.
66
3.1.5 Visualization of the parameters trajectory
Let draw the intermediate solutions for β (t) at each update from each epoch in order to further
understand what the algorithms do. This supposes to have run "f_minimizeloss_loop_torch" with
"n_epochs=20". There are three different implementations for the training algorithm, hence three
curves for three trajectories appear on the graphic below.
The graphs just above shows the cost function as a function of its parameters, as follows:
β → f (β ) = MSE(y, Xβ )
This allows to show the shape of the loss which is convex as a 3d parabola, but also the interme-
diate solutions from each epoch until the convergence. The surface is quadratic as expected from
the expression of the loss, hence the algorithms need to do a descent along the surface until the
minimum. The better algorithm with the better settings would be the one with fastest descent. The
initial positions are different for the algorithms hence a fair comparison is not in stake. If the three
trajectories were shown per epoch instead of per update, they would be more smooth and more
identical considering the fast convergence near the solution.
It is observed that the algorithms converge towards almost the same solution near the exact one
from linear algebra. This is without inverting a matrix, only with the gradient of the loss function
which is computed only partially with only one data vector or a few ones at each iteration.
The python function for this graphic is as below.
67
��� ���������������
���� � � � �������
������ ������������������� � ���������
��� ��������������������������������������������������
������������ �������������
��� �� � ������������������������ ���������������
�������
��� � �� �������������
��� � �� �������������
������ � �����������������
������� � ����������
���������������� �������������������
������������� ����� ����� ������� ����������������
��������������� ��� ��� ��������������������
������������������������
������������������������
���������������� �������� ��� �������� ���������� ��� ���
��������������������� �� ���������
��� ������ �� ������������������
����������������������������������
������������������������� ����� �������������
��� ������� ������ �� �����������������������������������
��������������������������������� ������������������������
������� ���������������
�������������������� ����������� ����� �������������
����������
��������������������������
��������������������������������������� ������ �����������
����������� ���������� �����������
������ ������ �������
Next, these python functions are further illustrated with an example for classification.
Here σ is usually the sigmoidal function, otherwise the regression is called with another name.
The set of observations or data sample is s = (xi , yi )i∈{1,...,n} , with yi binary. We want to estimate
the coefficients with a vector β̂ . The loss function called cross-entropy loss is minimized such that,
68
β̂ = argmin �(β ) .
β
We are direcly interested with the minibatches, thus, with a number of B minibatches, the loss
function is now written as B sums from subsets of s such that,
�(β ) = ∑Bi=b �b (β
�) � ��
B
= ∑b=1 − ∑i∈sb yi (β T xi ) − log (1 + exp (β T xi )) .
The new update formula for β (t) is thus written not from a single data vector but the whole subset
in the minibatch, roughly n/B, and from the formula of the previous chapter just for the subset,
� �
∂
∂ � �
∂β �b (β )
� = −XbT (yb − pb ) .
�b (β ) = 0 �
∂β ∂
∂ β Cb (β ) 1
The algorithm stops at the final step t = T . The new update formula at each iteration is thus as for
the linear regression but with a different gradient function because the loss function has changed.
This is a first-order algorithm,
� � � �
(t+1)
β0 β0
(t)
∂ � ���
= − αt �b (β ) �� .
(t+1)
β1
(t)
β1 ∂β β (t)
This translates in python as follow below, in a vectorized way because vectors are manipulated
instead of scalar components.
� � �����������
� � �������
� � ������
�������������������������������
���������� ���������
69
���� ���
� � �������������������������� ���
�������
����� ��
��������� � �������������������� � �
������������ ��������������
The training is implemented with pytorch, from the previous function, by just changing the loss
and the parameters for the inference.
������ ����� �� ��
������ ����� � ���� ����� ������ ������
���� ���������������� ������ ����������� �������������
���������� � ��
���� � ����������������������������
������� � ����
�������� � ���
70
�������� � ���������������������������
���������������������������������
�������
���������������������������������������������������
�����������
������������������
� ��������
� ��������� �������������������
�������������
���������� � ���������������������������
����������
��������������������
� �����������
� ������������ ��������������
������� � �����������������������������
������� � �����������������������������
The losses from the initialization and the solution with pytorch are compared.
71
� �������� � �������������������������������������������
�→��������������������
� ��������������������������������������������������������
������������� � ���������������������������������������������������
�→������������������
������������������������������������������������������������������
�������������� ��������
���������� � ���������������������������������������������������
�→�������������������
����������������������������������������������������������������
��������������� ��������
Accuracy
��������� � ���������������������������������������������������������
���������� � ����������������������������������������������������������
� ������� � �����������������
����������������������������������������
� ������������������������������������
������������������������������������������
��������� ����
���������� ����
The solution with pytorch is slighly performing less well for the loss and the accuracy in compari-
son to sklearn for this dataset and the chosen hyperparameters (see exercices). Note that for a large
dataset, the design matrix (the xi and also the yi ) are stored on the computer disk, thus a loop is
required in order to retrieve the rows of the matrix (with minibatches), and compute incrementally
the predicted target in order to compare with the true one and compute the accuracy, as explained
in the next chapter.
Just below, the frontier is visualized in a graphical representation.
Frontiers
72
Let plot again the frontier corresponding to the cut from the probability that a data vector belongs
to one class or another, with the threshold 0.5 as in the previous chapter but here with several
frontiers (from initial and intermediate values with pytorch plus true value).
� ����������� ������
������ ���������������� �� �����
������ ����������������� �� ���
���� ����� ���� � ��������������� �� �����������������
Figure 3.3: Intermediate frontiers for the two classes during convergence
The solution is slightly different to the one from the the module of machine learning but the frontier
looks almost the same with nearly a same error rate. Another difference is that we have a custom
73
algorithm and will be able to update it as most as required on the contrary to existing modules
which are often cumbersome to update by the end user.
3.3 Exercices
1. (sklearn) Retrieve the regression coefficients with sklearn. There is a penalty term added to
the loglikelihood by defaut, C1 (β12 + β22 ) with C=10, which is removed here with the option
penalty=‘none’, but this term should be kept and tuned for better results with new data (see
a next chapter about).
2. (sklearn) Test the algorithms with real data as the exercice in previous chapter, for the same
datasets, and compare the results with sklearn.
3. (python) Test other values of the learning rate in the presented set of candidate values, and
around the best one in order to improve the estimation for β .
4. (python) Test for also other sizes of minibatches and change accordingly the learning rate,
how is the change.
5. (python) Test other settings for the learning rate as a decreasing function of the epoch instead
of a constant.
6. (python) Draw the trajectory of the training for beta in the logistic regression, with two
graphics, one for the coefficients β0 and β1 , one for the coefficients β0 and β2 . After a
vectical disposition of the two graphics, propose a way to get both graphics side by side, one
to the left and one with an horizontal disposition. It is possible to improve these graphics, by
showing the mse and the real beta from the true distribution instead that an estimated one.
7. (stat) How to get a nonlinear frontier from the usual logistic classification instead of the
linear one, in any available python implementation.
8. (stat+python) Propose a better strategy for choosing the minibatches, or for initializing the
weights for the simple linear regression, and check graphically by comparing the learning
curves.
9. (python) Rewrite the python code by selecting random batches (a. one random single unit
from the sample at each epoch instead of a cycle or, b. again n/B subsamples after shuffling
the index of the rows in the data table at the beginning of each epoch) instead of the constant
non stochastic subsamples, and compare the two learning curves.
10. (stat+python) Test the algorithms in "variance reduction for stochastic gradient optimiza-
tion", "SAGA" and also the "multi-point bandit feedback" approach (among other) by im-
plementing the algorithms with numpy or pytorch. Is the inference improved, how and why.
11. (stat) Justify analytically why the gradient descent converges to the optimum for a convex
function.
74
Chapter 4
In the three first chapters it was recalled the fondations for the linear regression and the linear
classification, then a first approach was introduced in order to perform the parameters learning with
pytorch. In this chapter, pytorch is now able to model and train 1 neural networks via implemented
python classes. The target variable yi is the output of the network while the vectors xi feeding
the network are the input variables. The unknown nonlinear transformation which takes xi and
provides yi must be captured by the neural network during the training of its weights in order to
offer an accurate prediction for current and also future inputs.
yi = f (xi )
Hence, the true and unknown function f (.) is approximated by a neural network:
yi = g(xi |W )
75
fˆ(xi ) = g(xi |Ŵ )
≈ f (xi ) .
Thus, the fonction f (.) may be very general and anyway, it was shown in the literature that any
functional relationship between xi and yi can be learnt by some neural network, presented herein,
called MLP which are deep neural network with what are called hidden layers. The first layer is for
the input variable, the last layer is for the output variables, hence they are not hidden as the values
are read in the sample, while, between both, latent virtual observations in higher or smaller spaces
than the input, are called the hidden layers. The components are the nodes and they are linked with
weights, in the idea of the brain for signal propagation, from the input towards the output.
For the linear cases with no hidden layers, this is depicted just as,
1
x1
.. Σ ŷ = σ (w0 + ∑ j w j x j )
.
xp
For the linear regression and the linear classification we have two layers, the input for xi and the
output for the yi , while the weights are defined from the edges between the nodes of the input to
the nodes of the output. When some layers are defined between the input and the ouput, the neural
network becomes a deep neural network able to model a nonlinear function.
The loss function compares yi and ŷi for training purpose.
76
exp (wk0 + wk1 xi1 + wk2 xi2 + · · · + wkp xip )
ŷik = .
1 + ∑K−1
l=1 exp (wl0 + wl1 xi1 + wl2 xi2 + · · · + wl p xip )
Here, for the third case, w is a matrix of size (K × 1 + p) in order to obtain K real values.
Thus, ŷi = E(yi ), with:
• Linear regression
ŷi = wT xi .
• Logistic regression,
exp(wT xi )
ŷi = = p(yi = 1) .
1 + exp(wT xi )
• Softmax regression,
exp (wTk xi )
ŷik = = p(yik = 1|xi , βk ) .
1 + ∑K−1 T
l=1 exp (wl xi )
A simplified way to understand how the network performs is that the outputs are crude estimations
of the dependent variable in the three cases. An objective function is computed between the real
observations yi and the prediction from the network ŷi . Such that:
Where
77
ŷi ← NN(xi ) ∈ R .
• Logistic regression (one output in ]0; 1[ from nn compared to s a threshold) with predicted
output ŷik ∈ {0, 1} after "cuting",
ŷi ← NN(x
� i ) ∈]0; 1[
1 if sigmoid(ŷi ) >s .
ŷi ←
0 if sigmoid(ŷi ) ≤s
• Softmax regression (several output in ]0; 1[ compared between them to choose the predicted
ŷi in {1, 2, · · · , K}) after "argmax",
ŷi1
ŷi2
ŷi = .. ← NN(xi ) ∈]0; 1[K and ∑k ŷik = 1
.
ŷiK
ŷi ← argmaxk ŷik .
Here, argmax finds the index of the largest probability among the K ones found from the
softmax transformation.
If this is not clear to the reader, an external explaination on glm may be required. Also one may
consult the python code coming nextafter which proposes a complement of information before
coming back to this paragraph if required.
Note also that in some implementations, for classification, the outputs ŷi or ŷik are in R, hence the
function sigmoid or softmax may be required eventually. This is because the loss may treat directly
continous values if these functions are already included, see the table next subsection.
The postprocessing is implemented in "f_test_glmr()" while the training is implemented in
"f_train_glmr()". Next, an implementation via pytorch is presented before going deeper with hid-
den layers just after.
78
Here, the optional argument for the loss "reduction" is put at ’sum’, and only a minibatch is in stake,
with size B, not the full sample, while ŷi = ŷi (xi ) is the output from the network after feeding it
with the minibatch of independent variables xi , such that:
Loss Distribution(*) Expression
MSELoss Gaussian ∑Bi=1 (y
� i − ŷi )
2
�
B (yi −ŷi )2
GaussianNLLLoss Gaussian ∑i=1 log(max(var, eps)) + max(var,eps) + const
L1Loss Laplace ∑Bi=1 |yi − ŷi |
BCELoss Bernoulli ∑Bi=1 −wi [yi log ŷi + (1 − yi ) log(1 − ŷi )]
BCEWithLogitsLoss Logistic ∑Bi=1 −wi [yi log σ (ŷi ) + (1 − yi ) log(1 − σ (ŷi ))]
exp(ŷik )
CrossEntropyLoss Multinomial ∑Bi=1 −wk ∑K k=1 yik log( K exp(ŷ ) )
∑l il
PoissonNLLLoss Poisson ∑Bi=1 exp(ŷi ) − yi × ŷi + 0.5 log(2πyi )
KLDivLoss KL divergence ∑Bi=1 yi (log ŷi − log ŷi )
Here KL is for Kullback-Leibler, and for the Poisson loss, the options are log_input at True (as by
default) and full at True (instead of False by default). Note that this is the negative loglikelihood as
for the Gaussian distribution and the Bernoulli one, but some losses do not have a correspondence
with a loglikelihood.
Some loss functions accept weights for the observations which are equal to one in our case herein.
Some loss functions are for robust Euclidean one, with HuberLoss and SmoothL1Loss. Some
other losses are HingeEmbeddingLoss, SoftMarginLoss, MultiLabelMarginLoss, MultiMargin-
Loss which are variants for classification, and described in the documentation. The loss to be
chosen may be the one with better results.
In the previous chapter, it has been explained how to define a python function for such loss, with
two arguments, one for the vector of true target yi and one for the vector of estimated target ŷi ,
thus the same idea is required here for other losses. It is possible to consider losses not defined in
pytorch such as for a variant of the Poisson distribution for counts.
In this chapter, we are interested on implementing the neural network for regression or classifica-
tion with the class nn from pytorch. At the same time, we will use the existing loss function:
• For the linear regression, this is the mean square error, with the function
"torch.nn.MSELoss()" which computes the usual sum of squares.
• For the logistic regression, this is the cross-entropy with the function
"torch.nn.BCEWithLogitsLoss". According to the documentation, this loss combines
BCELoss with a sigmoid layer, this is "more numerically stable" than writing its own loss
by combining a sigmoid at the output of the network with a BCELoss just after: the final
nonlinear transformation is already included in the loss.
• For the softmax regressions, this is the cross-entropy with the function
"torch.nn.CrossEntropyLoss()". As for the two classes problem, this loss is more sta-
ble by taking as an input just the non normalized output of the network and adding the
softmax transformation.
For the BCEWithLogitsLoss, a sigmoidal layer is added automatically, thus there is no need to
79
add it in the neural network. Similarly for the CrossEntropyLoss, the softmax layer is included in
the loss. There is no need to redefine them except for better understanding or creating new losses
eventually.
The new python code uses a main class of torch, "torch.nn" which allows to define a function on
the weights. For the logistic regression, the sigmoid is already defined at the level of the loss for
stability reasons, hence, in both case, regression and classification, we just have to define for the
linear models, a linear function for the weights!
����� � �������������������� �������������������������������
����������������
�����������
�
As explained before, we have a layer for the input and one for the output, hence with the bias, there
are p nodes at the input, and:
• one node for the output for the linear and logistic regressions,
nb_nodes_input = p
nb_nodes_output = 1
• more than one for the softmax regression, with K>2,
nb_nodes_input = p
nb_nodes_output = K
The bias (intercept) is automatically added (otherwise, we would add one to each independent
variable and add a node to the input layer). Note that in the binary case, we just need one node and
not two because we can decide of the class with only a threshold from one output.
The parameters are retrieved from the python object model by calling the function
"model.parameters()", this object is required by the optimizer.
This leads to finally, with the weights defined implicitely from pytorch, to the call to the function
"model.train()" in order to put in "train mode" the neural network with gradient computations.
From now we adopt an approch more generic with python classes and function if possible, in order
to avoid to repeat similar python code. This is justified because the fondations have been studied
in the previous chapters and until here. This implies that most of the time we just need the next
python code for proceeding with regression or classification, with the only change at the level of
the layers. We introduce also a class called "Monitor" which will be useful later in the chapter for
processing the indicators from the test sample.
80
The work directory is given from the function.
��� ����������
������ ���������������������������
������ ���������
�����������������������
�� � ����������������������������������������������������������
� � �����������
� � ��������������������������������
� � ����������� �������������������� ���
� � ������
Let randomly shuffle the rows for avoiding any row structure before cycling the minibatches. Note
that for some files, the classes are not randomly found in the file, they are organized separately,
but for the training algorithm, it seems better to have some diversity between the elements of the
minibatches. The shuffling may be performed also with the dataloader eventually.
���������� � �����������������������������
����������������������������������
����������������������������������
����������������������������������
�������� �������� �������
The vector y is kept flat for python, this reduces any risk of automatic "broadcasting" non wanted
for algebra with numpy or sklearn for instance.
� � ���������
81
����� � �������������� �������� ���
��������� � ������������������
��������� � ������������������
���� ����� � ��������������� �� ��������������
������������������������������������������������������������
������������������������������������ ��� ���
�������������������������������
�������������������������������
����������������� ��������
���� � �������������������������������������������������� ����������
������������������������� ��������������� ���������� ������
����������
Figure 4.2: Data sample for two classes with non linear frontier
This is the visualization of the artficial dataset in stake with the true frontier between the two
classes, a circle for this example. The training of a neural network aims at guessing this unknown
frontier from the available data. An evaluation of the quality of the estimated frontier is available
with the error rate.
82
becomes ideally good for the dataset by learning any of its particularities without being enough
general. The problem comes from the fact that a (not enough large) sample is never a true image
of the whole population, but only a view of it, with defaults and bias. Training a neural network
with these mistaken informations is source of error, and this is even worse when the number of
parameters in the model increases as it becomes even more flexible and better able to learn too
much all the details and particularities of the train sample.
A real life example is when someone learn to how to make the differences between cars brands
but with all their defaults (sunken body, dirt, rust, tires ribs, repaired with parts not from origin for
older ones): the human may not be able to recognize new vehicules because of these informations
2 . Another example is for anyone who wants to learn a new activity and is not good at it at the
beginning of the learning, hence there is needed more training. To train the biological neural
network from the brain, the human has to capture enough data and during an enough amount of
time. An approach to check if there is a good understanding is exercices with new contents: they
are the new data here where it becomes possible to validate the knowledge for the activity. Without
new challenges, this is not sure if the activity is known as expected.
A way to check the quality is to validate the model with a new dataset. This is the "validation
set", while the "test set" is used during the training for tuning. The idea of the test set is to have
an independent sample different to the training one and to change the (hyper)parameters update in
consequence. For a real example, the test set is the exercices which does not count for the final
mark and are given during the course for the activity, while the validation set is the exercices from
the exam with a mark.
This means that we need to separate the train sample into two parts, one for the training again, and
the other one for the test. For enough large samples, a validation sample may be also obtained at
this stage. Note that other more elaborated methods exist for insuring a good model, or even divid-
ing the sample (see next chapters). This approach is the more usual way to do before implementing
eventually more advanced ones.
In the case when two datasets are already available, we just need to declare two different objects
from the class DataLoader, dataloader_train and dataloader_test, from whatever is the place of the
data, in memory, on the computer disk or even on a remote server.
The subsamples are drawn randomly without replacement: the test set is found first via sampling
because smaller and then substracted from the whole sample to get the train set. This is equivalent
to a function in sklearn with more available options, but written in two rows here. Note this is "with
suffle" hence the former order of the indexes is not kept. The indexes are the working backbone
of processing with a large dataset because this is not possible to replicate or load the full dataset
in the computer memory. The problem of some python modules is to not work with indexes but
directly with the whole dataset loaded in computer memory, which is not always possible.
2 Note that this can also been seen as additional variables which are useless and prone to induce errors even on the
train sample, this is discussed in another chapter.
83
��� ��������������������������������
������ � �����������������
������� � �������������
������� � ��������
�������� � �����������������������������
��������� � ��������������������������������
������ ���������� ��������� �������
� ��������� � ��������������������������������
� �������� � �������������������������������
��� �� ��
�����
As explained in first chapter, we need two samples seen by pytorch through a class "Dataset", one
for the training and one for the testing in order to check how behaves the model with new data. We
create our two (or three) subsamples, for "train" and "test" (plus eventually "validation" if the test
sample is used to tune the training and the sample size is enough large). Then, the iterators with
the class "DataLoader", with eventually some shuffling,
��� �����������������������������������������������������
� � ������������
������� � �������������� � ��
������ � �� �������
�������������� ������������ � ��������������������������������������
��������� ��������
84
�� �������� ������
������ �����
����� �������������������������������
��� ���������������������������������������������
������������������
�������� � ���������������������������� ������������� ����������
��������� � ����
The three neural networks for linear regression, logistic regression and multinomial regression
have same architecture, as only the loss function is changed because the link function is included
there for more stability. This is similar to glm with a generic definition of the model except that
the distribution changes, hence the loss changes too as minus the loglikelihood.
From an instance of the class above, it is possible to extract the weights of the corresponding neural
network in order to retrieve the values of the parameters before, during or after training. In the case
85
of only two layers (input and output), without hidden layers, this is as follows for instance.
The names for the key may change from one modeling to another, thus one must be cautious with
this naming. An alternative is to ask for the weight and the bias separately but also only the weights
which enter the derivative if some links are chosen constant (see the documentation of pytorch).
����� � ���������������������������������������
����������������������������������� ��������������������������
The weights of the neural network, here the initial values, are combined as a vector of regression
coefficients with the following function. For this case, the naming is used but no naming may be
better for more generality. This is written:
� ��� �������������������������
� ������� � ����� � ����
� ��� ������������ �� �������������������
� �� ���������������������������
� �������� ��������������������������������
� �� �������������������������
� ������ ��������������������������������
� ������ ������������������������
��� ���������������������������
������ � ���� � ����
��� ������������ �� �������������������
�� �����������������������������
������ � ��������������������������������
�� ���������������������������
���� � ��������������������������������
������ ��������������� �������
��������� � ��������������������������
����������������
Several methods exist for initializing the neural network, as a bad initialization may likely induce
a bad convergence, this step must not be forgotten for best results and to avoid to loose time on
tuning a model with a bad initialization. In this part, the standard and by default way from pytorch
is implemented as it is suitable.
86
4.4.2 Preparation for the optimization
In this subsection, the dataloader is not only needed, other instances of python classes (from py-
torch) are required for the optimization procedure.
Early stopping
An eventual way to improve the training is to compute the accuracy at each epoch, and stop when
for the test sample, it begins to drop (or the loss also begins to raise) . Because there is a moment
during the training process that the model is well fitted before it begins to learn too much particular
details from the train sample.
Stopping at this moment is the decision that we may want called "early stopping". This is the loss
which is generally preferred here for checking, the accuracy is less smooth. This is not the only
way to monitor the quality of generalization of the model. This explains the proposed class called
"MyMonitor" which is able to give a signal to the training loop in order to break its cycle via a
stopping rule at the end of each epoch, just below.
In order to monitor the convergence and compute indicators, first we create a python class which is
involved in a passive way during the training. This is optional and our choice for the implementa-
tion, alternative ways to do could ask for more advanced python code, because we define just one
unique generic python function after that for training the models.
����� ����������
����
�→�����������������������������������������������������������������������
���������� � �����
��������������������� � ����������������
���������������� � �����������
����������� � ������
��� ����������������������������� � ��� ����������� ���
������ �����
����� �������������������������
��� �������������������������
�������������������������������������������
������������������������������������������
�������������������������������������
�����������������������������������������������������������
��������� � ����
�������������������� � ���������������
����������������� � ���������������������
87
���������������� � ���������������������
���������������� � ���������������������
�������������� � ���������
����������������� � ������������
����������������� � ������������
��� �������������
�� ������������������������ �� �����
���� ����������������
��� ����� �� ���������������������������
�� ����������� �� ��� �����
�� � ������������������
�� � ������������������
�� ����������������� �� ��� �����
������������������������
����� � ��������������
�� � ��������������������������������������������
����� � ��������������������������������������
������ � ���������������� ���
������������������� �� ������
������������������� � �
�������������������������
�������������
������ �����
Here, this version finds the value of the loss for the test sample, this computation is asked inside
the training loop at the end of each epoch.
Let create all the objects which are involved during the training of a neural network:
• the "dataset" for accessing to the sample from the computer memory or computer disk
• the "dataloader" for cycling the minibatches through the dataset object
• the "model" of neural network itself which contains the unknown weights to find
• the "optimizer" for updating the weights with the values of the gradients while eventually
updating also the learning rate for an optimal minimization
• the "monitor" for computing the indicators from the test sample and check the convergence
for the stopping rule
• the "loss" function to minimize.
We also define the "number of epochs" and the "learning rate" (constant or initial, and eventually
a function for its decrease during training). This learning rate remains non automatic and must
be choosen by the human most of the time, except for some optimizers available for pytorch and
sometimes performing well, not considered here (see the chapter on "autoencoder").
88
���������� � ���
�������� � ����
������������ � ����
���������� � ��������������������
� ���������� � ������������������
� ���������� � �������������������
�� ���������� �� �������������������� �
���� � �������������������������������������������
����� �������������������
���� � ���������������������������������
����� � �����������������������������
����� � �������������
��������� � ����������������������������������� ������������ �������������
������� � ������������������������������������������
������������������������
This optimizer is equivalent to the one used with numpy in the chapter just before, a constant
learning rate. The argument "momentum" is able to improve and speed-up the training when well
chosen.
89
������ ��
��� ����������������������������������
�� ���������� �� ������������������ �� ���������� �� ����� �
�� ���������� �� �������������������� �� ���������� �� ������ �
�� ���������� �� ������������������� �� ���������� �� �������
����� � �������������
������ �����
For the regressions where the output is a simple scalar, the target variable from minibatch is a
simple vector with one unique dimension. When the output is a vector such as in the softmax
regression, some losses ask for a binarized version of the categorical values, the integers between
1 and K. Ideally, this could be treated before at the level of the dataset or dataloader, but if not,
some transformation is required here.
The main loop is defined as follows, it required the python objects defined above.
��� ���������������������������������������������������������������
���������� ������������� � ����� ������������������
������������������ ���������������������
��������������������
�� ������ �� ��� ����� ����������������
�������������
����������� � �������������������
��������� � �����������������
������������ � ���������������������
�������� � ���������������������
�������� � �����
��� ������
����� �����
������������
��� ��������� �� ��������������������
� ���� ���� ��������� ���� ��� �� ���
�� ������ �� ��� ����� �� � �������������
�� ������ �� ��� ����� �� � �������������
� ���������� ��������� ��� ����
�� ������������ �� ��� �����
�� � ����������������������������������
�� ������������ �� ��� �����
�� � ����������������
����� � ��������� � ��������� ������ ���� ��
�� ��������������� �� ��� ����� � ���������� �� ����� ��
����� � ���������������������������������
��������������������� � �������� ��� �� ����
�� ������������� �� ����� � ���� �����������
90
����� � ����������� ���
�����
����� � ������������������������� ����������
� �������� ����������� �� ���� � �� �� ��������
���������������� � �������� ����
�
�� �������������������
���������������� � ������ �� ����������
����� � ��� �������� ��� ������ �� ���������� ���� �������
� �������� ����� ������� �������� ��� �� ��������� ����
�������������������������������������������������
�
���������� �� ����� � ���� ������������
����������������������� � ���������� � ���� �� �����
�������� � ��������������� � �������� ����
� ����� �������� �������� ������ ��� �����������
�� ������������
������������������������
�� �����������
�� �������������� �� �����
�����������������������������������������������
� �� ���������� ������������
� �� �����������������������������������
�� ���������� ��� �����
�����������������������������������������������
� �� ���������� ������������
� �� �����������������������������������
� ��� �������� �������� ������ ��� �����������
����� � ��������� ����� �������
� ����� �� ���� ���� ��� ���� ����� �� �������� ���� �����
�� �������� �� ���������������
�� �����������
���������������������������������������������������
� �� ������������ ������������
� �� �������������������������������������
�����
����� �
������ ��������������������� ����� ��������
This function is generic in order to avoid copying the same code into several functions, thus there
must be some time to read it and isolate the important parts:
• The "loop" is infinite with a growing number of epochs, but stops whenever the maximum
number of epochs is reached or some stopping rule was defined and became true in the
91
function monitor.stop() .
• Three "transformations" are defined for yb, Xb and yhatb, for instance insuring that yb and
yhatb have same shape for the loss. The transformation for Xb may be a reduction such as a
principal compnent analysis or a random projection for instance. But it may be kept in mind
that any transformation is costly during the training, and for yb or Xb this could be included
in the dataloader in order to speed up the training, while ideally pre-computed and stored in
computer disk for loading with the dataset, this avoids any costly transformation during the
running time.
• The eventual function "update_model()" for updating the weights from the model differently
than from the optimizer, this is a short-cut here.
• The "device" is for computing the gradient and for updating the weights within the cpu or
the gpu. The data and the neural network must be put in the same device, otherwise the
computation is not possible with the usual current computer architectures, and this would
actually decrease the interest for having a fast gpu.
• There is three levels of "printing" the loss during the training, more or less informative: no
loss, first and last losses or all losses during training.
• The "monitor" allows to compute the loss for the test sample while the main function just
above computes the loss for the train sample. The monitor contains already the variable for
keeping the values of the loss during the training. The losses for the test and the train samples
are available in the object monitor at the end of the training for graphics and checking the
convergence.
The same idea of the algorithm than from the previous chapter is recognized, except that the gra-
dient is always automatic, the gpu is available natively for this python module and a few options
are added here for later use. If required, a way to better understand this function is to rewrite it
in a simpler way without all the options. This is left as an exercice to the reader: practicing the
language is the best way to master the language. After a simplified version, the optional features
can be added one by one, in a different way eventually. This proposed example of implementation
looks enough flexible for the purpose of almost the whole document.
Training
We have already created an instance from the class of neural network, one for the optimization,
one for the monitor. We have defined the loss function, the number of epochs and the learning rate.
Now we train the model with the function defined above "f_train_glmr()" which returns the loss,
the number of iterations and the state of the object monitor. This leads to:
����� � ���������������������������
��������������������������� � �
�������������������������������������������������������������
��������������������������
���������������������������������
92
������� � ���������������������������
�������
�������������������������
��������������������������
4.4.4 Post-processing
Loss curves after training the nn for test and train samples
We have considered the test step which is actually more a validation step to be more precise even if
the literature can missname them sometimes. In the class MyMonitorTest, a loop cycles the test (or
validation) sample without updating the weights of the neural network which were found during
the training step.
The output is shown.
��� �����������������������������������������������������������������
������������������������������� ���������������
�������������������
���������������������
� ������������������
������������������������ ������������� ������������
93
���������� �������������������������
������������������������������������������������������ ��
� ���������������������
���������������������������������������������������
������ ����� ������ ����� ������� ����� ������� � �� ���
Figure 4.3: Loss function for train and set samples, per epochs
Comparing the two curves allows to decide if the neural network suffers from a generalization
problem, with overfitting or underfitting, but also if the learning rate is (at least approximatively)
accurate. Typically, the training loss is lower than the test loss because the model was learnt on
the train sample such that it better adapts to this sample. Ideally the test loss curve should be no
too far at least at the end of the procedure. This is not the case here because a linear model is not
enough for nonlinear frontiers between the classes.
The neural network has one scalar output for the linear regression and logistic regression while a
vector of K outputs for the softmax/multinomial regression. These outputs are obtained from the
dataset by feeding the neural network with the independent variables xi and then processing into
probabilities for the classification in order to compute the estimated label ŷi with a threshold. These
final values allow to find the accuracy by comparing with the true values if they exist, otherwise,
this is our prediction for the classes when the information is unknown.
The accuracy is computed in this part after finding the labels, it is also discussed an accuracy
per class and visualized the frontiers because for this dataset a 2d representation is available. To
compute the quality of the regression or classification, we need the estimated labels or predicted
94
target variables which are obtained in the function below by looping the dataset with minibatches
or chunks.
��� ����������������������������������������������������������
��� � �����������������������
���� � ����
� � ����
��� � �
���� ����������������
��� ��� �� �� �����������������
� ������� ����
����� � ���������
�� ���������� �� �������������������
��� �� ���������������� �� ������������
���� ���������� �� �������������������� �� ���������� ���
�→�������
����� � ��������������������������������������������������
�→�������
�� � �����������������������
��� �� ���������������� �� ������������
���� ���������� �� ������������������� �� ���������� ���
�→�������
����� � ����������
����� � ������������������� ������
��� �� ���������������� �� ������������
�� ��������
�� ���� �� �����
���� � �����
� � ��
�����
���� � ���������������������
� � ���������������
������ ������������������������������������� ����� �
95
������������������������������������������������������
�������������������� � ����������������������������
������������� � ���������������������������
The aggregated indicator is not so much informative as the conditional one per class, but generally,
it is preferred because this is the overall error which is of first interest. For medical questions for
instance or asymetric costs related to the classification, the error should be computed w.r.t. each
class for a better understanding of the performance of the classifier. An error has not the same
consequence if this is a "FP" or a "FN", and also unbalanced classes induce an unfair accuracy for
which the larger class biases the rate.
�������������� � ������������������������������������������������
�→����������������������
�������������� � ������������������������������������������������
�→����������������������
������������������������������� ���������������������������
������������������������������� ���������������������������
���������������������� ���
���������������������� ���
The errors are not symmetrical for this dataset and for the implementation chosen for the train-
ing. The two accuracies are not both near 0.5, the half or the expected value which is the worse
prediction by chance (one out two). This result teaches several points:
• This underlines that looking at the accuracy is probably not enough for a deeper evaluation
of the model, and other indicators are to be checked anyway. The error from one class may
be preferred to be minimized for instance, but the threshold may be also chosen differently
for a different final error.
• This underlines that a linear model is not always enough and a nonlinear model is sometimes
mandatory in classification (and regression). The usual statistical approach for explaining
with the model is limited for usual machine learning question suchs as prediction.
• The usual way in statistics is to add some interaction and power terms (two most often)
which is related to polynomial regression, but the frontier here is not just a simple nonlinear
96
version of the line (or hyperplan) such that a polynomial transformation may be not always
enough.
Let check the solution graphically in order to retrieve the conclusion about the linear model for this
dataset. The frontier is checked visually by computing the predicted label ŷi for the whole plane
with discretization, such that we get:
��� ����������������������������������������������������������
���� � ������������������������������������������������ ����������
���� � ������������������������������������������������ ����������
������� � ���������������������
���
��� � �� �����������
��� � �� �����������
������������ � �����������������
����
������� � ���������������������
���� ����������������
����� � ��������������
����� � ���������������������������������
����� � ���������������� � ��������������������
����� � ������ � ��������������������������������� � �
������������������������ ����������������������� � �������������������
������������������ ������������� � �������������������
�����������
Thus,
� ����������� ������
������ ���������������� �� �����
������ ����������������� �� ���
����������������������������������������
�������������������������������������������������� ����������
������������������������� ��������������� ���������� ������
����������
97
Figure 4.4: Linear frontier from neural network without hidden layers
The resulting frontier is linear as expected, hence the model is misktaken. The frontier becomes
nonlinear by introducing in the neural networks hidden layers and nonlinear activation functions.
Such neural network is implemented next in order to solve this issue. This allows to extend glm,
otherwise there exists other methods for nonlinear classification with trees and ensembles for in-
stance but out of the scope herein.
The output is obtained by a sequential computation from the input layer to the first hidden layer,
then from the first hidden layer to the second hidden layer, and so on until the last layer. Hence,
each transformation leads to the next layer, until the last layer for the output.
98
Formal definition of the model
Recall that the input layer is for the features or independent variables xi j plus the bias or intercept 1,
with p + 1 nodes, while the output layer is for the target or dependent variable yi with 1 or K nodes.
This induces to defined intermediate functions g� () such that when counting the last transformation
for the final layer L + 1, we get:
Thus each function tranforms the previous current transformation by applying p� (= the size of the
current layer) times the same function for the hidden layer �, hence to each of its p� nodes. This
is associated to a linear mapping with a linear matrix or weights, such that each transformation
(function to each node+mapping to next layer) is summarized as a multidimensional transformation
from a space of p�−1 dimensions to a space of p� dimensions, when denoting p0 for the first layer:
g� : u → g� (u)
with,
u = (u1 , · · · , u p�−1 )T
g� (u) = (g�,1 (u), · · · , g�,p� (u)) .
There are several possible functions implemented in deep learning. Three have been already seen,
thus they are for instance, - linear for linear regression - sigmoid for binary linear classification -
softmax for multicategorical classification. The hidden layers are able to transform nonlinearly the
vectors xi , before the regression or classification which happens at the last layer.
The linear transformation for hidden layers remains exactly the same than before. Thus, let have a
look at the type of activation functions which can be considered in the class we define next. They
are from the list available from "torch.nn" such that, the functions "Sigmoid", "Softmax", "ReLU",
"CELU", "LeakyReLU", "PReLU", "Tanh", have each a different shape, while "Dropout" and
"AlphaDropout" select a percentage of nodes randomly.
Implementation from module "torch.nn" # Parameters
Sigmoid() 0
Softmax(dim: Union[int, NoneType] = None) 0 (always dim=1)
ReLU(inplace: bool = False) 1
CELU(alpha: float = 1.0, inplace: bool = False) 2
LeakyReLU(negative_slope: float = 0.01, inplace: bool = False) 2
PReLU(init: float = 0.25) 1
Tanh() 0
Dropout(p: float = 0.5, inplace: bool = False) 2
AlphaDropout(p: float = 0.5, inplace: bool = False) 2
99
Many other layers or activation functions have been proposed until even recently. Some layers
are useful for regularization by dropping some weights i.e. canceling out a percentage of neurons.
Some layers are for normalization such as batch normalization and layer normalization in order
to standardize each minibatch which allows a better convergence in some cases. Such advanced
layers like for image classification (i.e. neighboors layer called convolution layer) or clustering are
not considered herein neither, they may be tested as an exercice.
According to the current implementation, we just need to change the class "GLMRegression" into
a more general class which is able to handle our new settings. Note that the layers and activation
functions can be added in several ways in the class with pytorch, one after the other in the forward
function or all at the same time with a function called "sequential()". For more generality we use
lists and leave as an exercice the definition of particular models, mostly required for advanced
variants of networks such as recurrent ones with time varying variables: they are out of the scope
and often defined without such list. Typically in sklearn for instance, for such network, one just
has just to define a list with the sizes of the hidden layers, and the list of the activation functions.
Pytorch is more flexible hence asks for more details thanks to the additional functionalities, if
required.
From here, it will be preferred a class constructor from a list of layers and functions as proposed in
pytorch. This avoids to rewrite the full class each time when the structure of the network changes.
Recall that the last activation function may be included in the loss function in some cases, such as
for sigmoid and softmax, but for instance the sigmoid may appear from an other (hidden) layer in
order to induce nonlinearities. The last activation function might be linear here.
������ �������� �� ��
����� ��������������������������
��� �������������� ����� ��������
������������������
��������� � ����
����������� � ������
�������� � ����������������������
100
��������������������������������������������������
��� ������������� ���
������ �����������
The main interest of the parameterization of this class is a higher level access to the pytorch library
without dealing with python classes, but this is mostly for documentation of the layers that are in
used in this chapter. Otherwise, writing directly this class, and considering our already defined
other classes or functions remains possible, at least for a different situation such that advanced
image processing.
Anyway, let check how behaves this new class added to our implementation of deep learning with
pytorch.
����������� � �
������������ � �
������������� � ��
������ � ��
�������������������������������������������������� �����������
������������������������
�������������������������������������� ������������� ������������
It has been added in the class, the object net which is required for the function forward.
�����������
���� ��������������������� ���������������� ����������
���� ������
���� ���������������������� ��������������� �����������
�
Note that there is also available the class named "ModuleList" for dealing with a list of layers, this
is left to the reader as an exercice to access the documentation of pytorch in order to learn more
about.
�������������
101
Thus, from the module named "copy", we use "deepcopy()" for creating our object, and then run
the training: the resulting classification should be better than before.
������ ����
����� � ��������������������������������������������
����������������
�������� ����
�������������
���� � �������������������������������������������
��������� � ����������������������������������� ����������� �������������
������� � ������������������������������������������
����������������������
��������������������������� � ����������������������������
�����������������������
����������������������
������������� � �����
������������������
��������������������������
���������������������������������
������������������������������������������������������
��������������������������������������������������������������������
������������������������������������������� � �����������������
�→�����������������
The model is very accurate for this sample, but the frontier is not perfect according to the true one.
Some regularization may improve further the model but a model without error on the population is
difficult to achieve with a small sample for the training.
The two losses have same shape and are almost equal, there is no "underfitting" or "overfitting",
the "generalization" looks relevant here. This is the opposite to the curves without nonlinearities
where dramatic and pathological differences are detected via a visual checking of the two losses.
102
������ ����������������� �� ���
���� �� � ��������������
��� � ���������� �������������������������
������������������������������������������������������ ��
� ���������������������
���������������������������������������������������
������ ����� ������ ����� ������� ����� ������� � �����
Figure 4.5: Loss function for train and set samples, per epochs
Let check the frontier, the circle is expected to not be retrieved exactly because the separation
between the two classes has not this shape for these samples.
������������������������� �� ���������������
It appears that small samples can be optimistic even on the test sample, but this is not always the
case. Anyway, only one train sample and one test sample is not the best approach as explained
next chapter. This underlines also why more data is that one wants for training the neural network.
This would fill the empty space here which induces the ellipsoidal shape for the frontier instead of
a perfect circle.
103
Figure 4.6: Nonlinear frontier from neural network wit hidden layer
There is even for such a small dataset, the evidence of overfit -for some initialization- because the
curve may grow after the decrease. Early stopping could be implemented here, or reducing the
number of nodes could be even required. Note also with the ReLU, the overfiting seems reduced
as this layer is defined for this purpose in the literature by canceling some nodes: a large part of the
activation function is null. The number of hidden nodes is chosen high for demonstration purpose,
thus may be reduced with an additional hidden layer.
This small example underlines a pitfall of the deep neural networks where a bad training can hurt
the results for new incoming data, because the model may learn too much particular details of
the empirical distribution without remaining enough general. In statistics, the idea is the occam
razor which advises to prefer simpler models to more complex ones, with the usual "bic criterion"
in regression, but also the selection of hyperparameters via "cross-validation" for more complex
models with an average decision and a sampling of new data.
Next chapter, the model is considered in a more general situation for more complex datasets.
Regularization and training will be discussed further.
4.6 Exercices
1. For the dataset of the chapter,
a) (pytorch) Change the learning rate and the batch size in order to accelerate the training,
while keeping a decreasing loss during the loop.
b) (pytorch) Chech the accuracy for a range of numbers of nodes by drawing a curve of the
accuracy versus their number, with and without the ReLU transformation. separately.
Is the ReLU transformation useful and how ?
104
c) (pytorch) Idem for other layers (see the documentation)
d) (pytorch) Do the same for the regression model with an hidden layer, check visually
the overfitting. What is observed when the number of nodes grows? Is it related to
polynomial regression, why.
e) (pytorch) Test the accuracy and the overfitting with two layers with several number of
nodes.
f) (pytorch) Test another approaches in order to avoid overfitting.
g) (sklearn) Fit the logistic regression with sklearn for the dataset in this chapter. How
this model compares to the neural network with pytorch.
2. (pytorch) Find a neural network for the classification of the following dataset with XOR
structure from the datafile "foursquares_xor.txt" with labels at last column, after vizualizing
the points with the classes.
3. (pytorch) Propose a neural network for the classification from these datasets generated from
a submodule for datasets of sklearn for the three first ones,
a) x, y = datasets.make_classification(n_samples=n,n_features=d, n_redundant=0,
n_informative=2,random_state=1, n_clusters_per_class=1)
b) x, y = datasets.make_gaussian_quantiles(n_samples=n, n_features=2, n_classes=g,
random_state=42)
c) x, y = datasets.make_moons(n,noise=.1)
d) x, y = a two dimensional dataset with three spirals.
The data samples are stored in the files "2gauss_xy_2d.txt", "gaussq_xy_2d.txt",
"2moon_xy_2d.txt", and "threespiral2d.npy" respectively.
4. (pytorch) Test linear and nonlinear regressions or classifications via neural networks and
pytorch with the datasets from the exercices in the previous chapter and compare with the
logistic regression. Some datasets are more difficult than other: next chapters will explain
how improve the choice for the hyperparameters (learning rate, number of neurons in the
hidden layers). The iris dataset is a good candidate here with a small size and three classes,
for a linear classification.
5. (pytorch) Test the code for the 20000 newgroups (see dataset from sklearn) with only four
classes for instance.
6. (pytorch) Test the code for the 60k digits handwritten images mnist with ten classes. (stat)
How to reduce the number of weights in the model for images classification.
7. (stat+pytorch) Propose a method in order to improve the training and reduce the classification
error for the logistic regression and its nonlinear version.
8. (nn) Propose a graphical representation for the architectures of the neural networks in the
chapter, including the exercices.
105
106
Chapter 5
In the previous chapters training linear and deep models with pytorch has been presented with the
loss, the related maximum likelihood and algorithms. For improving the models, the number of
parameters may be reduced for instance by penalization as explained below. When some variables
are useless or the model over-parameterized, it improves dramatically the accuracy or the mse for
the test sample as illustrated with an example. An hyperparameter is set for the strength of the
regularization, its value is found with a method of cross-validation in order to choose the value
among a set or an interval.
More formally, the approach is described as follows for the linear case, with different possi-
ble expressions for � and R with ŷi (β ) = β T xi , otherwise for a deep model with ŷi (β ,W ) =
β T gL (...g� (...g2 (g1 (xi )) for instance. Instead of directly minimizing the objective function or loss
�(), a penalization R() is added in order to avoid inflating values during the training of the compo-
nents of the parameters vector, see the documentation of "sklearn" for instance.
107
�λ (β ) = �(β ) + λ R(β )
1 n
= ∑ L(yi, ŷi(β )) + λ R(β ) .
n i=1
The first term is the usual loss function �(), they are defined for regression, the least-squares or the
robust least-squares (Huber) for instance. Otherwise there is also defined other ones for classifica-
tion, the logistic regression for instance.
The second term allows to decrease the overfitting from the regression and improve the general-
ization. For the regularization term R = R(β ), they are, with ρ in [0; 1] for the parameter of linear
combination.
the L2 norm (ridge) R2 = 1p ∑ j |β j |2
the L1 norm (lasso) R1 = 1p ∑ j |β j |
the Elastic Net Re = ρ2 R2 + (1 − ρ)R1
This is visualized below, the lasso penalty induces smaller values than the ridge one, for two
components β1 and β2 for which the solution lies on a circle or a square respectively, while the
solution for elastic net is a combination of both.
β2 β2
β1 β1
ridge lasso
Figure 5.1: Solutions for each penalty function alone with p = 2.
Note also a probabilistic interpretation of these penalizations is to add a prior on the parameters,
a Gaussian density for the ridge penalty, and a Laplace density for the lasso one, say π(β |σ 2 ) =
p λ λ |β |
∏ j=1 2√ v
exp(− 2√vj ), with v > 0, but this is out of the scope herein. Bayesian inference remains
possible with pytorch1 but ask for more advanced implementations. Note that kind of model is
often presented as able to find a relevant value for the regularization parameters with only the train
sample, but involving a test sample may perform better actually.
Finally, this induces to add to the previous derivative ∇�, λ as a product with the derivative of
R(β ), say λ ∇R. With "sklearn" some functions are available to perform this new optimization and
even find the best value for the new unknown parameter λ , called "hyperparameter" in general,
1 See the external python module "pyro" with such models for instance.
108
in order to get the best generalization as possible from the available sample. When the hyperpa-
rameter is very large, only the penalization is minimized, hence the weights are all zero, while
when it becomes to decrease, the former loss is also minimized such that the previous solution is
regularized with this additional term.
This is implemented with pytorch in this whole chapter, just nextafter.
��� ����������
������ ���������������������������
������ ���������
�����������������������
It is imported the functions from previous chapters, all saved in a python file as a custom module.
An artificial dataset is generated in order to perform the comparisons. It is added to a design matrix
some useless columns for the variables which are not meaningful, for testing the model selection.
The columns are normalized. The datasets for train and test samples are saved for later use into
separated files in a compressed numpy format.
The dataset generated is loaded in order to check the contents.
������ ����� �� ��
��� ����������� �� ����������������������
���� ������������������������������������������������������������
����� �� ��
�� � ����������
������ � ��������������������������� �� � ����������
������ � ��������������������������� �� � ����������
������������� ��������� ������� ���������
109
the penalization at the level of the update of the weights of the neural network. This is performed
next with the module torch after training the regular regression.
For the dataloader, the previous functions are already available. The train and set samples are
already separated here and available from the computer files.
The argument "shuffle" is at False, this supposes that the suffling was performed before once, and
is not repeated during the training.
������ �����
���� ���������������� ������ �������������� ����������
�������� � �����������������������������������������������������
������� � ����������������������������������������������������
Note that "num_workers" for parallel processing with several cpu cores and "pin_memory" for
better management of the memory buffer are two options worth to try for some computers.
�������������������������� � �� ��������������������������
������������������������������ � ��������������������������������
������������������� � ����
����������������������� � ������ ������� �����
The training function is now implemented with the gpu device for faster training, when available.
110
5.3 Multiple regression without lasso
Let train the model without regularization for this new dataset, even if some independent variables
are just noise. This is likely to be the case with real data too. Useless variables detected by the lasso
model may be also just correlated to useful ones. In both cases, the usual regression is expected to
not perform well because its lack of robustness.
5.3.1 Training
������ �������� �� ��
������ ����
�� � ������� ��� � ������������������������
�������������� � ��
������������������������������������������������
����� � ����������������������������������������
������������������������������
�������������
����������� � ���
������� � ����
��������� � �
���� � ���������������������������������
��������� � ����������������������������������� ����������� �������������
������� � ������������������������������������������������������������
������������������������
��������������������������� � ����������������������������������
�����������������������������������������������
5.3.2 Post-processing
According to the graphic for the losses, the model looks fine with a convergence of the algorithm
and even no issue for the generalization with new data. But from the generation process it is clear
that this model is likely to be mistaken and to need improvements. This is confirmed below via
numerical indicators based on the predictions ŷi . For computing these predictions from the model,
a function is defined below.
��� �����������������������������������������������������
���� ����������������
���� � ��
���
111
��� ����� ���� ��� �� ���������������
�� ������ �� ��� �����
�� � �������������
�� ������������� �� ��� �����
�� � �����������������
������ � ������������������������
�����
�� ������������� �� ��� �����
�� � �����������������
������ � ��������������������������
�� � �� ��
���� � ������
� � �������������������
�����
���� � ��������������� ������� �������
� � ��������������������������������
����
������ ��������������� �����������
Here, "device" is an argument in order to be able to proceed at the level of the gpu. In pytorch,
there are several ways to use the gpu: load all the dataset in the gpu when this is possible or load
each minibatch during the training loop in the gpu. Here the model must be also in the gpu in
order to access the data. The command is just ".to(device)" where device is an object depending
on the computer hosting pytorch. By the way, if required, an optional argument for a function
"f_transform_x()" was added but it may be preferred the dataloader for such transformation like in
image processing. This leads to.
��������������� � �� ���������������������
�������������� � �� ��������������������
��������������� � �� ���������������������
�������� � �����
������� � �����
�������� � ����
112
���������� �������� � ����������������������������������������������������
��������� � ������������������������������������
���������������� � �� ����������������������
��������������� � �� ���������������������
���������������� � �� ����������������������
��������� � �����
�������� � �����
��������� � ����
The model performs very well on the train sample because this is a regression line fitted with
data generated after a nested model but the model is less convincing for the test sample which
induces a problem of overfitting. The poor generalization of the model is corrected by adding
some regularization (removing nodes and weights in the neural network which are not needed), as
explained next after.
The loss is altered by adding the absolute value of the weights to the previous loss function, the
mse. By this way the penalization appears in the update rule at each optimization step during the
training, like if this was a new loss function. This is the role of the following function which is
an argument of the generic training procedure. Thus, the training loop keeps the same, it is just
introduced a wrapping function which adds the regularization to the previous one.
��� ���������������������������
���������� � �����
���������� � �������������������������������������������������������
������ � ���������� � ��������� � ����������
������ ������
����������� � ���
������� � �����
����� � ����������������������������������������
������������������������������
���� � ����������������������������������
113
��������� � ����������������������������������� ����������� �������������
������� � ������������������������������������������������
�����������������������������
The computer code is nearly the same than without regularization, only the function for the loss is
implemented instead of remaining at "None". This allows the training without any deep knowledge
of pytorch, but also the perspective to change a generic code for the purpose of the reader, such as
for instance, any other regularization functions, loss functions, optimizer functions, etc. without
requiring to reinvent the wheel. Note that an additional generic python function with less arguments
is written in a next chapter with also an additional member from the generalized linear models for
discrete regressions. The training is launched.
��������������������������� � �
�������������������������������������������������
����������������������
��������������������������������������
���������� � ��������������������
��������� � �������������������
������������� � �����������������������
������������ � ����������������������
���� � �������������������������������������������������������
���� � ���������������������������������������������������������
��� � ����
�� � ����
��� � ����
�� � ����
The new model with the regularization performs better than the one without. This confirms the
needs for the penalizing term or at least some regularization in order to improve the usual model.
114
5.4.2 Interpretation of the obtained model
The following graphic with the predicted target variable against the true target variable for both
samples, test and train, confirms a higher error for the test sample when the penalization is re-
moved.
������ ����������������� �� ���
This leads to the two following graphics. This is on the left without lasso and on the right with
lasso, there is a huge improvement for the test sample which makes almost mandatory such kind
of procedure for prediction.
Figure 5.2: Scatterplot of yi and ŷi : without L1 (left) and with L1 (right)
The weights or regression coefficients β̂ j come with some variance and variability, this explains
why it is not better retrieved the true target variable from the model. The penalization improves the
115
generalization for new data by reducing the quality of the prediction for the training set. Adding
the regularization performs quite well here, and without this term the model appears not useful for
new data. Retrieving same values for the indicators for train and test sample confirms the idea of a
model able to generalize and to perform exactly the same for any data, as if it was estimated from
the whole population, but this is not always the case actually.
The training algorithm may be not optimal above for the optimization with a nonlinear regular-
ization term: for the lasso regression more advanced inferential procedures may be preferred for
faster training. But they ask for specialized python programs (see the literature on lasso for such
implementations) while the presented solution is via numerical derivatives.
116
consuming and suppose to repeat the training 5 or 10 times for all the possible candidate values for
each hyperparameters. This is times the product of the number of eventual values for each, which
becomes quickly a huge integer value when the number of hyperparameters #hyper grows. If there
are 15 candidate values for each hyperparameter, this is around:
A saving time would be to prefer these approaches around a good value found before with a human
search via several empirical tries or a first naive choice of a few values. Otherwise the train and
test samples need to be selected very cautiously in order to be similar, via stratified sampling with
enough strata and eventually fine quantization or clustering of the data space. Note that in the
documentation of sklearn several cross-validation procedures (and lasso algorithms) are available.
Here only the more usual procedure is considered.
117
understanding that the training of neural network models ask for some manual settings which may
be cumbersome for non experts, as there seems to be missing the real receip which works all the
time. Recent automatic implementations are based on bayesian inference, but are time consuming,
see "automl" in the literature for instance, and the last section of the chapter as examples.
The cross-validation (cv) provides nearly unbiased indicators, while the same code than before is
used except that the train and test samples are not unique anymore. For the "k-fold cv", k train
samples and k test samples are required, with often k = 5 or k = 10. The average of the k resulting
indicators of quality replaces the previous unique value. The k samples are found by partitioning
the whole sample into k equal parts. With these k subsamples, k pairs of training and test samples
are obtained by removing one of the k parts from the sample, each one after the other one. At each
removal, the removed part is the test sample while the remaining (k − 1) parts become the train
sample.
118
First the k samples of the cross-validation for testing and training are found from the available
whole sample. They are generated via the new python function just below which extends the
previous one, with the indexes for each fold and each subsample (test and train). These indexes
are then used to fed two dataloader of pytorch in order to cycle the minibatches from the two
subsamples.
��� ���
The number of folds is 5 here.
������ � �
Thus, five pairs of train and test samples are put in a dictionary as follows:
������ ������������ �� ��
����� � �������������������������������������������
Note that some functions are available in python modules such that sklearn (with the function
KFold for instance) for a similar result but with more options and alternative sampling strategies
for the selection, while the one presented is the most common. In the proposed function just above,
the result is coded with a python variable of type dictionnary (dict) with the fold number minus
one as the key, and two corresponding values as two lists for the two samples, and names "train"
and "test". For this dataset, the sizes for each fold from the dictionnary is given just after, with a
verification of the sets:
The sizes for each fold are as follows:
� ��� ��� �� �������������
� ������� � ������������������������
� ������ � �����������������������
� ������������� ����� ��� ������� ��� �������� � ������ ��� �������
119
�������������� � ��
��� ��� �� �������������
�������������� ����
� ������������������������������
� �����������������������������
������������������ ��
����������������� � ������������������������ � �
����������������������� �
�����������������������������������������
For a number of folds which does not divide exactly the sample size, all the folds have same size,
except the last one. This is different of sklearn with more balanced subsample sizes, hence this is
better to chek these sizes anyway. Thus, the folds are in motion next by just repeating several times
the previous usual strategy of training: one sample for model fitting and one for the evaluation of
the quality. And, at the end averaging.
The training is a loop over the folds in order to gather the metrics and aggregate them. The function
is defined in the file "utils.py" and has the following header and return list.
The function takes as argument either the sets in "idx_s" for the cross-validation plus the dataset
either the filenames in hdf5 format in "namefile_s". After training, the output is the list from each
fold from the function "utils.f_train_glmr", as a dataloader is created for each couple of train and
set samples. This allows after to compute the averages and compare the results from each fold.
120
The objects called model, loss, optimizer and monitor are copied for each fold with this function,
thus they are created first but the initial python objects are not altered.
The parameters for the learning are as follows.
������� � �����
��������� � ����
����������� � ���
���������� � ��
The call to the function comes after creating the required python objects:
������ �������� �� ��
������ ����
� �� ������������ ��� �������������� ���� ��� ��� �����
�������������� � ��
������������������������������������������������
� ������ �� ���� ���� ������ ��� �������� ��� ����������������
������ � ����������������������������������������
������������������������������
����� � ���������������������������������
���������� ���� �� � � �
����� ��������� �� � � ��� � ��� ��
����� ������� �� ��� � ��� � ����� ��
��������������� ���� �������� ����� ���������� ����� ��������� �����
���������� ���� �� � � �
����� ��������� �� � � ��� � ��� ��
����� ������� �� ��� � ��� � ����� ��
��������������� ���� �������� ����� ���������� ����� ��������� �����
���������� ���� �� � � �
����� ��������� �� � � ��� � ��� ��
����� ������� �� ��� � ��� � ����� ��
��������������� ���� �������� ����� ���������� ����� ��������� �����
���������� ���� �� � � �
����� ��������� �� � � ��� � ��� ��
����� ������� �� ��� � ��� � ����� ��
121
��������������� ���� �������� ����� ���������� ����� ��������� �����
���������� ���� �� � � �
����� ��������� �� � � ��� � ��� ��
����� ������� �� ��� � ��� � ����� ��
��������������� ���� �������� ����� ���������� ���� ��������� �����
���������� ���� �� � � �
����� ��������� �� � � ��� � ��� ��
����� �������� �� ��� � ��� � ����� ��
��������������� �������������� �������� ����� ���������� ����� ����������
�→�����
���������� ���� �� � � �
����� ��������� �� � � ��� � ��� ��
����� �������� �� ��� � ��� � ����� ��
��������������� �������������� �������� ����� ���������� ���� ��������� ��
�→���
���������� ���� �� � � �
����� ��������� �� � � ��� � ��� ��
����� �������� �� ��� � ��� � ����� ��
��������������� �������������� �������� ����� ���������� ����� ����������
�→�����
���������� ���� �� � � �
����� ��������� �� � � ��� � ��� ��
����� �������� �� ��� � ��� � ����� ��
��������������� �������������� �������� ����� ���������� ����� ����������
�→�����
���������� ���� �� � � �
����� �������� �� � � ��� � ��� ��
����� �������� �� ��� � ��� � ����� ��
��������������� �������������� �������� ����� ���������� ����� ����������
�→�����
This leads to an improvement of the results but less than expected in comparison to the previous
two samples (one train and one test).
A way to improve further would be to set better the learning rate by running the training for each
of candidate values and keep the best. Example of such values are as follows.
122
��������� � �������������������������������������������
�����������������������������������������
Instead, let check if the lambda_l1 is really optimal, considering this increase of the mse for the
cross-validation. First the datasets are stored on the computer disk.
������ ������
������ ����
��� �����������������������������������������������������������������
������������
��� ������������������������������������������������������
����������������������������������
��������
���������� � ������
��� �� ����� �� ������������������������
�� ����� ����������������� ���� �� �������������������������������
���������������� � �����������������������������������������������
��������������� � ����������������������������������������������
�������� � ������������������������������������
123
�����������������������������������������������
������� � ������������������������������������
����������������������������������������������
�������������� � �������������������������������������������������
������������� � ������������������������������������������������
��������������������������������������������������
�����������������������������
������������������������������������������������
�����������������������������
������������������ � �����������������������������
����������������������
������ ����������
���������� � ����������������������������������������
����������������������������
����������� �����������������������
����������������
Thus this is the input format for the dictionnary with the file names for the cross-validation: instead
of indexes, there is a characters string with the name (and path).
����������
The call is almost similar than before except that the argument "dataset" in the function call is set
to "None", such as also "idx_s", while "namefile_s" is now given a value. Note that for a small
dataset, this is better to keep the data in the computer memory for faster training.
124
The better regularization parameter and the better learning rate are found via cross-validation,
among the following candidate values.
��������� � ��������������������������������������������������
����������� � ��������������������������������������������
���������������������������� �������
����������� � ���
������ � ��
������ � ��
����������� � ��
��� ��������� ����������� �� ����������������������������������
�������� � ������������������������������������
������������������
������������������������
���������� � ��
��� �������� ���������� �� ��������������������������������
�������� � �����������������������������������
�����������������
�����������������������
���������������������������������������������������
��������������������������������
����������������������������������
����� � �����������������������������
����� � ����������������������������
125
������� �������������� ������ �������
������� �������������� ������ ���������
126
�������������� ��������������� ������������ ������ ������������ ������
�������������� ������������� ������������ ������ ������������ ������
�������������� ��������������� ������������ ������ ������������ ������
�������������� �������������� ������������ ������ ������������ ������
������������� ������������� ������������ ������ ������������ ������
������������� ��������������� ������������ ������ ������������ ������
������������� ������������� ������������ ������ ������������ ������
������������� ��������������� ������������ ����� ������������ ������
������������� �������������� ������������ ������ ������������ ������
������������� ������������� ������������ ������ ������������ �����
������������� ��������������� ������������ ������ ������������ ������
������������� ������������� ������������ ������ ������������ ������
������������� ��������������� ������������ ������ ������������ ������
������������� �������������� ������������ ������ ������������ ������
������������� ������������� ������������ ������ ������������ ������
������������� ��������������� ������������ ������ ������������ ������
������������� ������������� ������������ ������ ������������ ������
������������� ��������������� ������������ ������ ������������ ������
������������� �������������� ������������ ����� ������������ ������
Note that the function "loss_yy_model" performs with a global variable, hence a more secure
way to do could be preferred, such as a method from a class where lambda_l1 is a variable. The
training may be slow for some computer machine because the cross-validation asks for five training
runs which are repeated by the number of learning rate candidates multiply with the number of
regularizing parameters candidates: 5 × 5 × 12 = 300, thus fortunately the gpu was used here.
The number of epochs may be limited otherwise the running time may be too high for some com-
puters, here 100 instead of 350 because the test loss was converging faster according to the previous
curves. This is a way to do for any neural network actually, even larger ones and larger datasets.
Another strategy is to randomly sample candidate values for alpha_t and lambda_l1 in an interval
or according to some distribution, but with no better insurance to get the best parameters. Here, it
may be also implemented some strategy to fine tune the search by looking around a good candidate
value. These strategies are costly and may be possible only for serious computing availability. This
underlines the need for gpu and parallel processing in order to even search the good values for the
hyperparameters. This is anyway true for running any numerical python code: a better computer
gets the faster result but for deep learning a slow computer may not be able to proceed in a practical
time or might at least induce a delay for the results.
From the numerical output just above, the resulting statistics for the computed metrics are thus
summarized in a table in order to compare the quality of the results, even if the convergence was
eventually not reached.
������ ������ �� ��
127
����������� � ����������������������������� ��� ����� �� ��������
���������� � ���������������������������� ��� ����� �� ��������
The plot is with seaborn here, a module which make matplotlib more powerful to use with less
command lines, see the figure and code just below.
Figure 5.3: Mse averages on test set after CV+L1 and grid search
128
� �������������������
�������������������������������������
������������������������������������������
�����������������������
�������������������������������� ��� ���������������� ����� � ����������
������������������������� �� ���������������
Here, lambda_l1 equal to about 0.1 performs well, but a different alpha_t looks better than the one
from before, 0.001, this is the row from the results:
lambda_l1=0.1 alpha_t=0.001 mse_tr_mean= 0.0354 mse_te_mean=0.0451
The difference with the manually chosen parameters above may appear nearly neglectable. Any-
way, checking if the choice of the parameters is well suited looks like required in order to insure
the best training and settings as possible. For more parameters, parallel plots are suggested in the
literature, while a bayesian search for the optimal parameters looks like a better solution in order
to minimise the number of runs.
129
This is as follows for our cross-validation function. First the embedding function is defined.
����������� � ���
��� ���������������������������������
������������������������� ������������������������� ��
������� � �����������������
��������� � �������������������
���� ��������������������������
���������������������������������
��������������� �������������� ������������� �
���������� ������������ �������� � �
������������������������������������������������������������������
������������������������������������
��������������������������������������
�����������������������������������������������������
�
���������� � ��
��� �������� ���������� �� ��������������������������������
�������� � �����������������������������������
�����������������
�����������������������
����� � ����������������������������
������ �����
The optimization proceeds with the python module "smac" in order to find the best hyperparame-
ters automatically by defining their intervals (or range) and their distributions.
������ ����� �� ��
������ ��������
������ �������
���������������������������������������
130
���������������������������������������������������������������������� �
������� ����� ������������������� �
�������������������������������������������������������������������� �
�������� �������� �������������������� �
������������� � ����������� ���������� ���������������� �����
������������������� � �����������
������������������������������� � �� � �� �������������� ���������
�������������������������������� � ��� � ���� ����� ��� ��� �����������
�������� � �����������������������
� ���� � ��������������������������
���� � ���������������������������
������������������������������������
���������������������������������������������� ��
���������������������������������������
���������������������������������������������������� � ��������������
�→��������
��� ������� ������� � ����� ������� � ���� ���� ���� ������������� ��� ����
�������� ��� ������������� ��� ����������
��������������������������������������������������������� ������� �������
�→��� ��
��������������
������������������������������������������������
���������������������������������������
After the definition of the python objects for "smac", the search for the optimal solution is lauched
with the following call to "optimize()", a dedicated method from the module.
����������������� � ���������������
After running the call, the python variable "best_found_config" has received the optimal parameter
found. For debugging, one may call:
� ���
In this example of configuration, there was no improvement from the grid search. Choosing 10
folds for the cross-validation would lead to more precise estimations of the hyperparameters with
only double running time. Note that with this example of implementation if the loss becomes nan
or infty, the module may stop with an exception, hence this should be fixed if required by cautious
bounds for the intervals in the configuration space. Early stopping would be interesting here to
implement instead of a not full training with a constant number of epochs kept the same for every
131
choices of hyperparameters candidate. Other implementations of such automatic search for the
hyperparameters exist in python modules and are left as an exercice for comparison: this one is
implemented in auto-pytorch an auto-ml member which allows to find the architecture of neural
networks automatically, even if some setting by an human expert is required for the choice of the
intervals but also the final validation and the data preparation with also the recoding for instance.
In this chapter, we have studied how to implement the lasso regression with pytorch and compared
several approaches for finding the hyperparameters for the penalization and the learning rate. When
well chosen, the regularization (lasso, dropout, ...) allows to reduce the overfitting in order to
improve the generalization for the prediction with new data.
5.6 Exercices
1. (sklearn) Retrieve the value lambda_l1 with sklearn and compare the resulting mse and r2
from the result with pytorch. It is noticed here that the parameter for the regularization may
be a product by two or an half because in sklearn, it is usual to have a factor two in the mse,
see the documentation of sklearn. The regression model in sklearn performs often well by
default because a regularization is included with a quadratic norm for the parameters vector,
this is a ridge hence regularized regression.
2. (pytorch) Implement a lasso for the classification with a real or artificial dataset from one of
the previous chapters herein.
3. (sklearn) Test the stochastic regressor 2 from sklearn with different hyperparameters for the
number of steps and the batch sizes. Similar behaviour may be observed in comparison to
the pytorch implementation, when a regularizing term is added.
4. (pytorch) Add a stopping rule in the class MyMonitor when the training loss is enough stable
in order to reduce the running time, and also when the test loss increases during several steps
for reducing overfitting.
5. (pytorch) Test the lasso for the deep regression model from the chapter 1, for fitting the data
from a polynomial regression. Count the number of parameters which are near or zero and
the number of parameters which are kept. Check if this regularization reduces the mean
squared error on the test sample. Test other regularizations.
2 "https://scikit-learn.org/stable/modules/sgd.html"
132
Chapter 6
In the previous chapter we have studied neural networks and pytorch for regression and classifica-
tion. In this chapter, we are interested on numerical "variance estimation" by computing values for
� β̂ j ]. As a statistical model, neural networks have parameters which are estimated from a sam-
Var[
ple. This induces some variability of the resulting estimates. Hence the usual approach is to find
the "hessian matrix", say ∇2 � = ∇∇T �, with the "second-order derivatives" of the loss function �()
in order to deduce the variance. For a large number of parameters and small samples, this approch
is incomplete and asks for a more advanced method (in case of no tractable numerical "boostrap"),
hence out of the scope here. Otherwise such as in the linear case, this is the classical way to do
which is presented next after with the modules numpy, statsmodels and pytorch.
��� ����������
������ ���������������������������
������ ���������
�����������������������
Let generate a dataset from a linear model in order to perform the comparisons between the imple-
mentations. This is a toy dataset with thirty rows and five variables, the dataset is without useless
133
columns.
The resulting two samples are standardized.
������ ����� �� ��
���� � ���������������������������������������������������
��������� � ����
� � ��
� � �����������������������������������������
� � �������������������������� ���
� � �����������������������������������
� � � � ���� � �
� ������� � �������������������������������
� ������ � ������������������������������
� ������ �������� ������
�������� ������ � �������������������������������������
���� �� �� ��
134
������������������������������������������������������������������������������
���� ��� ��� � ����� ������ ������
������������������������������������������������������������������������������
����� ������ ����� ������ ����� ����� �����
�� ������ ����� ����� ����� ����� �����
�� ������ ����� ����� ����� ����� �����
�� ������ ����� ����� ����� ����� �����
�� ������ ����� ����� ����� ������ �����
�� ������ ����� ����� ����� ������ �����
������������������������������������������������������������������������������
The module offers a more advanced view of the table than the usual function "print()" for a numpy
array.
This is the usual contents for statistics with data, as follows:
- The regression coefficients at the column named "coef" are the values β̂ j .
- The "standard-errors" (std) at the column "std err" are the "square root of the variance".
� β̂ j ])0.5 are useful because they enter the definition of an interval instead
These values (Var[
of just an unique value for each component.
- The bounds of the confidence intervals (left at 0.025 and right at 0.975) for the true pa-
rameters, say "β j ∈ β̂ j ± z0.975 × std" where it is recognized z0.975 as the percentile of the
Gaussian distribution for a two side test.
- A Student test of nullity of the coefficients (value "t" and p-value).
Without the information from the variance, the interpretation of the estimations β̂ j for small sam-
ples are not sure: with another small sample from the same population, the value of the estimations
may change dramatically, such that the conclusion would change completely too. For larger sam-
ples, the variance is enough small to be neglected in practice, this is common in machine learning
literature.
Let retrieve the results for the variance via the algebra by writing a new python function. This is
after a brief recall about the classical analytical expressions from the statistical literature next after.
135
It is recalled that the criterion above comes without distributional hypothesis, even if one is
hidden here. This is a Gaussian noise which can be supposed and implicit above. This leads
to the matricial system:
σ̂ 2 T 1
V̂ (θ̂ ) = (X̃n X̃n )−1 where σ̂ 2 = �(θ̂ ) .
n n− p−1
The estimator for the σ 2 is here corrected at the denominator because the one from maximum
likelihood is biased (and with fraction n−1 instead). As expected when n becomes large, the
variance vanished to zero, but for small n it may be large.
• For logistic regression.
n
�(θ ) = ∑ yiσ (α + xiT β ) + (1 − yi)(1 − σ (α + xiT β ))
i
n
= ∑ yiσ (x̃iT θ ) + (1 − yi)(1 − σ (x̃iT θ )) .
i
2. With ∂∂θ� = ∑i ∂∂ �θi , the more general way to consider the variance estimation is via the hessian
(second-order derivative) or its approximation:
� �−1 � �−1
n
∂ 2 �i ∂ �i ∂ �i
�(θ ) = ∑ �(xi , θ ) such that, V̂θ = ∑ ≈− ∑
i ∂θ∂θ i ∂θ ∂θ
T T
i
It is retrieved the logistic regression as a special case. More advanced estimators such as the
sandwich one are not considered herein.
��� ���������������������������������
� � ����������
� � ���������� ������ ���� ����������
�������� � ����������������� � �� � ��� � �
136
����� � � � ��������
�������� � � � �����
���������� � ������������������� � �� � ��
������������ � ���������� � ����������������� � ��
�� ��������
��� �� �� ���������
�������������� � ���������������� ��� �� ���
��������������������������� ������������������
������ ����������������� ������������� ����������
����������� � ������������������
����������������������� ������������������������
������������
�� ������ ��� ��� ��� �� �� �
���� ������ ������ ������� ������ �������
���� ������ ������ ������ ������� ��������
���� ������� ������ ������ ��� �� �
� �� ������ ������� ��� ������ �������
� �� ������ ������� �� ������ ��������
The resulting table with the regression coefficients and the standard-deviations is shown just below
in a pretty view very similar to the output from the above module, with several columns.
��� �������������������������������������������������������������������
� � ������������������������������������������������������������ ���
��� � �������������
����������� � ���������
��� � �� �����������������������������
���� � ������������� � ������
������������������������
��������������������������������������
���������������������������������������������������������
������������������� �����
���������������������������������������������� �������
������������������������������������������������������ �
137
������������������������������������� �������
���������������������������������
�����������������������������������������������������������������
���������������������������������������
��������������������� � ���
����������������� � ���
�������������� ����� � ���
����������������� � ���
�������������� � ���
������������������ � ���
����������
������������������������������������������������������������������
�����������������������������������������������������������
� �������� � ���� � ��� ��� � ���� � � � ����� �
�����������������������������������������������������������
� ����� � ����� � ���� � ������ � ������� � ��� �
� ���������� � ����� � ����� � ����� � ������ � ��� �
� ���������� � ����� � ����� � ����� � ������ � ��� �
� ���������� � ����� � ����� � ����� � ������ � ��� �
� ���������� � ����� � ����� � ����� � ������ � ����� �
� ���������� � ���� � ����� � ����� � ������ � ����� �
�����������������������������������������������������������
It was added the variation coefficient, c.v., at the last column to the right, and also the t-test with
the related probability. Here, the Student test of equality t j of the regression coefficient is equal
θ̂ −0
j
to the variation coefficient because t j = � (θ̂ j ) = cv j , while the probability comes from a Student
var
distribution. This is not discussed further herein as this is the variance which is in stake.
When the number of variables increases, this becomes nearly "impossible" to invert the hessian
matrix with the current available software, thus it has been proposed solutions to solve this issue
for neural networks, mainly by only considering the diagonal part of the hessian matrix because
this is not the variance which is of first interest, just the training (see next chapter and the exercices
part). Anyway, this brief introduction to the numerical variance estimation is not always relevant
for some models where more advanced analytical expressions may be required.
138
6.2.1 First-order training with pytorch
������ �����
���� ���������������� ������ �������������� ����������
�������������������������� � �� ��������������������������
������������������������������ � ��������������������������������
������������������� � ����
����������������������� � ������ ������� �����
������������������� ��������
������ �������� �� ��
������ ����
�� � ����������������
����������� � ���
������� � ����
��������� � ��
�������������� � � ������������������������� �
����� � ����������������������������������������
������������������������������
���� � ���������������������������������
��������� � ����������������������������������� ����������� �������������
������� � ������������������������������������������������������������
������������������������
��������������������������� � �
�������������������������������������������������
��������������������������������
139
����� �������� �� � � ��� � ��� ��
����� �������� �� �� � ��� � ���� ��
����� �������� �� �� � ��� � ����� ��
�����
����� ������� �� ��� � ��� � ����� ��
��� ���������������������������������������������������������������������
������� � ����� � ����
��� ������������ �� �������������������
�� ��������������������������
�������� ��������������������������������
�� ������������������������
������ ��������������������������������
������ ������������������������
140
������������������� � �������������������������������������
�������������������
The more direct way to retrieve the hessian (for the variance) is to define the likelihood and im-
plement the call to the corresponding pytorch function. In the linear case, without a native loss
function from pytorch, this is written as follows.
������ �����
������������ � �
�������������������������������������������������������������
�������������������������������
������������ � ������������������������ �����������������������������
�� � ���������������������
�� � �����������������������������������������������
��� ���������������
������ �������������������� � ������������� � �������
This is just that the analytical hessian, divided by n_train, is retrieved numerically for this simple
example, because the solution is known in exact closed-form for the linear case:
���� � �� � �������
141
��������� ����������� ������������ ������������ ������������ �����������
������������
������������� ����������� ������������ ����������� �����������
�������������
������������� ������������ ����������� ������������ �����������
������������
������������� ����������� ������������ ����������� �����������
�������������
� ����������� ����������� ����������� ����������� �����������
�������������
� ����������� ������������ ����������� ������������ ������������
�������������
The estimator from algebra for this solution has the same value as expected, which validates the
illustration with the dedicated pytorch numerical function here. These solutions are also identical
to the one before from the python module stamodels (see a few paragraphes above).
�������������������������������
Actually it is required to retrieve the same solution for the regression coefficients because unbias-
ness is supposed here for computing the term σ̂ . In the nonlinear case, there is no need for this
proportional variance.
Anyway the hessian matrix for neural networks is more useful for training a small network with a
small dataset nowadays. But ill defined hessians and very large dimensional variable spaces seem
a problem not solved in the literature even currently considering the increasing sizes of the models
in used today for images and texts. For moderated larger matrices, an approximation of the hessian
matrix (such as from quasi-newton approaches) may be preferred as explained in the next chapter.
142
This ends the introduction on variance estimation with pytorch through an example for the linear
regression model. It must be added that a resampling algorithm, say "boostrap", is an alternative
not considered here. The next chapter will consider further the hessian matrix for training (deep)
neural networks (with a few hundreds or thousands of variables or weights).
∂�
- For the expression to the left, the gradient function is applied to each component of ∂β =
( ∂∂β�j ) j .
It leads to the wanted matrix, but may ask for many computer memory and may be
slow to compute for large values of p and n. With also a related second-order optimization
prone to local minima, this explains why first-order training algorithms are often preferred.
- For the expression to the right, recall that ∂∂β� = ∂ ∂∑βi �i = ∑i ∂∂ �βi . The approximate expression
just sums the cross-products of the gradients computed for each element of the sample,
say the vectors ∂β�i = ( ∂∂β�ij ) j . Recall that this alternative expression approximates the exact
hessian because the two expectations from each expression are exactly equal according to
maximum likelihood estimation.
First let have a short recall of the formula for logistic regresion with: the expression of the log-
likelihood, the derivatives (gradient, hessian) followed by the expression of the information matrix.
Then, the variance is computed with pytorch after fitting the regression coefficients from the model
for real data, the "abalone dataset".
143
The (inverse) logit link function insures a linear regression after transformation. Let denote the
diagonal matrix Ω = Diag(p1 (1 − p1 ), . . . , pn (1 − pn )), and the dataset available as the couples
(xi , yi ) with the loglikelihood written:
Thus,
� T
�
�(β ) = ∑ni=1 yi xiT β − ln(1 + exi β )
∇�(β ) = ∑ni=1 (yi − pi )xi
∇2 �(β ) = ∑ni=1 −pi (1 − pi )xi xiT
In (β ) = E[−∇2 �(β )] .
Note that as a result, the Fisher information matrix is equal to the hessian, because the hessian is
not random, and do not depend on the yi ,
∇�(β ) = X T (y − p)
∇2 �(β ) = −X T ΩX
Iˆn (β ) = X T ΩX .
Thus,
In (β ) = E[Iˆn (β )]
= Iˆn (β )
With the gradient per unit equal to ∇�i = (yi − pi )xi , this also induces that the approximation for
the hessian matrix is as follows,
E[I˜n (β )] = In (β ) .
As an exercice the reader may show the equality of the expressions from the first-order derivatives
and from the second-order derivatives. Thus it is retrieved the general result for this example of
a member of the generalized linear model. This is an expectation thus by the law of the large
numbers, the equality holds for the sum only for enough large samples. This expression from
the gradients has been implemented for neural networks by the past before the rise of first-order
approaches.
144
6.3.2 Real dataset with two classes
First the dataset is loaded with the module "pandas" because it is in format "csv". This module
allows also a nice processing of the data table for naming and accessing the columns with character
strings, plus also for the rows and columns the selection of subsets or the computation of descriptive
statistics (sum, mean, variance) for instance. The chosen target variable is the "count of rings"
which takes integer values, it is binarized in order to be suitable for a binary classification.
������ ������ �� ��
������� � �����������������������������������������
� � �����������������������������
� � ����������������������� � �
� � ����������������������������
�������� �������
Let select two subsamples for the training, the train and set samples.
145
������� � ���������������������������� ���� ���������
��� � ������������������ ��������
������������� � ��������������������
������� � �����������������������
������������������������
������ ����� �� ��
���� ������� ������ ������������
����� � �����������������������������������������������
���������������������������������
�������� � ������������������ ��������
��������� � �������������������������������
����� � ��������������������������������� ��������
��������� � �������������������������������������� ������� ���������
����������� � ����������������������
������������������� �� ������������������������
������������������ �� ����������������������������������������
������ �����
������� ������������ �� �����
146
���� ���������������� ������ �������������� ����������
������������� � �������������� ����������������������
��������������������� �
������������ � �������������� ���������������������
�������������������� �
�������� � �����������������������������������������������������
������� � ����������������������������������������������������
������������������� ��������
������ �������� �� ��
������ ����
�� � ����������������
����������� � ���
������� � ����
��������� � ��
�������������� � � ������������������������� �
����� � ������������������������������������������
������������������������������
���� � �������������������������������������������
��������� � ����������������������������������� ����������� �������������
������� � ������������������������������������������������
������������������������������������
��������������������������� � ����������������������������������
��������������� ����������������������
��������������������������������
��������������������������������������
����������
For this training of the dataset, the test loss has converged under the train loss, this supposes a
problem of underfitting according to the usual interpretation from the literature, hence the model
is not enough flexible, not well trained, or informative variables are missing.
The purpose here is to illustrate the computation of the variance, without even tuning much the
parameters. For this logistic regression model,
147
�������������� � �����������������������������������������������������
�������������������������
������ �������������������������� �
��������������
���� ����������������
��� �� ������� �� ��������������������
����� � ���������
������������ �� ����������������������������� ���
������������������������������������
������������������
This is near the solution from the dedicated python modules. A more cautious tuning of the pa-
rameters would lead to a nearer solution. For example sklearn leads to the solution.
�����������������������������������������
�������������������
������������ � ����������������������������
������� � �
��� � �� �������������������
�� ���������������
������� �� ����������
148
�����
������� �� �
�������
��������� � �����������������������������������
������� �������������
���������������������
� � ������������������������������
�������� � ���������������������������������
��������� � ������������
���
��� �����������������
��� � �������������������
��� � �����������
������ ����������������� �� � ������������������ � ������
���
� � ������������������
�������
��������������������
�������������������������������������������
149
�������� ������� � ������� � ������� � ������� � �������
� ������� � �������� �������������
������������������������������������������������������
The difference comes from that the final solutions for the regression coefficients were only nearby
as the training is for a nonlinear function with different initial values and different inferential
procedures. The first solution with statsmodels comes from a second-order training procedure,
see next chapter, while the second one with pytorch comes from a first-order one which asks for
more careful settings. This is the price to pay in order to get a more scalable algorithm without
requiring the full hessian at each step of the training, in order to avoid a non necessary costly
numerical burden. An automatic or better setting of the learning rate with a stochastic approach
for an optimal training would be required here. This is an almost mandatory option for convex
objective functions because the global solution is reachable here.
�� � ������������
������������������������������ ��������������������������
����������������������������� �������������������������
������������������� ��������������
150
�������������� �������� ������� ������� �������� ������� ������� ��
�→������
The hessian is written as the derivative of the components of the gradient vector. It is adviced to
keep the full list of parameters in an unique list, in order to avoid post-traitment, otherwise, a direct
use may lead to an hessian per layer.
������ � ������������������������������
���� � �������������������������������������������
��������� � ����������������������������������� ������� �������������
���������������������
���� � ����
����� � ���������
������ � ����������� ������������������������
������������� � ��
��� ������� �� �������������
�������� � ���������������������������� ������������������� �
������������������ �
������������������ �
������������������
���
This solution here is the most direct 1 , by iterating the components of the gradient, each row of
1 Thanks to "https://discuss.pytorch.org/t/computing-hessian-for-loss-function/67216" , for an example of imple-
151
the hessian is computed numerically. The "intercept" or "bias" appears at the last row (and last
column).
With q the number of parameters (bias and components of the weights) and θ = (θ1 , θ2 · · · , θq )T ,
this corresponds to the definion of the hessian H = ∑i Hi where Hi = ∂ ∂θ T [gi ] from the derivatives.
The sum here on the whole train sample is performed with minibatches, H = ∑Bb ∑i∈sb Hi which is
exactly equivalent to summing directly to the whole sample:
∂
∂ θ1 [loss(ŷi (θ ), yi )]
∂
∂ ∂ θ2 [loss(ŷi (θ ), yi )]
gi = [loss(ŷi (θ ), yi )] =
.. ,
∂θ .
∂
∂ θq [loss(ŷi (θ ), yi )]
and,
∂
� T �
∂ θ1 g1 (θ )
∂
� T �
∂2 ∂ θ2 g2 (θ )
Hi = [loss( ŷi (θ ), y i )] = .. .
∂θ∂θT
.
∂
� T (θ )
�
∂ θq g q
������ � �����������������������
�������
�������������������������
������������
For deep neural networks with hidden layers, the weights are likely to be in a matrix for each
layer, thus each matrix parameter has to be vectorized by reshaping and by appending the rows
mentation which was altered for the wanted purpose.
152
or columns in order to make an unique vector from all these vectorized parameters. Note that
recently a submodule for pytorch has been released in the version 1.12 for computing the hessian
matrix, see "functorch", at "https://pytorch.org/functorch", associated to "vmap" for vectorization,
"vjp" and "jvp", this is out of the scope herein. Note also that one may have to reduce the list
of parameters to those which are marked for derivation, when all the parameters are not involved
for the gradient. This happens for instance when performing the training of the classifier only,
while leaving the first layers for the transformation of the images unchanged in "transfer learning"
which used a pre-trained "convolution neural network" for images classification. An alternative
approximation of the hessian is with the gradients for each data, as follows.
∂
gi =
∂ θ [loss(ŷi (θ ), yi )]
∂
∂ θ1 [loss(ŷi (θ ), yi )]
∂
∂ θ2 [loss(ŷi (θ ), yi )]
=
.. ,
.
∂
∂ θq [loss(ŷi (θ ), yi )]
and,
Hi ≈ H̃i = gi gTi .
The new estimation is first initialized to zero. The loop computes the gradients and accumulates
the products. Here the gradients are required for each data sample vector, but the parameters are
not updated because they were found just before.
��� ���������������������
������� � �
��� � �� �������������������
�� ��������������� � ������ � ����� � ����
������� �� ���������� � ����������
�����
������� �� ���������� � ������ �� ������� ����� � ���
������ �������
For the case considered, a matrix of weights is not met because only the vector of regression
coefficients is modeled. For deep networks, matrices are required for modeling the weights in
order to compute the sum from a layer to another layer where the dimensions or numbers of nodes
153
are more than one. The loss is with a sum instead of a mean for each minibatch in order to retrieve
directly a whole average for the considered sample.
������� � ��������������������
������� � ������������������������������
��������� � ����������������������������������� ������� �������������
���������������������
�������� ������� � �
���������������������������������������������������������� �����
��������� ��������
������� � ������������������������
���
Let finally compare the resulting standard errors from the two estimators.
����� � ������������������������������������������
����� � ������������������������������������������������
����� � ����������������
������������������������
������������������������
154
������ ����� ����� ����� ����� ����� ������
������ ����� ����� ����� ����� ����� ������
�������������������������������������������
This alternative estimator of the Fisher matrix information may be less accurate (with this sample
size, and with smaller resulting standard-deviations, but also some runs of the algorithm with dif-
ferent initializations) for the estimation of the variance than the more usual one (which is exact for
the logistic regression). It cannot suffer of numerical problems coming from the algorithmic eval-
uation of the second-order derivative. The matrix H̃ is also relevant for a second-order inferential
procedure as studied next chapter.
6.4 Exercices
1. (stat+numpy) Write the function for the variance of the coefficient regression vector for the
multinomial regression and compare the result with the output of statsmodels for the "iris
dataset".
2. (stat+numpy) Compute the hessian matrix and the corresponding estimation of the variance
matrix for the parameters when the dataset is from the example of chapter 4, from the data
file "xy_2d_diskandnoise_reglogistic.txt". Does the estimation of the variance is relevant,
explain why and propose an alternative method (with pytorch or eventually "scipy.optimize"
by reimplementing the network model, like for the polynomial regression at chapter 1).
3. (stat+numpy) In order to better understand the issue with the dataset in the exercice just
before, generate several datasets of same size with a circle within a cicle or a square within a
square or any other convex regular shape, with an increasing surface for the outer geometrical
objet (circle or square or other). Compute the hessian for the different datasets and different
sizes of datasets, discuss the results.
4. (stat+numpy) Compute a candidate numerical value for the standard-deviation of the hidden
latent layers with an update from the functions above (for enough large datasets in order
to get positive diagonal terms after inversion of the hessian matrix). Find also a candidate
numerical value for the variance for the predicted target variables ŷi from a deep regression
with one hidden layer.
5. (stat+numpy) a) After testing the lasso for the deep regression model from the introduction
chapter which allows to fit the data from a polynomial regression, find the variance of the
parameters with the hessian here. Compare with a resampling method such that boostrap. If
the hessian is not well-defined explain why. b) Retry this by cancelling the zero terms and by
removing them from the gradients and hessian. How this changes the results. c) Propose an
experiment with artificial data for this same regression model in order to find which size of
the dataset is required in order to get a well-defined and invertible hessian matrix, and also
155
if the distribution of the xi or the range has an influence on this result. Does this result could
change with a different polynomial function and how this could changes with the degree of
the polynomial function.
6. (stat+numpy) a) Propose a method for the variance of the frontier with the example of classi-
fication from dataset "xy_2d_reglogistic.txt". Check if the true frontier is in the correspond-
ing interval at 95% of confidence. b) Idem for the dataset "abalone".
7. (stat+numpy) Retrieve for the logistic regression the result about the alternative expression
of the information matrix, by showing that E[(yi − pi )2 ] = pi (1 − pi ). Here, a clue is that
y2i = yi is a binary quantity, while E[yi ] = pi . Check the result by a numerical simulation.
8. (stat) Retrieve the expression of the Fisher scoring for the logistic regression.
156
Chapter 7
In this chapter, we are interested on the second-order estimation for nonlinear models such as
(deep) generalized linear models where the distributions for modeling the dependence of the target
variable yi w.r.t the variables xi belongs to the exponential family. For datasets of moderated size,
the classical method for the optimization is the "Newton-Raphson procedure" (or NR) which uses
the hessian matrix for rescaling the gradient vectors at each iteration. For larger datasets, the
methods named "Quasi-Newton" (or NQ) prefer to approximate the hessian and the product with
the gradient in order to avoid numerical issues coming from high dimensional spaces.
On the contrary to the first-order methods, the Newton-Raphson procedures are not suitable if the
neural network is too large (too many parameters) with the current computer architecture because
the hessian does not fit in memory or is too slow to compute. Another limit is when the model itself
is too flexible and the dataset too small such that the variance is not reliable: the hessian matrix is
ill defined, not invertible or gives negative values for the variance. It is supposed that all is going
well here, otherwise a variable selection, a weights selection, a tuned regularization or if required
a simpler model are often able to solve this issue.
In the first chapters, we have studied the linear models, their nonlinear extensions within the frame-
work of the neural networks, and several iterative algorithms which update the parameters in order
to minimize the loss functions (or maximize a loglikelihood). The chapter just before has discussed
the second-order derivatives aggregated in the hessian matrix and the related variance for (linear)
regression and classification when modeled with pytorch.
The hessian and its previous approximations are involved in the first part of the current chapter via
an implementation with numpy for the second-order training of a regression model. A procedure
from quasi-Newton methods concludes the chapter in a second part with pytorch via its available
dedicated function.
157
7.1.1 Definitions
let denote θ for the natural parameter and φ for the dispersion parameter with positive values. The
target variable yi belongs to an exponential family with density of a very general form,
� �
yθ − b(θ )
p(y|θ , φ ) = exp + c(y, φ ) .
a(φ )
In the generalized linear models the mean µ = E(Y |x) is written via a function called link function
which is a strictly increasing function, which is applied to the usual inner product of the parameter
vector β with the vectors of explicative variables (plus bias one for the intercept) xi . The link
function η(.) has an inverse η −1 (.) and,
g(µ) = η(xT β )
µ = η −1 (η)
For instance, these link functions are defined just below for two distributions in the regression
problems met every days when dealing with datasets
• Gaussian: η = µ and 1 for the variance function.
• Poisson : η = log µ and µ for the variance function.
Other link functions are available for the Inverse Gaussian, Gamma, Multinomial, Binomial and
Poisson regressions among other ones.
The Fisher information matrix is the expectation of the hessian, with E[yi ] = µi , while Ω =
[µi� (ηi )]2
Diag( σi2
: 1 ≤ i ≤ n), thus,
n
[µi� (ηi )]2 T
In (β ) = E[−∇2 �(β )] = ∑ 2
xi xi = X T ΩX .
i=1 σi
The hessian or Fisher information matrix are the covariance matrix when inverted and divided
by the sample size after, as explained in the previous chapter. This matrix is also suitable for
decreasing the loss instead of just considering the first-order derivative, with the additional burden
of its computation cost.
158
7.2 Algorithms for second-order training
A deep linear regression can be interpreted as a statistical nonlinear regression with a particular
expression for the nonlinearities. Historically, the Gauss-Newton-method was introduced for in-
ference of the parameters of such model, but with growing number of variables and large datasets,
first-order algorithms are usually more practicable.
Thus the following algorithms are for a moderated size network, except in case of fast computers
which are not available for every ones. The second-order approximation is presented just before
the algorithms, this is the foundation.
1
�(β ) ≈ �(β(t) ) + ∇�(β(t) )T (β − β(t) ) + (β − β(t) )T [∇2 �(β(t) )](β − β(t) ) .
2
Except for the Gaussian case with canonical link, the problem to solve is nonlinear and needs a
gradient ascent.
The corresponding updates are thus, with a ridge term with parameter τ for insuring the inverse of
the hessian matrix, as follows:
It is recognized exactly the same update formula than for the algorithm next paragraph.
β̂ = argmin �(β ) .
β
We are still interested with minibatches but eventually also the full dataset if it is not too large,
thus, with a number of B minibatches where eventually B = 1, the loss function is still written as,
�(β ) = ∑Bi=b �b (β )
= ∑Bb=1 ∑i∈sb − log f (yi ; xi , θ )
= ∑Bb=1 ∑i∈sb �i (β ) .
159
- The second-order update formula for β(t) is now written from the formula of the paragraph
above, � �
� � ∂
d ∂β �b (β )
�b (β ) = 0 � � .
dβ ∂
�b (β )
∂ β1
The new update formula of the parameters at each iteration remains with the gradient, but
with the hessian which appears here as a multiplicative corrective factor such that if τ = 0
the second-order update is:
� � � � � � � � −1
∂2 ∂2
(t+1)
β0 β0
(t) T
∂ β0 ∂ β0 �
�b (β ) T
∂ β0 ∂ β1 �
�b (β ) d � �
= − αt � � �b (β ) .
(t+1)
β1 β1
(t) ∂2 ∂2 dβ
T �b (β )
∂ β1 ∂ β0 T �b (β )
∂ β1 ∂ β1
A learning rate may be required because the second-order serie around the current solu-
tion may be not enough accurate, such as the update may be reduced in order to insure the
convergence despite the gap with the real function.
- This is implemented next after with numpy for the Poisson regression (see next section for
the expression of the loss and loglikelihood) where the target variable takes positive integer
values or counts, such that the new distribution involved is changed. With Dµ = Diag(µ(k) ),
the update is just for the NR algorithm:
� �−1
d 2 �(β(k) ) d�(β(k) )
β(k+1) = β(k) − αt dβ dβ T + τI p dβ
� � −1
= β(k) − αt ∑i µi xi xiT + τI p [∑ (yi − µi )xi ]
� T �−1 � Ti �
= β(k) − αt X Dµ X + τI p X (y − µ)
160
This is written for the Poisson regression:
This induces that the inversion is in closed-form from the previous step. When there are
several data vectors coming from the new minibatches, this formula is involved m times,
mb
as the size of the minibatches. Let denote, a1 = m(b+1) with a1 = 1 at first minibatch of
1
each epoch, and a2 = m(b+1) . In the python code, the following update was tried and kept
because the average is for a minibatch, and this seems to lead to a stable convergence here
for different sizes of dataset, learning rate or minibatch sizes,
−1 −1 1 −1 T −1
H̃bm+1 = a1 H̃bm − a2 1 −1 T H̃bm g� g� H̃bm .
1+a2 gT� bm H̃bm g�
161
For this dataset, the final result was relevant according to the numerical experiments with
numpy (and also when changing the parameters for generating variants of the dataset for a
poisson regression).
The simple expression just above seems slightly different from the proposed approaches
from the literature. The expression is fully matricial and might be faster than repeating
n times the Sherman-Morrison formula, at least for languages such as python with here
numpy for processing the algebra. A perspective is to add some sketching with a random
projection or pca (see next chapter about reductions), not discussed here.
Next, the different formula for updating the parameters and decreasing the loss function are com-
pared with an artificial dataset for the Poisson regression.
��� ����������
������ ���������������������������
������ ���������
�����������������������
Let clean and check the computer memory state. There should be remaining some space otherwise
the operating system risks to freeze.
162
������ ��
������������
��
������ ������
������ � �����������������������
�������� ������ ���� � ���������������� �����
�������� ��������� � � ����������������� � ������� �� ������ ����
� ������ �����
� ������������������������
� �����������
A data sample of 5000 observations from a Poisson regression with 6 independent variables is
generated and stored in the computer files. The train and test subsamples are sampled from the
whole sample. The sample is split into the train and test subsamples in order to check the model on
new data. They are stored as files in the computer disk with also the true parameters. The datasets
and the true parameters are loaded as follows.
� � ����
�� � �
� � ����
�������� ��� ��
���� � �
������ ����� �� ��
������� � �������������������������������������������������
������ � ������������������������������������������������
������� � �
������������������������������������������������������������������
������ � �
�����������������������������������������������������������������
����� � ������������������������������������������������
163
��� � ����������������������������������������������
�������� � �
�������������������������������������������������������������������
��������� � �
��������������������������������������������������������������������
The true regression coefficients are supposed unknown during the training, this is the purpose of
the model fitting to retrieve this vector. This induces that usually, the loglikelihood (and related bic
aic or other ones) is mostly the only value (plus eventually the residuals and related) which is in
stake in order to compare several models. For generated data, the true β and the true expectations
T T
µi = eβ xi are known thus they can be compared with the estimated ones, β̂ and µ̂i = eβ xi after
training.
The discrete distribution of the target variable y is visualized with a "barplot", after counting the
number of observations per possible value of the variable. This is with the python class "Counter"
from the module "collections" which asks for a list for the input sequence. The graphic is from the
module "matplotlib" while "seaborn" allows more advanced outputs not used here.
Figure 7.1: Barplot for the target variable from Poisson regression
164
���� ����������� ������ ������� �� �������
������ ����������������� �� ���
������ �������� ������������� � ��� ���� �����
��� ����������������������������������������������������������������������
������ � �����������������������������
������ � ��������� ������ ��� � �� ���������������
�� ���������������
������� ������������������ ��� � �� ��������
�� �������������
���������������������������������������������������
�������������������� ������������ ��� ���������
������������� �� ������������������������
���������������������
�����������������
����������
������ ������� ������
This graphic with a barplot would suggest that there may be an excess of zeros but checking the
mean and the variance confirms that the regular usual Poisson distribution is a good candidate:
��������������������� � �������������������������
�������������� � ������������������������
�������������������� � ������������������������
������������� � �����������������������
The first component in Xi contains only the value one in order to include the intercept in the vector
of regression coefficients β , otherwise this would be xi for usual notation. This avoids to separate
the intercept from the vectors, in order to get all the unknown quantities aggregated in an unique
vector β . In the python modules such as sklearn, this is often possible to separate or include both,
but for computation with numpy the algebra is lighter with an unique vector instead of a vector
plus a scalar.
With the normalization of the columns, the true beta is not retrieved even approximately, but the
T
parameters in the Poisson µi should be retrieved with the estimation µ̂i = exi β̂ because the expec-
tations remains identical.
165
������ ��������������� �� ���
���� �������������������� ��������
������������� � ���������
�������� �����������������������
�������� � ��������������������
�������� � �����������������������������������
166
Thus, the loglikelihood is defined as the log of the n products from the mass functions above, one
per sample observation, such that,
�(β ) = log L(β )
= ∑ni=1 log f (yi ; β )
� yi �
µ
= ∑ni=1 log yii ! e−µi
= ∑ni=1 yi log µi − ∑ni=1 µi − ∑ni=1 log yi ! .
The maximum likelihood solution is:
� n �
β̂ = argmax ∑ yi β xi − e
T β T xi
.
β i=1
� �
)� d 2 �(β ) �
For the optimization with Gk = d�(β
dβ � and Hk = dβ dβ T �β
, this leads to compute the derivatives
β(k) (k)
as given in the previous section just above: this is a good exercice to retrieve these expressions.
��� ������������������������������
������ ���� � ��� � ���� � ��������
����� ��������������������������������
��� �����������������������
������ � �
���������� � �����
��������� � ���������������
��� ����������������
���������������������
��� � �� ��������������
�������������� � ������������
��� ����������������������
�� � �������������������
�� ����
��� � ���������������
167
��� � ��������
�� �����
��� � �
��� � �
������� � ���������������������������������������
��������� � �
�������������������������������������������������������
���������� � � �������������������������������������
�������� � ��������� � ����������
������ ��������
The previous class is for the sequential inverse of the matrix in the natural gradient procedure with
minibatches while the following function is for the fitting of the Poisson model with the fours
approaches.
������ ����
���������� � �����������
��� �� � ��������
�� � ����������������
��������� � �������������������������������������������������� �
����������������������
���� � �����������������������
�� � ��������� � �����
168
�� ���������� �� ����������� �� ���� ����� �����
����� � ����� � �����������������
�� ��������������
������������ � �����������������������������������������
�� ���������� �� ����������������
����������������� � �������������������������
�� ���������� �� ���������� �
�� ��������������� �� ����������������
�� � ��������� � �����
� � ����������������������������
� � �����������������������������������
� � � � ��������������������
�� ��������������� �� ����������������
���� � �����������������
��� � �� ���������� ��������� � ��������
�������� � ���� � ������������� � �� ��������� �������
�����
���� � ����������������
�������� � ���� � ����� � �� ����������� ��� �������
����� ��� ������������ ����� ���
169
���������������������������� � �����
����������������������������� � ������
�������� � ���� � ������� � ����������
�� �������������� �� ��������������������
���������� � ���� � ���������������������������
���������������������������� � �����
����������������������������� � ������
��� � �� ���������������������������
���� � ���������������� �
�������������������������������
�� ���� � � ���������������
�� ����� ���
� � ����������
���� � ���������� � �
������������� � ���� ������ � �
���� � ���� � ������ � ����
�� ��������������������
��� �� �� ����������
��� �� �� ����������
�� �������
����������� � ���
���������� � � ������� �
����������������������������������
�������� � ���� � ������� ����� � ����������
�� ��������������
�� ����� ������������������������
���������� � ���� � ���������������������������
���������������������������� � �����
����������������������������� � ������
�������� � �������������������������������������
�������� � ���� � ������� � ��������
����� ��� ����������� ����� ���
170
��������� � ��������������������� �
����������� � ���������������
������������� � ���������������������������������
�������� � �����
�������� � ����
�������� � ����
������ � ��������� � ���������
����� � ������ � ������� � ����
������� � ���������������������������������������������������
�� ��������
���������������� � �������������������������������� �������������
�→��������������������
The final loglikelihoods from each algorithm are compared at the end of the inference when the
values of β(k)
algo
for k = 1, 2, · · · , become stable at the final k = kend
algo
. Indeed, each algorithm leads to its
solution β̂ algo for the maximum likelihood hence a different value for the loglikelihood �(β̂ algo ). This
does not ask for other informations than the available data sample, the usual situation in practice.
As the unknown parameters are known here, it is also computed the mean squared error and the
correlation coefficient from the true parameters µi and the estimations µ̂ialgo for each algorithm
"algo", this allows a further analysis.
The results are checked with the correlation and the mse. The final result depends on the chosen
test sample hence a cross-validation could to be preferred here. It is also better to run the fitting
several times, with different random initializations and keep the best result.
171
��� �����������������������������������������������
���� � �����������
���� � �����������
������ � �������� � �������������
���������� � � ����������������������� ��������
���������� � ������������������������������������
������ � ��������������������������
�� ������� �� ��� �����
�����������������������������������������������������
������������������������������������������������������������
�������������������������������������������
������ ������������� �������������������
������������������� ��������������
����������
The four algorithms are run one after the other as follows:
• For the batch methods, the Newton-Raphson updates with exact hessian.
��� ������������������������������������������������
������ ������ � ���� � ���� � �������
������������ � ��������
The diagonal version "nr_diag" just inverses the diagonal matrix by keeping only the diago-
nal elements from the hessian matrix. A better way would be to compute this diagonal only,
instead of zeroing the non diagonal cells.
• For the batch methods, the natural gradient updates with approximated hessian.
��������� � ������
��� ����������������������������������������������������������������
����� � ���� � ���������������������������
�� � ��������������
�� � ��������������
� � �������
�������� � ����������
�� � ����������������
��� ����� �� �������� ��� ������������
172
������ � ������� ��������������������� �
�� ������������ �� ���������
�������� � ������������
�� � ����������������
� �� ���������������������� � �
�� � ���������������������������
������ ����������������� � �����������������
�� ��������������
������ � ������������������������������ �������� ����������
���������������������� ����������
������������� � �
���������������������������� ������� ���������������������
������������ � ���������������������
�������������������� � ���������������������
������������ � ��������
The diagonal version "ef_diag" inverses only the diagonal matrix by keeping the diagonal
elements from the approximated hessian matrix.
• For the sequential methods, the updates with the gradient vectors from minibatches.
����������� � ���
�������������
������ � ���������������������� �������� �������� ����������������
�������������������� ������������� ����������
����������
������������� � �
���������������������������� ������� ���������������������
������������ � ���������������������
�������������������� � ���������������������
������������ � ��������
• For the sequential methods, the updates "efseq" with the inverses of an approximated hessian
from minibatches. Here, "efseq1" is for B = n (m = 1) with the Sherman-Morrison formula
only while "efseq" is for B < n (m > 1) with the Woodbury identity.
�����������
��������� � ���������������������� �������� �������� ����������������
����������� ��������������������
������������� ���������� �������������
���������������� � �
���������������������������� ������� ������������������������
173
��������������� � ������������������������
����������������������� � ������������������������
��������������� � ��������
The algorithms are compared visually via their loglikelihood in the next graphic.
For the sequential procedures with the numpy implementation above, there was required a "gra-
dient clipping", with bounds component by component on the gradient during training: this has
avoided a "divergence" when the consecutive values of the parameters do not stabilize with time.
An alternative is a strict bound on the norm for all the components together which sometimes is
required also in deep learning for some models. There the norm is kept constant instead of these
truncated components.
174
Which algorithm behaves the best depends on the dataset and the settings hence, when possible,
several algorithms should be tested and compared. Some improvement of the settings for the learn-
ing rate and the minibatches size is possible via "cross-validation" for instance. More advanced
second-order sequential methods (for high dimensional models) exist in the literature and some are
implemented on pytorch, one is tested at the next section.
Retrieving the exact counts is not in stake (because of the sampling generating the counts), thus the
Poisson parameters are compared instead. Comparing the loglikelihood (or related bic and aic) is
also more usual than the mse because the true parameters are often unknown. The numeral results
are summarized in the following table.
������ ������ �� ��
������ � ��
������ � ��
������ � ��
������������ � ��
����������� � ��
��������� � ���������������������������������
�������� � �����������������������������
��� � ���������������������
175
���� ��������� ���������� �������� ���������� ��� ������� ������ �������
�� ���������� ������ �� ������ ���� ���� �
�� ���������� ������ �� ������ ���� ���� �
�� ���������� ������ ��� ������ ���� ���� �
����� ���������� ������ �� ������ ���� ���� �
Here, with s = strain ∪ stest , and ntest = |stest |, "msemu_test" is for the test sample:
1
mse(µ̂test , µtest ) = ∑ (µ̂itest − µitest)2 .
ntest i∈stest
With a factor of four, the batch procedures have lead to smaller mse than the minibatches pro-
cedures here despite a small difference for the loglikelihood, this may underline the superiority
of such approaches for small datasets. Note that concerning these implementations with numpy,
the python code with the minibatches have some issues here: the first-order procedure has not the
best setting while the second-order could prefer (for some other datasets too) a full computation
of the inverse of the hessian at each epoch. Testing such variants and improvements is an eventual
exercice left to the reader in order to further practice these algorithms, it is adviced to separate
each algorithm in a separated dedicated function before updating the python code. The next sec-
tion presents a more robust and advanced approach from the "quasi-newton" methods, plus also a
better setting for the gradient approach, both with the module pytorch and the gpu when available.
������ �������� �� ��
������ �����
The loss function for the Poisson regression is already defined in the pytorch module, otherwise, it
may be defined as:
176
� ��� ������������������������������� ��������
� ������ ����������� ������ � �������� � ��������������������
� �������
The new function presents a simplified access to the class for the neural networks, and returns a
dictionary instead of separated variables at the end of the call.
������ ����
� ���� ��� ������ ��� ���� ������ ��� ��� ���� ��� �� ����� ���������
� ������ ����
�� �����������
�� ������������� ������������������ �� �
������������ �������
���� � �������������������������������������
�� ������������� �������������������� �� �
������������ ��������
���� � �����������������������������������������������
�� ������������� ������������������� �� �
������������ ����������������������� �� �
������������ ������ �� �
������������ ��������
���� � ����������������������������������������������
�� ������������� ������������������� �� �
������������ ��������
����� � ���������������������
���� � �������������������������������������������� �
��������������� ���������� �
177
������ ����� �����
�� ������������� ������������������ ��� �
������������ �������������������� ��� �
������������ ������������������� ��� �
������������ ����������������������� ��� �
������������ ������������������� ��� �
������������ ����� ��� �
������������ ������ ��� �
������������ ������ ��� �
������������ ������ ��� �
������������ �������� ���������� � �����
�� ������������� ������������������� �
�� ������������ ����������������������� �
�� ������������ ������ �
�� ������������ ��������
��������� ������ ������� �� �����
�� ���������������� ������
��������� � ����������������������������������� ����������� �
������������������
�� ���������������� ��������
��������� � ������������������������������������� ����������� �
�������������������������
������� � ������������������������������������������������
�����������������������������
������������ � �������������
������������ � �������������
� ����� �����
��� ������ �� ��� ����� ����������������������
��������������
�������������������� � �
���������������������������������������������������������
��������������
������������ � �������������
178
��������������� � ����������������
������������ � �������������
��������������������������
����������������
����������������������������
The transformations are as before, a standardization is possible for the columns of the design
matrix X, and eventually an initialization of the model is required.
� �� ���������������
� ������������ � ������������������������
� ����������� � ������������������������
� � ������������������
� � ������� � ���������������
� ��� �� ���� ��� �� ��������������������
� ������������ �� �������������� ���������������������������� ��
�→�������
� ������������� �����������
� ��� ������������������
� ������ �����������������������������
� �
� � ��� ������������������
� � ������ ��
� ������������ � �����������������������
� ����������� � ����������������������
� ������������� � ����
179
���� ���������������� ������ ����������� �������������
����������� �
�������� � �������������������� ����������� �����������
����������������������������
������� � ������������������� ����������� �����������
����������������������������
�������� ������� � ���������������������������������
������� ������ � ��������������������������������
���� � ���� �
Here it is implemented a gradient clipping which avoids the gradient to increase too much dur-
ing the training. This seems mandatory for first-order algorithms, which seems not discussed in
the literature much. The bounds for the clipping was chosen arbitrary after testing several values.
Similarly, in some cases, the initialization seems to require a bounded value around 0 from an
uniform distribution, otherwise the convergence was towards a mistaken solution. This may be re-
lated to some initializations for deep learning using gradient procedures. For the proposed generic
implementation, it is required to declare a new function and add the function at the call while the
initilization is by default for this version of pytorch.
��� �������������������������������������������������������������������
������� � ���������������������������������������� ����� ���������
��� � �� �������������������������
���������������� � �
����������������� � ��
������ � ������ � ������� � ������
The model is trained with the new function above for (deep) glm, this is more convenient because
there is just need to define the structure of the network with the hidden layers and the name of the
model.
������ � ��������������������� �� ������������������������� ���� ������
���������� � �������������������
����������� � ��
��������� � �
180
������� � ������
������ � ��
���������������������������������� �����������
�������������������������������
��������������������������������������������������
��������������������������������������������
����������������������������������������������������
Let compute the statistics for measuring the quality of the final model according to the true pa-
rameters. Remember that this is not possible with real data except for the loglikelihood. Here the
dataset is artificial and the additional indicators are informative for checking the implementation
and further comparisons.
��� ������������������������
���� � ���������������������������
��� � �� �������������������
���� � �������� � �� � ����������
��� �� � �� ����������������
���� � ��������������� ����� ������ �
������ ����
��������� � ��������������������������������������������������������������
�→�������
��������� � �����������������������������������
�������� � ��������������������������������
��������������� � �����������������������������������
�����������������������
���������������� � �������������������������������������
������������������������
��������������� � ������������������������
�������������� � �����������������������
181
������������������������ ������������������������
���������������������� � �����������������������
���������������� ��������
�������������� � ��������
The likelihoods and indicators are obtained above, but one may keep in mind that proceeding in
"f_mu_mse_cor_poisson" with chunks would be better for larger datasets, this can be seen as an
exercice left to the reader (otherwise, there is the risk of computer system freezing when python
uses a large virtual memory).
������� � ������������������������������
�������������� � �����������������������������������
����������������������
��������������� � �������������������������������������
�����������������������
�������������� � �����������������������
������������� � ����������������������
���������������������� � �����������������������
��������������������� � ����������������������
This is not very far to the solution above with pytorch. Actually some installations of pytorch
may perform slower than numpy (which could be even faster with some gpu directive like from
"numba"). But, the automatic derivatives are very appealing for (any) more complex models. The
parameters are not so much tuned for the sequential methods with minibatches and the updates are
not stochastic, which explain the lesser performance here.
182
7.5.2 Example of L-BFGS for training at second-order
In this section, we are interested on quasi-newton methods. They aim at step k at approximating the
usual update for the parameters, θ(k+1) = θ(k) +δk . Here δk is unknown and solves for Bk δk +gk = 0
where Bk ≈ Hk . The approximation Bk of the exact hessian Hk needs generally to be positive
definite for insuring an inverse with good properties.
In particular the method called limited memory BFGS is an improved algorithm from the original
research from the Broyden–Fletcher–Goldbarb–Shanno (BFGS) formula. This formula proposes
an approximation of the hessian and its inverse from two consecutive values of the gradient, say gk
and gk+1 plus the two last consecutive data vectors, say xk and xk+1 . The differences are involved,
sk = xk+1 − xk and yk = gk+1 − gk . By equating Bk+1 sk = yk , and after advanced algebra, this leads
to the update of the approximating hessian:
1
Bk+1 = Bk − T Bk sk sTk Bk + ρk yk yTk .
sk Bk sk
Here, ρk denotes the factor yT1s . The inverse is in closed-form, via an algorithm with a truncated
k k
sum for the limited memory version.
The implementation is tricky and out of the scope, thus, the existing optimizer from pytorch is used.
It is named "optim.LBFGS" and asks currently for a different implementation of the backpropa-
gation (in the loop with the minibatches), exceptionnally, thus this must be rewritten in python
by following the documentation. There exists many variants, but they are not implemented with
pytorch, the available one is stable and performs well. Note also that the sequential process is
managed by the algorithm, hence in this case, it is better to include all the dataset once in the
dataloader. In some cases minibatches may induce a convergence but this is not sure at all and not
wise to avoid the batch version (except for the recent variants called stochastics).
For this optimizer, it is added a test in order to check which optimizer is involved in the function
"f_train_glmr()" from the module deepglmlib and file "utils.py". Hence, this is included in the
function for training already in the companion file.
The call to the function is thus as previously by just changing the name of the optimizer and the
dataloaders. For instance, some eventual regularization is added as follows.
��������� � ����
��� ���������������������������
���������� � �����
���������� � �������������������������������������������������������
������ � ���������� � ��������� � ����������
������ ������
183
������ � ��������������������� �� ������������������������� ���� ������
���������� � �������������������
����������� � �
��������� � �
������� � �����
�������������� � ��
������������������������������������������ �����������
������������ � �����������������
�������������������������������������������������������
������������ � �����������������������������������������
����������� � ��������������������������������������
������������������ � �����������������������������������
��������������������������
������������������� � �������������������������������������
���������������������������
������������������ � ���������������������������
����������������� � ��������������������������
�������������������������� � ���������������������������
������������������������� � ��������������������������
������������������ � ��������
����������������� � ��������
The results are summarized in the following table for comparison purpose. The correlation is not
184
given as it is less informative than the mse. The second-order method is clearly performing well
here as expected for a small dataset. The table is thus:
������ ������ �� ��
�������� � ���������������
�����������������
���� ����������
�������� � �������������������������
���������������������������
�����������������������
�������� � ��������������������������
����������������������������
������������������������
�������� � ��������������������
���������������������� ��
��������� � �������������������������
�������� � ����������������������
��� � �������������������������
���������� � ���������������������������������
������������������ � �������� ����������
�������������������� ��������������
���������� ��������� ����������
The implementation with pytorch leads to a same loglikelihood (and even slighly better) than the
implementation from statsmodels, while the gradient approach is almost identical with a small
difference. For the second-order training with pytorch, the hessian is approximated recently for
pytorch with external modules with the diagonal hessian for instance in the case of very large
185
models: fully diagonal or by block per layer. External modules such as "backpack" are found in
repositories, this allows the access to other approximation but sometimes available only for some
architectures of neural networks. Second-order are more prone to minimum local, hence for non
convex optimization they should not be considered without further checking. Some mixed method
by combining with another optimizer at the beginning or at the end of the training looks like the
best way to do here.
7.6 Exercices
1. (stat+numpy) Rewrite the function for the four second-order procedures into separated func-
tions for each method. Test the algorithms for the linear regression and the logistic regression
by changing the expression of the gradient vector and the hessian matrix, compare the output
with statsmodels for the datasets from the previous chapters herein.
2. (stat+pytorch) Idem than just before, with pytorch instead of numpy. Compare with the
result from the algorithm l-bfgs.
3. (numpy) Test a cross-validation of the newton-raphson algorithm and the approximated ver-
sion with numpy and find out if five or ten folds are relevant for approaching the expected
error.
4. (numpy) Test the deep Poisson regression with an artificial dataset with only one inde-
pendent variable and a nonlinear function in µ. Compute the mean squared error for
diverses architectures (varying number of hidden layers and number of neurons). See
"https://link.springer.com/article/10.1007/s00521-009-0277-8" for instance for examples of
nonlinear regression.
5. (stat+pytorch) Implement a zero inflated model for deep learning with count data. The dat-
aloader with the id from each row may be required here if shuffling is true, in order to keep
the ordering of the rows for the vector of mixing probabilities.
6. (stat+pytorch) In some not rare cases for deep neural networks, the hessian is expected to
be ill-defined, with the inverse which leads to negative variances on the diagonal. Propose
solutions in order to solve this issue. For instance test a regularization or test a Gan for
tabular data in order to augment the number of observations from the dataset and test a
variable selection method in order to keep only the useful variables in the input layer. Is
there a difference between variable selection and weights removal from the neural network
in order to improve the generalization. An example of network is from the artificial data
"diskandnoise", see previous chapters.
7. (pytorch) Check the convergence by ploting locally the loss function as curves for each
variable, and also deduce a numerical value for the gradient.
8. (pytorch) After a read at the documentation, test the module1 "backpack" with more ad-
vanced second-order derivatives. Compare with l-bfgs for a small dataset. Compare with a
first-order training for a large dataset.
1 "https://pypi.org/project/backpack-for-pytorch/"
186
Chapter 8
In the previous chapters it has been studied the neural networks for regression and classification
when the dataset is a data table: nonlinearities are added to the family of the generalized linear
models. Thus, their definition, their training and their implementation have been presented for
several datasets with several modules available with the computer language "python". These ap-
proaches are called "supervised" because they aim at predicting a variable yi from a variable xi .
When only the variables xi are available, the set of rows or the set of columns from the data table
can be described by methods called "unsupervised". Typically, two main families of unsupervised
methods exist: a) "clustering" (out of the scope) which proposes a descriptive "partitionning" with-
out examples, hence according to a criterion such as minimum intra variance, and b) "reduction"
(see this chapter) which proposes a new set of rows or columns according to a criterion such as
maximum variance.
• The methods when they provide new low dimensional vectors x̂i from the high dimensional
ones xi needs to preserve some information from the former space: the whole variance be-
tween data points, the distances between data points, their nearest neighbors or even the
structure with the furthest ones. To be clear, the method named "pca" preserves mainly the
variance while the one called "t-sne" preserves neighboors, thus they do not have exactly
the same utility. Note that t-sne is the culminating method which came after several ones
(Isomap, LLE, SNE, Laplacien Eigenmap and a few other ones) with lower performances.
• Several criteria from unsupervised methods are summarized in the table below.
Name Type Criterion
K-means Clustering ({ẑik } , {µ̂k }) = argminz ∑ni=1 ∑gk=1 zik �xi − µk �2
GMM Clustering ({ẑik } , {µ̂k } , σ ) = argminz ∑i,k zik log N(xi − µk , σ 2 )
PCA Reduction x̂ = xŵT , with ŵ = argmaxw� ∑i wT n1 xiT xi w
� �2
ISOMAP Reduction x̂ = {x̂1 , · · · , x̂n } = argminx̂ ∑i, j vi, j δi, j − di, j
p
t-SNE Reduction x̂ = {x̂1 , · · · , x̂n } = argminx̂ ∑i, j pi j log qii jj
Here, a simplified version for principal component analysis (pca with the data columns
supposed standardized and with only one latent component only, with also the constraint
187
ŵT ŵ = 1), and for gaussian mixtures models (gmm with here the non fuzzy version) are
presented, see the literature and the documentation of sklearn for more information. Note
also that di, j is the Euclidean distance between xi and x j while δi, j is the Euclidean distance
between two lower dimensional vectors x̂i and x̂ j , while vi, j is the graph of the nearest neigh-
boors. For SNE and t-SNE, the probabilities pi j and qi j are respectively the transformations
via an exponential function or t-distribution density function of di, j and δi, j plus a normal-
ization: this allows to improve Isomap by increasing the frontiers between the clusters from
the data space or the projection space.
• With neural networks, it becomes possible to model the nonlinearities with hidden layers
instead of the vicinity graph. Thus several architectures of neural networks has been invented
for explaining the contents of a data table. For reducing the number of variables, among
the existing methods of reduction, autoencoders are able to extract a linear or nonlinear
subspace from the set of vectors xi . A reduction of the dimension may be concerned by one of
the following purposes: exploration, visualization, feature extraction (new variables before
regression or classification), regularization and noise reduction, less computer memory and
more speed, outlier detection. More generally this solves for the curse of dimensionality by
retrieving the lower subspace where the data lies really instead of an high dimensional space
with many useless variables. This comes from the numerous variables which share the same
information and are correlated or can be explained by other variables. Removing the noisy
and useless information allows to keep a summary which is informative for the human and
the machine: for interpretation or computing.
• In an autoencoder, the reduced vectors denoted x̂i come from an hidden layer of a neural
network. It is defined very similarly than in the previous chapters, except that the output is
not anymore a variable yi but instead just xi , while the input remains again xi . Autoencoders
are quite old neural networks which are modernized currently with the rise of computer
modules such as pytorch, and they are able to leads to a better reduction than an usual
linear method such as pca and even sometimes a better reduction than the current state of
art methods such as t-sne. Autoencoder (ae) in the simplest linear case behaves like pca,
but when nonlinearities are added and when combined with other neural architectures, ae
is expected to perform better than other methods. But the training of an autoencoder is
currently often difficult: because of the intensive search for the optimal hyperparameters and
the unsupervised side. Thus, it is more often found autoencoders specialized in a family
of datasets, for instance, medical images of a body part, clinical data for a given disease or
biological data related to dna: each architecture of ae needs a careful training in order to be
able to reduce the dimensionality of the available data and future nearly identical data.
The chapter is divided into three parts, first pca, ae and t-sne with an artificial dataset and the 60k
digits dataset named "mnist" which is currently the main usual dataset for model comparisons.
188
matrix which aggregates the data vectors xi as rows. It may be even included within a model in
order to regularize an unknown latent matrix.
A data matrix in numerical algebra comes with an Euclidean space from the space of the columns,
for the features or variables. The matrix may be not of full rank with some correlation between
colums which can make the life difficult for some algorithms such as regression or even classifica-
tion. These statistical duplicates induce a large number of columns, a numerical burden during the
matricial computations or even an erroneous solution from the optimization. Hence, pca is very
appealing because it aims at defining new variables which are a linear combination of the available
variables and less numerous, while optimizing a cost function, the variance among these latent
variables.
xi j − x̄ j
yi j = .
σj
The two methods for a matricial decomposition lead to the same result, but numerically they
have not the same numerical complexity or the same demand for the computer memory. If
189
the number of rows or columns is reduced, the square matrix is a good choice, but for a
sparse matrix the rectangular matrix keeps the sparsity. Here in both cases, it is denoted Δr a
diagonal matrix for the singular values λ� where 1 ≤ � ≤ r, and the two orthogonal matrices
Ur and Vr where:
Δ = Diag (λ1 , λ2 , · · · , λr ) .
UrT Ur = Ir and VrT Vr = Ir .
Note that the squares λ�2 are called the eigenvalues, they come from the eigen decomposition
of a symmetric matrix. For instance the covariance and correlation matrices are positive
definite and symmetric.
d) This leads to the projection of Y for k (with k ≤ r and often k � r) principal components:
CYpca = YV
√k
= n − 1Uk Δ2k .
Here, Δk keeps only the singular values corresponding to the k largest ones in Δr while Uk
and Vk are similarly truncated versions of Ur and Vr to the corresponding eigenvectors. This
solution is also the principal components for X with: CXpca = CYpca . This approach allows also
an approximation of X instead of a projection, see next.
Some authors prefer to proceed directly with the covariance matrix, and other ones would add
some additional scaling if an underlying metric is involved, this is discussed in the literature of
exploratory data analysis. This explains why different implementations lead to different results
because these options are not always fully documented. Some indicators from pca are not discussed
here. Next, it is explained further the algebra for pca.
Singular value decomposition (svd)
The definition is not always clear in the literature from one author to another one because the
decomposition is written in three different ways. This paragraph allows a better understanding of
this important part of the algorithm. The svd has a matricial notation for the following result for a
any real matrix of rank k,
r
Y= ∑ λ�u�vT� .
�=1
Here, u� ∈ Rn×1 and v� ∈ R p×1 are basis vectors for respectively Rn and R p , hence,
The quantities λ� are strictly positive (non equal to zero) and called the singular values. By keeping,
only the first k largest singular values in the sum above, one get the best rank k approximation.
The three corresponding matricial notations are the following ones:
- Let denote Ur and Vr aggregating the vectors above, with Ur = [u1 |u2 | · · · |ur ] ∈ Rn×r and
Vr = [v1 |v2 | · · · |vr ] ∈ R p×r with UrT Ur = Ir and VrT Vr = Ir while Σr = Diag(λ1 , · · · , λr ) ∈
190
Rr×r , one gets the matrial version of the svd in brief notation:
Y = Ur ΣrVrT .
This first clear and concise notation is expanded into two notation because it is possible to
add basis vectors to Ur and Vr when zeros are added to the matrix Σ. Remember that for this
notation, the matrix is diagonal with not null diagonal elements, and thus, on the nondiagonal
part of the matrix there are only null cells (equal to zeros).
- The second notation is a direct extension of the one above but the less natural, by computing
q−r
q = min(n, p), and by partially completing the basis for Rn and R p , aggregated in U∗ for U
q−r q−r q−r
and V∗ for V , thus Uq = [Ur |U∗ ] ∈ Rn×q and Vq = [Vr |V∗ ] ∈ R p×q with UqT Uq = Iq and
VqT Vq = Iq , while Σq is the diagonal matrix Σr completed with q − r zeros on the diagonal,
Σq ∈ Rq×q , thus Uq ΣqVqT is read as:
� q−r �� �
q−r Σr 0r VrT
Y = [Ur |U∗ ] q−r q−r T .
0rq−r 0q−r (V∗ )
- The third notation is not the more natural because Σ is not always diagonal. There are
two complete basis, by fully completing the basis for Rn and R p , aggregated in U∗n−r for U
and V∗p−r for V , thus Un = [Ur |U∗n−r ] ∈ Rn×n and Vp = [Vr |V∗p−r ] ∈ R p×p with UnT Un = In
and VpT Vp = I p , while Σnp is the diagonal matrix Σr completed with zeros, Σnp ∈ Rn×p , thus
Un ΣnpVpT is read as:
� �� �
Σr 0rp−r VrT
Y = [Ur |U∗n−r ] .
p−r
0rn−r 0n−r (V∗p−r )T
In the general case, n �= p such that Σnp is not even square anymore.
Note that the algorithms for finding the eigenvalues and eigen vectors do not order the values of λ�
by default, hence, one must reorder the singular values and eigen vectors because only the largest
ones are kept for pca or low rank approximation. Ideally one looks for an implemention able to
find only the largest or the smallest eigen components, for faster computation.
Let find an approximation ŷi of yi with a matrix Vk ∈ R p×k orthogonal (hence VkT Vk = Ik ) and
191
vectors ci ,
ŷi1 ci1
ŷi = ... = ∑k�=1 ci� v� = [v1 |v2 | · · · |vk ] ... = Vk ci
ŷip cik
∂
� �
∂ ci Sn = ∂∂ci ∑i �yi −Vk ci �2
� �
= ∂∂ci (yi −Vk ci )T (yi −Vk ci )
� �
= ∂∂ci yTi yi − 2yTi Vk ci + cTi VkT Vk ci
� �
= −2VkT yi + 2VkT Vk ci
= 0
Thus, both approximations ci and ŷi are linear projections for the data vector into two different
spaces:
- The low dimensional one reduces the size of the vector or length from p to k,
ci = VkT yi .
- The low rank one reduces the size of the linear basis with only k orthogonal vectors,
ŷi = VkVkT yi .
Hence,
argminVk Sn � �
= argminVk Tr �(Y −YVkVkT )(Y −YVkVkT )T �
= argminVk Tr� (Y −YVkVkT )(Y T −VkVkT Y T ) �
= argminVk Tr −YVkVkT Y T +YVkVkT Y T −YVkVkT VkVkT Y T + Tr(YY T )
= argminVk − Tr(YVkVkT VkVkT Y T )
= argmaxVk Tr(VkT Y T YVk ) .
Hence it is recognized for the columns of Vk the first eigenvectors of Y T Y by definition of an eigen
decomposition of a symmetric matrix. This also induced that from these eigen vectors, one gets
the low rank approximation of yi with just ŷi = Vk ci = VkVkT yi which is sometimes called denoised.
It is retrieved the approximation for xi by inverting the standardization of the columns (mean and
standard deviation), which concludes the method here. See a textbook on principal component
analysis for the additional results for the method.
192
Note that "ipca" and "kpca" are variants, with ipca for "incremental pca" and kpca for "kernel pca".
They come with alternative algorithms out of the scope here. The result from ipca should lead to
nearly the same reduced space than pca because its aims at solving the same problem with chunks
for scalability. As kpca is a nonlinear version of pca, it is able to improve the resulting reduction.
Next, these three algorithms are compared to an autoencoder and t-sne with several indicators of
quality of the visualization.
Illustration
The method pca is tested with a small dataset for better understanding. A toy dataset is generated,
this is 80 points from a circle in a 3d space.
���� ���� ��������������� ������ �������� �����
������ ����� �� ��
193
������������ ���� ������ ��������������������
���� �����������
���������� �����������
����� �������������
����������� ����� �������
��������� ����������
194
�� ��������������
����������������
� ���������������
Σv3 − λ3 v3 =
�����������������
����������������
�����������������
The eigen vectors remain the same for the two implementations but their sign may change. This
comes from the eigen decomposition which is not unique. The ellipsoidal shape of the circle is
exactly retrieved. Because the points belong to a plane, the third pca component is zero.
���������������������� ��������������������
� ��
Below, other expressions from the svd are also checked with three components.
���� �����������
���������� �����������
����� �������������
����������� ����� �������
��������� ����������
195
����� ������ ��� ��������������� �� ��� ����������
����� ������ ��� ������� ���� ��� ���� ��� ��� �� ��� �����������
����� ����������������������
������ � �����������������������������������������
����� � ����
���������������������
� � �����������
196
���� �� ���
��� �� ���
��� �� ����
������������������ ������������������
����� ������ ����������� ���������� ��� ��� ���� ����� ������������ ���� ���
� ����� ����������� ��� ��� ����������� ������� ���� ������� �����
� � ����
������ ������������������� � �� � �� � ��������
����� ������������������ � �� � �� � ������� �
� �� �� ������������
��������������� �������������� ���������������
����� ������ ����������� ���������� ��� ��� ���� ���� ������������ ���� ���
� ����� ����������� ��� ��� ����������� ������� ���� ������� �����
� �� �� ������������
��������������� �������������� ���������������
��� ����������
������ ���������������������������
197
������ ���������������� �� �����
������ ����� �� ��
������ ���������
�����������������������
������ ����� �� ��
�� � ���������������������������������
� � �������
� � �������
�������� �������
In a three dimensional view, the data are as follows when showing the three first components.
���� � ����������������������������������������
��������������������������������� ��������
198
Figure 8.1: Visualization of 3 variables among 10 for 9 classes.
The classes are not fully apparent when a different color or a different marker is not added to
the points for each class. This information is generally unknown, except if some clustering was
performed before in order to help the data analysis and the visualization.
With real data, the variables yi should be found in an unsupervised way (clustering), otherwise if
they are known a supervised method for visualization may be better actually in order to use the
additional information (see "fda", "discriminant analysis" in the literature), this is out of the scope
herein.
Next, the projections are from the high dimensional space, 10 here, into the bidimensional plane,
hence 2 instead of 3 among the usual choices. The labels yi are unknown during the computations,
the corresponding colors are just added to each points from the visualization after processing the
projection. This allows to better compare visually the differences between the methods.
The algorithm for pca is already implemented in sklearn. The call to the function available with
the module is written as follows.
������ ����� �� ��
���� ��������������������� ������ ���
���� � �������������������
����� � ���������������������
Thus, the pca projection leads for the first principal plane in 2d.
199
������ ����������������� �� ���
���� ������������������ ������ �������
��� �����������������������������������������������������������������
� � �������������������������
�� ��������� �� ������
��� � ���������������������� ��� �������
�������������������
��������������������
����������������
��� � �� �������������������
���������������������� ���������� ���������� ������
������������������������������
������������
�������������� ��������������
�� ��������� �� �����
���� �� � ������������������������ ��� �������
�������������������
��������������������
����������������
��� � �� �������������������
���� � ��� �� �� ��
���� � ��� �� �� ��
��� � ��� �� ��
��� � ������������ �����
����� ���� � �������������������
����� � ��������������������
���� � �����������
���� � �������������
����� � ����������������������������������������
�� � � ��� � �������������
��� � �������������������������� ���������������
�������� ��������� ������������ �������������� ����������
��������������������������
������������������
���� � �������������������������� ���������������
�������� ��������� ������������ �������������� ��������
��������������������������
�������������������
���������������� ����� ���������� ����������
�� � �� �������� ������
��������������������� ������������
��������������������� ������������
����������
200
����� � ����������������� �� ��� ������� ���� ����
��������������������� �� ������������ ���������������
Figure 8.2: Visualization of two first pca components for 9 small classes
For each group an ellipse is added on top of its scatterplot for helping the visualization. Some
data points may appear less near to the cluster as this is allowed by the Gaussian distribution by
definition.
There is a small overlapping because some clusters are near in the data space. For this example,
we will check how behave an autoencoder.
For larger datasets, it is preferred an incremental algorithm which operates with chunks from a
data file. With sklearn it is possible to enter the minibatches manually one after another as for our
loops with pytorch, via the function "partial_fit()". Here, this is sklearn which cycles the dataset
with minibatches until convergence.
201
���������������������������������������������������
������������������������������������������������������� ����������������
�→�������
The representation is not exactly the same because the sign from the eigen vectors may be inverted
and the iterative algorithm depends on hyperparameters such as the size of the minibatches, but this
is nearly the same visualization than pca. Note that a pca algorithm is also implemented in pytorch
natively, see "torch.pca_lowrank()". For large matrices, approximations of pca via randomization
or random projection are appealing alternatives to the exact solution.
In the nonlinear choice, it is considered a kernel pca which proposes a trick (related to "svm",
"support vector machines") in order to change the inner products into nonlinear functions. Some
options are:
• kernel is with the choices ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘cosine’, with by default=‘linear’
with is for the usual pca.
• gamma is a float for kernel coefficient in rbf, poly and sigmoid kernels. By default it is equal
to 1/n.
• degree is by default=3 and only for poly kernels.
202
���� ��������������������� ������ ���������
���� � ������������������������� �����������������
������ � ���������������������
����� � ����������� �� ��� ������� ���� �����
���������������������� � � ������������ ���������������
This is not working so well here with clusters very near, but some clusters are more separated or
smaller than with a linear kernel.
���������������������������������������������������
���������������������������������������� �������������� ��� ������� ��������
���� �� ���� ���� ������ ���� ����� �� ������ �� ����
��������������
���������������������������������������������������
���������������������������������������� �������������� ��� ����
�→��������������
203
�� ���� ���� ������ �� ���� ��� �������� ��������� �� ��� ����� �� ���� ���
�→����
Here for most samples tested from the distribution considered, as for this draw, the clusters are vi-
sually well separated because the overlap is not so high. Otherwise small or large overlapping may
appear with a cluster behind another one on the visualization map like in the orthogonal projection
from pca above. This method is today widely used in theory and practice for visualization, despite
that it is not yet perfect: the output from this nonlinear method often replaces the previous maps
from pca in many research domains (machine learning, biology, physics, ...).
204
������ ����� �� ��
���� ��������������� ������ ��������������������
���� ��������������� ������ ����������������
��� �������������������������������������������
� � ���������
�� � ����������������������� ��
�� � ������������������� ��
�� ���� �� �����
��������������������������� ��������� �������������������
������������������� ��������� �������������������
������ ��� ��
As expected, t-sne perfoms better than the pca methods, while pca and ipca are equivalent. The
resulting values are given below in a table after adding autoencoders to the list.
Linear case
One popular interpretation in the literature of the autoencoder is its close relation to pca in the
linear case. The authors of a recent communication 1 have recalled their result from 1988: writing
the linear autoencoder formally as with the matrix H which aggregates the values hi of the hidden
1 See "https://doi.org/10.1007/s00422-022-00937-6, Biological Cybernetics", 2022) for a discussion on autoen-
coders and pca plus some related recent architectures of neural networks.
205
layer for the input xi ,
then the solution for the optimal matrices Ŵ2 and Ĥ needs to solve for the minimum of:
The solution comes from standard linear algebra2 with the "best rank k approximation of X", such
that with the svd X = Un ΣnVnT , the solutions are respectively:
Ŵ2 = Uk S−1
Ĥ = SΣkVkT
Ŵ1 = Ŵ2T .
Here, Un , Σn Vn are the respective truncated matrices from Un , Σn Vn by keeping the part corre-
sponding to the number of nodes k of the unique hidden layer while S is an arbitrary non-singular
matrix. The author had concluded that an autoencoder with an hidden layer and a linear output
activation function learns the "weights that span the same subspace as the one spanned by the
principal component vectors" when the loss function is the MSE.
Without the orthogonalization, linear autoencoders are expected to be less effective than pca for
linear projection. Actually, orthogonality is not an issue and may be introduced via some kind
of explicit penalization for more advanced linear settings according to recent architectures of the
literature. With nonlinearities they become able to improve dramatically the reduction. They are
also flexible with the possibility to be associated with other architectures of neural networks.
General case
The idea is to define after the input layer a set of hidden layers with a decreasing number of nodes
until a dimension chosen for the reduction. This layer is the "bottleneck layer" where the reduction
is obtained. After this layer, the hidden layers have a increasing number of nodes until the size
of xi . The first set of layers are the encoder and the second set of layers is the decoder. The
autoencoder is able to encode a vector xi into a reduced vector ĥi and then to decode the reduction
into an approximation x̂i of the input vector with eventually some small error:
x̂i ≈ xi ,
where,
206
- the reductions {ĥi } which allows a visualization and any post-processing.
- the decoder() which allows to get new vectors from the continuous latent reduced space.
- the reconstructions {x̂i } which is eventually an improved version of the input vector.
- the encoder() for contructing the reduction for new input vectors if required.
Note that the dimension for the latent space may be two or three for visualization, but this may be
more because an additional method may be used for visualizing the latent space.
������ �������� �� ��
����� �����������������������
��� �������������� ����� ��������������� ���������������
����������� � ������
������������������
��������� � ����
������������������� � ��������������
������������������� � ��������������
���������������� � ������������������������������
���������������� � ������������������������������
���������������� � �����������
�� ���������������� �� ��� �����
��� � �� �����������������
���������������������������������������������������������
���������������������������������������������������������
��� ����������������
������� � �������������������
������ �������
��� ����������������
207
������� � �������������������
������ �������
The class allows to define the part of the network for the encoder and the part for the decoder.
The forward pass needs to encode to the reduced space and then to decode to the data space. This
suggests also to define two new functions for each of these two parts.
������ �����
���� ���������������� ������ �������������� ����������
������ ���������������� �� �����
������� � �������������� ����������������������������
3 "https://pytorch.org/docs/stable/nn.init.html"
208
������������������������� �
��������� �������� �� �������� ������ � �����������������������������
������ �������� �� ��
������ ����
�������������� � ��
��������������������������������������������� �����������
���������������������������������� ���
���������������������������������� ���
�������������� � ��
���������������������������������� ���
���������������������������������� ���
����������������������������������������������
Training function
The parameters are optimized with the following function for an Euclidean loss. For the python
code, the only difference with previously is the output (xi instead of yi ) of the network and the
optimizer Adam.
��� �����������������������������������������������������
������������������������������
���
��� ����� �� �������������������
������ � �
��� ����� ������� �� ��������������������
�� � ����������
�� � ����������
�� ������ �� ��� �����
����������������
����������������
209
������ � ���������������
����� � ������������ ���
���������������������
����������������
����������������
������ �� �����
��������� � ������
�� ����� � ����������� �� � �
�� ������ �� ������������� ��� ��������������������������
�����������������������
������������������������������������������
����
������������������
���� � �
Several optimizers are available with pytorch natively (see also the available modules from repos-
itories), in the module4 "torch.optim", with for instance:
Optimizer Description
a stochastic gradient descent with minibatches and an op-
"SGD" tional momentum, extends the Robbins–Monro method
(1951)
the Adam algorithm (2014) is a recent popular stochas-
"Adam" tic gradient descent based on "adaptive estimates of lower-
order moments", it combines AdaGrad and RMSProp
the L-BFGS algorithm (1989), see the previous chapter for
"LBFGS"
an illustration
the "resilient backpropagation algorithm" (1992), it is a pi-
"RPROP" oneer approach for adaptive estimates and can perform well
(improved by RMSProp)
210
����������� � ���������������������������������������������������
������������������������������
� ����������������������������������������������������������������
� �����������������������������������������������������
�→�������������������������
��������������������������������������������������������������������������
The reduction ĥi in the latent space from the bottleneck layer is retrieved for each data vector xi of
the dataset with "encoder()", the function implemented for this purpose:
���� � �����������������������������������������������������������������
�����������������
������ ��
The latent space is visualized as for pca.
211
The output just before comes from the following python program.
The resulting projection is for this example very similar to pca, but there is one dimension of
pca which is compressed. A similar subspace than from pca is retrieved but with some linear
transformation. Note that the choice of structure for the network is not exactly corresponding to
pca also because there is an intermediate hidden layer before and after the bottleneck layer.
This is a known observation from the literature, that the autoencoder is able to retrieve pca despite
that it does not include the hypothesis of orthogonality. The numerical indicators allow to retrieve
these similarities if they are identical.
�������������� � ��
��������������������������������������������� �����������
��������������������������������
���������������������������������� ���
��������������������������������
���������������������������������� ���
�������������� � ��
���������������������������������� ���
��������������������������������
���������������������������������� ���
��������������������������������
����������������������������������������������
�������������� � �����������������������������������������������������
�→������������������������� ������
An exemple of training for this autoencoder (with nbmax_epoqs=1500 and lr=0.001) was stored
in a file, it is loaded.
�������������������������������������������������������������������������
�→�������
212
���� ���� ������� �������������
������� � �������������������������������������������������������������
�→�������
The resulting model is able to separate better the projection points from some classes but there is
also some distortion among the clusters with ellipsoidal shapes and diverse orientations for their
main axis. With high dimensional data, the autoencoder is also able to perform better than pca
when associated with a post-processing via t-sne for biological data for instance.
������ ������ �� ��
213
������� �� �������� ���� ������ �� ��������
214
the binary hdf5 compression which is still near the state of art until today for tabular data, at least
for a sequential access to its local contents.
Let download the dataset5 with pytorch in a tensor format in order to save each image into a row
of a compressed data file. The first call to "MNIST()" takes the files from a remote server into the
chosen directory of the computer disk. All the next calls can access directly the local files until
they are eventually removed by the user.
��������� � ����������
������ ����� �� ��
������ ������
��� �������������������������������������������������
���� � �������������������������� ���������
5 "http://yann.lecun.com/exdb/mnist/"
215
������ � ��������������������
������ � ������������������
���� � ����������������������������� ���� �������������
���� � ����������������������������� ���� �������������
������������
�����������������������������������������������������������������
���������������������������������������������������������������������
An incremental pca is obtained by cycling the file, and then calling a function from sklearn as
follows.
������ ����
���� ��������������������� ������ ��������������
��� �����������������������������������������
������������������������������
���� � ���������� �������� � ��� �
������� ����� ��� �
������� ����� ��� �
� � ������������� � � �� ��������
� � ������������� � � �� ��������
���� � ���������������������������� �������������
����������� ���������� �
216
�� ���������
������ � ��������������������������
��� � ����������������
� � ���������������� �������
������������
������ ����� �� �� �
������������ � ��
�����������������������
��������������������
������ ��������������������������������������������������������
������� ��
We get a python object for the projection, hence we just project our dataset now.
��� ����������������������������������������������������������������������
���� � ���������� �������� � ��� �
������ � ����� ��� �
� � ������������� � � �� ��������
� � ������������� � � �� ��������
������ � ����� ��� �
� � ���������������
������ ������� �
Here the size of the chunks divides with an integer the number of rows of the dataset, otherwise the
remaining rows must be projected too. The projection is directly saved in a file on the disk, with a
virtual array created from "memmap()" with numpy (see next after for more about this binary file
format).
217
��������������������� � ����������������������������������
����������� � ��������������������������������
���������������� ����������
���������������������������
�����������������
������� ���
������������ ������� � �
���������������������������������������������������������
���������������������������
������������������������
��������������������
������� ���
������� ��
The resulting projection is below, from the two first principal components.
Figure 8.8: Visualization of two first ipca components for 60000 images
218
The overlap of the ten classes of digits on the projection is not informative and even prone to errors
of interpretation for the human who wants to understand the structure of the data cloud. This is
expected from images of digits which are often very similar. For instance, the digits 3 and 9 may
be both handwritten with open or close curves, and with only small differences from a writer to
another. Too many similar classes are not well separated with a linear projection. This is a limit
of pca which is just linear and not able to process well complex data clouds, the method needs
numerous latent variables instead of just a few ones.
The obtained projection points are all shown without filtering for the first ipca plane. Note that
in pca, there exists additional indicators called "contributions", "cosinus" and "qualities" which
allow to remove points from the projection plane. The pca plane is the scatter plot from two new
variables from pca, one by axis. When pca becomes a statistical exploratory method instead of just
an algorithm, some data are known to be not well represented on the plane. When too far from the
projection plane and with the worse indicator values, these points could be better shown by other
new variables from pca, the ones corresponding to smaller eigenvalues than the two leading ones.
The limit of this old approach is that for data with numerous variables an unknown number of
diverses reduced dimensions is required for each unknown class in order to retrieve the underlying
whole classification.
Current machine learning approaches prefer generally a nonlinear transformation of the variables,
a non Euclidean space or a neighboors graph. Pca is able to improve these methods by helping the
removal of the useless dimensions from the data table and the noises from its rows and its columns
which correspond to the smallest eigenvalues. Autoencoders are able to provide such improved
nonlinear view of the data in a flexible way. Its latent space may be able to separate better the
unkown classes (or "clusters") even in an unsupervised approach without the knowledge of the
true classification.
An autoencoder is trained for this dataset as follows. We propose to handle the dataloader with
the hdf5 file with extension ".h5" which is suitable for large datasets (or small computers). This is
performed by defining a new class extending "Dataset" from pytorch. The images are rows in the
matrix ’x’ and the labels are components in the vector ’y’ from the hdf5 file.
������ ����
������ �����
����� ������������������������������������
��� �������������� ���������� ���������� �����������
���������������� ����������������
������� � ������������������� � ����
������ � ��������������
������ � ��������������
��� ����������������� �������
������ �����������������������������
����������������������������
219
����������������
��� ��������������
������ ���������������
The dataloader is created as usually, but with this new object dataset for pointing to the rows in the
file.
������������� � ������������������������������������������
���������������������������� ����������������������
�����������������������
������ � �
�������������� � ��
���������������������������������������� �����������
��������������������������������
���������������������������������������� �����������
��������������������������������
220
���������������������������������������� �����������
��������������������������������
��������������������������������������� �����������
��������������������������������
�������������������������������������� �����������
��������������������������������
������������������������������������������ �����������
�������������� � ��
��������������������������������������� ��� �����������
��������������������������������
����������������������������������� ��� �����������
��������������������������������
����������������������������������� ���� �����������
��������������������������������
������������������������������������ ���� �����������
��������������������������������
������������������������������������ ���� �����������
��������������������������������
������������������������������������ ���� �����������
�����������������������������������
The last activation function is sigmoidal hence the output of the neural network belongs to [0; 1].
The components of the data vectors should belong to the same range.
This architecture looks too complex for this dataset with only 50000 data vectors for training so
many weights but has lead to a relevant low dimensional projection as a first test. Note that the
number of parameters is:
��������� � ���������������������
������������������������������
������������������������������
��� ���������
221
the small artificial dataset, "f_train_autoencoder()", is the penalization for the loss function. The
id of the row is also eventually loaded because used in some cases like debugging: when shuffling
is true in the dataloader the order of the rows is lost. For adding other optional computations, see
for instance the function "f_train_glmr()" and the class "Monitor" as examples.
��� ������������������������������������������������������������
��� �������������� ������������
�������������������������������������
����������� � ����������������������
�������������������
��������� � ������������������������������������������ ������
���������� � ����������������������������������������� ������
��������� � ���������������������������
������ � ���������������������
��� ����� �� �������������������
������ � �
��� ����� ������� �� ��������������������
�� � ����
�� � ����������
�� � ����������
�� ���������������� �� � ����������
�� ������ �� ��� ����� �� � �������������
�� ����������� �� ��� ����� ������������������
������ � ���������������
���� � ����������������� ���
�� ������������� �� ��� �����
���� � �����������������������������������������������
���������������������
���������������
����������������
������ �� ����
������������� � �����������������������������
�� �������������������� �
�� ��������������������� ��� ��������������������������
���������������������������������������������������������
������������������
������ ������� �������
The optimizer here is Adam because it performs very well for large models and was able to reduce
the loss function automatically without additional advanced function for the learning rate. As seen
previously, this is not required for linear models which just need an optimizer SGD because of
the convexity of their loss function. The documentation of pytorch discusses this more advanced
222
optimizer and a large part of the literature on images processing where the recent optimizer Adam
has become one of the best optimizers currently. This illustrates that a clever combination of
different approaches, here two optimizers with "Adam ≈ AdaGrad+RMSProp", can achieve great
results.
The autoencoder is trained and its weights are stored in a computer file.
����������� � �����������������������������������������������
������������������������������
� ������������� ������������� � �
� ����������������������������������������������������������
� ����������������������������������������
��� ��������������������������������������������������������������������
��� ������������������������������������
����������������������������������������������
The trained model is loaded by retrieving the weights from the computer file and the weights of
the python class are filled with these values.
������������ � �����������������
���������������������������� �
��������������������������������������������������
� �������������������������������������������������
�������������� �������������� ��
For a large dataset, one must be careful when computing the coordinates from the latent space by
cycling the dataset with the dataloader. This is the purpose of the following function.
��� ����������������������������������������������������
�������������������������������
����������� � ����������������������
������������������
���� ����������������
���� � ��
���
��� ����� ������� �� �������������������� ���������
�� � ����
�� � ����������
�� � ����������
�� ���������������� �� � ����������
223
�
�� ������ �� ��� ����� �� � �������������
�� ����������� �� ��� ����� ������������������
������ � ����������������������������������������������
�� � �� ��
���� � ������
���� � �������������������
�� �� �� ��� ����� ���� � ��
�����
���� � ��������������� ������� �������
���� � �����������������������������������
�� �� �� ��� ����� ���� � ������������������
����
������ ����� ����� ����
The full dataset is projected here from the trained model with the train sample, otherwise, the dat-
aloader from the train sample or the test sample can be preferred eventually for validation purpose.
The resulting projection is as follows, after training the nonlinear autoencoder.
������������� � ������������������������������������������
������ � ������������������������������������������������������
224
�������������� �������������� ���������� � �
������������������������������������������������� �������
��������������������������
��������������
Then let compare the quality of the projection for the full sample after training.
A sample of the 60000 rows is drawn in order to get faster computation of the numerical indicators.
��� � ��������������������������������������������������������
�������������� ������������� � �
�����������������������������������������
�������������������������������������������
��������������� �������������� � �
�������������������������������������������
�����������������������������������������
Visually, this trained autoencoder seems not performing better than pca for this dataset but also the
setting is not optimal. According to the literature, a more usual way to do is to choose a higher
dimension for the latent space and perform after a nonlinear reduction such as t-sne. Note also
that better architectures of autoencoder exist but often ask for far more intensive computations.
Convolution autoencoders which model the neighboors between the pixels are a first step towards
a better reduction, this is out of the scope of tabular data.
225
Random projection (rp) is a linear reduction method different from pca. It allows to approximate
the Euclidean distance between two vectors. Each data vector xi ∈ R p is reduced to x̂i ∈ Rq with a
matrix R ∈ Rq×p where q < p. The new vector is written as follows:
x̂i = R xi .
The reducing matrix may be sparse or dense. The cells of this matrix are random values generated
from a Gaussian distribution (or eventually another one). With the right choice for the dimension
of the reduction q, and a small ε depending on the value q, it is written that with high probability,
It is recognized the same inequality than in the Johnson-Lindenstrauss lemma. A smaller ε asks
for a large dimension q, but for a large p one gets easily q � p with an enough small ε, such as the
gain is relevant. When the matrix R is sparse, the reductions are fast to compute for dense vectors xi
such as this version is said to be "database friendly", underlying that it is relevant for huge datasets.
On the contrary to pca, rp does not ask for any matrix decomposition, just to generate a matrix for
the reduction.
The random projection is implemented with sklearn which generates the matrix R for a chosen
reduced dimension q. Also a warning appears if the corresponding approximated distances are
inaccurate in the lower dimension. For datasets which can’t fit in the computer memory, the trans-
formation is applied with chunks from a data file, in this paragraph.
The dataset which comes from a dataloader is stored in the computer disk in a data file, a binary
compressed file which can be read as a numpy array. This is a virtual array from the computer disk
instead of the computer memory, see "memmap()" from numpy. This format is such that there is no
need for loops and chunks to process the rows of the array if the resulting arrays fit the computer
memory: it allows linear algebra with other python objects of type numpy arrays and with the same
operators. But this may be slower than hdf5 and loops for some computations. Other formats of
compressed file exists for python, even with distributed architecture or better parallel processing,
but not always compatible with numpy.
In order to save the result from the transformation directly on the computer disk, a loop is imple-
mented. The linear algebra from numpy remains available, but applied to only subsets of rows
until completing the full transformation of the data matrix. The optional shuffling is set to false in
the dataloader for a sequential processing.
� ������ ����� �� ��
226
� �����������
� �����������
� ��������������
� ����������������
� �� ������� �� ��������
� �� � � �����������������������������������
� ����� � ��������������������� �
� ���������������� ���������� ������������
� ����� � ��������������������� �
� ���������������� ���������� ������������
� �����
� �� ���������
� ��� ������������ �� ����������������������
� �������������������������� � ��
� �������������������������� � �����������������������
� ��� � ��� � �������
� �����
� ��� ��������� �� ����������������������
� �������������������������� � ��
� �������������������������� � �����������������������
� ��� � ��� � �������
� ����������������������������������������������
� ����������������������������������������������
� ��� ������ �����
The dependent and independent variables are stored in two separated files.
���������� � �����������������������������
���������� � �����������������������������
� � ������������������������
� � ������������������������
From the data file, a virtual numpy array is created via a function as below.
������ ����� �� ��
227
��� ������������������������������
����� � ��������������������� ����������������
��������� ������������
������ �����
��� �������������������������������
����� � ��������������������� ����������������
���������� ������������
������ �����
�� � � ������� ����
����� � �����������������������������
����� � �����������������������������
������������ �����������
Let project with a random projection the space of the columns in a smaller dimension, 120 > 50,
where 50 is the usual value for pca before t-sne. A random projection is able to keep accurate the
distances between rows after the linear transformation when the reduced space is enough large.
�������� � ���
� ������ � ���������������������������������
� ������������������� � ������������������������������������������
� �� � ������������������
� ���������������
� ����������������������������������������������������������
�� � �����������������������������������������������
The random projection allows to keep the distances while reducing the size of the vectors. Thus,
the dataset has it number of columns reduced (from 784 to 120) by the product to the right with
this matrix (after transpose). This is done with chunks/minibatches in order to proceed directly
with the data on the computer disk here.
228
������ ����� �� ��
��� �������������������������������������������������
��������������
���� �����
�������������� � ����
���������������������������������������������������
�����������������������������������������������������
��� �������� ��������
The reduced matrix is stored in a new data memmap file with 120 variables instead of the previous
784 ones.
������������ � �����������������������������
������������� � �����������������������������������
�������������������������������������������������
��������������
�����������������
� � ���������������
�������������� � ����
The new reduced matrix is accessed via a python object, z_rp_mnist, without loading its whole
contents from the disk while the python object y_mnist remains available for the labels.
���������� � �����������������������������������
� � �����
� � ���
229
���������� � �����������������������������
������� � �����
����������������� �������������
Note that with this format, we keep the size of the matrix in a separated textual file when the data
is stored, in order to provide the information when the data file is read later.
This reduced matrix is involved next, just after the projection with the 50 components from the
incremental pca, in order to compare the results.
������ ��������
��� �������������������������������������������������������������������
������������������������������
��� � ������������������������������������
������
����������������������
��������������
��������������������������
�
��������������� � �������������������������������������������
�������
������ � ��������������
��������������
����������������
��������������������� �������������������������������
The output from t-sne for mnist after the initialization from the 50 first ipca components is as
follows.
230
� ������ ��������
��� �����������������������������������������������������������
��� �������������������������������������������������
������������ � ��������������������������������������
������� � ���������������������������������
Figure 8.10: Visualization from t-sne after ipca for 60000 images
The python code for the graphic just before is below, it is recognized the ten classes almost per-
fectly separated. The method t-sne allows an unsupervised visualization of the natural classes on
the contrary to pca for instance.
������� �
���������������������������� �������� ������������ ���������������
�������������� ������������� � ������������������������������������������
������������������������
�����������������������������
231
8.3.3 Projection t-sne after rp+pca
Here rp is for random projection. The reduction before t-sne is a random projection (120 reduced
dimensions) followed by an incremental pca (50 reduced dimensions). Hence, rp is just a prepro-
cessing step before ipca, while ipca is followed by t-sne at the end.
It is better to reduce and standardize the columns before the partial fit because the whole dataset
is not available at the beginning. The random projection allows to accelerate pca computations but
with a small difference for the quality of the projection because of the approximate distances.
�� � �������������������������
�� � ����������������������������������
������������� � ������������������������������������������������
��������������� � ������������������������
���������������� ����������
��������������
����������� � ���
��� ���������������
����������� � ���
������������ � ��
��������������� � �
�������������� �������������������������������������������������
����� �
������������ � �
��������������� �������������������������������������������
�������������� �
232
������ � ������� ��������������������� �
���������������������������� � �
�����������������������������������������������
��� ���������������� ������������
������������ � �
�������������� �������������������������������������������
�������� �
� ������ ��������
� ����� � ������������
��� ���������������������������������������������������������������
��� ���������������������������������������������������
�������������� � ����������������������������������������
��������� � �����������������������������������
������� �
������������������������������ �������� ������������ ���������������
���������������� ��������������� � �
��������������������������������������������
������������������������
�����������������������������������
233
Figure 8.11: Visualization from t-sne after rp+ipca for 60000 images
The random projection is by definition and as its names says: "random". Hence several random
projections may be tried in order to get the best final result. Some noise was added during the
reduction, resulting to some data points between the clusters. Several clusters have changed their
relative positions but the ten classes of digits are retrieved. This is informative because if two
points keep the same vicinity for two different random projections, that means that they belong to
the same cluster. Next the two t-sne projections are compared numerically with the previous ones.
������ ������ �� ��
234
���� ��������������������������������� ����������������
���������������������������� �������
�����������������
Here, t-sne as a dedicated method for reduction and visualization outperforms this trained autoen-
coder but uses more information with the vicinity graph. Note that another recent method could be
used here instead of t-sne for visualization, such as u-map. Tuning further the hyperparameters or
the training should lead to better results but is time consuming for the computer. More advanced
autoencoder architectures may do better than t-sne, see the literature. Currently, autoencoders are
combinated with other neural networks with different architectures and have also been defined for
other kinds of data such as time series, text, graph, or rgb images. The number of parameters is
high or even huge which makes them prone to overfitting, a problem mostly solved with very large
datasets and big workstations with powerful gpus.
There exists alternative approaches to a glm within a neural network which remains nevertheless
the more usual way to do currently. In all cases, a penalization may be added, it is worth to try as
explained in the chapter on the lasso regression, otherwise other regularizations are available.
235
Example with one hidden layer in the autoencoder for linear regression
The model for the basic autoencoder with one hidden layer is as follows:
Combined (Ŵx , Ŵh , b̂h , b̂x , β̂ , b̂y ) = argmin ∑i �xi − x̂(h(xi ))�2
Wx ,Wh ,bh ,bx ,β ,by
+λ ∑i �yi − β T h(xi ) − by �2
These models are not implemented here, another bias (or intercept) term may be added in some
cases for xi . The plugin and combined approaches count nearly each one two times the number
of parameters of the direct approach because of the decoding step, hence they may ask for more
data to train. They seem to be tested in the literature with more advanced autoencoders such as
for image processing for medical images and clinical data for instance. This is because, one may
have even additional data for the combination or plugin approach, by extending the vector xi but
with a part not reduced by the autoencoder. For instance, for images with additional descriptive
variables or even text, another network may process this other information and the result may be
merged with the reductions for the images. This idea is very convenient but the resulting loss is not
obvious to minimize because several networks are in stake. They ask for different optimizations
-with different learning rates and other hyperparameters- while being linked. See next section for
related exercices, not solved herein.
8.5 Exercices
1. (sklearn) Implement the autoencoder with sklearn for the small dataset and improve the
model by cross-validation.
2. (sklearn) For dataset mnist of the chapter, look for a better setting for the autoencoders
(number of hidden layers, number of neurons for each hidden layer, activation functions and
236
even type of hidden layers like linear or dropout). See the documentation of sklearn (or
torch.nn) for the available layers and activation functions.
3. (pytorch) Test further a sparse autoencoder and test some other architectures of autoencoder
available in the literature and code repositories. Check the generalization with two samples
via the two loss functions, and via the quality of the reduction for the test sample.
4. (pytorch+other) Compare the running time for alternative ways to cycle the dataset with the
minibatches from mnist, consider several ones among the following examples.
• Direct access from the image files in directories.
• Access to the images stored in one unique hdf5 file.
• Access to other file formats (numpy memmap, numpy npz, dask, TFRecord, pandas
csv, spark).
• Access without dataloader or with minibatches stored in separated files.
Check: if the access can be made multi-core in parallel, and validate by looking at the the
cpu cores stats during the running (command "top" with ubuntu).
Check: eventually by profiling the code and measuring the running time for loading the
minibatch from the computer hard disk to the cpu (and the gpu), with a profiler for python
or pytorch.
5. (opentsne) Propose a way to improve the projection from t-sne for mnist with the random
projection plus pca while keeping the data on the computer disk. For instance, several pa-
rameters need to be set, such that the dimension of the random projection.
6. (pytorch) Test the visualization of other datasets of tabular or images data, such that:
Many natively in pytorch "https://pytorch.org/vision/0.8/datasets.html"
102 Category Flower torchvision.datasets.Flowers102
CIFAR torchvision.datasets.CIFAR10
CelebFaces torchvision.datasets.CelebA
Traffic Sign Recognition torchvision.datasets.GTSRB
MedMNIST v2 - Biomedical "https://medmnist.com/"
MNIST-like fashion product "https://github.com/zalandoresearch/fashion-mnist"
Stanford Dogs Dataset "http://vision.stanford.edu/aditya86/ImageNetDogs/"
Because pca is doing well at reduction, the purpose is here to achieve at least an equivalent
result as pca at pre-processing for a k-dimensional reduction with k = 2 (or eventually more)
for the autoencoder. Images from some datasets need to be standardized in order to compare
accurately the pixels and their colors, hence testing a classification allows to check this.
7. (pytorch) Test a "Fisher" approach as in parametric likelihood models, by reducing the gradi-
ent vectors with an incremental pca at each epoch. Check the change of the first eigen vectors
and the first eigenvalues during training and for different choices of the hyperparameters.
8. (pytorch) Test the combination of an autoencoder with a glm for a dataset of images com-
pleted with descriptive variables. Examples of datasets are images of objects (cars, houses,
etc) plus a few descriptive variables with the price to be predicted at "kaggle.com". Another
237