0% found this document useful (0 votes)

35 views247 pages

Book Pytorch Scikit Learn Numpy

Uploaded by

Farhad Dubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views247 pages

Book Pytorch Scikit Learn Numpy

Uploaded by

Farhad Dubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 247

Linear and Deep Models Basics

with Pytorch, Numpy, and Scikit-Learn

Copyright © 2022 Rodolphe Priam. All rights reserved.

This book’s purpose is to inform, instruct and educate. Many efforts have been produced for writ-
ing this book and be sure that the contents is accurate, this does not insure a complete objectivity
of the contents. Neither the author nor the publisher nor the publishing platform will be held liable
or responsible for any actual or perceived loss or damage to any person or entity, caused or alleged
to have been caused, directly or indirectly, by anything in this book or its companion programs.
Back cover image design with "Stable Diffusion".
Contents

List of Figures v

Preface vii

1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notion of sample and log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Random sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Loglikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Numerical example with a nonlinear curve . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Numerical illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Glm as neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Why learn pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Linear models with numpy and scikit-learn 23

2.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Probabilistic interpretation and model . . . . . . . . . . . . . . . . . . . . 25
2.1.2 Example of simple linear regression (data) . . . . . . . . . . . . . . . . . 26
2.1.3 Analytical solution for the linear multiple regression . . . . . . . . . . . . 28
2.1.4 How to check the quality of the linear regression . . . . . . . . . . . . . . 33
2.1.5 Example of simple linear regression (continued) . . . . . . . . . . . . . . 35
2.1.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 Logistic Regression (2 classes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.1 Probabilistic interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.3 Predicted labels ŷi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.4 How to check the quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.5 Example of bivariate classes . . . . . . . . . . . . . . . . . . . . . . . . . 47

i
2.2.6 Indicators with sklearn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3 Multinomial regression (3 or more classes) . . . . . . . . . . . . . . . . . . . . . . 53
2.4 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 First-order training of linear models 55

3.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.1 Gradient descent with one vector and numpy . . . . . . . . . . . . . . . . 57
3.1.2 Gradient descent with a minibatch and numpy . . . . . . . . . . . . . . . . 60
3.1.3 Gradient descent with a minibatch and pytorch . . . . . . . . . . . . . . . 62
3.1.4 Dataloader, autograd and training loop . . . . . . . . . . . . . . . . . . . . 63
3.1.5 Visualization of the parameters trajectory . . . . . . . . . . . . . . . . . . 67
3.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.1 Dataset with bivariate classes (continued) . . . . . . . . . . . . . . . . . . 69
3.2.2 Loss minimization with pytorch . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.3 Comparison of the solutions . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Neural networks for (deep) glm 75

4.1 From linear models to neural networks . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1.1 Loss functions from glm . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1.2 Post-processing of the output . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.3 Loss functions in pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Dataset with nonlinear frontiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3 Train, test and validation subsamples . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Training traditional linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.1 Neural network deﬁnition and weights . . . . . . . . . . . . . . . . . . . . 85
4.4.2 Preparation for the optimization . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.3 Parameters training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.4 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Training deep nonlinear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.1 Hidden layers in deep models . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.2 Deﬁnition and model training . . . . . . . . . . . . . . . . . . . . . . . . 100
4.6 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5 Lasso selection for (deep) glm 107

5.1 Brief recall on regularized regression . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 Dataset with uninformative variables . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Multiple regression without lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.2 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4 Multiple regression with lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.1 Post-processing and mean square error . . . . . . . . . . . . . . . . . . . . 114
5.4.2 Interpretation of the obtained model . . . . . . . . . . . . . . . . . . . . . 115
5.4.3 Other regularization methods . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5 Setting of the hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

ii
5.5.1 How to improve the learning . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5.2 Grid search for optimal hyperparameters . . . . . . . . . . . . . . . . . . 118
5.5.3 Bayesian search for optimal hyperparameters . . . . . . . . . . . . . . . . 129
5.6 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6 Hessian and covariance for (deep) glm 133

6.1 Parameters variance without pytorch . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.1.1 Dataset from linear regression . . . . . . . . . . . . . . . . . . . . . . . . 133
6.1.2 Variance estimates from statsmodels . . . . . . . . . . . . . . . . . . . . . 134
6.1.3 Algebra for the parameters variance . . . . . . . . . . . . . . . . . . . . . 135
6.1.4 Variance estimates with numpy . . . . . . . . . . . . . . . . . . . . . . . 136
6.2 Hessian and variance for the linear regression . . . . . . . . . . . . . . . . . . . . 138
6.2.1 First-order training with pytorch . . . . . . . . . . . . . . . . . . . . . . . 139
6.2.2 First computation of hessian with pytorch . . . . . . . . . . . . . . . . . . 140
6.2.3 Variance estimates with pytorch . . . . . . . . . . . . . . . . . . . . . . . 142
6.3 Hessian computation for (small) neural networks . . . . . . . . . . . . . . . . . . 143
6.3.1 Hessian expression for logistic regression . . . . . . . . . . . . . . . . . . 143
6.3.2 Real dataset with two classes . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3.3 Results from the module statsmodels . . . . . . . . . . . . . . . . . . . . 145
6.3.4 Results from sklearn and numpy . . . . . . . . . . . . . . . . . . . . . . . 146
6.3.5 First-order training with pytorch . . . . . . . . . . . . . . . . . . . . . . . 146
6.3.6 Hessian from second-order derivatives . . . . . . . . . . . . . . . . . . . . 148
6.3.7 Hessian from second-order derivatives (bis) . . . . . . . . . . . . . . . . . 150
6.3.8 Hessian from ﬁrst-order derivatives . . . . . . . . . . . . . . . . . . . . . 153
6.4 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7 Second-order training of (deep) glm 157

7.1 Introduction to GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.1.2 Derivatives and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.2 Algorithms for second-order training . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2.1 Taylor serie at second-order . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2.2 Newton-Raphson procedure . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2.3 Natural gradient procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.2.4 First-order gradient update with minibatches . . . . . . . . . . . . . . . . 160
7.2.5 Natural gradient procedure with minibatches . . . . . . . . . . . . . . . . 161
7.3 Fitting a Poisson regression with statsmodels . . . . . . . . . . . . . . . . . . . . 162
7.4 Fitting a Poisson regression with numpy . . . . . . . . . . . . . . . . . . . . . . . 166
7.4.1 Expressions of the gradient and hessian . . . . . . . . . . . . . . . . . . . 166
7.4.2 Implementation of the four algorithms . . . . . . . . . . . . . . . . . . . . 167
7.5 Fitting a Poisson regression with pytorch . . . . . . . . . . . . . . . . . . . . . . . 176
7.5.1 Example of training at ﬁrst-order . . . . . . . . . . . . . . . . . . . . . . . 176
7.5.2 Example of L-BFGS for training at second-order . . . . . . . . . . . . . . 183
7.6 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

iii
8 Autoencoder compared to ipca and t-sne 187
8.1 Three pca methods and t-sne for dimension reduction . . . . . . . . . . . . . . . . 188
8.1.1 The pca method in brief . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.1.2 Implementations of pca with sklearn . . . . . . . . . . . . . . . . . . . . . 197
8.1.3 Implementation of t-sne with sklearn . . . . . . . . . . . . . . . . . . . . 203
8.1.4 Quality indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.2 Autoencoders with pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.2.1 Deﬁnition and link with pca . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.2.2 Visualization of artiﬁcial data . . . . . . . . . . . . . . . . . . . . . . . . 207
8.2.3 Autoencoder for 60000 images . . . . . . . . . . . . . . . . . . . . . . . . 214
8.3 Low dimensional reduction via t-sne . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.3.1 Reduction with a random projection . . . . . . . . . . . . . . . . . . . . . 225
8.3.2 Projection t-sne after pca . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.3.3 Projection t-sne after rp+pca . . . . . . . . . . . . . . . . . . . . . . . . . 232
8.3.4 Comparison of the visualizations . . . . . . . . . . . . . . . . . . . . . . . 234
8.4 Autoencoder associated with a glm . . . . . . . . . . . . . . . . . . . . . . . . . . 235
8.5 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

9 Solution to selected exercices 239

9.1 Exercice 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.2 Exercice 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.3 Exercice 1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
9.4 Exercice 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
9.5 Exercice 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
9.6 Exercice 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
9.7 Exercice 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
9.8 Exercice 4.3abc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.9 Exercice 4.3d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9.10 Exercice 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
9.11 Exercice 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

iv
List of Figures

1.1 True polynomial curve and train sample . . . . . . . . . . . . . . . . . . . . . . . 12

1.2 True polynomial curve and test sample . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Several ﬁtted polynomial curves and train sample . . . . . . . . . . . . . . . . . . 16
1.4 Several ﬁtted polynomial curves and test sample . . . . . . . . . . . . . . . . . . . 17
1.5 Error on the train and test samples . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1 True regression line and sample with 10 observations. . . . . . . . . . . . . . . . . 28

2.2 Loglikelihood and variance parameter . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Sigmoid function u → σ (u) with "S shape" . . . . . . . . . . . . . . . . . . . . . 41
2.4 Data sample for two classes with a linear frontier . . . . . . . . . . . . . . . . . . 49
2.5 True frontier between the two classes . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1 Decreasing loss function during training, per epochs . . . . . . . . . . . . . . . . . 60

3.2 Loss and intermediate values of beta during convergence . . . . . . . . . . . . . . 67
3.3 Intermediate frontiers for the two classes during convergence . . . . . . . . . . . . 73

4.1 Simple neural network for a linear model. . . . . . . . . . . . . . . . . . . . . . . 76

4.2 Data sample for two classes with non linear frontier . . . . . . . . . . . . . . . . . 82
4.3 Loss function for train and set samples, per epochs . . . . . . . . . . . . . . . . . 94
4.4 Linear frontier from neural network without hidden layers . . . . . . . . . . . . . 98
4.5 Loss function for train and set samples, per epochs . . . . . . . . . . . . . . . . . 103
4.6 Nonlinear frontier from neural network wit hidden layer . . . . . . . . . . . . . . 104

5.1 Solutions for each penalty function alone with p = 2. . . . . . . . . . . . . . . . . 108

5.2 Scatterplot of yi and ŷi : without L1 (left) and with L1 (right) . . . . . . . . . . . . . 115
5.3 Mse averages on test set after CV+L1 and grid search . . . . . . . . . . . . . . . . 128

7.1 Barplot for the target variable from Poisson regression . . . . . . . . . . . . . . . 164
7.2 Losses per epoch for the ﬁrst/second-order algorithms . . . . . . . . . . . . . . . . 174

8.1 Visualization of 3 variables among 10 for 9 classes. . . . . . . . . . . . . . . . . . 199

8.2 Visualization of two first pca components for 9 small classes . . . . . . . . . . . . 201
8.3 Visualization of two first ipca components for 9 classes . . . . . . . . . . . . . . . 202
8.4 Visualization of two first kpca components for 9 classes . . . . . . . . . . . . . . . 203
8.5 Visualization from t-sne after pca for 9 classes . . . . . . . . . . . . . . . . . . . . 204
8.6 Visualization from linear autoencoder for 9 classes . . . . . . . . . . . . . . . . . 211

v
8.7 Visualization from nonlinear autoencoder for 9 classes . . . . . . . . . . . . . . . 213
8.8 Visualization of two ﬁrst ipca components for 60000 images . . . . . . . . . . . . 218
8.9 Visualization from nonlinear autoencoder for 60000 images . . . . . . . . . . . . . 224
8.10 Visualization from t-sne after ipca for 60000 images . . . . . . . . . . . . . . . . . 231
8.11 Visualization from t-sne after rp+ipca for 60000 images . . . . . . . . . . . . . . . 234

9.1 Loss function from sequential training with sklearn . . . . . . . . . . . . . . . . . 254

9.2 Barplot for the class sizes from the 581012 observations . . . . . . . . . . . . . . 257
9.3 From left to right: true and estimated labels, frontiers, losses . . . . . . . . . . . . 266
9.4 From left to right: true and estimated labels, frontiers, losses . . . . . . . . . . . . 267
9.5 From left to right: true and estimated labels, frontiers, losses . . . . . . . . . . . . 267
9.6 From left to right: true and estimated labels, frontiers, losses . . . . . . . . . . . . 268
9.7 From left to right: true and estimated labels, frontiers, losses . . . . . . . . . . . . 269
9.8 Loss function for train and set samples, per epochs . . . . . . . . . . . . . . . . . 272

vi
Preface

This book is an introduction with examples to the python module called "pytorch" for models of
regression and classiﬁcation. The other existing modules for machine learning considered herein
are "Scikit-Learn" (also called "sklearn") in a dedicated chapter and "statsmodel" for variance
estimation in another dedicated chapter. The module "sklearn" is often involved in projects of
machine learning because of its intuitive and powerful processing of the data with many methods
implemented often with multicore parallelism. The python module for the graphical outputs is
"matplotlib" which allows to draw curves, lines and scatterplots plus the legends and the axis with
names.
When the models are not kept linear, the extensions with nonlinearities of the models come from
hidden layer(s) within a neural network (nn) framework. "Deep learning" allows to train these
nonlinear models despite a high dimensionality and numerous parameters.

NN and GLM

Herein, we are interested on the neural networks called "feedforward neural network" (ffnn) or
"multi-layer perceptrons" (mlp). The name mlp has different meanings from an author to another
in the literature, because historically the perceptron is for binary classification and not regression.
These networks are fully connected neural networks hence with fully-connected layers. All the
possible connexions between the nodes of a layer to the nodes of the next layer enter the models.
This also induces that the network receives an input "x" at the first layer and produces an output -a
prediction approximating the true "y"- at the last layer after a cascade of nonlinear transformations.
The glm are a generalization or many models of classification and regression. They aim at pre-
dicting a variable "y" from the knowledge of variables "x", with an unique function with a high
level parameterization. Herein, we are interested on only particular cases with python because
dealing with this very general unique fonction is not practical, nevertheless the general case is also
discussed. These neural networks allow to extend glm in order to improve the training and the
modeling. How to define the network and how to train the parameters in order to improve the
inference for linear models and reach a relevant prediction.

Training algorithms

The interest in implementing the training algorithms is to understand the mechanisms involved

vii
and to improve some current implementations and models. Thus, the book provides a fully
working python code built on the top of the module pytorch and which is completely ﬂexible. On
the contrary to the usual python modules for machine learning where the models are already fully
implemented the process is different with pytorch: one needs to deﬁne and train the parameters for
each model or build a general implementation which is able to handle several models as proposed
herein. One can use existing programs or existing high level libraries but learning asks for coding.

It is proposed an introduction and implementation to several algorithms for the estimation of the
parameters of deep glm. The linear versions are particular cases and the foundation of these mod-
els, they are often involved in most of the chapters herein. The algorithms are implemented with
numpy and pytorch for several of them while some numerical results come directly from the exist-
ing modules that we use for the computations.

Datasets and python codes available for download

The number of datasets processed with python herein is near 20, with a complementary list of some
other datasets from repositories. The smaller dataset is 10 rows for pedagogical reasons while the
larger ones are about 500000 observation from forestry for one and 11000000 for a second one
from physics. The usual dataset "mnist" is included with 60000 images as an usual benchmark in
deep learning.
Herein, images datasets are most of the time converted into this same "tabular form" with an unique
data ﬁle where each row corresponds to an image. The ﬁle formats for large tables is also discussed
with several examples and exercices. One of the purposes of the document is to be able to handle
those large datasets with "pytorch" (and "sklearn") with a limited available quantity of computer
memory for training a linear or deep glm. Preprocessing should take only a few seconds or at most
a few minutes for millions of rows herein. Obviously, the more as possible of computer memory is
probably a wise advice but this is not always possible for a cost reason or for the largest datasets.
Several datasets and python codes are expected to be available for download at the url
"github.com/rpriam/book1" from a few days after that the book will appear at the publishing plat-
form website. Otherwise, the reader may consult for any information, the page of the publishing
platform for the book, or may directly send a message to "rpriam@gmail.com" for any request
about the datasets and the companion programs.

Included (chapters, exercices, solutions) and not included (everything else)

It has been decided to not present the backpropagation in this volume but ﬁrst-order and second-
order derivatives are discussed for the linear glm. The purpose is to get rid of most of the algebra
and be happy with the automatic computation of the values of the derivatives of any loss function,
thanks to pytorch. The bayesian models have also not been included and are left for a second
volume with more advanced models. Neural networks specialized for texts and images processing,
called recurrent, convolution or transformer networks are not discussed herein neither but the last

viii
exercice of the book allows to learn about as a perspective.
The document is divided into 9 chapters including a chapter for the introduction and a chapter for
the full correction of selected exercices (among the nearly 60 proposed at the end of each chapter)
with python codes and several additional large datasets.
The exercices allow to check if the chapter is well understood and allow to go behond the contents
of the course. A complementary contents to this book is expected to be proposed as a second
volume with more advanced methods and eventually more analytical contents.

How to run the algorithms

The computer used for writing the document is with a quite old cpu with only 4 cores and a small
gpu (2 Go) but with some supplementary memory (8 Go). In case of unlikelily memory problem, if
freeing the computer memory is not enough the reader may reduce the size of the larger datasets by
working only on a sample or adding the missing loop with chunks. But this should not be required,
most of the time, as this was handled already and took care of. Linux was the operating system
for running the python code, and jupyter notebooks, one per chapter, were initially converted
into the text format ".tex" before ﬁnal updates. This supposes to install all the python packages
and libraries (called herein "modules") required in order to access their functionalities. Another
solution is jupyter notebooks with google colab for cloud computing or for local computations.

ix
Chapter 1

Introduction

1.1 Overview
In the following chapters we are interested on the estimation of statistical models related to neural
networks. The purpose is to predict a variable yi called "target", "outcome", or "of interest" or
"dependent" with the knowledge of a vector xi , aggregating a set of p variables {xi1 , xi2 , xip } even-
tually large, called "predictive", "independent" or "explicative", and otherwise "predictor". One
supposes that there exists a relation between x and y, say a function g(.), such that:
yi = g(xi1 , xi2 , xip ) .
The most usual models for this problem are divided into two different cases, one for the prediction
of a numeric value and one for the prediction of a categorical value. They are named after the:
- "regression models" when y is a scalar (or sometimes a vector) with its values continuous
(real) or discrete (integer ordered). They are called "linear regression" for a continuous
outcome, and "simple" for one predictor or "multiple" for several ones.
- "classiﬁcation models" when y is a category or class label. The category is usually coded
with an integer not ordered, just as an index. Thus, they are called1 "logistic regression" or
"multinomial regression" for respectively two classes or more than two classes.
These models depend on unknown parameters θ , because they need to adapt to a sample. Each
dataset and related population is different such that the prediction changes. Instead of changing the
function, only the parameters need to change. The function is well chosen and enough general in
order to be relevant for many different datasets or populations, each one with a different parameter.
This is written as follows:
yi = f (xi1 , xi2 , xip ; θ ) .
For each dataset or population, the parameters in the vector θ needs to be found, or say "inferred",
"estimated", or "learned" while the model is said to be "ﬁtted" or "trained",
θ =? .
1 Thus,both modeling cases lead to the prediction of a numerical value with related models. Herein all the models
have same foundation, as the (extended) members of the "generalized linear models" most of the time.

1
1.2 Notion of sample and log-likelihood
1.2.1 Random sample
The population is supposed inﬁnite or enough large to be supposed inﬁnite. Thus a datum is here
of the form:
(yi , xi ) ,
with for continuous (q ≥ 1) or binary (q = 2), multinomial (q > 2), discrete (q = ∞) regressions:

xi = (xi1 , xi2 , · · · , xip )T ∈ R p and yi ∈ Rq (with often q = 1) or yi ∈ {0, 1, · · · , q} (with q ∈ N+

∗ ).

The dataset is a data sample with n observations,

sn = {(y1 , x1 ), (y2 , x2 ), · · · , (yi , xi ), · · · , (yn , xn )} .

Because only a sample is available and not the whole population or even an enough large sample,
there exists an (eventually small or big) error in θ̂ = θ̂ (sn ), the value of θ found from the sample.
This is not the true parameter θ ∗ for the whole population which is unknown, hence there is an
algebrical and numerical difference between the available value and the true value.

1.2.2 Loglikelihood
The "probabilistic distribution" of the sample comes from a member of the family of "generalized
linear models" such that there exists a known parametric form for:

f (yi ; xi , θ ) .

Thus, the "loglikelihood" is deﬁned as the log of the n products of the density function or proba-
bility mass function above, one per sample, such that,

log L(θ ) = log f (y1 , y2 , · · · , yi , · · · , yn ; x1 , x2 , · · · , xi , · · · , xn , θ )

= log { f (y1 ; x1 , θ ) × f (y2 ; x2 , θ ) × · · · × f (yi ; xi , θ ) × · · · × f (yn ; xn , θ )}
= log f (y1 ; x1 , θ ) + log f (y2 ; x2 , θ ) · · · + log f (yi ; xi , θ ) · · · + log f (yn ; xn , θ ) .

And, with the summation sign sigma or sign sum,

n
log L(θ ) = ∑ log f (yi; xi, θ ) .
i=1

The derivatives of the loglikelihood w.r.t. θ allows to write an algorithm in order to ﬁnd an esti-
mation θ̂ of the parameters, given a sample of n observations.

2
1.2.3 Optimization
The search for the parameters θ is summarized as follows:

θ̂ = argmaxθ log L(θ ) .

This notation says that the vector θ̂ to the left is the solution which maximizes the function to the
right. This function depends on the vector θ which can take usually a not countable inﬁnite number
of possible different values, and we are looking for the one which makes the function maximum.
In neural network, this is a loss function, thus, the optimization is the look for a minimum, for
instance of " - log L(θ̂ ) " here. Sometimes, the loss function is instead just a distance between the
true target values yi and their approximations from the network ŷi , as explained later.
There are many algorithms for this task because, the problem is different if the dataset is large or
small, the variable space is high dimensional or small dimensional, and for other computational
constraints. This is not rare to have different algorithms for differents models, as in computational
statistic for bayesian settings for instance. But for neural networks, the algorithms are often more
generic because the derivative are not written in closed-form, and often estimated numerically.
These derivatives are the horse power of most of the training algorithms because the nonlinear
function is approximated locally with a linear or quadratic function before the optimization. The
approximation comes from these derivatives, as in any usual Taylor serie. For some rare cases, the
solution θ̂ is a closed-formed expression, exact without algorithm, such as for the linear regression.

1.2.4 Bias and variance

As the parameters θ̂ = θ̂ (sn ) comes from a sample sn of size n, there are several questions about: is
it accurate enough for the current or new data. Generally, in a statistical setting for machine learn-
ing with smooth functions, the estimator is such that it converges to the true parameter for large
samples, θ̂ → θ . It is relevant if it is unbiased, which means that the average of the estimator θ̂
from many samples is equal to the true parameter unknown θ , E(θ̂ ) = θ . Another properties which
comes with the estimator is its variance, say V (θ̂ ) > 0, which tells how near the true parameter is
to its approximation via a conﬁdence interval, this is discussed in a next chapter.

1.3 Numerical example with a nonlinear curve

1.3.1 Simple linear regression
This is the simplest case actually, there is no need for a distribution, just a loss function:

1 n
�(θ ) = ∑
n i
(yi − β0 + β1 xi1 )2 .

3
The difference with a distribution -and its related loglikelihood function- is that the loss function
is just a measure of the difference between the true targets and the approximated targets as found
from a model, while the distribution has a probabilistic interpretation.
Here, yi is the true target variable while the prediction from the model is ŷi = β0 + β1 xi1 . In the
literature this notation is also sometimes after training such that ŷi = β̂0 + β̂1 xi1 , thus this has to
be understood according to the context. A way to be clear would be to write ŷi (θ ) and ŷi (θ̂ )
respectively without additional symbol for the target variable.
The derivatives are found with usual algebra as follows.

∂ �(θ ) ∂
�1 n 2
�
β0 = β0 n ∑i (yi − β0 + β1 xi1 )
∂ �(θ ) ∂
�1 n 2
�
β1 = β1 n ∑i (yi − β0 + β1 xi1 ) .

Which leads to:

1 n
0 = n ∑i −(yi − β̂0 + β̂1 xi1 )
1 n
0 = n ∑i −(yi − β̂0 + β̂1 xi1 )xi1 .

If not resolved by the more elementary way to write one coefficient as a function from the first
equation and replace in the second equation, in order to find the expression of one coefficient
before the expression of the second from first equation, the problem is simply written matricially
for resolution:

� 1 n
�� 1 n
�
1 n ∑i xi1 β̂0 n ∑i yi
1 n 1 n 2 = 1 n .
n ∑i xi1 n ∑i xi1 β̂1 n ∑i yi xi1

This leads to:

� � � 1 n
�−1 � 1 n
�
β̂0 1 n ∑i xi1 n ∑i yi
= 1 n 1 n 2 1 n
β̂1 ∑
n i xi1 n ∑i xi1 n ∑i yi xi1
� � � ��
β̂0 1 1 − 1n ∑ni xi1 1 n
n ∑i yi
= 1 n 2
β̂1 1 n
n ∑i xi1 − ( n ∑i xi1 )
2 − 1n ∑ni xi1 1n ∑ni xi12 1 n
n ∑i yi xi1
� � � ��
β̂0 1 1 −x1 y
= .
β̂1 var(x1 ) −x1 x12 yx1

The usual linear algebra leads to the known analytical usual solution, with an inverse of the matrix
which is either analytical for p = 2 (as just above) or p = 3, either numerical (via a computer

4
program) for larger values of p. When the function is nonlinear, the problem is different because a
closed-form solution is not available yet. Another issue is for large p and large n where the matrix
involved is large such as the numerical algorithm is too slow or worse, the matrix does not ﬁt in the
computer memory. In both cases, alternative algorithms allow to get a solution directly from the
expression of the derivatives above in order to ﬁnd the unknown parameters (see next chapters).

1.3.2 Polynomial regression

For regression, a polynom can be seen as a neural network, but with an heuristic as a ﬁrst approach.
A simple way to consider at ﬁrst an hidden layer in a neural network with a nonlinear transforma-
tion is with polynom:
• From unidimensional input xi = xi1 to multidimensional layer
 
xi
 x2 
 i 
 3 
xi → g� (xi ) =  xi  ∈ R� .
 .. 
 . 
xi�
This is not the traditional layer with a linear transformation, but nowadays, a layer has a
generic meaning wich could include such transformation. Because neural networks are most
often useful for high-dimensional input vectors instead of a simple scalar, such transforma-
tion is most often not considered, and replaced by a nonlinear function as in a next subsection
for an example.
• From multidimensional layer to unidimensional output ŷi
 
xi
 x2 
 i 
 3 
g� (xi ) =  xi  → ŷi ,
 .. 
 . 
xi�
where the approximation of the true target variable is written as a weighted sum, of the
powers of xi plus a constant term,
�
ŷi = g(xi , θ ) = w0 + wT g� (xi ) = w0 + ∑ wk xik .
k=1

In a matrix format, one gets that:

    
ŷ1 1 x1 · · · · · · x1� w0
 ŷ2   1 x2 · · · · · · x2�   
     w1 
 .. = .. .. .. .. ..   ..  .
 .   . . . . .  . 
ŷn 1 xn · · · · · · xn� w�

5
Note that the simple linear regression is also written with this form, with � = 1, because the result
before is retrieved by the product to the left of the transpose matrix which contains the variable x
and called design matrix.
The estimation ŷi approximates yi with the help of the knowledge of xi , and depends on parameters
wk , such that a comparison between the vector of yi and the vector of ŷi is the Euclidean distance,
or the mean. It is thus optimized the following criterion.
• Cost function with θ = (wk ), it is obtained by comparing the true input with the approxi-
mated output.
1 n
�(θ ) = ∑[yi − w0 − wT g� (xi )]2 .
n i

• Optimization, the optimal solution searches for the best parameters.

θ̂ = argminθ �(θ )
= argminθ n1 ∑ni [yi − ŷi (θ )]2
= argminθ n1 ∑ni [yi − w0 − wT g� (xi )]2 .

The solution is known via a least square or regression problem and with an analytical solution
(when there is no problems such as collinearities). Note that an usual trick is to consider the vector
xi by adding the component 1 as its ﬁrst component, such as the "intercept" (also called "bias" in
the computing literature: understood as a bias term for the model without this additive correction)
w0 comes as a component into the vector w for a lighter notation, with "[w0 , wT ][1, g� (xi )T ]T ".

1.3.3 Neural network

For the nonlinear regression, the loss is with g(), and,

1 n
n∑
�(θ ) = [yi − g(xi , θ )]2 .
i

The nonlinear model is approximated with a linear model in the polynomial function but this has
some limits for more complex functions which are unknown and not polynomial.
In a neural network, the nonlinearity is modeled with layers which are linear transformations asso-
ciated to a nonlinear function.
The neural network with one hidden layer, instead of a polynomial transformation, is a linear
transformation plus a nonlinear one. The nonlinear transformation is called "activation function",
in the sense that each component in the hidden layer is a neuron, and it is activated or not in the sum
thanks to the activation function which has often a shape as a "S curve": a smooth switch zero or
one. This is not discussed further herein, the idea is biological with signals which are propagated
in the brain via neurons which are or not activated according to the received amount of signal.
Formally, this is just written as follows:

6
• From unidimensional input yi to multidimensional layer
 
xi w11
 xi w12 
 
 
xi → xiT W (1) =  xi w13  ∈ R� .
 .. 
 . 
xi w1�
A non linear function σ (), such as sigmoid or thanh, is applied to each component,
 
σ (xi w11 )
 σ (xi w12 ) 
 
T (1)  σ (xi w13 ) 
σ (xi W ) =   ∈ R� .
 .. 
 . 
σ (xi w1� )

• From multidimensional layer to unidimensional output ŷi Another set of weights are required
in order to retrieve the dimensionaly of yi , such that one just writes that:

σ (xiT W (1) ) → ŷi ,

where,
ŷi = g(xi ; θ ) = w20 + σ (xiT W (1) )T W (2) ,
or eventually,
� �
ŷi = g(xi ; θ ) = w20 + σ σ (xiT W (1) )T W (2) ,

or directly,
� �
ŷi T
= g(xi ; θ ) = σ w20 + σ (xi W ) W
(1) T (2) .

The later case would be for a target variable in the range [0; 1] because the additional term w20
would make the infence more cumbersome if left out from the second activation function. while
the just before case would be for a target variable in the real line with continuous values. The
case without a second activation function is just more linear and may be enough for ﬁtting some
simple curves. There exists theorem which prove that neural networks are able to approximate any
nonlinear function, but in practice this is not always easy to ﬁnd a relevant value for their weights.
Here w20 is not called "intercept" but "bias" in the literature for neural networks, while,

σ (xiT W (1) )T W (2) = w21 σ (xi w11 ) + w22 σ (xi w12 ) + w23 σ (xi w13 ) + · · · + w2� σ (xi w1� ) .

Thus ﬁnally, this is another optimization problem:

7
• Cost function
1 n
n∑
�(θ ) = [yi − w20 − σ (xiT W (1) )T W (2) ]2 .
i

• Minimization w.r. θ

θ̂ = argminθ �(θ )
= argminθ 1n ∑ni [yi − ŷi (θ )]2
= argminθ 1n ∑ni [yi − w20 − σ (xiT W (1) )T W (2) ]2 .

Generally, xi is not one dimensional but p dimensional and � << p, instead of here having � larger
than the dimensionaly of xi , because the neural network reduces the dimensionaly in order to add
nonlinearities, as the real space for the data is generally smaller than the one given. There is a ﬁrst
transformation associated to the layer, linear and nonlinear, followed by a linear transformation,
and eventually a second nonlinear transformation, not considered here. Note that on the contrary,
for small dimensional xi , an hidden layer has more neurons than p in order to increase the dimen-
sionality of the nonlinear hidden space, � > p, and get a good prediction for the target variable.
This replaces the polynomial transformation with order not explicit, as the order is automatically
chosen from the found weights. In theory, one hidden layer is probably enough, but two or three
are not rare in practice.
In the literature, it was shown that such functions are able to retrieve the true function g() for a
well chosen set of weights and activation transformations: this is more general than the polynomial
regression, again with weights but also nonlinear transformation. There is a difference between
knowing the function as in inference for the nonlinear regression and estimate its parameters,
while not knowing the form of the function in neural networks and estimate the weights in order to
retrieve the function: g(., θ̂ ) versus ĝ(.). A consequence is that more parameters are to be ﬁtted in
neural networks, this induces more problems of generalization not only because the real function
is unknown but also because of the numerous unknown parameters. This is more an estimation of
function than an estimation of parameters, even if weights are estimated like usual parameters.

1.3.4 Numerical illustration

Let have a work directory where data ﬁles, data directories, and temporary ﬁles are stored.

��
��

Let check the computer memory state. There should be remaining some space otherwise the oper-
ating system risks to freeze. For neural network, it is a good habit to check the computer memory.
The code for this document was written with a laptop with more than 6 GB of memory, thus
it might be running in most of recent architecture very well and even for older computers with
enough computer memory.
For instance, psutil may be considered here.

8
��
��
��
��

��

��
For the current chapter, the computer memory is not a concern as the data samples are very small.
A sample from a nonlinear function is generated and the purpose is to retrieve this function, or at
least retrieve the target variable in order to predict their values for new data. Think for instance,
to a function in physics or in biology which is unknown. By the past scientists looked for a way
to find an expression for the function according to some theory, and then used to fit its parameters
according to some observation available, while nowadays they may use directly the computer in
order to find the function. In this paragraph, the function is simple, but it may be more complicated
and more difficult to find for real problem with an unknown form for the function.
In this chapter we do not need pytorch which comes after in the next chapters. A sample from a
polynomial curve is generated with python via a new function and stored in the computer memory
as python objects "x_all" and "y_all", with 80 points. Hence "x_all" and "y_all" are numpy arrays
of one dimension, their length is n = 80. For instance yi = xi2 with xi the values at equal distance
for the inverval [−1.5; 1.5]. But here a polynom with a higher degree was chosen. The data sample
(or dataset) is defined from the two data "x_all" and "y_all". It may be stored as a data table which
is a matrix with two columns, one for "x_all" and one for "y_all". From these data, we want to find
the link from "x_all" to "y_all", hence the polynom (coefficients and degree) is unknown. In next
chapters, even the information that the function is polynomial may be unknown because neural
networks can infer the function whatever its shape: this is where pytorch becomes useful and in
particular when the problem becomes high dimensional. See also the exercices at the end of the
chapter for a first implementation with the python module "sklearn" for these dataset and function.
Usually, one wants to know how behave the model on a new dataset, thus, it is sampled a part of
the dataset to simulate this. Otherwise, as the generating function is available here, it could be
sampled directly new observations instead.

��
��

��

��
��
��
��
��

� � ��
��

9
��
��

��

The dataset was stored in txt ﬁles in order to get the same data at each run and made available the
sample.

� ��
� ��
� ��
� ��
� ��
� ��

Thus, the ﬁles are loaded.

��
��
��
��
��
��

��
��

��

The data sample was splited into two subsamples, one for ﬁtting the model and one for checking the
obtained model with new data. This is what happens with real data when one wants to predict the
unknown target variables from new values for the explaining variables, this is the ultimate purpose
of the modeling. Actually, in deep learning the purpose is most often "prediction" than "explaining"
because the nonlinearities make the explaination cumbersome, such as neural networks were often
called "black-box" by the past even if now they are more called "artiﬁciel intelligence" or just
related to "deep learning" which is a big change actually. There is even some direction of current
research in order to make them explainable which is completely the opposit of the former naming.

For the moment, we stay in the linear case, let draw the real curve and the available sample points.

10
��
��
��
��
��

��
��
��
��
��
��
�
��
��
��
��
��
��
��
��
��
�
��
��
��
��
��
�→��

��
��
��

� ��
��
��
��

��
��
��
��
��

11
��
��
��

Figure 1.1: True polynomial curve and train sample

Figure 1.2: True polynomial curve and test sample

Note that here the distribution for the xi was not taken uniform, hence the reader may use its own
example of data for further understanding the polynomial ﬁtting and the sample error associated.

Simple regression ﬁtting

12
The ﬁtting for the simple regression leads to the following result. The solution is written with the
available values in the sample.
2 , y , and y x , are respectively:
With a precision of four digits, the means xtrain , xtrain train train train

��

The derivative of the loss function, 1n ∑ni [yi − g(xi , θ )]2 , w.r.t. β0 and β1 leads to the resolution via
a linear system:

� ��
1 −0.5628 β̂0 8.0168
=
−0.5628 1.1940 β̂1 −11.7599
� � � �−1 � �
β̂0 1 −0.5628 8.0168
=
β̂ −0.5628 1.1940 −11.7599
� 1 � � ��
β̂0 1.3610 0.6415 8.0168
=
β̂ 0.6415 1.1399 −11.7599
� 1 � � �
β̂0 3.3672
=
β̂1 −8.2620

This is a good exercice to find the linear system (see also next chapter) with paper and pencil,
before writing this solution. With python code, first the matrix is defined, followed by its inverse
and its product with the column vector, as follows.

� � ��

��

� � ��
��

��

Note that for a larger number of variables, a more robust solution is to solve for the linear system
Aβ = b instead of taking the inverse of A, this is also available in the module numpy with a native
function.
Thus, β̂ is found equal to:

��

��
��
The estimations from the simple linear model are obtained as,

ŷi = β̂0 + β̂1 xi

= 3.36718 − 8.26200 xi .

13
Thus with python, one gets the prediction for the test sample as follows,

��

� ��
� ��
� ��
� ��
� ��

A measure of the error from the model is the value of the loss function, here the mean square error
(mse), which is computed as follows,

��
��

��

Let solves this problem directly with a numerical algorithm implemented in the module
"scipy.optimize" in order to ﬁnd the unknown parameters as follows.

��

��
��

��

� ��
� ��
� ��
��

��

The same value is slightly obtained as expected for a simple linear model. Anyway, the mse is
large because the linear model is not relevant for this dataset, and a nonlinear regression should be
ﬁtted. Next, the curve is approximatively retrieved with a better ﬁt than just a linear regression, by
changing the function to feed the numerical optimization module.

14
Polynomial ﬁtting with scipy

The fitting of a curve is illustrated in this paragraph. We are going to compare several polynoms
with different degrees because usually the polynomial degree is unknown when only the data sam-
ple is available. The purpose is to find the best polynom, say the relevant degree and a relevant
estimation for the coefficients.
��

Note that an analytical expression exists already in the linear case such as the simple regression
and the polynomial regression. Thus, this example illustrates the ﬁtting of any function with a
numerical algorithm and allows to compare with the exact solution. For an optimization of a
function with the python module scipy, one writes the nonlinear function and the ﬁtting as follows:

��

��
��
��
��
��

��

�→��

��

�→��

��

��
��
��
��

The resulting estimations are visualized and compared numerically.

��
��
��
��
��
��
��
��

15
��
��
�→��

��
��
��
��
��
��
��
��
��
��
��

��

��
��
��
��
��
��
��
��
��

Figure 1.3: Several ﬁtted polynomial curves and train sample

16
Figure 1.4: Several ﬁtted polynomial curves and test sample

The error here is the usual average quadratic distance between the estimated and the true values.

��
��
��
��
��
��

��

� � ��
� � ��
� � ��
� � ��
� � ��
� � ��
� � ��
� � ��
� � ��
� ��
��
��

��

��
��
��
��
��
��

17
��
��
��
��

Figure 1.5: Error on the train and test samples

It is clear that for the train sample, increasing the degree of the polynom induces a better ﬁtting,
while for new data with the test sample, this is not true, the error increases and even explodes in
some cases. The curve learns all the noise of the data instead of learning the underlying smooth real
curve. Note that the optimization with scipy allows to introduce bounds on the parameters such as
the parameter is no greater than its bounds. In a next chapter, another approach is considered also
related to curve ﬁtting with an explicit penalty term added to the objective function.
With the smaller mse for the test sample, hence for new data, the model kept here is the polynomial
regression with four degrees. The estimations from this model are obtained as,

ŷi = β̂0 + β̂1 xi + β̂2 xi2 + β̂3 xi3 + β̂4 xi4

= −0.97154 − 0.42205 xi + 2.55021 xi2 − 2.98778 xi3 + 1.13000 xi4 .

This model is globally not far of the true one, according to the parameter values, but for some
datasets, the mse on the test sample may be not minimum exactly for the right degree of the
polynom because this way to choose is not optimal or the sample too small. It makes sense that
looking just to one new sample is not enough if the test sample is not reﬂecting the general case,
and better approaches are implemented in a next chapter with several test samples instead of just
only one. This is related to the problem of generalization which is interested on the quality of the
model for a novel data sample, or how make a model enough general for any new similar dataset
from the unique available sample.

18
Next chapters, pytorch is presented for such cases with also some main algorithms for deep learn-
ing. Instead of relying on a numerical library for the optimization, the numerical training must
be written for each general architecture of neural network. This allows to avoid the traps coming
from the nonlinearities and the associated useless local minima of the loss function. In the other
hand, neural network models are relevant for many current data and still able to predict an out-
come for new data. Less ﬂexible models are more limited and perform lesser, hence extending
linear statistical models within the neural framework is appealing when enough data are avaible
for training.

1.4 Glm as neural networks

The linear and polynomial regression are not enough because they are not relevant for non contin-
uous target variables and unknown nonlinearities. This is why was deﬁned the family of models
called glm or linear generalized model where the distribution functions and the underlying loss
functions allows to extend the Gaussian and Euclidean cases, in the statistical literature. Some
members of this family are considered in the next chapters.
As a recall to generalized linear models, let denote θ for the natural parameter and φ for te positive
dispersion parameter. The target variable yi belongs to an exponential family with density or mass
of speciﬁc form, � �
yθ − b(θ )
p(y|θ , φ ) = exp + c(y, φ ) .
a(φ )

In the generalized linear models the mean µ = E(Y |x) is written via a function called link function
which is a strictly increasing function, and which is applied to the usual inner product of the
parameter vector β with the vectors of explicative variables (plus bias one for the intercept) xi ,
such that the link function η(.) has an inverse η −1 (.) and,
g(µ) = η(xβ ) .

This also induces:

yi ≈ µ(xiT β ) .

In neural network, the functional relation between yi and xi is more complex and it is retrieved the
notation of the statistical nonlinear regression. Instead of µ(xiT β ) in linear generalized models,
this is written with weights w,
yi ≈ µ(xi , w) .

To be more precise, often at least for classiﬁcation and regression which are met in deep learning
for images and biology, the expression is:
yi ≈ µ(φ (xi , w)T β ) .

In this case, yi is most of the time univariate and continuous as a curve ﬁtting problem like the one
in this chapter, or categorical for image classiﬁcation with advanced layers in the neural network
in order to process the spatial information hidden in the images.

19
Deep regressions can often be interpreted as nonlinear regressions with a particular expression
for the nonlinearities: the observed vector of explaining independent variables is transformed via a
nonlinear space before applying the usual ﬁnal weighting and sum towards the target variable. His-
torically, the Gauss-Newton method was introduced for inference of the parameters of such model
and back-propagation which is related to the so-called "chain rule" for the derivatives. With a
growing number of variables and large datasets, ﬁrst-order algorithms are usually more practicable
than exact second-order methods even today.

1.5 Why learn pytorch

Pytorch is a module for python which is a powerful interpreted computing langage with hundred of
modules available. This module is dedicated to the definition of architectures of neural networks
and their training. A way to maximize the likelihood with this module is to minimize a loss as for
the three cases of distribution often met: Bernoulli, Multinomial, and Gaussian for binary, discrete
and continuous target variables. In the general case, one has to write her/his own loss function,
or just consider one existing in the python module. For neural networks, the loss measures the
difference between the true target variable yi and the one found by the network ŷi , by extending the
usual Euclidean distance to other measures of discripancy more relevant for robust comparison or
binary/multi-category/integer comparisons.
It is possible to consider losses not defined in the python module such as for survival analysis or
any likelihood or even function to be optimized. Another example is for a nonlinear regression
when the nonlinearities are not polynomial anymore, a main reason to consider neural networks
when the nonlinear function is unknown. With this module, many neural networks are already
implemented, and available online, thus this makes possible to use and adapt them for our datasets,
but also to improve, combine or evaluate. The module allows also to optimize in a fast and reliable
way the architectures while finding values of the weights automatically for very large datasets,
with diverse computers (laptop, desktop, distributed servers) and diverses ships (cpu, gpu, multi
cpu, multi gpu). This module is available with computer clusters or supercomputers for going to
scale after prototyping with a simple home computer or workstation.

1.6 Exercices
1. (python) Compare side by side in an array, for ten observations, the true values and estimated
values for the target variable when the polynom is ﬁtted with four degrees. Hint, the parame-
ters are found as the second array in the return from the function f_polynomial_regression()
and stored in the variable y_test_polyn_hat_s.
2. (sklearn) Consider a neural network with non linear functions instead of the polynomial
regression and ﬁnd suitable values of the weights for this dataset. Compare the pre-
dicted/estimated target values and the error with the results from the ones above in the chap-
ter.
3. (python) Implement a "cross-validation" with scipy and propose a method for regularizing
the model. Hint, a way to see this problem of regularization is that when the polynomial

20
degree increases, the regression coefficient increases and the norm of the vector of coefficient
should be kept enough small.
4. (stat) How to choose well the training and test samples for best results.
5. (stat) Retrieve the solution for one variable of the linear regression in a matricial notation
from the initial matricial expression of the y as a function of the x (as in the polynomial
regression above). The solution should be the same than the one written after the derivatives
proposed here. The linear system with the analytical expression comes from the derivative
w.r.t. β0 and β1 before rewriting into a matrix and two vectors by isolating the vector of
regression coefficient at each row.
6. (stat) For the polynomial regression (and more generally multiple regression), an analyti-
cal exact expression is available for the mean squared error. Retrieve this expression, and
its value for the polynomial regression. Statistical or machine learning books treats these
questions with a more elaborate developement, out of the scope herein.

21
22
Chapter 2

Linear models with numpy and scikit-learn

In this chapter we are interested on ﬁnding a linear function for explaining observations yi depend-
ing on p independent variables xi j aggregated in the vector xi = (xi1 , xi2 , · · · , xip )T . This is written
with a noise εi as,

yi = f (xi ) + εi
= f (xi1 , xi2 , · · · , xip ) + εi .

The noise comes from the error with the linear approximation. The error may be induced by an
inexact measure (for prediction) or unknown variables (for explaining). This model is usually
called a regression model while f (xi ) is the regression line. Here in the linear case we have just:

yi = β0 + β1 xi1 + β2 xi2 + · · · + β j xi j + · · · + β p xip + εi .

With real data the noise is often Gaussian, and other noises are possible actually, depending on
the distribution of yi . Note that with pytorch, there is just need to choose a loss function among
a list of loss functions already available in the python module, or eventually define the loss by
writing a python function, before launching the optimization which does not ask for more analytical
work. On the other hand, knowing the underlying mechanism allows a better understanding of the
algorithms, their limits and a better extension or improvement the models if one is not happy with
the current available contents.
The main purpose is thus to find the coefficients β j from a sample of n couples (yi , xi ), compound
of the variable yi called target, outcome or dependent and the variable xi called the predictor or
explanatory or independent (non dependent) variable. The sample is a set of n couples,

sn = {(y1 , x1 ), (y2 , x2 ), · · · , (yn , xn )} .

The method is called supervised because the variables yi is available for each vector xi , hence, the
purpose is to "explain" or "predict" yi as a function of xi . These two purposes are well separated:

23
• "to explain" has for meaning that the coefficients β j can be understood as weights positive
or negative for the dependant variable yi ,
• "to predict" has for meaning that given new values of xi a value for yi is computed and should
be near the true value.
In practical applied statistics, one can explain with a non predictive model. Hence, whatever is the
size of the variance of the noise, its distribution just needs to be typically Gaussian. Eventually,
one may not even look at the noise, but this becomes mandatory for validating the model and the
coefficients of the regression. Herein, we want a relevant model for prediction, and the noise might
be not too large otherwise the prediction could be not enough efficient.
The linear algebra provides a solution for the vector of regression coefficients β =
(β0 , β1 , β2 , · · · , β j , · · · , β p )T ∈ R p+1 in closed-form, but next chapters will present alternative so-
lutions via diverse algorithms. Finding a value for β from a sample has diverse namings in the
literature: infering, fitting, learning or training. They may not have exactly the same meaning from
an author to another, but one always gets β̂ given the sample, and this approximated value shoud
be near the true value.
As the sample comes from a whole population often infinite, a consequence is that the obtained
solution depends on the sample: β̂ may be denoted β̂ (sn ), β̂s or β̂n . This estimation comes with
some variability in the sense that changing the sample leads to a different estimation β̂ . Thus when
possible, it is checked the expectation and the variance of β̂ , at least for simpler models such as
linear.
The other quantities which are measured are related to the sample, and compute how good are the
estimation ŷi in comparison to the true values yi of the target variable.
Next, a more formal presentation leads to the expression for the coefficients and of the diverses
criteria discussed for the quality. A probabilistic interpretation of the regression is written with the
help of the Gaussian noise. An illustrative example follows then.

2.1 Linear regression

The basic assumption when one derives the equations for linear regression is to assume that the
output is determined by a given continuous function f (x) and a random noise ε given by the
Gaussian distribution with mean value equal to zero and a spherical variance with undetermined
parameter σ 2 .
Recall that the derivative of the cost function of the linear regression with respect to the parameters
β is equal to zero: X T (y − X β̂ ) = 0. This leads to the solution for β̂ such that:

β̂ = argminβ �(β )
� �−1 T
= XT X X y.
where,
1 2
�(β ) = n �y − Xβ � .

24
Here the loss function is written directly with a quadratic distance in order to have a notation
related to a Gaussian distribution.

2.1.1 Probabilistic interpretation and model

It is introduced a term related to a variance such that, the criterion leads to an equivalent criterion
for the optimization problem:

˜ , σ ) = n log 2πσ 2 + n �(β )

�(β
2 2σ 2
n ||(y − Xβ )||22
= log 2πσ 2 +
2 2σ 2
n 2 ∑i (yi − xiT β )2
= log 2πσ +
2 2σ 2
1 (yi − xiT β )2
= ∑i log 2πσ 2 + ∑i
2n 2σ 2
T
= − ∑i=1 log p(yi ; xi β , σ )
= − log ∏ni=1 p(yi ; xiT β , σ )
= − log p(y; X, β , σ ) .

The two last rows comes from a property of a Gaussian probability function as follows. The prob-
ability distribution of the dependent variable yi conditionally to the independent variables is nor-
mally distributed because of this equality just before. Even without this hypothesis of distribution,
this results holds. This comes from the expression of this underlying distribution as a Gaussian
distribution:

� �
1 (yi − xiT β )2
yi ∼ N(xiT β , σ 2 ) is equivalent to p(yi ; xiT β , σ ) = √ exp − .
2πσ 2 2σ 2

After considering such expression, we just need to take the logarithm from the distribution of the
i.i.d. variables yi conditionally to xi . Conditionally just mean that the variable xi appears in the
mean, hence can be supposed constant, while this is yi which is random. This makes sense because
yi is depending on xi and comes after that xi is given a value. Thus, coming with this distribution,
we retrieve our criterion which is optimized for the usual linear regression, despite that there was
eventually no distributional hypothesis originally (with just the simple sum of squares for the loss).
Indeed:

log p(y; X, β , σ ) = log ∏ni=1 p(yi ; xiT β , σ

�) �
(y −xT β )2
= log ∏ni=1 √ 1 2 exp − i 2σi 2
� 2πσ � ��
n 1 (yi −xiT β )2
= ∑i=1 log √2πσ 2
exp − 2σ 2 �
�
1 (y −xT β )2
= ∑ni=1 − log 2πσ 2 − i 2σi 2
2

25
Hence one can assume that the variables with values yi are identically and independently distributed
according to the above Gaussian distribution. This leads to define the likelihood of an observation
yi with the input variables xi given the vector of coefficients β which is unknown. Since these
events are assumed to be independent and identically distributed we can build the probability dis-
tribution function (p.d.f) for all possible observations (y1 , y2 , · · · , yn ) as the product of the single
events for the observations yi .
To find a value for the unknown parameter β , the more frequent method is the maximum likelihood
estimation (MLE). This method estimates the parameters of the parametric probability distribution,
given a data sample. At the end, the function is maximized such as the sample maximizes the like-
lihood function. For analytical reasons and practical reasons, this is the logarithm of the likelihood
which is maximized, also called the loglikelihood, this is equivalent because the logarithm is a
strictly increasing function: this allows to avoid the maximization of products which is a tricky
numerical problem. Note that the loss in neural network is minimized, thus this is the negative
of the loglikelihood which is optimized by modules such as pytorch, but they are not limited to
loglikelihood and other functions not related to parametric distribution are possible such as for
robust losses or more advanced structures of neural networks (recurrent for sequence processing,
or convolution for image processing for instance, out of the scope herein). This induces that there
is a strong distributional hypothesis in this statistical inference, which is not required in neural
networks: they are able to fit any unknown function with a relevant optimization and a suitable
architecture (hidden layer(s) and number of neurons). Only more flexible statistical models such
as tree-based for instance are able to compete with neural networks.
The consequence of this result is that the solution for β is a maximum likelihood estimator, and
there is also required the estimation for the variance parameter:

˜ ,σ)
β̂ = argmaxβ �(β
˜ β̂ , σ ) .
σ̂ 2 = argmaxσ �(

The first line comes from the fact that σ cancels out for the first term and remains proportional for
the second term, when computing the derivative of the function �(.,˜ .) w.r.t. β , while the solution
for σ is after the solution for the regression coefficients.

2.1.2 Example of simple linear regression (data)

The work directory is given from the function.

��
��

��

26
��
��

��

In this example, the explaining variable is univariate such that xi = xi1 , hence:

yi = β0 + β1 xi + εi .

In python, we generate the noise εi , the explaining variable xi , compute yi and show their linear
relation with the line.

The data and true parameters are stored in the computer disk, thus loaded just below from text ﬁles.

��
��
� � ��
� � ��
��

��

��
��
��
��

��

��
�→��

��
��
��
��
��

��

27
Figure 2.1: True regression line and sample with 10 observations.

2.1.3 Analytical solution for the linear multiple regression

More formally, a regression model is able to model the conditional distribution of the variable y
with the given variable x, denoted p(y|x).

As seen just before, a sample of n couples (xi , yi ) is available with :

• The dependent variables yi for i = 1, 2, . . . , n

• The independent variables xi = (xi0 , xi1 , xi2 , . . . , xip )T for i = 1, 2, . . . , n multivariate in R p+1
with xi0 = 1 for the intercept β0 .

For notational reasons, the intercept β0 is included in β while xi is also with an additional compo-
nent xi0 in order to deal with this deﬁnition of beta. The function f (.) which allows to write yi as a
function of xi is linear here. For instance, the number of apples produced in an apple tree may be
in a linear relation with the season, the average temperature and the height of the tree.

Below, we rewrite the regression with a linear system in order to ﬁnd an analytical solution.

Linear system of n equations and matricial notation

Let rewrite the n linear equations, one per line as follows, via three different notations. The regres-
sion line is true for each observation couple (yi , xi ), such that:

28
y1 = β0 x10 + β1 x11 + · · · + β p x1p + ε1 y1 = ∑ pj=0 β j x1 j + ε1 y1 = β T x1 + ε1
y2 = β0 x20 + β1 x21 + · · · + β p x2p + ε2 y2 = ∑ pj=0 β j x2 j + ε2 y2 = β T x2 + ε2
.. .. ..
. . .
yi = β0 xi0 + β1 xi1 + · · · + β p xip + εi yi = ∑ pj=0 β j xi j + εi yi = β T xi + εi
.. .. ..
. . .
yn = β0 xn0 + β1 xn1 + · · · + β p xnp + εn yn = ∑ pj=0 β j xn j + εn yn = β T xn + εn

In the system above, it is recognized a column vector to the left for the yi , a column vector to
the right for the noises εi , and a scalar product between the vectors xi and β . A more condensed
expression of the system with the sign sum is given at the second column. In a vectorial notation,
this linear system is written at the third column. One must notice that the new observed variable
xi0 was introduced for notational reason, its value is one.
It is easily recognized a product of a matrix with a vector just after the sign equal, which allows a
matricial notation as explained next paragraph. The notation for the different vectors and matrices
is below, followed by the more condensed notation with a matrix for the independent variables
instead of vectors.
Let deﬁne the matrix X as follows. The matrix X ∈ Rn×(p+1) is called the design matrix. It allows
to have a matricial expression of the linear system involving the regression coefﬁcients:

   T     
1 x11 x12 ... ... x1p 1 x1;1:p y1 ε1
1   T  
 y2  ,and ε =  ε2 
x21 x22 ... ... x2p   1 x2;1:p    
X =
. . . = , y =
... ... ... ... . . .  . . .  . . . . . .
1 xn1 xn2 ... ... xnp T
1 xn;1:p yn εn

Here, x1;1:p is when keeping the p last components of xi without the first equal to 1, this is the more
usual notation for xi , not preferred here for the matricial computation, otherwise β0 is separated,
leading to more lenghty analytical expressions. Another approach is a separated name, x̃i when
adding the first constant component, which is not considered here.
After the design matrix, three column vectors are defined, one for the target variable, one for the
regression coefficients and one for the noise.
Here xi = (xi0 , xi1 , xi2 , . . . , xip )T ∈ R p×1 , y ∈ Rn×1 , β = (β0 , β1 , β2 , . . . , β p )T ∈ R(p+1)×1 and ε ∈
Rn×1 . For the three vectors, the dimensionality is obtained such as this leads to rewrite the n
equations with an unique matricial equation:

y = Xβ + ε .

This means that when the data is available, from physical or biological measures for instance, the
data are written with this matricial expression, with y and X, in order to ﬁnd a solution for β . For
the example just before, this leads to the following numerical matrix and vector:

29
� � ��
� � ��
� ��
� ��

Here recall that p = 1. In summary, from the sample, it is available the information from the
data y, which has a role different from the other variables available. The other variables x are
aggregated in the matrix X which is called the design matrix. The noise is latent and included in
the observation yi , such as it is not dealt with in a ﬁrst approach: with yi given, and ŷi estimated,
this is just εi = yi − ŷi = yi − β T xi . Note that it is estimated with ε̂i = yi − β̂ T xi when β̂ denotes
the vector of regression coefﬁcients found according to the available sample. The noises are the
quantities minimized with a square as it it recognized:

1 2
�(β ) = n ∑i εi (β , X) .

This is the gap between the observations yi and their linear approximations ŷi , such that the regres-
sion line must minimize these gaps as a solution.

The linear regression model has for purpose to predict or to explain y in terms of X via a linear
relation. Other forms of relation may be polynomial for instance, or even more non linear for a
function of X involving another analytical approach than this chapter. The linear relation between
X and y is relevant when it is acceptable, thus it will be consider how to check such relation. The
linear hypothesis leads to the ﬁtting of the linear regression model and to ﬁnd regression parameters
which are estimation of β = (β0 , . . . , β p )T from the available sample.

The linear regression provides a mathematical expression of the relation of yi as a function of the
p variables xi j and the parameters β j . This leads to explain the variable yi from the predictors xi j
by the sign of the coefﬁcients β j for a positive or negative inﬂuence from the independent variable
to the dependent variable. This leads also to predict the variable ynew
i from new predictors xinew
j
by using the linear equation and ignoring the noise such that ŷi new is a guess for the true target
variable ynew
i which is unknown and not available.

This shows the difference between explaination and prediction. In the ﬁrst one which "explains",
the noises may be large as only the linearity and the shape of the noise are checked hence a large
variance σ is possible and frequent. In the second one which "predicts", the noises and so the
variance must be small such that the values ŷi found by the model are expected to be near the true
values yi . This is why in clinical research for instance, the analysts may focus on small sets of
independent variables which allow to explain and not to predict. This generates a large amount of
concurrent results which are compared afterwards with the idea that if the bias are different and the
conclusions are identical, an effect is detected. The predictive models may required several tens
of variables which are costly to gather each time. They may be difﬁcult to improve because this
supposes to look for new relevant variables at the beginning of the study.

After this introduction to the notations and the purpose with practical illustrations, the expressions
for the solution are found next.

30
Optimization for β - mse to be minimized

For the ﬁtting of the model, say ﬁnd a value for β , it is required to write a criterion and optimize
it, the criterion will be a function of β in order to proceed at the optimization (minimization or
maximization). The solution is written as a function of the sample, because only the sample is
available.
A relevant criterion is one able to minimize the error between the true yi and the predicted ŷi from
the model. This is available from the εi , and for mathematical ease but also in order to remove the
sign of εi , the square of the noise is usually minimized. This allows nice closed-form solutions
with nice statistical properties even if this can be improved as will be seen later.
The loss, also called cost function because it is minimized, is thus written as follows:

1 n 2
�(β ) = n ∑i=1 εi (β , X)
1 n 2
= n∑� i=1 (yi − ŷi ) �
1 T
= n� (y − ŷ) (y − ŷ)
�
1 T
= n (y − Xβ ) (y − Xβ )
1
� T T T T
�
= n y y − 2y Xβ + β X Xβ .

The vectorial notation in the criterion at the third line comes from just the norms of the vector
y − ŷ which is the usual trick here. The fourth line comes from rewriting ŷ with its expression as a
function of the design matrix and the vector of the regression coefﬁcients.

Optimization for β - analytical solution

Having a cost function, our purpose is now to minimize this loss w.r.t. the parameter vector β . For
the estimation of β which leads to a solution β̂ which depends on the sample, we just minimize
the cost function via the optimization problem:

β̂ = argminβ ∈R p+1 �(β ) .

The minimization is performed via the derivative of the cost function with respect to β which
is equal to zero at the optimum. This gives a mathematical expression as follows in a vectorial
approach:

∂ � �
[�(β )] = 2n −2X T y + 2X T Xβ
∂β
= −2 T
n X (y − Xβ )
= 0.

31
This leads to

X T (y − Xβ ) = 0
XT y = X T Xβ
� �−1 T
β̂ = XT X X y.

Here the expression given is available only if the matrix X T X is invertible, otherwise a way around
is required, such as making the matrix inverting by inflating the diagonal or prefering a pseudo-
inverse, or using a numerical algorithm for the optimization which does not involve the matrix
inversion. Note that nowadays for large matrices, such algorithms are more and more wanted and
used in practice.
One may also notice that the optimization may be performed component-wise which avoids any
knowledge on matrix derivative. This translates in mathematical terms as follows: proceed at
the computation of the derivatives component by component for each coefficient β j with j in
0, 1, . . . , p + 1. More precisely one may write the expression for each ∂∂β j [�(β )], then one may
rewrite the expression from the p + 1 univariate derivatives via vectors and a matrix: the same
matricial solution is obtained as with the direct matricial derivative. Some algorithms proceed also
at an optimization component by component or by set of components instead of all the components
once, both approaches are out of the scope herein.
As the design matrix is defined with X ∈ Rn×p+1 , its product with the transposed version,
X T X ∈ R p+1×p+1 , is a squared matrix of side p + 1. If for a small example, the matrix may be
invertible, for larger values of p + 1 this becomes quickly intractable because the squared matrix is
dense and ask for too much computer memory or because there are just too much useless variables
that one needs to remove. The solution above for the regression is not really able to lead to a so-
lution relevant in this case, and other algorithms are required. Improved algorithms and improved
regression models are thus available in the literature and python modules such as online training
and variables selection. This kind of numerical variants are introduced in the next chapters.

Optimization for β - checking that the cost function is minimized

When a solution from the optimization is obtained, it is usual to check if the optimum is a minimum
or a maximum, this is possible by looking at the sign of the eigen values of the second-order
derivative. For this regression model, one gets that the second-order derivative of the criterion is:

∂ 2 [�(β )] 2 T
H= = X X.
∂βT∂β n

This second-order derivative is usually called the Hessian matrix and denote H. As a product of
a matrix with its transpose, it is always positive, hence the optimization is a minimization and the
problem is said convex, with X supposed of full rank here. The Hessian matrix and the gradient

32
which is the ﬁrst derivative are involved in an approximation of the criterion around the optimal
solution, such as:

∂ [�(β )] 1 ∂ 2 [�(β )]
�(β ) = �(β̂ ) + (β − β̂ )T + (β − β̂ )T (β − β̂ ) + R(β )
∂β 2 ∂βT∂β

This is the same approximation which is available from Taylor series for univariate functions. Here
for a quadratic problem we must have the remaining term R(β ) equal to zero while the expression
is exact. A consequence is that if it exists, there is only one unique solution, which insures the
convergence of the algorithms with few conditions. For nonlinear functions, the remaining term
is not zero in general and there exist local solutions which are not global (the best), and make
the optimization more tricky: today for neural networks, solutions and workarounds have been
recently proposed hence this problem is partially solved for some architectures.

Optimization for σ

Let see when we solve for an exact solution,

� �
∂ ˜ β̂ , σ ) = 0
�(
∂σ

Hence, it comes:

� �
2
∂ n
log 2πσ 2 + ||(y − X β̂ )||2 = 0
∂σ 2 2σ 2
� �
||(y − X β̂ )||2
∂ 2
∂ σ n log σ + 2σ 2
= 0
� �
||(y − X β̂ )||2
∂ 1 2
∂σ nσ − σ3
= 0
1
σ̂ 2 = 2
n ||(y − X β̂ )||2

Note that this estimation is biased because the expectation is equal to n−p
n σ , a result which is found
in any statistical book for the linear model, such that the unbiased estimation may be preferred for a
more advanced concern. This also induces that minimizing the loss minimizes also this expression
for the variance term, while the loglikelihood is maximized.

2.1.4 How to check the quality of the linear regression

The residuals are orthogonal to the vectorial space span from the columns of X because the solution
is able only to model this space. The remaining space is kept in the noise which is in a pythagorian

33
triangle. In order to check if the fitting of the linear model is successful diverse methodologies have
been proposed in the statistical literature by computing different indicators which tell us how the fit
is relevant. One of the most used method is to check the residuals yi − ŷi which might be Gaussian
distributed, centered at zero and without any structure hidding some correlation among the data.
Thus, one can have a look at the (studentized) residuals with a plot, at their distribution with an
histogram or at the comparison of their distribution with the Gaussian one with a qq-plot. These
plots and the related statistical tests for the linear regression are not considered herein, but are
wise complements to the usual numerical indicators. Instead, a plot of yi against ŷi will illustrate
variables selection in the dedicated chapter for the regression with a continuous target variable
while a matrix aggregating the counts for comparing the values of yi against ŷi for a discrete target
variable is presented at the end of the current chapter with an example of classification.
For the regressions with a continuous outcome, numerical indicators are as follows.

Mean squared error (MSE)

This indicator is formally an empirical expectation over a sample. It is written as the mean of the
square of the noise per datum as:

1 n
MSE(y, ŷ) = ∑ (yi − ŷi)2 .
n i=1

This sum is just the loss or objective function that was minimized in order to ﬁnd the solution for
the regression. When comparing models with different choices of variables, the better ﬁt may be
with the smaller MSE but actually one must also care about the error for new data. The MSE is
never equal to zero because there is always some random noise with real data. The MSE is written
as the following python function,

��
��

The mean absolute error (MAE) is related to the mean squared error by replacing the quadratic
distance by an absolute distance, and not often given, on the contrary to the next indicator, without
any doubt the preferred indicator in many domains despite its limits.

Coefﬁcient of determination or R2

This indicator is related to the prediction from new independent variables. The larger is the better
with a maximum at 1.0, a more negative value in the worser scenario of bad ﬁtting, and a value at
0.0 for only an intercept β0 without covariates. The score R2 is deﬁned as

∑ni=1 (yi − ŷi )2

R2 (y, ŷ) = 1 − ,
∑ni=1 (yi − ȳ)2

34
where we have deﬁned the mean value of y as

1 n
ȳ = ∑ yi.
n i=1

It is also known that this statistics is unable to check the linearity of the data, and supposes that the
hypothesis of linearity is true, hence may provide erroneous high value for real function which are
nonlinear: the residuals should always be checked too. The R2 is written with a python function,

��

Note that these indicators are available with sklearn for some methods such as LinearRegression
in sklearn.linear_model, from the sub module metrics, with the names "mean_squared_error",
"r2_score", "mean_absolute_error" while "mean_squared_log_error" is an alternative one.

2.1.5 Example of simple linear regression (continued)

In this paragraph, we explain how to compute the regression coefﬁcients and the different scores
directly from the formula without using any python module except for basic algebra and then
compare with two modules available in python and able to solve the regression problem via an
implemented function. The values for the design matrix X and the vector of dependent variables y
have been given before in this chapter.

Implementation from python with the algebra

The direct implementation involves a matrix inversion and its multiplication with y in order to get
the regression coefﬁcients. The python operator "@" is for a matricial multiplication in its usual
deﬁnition by summing along the columns for the element coming at the left and along the rows for
the element coming at the right. The function ".dot()" allows an equivalent matrix multiplication.
The function "linalg.inv()" from the module numpy is for the inversion of a squared matrix.

The solution for the regression is written as follows with python:

��
��
��

��
��
� ��
For this small dataset, this result comes after a few steps as follows:

35
 T  −1  T  
1 0.35 1 0.35 1 0.35 0.07
       
 1 0.45   1 0.45   1 0.45   0.94 
   
    
 1 0.02   1 0.02   1 0.02   −1.06 
       
 1 0.18   1 0.18   1 0.18   −0.21 
       
 1 0.23 1 0.23 1 0.23 −0.01
β̂ =        .
 1 0.93   1 0.93   1 0.93   2.26 
       
 1 0.53   1 0.53   1 0.53   1.55 
       
 1 0.48   1 0.48   1 0.48   1.50 
       
 1 0.73   1 0.73   1 0.73   2.09 
1 0.04 1 0.04 1 0.04 −0.23

From the products X T X with the matrix X and X T y with the vector y, this leads to:

� �−1 � �
10.000 3.950 6.913
β̂ =
� 3.950 �2.334 5.567
−0.757
= .
3.666

There is a small difference with the previous direct numerical solution because of the rounding
of the numbers in the matrix, which recalls that rounding numbers is suitable only for showing
the results, while intermediate expression should be computed without approximation in order to
avoid any propagation of the rounding error.

Implementation from numpy

In the module numpy, the regression is implemented with the function "linalg.lstsq()" such that,

��

��
��

��
��
� ��

Implementation from sklearn

In the module sklearn, the linear regression is implemented in the function lin-
ear_model.LinearRegression(r). We get the fitting, the regression coefficients from the fitting and
the prediction for the small dataset as follows.

36
� ��

��

��
��
��
��
��

��
��

��
��
� ��
Here, xi has for first component 1, hence there is no need to fit the intercept, with the option
"fit_intercept=True", otherwise by default it is fitted and return in the variable "intercept_", while
the vector of coefficients is returned in the variable "coef_", which in the present case contents
already the intercept.

Indicators with numpy

The R2 and the MSE from the two functions deﬁned above are equal to:

��
��
��

��
��

Solution for σ

The solution for the (biased) standard-deviation is computed as,

��

��
Let plot for the small example, the cost function for σ before having the equation solve. The
loglikelihood is computed for several values of the standard-deviation. Note that the design matrix

37
could be an entry for the function instead of a global python variable. Then we plot the cost error
for the variance quantity.

��
��
� � ��
� � ��
��
��
��
��
��
��
��

Figure 2.2: Loglikelihood and variance parameter

We� see that there

� is a clear maximum, which can be checked with the second-order derivative,
2 ˜ 2
∂ −�(β̂ , σ ) /∂ σ , which is positive. The estimation of the regression coefﬁcients does not
depend on this variance hence the later is generally not involved for neural networks.
The functions for drawing scatterplots and curves in the graphical output above are just below.

��
��
��
��
��

38
��
��
��
��
��
��

Most of the functions defined in the chapters are written in a python file in order to be available
after. Note that the functions in the file "deepglmlib.utils.py" does not need to refer "utils" because
the functions are already defined locally, thus, this is removed in every functions in the file.

Indicators with sklearn

For the regression, the indicators are computed in a function with optional printing of the indicators
and drawing of the scatterplot between true and predicted targets.

��

��
��

��
��
� ��
��

��

��
��

��
��
��
��

��
��
��

��

It it recognized the mean square error (for linear and nonlinear regression) and the coefﬁcient of
determination (for linear regression).

39
��
��

��
��

2.1.6 Remarks
A main purpose of the next chapters is to consider nonlinear models by changing the linear pro-
cessing of the explaining/predicting variables from a sum into some nonlinear function, i.e. β T xi is
changed into a more general function g(xi , β ) where g(., .) also depends on linear transformations
associated to non linear ones which follow each other. The target variables yi shall also be eventu-
ally not continuous, such as binary or integer, associated to different loss functions. The training
of such deep models will be handled via pytorch where g(., .) comes from a particular expression
explained in the next chapter after: one gets powerful and accurate regression and classification
models when the dataset is enough large. If for usual expressions of g(.), python modules for
machine learning such as sklearn are able to process the related model fitting, when the dataset
increases and the model evolves, more flexible approaches are required such as offered with the
module pytorch for running the optimization.
This concludes the introduction to the linear regression. The reader may have noticed that the linear
classification is also a regression with a different distribution for the outcome, in the framework
called generalized linear models. Next, instead of a continuous variable for the outcome, this is a
binary one which is involved.

2.2 Logistic Regression (2 classes)

In this chapter, a classiﬁcation model called logistic regression for a binary dependent variable yi is
presented with its analytical formulation and its implementation available from a python module.
The case of more than two classes is discussed at the end of the chapter.
With binary values, the previous Gaussian distribution in the linear regression is changed into the
Bernoulli one. This is an usual approach in research, when one changes something and observes
what happens. In an experiment, this would be a condition which is altered. For instance, in the
medical world, there would be "new medecine" or "no medecine" for ill people, and the analyst
would count after how many people (from a sample) survived to each hypothesis (with or without
medecine) in order to decide if the drug can be given to more people when it performs better than a
previous one. In statistical modeling this may be the data distribution as here which is changed in
order to check if the new model is more suitable for the prediction, otherwise the previous model
is kept, similarly.

2.2.1 Probabilistic interpretation

Let present the new modeling involved in a logistic regression. The dependent variables yi is
binary,

40
�
0 for the class of failure
yi = .
1 for the class of success

As before y is a vector aggregating all the observations for the outcomes, X is the n × p design
matrix while β is the p × 1 vector of unknown parameters or regression coefﬁcients.
The usual linear regression predicts values on R while we are interested on predicted values in the
set {0, 1}. We have seen that the linear regression involves a Gaussian distribution, then changing
the distribution into a Bernoulli one makes sense but its parameter needs a bounded probability.
Recall that the parameter of the Bernoulli distribution is the probability of success, say the proba-
bility that the observation is equal to one. For this reason, the logistic regression is based on this
distribution but instead of just implementing the inner product xiT β directly as the parameter, it
supposes a nonlinear transformation of this inner product:

� �y � �1−yi
yi ∼ B(σ (xiT β )) is equivalent to p(yi ; xiT β ) = σ (xiT β ) i 1 − σ (xiT β ) .

This model is the one we want because σ (xiT β ) ∈ {0, 1} and it is deﬁned for binary observations
as expected. Note that the usual notation σ () is just a coincidence, and not related to any variance
quantity with the scalar σ from the Gaussian likelihood.

Figure 2.3: Sigmoid function u → σ (u) with "S shape"

Here, σ () is an increasing smooth function with bounds in order to model probabilities.

41
Sigmoid function

The sigmoid function (or the logistic curve) is useful in neural networks for assigning weights on
a relative scale. The value u is the weighted sum of parameters involved in the learning algorithm,
with 1 − p(u) = p(−u). This function takes any real number, u, and outputs a number in ]0; 1[.

exp u 1
σ (u) = = .
1 + exp u 1 + exp −u

Note that the logistic model leads to a smooth version of a neural network called "perceptron", not
studied herein, as it performs less well. There are some theoritical concerns (such as number of
steps before convergence of the algorithm) with this simple network which are out of the scope.
Recall that it is assumed to have two classes with yi either 0 or 1. Furthermore it is assumed also
to have an unknown vector of parameters β in our ﬁtting with the sigmoid function. With that, it
is deﬁned probabilities:

exp (β0 + β1 xi1 + β2 xi2 )

p(yi = 1|xi , β ) = = σ (xiT β ),
1 + exp (β0 + β1 xi1 + β2 xi2 )
1
p(yi = 0|xi , β ) = = 1 − σ (xiT β ) ,
1 + exp (β0 + β1 xi1 + β2 xi2 )

where β are the weights we want to guess from data, in our case β0 , β1 , and β2 , in the example
below, for instance.
Note that, as {yi = 0} and {yi = 1} are complementary events,

p(yi = 0|xi , β ) + p(yi = 1|xi , β ) = 1 .

Likelihood function

It is deﬁned the total likelihood for all possible outcomes from the available sample or dataset
{(yi , xi ), 1 ≥ i ≥ n}, with the binary labels yi ∈ {0, 1} and where the data points are drawn inde-
pendently.

log p(y; X, β ) = log ∏ni=1 p(yi ; xiT β , σ )

= ∏ni=1 (p(yi = 1|xi , β ))yi (1 − p(yi = 1|xi , β ))1−yi

The purpose is thus to maximize this probability: the probability of the observed data for this
model. The likelihood is approximated in terms of the product of the individual probabilities of
each outcome yi , because of the hypothesis of independence.

42
This allows to implement the Maximum Likelihood Estimation (MLE) principle where the opti-
mization leads to solutions for the unknown vector of parameters. This solution has known prop-
erties, all exactly the same for any form of likelihood from Gaussian, Bernoulli, etc distributions.
In neural network, this is a loss which is minimized, the opposit of the loglikelihood.

Loss function

After that, one applies the logarithm from the mass distribution of the i.i.d. variables yi , such that:

log p(y; X, β ) = log ∏ni=1 p(yi ; xiT β )

� �y � �1−yi
= log ∏ni=1 σ (xiT β ) i 1 − σ (xiT β ) .

Since the events for the observations were assumed to be independent and identically distributed
the probability mass function for all possible event y is a product of the single events, such that
after the logarithm, the loglikelihood is obtained:

�(β ) = log p(y; Xβ )

= ∑ni=1 (yi log p(yi = 1|xi , β ) + (1 − yi ) log [1 − p(yi = 1|xi , β ))]) .

Reordering the logarithms, one can rewrite the sum as, with an example for two independent
variables,
� �
�(β ) = ∑ni=1 yi (β T xi ) − log (1 + exp (β T xi ))
= ∑ni=1 (yi (β0 + β1 xi1 + β2 xi2 ) − log (1 + exp (β0 + β1 xi1 + β2 xi2 ))) .

The maximum likelihood estimator is deﬁned as follows: this is the set of parameters that maximize
the loglikelihood, say maximize �(β ) with respect to β .
In neural networks, since the cost (error) function is just the negative loglikelihood, for logistic
regression the sign would be changed for a minimization (of a loss). The correponding loss is
named the "cross-entropy".

2.2.2 Derivatives
The cross-entropy is a convex function of the weights β and, therefore, any local minimizer is a
global minimizer.
Minimizing this cost function with respect to the two parameters β0 , β1 and β2 we obtain,

∂ �(β ) � �
exp (β0 +β1 xi1 +β2 xi2 )
= − ∑ni=1 yi − 1+exp
∂ β0 (β0 +β1 xi1 +β2 xi2 )
∂ �(β ) � �
n exp (β0 +β1 xi1 +β2 xi2 )
= − ∑i=1 yi xi1 − xi1 1+exp (β +β x +β x )
∂ β1 0 1 i1 2 i2
∂ �(β ) � �
exp (β0 +β1 xi1 +β2 xi2 )
= − ∑ni=1 yi xi2 − xi2 1+exp (β0 +β1 xi1 +β2 xi2 ) .
∂ β2

43
Let us now define a vector y with n elements yi , an n × p matrix X which contains the xi values
and a vector p of fitted probabilities p(yi |xi , β ). We can rewrite in a more compact form the first
derivative of the cost function as,

∂ �(β ) ∂ 2 �(β )
= −X T (y − p) and = X T ΩX .
∂β ∂β∂βT
Here it is denoted a diagonal matrix Ω with elements ωi = p(yi |xi , β )(1 − p(yi |xi , β ), which leads
to a compact expression of the second derivative too.

2.2.3 Predicted labels ŷi

Let see how translate the modeling with the sigmoidal function in order to have the new predicted
values from β . This is now for a threshold s in ]0, 1[ such that we get that:
�
0 if σ (xiT β̂ ) > s
ŷi = .
1 if σ (xiT β̂ ) ≤ s

The difference with the linear regression is that the output from the regression was directly used for
ŷi while for the logistic regression this is a probability which needs to mutate into a binary value.
The choice for s is according to diverse criteria in order to insure errors balanced or unbalanced
for the two possible values of outcomes. Here, an error happens when yi �= ŷi as explained next.

2.2.4 How to check the quality

In order to check if the fitting of the linear model is successful the same methodology than for
the regression is not anymore relevant because the target variable is discrete. Hence the indicators
which tell us how the classification is relevant are different. Note that residuals are defined for
the logistic regression but rarely considered in practice. Actually, the mse from the regression is
retrieved nextafter but the theory is different here. The linear regression can be relevant and not
exact for all the data while the linear classification can only be relevant if it is exact for enough
data. The main indicators are as follows.

Classiﬁcation errors - how ŷi compares with yi

Because the problem is a classiﬁcation and the variable y is not continuous, the purpose is to
retrieve exactly the same class for each datum. Indeed, yi ∈ {0, 1} and ŷi ∈ {0, 1}, then the errors
happen whenever yi is not equal to ŷi . The four different cases are summarized just below:
• yi = 1 and ŷi = 1 is True Positive or TP
• yi = 0 and ŷi = 0 is True Negative or TN
• yi = 1 and ŷi = 0 is False Negative or FN
• yi = 0 and ŷi = 1 is False Positive or FP

44
Confusion matrix

A matrix allows a visual representation of the four cases listed above. They are summarized with
the cells of the "confusion matrix" which shows the counts for the correspondence or non corre-
spondence between the true yi and the predicted ŷi . The matrix is thus deﬁned as follows:

ŷ = 1 ŷ = 0
y = 1 #TP #FN
y = 0 #FP #TN

The cells with #TP, #FN, #FP and #TN denote the sizes of the set by couting the number of TP,
FN, FP and TN in the sample when comparing the true target yi with the predicted target ŷi . We
are happy if the diagonal of the matrix is large, and the non diagonal is small, but other alternative
indicators exist (and also for unbalanced classiﬁcation not considered here). The value of the
indicators depends on the cut with s for choosing ŷi , this explains why the cut may be decided for
maximizing an indicator of interest, even if often the threshold 0.5 is selected by default.
�

Accuracy

This indicator is written as follows,

#TP+#TN
ACC(y, ŷ) = .
#TP+#FN+#FP+#TN
The value of the indicator is related to the criterion or loss function that was minimized in order
to ﬁnd the solution for the logistic regression. This quantity is not directly optimized because it is
not smooth on the contrary to the loss of the regression, and non smooth functions are difﬁcult to
optimize. The expression is exactly equal to the previous mean square error, because (yi − ŷi )2 is
equal to zero (with (1 − 1)2 = 0 and (0 − 0)2 = 0) if there is no error and equal to one if there is an
error (with (0 − 1)2 = 1 and (1 − 0)2 = 1), while the denominator is the size of the sample.

��
��

Error rate

This indicator is on the contrary to the accuracy the percentage of missclassiﬁed units in the sample,
such that:
ERR(y, ŷ) = 1 − ACC(y, ŷ) .
�

45
Precision and recall

The "precision" is the percentage from the number of true positive against the number of true
positive and false positive, while the "recall" is the percentage from the number of true positive
against the number of true positive and false negative
#TP
PREC(y, ŷ) = .
#TP+#FP
#TP
RECALL(y, ŷ) = .
#TP+#FN
This is only a measure for the the success, conditionally to the true target equal to one, which is
mesured here, as a percentage from the count of predicted positive w.r.t. the count of true positive.
This reﬂects the purpose to only retrieve the value 1 of the target variable, otherwise the true
negative would be in the numerator.
�

ROC curve

This curve is a representation of the points (precision by recall) when the threshold changes within
the interval [0; 1] in order to ﬁnd the best threshold which maximizes both:

s → (PRECs (y, ŷ), RECALLs (y, ŷ)) .

This curve is not so much considered because the accuracy is more often a concern with a sym-
metric error, such that, one may plot eventually this curve as a complement, named here the "ACC
curve",
� �
#TPs #TNs
s→ , .
#TPs + #FNs + #FPs + #TNs #TPs + #FNs + #FPs + #TNs
Here it is denoted the index s when the counts come from a value s as the level of cut for ﬁnding
the target variable ŷi from the the continuous quantity σ (xiT β̂ ), instead of just selecting s = 0.5
from the usual and not optimal ﬁrst choice, see "sklearn.metrics.roc_curve" for instance.
�

AUC

The quantity from the "Area Under the Curve" ROC or "AUC" allows to judge the quality of a
classiﬁcation. With 0.5 the random choice for the predicted labels, a larger value towards 1 leads
to the best model. A visual representation associated to this area or just the value of area is a
complement to the other indicators. Note that in image classiﬁcation, a wide domain of research
in neural networks, authors focus on the accuracy most of the time, while this area appears more
frequently in machine learning for tabular data for instance.
�

46
F1 score

The F1 score is a particular case of F-measure (with parameter equal to 1), an harmonic mean:
2#TP 2
F1 (y, ŷ) = = 1 1
.
2#TP+#FP+#FN PREC(y,ŷ) + RECALL(y, ŷ)

MCC

The "Matthews Correlation Coefficient" is related to the Pearson correlation for a contingency
matrix, such as a phi coefficient or mean square contingency coefficient,
#TP #TN - #FP #FN
MCC(y, ŷ) = .
(#TP+#FP)(#TP+#FN) + (#TN+#FP)(#TN+#FN)
�
All these indicators for two classes are generalized for a muli-classes model, with the confusion
matrix which required the K classes instead of just two at the level of the rows and columns, while
the other indicators are also extended by counting the errors per class and summing.
Herein, the implementation of the indicators comes from the module sklearn. These indicators are
available from the sub module metrics, with the names "accuracy_score", "precision_score" and
"recall_score" and "confusion_matrix", with the first argument for the true target variable and the
second one for the predicted one.

2.2.5 Example of bivariate classes

For an artiﬁcial dataset, the Gaussian distribution from the beginning of the chapter is replaced with
a Bernoulli one, let generate a sample and have a look to the obtained data. Actually, if we have
two classes with the covariates xi , this means that these variables themselves have a partitioning.
Hence, the independent variables are generated in order to come from a mixture of distributions
with two modes, say two different groups more or less separated. Without such underlying classes,
there is no reason that yi codes two classes. Because of the supposed dependence, this must be
caused by the independent variables.

��
��

� ��

� ��
� � � � � ��

47
� ��

� � � ��
� ��

� � � ��

Let save the dataset with the python module numpy in text format.

� ��
� ��

��
��
��

� � ��
� � ��
� � ��

��

From the data sample, the generation process in unknown, and only the observed x and observed y
from the sampling process are available. To be clear, the "true class labels" will be the "observed
class labels" herein and available in the dataset, not the class labels of the independent variables x
which are unknown, not modeled, and out of the scope for our analysis.

��
��
��

��

The format for y as a list with only one dimension is often preferred in some python modules,
otherwise a not related exception may be received without further explaination, such that checking
if it is a (python) vector or a list-like format is a good way to do, before all, in order to avoid tricky
computer messages from python.

� � ��

Let plot the dataset with the known labels.

48
��

��

��
��
��
��

��

Figure 2.4: Data sample for two classes with a linear frontier

From the graphic, we get two classes almost separated as expected because the independent vari-
ables are generated separately for each class with a distance between the means larger than the
standard-deviations. The logistic regression will be able to find the frontier between the two classes
with again a line as in linear regression but in a different way. Actually, this is the reason why these
two models are called "linear models". Note that there is not a closed-form solution on the con-
trary to the linear regression, hence, an iterative algorithm is required whatever the size of the
dataset. This is common to many methods in statistics, machine learning and even neural network
which will be studied in a next chapter. Note that the algorithm will ask for a computation of the
first-order derivative of the cost function. While the second-order will be not required for all the
numerical algorithms. If this is the case in this chapter by default for the used implementations,
second-order ones are less used in deep learning because of the size of the networks and the large
number of weights or unknown parameters to find in order to fit the model to the sample.

49
��
��
��
��

��
��

Here it has been chosen the algorithm called ’lbfgs’ by sklearn, but other algorithms are available
such as ’newton’ for instance, they are both second-order methods, and thus better to use for small
datasets.
The parameters and the accuracy after training follow.

��
��

��
� ��
� ��

��
��

��
An hyperplane is the frontier between the two classes such as p(β̂ T xfr ) = 0.5, thus with two
dimensions where p = 2 this leads to the equation:
� 0.5 �
β̂0 + xfr1 β̂1 + xfr2 β̂2 = ln 1−0.5
= 0.0
This line is added to the graphic with the two classes as colored points in order to illustrate the
resulting classiﬁcation. Such function is presented for drawing several lines useful next chapter
while only one solution from sklearn appears below.

��
��
��
��
��
��

��

50
��
��
��

��

��
��

��

��
��
��
�→��

Figure 2.5: True frontier between the two classes

The number of errors is recognized visually (when some points does not overlap) by counting
the points from one side and the other side of the line, which does not belong to the correspond-
ing class from each side. This also underlines that for a nonlinear frontier, a line is not enough,
which explains why nonlinear models are required such as with neural networks. Some errors are
removable by improving the frontier between the two classes.

51
2.2.6 Indicators with sklearn
For the classiﬁcation, the indicators are computed in a function with optional printing of the indi-
cators and drawing of the scatterplot (not implemented here) between true and predicted outcome.

��

��
��

��
��
��
��
��
��
��
��
��
��
��
��
��
��

��

It it recognized the accuracy, precision and recall, plus the confusion matrix, which are described
previously in the chapter. For the bivariate example just before, the resulting error is as follows.

��

��
��
� � ��
��
��
��

Here, we are mainly interested on the accuracy which may be also computed during the training
if a loop is implemented. As expected from the visual inspection of the classes, the classiﬁcation
error is small, 0.06, because the classes are linearly separable.

52
2.3 Multinomial regression (3 or more classes)
When the classification is for more than two classes, in this case, either a rule is set from binary
classifiers, either the model is updated for multi-classes. The first way asks for combinations of
binary classifiers, while the second way asks for a new distribution for the classes. Following the
modeling approach of the chapter, this results to:

yi ∈ {1, 2, · · · , K} .

Here the number of classes may be equal to 3 or more. The distribution is not anymore Gaussian
or bernoullian, this is a multinomial law which allows a label with more than two values.
The sigmoid function is replaced with a softmax function for the probabilities which generalized
the sum to one for K probabilities instead of just 2 probabilities, with the expression:

exp (βk0 + βk1 x1i + βk2 x2i + · · · + βkp x pi )

p(yi = k|xi , βk ) =
1 + ∑K−1
l=1 exp (βl0 + βl1 x1i + βl2 x2i + · · · + βl p x pi )
exp (βkT xi )
= .
1 + ∑K−1 T
l=1 exp (βl xi )

The method is now named "multinomial regression" or "softmax regression". Instead of one linear
regression transformed by a non linear function, there are K linear regressions with each its own
coefﬁcients and they are embedded in the softmax function. Thus, this model remains linear as the
two previous models.
The target variable yi is rewritten with binary vectors with components yik , here the value is one if
yi = k and zero otherwise.
The new likelihood is now:
yik
P(y|X, β ) = ∏ni=1 ∏K
k=1 [p(yi = k|xi , βk )] .

The usual training algorithm remains sligthly similar than for the logistic regression, via gradient
descent (Newton’s method for instance). Note that in the literature, an advice is to substract the
max of the K regressive parts in each probability of class separately in order to avoid numerical
issues. The code from the python module sklearn is as previously for two classes with the same
function "LogisticRegression()" which is able to handle a multicategory target variable.

2.4 Exercices
1. (sklearn) Instead of the logistic regression, let test the linear regression for the binary target
variables, and compare the resulting classiﬁcation on this (bad) model. We will choose 0 and
1 as values for yi . This is convenient because the linear regression will give us a continuous
value for the prediction ŷi , hence a ﬁrst choice is just greater or lower to 0.5 for deciding of

53
the class, yi = 1 if xiT β̂ > 0.5, and yi = 0 if xiT β̂ ≤ 0.5. Compare with the logistic regression
for the (very) small dataset of the chapter, or other data. Which model is performing better
for predicting the target variable.
2. (sklearn) Test a linear regression or classification for two or three small datasets from the uci
machine learning repository, kaggle or sklearn. These datasets are generally pre-processed.
Such that mostly missing values may be removed or imputed for some of them. While cat-
egorical independent variables may be binarized in order to enter the models. For instance,
the following datasets are usual benchmarks.
Dataset #rows #vars #classes
Iris 150 4 3
Breast cancer wisconsin 569 30 2
Wine Quality 4898 11 NA
Credit Card Default 30000 22 NA
Pima Indians Diabetes 768 20 2
Ionosphere 351 33 2
Glass Identification 214 9 6
Ecoli 336 7 8
Abalone 4177 8 NA
KDD Cup 1998 191779 481 NA
3. (sklearn) Test a regression or a classification for one large dataset, with for the training: a)
the full sample for a batch algorithm and b) the subsamples for a minibatches algorithm via
cycling chunks from the datafile. For instance, the "Higgs dataset from the communica-
tion "Searching for Exotic Particles in High-energy Physics with Deep Learning" (2014,
Nature) with 11000000 (11M) rows and 28 variables while y is binary, available at the
uci in csv format, or in hdf5 format also online. Another dataset larger, is available at
"https://developer.ibm.com/data/airline/" in a gzip format from a csv file with 200 million
(200M) "domestic US flights" to "predict the likelihood of a flight arriving on time", or its
subsample of 2000000 (2M) row samples.
4. (sklearn) Test a regression to predict the cover types in forest from the dataset available
from the uci machine learning repository at the entry "Covertype Data Set", with 581012
data vectors and 54 variables (attributes or features). This dataset is from "Comparative
Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest
Cover Types from Cartographic Variables" (1999, Elservier).

54
Chapter 3

First-order training of linear models

Pytorch allows to infer the weights of neural networks by automatically computing the gradient
of a cost function and implement the optimization loop without requiring something else than the
definition of the network and the definition of the cost function. All the burden for the optimization
is almost avoided except some settings in order to help the optimization to perform well with the
data. This means that for the linear regression or the logistic regression where the vector β is
unknown, one just needs to define a neural network which is corresponding to one of the linear or
logistic model, add the cost function, choose an option for the optimization and launch the learning
procedure which finds a numerical value for β̂ . The weights of the network w are identified to the
coefficients of regression while the structure of the network is very simple in this case, one of the
first to appear in the history of neural networks.
For minimizing the cost function �(w) the idea of learning parameters for the network is found
with an iterative algorithm called gradient descent, from an initial value w(0) , and a choice for the
learning rate αt often constant and equal to a small value α such that it is iterated until stabilization:

∂ � ��
w(t+1)
=w (t)
+ αt �b (w) �� .
∂w w(t)

Here b is a (mini)batch of indexes i, with sb a subset of {1, · · · , n}. Choosing the whole set or
only one single index i is possible but the later often leads to a slower converge. Here we will
first choose just one datum and iterate over the whole sample. This approach adds randomness to
the learning algorithm which is more efficient, and also avoid an heavy computation on the whole
dataset of the gradient at each step which is quicker. When s is the set of indexes {1, 2, · · · , n} with
∪b sb = s the main variants are defined as follows for each iteration of the algorithm:
• when a singleton b = i ∈ s from the sample is chosen not randomly by cyle, the algorithm is
called "gradient descent"
• when a subset sb ∈ s from the sample is chosen not randomly by cyle, the algorithm is called
"minibatch gradient descent"
• when a singleton is chosen randomly from the whole sample, the algorithm is called1
1 Robbins–Monro procedure.

55
"stochastic gradient descent"
• when a subset is chosen randomly from the whole sample, the algorithm is called "stochastic
minibatch gradient descent"
• when the full sample sb = s is chosen, the algorithm is a "batch gradient descent".
This also supposes a good choice for the learning rate αt which must change according to the
size of the minibatch. The stochastic (minibatch) gradient descent approaches are able to reach
outstanding solutions for deep neural networks with the rise of the size of the dataset which insures
generalization and nearly optimal solutions. For a stochastic procedure, one just need to shuffle,
say randomly change the order of, the units in the dataset before each epoch, where an epoch is a
full cycle on the dataset. Despite huge volumes of data and huge number of weights, the trainind
has become possible nowadays. The batches iterations are generally very slow to compute and
eventually prone to bad solutions for nonlinear functions where they are trap in local minima.
There exists some hypothesis required for the convergence of the gradient descent not discussed
here. Convexity is one which insures to reach the optimal solution, but relaxing this condition
becomes possible in the case of the stochastic version. These algorithms are relevant event when
the cost function is nonlinear with several local minima instead of just one minimum. It is also
suitable for large datasets or some non invertible matrices for instance for the linear regression. The
loss functions of linear regression and logistic regression are convex which allows non stochastic
gradient descents to converge fast to the optimum. The choice for αt is important for insuring the
convergence where a stable value is obtained at the final steps but also a good convergence where
the solution is relevant at the last step. These two conditions are required in order to get a useful
numerical value for β̂ otherwise another function or value for αt is wanted until a suitable result
is obtained. Changing the initialization is also important in order to change the local minimum
of the nonlinear function when the optimum is not reach at the end of the iterations. Think about
an example of a polynomial curve, there are several minima and the algorithm is able to go to the
minimum at each valley but not always change of valley because this would suppose enough big
changes during the first iterations, which is only possible when the gradient is not already small
near zero as at the convergence. The gradient of the function needs to be smaller and smaller
during the algorithm in order to insure the convergence, and according to the formula above for
the update, a large change of value needs either a large learning rate either a large gradient.
Next, such algorithm is implemented with numpy before using the power of pytorch. The opti-
mization is performed with numpy by writing the closed-form expression for the derivative from
linear algebra first, before the automatic derivative with pytorch after: with numpy and pytorch for
the linear regression and directly with pytorch for the logistic regression.

3.1 Linear regression

The algorithm with iterative updates for the inference of a linear regression is explained in this
subsection. Some derivatives are presented partly with only one dependent variable but the last
matricial expression is relevant for several variables (by eventually replacing the design matrix by
the more complete one). The algorithms are not completely stochastic because it is preferred to
cycle the dataset in the same order and several times instead of drawing randomly the selected

56
rows, as this lead to nearly similar result here.

3.1.1 Gradient descent with one vector and numpy

For this chapter, the simple linear model is defined with the coefficients β0 = −0.5 and β1 = 3.5
and the noises εi , such as the regression function is defined with xi = xi1 as:
yi = β0 + β1 xi1 + εi .

The set of observations or data sample is (xi , yi )i∈{1,...,n} . We want to estimate the regressions
coefﬁcients with a vector β̂ . The loss function called "mean squared error" is minimized such that,
β̂ = argmin �(β ) ,
β

where for only one independent variable,

�(β ) = ∑ni=1 �i (β )
= ∑i∈s (yi − β0 − β1 xi1 )2 .

Here n may be very large, hence it is better to have an algorithm which does not need the full
sample at each iteration.
We have seen previously that the gradient of the cost function with respect to the weights is written
as a vector which will enter the iterative procedure. As explained before, we add the component
one to the vector in order to get xi = (1, xi1 )T while aggregating all the vectors in the design matrix
X, such as at the end, the gradient is written as follows,

 � � 
� � ∂ � � � �
∂ ∂ β0 ��i (β )� (yi − β0 − β1 xi1 ) 1 � �
�i (β ) =   = −2 = −2 yi − xiT β .
∂w ∂ xi1 (yi − β0 − β1 xi1 ) xi1
∂ β1 �i (β )

Now we just need to choose a value or a function for αt = α(t) that we choose constant equal to
αt = α = .01 here for instance. This is implemented just belows, recall that:

s = {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )} .

Next we follow the update rule of β after a random initialization of the vector, hence we implement:
� � � � � ��
(t+1)
β0
(t)
β0 1 �
= + αt × (−2) yi − xiT β (t) .
(t+1)
β1
(t)
β1 xi1

The algorithm stop at the ﬁnal step t = T . In general it must be several cycles over all the data
sample, each cycle is called an epoch. This translates in python as follow below, in a vectorized
way because vectors are handled faster than scalar components with loops in pytorch.

57
Dataset from the ﬁles

The work directory is given from the function.

��

��
��

��

��
��

��
� � ��
� � ��
� � ��
� � ��

��

��
��

��
��
� ��

Solution with the sample from algebra

The regression coefﬁcients are estimated from the sample with algebra as:

58
��
��
��
��
��
��

��
��
� ��

��

Loop with data one by one and parameters update via numpy

A ﬁrst implementation is with a simple loop then a second implementation is more oriented neural
network before a third implementation using pytorch for the derivative and a pure pytorch imple-
mentation before a ﬁnal version oriented object with a parent class from pytorch which will be the
way to do in the next chapters.

��
��

��
��
��
��
��
��
��
��

� � ��
��
��
��
��
��
��
��
��
��
��

59
��

The results are the final estimation for the regression coefficients and the loss. It is wise to show
the form of the loss after training, in order to check the convergence, as a not too fast and not too
slow decrease generally inform for a relevant choice of the learning rate, otherwise another choice
is wanted. Anyway, the gradient should be near zero in order to confirm that an optimum (here
unique for the linear regression) was reached at the end of the loop. The resulting curve validates
the learning rate.

��

��
��
��

Figure 3.1: Decreasing loss function during training, per epochs

The algorithm performs well even if there was no an advanced tuning of the learning rate, just a
few tries (not presented, but the reader might try other learning rates and compare the obtained
curves, see exercices). Note that some stochastic algorithms would sample the data vector in some
case, this is not considered here. The values for αt among 10u with u integer between −6 to 3 are
an usual try, as there does not exist any formula yet in order to ﬁnd it automatically.

3.1.2 Gradient descent with a minibatch and numpy

Next the learning with a minibatch is presented for the linear regression. As the loss function is less
noisy during the training, this will allow visual comparisons by inspection of the learning curves.
With a number of B minibatches, the loss function is now written as B sums,

60
�(β ) = ∑Bi=b ��b (β ) �
= ∑Bb=1 ∑i∈sb (yi − β0 − β1 xi1 )2 .

The new update formula for β (t) is thus written not from a single data vector but from the whole
subset in the minibatch, roughly n/B, as:
 � � 
� � ∂ � �
∂ ∂ β0 ��b (β )� 1 � �
∂w
�b (β ) = 
∂
= ∑ −2 xi1
yi − xiT β = −2XbT (yb − Xb βb ) .
∂ β1 �b (β ) i∈sb

This is implemented just below, with a size |sb | for the batch b equal to 10. For the training,
the same batches are cycled at each epoch, but shufﬂing is also possible (by reordering constant
minibatches, by sampling new minibatches with or without replacement) at each epoch:

s = s1Us2U · · ·UsB .

This is instead of changing their order or sampling batches at each step. This is convenient for a
fast access to dataﬁles for example, as the aim is often to process large datasets which cannot ﬁt in
the computer memory. The new update formula is thus as follows,
� � � �
(t+1) (t)
β0 β0
(t+1) = (t) + αt × (−2)XbT (yb − Xb βb ) .
β1 β1

Thus,

��
��
��
��

��
��
�→��

��
��
��
��

� � ��
��
��

61
� � ��
��
��
��
��
��
��
��
��

��
��
��

��
��
� ��

��
There is a nice decreasing curve for the loss during the training because the cost function is a
smooth convex parabola and the learning rate was chosen not too lage or not too small even if
probably not optimally. A better setting would lead to a solution nearer to the optimum for β̂ .
With this setting, the convergence may be slower than the stochastic algorithm.

3.1.3 Gradient descent with a minibatch and pytorch

According to the tutorial2 of pytorch, tensors are used to encode the data to process, but also the
internal states and parameters of models. Basically, tensors extend the usual two dimension matrix
or one dimensional vector into more dimensions such as cubes with three dimensions. These
mathematical objects are also called multi-array or just array in numpy. This is with a more general
deﬁnition from the point of view of the storage as they may be stored in the computer memory near
the cpu or directly in the gpu memory near its highly parallel processors. Thus arrays are involved
herein, in a similar way than with numpy with similar computations even if the storage and the
syntax may be somewhat different in some case or extended. It is possible to change the size of an
array, append arrays, get the shape and apply usual mathematical operations (addition, substraction
and multiplication).
With the module pytorch, we are not using anymore arrays from numpy but tensors from pytorch.
This allows to implement the power of the gpu by moving the arrays/tensors in the memory of the
gpu and perform in parallel the computations of the derivatives for nonlinear functions automati-
cally. This means that the gradient is directly computed by the module whatever the complexity
of the network. We just need to follow the logic implemented in the module. Note that for small
2 "https://pytorch.org/tutorials/index.html",

"https://pytorch.org/tutorials/beginner/former_torchies/tensor_tutorial.html"

62
dataset, the cpu is enough and the gpu supposes to install cuda before being available. In this
chapter we stay in the computer memory and with the cpu. For small datasets, an implementation
with numpy may be faster because more direct but asks for explicit expressions.

Loading the python module

��

In order to get the data in a format relevant for pytorch, it is usual to use a class called "Dat-
aLoader", which is optional as one may have its own loop for cycling a source of data. This python
class gives an iterator in order to cycle the dataset with minibatches, whatever are the data places:
computer memory, gpu memory, hard disk or even from a remote server via the network because
it is possible to write its own class. Hence, it is ﬁrst imported the dataset in a dataloader object,
and then implemented the autograd function with calculates a value of the gradient for a function
automatically.

3.1.4 Dataloader, autograd and training loop

Here we need ﬁrst to create an object of class "Datataset" (see documentation of pytorch) which
takes a simple numpy array in the present case. This class is very general and when implemented,
it is able to deal also with multimedia ﬁles for instance while performing parallel processing.The
Dataloader provides the iterator for reading the Dataset and loading the minibatches with the vec-
tors yb and the matrices Xb or xb in order to train the model.

Dataloader

There are several ways to implement this class, for instance by creating our own class, but here we
just create new objects of the class torch.Tensor from the whole dataset because the numpy array
is very small and ﬁt in the computer memory. After that, the iterator is created by instanciating a
new object of class Dataloader, while the operator function "iter()" leads to get the X and y from
each minibatch (of size 10).

��

We have still access to the dataset in a numpy format.

63
��
��

��
The main interest of these classes appears in next chapters to load a dataﬁle sequentially and par-
tially from the disk, directly from the classes Dataset and Dataloader, without loading the full data
in memory for any format implemented in the class like compressed binary format or multimedia
ﬁles in several directories while applying some transformation (mutate into a tensor format or into
a different shape for instance).

Autograd and loop for the updates

The automatic computation from pytorch with tensors via cpu (or gpu if added) is then written as
a function here, but this is for later use, otherwise an implementation with no function is easier to
debug if required.
First an object of class Dataset reprenting a pointer to the data in memory or computer disk is
created, and for cycling the data sample, an object of class DataLoader which is like a python
"iterator", they are named respectively "dataset" and "dataloader", as explained in the pararaph
above.
Then, the function for the optimization with pytorch is defined as follows. The iterator from
dataloader provides the minibatches of size 10 here, leading to the objects Xb and yb which are
part of the whole design matrix X and the whole vector y aggregating the target variables yi .
The vector for the regression coefficients is declared requiring gradient, with the command "re-
quires_grad_(True)" such as the automatic derivative is processed at the call from the function
"backward()". In short, a separated function here was implemented for the implementation of the
updates, with the gradient retrieved with the variable "grad" from the Tensor object which con-
tains the regression coefficients, "betat_ag", and it is reinitialised before new data. The call "with
torch.no_grad()" allows to avoid any update to the gradient for any other computations than gradi-
ent or parameters updates.

��

��
��

��
��
��
��
��
� � ��

64
� � ��

��

��
��
��

��

��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

The call to this function for updating the parameters after one step is as follows:

��
��
��
��

��

��
��
��

��
��

65
��

��

� ��

It must be noticed that changing the loss obviouly change the gradient, but more importantly, there
is a difference between the mean and the sum, with a factor 1/n, thus this must be keep in mind
when choosing the learning rate which is proportional to this factor by linearity and deﬁnition of
the gradient. One can compare the value of the numerical gradient in "betat_ag.grad" with the
analytical expression written with numpy, they are exactly the same (with an eventual very small
numerical difference actually, often neglectable with real data).

��

To compare the numerical and analytical gradient, let rewrite the update by computing the differ-
ence between both, as follows.

��
��
��
��
��
��
��

��

This avoids to ﬁnd the mathematical expression for the derivative which is eventually not even
possible to compute exactly in closed-form with paper and pencil for more complicated loss func-
tions. A veriﬁcation of the gradient is required at least at the end of the training in order to check if
the numerical procedure has performed well (which should be the case most of the time) but also
if the gradient is enough small as required.

66
3.1.5 Visualization of the parameters trajectory
Let draw the intermediate solutions for β (t) at each update from each epoch in order to further
understand what the algorithms do. This supposes to have run "f_minimizeloss_loop_torch" with
"n_epochs=20". There are three different implementations for the training algorithm, hence three
curves for three trajectories appear on the graphic below.

Figure 3.2: Loss and intermediate values of beta during convergence

The graphs just above shows the cost function as a function of its parameters, as follows:

β → f (β ) = MSE(y, Xβ )

This allows to show the shape of the loss which is convex as a 3d parabola, but also the interme-
diate solutions from each epoch until the convergence. The surface is quadratic as expected from
the expression of the loss, hence the algorithms need to do a descent along the surface until the
minimum. The better algorithm with the better settings would be the one with fastest descent. The
initial positions are different for the algorithms hence a fair comparison is not in stake. If the three
trajectories were shown per epoch instead of per update, they would be more smooth and more
identical considering the fast convergence near the solution.
It is observed that the algorithms converge towards almost the same solution near the exact one
from linear algebra. This is without inverting a matrix, only with the gradient of the loss function
which is computed only partially with only one data vector or a few ones at each iteration.
The python function for this graphic is as below.

��

67
��
��
��

��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

Next, these python functions are further illustrated with an example for classiﬁcation.

3.2 Logistic regression

Recall in the second chapter, the small example of classification model is defined with the coeffi-
cients β0 = −0.5 and β1 = 3.5 and β1 = −2.5 such as,
E(yi |xi ) = σ (β0 + β1 xi1 + β2 xi2 ) .

Here σ is usually the sigmoidal function, otherwise the regression is called with another name.
The set of observations or data sample is s = (xi , yi )i∈{1,...,n} , with yi binary. We want to estimate
the coefﬁcients with a vector β̂ . The loss function called cross-entropy loss is minimized such that,

68
β̂ = argmin �(β ) .
β

We are direcly interested with the minibatches, thus, with a number of B minibatches, the loss
function is now written as B sums from subsets of s such that,

�(β ) = ∑Bi=b �b (β
�) � ��
B
= ∑b=1 − ∑i∈sb yi (β T xi ) − log (1 + exp (β T xi )) .

The new update formula for β (t) is thus written not from a single data vector but the whole subset
in the minibatch, roughly n/B, and from the formula of the previous chapter just for the subset,
 � � 
∂
∂ � �
∂β �b (β )
�  = −XbT (yb − pb ) .
�b (β ) =  0 �
∂β ∂
∂ β Cb (β ) 1

The algorithm stops at the ﬁnal step t = T . The new update formula at each iteration is thus as for
the linear regression but with a different gradient function because the loss function has changed.
This is a ﬁrst-order algorithm,
� � � �
(t+1)
β0 β0
(t)
∂ � ��
= − αt �b (β ) �� .
(t+1)
β1
(t)
β1 ∂β β (t)

This translates in python as follow below, in a vectorized way because vectors are manipulated
instead of scalar components.

3.2.1 Dataset with bivariate classes (continued)

Let ﬁrst load the dataset.
��
��
��

� � ��
� � ��
� � ��

��

69
��

3.2.2 Loss minimization with pytorch

The implementation for minimizing the cost function is thus as follows.

� � ��
��

��

�→��

��

� ��

�→��

The training is implemented with pytorch, from the previous function, by just changing the loss
and the parameters for the inference.

��
��
��

��

��
��
��
� � ��
��

��

��
��
��

70
��

��

��
��

��
��
��

��
��
� ��
� ��

��

��
��

��
� ��
� ��

� ��

�→��

� ��

3.2.3 Comparison of the solutions

Let compare the different losses (minus the loglikelihood) from the algorithm above, plus also the
classiﬁcation error. It is remembered that the classiﬁcation error or error rate is the quantity to be
minimized, but as a non continuous quantity, the optimization is for a smooth function instead and
with a similar behaviour than the wanted minimization of the error.

Loss and loglikelihood

��
��

The losses from the initialization and the solution with pytorch are compared.

71
� ��
�→��

� ��

��
�→��

��

��
�→��

��

Accuracy

About the accurracy, one gets:

��

��
��
� ��

��
� ��
��

��
��
The solution with pytorch is slighly performing less well for the loss and the accuracy in compari-
son to sklearn for this dataset and the chosen hyperparameters (see exercices). Note that for a large
dataset, the design matrix (the xi and also the yi ) are stored on the computer disk, thus a loop is
required in order to retrieve the rows of the matrix (with minibatches), and compute incrementally
the predicted target in order to compare with the true one and compute the accuracy, as explained
in the next chapter.
Just below, the frontier is visualized in a graphical representation.

Frontiers

72
Let plot again the frontier corresponding to the cut from the probability that a data vector belongs
to one class or another, with the threshold 0.5 as in the previous chapter but here with several
frontiers (from initial and intermediate values with pytorch plus true value).

� ��
��
��
��

� ��

��
��
��
��

� ��

��
��
��
��
��
��
��

Figure 3.3: Intermediate frontiers for the two classes during convergence

The solution is slightly different to the one from the the module of machine learning but the frontier
looks almost the same with nearly a same error rate. Another difference is that we have a custom

73
algorithm and will be able to update it as most as required on the contrary to existing modules
which are often cumbersome to update by the end user.

3.3 Exercices
1. (sklearn) Retrieve the regression coefficients with sklearn. There is a penalty term added to
the loglikelihood by defaut, C1 (β12 + β22 ) with C=10, which is removed here with the option
penalty=‘none’, but this term should be kept and tuned for better results with new data (see
a next chapter about).
2. (sklearn) Test the algorithms with real data as the exercice in previous chapter, for the same
datasets, and compare the results with sklearn.
3. (python) Test other values of the learning rate in the presented set of candidate values, and
around the best one in order to improve the estimation for β .
4. (python) Test for also other sizes of minibatches and change accordingly the learning rate,
how is the change.
5. (python) Test other settings for the learning rate as a decreasing function of the epoch instead
of a constant.
6. (python) Draw the trajectory of the training for beta in the logistic regression, with two
graphics, one for the coefficients β0 and β1 , one for the coefficients β0 and β2 . After a
vectical disposition of the two graphics, propose a way to get both graphics side by side, one
to the left and one with an horizontal disposition. It is possible to improve these graphics, by
showing the mse and the real beta from the true distribution instead that an estimated one.
7. (stat) How to get a nonlinear frontier from the usual logistic classification instead of the
linear one, in any available python implementation.
8. (stat+python) Propose a better strategy for choosing the minibatches, or for initializing the
weights for the simple linear regression, and check graphically by comparing the learning
curves.
9. (python) Rewrite the python code by selecting random batches (a. one random single unit
from the sample at each epoch instead of a cycle or, b. again n/B subsamples after shuffling
the index of the rows in the data table at the beginning of each epoch) instead of the constant
non stochastic subsamples, and compare the two learning curves.
10. (stat+python) Test the algorithms in "variance reduction for stochastic gradient optimiza-
tion", "SAGA" and also the "multi-point bandit feedback" approach (among other) by im-
plementing the algorithms with numpy or pytorch. Is the inference improved, how and why.
11. (stat) Justify analytically why the gradient descent converges to the optimum for a convex
function.

74
Chapter 4

Neural networks for (deep) glm

In the three first chapters it was recalled the fondations for the linear regression and the linear
classification, then a first approach was introduced in order to perform the parameters learning with
pytorch. In this chapter, pytorch is now able to model and train 1 neural networks via implemented
python classes. The target variable yi is the output of the network while the vectors xi feeding
the network are the input variables. The unknown nonlinear transformation which takes xi and
provides yi must be captured by the neural network during the training of its weights in order to
offer an accurate prediction for current and also future inputs.

4.1 From linear models to neural networks

A neural network in a simplest definition can be seen as taking an input vector xi and providing
an output yi via nested weighted nonlinear functions. For more generality, we will denote the
output as a vector even if a scalar is also possible. A neural network uses weights denoted W or
w which needs to be found during what is called the training of the neural network. Like fitting
a regression or classification model for computing an estimate of β , training the neural network
leads to compute an estimate of W from a sample. In the case of the linear regression and linear
classification, W = β , but of course neural networks are more general. This is written as follows:

yi = f (xi )

Hence, the true and unknown function f (.) is approximated by a neural network:

yi = g(xi |W )

We train the model in order to get Ŵ such that,

1 See also "https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html".

75
fˆ(xi ) = g(xi |Ŵ )
≈ f (xi ) .

Thus, the fonction f (.) may be very general and anyway, it was shown in the literature that any
functional relationship between xi and yi can be learnt by some neural network, presented herein,
called MLP which are deep neural network with what are called hidden layers. The ﬁrst layer is for
the input variable, the last layer is for the output variables, hence they are not hidden as the values
are read in the sample, while, between both, latent virtual observations in higher or smaller spaces
than the input, are called the hidden layers. The components are the nodes and they are linked with
weights, in the idea of the brain for signal propagation, from the input towards the output.
For the linear cases with no hidden layers, this is depicted just as,

1
x1
.. Σ ŷ = σ (w0 + ∑ j w j x j )
.
xp

Figure 4.1: Simple neural network for a linear model.

For the linear regression and the linear classification we have two layers, the input for xi and the
output for the yi , while the weights are defined from the edges between the nodes of the input to
the nodes of the output. When some layers are defined between the input and the ouput, the neural
network becomes a deep neural network able to model a nonlinear function.
The loss function compares yi and ŷi for training purpose.

4.1.1 Loss functions from glm

The output ŷi is as follows for yi continuous, binary or categorical:
• Linear regression
ŷi = w0 + w1 xi1 + w1 xi2 + · · · + w p xip = identity(wT xi ) .
• Logistic regression,
exp(w0 + w1 xi1 + w1 xi2 + · · · + w p xip )
ŷi = = sigmoid(wT xi ) .
1 + exp(w0 + w1 xi1 + w1 xi2 + · · · + w p xip )
• Softmax regression,  
ŷi1
 ŷi2 
 
ŷi =  ..  = so f tmax(wT xi ) .
 . 
ŷiK

76
exp (wk0 + wk1 xi1 + wk2 xi2 + · · · + wkp xip )
ŷik = .
1 + ∑K−1
l=1 exp (wl0 + wl1 xi1 + wl2 xi2 + · · · + wl p xip )
Here, for the third case, w is a matrix of size (K × 1 + p) in order to obtain K real values.
Thus, ŷi = E(yi ), with:
• Linear regression

ŷi = wT xi .

• Logistic regression,

exp(wT xi )
ŷi = = p(yi = 1) .
1 + exp(wT xi )
• Softmax regression,

exp (wTk xi )
ŷik = = p(yik = 1|xi , βk ) .
1 + ∑K−1 T
l=1 exp (wl xi )

A simpliﬁed way to understand how the network performs is that the outputs are crude estimations
of the dependent variable in the three cases. An objective function is computed between the real
observations yi and the prediction from the network ŷi . Such that:

�(W ) = ∑ LOSS(yi , NN(xi )) .

Where

NN(xi ) = "the output from the neural netwok"

= ŷi .

4.1.2 Post-processing of the output

We have seen that for the classiﬁcation problem, the output of the network are probabilities which
are approximations of expectations in a probabilistic model, hence, the true ŷi is obtained after
comparing the expectations with a threshold in order to decide which class the input vector belongs.
Another interpretation is that the network aims at approaching the binary values of yi or yik with
smooth values inside the interval [0; 1] in order to construct the approximation of the true function
f (.).
It must be noticed that in the python code here, yhat has not the same meaning in the training loop
and outside, after for processing the results, thus, this is to understand as follows (with ← as an
affectation to the value of a variable, here yhat, in the computer meaning). After training, the actual
predictions are obtained:
• Linear regression (one output in ] − ∞; ∞[ from nn, keep the same after the loop) with pre-
dicted output ŷi in the real line.

77
ŷi ← NN(xi ) ∈ R .
• Logistic regression (one output in ]0; 1[ from nn compared to s a threshold) with predicted
output ŷik ∈ {0, 1} after "cuting",

ŷi ← NN(x
� i ) ∈]0; 1[
1 if sigmoid(ŷi ) >s .
ŷi ←
0 if sigmoid(ŷi ) ≤s

• Softmax regression (several output in ]0; 1[ compared between them to choose the predicted
ŷi in {1, 2, · · · , K}) after "argmax",
 
ŷi1
 ŷi2 
 
ŷi =  ..  ← NN(xi ) ∈]0; 1[K and ∑k ŷik = 1
 . 
ŷiK
ŷi ← argmaxk ŷik .

Here, argmax ﬁnds the index of the largest probability among the K ones found from the
softmax transformation.
If this is not clear to the reader, an external explaination on glm may be required. Also one may
consult the python code coming nextafter which proposes a complement of information before
coming back to this paragraph if required.
Note also that in some implementations, for classiﬁcation, the outputs ŷi or ŷik are in R, hence the
function sigmoid or softmax may be required eventually. This is because the loss may treat directly
continous values if these functions are already included, see the table next subsection.
The postprocessing is implemented in "f_test_glmr()" while the training is implemented in
"f_train_glmr()". Next, an implementation via pytorch is presented before going deeper with hid-
den layers just after.

4.1.3 Loss functions in pytorch

A way to maximize the likelihood in pytorch is to minimize a loss as seen before for three cases
of distribution, Bernoulli, Multinomial and Gaussian. In the general case, one has to write its own
loss function, or just consider another one existing in the python module.
In pytorch, the loss measures the difference between the true target variable yi and the one found
by the network ŷi , by extending the usual Euclidean distance to another measure of comparison
more relevant for robust comparison or binary/multi-category/integer comparisons.
Some are extracted from the documentation, "https://pytorch.org/docs/stable/nn.html", at the Loss
section. There are equivalent deﬁnitions in "torch.nn.functional" too.

78
Here, the optional argument for the loss "reduction" is put at ’sum’, and only a minibatch is in stake,
with size B, not the full sample, while ŷi = ŷi (xi ) is the output from the network after feeding it
with the minibatch of independent variables xi , such that:
Loss Distribution(*) Expression
MSELoss Gaussian ∑Bi=1 (y
� i − ŷi )
2
�
B (yi −ŷi )2
GaussianNLLLoss Gaussian ∑i=1 log(max(var, eps)) + max(var,eps) + const
L1Loss Laplace ∑Bi=1 |yi − ŷi |
BCELoss Bernoulli ∑Bi=1 −wi [yi log ŷi + (1 − yi ) log(1 − ŷi )]
BCEWithLogitsLoss Logistic ∑Bi=1 −wi [yi log σ (ŷi ) + (1 − yi ) log(1 − σ (ŷi ))]
exp(ŷik )
CrossEntropyLoss Multinomial ∑Bi=1 −wk ∑K k=1 yik log( K exp(ŷ ) )
∑l il
PoissonNLLLoss Poisson ∑Bi=1 exp(ŷi ) − yi × ŷi + 0.5 log(2πyi )
KLDivLoss KL divergence ∑Bi=1 yi (log ŷi − log ŷi )
Here KL is for Kullback-Leibler, and for the Poisson loss, the options are log_input at True (as by
default) and full at True (instead of False by default). Note that this is the negative loglikelihood as
for the Gaussian distribution and the Bernoulli one, but some losses do not have a correspondence
with a loglikelihood.
Some loss functions accept weights for the observations which are equal to one in our case herein.
Some loss functions are for robust Euclidean one, with HuberLoss and SmoothL1Loss. Some
other losses are HingeEmbeddingLoss, SoftMarginLoss, MultiLabelMarginLoss, MultiMargin-
Loss which are variants for classification, and described in the documentation. The loss to be
chosen may be the one with better results.
In the previous chapter, it has been explained how to define a python function for such loss, with
two arguments, one for the vector of true target yi and one for the vector of estimated target ŷi ,
thus the same idea is required here for other losses. It is possible to consider losses not defined in
pytorch such as for a variant of the Poisson distribution for counts.
In this chapter, we are interested on implementing the neural network for regression or classifica-
tion with the class nn from pytorch. At the same time, we will use the existing loss function:
• For the linear regression, this is the mean square error, with the function
"torch.nn.MSELoss()" which computes the usual sum of squares.
• For the logistic regression, this is the cross-entropy with the function
"torch.nn.BCEWithLogitsLoss". According to the documentation, this loss combines
BCELoss with a sigmoid layer, this is "more numerically stable" than writing its own loss
by combining a sigmoid at the output of the network with a BCELoss just after: the final
nonlinear transformation is already included in the loss.
• For the softmax regressions, this is the cross-entropy with the function
"torch.nn.CrossEntropyLoss()". As for the two classes problem, this loss is more sta-
ble by taking as an input just the non normalized output of the network and adding the
softmax transformation.
For the BCEWithLogitsLoss, a sigmoidal layer is added automatically, thus there is no need to

79
add it in the neural network. Similarly for the CrossEntropyLoss, the softmax layer is included in
the loss. There is no need to redefine them except for better understanding or creating new losses
eventually.
The new python code uses a main class of torch, "torch.nn" which allows to define a function on
the weights. For the logistic regression, the sigmoid is already defined at the level of the loss for
stability reasons, hence, in both case, regression and classification, we just have to define for the
linear models, a linear function for the weights!
��
��
��
�
As explained before, we have a layer for the input and one for the output, hence with the bias, there
are p nodes at the input, and:
• one node for the output for the linear and logistic regressions,
nb_nodes_input = p
nb_nodes_output = 1
• more than one for the softmax regression, with K>2,
nb_nodes_input = p
nb_nodes_output = K
The bias (intercept) is automatically added (otherwise, we would add one to each independent
variable and add a node to the input layer). Note that in the binary case, we just need one node and
not two because we can decide of the class with only a threshold from one output.
The parameters are retrieved from the python object model by calling the function
"model.parameters()", this object is required by the optimizer.
This leads to finally, with the weights defined implicitely from pytorch, to the call to the function
"model.train()" in order to put in "train mode" the neural network with gradient computations.
From now we adopt an approch more generic with python classes and function if possible, in order
to avoid to repeat similar python code. This is justified because the fondations have been studied
in the previous chapters and until here. This implies that most of the time we just need the next
python code for proceeding with regression or classification, with the only change at the level of
the layers. We introduce also a class called "Monitor" which will be useful later in the chapter for
processing the indicators from the test sample.

4.2 Dataset with nonlinear frontiers

As a baseline, a linear logistic regression is fitted with a new data sample despite that the problem
of classification asks for a more complex frontier between the classes. A deep binary regression
model is trained in order to improve the classification just after fitting this first model.

80
The work directory is given from the function.

��
��

��

��
��

��

Let load the data ﬁle.

��

��
� � ��
� � ��
� � ��
� � ��

Let randomly shuffle the rows for avoiding any row structure before cycling the minibatches. Note
that for some files, the classes are not randomly found in the file, they are organized separately,
but for the training algorithm, it seems better to have some diversity between the elements of the
minibatches. The shuffling may be performed also with the dataloader eventually.

��
��
��
��
��

��

The vector y is kept ﬂat for python, this reduces any risk of automatic "broadcasting" non wanted
for algebra with numpy or sklearn for instance.

� � ��

��

��
� ��

81
��
��
��
��
��
��
��
��
��
��
��
��

Figure 4.2: Data sample for two classes with non linear frontier

This is the visualization of the artﬁcial dataset in stake with the true frontier between the two
classes, a circle for this example. The training of a neural network aims at guessing this unknown
frontier from the available data. An evaluation of the quality of the estimated frontier is available
with the error rate.

4.3 Train, test and validation subsamples

For prediction purpose, it becomes required to check how would behave the estimated model with
new data. A new data sample is a new set of couples (yi , xi ) different from the given data sample
used for the training of the parameters. It is usually observed that checking the loss on the train-
ing data is optimistic because the model may overﬁt for this sample. This means that the model

82
becomes ideally good for the dataset by learning any of its particularities without being enough
general. The problem comes from the fact that a (not enough large) sample is never a true image
of the whole population, but only a view of it, with defaults and bias. Training a neural network
with these mistaken informations is source of error, and this is even worse when the number of
parameters in the model increases as it becomes even more ﬂexible and better able to learn too
much all the details and particularities of the train sample.
A real life example is when someone learn to how to make the differences between cars brands
but with all their defaults (sunken body, dirt, rust, tires ribs, repaired with parts not from origin for
older ones): the human may not be able to recognize new vehicules because of these informations
2 . Another example is for anyone who wants to learn a new activity and is not good at it at the

beginning of the learning, hence there is needed more training. To train the biological neural
network from the brain, the human has to capture enough data and during an enough amount of
time. An approach to check if there is a good understanding is exercices with new contents: they
are the new data here where it becomes possible to validate the knowledge for the activity. Without
new challenges, this is not sure if the activity is known as expected.
A way to check the quality is to validate the model with a new dataset. This is the "validation
set", while the "test set" is used during the training for tuning. The idea of the test set is to have
an independent sample different to the training one and to change the (hyper)parameters update in
consequence. For a real example, the test set is the exercices which does not count for the final
mark and are given during the course for the activity, while the validation set is the exercices from
the exam with a mark.
This means that we need to separate the train sample into two parts, one for the training again, and
the other one for the test. For enough large samples, a validation sample may be also obtained at
this stage. Note that other more elaborated methods exist for insuring a good model, or even divid-
ing the sample (see next chapters). This approach is the more usual way to do before implementing
eventually more advanced ones.
In the case when two datasets are already available, we just need to declare two different objects
from the class DataLoader, dataloader_train and dataloader_test, from whatever is the place of the
data, in memory, on the computer disk or even on a remote server.
The subsamples are drawn randomly without replacement: the test set is found first via sampling
because smaller and then substracted from the whole sample to get the train set. This is equivalent
to a function in sklearn with more available options, but written in two rows here. Note this is "with
suffle" hence the former order of the indexes is not kept. The indexes are the working backbone
of processing with a large dataset because this is not possible to replicate or load the full dataset
in the computer memory. The problem of some python modules is to not work with indexes but
directly with the whole dataset loaded in computer memory, which is not always possible.

� ��

��

2 Note that this can also been seen as additional variables which are useless and prone to induce errors even on the
train sample, this is discussed in another chapter.

83
��
��
��

��
��
��
��

��

� ��
� ��

��

��
��

��
��
As explained in ﬁrst chapter, we need two samples seen by pytorch through a class "Dataset", one
for the training and one for the testing in order to check how behaves the model with new data. We
create our two (or three) subsamples, for "train" and "test" (plus eventually "validation" if the test
sample is used to tune the training and the sample size is enough large). Then, the iterators with
the class "DataLoader", with eventually some shufﬂing,

��

��
� � ��
��
��
��
��

��

��
��
��
��

��

84
��

��

4.4 Training traditional linear models

In this section, we propose to consider functionalities of pytorch which bring a more advanced
training for the parameters of the model. The training loop needs the following python objects:
• a python instance named "model" implementing a pytorch class,
• a pytorch function named "optimizer()" for the gradient step with a learning rate,
• a pytorch function named "loss()" for the objective function to minimize.
We require the module torch.

��

4.4.1 Neural network deﬁnition and weights

The neural networks are defined with the class "torch.nn.Module", see "torch.nn.Linear" more
specifically. In a generalized model, the linear part is a weighted sum of the independent variables
with weights equal to the regression coefficients. This linear transformation is embedded in the
class just below which inherits from "torch.nn.Module" in order to define a neural network (nn).
Here the net is with fully connected nodes from each xi j for the first layer as the entry of the nn
to one unique node (or eventually several nodes for multinomial regression) for the output or last
layer. This is written ∑ j xi j w jk for k varying with the index of the component of the target variable.
The sigmoid or softmax are included in the loss function such that, only the part before the last
transformation coming from the glm is required in this class.

��
��
��
��
��

��

The three neural networks for linear regression, logistic regression and multinomial regression
have same architecture, as only the loss function is changed because the link function is included
there for more stability. This is similar to glm with a generic deﬁnition of the model except that
the distribution changes, hence the loss changes too as minus the loglikelihood.
From an instance of the class above, it is possible to extract the weights of the corresponding neural
network in order to retrieve the values of the parameters before, during or after training. In the case

85
of only two layers (input and output), without hidden layers, this is as follows for instance.

The names for the key may change from one modeling to another, thus one must be cautious with
this naming. An alternative is to ask for the weight and the bias separately but also only the weights
which enter the derivative if some links are chosen constant (see the documentation of pytorch).

��

The weights of the neural network, here the initial values, are combined as a vector of regression
coefﬁcients with the following function. For this case, the naming is used but no naming may be
better for more generality. This is written:

� ��
� ��
� ��
� ��
� ��
� ��
� ��
� ��

��
��
��
��
��
��
��
��

��
��

� ��

Several methods exist for initializing the neural network, as a bad initialization may likely induce
a bad convergence, this step must not be forgotten for best results and to avoid to loose time on
tuning a model with a bad initialization. In this part, the standard and by default way from pytorch
is implemented as it is suitable.

86
4.4.2 Preparation for the optimization
In this subsection, the dataloader is not only needed, other instances of python classes (from py-
torch) are required for the optimization procedure.

Early stopping

An eventual way to improve the training is to compute the accuracy at each epoch, and stop when
for the test sample, it begins to drop (or the loss also begins to raise) . Because there is a moment
during the training process that the model is well ﬁtted before it begins to learn too much particular
details from the train sample.
Stopping at this moment is the decision that we may want called "early stopping". This is the loss
which is generally preferred here for checking, the accuracy is less smooth. This is not the only
way to monitor the quality of generalization of the model. This explains the proposed class called
"MyMonitor" which is able to give a signal to the training loop in order to break its cycle via a
stopping rule at the end of each epoch, just below.

Monitoring during the training of the nn

In order to monitor the convergence and compute indicators, ﬁrst we create a python class which is
involved in a passive way during the training. This is optional and our choice for the implementa-
tion, alternative ways to do could ask for more advanced python code, because we deﬁne just one
unique generic python function after that for training the models.

��
��
�→��

��
��
��
��
��
��

��
��
��
��
��
��
��
��
��

87
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

Here, this version ﬁnds the value of the loss for the test sample, this computation is asked inside
the training loop at the end of each epoch.

List of the required python objects

Let create all the objects which are involved during the training of a neural network:

• the "dataset" for accessing to the sample from the computer memory or computer disk
• the "dataloader" for cycling the minibatches through the dataset object
• the "model" of neural network itself which contains the unknown weights to ﬁnd
• the "optimizer" for updating the weights with the values of the gradients while eventually
updating also the learning rate for an optimal minimization
• the "monitor" for computing the indicators from the test sample and check the convergence
for the stopping rule
• the "loss" function to minimize.

We also deﬁne the "number of epochs" and the "learning rate" (constant or initial, and eventually
a function for its decrease during training). This learning rate remains non automatic and must
be choosen by the human most of the time, except for some optimizers available for pytorch and
sometimes performing well, not considered here (see the chapter on "autoencoder").

88
��
��
��
��
� ��
� ��

��
��
��
��

��
��
��
��
��

This optimizer is equivalent to the one used with numpy in the chapter just before, a constant
learning rate. The argument "momentum" is able to improve and speed-up the training when well
chosen.

4.4.3 Parameters training

The training is still a loop with the minibatches, except that for training the neural network with
the loss function, a "backpropagation" is implemented within pytorch which allows faster com-
putations for the gradient when there are hidden layers, this is the automatic computation of the
gradient, as seen in the previous chaper.
When deﬁning the function for the main loop, the target variable from the sample yb or the neural
network ŷb for a current minibatch, may be transformed in order to get the same shape of array for
both.
��
��
��
��
��
��
��
��
��
��
��
��

89
��

��
��
��
��
��
��

For the regressions where the output is a simple scalar, the target variable from minibatch is a
simple vector with one unique dimension. When the output is a vector such as in the softmax
regression, some losses ask for a binarized version of the categorical values, the integers between
1 and K. Ideally, this could be treated before at the level of the dataset or dataloader, but if not,
some transformation is required here.
The main loop is deﬁned as follows, it required the python objects deﬁned above.

��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
� ��
��
��
� ��
��
��
��
��
��
��
��
��
��

90
��
��
��
� ��
��
�
��
��
��
� ��
��
�
��
��
��
� ��
��
��
��
��
��
� ��
� ��
��
��
� ��
� ��
� ��
��
� ��
��
��
��
� ��
� ��
��

��
��

This function is generic in order to avoid copying the same code into several functions, thus there
must be some time to read it and isolate the important parts:

• The "loop" is inﬁnite with a growing number of epochs, but stops whenever the maximum
number of epochs is reached or some stopping rule was deﬁned and became true in the

91
function monitor.stop() .
• Three "transformations" are defined for yb, Xb and yhatb, for instance insuring that yb and
yhatb have same shape for the loss. The transformation for Xb may be a reduction such as a
principal compnent analysis or a random projection for instance. But it may be kept in mind
that any transformation is costly during the training, and for yb or Xb this could be included
in the dataloader in order to speed up the training, while ideally pre-computed and stored in
computer disk for loading with the dataset, this avoids any costly transformation during the
running time.
• The eventual function "update_model()" for updating the weights from the model differently
than from the optimizer, this is a short-cut here.
• The "device" is for computing the gradient and for updating the weights within the cpu or
the gpu. The data and the neural network must be put in the same device, otherwise the
computation is not possible with the usual current computer architectures, and this would
actually decrease the interest for having a fast gpu.
• There is three levels of "printing" the loss during the training, more or less informative: no
loss, first and last losses or all losses during training.
• The "monitor" allows to compute the loss for the test sample while the main function just
above computes the loss for the train sample. The monitor contains already the variable for
keeping the values of the loss during the training. The losses for the test and the train samples
are available in the object monitor at the end of the training for graphics and checking the
convergence.
The same idea of the algorithm than from the previous chapter is recognized, except that the gra-
dient is always automatic, the gpu is available natively for this python module and a few options
are added here for later use. If required, a way to better understand this function is to rewrite it
in a simpler way without all the options. This is left as an exercice to the reader: practicing the
language is the best way to master the language. After a simplified version, the optional features
can be added one by one, in a different way eventually. This proposed example of implementation
looks enough flexible for the purpose of almost the whole document.

Training

We have already created an instance from the class of neural network, one for the optimization,
one for the monitor. We have deﬁned the loss function, the number of epochs and the learning rate.
Now we train the model with the function deﬁned above "f_train_glmr()" which returns the loss,
the number of iterations and the state of the object monitor. This leads to:

��

��
��
��
��

92
��
��
��
��

��

��
��
��
��
��
��
��
��
��

��

4.4.4 Post-processing
Loss curves after training the nn for test and train samples

We have considered the test step which is actually more a validation step to be more precise even if
the literature can missname them sometimes. In the class MyMonitorTest, a loop cycles the test (or
validation) sample without updating the weights of the neural network which were found during
the training step.
The output is shown.

��
��

��

�→ ��
��

��
��
� ��
��

��

93
��
��
� ��
��
��

Figure 4.3: Loss function for train and set samples, per epochs

Comparing the two curves allows to decide if the neural network suffers from a generalization
problem, with overﬁtting or underﬁtting, but also if the learning rate is (at least approximatively)
accurate. Typically, the training loss is lower than the test loss because the model was learnt on
the train sample such that it better adapts to this sample. Ideally the test loss curve should be no
too far at least at the end of the procedure. This is not the case here because a linear model is not
enough for nonlinear frontiers between the classes.

Accuracy with pytorch

The neural network has one scalar output for the linear regression and logistic regression while a
vector of K outputs for the softmax/multinomial regression. These outputs are obtained from the
dataset by feeding the neural network with the independent variables xi and then processing into
probabilities for the classification in order to compute the estimated label ŷi with a threshold. These
final values allow to find the accuracy by comparing with the true values if they exist, otherwise,
this is our prediction for the classes when the information is unknown.
The accuracy is computed in this part after finding the labels, it is also discussed an accuracy
per class and visualized the frontiers because for this dataset a 2d representation is available. To
compute the quality of the regression or classification, we need the estimated labels or predicted

94
target variables which are obtained in the function below by looping the dataset with minibatches
or chunks.

��
��
��
� � ��
��
��
��
� ��
��
��
��
��
�→��

��
�→��

��
��
��
�→��

��
��
��
��
��
��
� � ��
��
��
� � ��
��

��

��
��
��

��

��
��

95
��

��
��

��

Accuracy per class instead of the usual aggregated one

The aggregated indicator is not so much informative as the conditional one per class, but generally,
it is preferred because this is the overall error which is of first interest. For medical questions for
instance or asymetric costs related to the classification, the error should be computed w.r.t. each
class for a better understanding of the performance of the classifier. An error has not the same
consequence if this is a "FP" or a "FN", and also unbalanced classes induce an unfair accuracy for
which the larger class biases the rate.

��

��
�→��

��
��

��
��
The errors are not symmetrical for this dataset and for the implementation chosen for the train-
ing. The two accuracies are not both near 0.5, the half or the expected value which is the worse
prediction by chance (one out two). This result teaches several points:
• This underlines that looking at the accuracy is probably not enough for a deeper evaluation
of the model, and other indicators are to be checked anyway. The error from one class may
be preferred to be minimized for instance, but the threshold may be also chosen differently
for a different ﬁnal error.
• This underlines that a linear model is not always enough and a nonlinear model is sometimes
mandatory in classiﬁcation (and regression). The usual statistical approach for explaining
with the model is limited for usual machine learning question suchs as prediction.
• The usual way in statistics is to add some interaction and power terms (two most often)
which is related to polynomial regression, but the frontier here is not just a simple nonlinear

96
version of the line (or hyperplan) such that a polynomial transformation may be not always
enough.

Visualization of the frontier between the classes

Let check the solution graphically in order to retrieve the conclusion about the linear model for this
dataset. The frontier is checked visually by computing the predicted label ŷi for the whole plane
with discretization, such that we get:

��
��
��
��

��
��
��
��
��
��

��
��
��
��
��
��
��

��

Thus,

� ��
��
��

��

��
��
��

97
Figure 4.4: Linear frontier from neural network without hidden layers

The resulting frontier is linear as expected, hence the model is misktaken. The frontier becomes
nonlinear by introducing in the neural networks hidden layers and nonlinear activation functions.

Such neural network is implemented next in order to solve this issue. This allows to extend glm,
otherwise there exists other methods for nonlinear classiﬁcation with trees and ensembles for in-
stance but out of the scope herein.

4.5 Training deep nonlinear models

In this section, after discovering functionalities of pytorch for a more advanced training of linear
models, additional latent layers and nonlinear functions are introduced in the python class for the
deﬁnition of neural networks. The new resulting model is trained with the same loop embedded in
our python function from the section just before and the accuracies are compared.

4.5.1 Hidden layers in deep models

The deep neural model is deﬁned by adding the "activation functions", which are nonlinear func-
tions to apply to the output of a node, and by adding "hidden layers" between the input layer and
the output layer. The hidden layers are like latent observations, but they result of a transformation
from the layer just before in the neural network.

The output is obtained by a sequential computation from the input layer to the ﬁrst hidden layer,
then from the ﬁrst hidden layer to the second hidden layer, and so on until the last layer. Hence,
each transformation leads to the next layer, until the last layer for the output.

98
Formal deﬁnition of the model

Recall that the input layer is for the features or independent variables xi j plus the bias or intercept 1,
with p + 1 nodes, while the output layer is for the target or dependent variable yi with 1 or K nodes.
This induces to deﬁned intermediate functions g� () such that when counting the last transformation
for the ﬁnal layer L + 1, we get:

f (xi ) = gL+1 (gL (...g� (...g2 (g1 (xi )))

Thus each function tranforms the previous current transformation by applying p� (= the size of the
current layer) times the same function for the hidden layer �, hence to each of its p� nodes. This
is associated to a linear mapping with a linear matrix or weights, such that each transformation
(function to each node+mapping to next layer) is summarized as a multidimensional transformation
from a space of p�−1 dimensions to a space of p� dimensions, when denoting p0 for the ﬁrst layer:

g� : u → g� (u)

with,
u = (u1 , · · · , u p�−1 )T
g� (u) = (g�,1 (u), · · · , g�,p� (u)) .

There are several possible functions implemented in deep learning. Three have been already seen,
thus they are for instance, - linear for linear regression - sigmoid for binary linear classification -
softmax for multicategorical classification. The hidden layers are able to transform nonlinearly the
vectors xi , before the regression or classification which happens at the last layer.

Example of hidden layers with pytorch

The linear transformation for hidden layers remains exactly the same than before. Thus, let have a
look at the type of activation functions which can be considered in the class we define next. They
are from the list available from "torch.nn" such that, the functions "Sigmoid", "Softmax", "ReLU",
"CELU", "LeakyReLU", "PReLU", "Tanh", have each a different shape, while "Dropout" and
"AlphaDropout" select a percentage of nodes randomly.
Implementation from module "torch.nn" # Parameters
Sigmoid() 0
Softmax(dim: Union[int, NoneType] = None) 0 (always dim=1)
ReLU(inplace: bool = False) 1
CELU(alpha: float = 1.0, inplace: bool = False) 2
LeakyReLU(negative_slope: float = 0.01, inplace: bool = False) 2
PReLU(init: float = 0.25) 1
Tanh() 0
Dropout(p: float = 0.5, inplace: bool = False) 2
AlphaDropout(p: float = 0.5, inplace: bool = False) 2

99
Many other layers or activation functions have been proposed until even recently. Some layers
are useful for regularization by dropping some weights i.e. canceling out a percentage of neurons.
Some layers are for normalization such as batch normalization and layer normalization in order
to standardize each minibatch which allows a better convergence in some cases. Such advanced
layers like for image classiﬁcation (i.e. neighboors layer called convolution layer) or clustering are
not considered herein neither, they may be tested as an exercice.

4.5.2 Deﬁnition and model training

In this subsection, deep models are illustrated by adding an hidden layer. We are interested on
improving the results from linear classication or linear regression when these models are not per-
forming enough well on a dataset. This is the case when the frontier is nonlinear, such as a circle
instead of a simple line.
There are several ways to deﬁne a network with pytorch, and the one above may be less convenient
because one needs to write by hand all the nested transformations coming from all the hidden
layers in the "forward" function. A more powerful notation uses a list of layers as an input variable
for the constructor of the class.

Generic python class with hidden layers and activation functions

According to the current implementation, we just need to change the class "GLMRegression" into
a more general class which is able to handle our new settings. Note that the layers and activation
functions can be added in several ways in the class with pytorch, one after the other in the forward
function or all at the same time with a function called "sequential()". For more generality we use
lists and leave as an exercice the definition of particular models, mostly required for advanced
variants of networks such as recurrent ones with time varying variables: they are out of the scope
and often defined without such list. Typically in sklearn for instance, for such network, one just
has just to define a list with the sizes of the hidden layers, and the list of the activation functions.
Pytorch is more flexible hence asks for more details thanks to the additional functionalities, if
required.
From here, it will be preferred a class constructor from a list of layers and functions as proposed in
pytorch. This avoids to rewrite the full class each time when the structure of the network changes.
Recall that the last activation function may be included in the loss function in some cases, such as
for sigmoid and softmax, but for instance the sigmoid may appear from an other (hidden) layer in
order to induce nonlinearities. The last activation function might be linear here.

��
��
��
��
��
��
��

100
��
��
��

The main interest of the parameterization of this class is a higher level access to the pytorch library
without dealing with python classes, but this is mostly for documentation of the layers that are in
used in this chapter. Otherwise, writing directly this class, and considering our already deﬁned
other classes or functions remains possible, at least for a different situation such that advanced
image processing.
Anyway, let check how behaves this new class added to our implementation of deep learning with
pytorch.

��
��
��

��
��
��
��

It has been added in the class, the object net which is required for the function forward.

��

��
��

��
��
��
��
�
Note that there is also available the class named "ModuleList" for dealing with a list of layers, this
is left to the reader as an exercice to access the documentation of pytorch in order to learn more
about.
��

��

��
Next we copy the object "layers", otherwise, each time the training is performed from this object,
it will update the same weights. The python variable "layers" is not just a description of the model:
it contains the weights. For large models, a shallow copy may be preferred, depending on the
computer memory.

101
Thus, from the module named "copy", we use "deepcopy()" for creating our object, and then run
the training: the resulting classiﬁcation should be better than before.

��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

��

The model is saved for later.

��
��

Finally the metrics are computed.

��

��
�→��

��

The model is very accurate for this sample, but the frontier is not perfect according to the true one.
Some regularization may improve further the model but a model without error on the population is
difﬁcult to achieve with a small sample for the training.

The two losses have same shape and are almost equal, there is no "underﬁtting" or "overﬁtting",
the "generalization" looks relevant here. This is the opposite to the curves without nonlinearities
where dramatic and pathological differences are detected via a visual checking of the two losses.

102
��
��
��
��
� ��
��
��

Figure 4.5: Loss function for train and set samples, per epochs

Let check the frontier, the circle is expected to not be retrieved exactly because the separation
between the two classes has not this shape for these samples.

��

��
��
��
��
��

��

It appears that small samples can be optimistic even on the test sample, but this is not always the
case. Anyway, only one train sample and one test sample is not the best approach as explained
next chapter. This underlines also why more data is that one wants for training the neural network.
This would ﬁll the empty space here which induces the ellipsoidal shape for the frontier instead of
a perfect circle.

103
Figure 4.6: Nonlinear frontier from neural network wit hidden layer

There is even for such a small dataset, the evidence of overfit -for some initialization- because the
curve may grow after the decrease. Early stopping could be implemented here, or reducing the
number of nodes could be even required. Note also with the ReLU, the overfiting seems reduced
as this layer is defined for this purpose in the literature by canceling some nodes: a large part of the
activation function is null. The number of hidden nodes is chosen high for demonstration purpose,
thus may be reduced with an additional hidden layer.
This small example underlines a pitfall of the deep neural networks where a bad training can hurt
the results for new incoming data, because the model may learn too much particular details of
the empirical distribution without remaining enough general. In statistics, the idea is the occam
razor which advises to prefer simpler models to more complex ones, with the usual "bic criterion"
in regression, but also the selection of hyperparameters via "cross-validation" for more complex
models with an average decision and a sampling of new data.
Next chapter, the model is considered in a more general situation for more complex datasets.
Regularization and training will be discussed further.

4.6 Exercices
1. For the dataset of the chapter,
a) (pytorch) Change the learning rate and the batch size in order to accelerate the training,
while keeping a decreasing loss during the loop.
b) (pytorch) Chech the accuracy for a range of numbers of nodes by drawing a curve of the
accuracy versus their number, with and without the ReLU transformation. separately.
Is the ReLU transformation useful and how ?

104
c) (pytorch) Idem for other layers (see the documentation)
d) (pytorch) Do the same for the regression model with an hidden layer, check visually
the overfitting. What is observed when the number of nodes grows? Is it related to
polynomial regression, why.
e) (pytorch) Test the accuracy and the overfitting with two layers with several number of
nodes.
f) (pytorch) Test another approaches in order to avoid overfitting.
g) (sklearn) Fit the logistic regression with sklearn for the dataset in this chapter. How
this model compares to the neural network with pytorch.
2. (pytorch) Find a neural network for the classification of the following dataset with XOR
structure from the datafile "foursquares_xor.txt" with labels at last column, after vizualizing
the points with the classes.
3. (pytorch) Propose a neural network for the classification from these datasets generated from
a submodule for datasets of sklearn for the three first ones,
a) x, y = datasets.make_classification(n_samples=n,n_features=d, n_redundant=0,
n_informative=2,random_state=1, n_clusters_per_class=1)
b) x, y = datasets.make_gaussian_quantiles(n_samples=n, n_features=2, n_classes=g,
random_state=42)
c) x, y = datasets.make_moons(n,noise=.1)
d) x, y = a two dimensional dataset with three spirals.
The data samples are stored in the files "2gauss_xy_2d.txt", "gaussq_xy_2d.txt",
"2moon_xy_2d.txt", and "threespiral2d.npy" respectively.
4. (pytorch) Test linear and nonlinear regressions or classifications via neural networks and
pytorch with the datasets from the exercices in the previous chapter and compare with the
logistic regression. Some datasets are more difficult than other: next chapters will explain
how improve the choice for the hyperparameters (learning rate, number of neurons in the
hidden layers). The iris dataset is a good candidate here with a small size and three classes,
for a linear classification.
5. (pytorch) Test the code for the 20000 newgroups (see dataset from sklearn) with only four
classes for instance.
6. (pytorch) Test the code for the 60k digits handwritten images mnist with ten classes. (stat)
How to reduce the number of weights in the model for images classification.
7. (stat+pytorch) Propose a method in order to improve the training and reduce the classification
error for the logistic regression and its nonlinear version.
8. (nn) Propose a graphical representation for the architectures of the neural networks in the
chapter, including the exercices.

105
106
Chapter 5

Lasso selection for (deep) glm

In the previous chapters training linear and deep models with pytorch has been presented with the
loss, the related maximum likelihood and algorithms. For improving the models, the number of
parameters may be reduced for instance by penalization as explained below. When some variables
are useless or the model over-parameterized, it improves dramatically the accuracy or the mse for
the test sample as illustrated with an example. An hyperparameter is set for the strength of the
regularization, its value is found with a method of cross-validation in order to choose the value
among a set or an interval.

5.1 Brief recall on regularized regression

Instead of constructing new variables in a new reduced space, some parameters are just removed
from the equation because they are useless. More precisely, the approch considered here is called
"lasso", and the penalization is the "L1 distance", the sum of the absolute values of the weights.
This is very competitive in practice, the added penalty term to the loss is also even able to reduce
the values of the weights when they are not just cancel out with very small values for the less
relevant. Note that using the "L2 distance" instead leads to a "ridge model" with also a reduction
but not a removal of the weights, thus often less efficient. This is implemented by default in sklearn
for regression models as a first step against overfitting. Associating the two distances is called the
"elastic net model", which generalizes both with a linear combination of the lasso model and the
ridge model. This also means that one just needs to minimize a new loss function by adding a
simple term to the former one, which explains why so much applications of this idea has been
published by scientists. But it has been also invented analytical theories in order to explain and
train in a better way the resulting loss.

More formally, the approach is described as follows for the linear case, with different possi-
ble expressions for � and R with ŷi (β ) = β T xi , otherwise for a deep model with ŷi (β ,W ) =
β T gL (...g� (...g2 (g1 (xi )) for instance. Instead of directly minimizing the objective function or loss
�(), a penalization R() is added in order to avoid inﬂating values during the training of the compo-
nents of the parameters vector, see the documentation of "sklearn" for instance.

107
�λ (β ) = �(β ) + λ R(β )
1 n
= ∑ L(yi, ŷi(β )) + λ R(β ) .
n i=1

The first term is the usual loss function �(), they are defined for regression, the least-squares or the
robust least-squares (Huber) for instance. Otherwise there is also defined other ones for classifica-
tion, the logistic regression for instance.
The second term allows to decrease the overfitting from the regression and improve the general-
ization. For the regularization term R = R(β ), they are, with ρ in [0; 1] for the parameter of linear
combination.
the L2 norm (ridge) R2 = 1p ∑ j |β j |2
the L1 norm (lasso) R1 = 1p ∑ j |β j |
the Elastic Net Re = ρ2 R2 + (1 − ρ)R1
This is visualized below, the lasso penalty induces smaller values than the ridge one, for two
components β1 and β2 for which the solution lies on a circle or a square respectively, while the
solution for elastic net is a combination of both.

β2 β2

β1 β1
ridge lasso
Figure 5.1: Solutions for each penalty function alone with p = 2.

Note also a probabilistic interpretation of these penalizations is to add a prior on the parameters,
a Gaussian density for the ridge penalty, and a Laplace density for the lasso one, say π(β |σ 2 ) =
p λ λ |β |
∏ j=1 2√ v
exp(− 2√vj ), with v > 0, but this is out of the scope herein. Bayesian inference remains
possible with pytorch1 but ask for more advanced implementations. Note that kind of model is
often presented as able to ﬁnd a relevant value for the regularization parameters with only the train
sample, but involving a test sample may perform better actually.
Finally, this induces to add to the previous derivative ∇�, λ as a product with the derivative of
R(β ), say λ ∇R. With "sklearn" some functions are available to perform this new optimization and
even ﬁnd the best value for the new unknown parameter λ , called "hyperparameter" in general,
1 See the external python module "pyro" with such models for instance.

108
in order to get the best generalization as possible from the available sample. When the hyperpa-
rameter is very large, only the penalization is minimized, hence the weights are all zero, while
when it becomes to decrease, the former loss is also minimized such that the previous solution is
regularized with this additional term.
This is implemented with pytorch in this whole chapter, just nextafter.

5.2 Dataset with uninformative variables

The work directory is given from the function.

��
��

��

��
��

It is imported the functions from previous chapters, all saved in a python file as a custom module.
An artificial dataset is generated in order to perform the comparisons. It is added to a design matrix
some useless columns for the variables which are not meaningful, for testing the model selection.
The columns are normalized. The datasets for train and test samples are saved for later use into
separated files in a compressed numpy format.
The dataset generated is loaded in order to check the contents.

��
��
��
��
��
��
��
��

� ��

��
��
Let train the lasso model with pytorch as a neural network and compare with the regression without
penalization. For lasso, the computer program is very similar from the previous chapter, except
that the loss function is altered, and the optimization becomes more difﬁcult. This supposes to add

109
the penalization at the level of the update of the weights of the neural network. This is performed
next with the module torch after training the regular regression.

For the dataloader, the previous functions are already available. The train and set samples are
already separated here and available from the computer ﬁles.

��

��
��
��
��
��
��
��
��
��
��
��
��
��

��

The argument "shufﬂe" is at False, this supposes that the sufﬂing was performed before once, and
is not repeated during the training.

��
��
��
��

Note that "num_workers" for parallel processing with several cpu cores and "pin_memory" for
better management of the memory buffer are two options worth to try for some computers.

��
��

The training function is now implemented with the gpu device for faster training, when available.

��

110
5.3 Multiple regression without lasso
Let train the model without regularization for this new dataset, even if some independent variables
are just noise. This is likely to be the case with real data too. Useless variables detected by the lasso
model may be also just correlated to useful ones. In both cases, the usual regression is expected to
not perform well because its lack of robustness.

5.3.1 Training
��
��
��
��
��

��
��
��
��
��
��
��
��
��
��

��
��

��

5.3.2 Post-processing
According to the graphic for the losses, the model looks fine with a convergence of the algorithm
and even no issue for the generalization with new data. But from the generation process it is clear
that this model is likely to be mistaken and to need improvements. This is confirmed below via
numerical indicators based on the predictions ŷi . For computing these predictions from the model,
a function is defined below.
��
��
��
��

111
��
��
��
��
��
��
��
��
��
��
��
��
� � ��
��
��
� � ��
��
��

Here, "device" is an argument in order to be able to proceed at the level of the gpu. In pytorch,
there are several ways to use the gpu: load all the dataset in the gpu when this is possible or load
each minibatch during the training loop in the gpu. Here the model must be also in the gpu in
order to access the data. The command is just ".to(device)" where device is an object depending
on the computer hosting pytorch. By the way, if required, an optional argument for a function
"f_transform_x()" was added but it may be preferred the dataloader for such transformation like in
image processing. This leads to.

��

��
��
��

112
��
��

��
��
��

��
��
��
The model performs very well on the train sample because this is a regression line ﬁtted with
data generated after a nested model but the model is less convincing for the test sample which
induces a problem of overﬁtting. The poor generalization of the model is corrected by adding
some regularization (removing nodes and weights in the neural network which are not needed), as
explained next after.

5.4 Multiple regression with lasso

Let add the regularization. A ﬁrst approximation of the optimal value of the hyperparameter λ was
found by trying several values. This is not shown here, see next part for an automatic setting, the
reader may try its own candidate values.

��

The loss is altered by adding the absolute value of the weights to the previous loss function, the
mse. By this way the penalization appears in the update rule at each optimization step during the
training, like if this was a new loss function. This is the role of the following function which is
an argument of the generic training procedure. Thus, the training loop keeps the same, it is just
introduced a wrapping function which adds the regularization to the previous one.

��
��
��
��
��

This leads to new weights or regression coefﬁcients after training as follows.

��
��
��
��
��

113
��
��
��

The computer code is nearly the same than without regularization, only the function for the loss is
implemented instead of remaining at "None". This allows the training without any deep knowledge
of pytorch, but also the perspective to change a generic code for the purpose of the reader, such as
for instance, any other regularization functions, loss functions, optimizer functions, etc. without
requiring to reinvent the wheel. Note that an additional generic python function with less arguments
is written in a next chapter with also an additional member from the generalized linear models for
discrete regressions. The training is launched.

��
��
��
��

��

5.4.1 Post-processing and mean square error

For the usual multiple regression, the mse on the test sample was found not enough small in
comparison with the one from training: this suggests that the ﬁrst model cannot perform very
well with new data. Let check the indicators for the new model with a penalization. The new
model leads to new mse for the train and the test samples as follows.

��

��
��
��
��

��
��

��
��
��
��
The new model with the regularization performs better than the one without. This conﬁrms the
needs for the penalizing term or at least some regularization in order to improve the usual model.

114
5.4.2 Interpretation of the obtained model
The following graphic with the predicted target variable against the true target variable for both
samples, test and train, conﬁrms a higher error for the test sample when the penalization is re-
moved.
��

��

��
��
��
��
��
��

��

� ��
� ��
��

��

� ��
� ��
��

This leads to the two following graphics. This is on the left without lasso and on the right with
lasso, there is a huge improvement for the test sample which makes almost mandatory such kind
of procedure for prediction.

Figure 5.2: Scatterplot of yi and ŷi : without L1 (left) and with L1 (right)

The weights or regression coefﬁcients β̂ j come with some variance and variability, this explains
why it is not better retrieved the true target variable from the model. The penalization improves the

115
generalization for new data by reducing the quality of the prediction for the training set. Adding
the regularization performs quite well here, and without this term the model appears not useful for
new data. Retrieving same values for the indicators for train and test sample conﬁrms the idea of a
model able to generalize and to perform exactly the same for any data, as if it was estimated from
the whole population, but this is not always the case actually.
The training algorithm may be not optimal above for the optimization with a nonlinear regular-
ization term: for the lasso regression more advanced inferential procedures may be preferred for
faster training. But they ask for specialized python programs (see the literature on lasso for such
implementations) while the presented solution is via numerical derivatives.

5.4.3 Other regularization methods

The lasso regression may be improved for real data by alternative or more advanced models, in
particular the ridge regression assumes another penalization with the term ∑ j |β j |2 . Elastic net,
with the term ∑ j ρ2 |β j |2 + (1 − ρ)|β j |, is expected to share the good properties of both: it is worth
to try, this depends on the data. There is also the power from the q-norm with the term ∑ j ρ j |β j |q ,
or the clustered one with the term ∑k ∑ j∈Jk ρk |β j |q with some structure on the regression coefﬁ-
cients or the neural network parameters, for instance per layer. Here the quantities ρ j are either
for an adaptative penalization or to just focus on some parameters more than other ones. Some
of these different ways to regularize are considered in the exercices, but one may keep in mind
that for neural networks, specialized hidden layers are often now available for this purpose and
more advanced methods were proposed recently. For instance the "dropout layer", the activation
function "relu" (and their smooth variants) and the "batch normalization" provide some kind of
regularizations. Some most recent ways are in real competition against the lasso method in order
to remove the weights which are not wanted, for a parsimonious and lighter model.
Next after, it is explained how to ﬁnd in an automatic way the hyperparameters such as for the
regularization. It is also retrieved if the proposed value for lambda_l1 in the lasso, 0.02, was
actually suitable for this dataset via "cross-validation" with pytorch.

5.5 Setting of the hyperparameters

Hyperparameters are the parameters which are not trained via the gradient, they do not enter the
set of weights of the neural network but are related to its structure or its training: "learning rate",
"batch size", "number of hidden layers", "number of neurons per layers", "regularizing constant",
etc. For the setting of the values of the hyperparameters, the best common advice according to
the literature (and any serious practical experience) would be to search for these values before any
automatic procedure in order to reduce the "search space". But in the other hand, this could reduce
the capability to reach the optimal one in the case of a bad ﬁrst choice, due to human error.
The cross-validation next after should be considered for this task, because only one test sample is
not enough. Using only one test sample is prone to optimistic or pessimistic decisions and a biased
result when comparing between two candidate values for an hyperparameter. But cross-validation
should be used only when the hyperparameters were not ﬁnd manually because this is very time

116
consuming and suppose to repeat the training 5 or 10 times for all the possible candidate values for
each hyperparameters. This is times the product of the number of eventual values for each, which
becomes quickly a huge integer value when the number of hyperparameters #hyper grows. If there
are 15 candidate values for each hyperparameter, this is around:

"Total number of models ﬁtted for cross-validation" ≈ 7.5 × 15#hyper .

A saving time would be to prefer these approaches around a good value found before with a human
search via several empirical tries or a first naive choice of a few values. Otherwise the train and
test samples need to be selected very cautiously in order to be similar, via stratified sampling with
enough strata and eventually fine quantization or clustering of the data space. Note that in the
documentation of sklearn several cross-validation procedures (and lasso algorithms) are available.
Here only the more usual procedure is considered.

5.5.1 How to improve the learning

The final model which generalizes better is also with an optimal learning rate because minimizing
the loss, even with conditions, is the first purpose before all. There exists now many variant
ways for updating the learning rate during the training loop with minibatches. Herein, we have
considered a constant learning rate because for convex problems or with few and sometimes many
nonlinearities, such approach is often enough. When the objective function is more complicated
another learning rate schedule may be preferred. Hence, two questions are risen here: how to select
the better function for updating the learning rate and how to guess its better initial value.
• For the varying learning rate, αt = α(t) as an alternative to a simple constant, this is a de-
creasing function (linear or exponential) which needs the setting of some parameters (speed
of the decrease, initial value), see "torch.optim.lr_scheduler". Because the training updates
are with αt ∇�, the risk with a bad setting for the varying learning rate is to cancel out the
gradient update before that the gradient ∇� is null with αt null before that the minimum is
reached. This is another good reason to check the gradient at the end of the training. An
alternative approach is a specialized optimizer, such as the "Adam optimizer", which is a
powerful candidate to replace the previous gradient descent. With this kind of optimizer, the
learning rate is updated automatically according to the objective function and its gradient,
see the documentation of pytorch for the list of such optimizers.
• For the initial learning rate, α0 , trying several values (among powers of 10), say 10s for
s ∈ {−5, −4, −3, −2, −1, 0, 1, 2, 3} and their between values, remains possible but for very
large datasets, the training might be stopped after a few steps for making it possible. Another
way to do is choose a first high value for the learning rate, and then try a smaller one while
checking if this improves the final results. By dividing at each time by two or ten, this should
provide some good value at the end. But there is since a few years ago, a more automatic
approch called "Cyclical Learning Rates" which is implemented for pytorch, sometimes able
to provide the wanted value with a graphical proof.
A too fast training may also suppose a too large learning rate while a too slow one may suppose
a too small learning rate: for constant values or varying ones. At this moment, there is a clear

117
understanding that the training of neural network models ask for some manual settings which may
be cumbersome for non experts, as there seems to be missing the real receip which works all the
time. Recent automatic implementations are based on bayesian inference, but are time consuming,
see "automl" in the literature for instance, and the last section of the chapter as examples.

5.5.2 Grid search for optimal hyperparameters

Several hyperparameters require an a priori value: learning rate α, learning rate decay, momentum
β , other hyperparameters for some learning rate function (in Adam, etc), number of hidden layers
and their number of nodes, and also the size of the minibatch for which the learning rate may need
to change.
Two main approaches are available (beside bayesian and genetic ones not considered here), the
grid one and the sampling one. Note that if the sampling is not adaptative by looking better regions
in some intervals, the two approaches are almost equal, once the sample values are obtained before
in some array. They are called in sklearn tuning by "grid-search" and tuning by "randomization-
search" (see also for an implementation "Ray Tune" available with pytorch). A set of values is
avaible in order to try the model on each value of the set, and compare a resulting indicator of
quality in order to keep the best value. The grid seach is also eventually replaced by some chosen
set of candidate value before training, instead of dividing an interval into values with equal distance
for two neiboors as the grid supposes. Some python libraries are dedicated ("optuna") or have
functionalities (sklearn) for such settings.
As an illutration, let revise the example with the regression and useless variables. The cross-
validation via grid search is implemented for pytorch. This is the whole data sample which is
loaded (instead of the test and train samples from two ﬁles, because the cross-validation takes its
own train and test subsamples, several times, from the available data in order to propose an average
and robust decision: the resulting indicator is not depending anymore of some biased choice for
the test and train samples. This is as follows.

��

��
��
� � ��
� � ��
��
��

The cross-validation (cv) provides nearly unbiased indicators, while the same code than before is
used except that the train and test samples are not unique anymore. For the "k-fold cv", k train
samples and k test samples are required, with often k = 5 or k = 10. The average of the k resulting
indicators of quality replaces the previous unique value. The k samples are found by partitioning
the whole sample into k equal parts. With these k subsamples, k pairs of training and test samples
are obtained by removing one of the k parts from the sample, each one after the other one. At each
removal, the removed part is the test sample while the remaining (k − 1) parts become the train
sample.

118
First the k samples of the cross-validation for testing and training are found from the available
whole sample. They are generated via the new python function just below which extends the
previous one, with the indexes for each fold and each subsample (test and train). These indexes
are then used to fed two dataloader of pytorch in order to cycle the minibatches from the two
subsamples.

��

��
��

��
The number of folds is 5 here.
��

Thus, ﬁve pairs of train and test samples are put in a dictionary as follows:

��

��
��
��
��
��
��
��
��

��

Note that some functions are available in python modules such that sklearn (with the function
KFold for instance) for a similar result but with more options and alternative sampling strategies
for the selection, while the one presented is the most common. In the proposed function just above,
the result is coded with a python variable of type dictionnary (dict) with the fold number minus
one as the key, and two corresponding values as two lists for the two samples, and names "train"
and "test". For this dataset, the sizes for each fold from the dictionnary is given just after, with a
veriﬁcation of the sets:
The sizes for each fold are as follows:
� ��
� ��
� ��
� ��

119
��
��
��
� ��
� ��
��
��
��
��

��

��
��
��
��

��

��
��
��
��
��

For a number of folds which does not divide exactly the sample size, all the folds have same size,
except the last one. This is different of sklearn with more balanced subsample sizes, hence this is
better to chek these sizes anyway. Thus, the folds are in motion next by just repeating several times
the previous usual strategy of training: one sample for model ﬁtting and one for the evaluation of
the quality. And, at the end averaging.

The training is a loop over the folds in order to gather the metrics and aggregate them. The function
is deﬁned in the ﬁle "utils.py" and has the following header and return list.

��

��
��
��
��
��
��
��

The function takes as argument either the sets in "idx_s" for the cross-validation plus the dataset
either the ﬁlenames in hdf5 format in "nameﬁle_s". After training, the output is the list from each
fold from the function "utils.f_train_glmr", as a dataloader is created for each couple of train and
set samples. This allows after to compute the averages and compare the results from each fold.

120
The objects called model, loss, optimizer and monitor are copied for each fold with this function,
thus they are created ﬁrst but the initial python objects are not altered.
The parameters for the learning are as follows.

��
��
��
��

The call to the function comes after creating the required python objects:

��
��
� ��
��
��
� ��
��
��
��

��

��
��
��
��
��
�

��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

121
��
��
��
��
��

��

��
��
��
��
��
�

��
��
��
��
�→��

This leads to an improvement of the results but less than expected in comparison to the previous
two samples (one train and one test).

A way to improve further would be to set better the learning rate by running the training for each
of candidate values and keep the best. Example of such values are as follows.

122
��
��

Instead, let check if the lambda_l1 is really optimal, considering this increase of the mse for the
cross-validation. First the datasets are stored on the computer disk.

��
��

��

��
��
��
��

��

��
��
��
��
��
��
��
��

��

��
��
��

��

��
��

��

123
��

��
��

��
��
��
��

��
��

��

The compressed ﬁles are stored.

��
��
��
��

Thus this is the input format for the dictionnary with the ﬁle names for the cross-validation: instead
of indexes, there is a characters string with the name (and path).

��

��
��
��
��
��
��
��
��
��

The call is almost similar than before except that the argument "dataset" in the function call is set
to "None", such as also "idx_s", while "nameﬁle_s" is now given a value. Note that for a small
dataset, this is better to keep the data in the computer memory for faster training.

124
The better regularization parameter and the better learning rate are found via cross-validation,
among the following candidate values.

��
��
��

��

��
��

��

��
��
��
��
��
�

��
��
��
��
��

��
��
��

��
��

125
��
��

��

��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

126
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
Note that the function "loss_yy_model" performs with a global variable, hence a more secure
way to do could be preferred, such as a method from a class where lambda_l1 is a variable. The
training may be slow for some computer machine because the cross-validation asks for ﬁve training
runs which are repeated by the number of learning rate candidates multiply with the number of
regularizing parameters candidates: 5 × 5 × 12 = 300, thus fortunately the gpu was used here.
The number of epochs may be limited otherwise the running time may be too high for some com-
puters, here 100 instead of 350 because the test loss was converging faster according to the previous
curves. This is a way to do for any neural network actually, even larger ones and larger datasets.
Another strategy is to randomly sample candidate values for alpha_t and lambda_l1 in an interval
or according to some distribution, but with no better insurance to get the best parameters. Here, it
may be also implemented some strategy to ﬁne tune the search by looking around a good candidate
value. These strategies are costly and may be possible only for serious computing availability. This
underlines the need for gpu and parallel processing in order to even search the good values for the
hyperparameters. This is anyway true for running any numerical python code: a better computer
gets the faster result but for deep learning a slow computer may not be able to proceed in a practical
time or might at least induce a delay for the results.
From the numerical output just above, the resulting statistics for the computed metrics are thus
summarized in a table in order to compare the quality of the results, even if the convergence was
eventually not reached.

��

127
��
��

��

��
��
��
��
��
��
��

The plot is with seaborn here, a module which make matplotlib more powerful to use with less
command lines, see the ﬁgure and code just below.

Figure 5.3: Mse averages on test set after CV+L1 and grid search

��

� � ��

�→��

128
� ��
��
��
��
��

��

Here, lambda_l1 equal to about 0.1 performs well, but a different alpha_t looks better than the one
from before, 0.001, this is the row from the results:
lambda_l1=0.1 alpha_t=0.001 mse_tr_mean= 0.0354 mse_te_mean=0.0451
The difference with the manually chosen parameters above may appear nearly neglectable. Any-
way, checking if the choice of the parameters is well suited looks like required in order to insure
the best training and settings as possible. For more parameters, parallel plots are suggested in the
literature, while a bayesian search for the optimal parameters looks like a better solution in order
to minimise the number of runs.

5.5.3 Bayesian search for optimal hyperparameters

The method just before looks like slow for larger networks and more parameters because it needs
to compute the solution for every candidate values from all the hyperparameters. A method which
is able to search in a more clever way is able to choose better the area to search for the optimal
parameters by sampling the best place from the interval where the hyperparameters are suggested
to belong.
This is the "bayesian search" which uses probabilitic distributions on the hyperparameters in order
to improve the understanding of the search space. Several methods have been proposed in the
literature, the following one is tested, it is called "smac". Note that it is also implemented in "auto-
torch", a module to choose automatically the architecture for a neural network as herein, a feed
forward neural network. There exists also modules for the "grid search" or "random search", but
this module "smac" tries to look for an optimal path towards the best solution given a space of
search.
To proceed with the module, we need:
- The definition of the function to minimize with an unique argument named "config", this
variable is a dictionary for the parameters in order to be callable from the module.
- The definition of the "configuration space" with an interval for the sampling of each param-
eter. A configuration is a unique value of the hyperparameters while the space is the set
of interval where the parameters are to be found plus their distribution, such as uniform,
gaussian, etc. This may be with a logarithmic transformation if required.
- The definition of a "scenario" for the search, for instance, the total number of configurations
tested, total running time, etc plus the configuration space.
- The call to the module for the search, with for instance the function "SMAC4HPO()" which
takes as arguments: the function to minimize and also the scenario.

129
This is as follows for our cross-validation function. First the embedding function is deﬁned.

��
��
��
��
��
��
��
��
��
��
��
��
��
�
��
��
��
��
��
��
��

The optimization proceeds with the python module "smac" in order to ﬁnd the best hyperparame-
ters automatically by deﬁning their intervals (or range) and their distributions.

��

��
��
��

��

��
��

��

� ��

��

130
��
��

��
��
��
��
��
��

��

� ��
��
��

��
��
��
�→��

��
��
��
After the deﬁnition of the python objects for "smac", the search for the optimal solution is lauched
with the following call to "optimize()", a dedicated method from the module.

��

After running the call, the python variable "best_found_conﬁg" has received the optimal parameter
found. For debugging, one may call:

� ��

In this example of configuration, there was no improvement from the grid search. Choosing 10
folds for the cross-validation would lead to more precise estimations of the hyperparameters with
only double running time. Note that with this example of implementation if the loss becomes nan
or infty, the module may stop with an exception, hence this should be fixed if required by cautious
bounds for the intervals in the configuration space. Early stopping would be interesting here to
implement instead of a not full training with a constant number of epochs kept the same for every

131
choices of hyperparameters candidate. Other implementations of such automatic search for the
hyperparameters exist in python modules and are left as an exercice for comparison: this one is
implemented in auto-pytorch an auto-ml member which allows to find the architecture of neural
networks automatically, even if some setting by an human expert is required for the choice of the
intervals but also the final validation and the data preparation with also the recoding for instance.
In this chapter, we have studied how to implement the lasso regression with pytorch and compared
several approaches for finding the hyperparameters for the penalization and the learning rate. When
well chosen, the regularization (lasso, dropout, ...) allows to reduce the overfitting in order to
improve the generalization for the prediction with new data.

5.6 Exercices
1. (sklearn) Retrieve the value lambda_l1 with sklearn and compare the resulting mse and r2
from the result with pytorch. It is noticed here that the parameter for the regularization may
be a product by two or an half because in sklearn, it is usual to have a factor two in the mse,
see the documentation of sklearn. The regression model in sklearn performs often well by
default because a regularization is included with a quadratic norm for the parameters vector,
this is a ridge hence regularized regression.
2. (pytorch) Implement a lasso for the classification with a real or artificial dataset from one of
the previous chapters herein.
3. (sklearn) Test the stochastic regressor 2 from sklearn with different hyperparameters for the
number of steps and the batch sizes. Similar behaviour may be observed in comparison to
the pytorch implementation, when a regularizing term is added.
4. (pytorch) Add a stopping rule in the class MyMonitor when the training loss is enough stable
in order to reduce the running time, and also when the test loss increases during several steps
for reducing overfitting.
5. (pytorch) Test the lasso for the deep regression model from the chapter 1, for fitting the data
from a polynomial regression. Count the number of parameters which are near or zero and
the number of parameters which are kept. Check if this regularization reduces the mean
squared error on the test sample. Test other regularizations.

2 "https://scikit-learn.org/stable/modules/sgd.html"

132
Chapter 6

Hessian and covariance for (deep) glm

In the previous chapter we have studied neural networks and pytorch for regression and classiﬁca-
tion. In this chapter, we are interested on numerical "variance estimation" by computing values for
� β̂ j ]. As a statistical model, neural networks have parameters which are estimated from a sam-
Var[
ple. This induces some variability of the resulting estimates. Hence the usual approach is to ﬁnd
the "hessian matrix", say ∇2 � = ∇∇T �, with the "second-order derivatives" of the loss function �()
in order to deduce the variance. For a large number of parameters and small samples, this approch
is incomplete and asks for a more advanced method (in case of no tractable numerical "boostrap"),
hence out of the scope here. Otherwise such as in the linear case, this is the classical way to do
which is presented next after with the modules numpy, statsmodels and pytorch.

6.1 Parameters variance without pytorch

An usual linear regression is ﬁtted and the parameters variance is estimated with python before
retrieving the results with numpy and then pythorch.

6.1.1 Dataset from linear regression

The work directory is given from the function.

��
��

��

��
��

Let generate a dataset from a linear model in order to perform the comparisons between the imple-
mentations. This is a toy dataset with thirty rows and ﬁve variables, the dataset is without useless

133
columns.
The resulting two samples are standardized.

��
��
��
� � ��
� � ��
� � ��
� � ��
� � � � ��

��

��
� ��
� ��
� ��

� ��
� ��
� ��
��

��

��
��
��

��

6.1.2 Variance estimates from statsmodels

Because the sample is not the population, the regression coefﬁcients have a variance. This is
available from the module "statsmodels" which computes the coefﬁcients vector for the linear
regression, plus diverses statistical indicators, including the standard-deviation of each component.

��

��
��
��
��

134
��
��
��
��
��
��
��
��
��
��

The module offers a more advanced view of the table than the usual function "print()" for a numpy
array.
This is the usual contents for statistics with data, as follows:
- The regression coefficients at the column named "coef" are the values β̂ j .
- The "standard-errors" (std) at the column "std err" are the "square root of the variance".
� β̂ j ])0.5 are useful because they enter the definition of an interval instead
These values (Var[
of just an unique value for each component.
- The bounds of the confidence intervals (left at 0.025 and right at 0.975) for the true pa-
rameters, say "β j ∈ β̂ j ± z0.975 × std" where it is recognized z0.975 as the percentile of the
Gaussian distribution for a two side test.
- A Student test of nullity of the coefficients (value "t" and p-value).
Without the information from the variance, the interpretation of the estimations β̂ j for small sam-
ples are not sure: with another small sample from the same population, the value of the estimations
may change dramatically, such that the conclusion would change completely too. For larger sam-
ples, the variance is enough small to be neglected in practice, this is common in machine learning
literature.
Let retrieve the results for the variance via the algebra by writing a new python function. This is
after a brief recall about the classical analytical expressions from the statistical literature next after.

6.1.3 Algebra for the parameters variance

The variance of the parameters is obtained from the statistical theory as follows for two separated
cases, the linear one and the nonlinear one. An underlying hypothesis is a solution from maximum
likelihood estimation.
1. For linear models (regression and classiﬁcation)
• For linear regression.
n
�(θ ) = ∑(yi − α − xiT β )2
i
n
= ∑(yi − x̃iT θ )2 .
i

135
It is recalled that the criterion above comes without distributional hypothesis, even if one is
hidden here. This is a Gaussian noise which can be supposed and implicit above. This leads
to the matricial system:

y = X̃n θ + ε where θ = (α, β T )T and ε = (εi )ni=1 with εi ∼ N(0, σ ) .

The solution is written:

θ̂ = (X̃nT X̃n )−1 X̃nT y .
This estimator is unbiased, E(θ̂ ) = θ where θ is the true parameter for an inﬁnite sample,
say the whole population. An estimator of the variance of θ̂ is thus,

σ̂ 2 T 1
V̂ (θ̂ ) = (X̃n X̃n )−1 where σ̂ 2 = �(θ̂ ) .
n n− p−1

The estimator for the σ 2 is here corrected at the denominator because the one from maximum
likelihood is biased (and with fraction n−1 instead). As expected when n becomes large, the
variance vanished to zero, but for small n it may be large.
• For logistic regression.
n
�(θ ) = ∑ yiσ (α + xiT β ) + (1 − yi)(1 − σ (α + xiT β ))
i
n
= ∑ yiσ (x̃iT θ ) + (1 − yi)(1 − σ (x̃iT θ )) .
i

With ωi = σ (x̃2T θ̂ )(1 − σ (x̃2T θ̂ )), an estimator of the variance of θ̂ is,

V̂θ = −(X̃nT Ωθ̂ X̃n )−1 where Ωθ̂ = Diag(ω1 , ω2 , · · · , ωn ) .

2. With ∂∂θ� = ∑i ∂∂ �θi , the more general way to consider the variance estimation is via the hessian
(second-order derivative) or its approximation:
� �−1 � �−1
n
∂ 2 �i ∂ �i ∂ �i
�(θ ) = ∑ �(xi , θ ) such that, V̂θ = ∑ ≈− ∑
i ∂θ∂θ i ∂θ ∂θ
T T
i

It is retrieved the logistic regression as a special case. More advanced estimators such as the
sandwich one are not considered herein.

6.1.4 Variance estimates with numpy

The variance of the parameters is obtained as follows with python for the linear model.

��
� � ��
� � ��
��

136
��
��
��
��
��
��
��
��
��

��

The estimate of the covariance matrix is equal to:

��

��
��
��
��
��
� ��
� ��
The resulting table with the regression coefﬁcients and the standard-deviations is shown just below
in a pretty view very similar to the output from the above module, with several columns.

��

��
� � ��
��
��
��
��
��
��
��
��
��
��

137
��
��
��
��
��
��
��
��
��
��
��

��

��
� ��
��
� ��
� ��
� ��
� ��
� ��
� ��
��

It was added the variation coefficient, c.v., at the last column to the right, and also the t-test with
the related probability. Here, the Student test of equality t j of the regression coefficient is equal
θ̂ −0
j
to the variation coefficient because t j = � (θ̂ j ) = cv j , while the probability comes from a Student
var
distribution. This is not discussed further herein as this is the variance which is in stake.

When the number of variables increases, this becomes nearly "impossible" to invert the hessian
matrix with the current available software, thus it has been proposed solutions to solve this issue
for neural networks, mainly by only considering the diagonal part of the hessian matrix because
this is not the variance which is of ﬁrst interest, just the training (see next chapter and the exercices
part). Anyway, this brief introduction to the numerical variance estimation is not always relevant
for some models where more advanced analytical expressions may be required.

6.2 Hessian and variance for the linear regression

We want to ﬁnd the second-order derivatives instead of just the ﬁrst-order derivatives for the loss
and w.r.t. the weights of the neural network.

138
6.2.1 First-order training with pytorch

��
��

��

��
��
��
��
��

��
��

��

��
��

��
��
��
��

��

��
��
��
��
��
��

��
��
��

139
��
��
��
��
��

6.2.2 First computation of hessian with pytorch

Here it is not considered the quality of the regression. Instead, this is the variance of the parameters
which is in stake. An implementation is recently available with pytorch for this purpose, with the
function:
• torch.autograd.functional.jacobian(func, x)
• torch.autograd.functional.hessian(func, x)
This is written as follows. First it is retrieved the numerical solution for the parameters. The
function "extract_weights_lin" in the module "deepglmlib" has the name of the layers as arguments
in order to insure more ﬂexibility as the keys may change.
Thus, the vector of regression coefﬁcients is written,

��
��
��
��
��
��
��
��

� ��

��
��
��

� ��

��
For the ﬁrst and second-order derivatives, say gradient and hessian, it must be deﬁned the function
and the value where the derivative is computed. Here it is built a new fonction based on the loss.
This is very near the training or testing, but the argument is now the parameter. First, let retrieve
the values of the standard-deviation from the algebra.

��

140
��
��

��

The more direct way to retrieve the hessian (for the variance) is to deﬁne the likelihood and im-
plement the call to the corresponding pytorch function. In the linear case, without a native loss
function from pytorch, this is written as follows.

��
��
��
��
��
��
��

��
��

��

� � ��
� � ��
�

��

��
��
��
� ��
��
��
��
��
��
� ��
��

This is just that the analytical hessian, divided by n_train, is retrieved numerically for this simple
example, because the solution is known in exact closed-form for the linear case:

��

141
��
��
��
��
��
��
��
��
� ��
��
� ��
��

This comes from that:

0.5 ∂ 2 (Xθ − y)T (Xθ − y) 1 T

= X X.
n ∂θ∂θT n

6.2.3 Variance estimates with pytorch

After applying the pytorch function "torch.autograd.functional.hessian", the obtained value of the
hessian matrix leads to the numerical estimation of the variance as follows:
��
��
��

��

The estimator from algebra for this solution has the same value as expected, which validates the
illustration with the dedicated pytorch numerical function here. These solutions are also identical
to the one before from the python module stamodels (see a few paragraphes above).

��

Actually it is required to retrieve the same solution for the regression coefﬁcients because unbias-
ness is supposed here for computing the term σ̂ . In the nonlinear case, there is no need for this
proportional variance.
Anyway the hessian matrix for neural networks is more useful for training a small network with a
small dataset nowadays. But ill deﬁned hessians and very large dimensional variable spaces seem
a problem not solved in the literature even currently considering the increasing sizes of the models
in used today for images and texts. For moderated larger matrices, an approximation of the hessian
matrix (such as from quasi-newton approaches) may be preferred as explained in the next chapter.

142
This ends the introduction on variance estimation with pytorch through an example for the linear
regression model. It must be added that a resampling algorithm, say "boostrap", is an alternative
not considered here. The next chapter will consider further the hessian matrix for training (deep)
neural networks (with a few hundreds or thousands of variables or weights).

6.3 Hessian computation for (small) neural networks

A nonlinear loglikelihood with the logistic regression is presented as a more advanced example
where the hessian is computed from minibatches (for larger datasets) instead of the full matrix
before, thus the data may come from the computer disk for these new python functions.
In summary from above, the derivatives are computed with the module torch via the automatic
numerical function for the derivatives, in this section. The main idea is to apply the gradient
computation to each component of the gradient vector, following the component-wise deﬁnition of
the hessian, H = H jk , while for the variance V̂ = H −1 . The cells of the matrix H are thus,
� �
∂ 2 ∑i �i ∂ ∂� ∂ �i ∂ �i
H jk = = or approximatively H jk ≈ − ∑ × .
∂ β j ∂ βk ∂ βk ∂ β j i ∂ β j ∂ βk

∂�
- For the expression to the left, the gradient function is applied to each component of ∂β =
( ∂∂β�j ) j .
It leads to the wanted matrix, but may ask for many computer memory and may be
slow to compute for large values of p and n. With also a related second-order optimization
prone to local minima, this explains why first-order training algorithms are often preferred.
- For the expression to the right, recall that ∂∂β� = ∂ ∂∑βi �i = ∑i ∂∂ �βi . The approximate expression
just sums the cross-products of the gradients computed for each element of the sample,
say the vectors ∂β�i = ( ∂∂β�ij ) j . Recall that this alternative expression approximates the exact
hessian because the two expectations from each expression are exactly equal according to
maximum likelihood estimation.
First let have a short recall of the formula for logistic regresion with: the expression of the log-
likelihood, the derivatives (gradient, hessian) followed by the expression of the information matrix.
Then, the variance is computed with pytorch after fitting the regression coefficients from the model
for real data, the "abalone dataset".

6.3.1 Hessian expression for logistic regression

In the logistic regression, let rembember that the target or dependent variable is binary, yi = 1 or
yi = 0, associated to the explicative variables which enter the model with weights called coefﬁcients
regression in the vector θ including the intercept. This leads to the model and the expectations as
follows:
T
exi β
f (yi |pi ) = pyi i (1 − pi )1−yi where E(yi ) = pi = T .
1 + exi β

143
The (inverse) logit link function insures a linear regression after transformation. Let denote the
diagonal matrix Ω = Diag(p1 (1 − p1 ), . . . , pn (1 − pn )), and the dataset available as the couples
(xi , yi ) with the loglikelihood written:

�(β ) = ∑ni=1 yi ln pi + (1 − yi ) ln(1 − pi ) .

Thus,
� T
�
�(β ) = ∑ni=1 yi xiT β − ln(1 + exi β )
∇�(β ) = ∑ni=1 (yi − pi )xi
∇2 �(β ) = ∑ni=1 −pi (1 − pi )xi xiT

In (β ) = E[−∇2 �(β )] .

Note that as a result, the Fisher information matrix is equal to the hessian, because the hessian is
not random, and do not depend on the yi ,

Iˆn (β ) = −∇2 �(β ) = In (β ) .

In matricial notation, this is rewritten:

∇�(β ) = X T (y − p)
∇2 �(β ) = −X T ΩX

Iˆn (β ) = X T ΩX .

Thus,

In (β ) = E[Iˆn (β )]
= Iˆn (β )

With the gradient per unit equal to ∇�i = (yi − pi )xi , this also induces that the approximation for
the hessian matrix is as follows,

I˜n (β ) = ∑i ∇�Ti ∇�i

= ∑i (yi − pi )2 xiT xi

E[I˜n (β )] = In (β ) .

As an exercice the reader may show the equality of the expressions from the ﬁrst-order derivatives
and from the second-order derivatives. Thus it is retrieved the general result for this example of
a member of the generalized linear model. This is an expectation thus by the law of the large
numbers, the equality holds for the sum only for enough large samples. This expression from
the gradients has been implemented for neural networks by the past before the rise of ﬁrst-order
approaches.

144
6.3.2 Real dataset with two classes
First the dataset is loaded with the module "pandas" because it is in format "csv". This module
allows also a nice processing of the data table for naming and accessing the columns with character
strings, plus also for the rows and columns the selection of subsets or the computation of descriptive
statistics (sum, mean, variance) for instance. The chosen target variable is the "count of rings"
which takes integer values, it is binarized in order to be suitable for a binary classiﬁcation.

��
��
� � ��
� � ��
� � ��

��

Let select two subsamples for the training, the train and set samples.

��

��
��

6.3.3 Results from the module statsmodels

The model is ﬁrst trained with statsmodels as a baseline result in order to compare with the pytorch
implementation. The output and code follow below.
��
��
��
��
��
��
��
��
��
��
��
��
��
��

��

145
��
��
��
��

��

6.3.4 Results from sklearn and numpy

The model is also ﬁtted with sklearn in this paragraph. The standard deviations for the regres-
sion coefﬁcients are computed with numpy arrays by implementing the expressions obtained from
algebra at the beginning of the subsection, as follows:

��
��
��
��
��
��
��
��

��
��
��

��

��
Thus a same result is obtained with the formula and with the python package for the variances
of the regression coefﬁcients. Note that the model here does not perform well with the variables
which are correlated, hence adding a regularization could be required for a more useful model.
Note that currently the python variable "y" should be a one dimensional vector,
"y.shape=(n_samples,)", thus obtained after the call "y=y.ravel()", which is the reason why this
was added above at the deﬁnition of y when x and y were both created. Another shape of vector
may not work well, hence checking the shape of y (and x) is a wise advice whenever pytorch or
sklearn complains, with the current version of the modules.

6.3.5 First-order training with pytorch

With pytorch, the variance of the parameters is also retrieved after an iterative inference as follows.

��
��

146
��
��
��
��
��
��
��

��

��
��
��
��
��
��

��
��
��
��
��
��
��

��
��
��
��
��

��

For this training of the dataset, the test loss has converged under the train loss, this supposes a
problem of underﬁtting according to the usual interpretation from the literature, hence the model
is not enough ﬂexible, not well trained, or informative variables are missing.

The purpose here is to illustrate the computation of the variance, without even tuning much the
parameters. For this logistic regression model,

147
��
��
��

� ��

��

��
��
��
��
��

��

This is near the solution from the dedicated python modules. A more cautious tuning of the pa-
rameters would lead to a nearer solution. For example sklearn leads to the solution.

��

� ��

��

��
��

��

6.3.6 Hessian from second-order derivatives

The hessian is computed with minibatches because it is a sum of second-order derivatives over
the whole sample. A loop cycles the dataset and sums the second-order derivatives from each
minibatch as follows.
The number of weights is computed for the sizes before the calls to "hessian()" from pytorch.

��

��
��
��
��

148
��
��
��

��
��
��
� � ��
��
��
��

��

��
��

��
��
��
��

��

� � � � ��

��

� � ��
��
��

��

� ��
� ��
� ��
��
� ��
� ��
The resulting values of the standard-deviations are almost equal to the output from the previous
estimations with a dedicated python module.

��

149
��
� ��

��

The difference comes from that the final solutions for the regression coefficients were only nearby
as the training is for a nonlinear function with different initial values and different inferential
procedures. The first solution with statsmodels comes from a second-order training procedure,
see next chapter, while the second one with pytorch comes from a first-order one which asks for
more careful settings. This is the price to pay in order to get a more scalable algorithm without
requiring the full hessian at each step of the training, in order to avoid a non necessary costly
numerical burden. An automatic or better setting of the learning rate with a stochastic approach
for an optimal training would be required here. This is an almost mandatory option for convex
objective functions because the global solution is reachable here.

6.3.7 Hessian from second-order derivatives (bis)

Let revisit the python code for more advanced models with hidden layers for instance. First, one
has to remember that the parameters or weights of each layer are structured as follows in pytorch,
the matrix for the linear transformation or "weight" and the vector for the "bias".

��

��
The weights are also available from the dictionary.

��
��

��

��
There is a more direct access to the weights which allows to update their values too.

��

150
��
�→��

The hessian is written as the derivative of the components of the gradient vector. It is adviced to
keep the full list of parameters in an unique list, in order to avoid post-traitment, otherwise, a direct
use may lead to an hessian per layer.

��
��
��
��
��

��

��
��

��

��
��
��

��

��
��
��
��
��
��

��

�→��
�→��

��

��
This solution here is the most direct 1 , by iterating the components of the gradient, each row of
1 Thanks to "https://discuss.pytorch.org/t/computing-hessian-for-loss-function/67216" , for an example of imple-

151
the hessian is computed numerically. The "intercept" or "bias" appears at the last row (and last
column).
With q the number of parameters (bias and components of the weights) and θ = (θ1 , θ2 · · · , θq )T ,
this corresponds to the deﬁnion of the hessian H = ∑i Hi where Hi = ∂ ∂θ T [gi ] from the derivatives.
The sum here on the whole train sample is performed with minibatches, H = ∑Bb ∑i∈sb Hi which is
exactly equivalent to summing directly to the whole sample:

 ∂

∂ θ1 [loss(ŷi (θ ), yi )]
 ∂ 
∂  ∂ θ2 [loss(ŷi (θ ), yi )] 
gi = [loss(ŷi (θ ), yi )] = 
 .. ,

∂θ  . 
∂
∂ θq [loss(ŷi (θ ), yi )]

and,

 ∂
� T � 
∂ θ1 g1 (θ )
 ∂
� T � 
∂2  ∂ θ2 g2 (θ ) 
Hi = [loss( ŷi (θ ), y i )] =  .. .
∂θ∂θT  
 . 
∂
� T (θ )
�
∂ θq g q

��
��
��

��

� ��
� ��
� ��
��
� ��
� ��
There is actually a very small difference which may comes from regularization in the loss or just
numerical differences.
��

��
For deep neural networks with hidden layers, the weights are likely to be in a matrix for each
layer, thus each matrix parameter has to be vectorized by reshaping and by appending the rows
mentation which was altered for the wanted purpose.

152
or columns in order to make an unique vector from all these vectorized parameters. Note that
recently a submodule for pytorch has been released in the version 1.12 for computing the hessian
matrix, see "functorch", at "https://pytorch.org/functorch", associated to "vmap" for vectorization,
"vjp" and "jvp", this is out of the scope herein. Note also that one may have to reduce the list
of parameters to those which are marked for derivation, when all the parameters are not involved
for the gradient. This happens for instance when performing the training of the classifier only,
while leaving the first layers for the transformation of the images unchanged in "transfer learning"
which used a pre-trained "convolution neural network" for images classification. An alternative
approximation of the hessian is with the gradients for each data, as follows.

6.3.8 Hessian from ﬁrst-order derivatives

An estimation of the hessian is available via the gradients only, let compare such solution with the
previous one. In pytorch, the gradient is computed with a numerical procedure for minibatches for
the training algorithm, thus, the same approach with minibatches of size one leads to the wanted
alternative.
With q the number of parameters (bias and components of the weights) and θ = (θ1 , θ2 · · · , θq )T ,
this corresponds to the deﬁnion H̃ = ∑Bb ∑i∈sb H̃i for the approximate hessian,

∂
gi = 
∂ θ [loss(ŷi (θ ), yi )] 
∂
∂ θ1 [loss(ŷi (θ ), yi )]
 ∂ 
 ∂ θ2 [loss(ŷi (θ ), yi )] 
= 
 .. ,

 . 
∂
∂ θq [loss(ŷi (θ ), yi )]

and,
Hi ≈ H̃i = gi gTi .
The new estimation is ﬁrst initialized to zero. The loop computes the gradients and accumulates
the products. Here the gradients are required for each data sample vector, but the parameters are
not updated because they were found just before.

��
��
��
��
��
��
��
��

For the case considered, a matrix of weights is not met because only the vector of regression
coefﬁcients is modeled. For deep networks, matrices are required for modeling the weights in
order to compute the sum from a layer to another layer where the dimensions or numbers of nodes

153
are more than one. The loss is with a sum instead of a mean for each minibatch in order to retrieve
directly a whole average for the considered sample.

��

�→��

��
��
��
��

��

��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

��
��
��

��

Let ﬁnally compare the resulting standard errors from the two estimators.

��
��
��
��
��

154
��
��

��

� ��

This alternative estimator of the Fisher matrix information may be less accurate (with this sample
size, and with smaller resulting standard-deviations, but also some runs of the algorithm with dif-
ferent initializations) for the estimation of the variance than the more usual one (which is exact for
the logistic regression). It cannot suffer of numerical problems coming from the algorithmic eval-
uation of the second-order derivative. The matrix H̃ is also relevant for a second-order inferential
procedure as studied next chapter.

6.4 Exercices
1. (stat+numpy) Write the function for the variance of the coefficient regression vector for the
multinomial regression and compare the result with the output of statsmodels for the "iris
dataset".
2. (stat+numpy) Compute the hessian matrix and the corresponding estimation of the variance
matrix for the parameters when the dataset is from the example of chapter 4, from the data
file "xy_2d_diskandnoise_reglogistic.txt". Does the estimation of the variance is relevant,
explain why and propose an alternative method (with pytorch or eventually "scipy.optimize"
by reimplementing the network model, like for the polynomial regression at chapter 1).
3. (stat+numpy) In order to better understand the issue with the dataset in the exercice just
before, generate several datasets of same size with a circle within a cicle or a square within a
square or any other convex regular shape, with an increasing surface for the outer geometrical
objet (circle or square or other). Compute the hessian for the different datasets and different
sizes of datasets, discuss the results.
4. (stat+numpy) Compute a candidate numerical value for the standard-deviation of the hidden
latent layers with an update from the functions above (for enough large datasets in order
to get positive diagonal terms after inversion of the hessian matrix). Find also a candidate
numerical value for the variance for the predicted target variables ŷi from a deep regression
with one hidden layer.
5. (stat+numpy) a) After testing the lasso for the deep regression model from the introduction
chapter which allows to fit the data from a polynomial regression, find the variance of the
parameters with the hessian here. Compare with a resampling method such that boostrap. If
the hessian is not well-defined explain why. b) Retry this by cancelling the zero terms and by
removing them from the gradients and hessian. How this changes the results. c) Propose an
experiment with artificial data for this same regression model in order to find which size of
the dataset is required in order to get a well-defined and invertible hessian matrix, and also

155
if the distribution of the xi or the range has an influence on this result. Does this result could
change with a different polynomial function and how this could changes with the degree of
the polynomial function.
6. (stat+numpy) a) Propose a method for the variance of the frontier with the example of classi-
fication from dataset "xy_2d_reglogistic.txt". Check if the true frontier is in the correspond-
ing interval at 95% of confidence. b) Idem for the dataset "abalone".
7. (stat+numpy) Retrieve for the logistic regression the result about the alternative expression
of the information matrix, by showing that E[(yi − pi )2 ] = pi (1 − pi ). Here, a clue is that
y2i = yi is a binary quantity, while E[yi ] = pi . Check the result by a numerical simulation.
8. (stat) Retrieve the expression of the Fisher scoring for the logistic regression.

156
Chapter 7

Second-order training of (deep) glm

In this chapter, we are interested on the second-order estimation for nonlinear models such as
(deep) generalized linear models where the distributions for modeling the dependence of the target
variable yi w.r.t the variables xi belongs to the exponential family. For datasets of moderated size,
the classical method for the optimization is the "Newton-Raphson procedure" (or NR) which uses
the hessian matrix for rescaling the gradient vectors at each iteration. For larger datasets, the
methods named "Quasi-Newton" (or NQ) prefer to approximate the hessian and the product with
the gradient in order to avoid numerical issues coming from high dimensional spaces.
On the contrary to the first-order methods, the Newton-Raphson procedures are not suitable if the
neural network is too large (too many parameters) with the current computer architecture because
the hessian does not fit in memory or is too slow to compute. Another limit is when the model itself
is too flexible and the dataset too small such that the variance is not reliable: the hessian matrix is
ill defined, not invertible or gives negative values for the variance. It is supposed that all is going
well here, otherwise a variable selection, a weights selection, a tuned regularization or if required
a simpler model are often able to solve this issue.
In the first chapters, we have studied the linear models, their nonlinear extensions within the frame-
work of the neural networks, and several iterative algorithms which update the parameters in order
to minimize the loss functions (or maximize a loglikelihood). The chapter just before has discussed
the second-order derivatives aggregated in the hessian matrix and the related variance for (linear)
regression and classification when modeled with pytorch.
The hessian and its previous approximations are involved in the first part of the current chapter via
an implementation with numpy for the second-order training of a regression model. A procedure
from quasi-Newton methods concludes the chapter in a second part with pytorch via its available
dedicated function.

7.1 Introduction to GLM

The family of models named generalized linear models (GLM), for regression and classiﬁcation,
is deﬁned via the distribution named "exponential family" associated to a transformation.

157
7.1.1 Deﬁnitions
let denote θ for the natural parameter and φ for the dispersion parameter with positive values. The
target variable yi belongs to an exponential family with density of a very general form,
� �
yθ − b(θ )
p(y|θ , φ ) = exp + c(y, φ ) .
a(φ )

In the generalized linear models the mean µ = E(Y |x) is written via a function called link function
which is a strictly increasing function, which is applied to the usual inner product of the parameter
vector β with the vectors of explicative variables (plus bias one for the intercept) xi . The link
function η(.) has an inverse η −1 (.) and,
g(µ) = η(xT β )
µ = η −1 (η)

For instance, these link functions are deﬁned just below for two distributions in the regression
problems met every days when dealing with datasets
• Gaussian: η = µ and 1 for the variance function.
• Poisson : η = log µ and µ for the variance function.
Other link functions are available for the Inverse Gaussian, Gamma, Multinomial, Binomial and
Poisson regressions among other ones.

7.1.2 Derivatives and variance

From any statistical textbook, the score or ﬁrst-order derivative of the loglikelihood is obtained as
follows:
n
(yi − µi )µi� (ηi )
∇�(β ) = ∑ σi2
xi .
i=1

The Hessian or second-order derivative of the loglikelihood is thus written:

� �
n � (η )]2 n
[µ (y − µ )µ �� (η )
∇2 �(β ) = − ∑ −∑
i i i i
i
2 2
xi xiT .
i=1 σ i i=1 σ i

The Fisher information matrix is the expectation of the hessian, with E[yi ] = µi , while Ω =
[µi� (ηi )]2
Diag( σi2
: 1 ≤ i ≤ n), thus,
n
[µi� (ηi )]2 T
In (β ) = E[−∇2 �(β )] = ∑ 2
xi xi = X T ΩX .
i=1 σi
The hessian or Fisher information matrix are the covariance matrix when inverted and divided
by the sample size after, as explained in the previous chapter. This matrix is also suitable for
decreasing the loss instead of just considering the ﬁrst-order derivative, with the additional burden
of its computation cost.

158
7.2 Algorithms for second-order training
A deep linear regression can be interpreted as a statistical nonlinear regression with a particular
expression for the nonlinearities. Historically, the Gauss-Newton-method was introduced for in-
ference of the parameters of such model, but with growing number of variables and large datasets,
ﬁrst-order algorithms are usually more practicable.
Thus the following algorithms are for a moderated size network, except in case of fast computers
which are not available for every ones. The second-order approximation is presented just before
the algorithms, this is the foundation.

7.2.1 Taylor serie at second-order

For the training procedure, a second-order approximation around a current solution denoted β (t)
or here β(t) is written:

1
�(β ) ≈ �(β(t) ) + ∇�(β(t) )T (β − β(t) ) + (β − β(t) )T [∇2 �(β(t) )](β − β(t) ) .
2

The optimization solves for the derivative of this approximation:

∇�(β(t) ) + [∇2 �(β(t) )](β − β(t) ) = 0 p .

Except for the Gaussian case with canonical link, the problem to solve is nonlinear and needs a
gradient ascent.
The corresponding updates are thus, with a ridge term with parameter τ for insuring the inverse of
the hessian matrix, as follows:

β(t+1) = β(t) − αt [Ht + τI p ]−1 ∇�(β(t) ) .

It is recognized exactly the same update formula than for the algorithm next paragraph.

7.2.2 Newton-Raphson procedure

Let remember from the previous chapters, we want to estimate regression or classiﬁcation coefﬁ-
cients with a vector β̂ . The cost function or loss is minimized such that,

β̂ = argmin �(β ) .
β

We are still interested with minibatches but eventually also the full dataset if it is not too large,
thus, with a number of B minibatches where eventually B = 1, the loss function is still written as,

�(β ) = ∑Bi=b �b (β )
= ∑Bb=1 ∑i∈sb − log f (yi ; xi , θ )
= ∑Bb=1 ∑i∈sb �i (β ) .

159
- The second-order update formula for β(t) is now written from the formula of the paragraph
above,  � � 
� � ∂
d ∂β �b (β )
�b (β ) =  0 � � .
dβ ∂
�b (β )
∂ β1
The new update formula of the parameters at each iteration remains with the gradient, but
with the hessian which appears here as a multiplicative corrective factor such that if τ = 0
the second-order update is:
� � � �  � � � � −1
∂2 ∂2
(t+1)
β0 β0
(t) T
∂ β0 ∂ β0 �
�b (β ) T
∂ β0 ∂ β1 �
�b (β ) d � �
= − αt  � �  �b (β ) .
(t+1)
β1 β1
(t) ∂2 ∂2 dβ
T �b (β )
∂ β1 ∂ β0 T �b (β )
∂ β1 ∂ β1

A learning rate may be required because the second-order serie around the current solu-
tion may be not enough accurate, such as the update may be reduced in order to insure the
convergence despite the gap with the real function.
- This is implemented next after with numpy for the Poisson regression (see next section for
the expression of the loss and loglikelihood) where the target variable takes positive integer
values or counts, such that the new distribution involved is changed. With Dµ = Diag(µ(k) ),
the update is just for the NR algorithm:
� �−1
d 2 �(β(k) ) d�(β(k) )
β(k+1) = β(k) − αt dβ dβ T + τI p dβ
� � −1
= β(k) − αt ∑i µi xi xiT + τI p [∑ (yi − µi )xi ]
� T �−1 � Ti �
= β(k) − αt X Dµ X + τI p X (y − µ)

7.2.3 Natural gradient procedure

When denoting Dy−µ = Diag(y − µ(k) ) and with the alternative expression for the hessian matrix,
this leads for the Poisson regression to the approximated NR algorithm:
� �−1
d�i (β(k) ) T d�i (β(k) ) d�(β(k) )
β(k+1) = β(k) − αt ∑i dβ dβ + τI p dβ
� �−1
= β(k) − αt ∑i (yi − µi )2 xi xiT + τI p [∑i (yi − µi )xi ]
� �−1 � �
= β(k) − αt X T D2y−µ X + τI p X T (y − µ)
According to the experiments, this version has converged to a solution near the optimal one but
more slowly: a correcting factor is required like a learning rate. Also the convergence may be not
insured on the contrary to the usual NR algorithm. In the other hand, this allows a sequential ap-
proach with minibatches for consecutive computations of the inverse of the (approximated) hessian
matrix as explained next.

7.2.4 First-order gradient update with minibatches

Following the ﬁrst chapters, with sb the data subset for the minibatch at iteration k from one epoch,
d�(β(k) )
β(k+1) = β(k) − αt dβ .

160
This is written for the Poisson regression:

β(k+1) = β(k) + αk ∑i∈sb (yi − µi )xi .

7.2.5 Natural gradient procedure with minibatches

When adding the gradients from each minibatch, the approximated hessian matrix is inverted se-
quentially thanks to linear algebra without explicit inversion but only with the Sherman-Morrison
formula and also for faster processing the Woodbury identity. The mean is preferred to the sum,
otherwise, the learning rate may need to change at each minibatch, as the sum is growing, while
here it is kept constant.
d�i (β )
Let denote gi = dβ(k) the gradient vector for xi . For simpler notation, the data are reordonned
according to the shufﬂing if there is any. The size of a minibatch is m and the minibatch index is
b, with next b + 1. Note that it may happen that the last minibatch is smaller than m for Euclidean
division reason (n divided by B), but this is not considered here.
- Adding one gradient vector one after the other for each minibatch (until m times)
First, let consider the case when a data vector is added with, m = mb = |sb | the size of
the minibatches, at epoch k. Such as the current minibatch is number bk with a total of B
minibatches, supposed of equal sizes for lighter notation, thus, bk m = ∑bb=1 k
|sb |, and, when
adding a new gradient vector to the approximation of the hessian knowing that m should be
added for a full minibatch,
� � � �
1 1 bk T T
H̃bk m+1 = bk b k m bk ∑b=1 ∑i∈sb gi gi + g� g�
∑b=1 |sb |+1 � � ∑b=1 |sb | � �
= bk 1 bk m bk1m ∑bi=1
km T
gi gi + gT� g�
∑b=1 mb�+1 �
1
= bk m+1 bk mH̃bk m + gT� g� .
From the Sherman-Morrison formula, this leads to repeat m times, with � = �sk such that for
1 ≤ s ≤ m, and for example s = 1,
� T g −1
�
H̃b−1m+1 = (bk m + 1) b k mH̃b m + g �
k � k � �
1 −1 1 1 −1 T 1 −1
= (bk m + 1) bk m H̃bk m − H̃ g g H̃
1+gT� b 1m H̃b−1m gT� bk m bk m � � bk m bk m
� k k �
= bkbm+1
km
H̃b−1
km
− bk1m T 1
1 −1 T −1
−1 T H̃bk m g� g� H̃bk m .
1+g� bk m H̃bk m g�

This induces that the inversion is in closed-form from the previous step. When there are
several data vectors coming from the new minibatches, this formula is involved m times,
mb
as the size of the minibatches. Let denote, a1 = m(b+1) with a1 = 1 at ﬁrst minibatch of
1
each epoch, and a2 = m(b+1) . In the python code, the following update was tried and kept
because the average is for a minibatch, and this seems to lead to a stable convergence here
for different sizes of dataset, learning rate or minibatch sizes,
−1 −1 1 −1 T −1
H̃bm+1 = a1 H̃bm − a2 1 −1 T H̃bm g� g� H̃bm .
1+a2 gT� bm H̃bm g�

161
For this dataset, the ﬁnal result was relevant according to the numerical experiments with
numpy (and also when changing the parameters for generating variants of the dataset for a
poisson regression).

- Adding m gradient vectors for the whole minibatch (only 1 time)

The solution from the previous paragraph is slow for large m or B because a sum is in-
volved, hence, a matricial version is preferred as follows with Gb aggregating the m vectors
of gradients gi from the minibatch bk . This allows the Woodbury identity, by adding the
whole minibatch of gradient vectors. It follows when denoting with just the index k from the
minibatch bk in the approximated hessian and nk = bk m, for a lighter notation:
1
� �
H̃k+1 = nk+1 nk H̃k + ∑�∈sk+1 gT� g�
1
� �
= nk+1 nk H̃k + Gk+1 Ik+1 GTk+1
� � �−1 �
−1 1 −1 1 −1 T 1 −1 T 1 −1
H̃k+1 = nk+1 nk H̃k − nk H̃k Gk+1 Ik+1 + Gk+1 nk H̃k Gk+1 Gk+1 nk H̃k
� � �−1 �
nk+1 −1 1 −1 T 1 −1 T −1
= nk H̃k − nk H̃k Gk+1 Ik+1 + Gk+1 nk H̃k Gk+1 Gk+1 H̃k
� �−1
≈ H̃k−1 − n1k H̃k−1 Gk+1 Ik+1 + GTk+1 n1k H̃k−1 Gk+1 GTk+1 H̃k−1 .

The simple expression just above seems slightly different from the proposed approaches
from the literature. The expression is fully matricial and might be faster than repeating
n times the Sherman-Morrison formula, at least for languages such as python with here
numpy for processing the algebra. A perspective is to add some sketching with a random
projection or pca (see next chapter about reductions), not discussed here.

Next, the different formula for updating the parameters and decreasing the loss function are com-
pared with an artiﬁcial dataset for the Poisson regression.

7.3 Fitting a Poisson regression with statsmodels

The work directory is given from the function.

��
��

��

��
��

Let clean and check the computer memory state. There should be remaining some space otherwise
the operating system risks to freeze.

162
��
��

��

��
��
��
��

��

� ��
� ��

� ��

Dataset from the ﬁles

A data sample of 5000 observations from a Poisson regression with 6 independent variables is
generated and stored in the computer ﬁles. The train and test subsamples are sampled from the
whole sample. The sample is split into the train and test subsamples in order to check the model on
new data. They are stored as ﬁles in the computer disk with also the true parameters. The datasets
and the true parameters are loaded as follows.

� � ��
��
� � ��

��

��
��
��
��
��
��
��

163
��
��
��
��
��

��

The true regression coefﬁcients are supposed unknown during the training, this is the purpose of
the model ﬁtting to retrieve this vector. This induces that usually, the loglikelihood (and related bic
aic or other ones) is mostly the only value (plus eventually the residuals and related) which is in
stake in order to compare several models. For generated data, the true β and the true expectations
T T
µi = eβ xi are known thus they can be compared with the estimated ones, β̂ and µ̂i = eβ xi after
training.

Checking the target variable

The discrete distribution of the target variable y is visualized with a "barplot", after counting the
number of observations per possible value of the variable. This is with the python class "Counter"
from the module "collections" which asks for a list for the input sequence. The graphic is from the
module "matplotlib" while "seaborn" allows more advanced outputs not used here.

Figure 7.1: Barplot for the target variable from Poisson regression

164
��
��
��

��
��
��
��
��
��
��
��
��
��
��
��
��

��

This graphic with a barplot would suggest that there may be an excess of zeros but checking the
mean and the variance conﬁrms that the regular usual Poisson distribution is a good candidate:

��
��
��
��

��

The ﬁrst component in Xi contains only the value one in order to include the intercept in the vector
of regression coefﬁcients β , otherwise this would be xi for usual notation. This avoids to separate
the intercept from the vectors, in order to get all the unknown quantities aggregated in an unique
vector β . In the python modules such as sklearn, this is often possible to separate or include both,
but for computation with numpy the algebra is lighter with an unique vector instead of a vector
plus a scalar.

With the normalization of the columns, the true beta is not retrieved even approximately, but the
T
parameters in the Poisson µi should be retrieved with the estimation µ̂i = exi β̂ because the expec-
tations remains identical.

Training of the regression model

This model is ﬁtted via a second-order procedure with a python module.

165
��
��
��
��

��

��
��
The training was performed only in four iterations, which makes very appealing the second-order
procedures when the sample is not too large. The regression coefﬁcients are retrieved as follows:

��
��

7.4 Fitting a Poisson regression with numpy

In this subsection, we are interested on inference procedures with the full dataset at each iteration
(also called batch algorithms) and on incremental or sequential procedures with parts of the dataset
at each iteration (also called minibatch algorithms). The four algorithms are in an unique python
function in order to make easier the access to their contents and their call:
• "nr" for Newton-Raphson procedure on the whole data sample, with the exact hessian matrix
from second-order derivatives.
• "ef" for natural gradient procedure on the whole data sample, with the approximated hessian
from first-order derivatives.
• "gd" for gradient descent with minibatches, as in the first chapters, with the gradient vectors
or first-order derivatives.
• "efseq" for natural gradient with minibatches with a sequential inverse of the matrix, with
the approximated hessian.
Note that "ef" stands for "estimated fisher matrix", which is chosen in our case equal to the one pre-
viously called "approximated hessian", hence another naming could be "ng" instead for instance.
The natural gradient procedure is like the newton-raphson procedure except that the hessian is the
approximated one. The ridge correction is not discussed anymore because it has begun usual in
any implementation. There exists precise differences between the numerous algorithms from the
literature hence the reader might look for a review on the history of neural networks or statistical
inference for more details about, this is out of the scope here.

7.4.1 Expressions of the gradient and hessian

Tx
With µi = µi (β ) = eβ i , the Poisson regression is a member of the generalized linear model such
that,
y
µi i −µi
f (yi ; β ) = yi ! e .

166
Thus, the loglikelihood is deﬁned as the log of the n products from the mass functions above, one
per sample observation, such that,
�(β ) = log L(β )
= ∑ni=1 log f (yi ; β )
� yi �
µ
= ∑ni=1 log yii ! e−µi
= ∑ni=1 yi log µi − ∑ni=1 µi − ∑ni=1 log yi ! .
The maximum likelihood solution is:
� n �
β̂ = argmax ∑ yi β xi − e
T β T xi
.
β i=1
� �
)� d 2 �(β ) �
For the optimization with Gk = d�(β
dβ � and Hk = dβ dβ T �β
, this leads to compute the derivatives
β(k) (k)
as given in the previous section just above: this is a good exercice to retrieve these expressions.

7.4.2 Implementation of the four algorithms

In this subsection, the algorithms presented just before are tested with numpy by writing a loop
and by updating the parameters at each iteration. On the contrary to the ﬁrst-order procedures, the
second-order procedures needs the hessian, hence this matrix is computed just after the gradient at
each iteration. The python functions are deﬁned as follows.

��

��
��

��

��
��
��
��

��
��

167
��

��
��
��

��
��

��

The previous class is for the sequential inverse of the matrix in the natural gradient procedure with
minibatches while the following function is for the ﬁtting of the Poisson model with the fours
approaches.

��

��
��
��

��
��
��
��

��

��
��

��

��
��
��
��
��

168
��
��

��
��

��

��
��

��
��
��
� � ��
� � ��
� � � � ��

��
��
��
��
��
��
��
��

��

��
��
��
��
��
��
��
��
��
��
��

169
��
��
��
��
��
��
��
��
��
��
��
��
� � ��
��
��
��
��
��
��
��
��
��
��
��

��
��
��
��
��
��
��
��

��

��
��
��
��

��

��
��
��

170
��
��
��

��

��
��
��
��

��
��
��
��
��
��
��
��
�→��

��

��
��
��

The final loglikelihoods from each algorithm are compared at the end of the inference when the
values of β(k)
algo
for k = 1, 2, · · · , become stable at the final k = kend
algo
. Indeed, each algorithm leads to its
solution β̂ algo for the maximum likelihood hence a different value for the loglikelihood �(β̂ algo ). This
does not ask for other informations than the available data sample, the usual situation in practice.
As the unknown parameters are known here, it is also computed the mean squared error and the
correlation coefficient from the true parameters µi and the estimations µ̂ialgo for each algorithm
"algo", this allows a further analysis.

The results are checked with the correlation and the mse. The ﬁnal result depends on the chosen
test sample hence a cross-validation could to be preferred here. It is also better to run the ﬁtting
several times, with different random initializations and keep the best result.

��

��
��
� � ��
��
��
��
��

171
��
��
��
��
��
��
��
��
��
��
��
��
��
��

The four algorithms are run one after the other as follows:
• For the batch methods, the Newton-Raphson updates with exact hessian.

��
��

��

��
��
��
��
��

��
The diagonal version "nr_diag" just inverses the diagonal matrix by keeping only the diago-
nal elements from the hessian matrix. A better way would be to compute this diagonal only,
instead of zeroing the non diagonal cells.
• For the batch methods, the natural gradient updates with approximated hessian.

��
��
��
��
��
� � ��
��
��
��

172
��
��
��
��
� ��
��
��

��
��
��
��
��
��
��

��
The diagonal version "ef_diag" inverses only the diagonal matrix by keeping the diagonal
elements from the approximated hessian matrix.
• For the sequential methods, the updates with the gradient vectors from minibatches.

��

��
��
��
��
��
��
��
��

��
• For the sequential methods, the updates "efseq" with the inverses of an approximated hessian
from minibatches. Here, "efseq1" is for B = n (m = 1) with the Sherman-Morrison formula
only while "efseq" is for B < n (m > 1) with the Woodbury identity.

��
��
��
��
��
��

173
��
��

��
The algorithms are compared visually via their loglikelihood in the next graphic.

��

��
��
��
��
��
��
��
��
��
��

Figure 7.2: Losses per epoch for the ﬁrst/second-order algorithms

For the sequential procedures with the numpy implementation above, there was required a "gra-
dient clipping", with bounds component by component on the gradient during training: this has
avoided a "divergence" when the consecutive values of the parameters do not stabilize with time.
An alternative is a strict bound on the norm for all the components together which sometimes is
required also in deep learning for some models. There the norm is kept constant instead of these
truncated components.

174
Which algorithm behaves the best depends on the dataset and the settings hence, when possible,
several algorithms should be tested and compared. Some improvement of the settings for the learn-
ing rate and the minibatches size is possible via "cross-validation" for instance. More advanced
second-order sequential methods (for high dimensional models) exist in the literature and some are
implemented on pytorch, one is tested at the next section.
Retrieving the exact counts is not in stake (because of the sampling generating the counts), thus the
Poisson parameters are compared instead. Comparing the loglikelihood (or related bic and aic) is
also more usual than the mse because the true parameters are often unknown. The numeral results
are summarized in the following table.

��

��
��
��

��
��

��

��
��
��
��
��
��
��

��
��
��

��

��
��
��
��
��

��

��
��

175
��
��
��
��
��
Here, with s = strain ∪ stest , and ntest = |stest |, "msemu_test" is for the test sample:

1
mse(µ̂test , µtest ) = ∑ (µ̂itest − µitest)2 .
ntest i∈stest

With a factor of four, the batch procedures have lead to smaller mse than the minibatches pro-
cedures here despite a small difference for the loglikelihood, this may underline the superiority
of such approaches for small datasets. Note that concerning these implementations with numpy,
the python code with the minibatches have some issues here: the ﬁrst-order procedure has not the
best setting while the second-order could prefer (for some other datasets too) a full computation
of the inverse of the hessian at each epoch. Testing such variants and improvements is an eventual
exercice left to the reader in order to further practice these algorithms, it is adviced to separate
each algorithm in a separated dedicated function before updating the python code. The next sec-
tion presents a more robust and advanced approach from the "quasi-newton" methods, plus also a
better setting for the gradient approach, both with the module pytorch and the gpu when available.

7.5 Fitting a Poisson regression with pytorch

The inference for the Poisson regression is implemented with pytorch, with the 5000 data vectors
in order to use the numerical derivatives for deep neural networks and compare with the imple-
mentation with numpy for the linear case. First the usual ﬁrst-order method is implemented for the
new loss function, then a new optimizer class from pytorch with a more advanced approximation
of the hessian is tested for a second-order method. Note that this part of the chapter is relevant
for deep models while the part just before is for linear models, this would ask for implementing
backpropagation with numpy, which is out of the scope herein and fully automatic with pytorch.

7.5.1 Example of training at ﬁrst-order

Let deﬁne a more general python function for the training, with linear generalized models such as
Gaussian, Bernoulli, Multinomial and Poisson. Note that other distributions are allowed by giving
a loss function as input and choosing "MLP" for the architecture.

��

��
��

The loss function for the Poisson regression is already deﬁned in the pytorch module, otherwise, it
may be deﬁned as:

176
� ��
� ��
� ��

The new function presents a simpliﬁed access to the class for the neural networks, and returns a
dictionary instead of separated variables at the end of the call.

��
� ��

��

��
��
��
��
��
��
��
��

� ��
��
��
��
��

��
��
��

��
��
��
��
��

��

177
��
��
��
��
��
��
��
��
��
��
��

��

��
��
��
��
��

��

�→ ��

��
��
��

��
��
��
��
� ��
��

��
��
��
��
��

178
��
��
��
��
��

��

The transformations are as before, a standardization is possible for the columns of the design
matrix X, and eventually an initialization of the model is required.

� ��
� ��
� ��
� � ��
� � ��
� ��
� ��
�→��

� ��

�→��

� ��

� ��
� ��
� �
� � ��
� � ��

� ��
� ��

� ��

There is not an involved transformation for this dataset.

A main difference with before is that the loss is automatically chosen. Let deﬁne the dataloader
for loading the dataset, and train the model.

� ��

179
��

��

�→��

��

�→��

��
��
��
��
��
��
��

��

��
Here it is implemented a gradient clipping which avoids the gradient to increase too much dur-
ing the training. This seems mandatory for ﬁrst-order algorithms, which seems not discussed in
the literature much. The bounds for the clipping was chosen arbitrary after testing several values.
Similarly, in some cases, the initialization seems to require a bounded value around 0 from an
uniform distribution, otherwise the convergence was towards a mistaken solution. This may be re-
lated to some initializations for deep learning using gradient procedures. For the proposed generic
implementation, it is required to declare a new function and add the function at the call while the
initilization is by default for this version of pytorch.

��
��
��
��
��
��

The model is trained with the new function above for (deep) glm, this is more convenient because
there is just need to deﬁne the structure of the network with the hidden layers and the name of the
model.
��

��
��
��

180
��

��
��

��

��
��
��
��

��

��
��
��
��

Let compute the statistics for measuring the quality of the ﬁnal model according to the true pa-
rameters. Remember that this is not possible with real data except for the loglikelihood. Here the
dataset is artiﬁcial and the additional indicators are informative for checking the implementation
and further comparisons.

��
��
��
��
��
��
��

��
�→��

��

��
��
��
��
��

��
��

181
��
��

��
��

The likelihoods and indicators are obtained above, but one may keep in mind that proceeding in
"f_mu_mse_cor_poisson" with chunks would be better for larger datasets, this can be seen as an
exercice left to the reader (otherwise, there is the risk of computer system freezing when python
uses a large virtual memory).

In comparison, the module statsmodels leads to:

��

��
��
��
��

��
��
��
��
��

��
��
��
��

��

��
��
��
��

This is not very far to the solution above with pytorch. Actually some installations of pytorch
may perform slower than numpy (which could be even faster with some gpu directive like from
"numba"). But, the automatic derivatives are very appealing for (any) more complex models. The
parameters are not so much tuned for the sequential methods with minibatches and the updates are
not stochastic, which explain the lesser performance here.

Next, the second-order is implemented with pytorch for neural networks.

182
7.5.2 Example of L-BFGS for training at second-order
In this section, we are interested on quasi-newton methods. They aim at step k at approximating the
usual update for the parameters, θ(k+1) = θ(k) +δk . Here δk is unknown and solves for Bk δk +gk = 0
where Bk ≈ Hk . The approximation Bk of the exact hessian Hk needs generally to be positive
definite for insuring an inverse with good properties.
In particular the method called limited memory BFGS is an improved algorithm from the original
research from the Broyden–Fletcher–Goldbarb–Shanno (BFGS) formula. This formula proposes
an approximation of the hessian and its inverse from two consecutive values of the gradient, say gk
and gk+1 plus the two last consecutive data vectors, say xk and xk+1 . The differences are involved,
sk = xk+1 − xk and yk = gk+1 − gk . By equating Bk+1 sk = yk , and after advanced algebra, this leads
to the update of the approximating hessian:
1
Bk+1 = Bk − T Bk sk sTk Bk + ρk yk yTk .
sk Bk sk
Here, ρk denotes the factor yT1s . The inverse is in closed-form, via an algorithm with a truncated
k k
sum for the limited memory version.
The implementation is tricky and out of the scope, thus, the existing optimizer from pytorch is used.
It is named "optim.LBFGS" and asks currently for a different implementation of the backpropa-
gation (in the loop with the minibatches), exceptionnally, thus this must be rewritten in python
by following the documentation. There exists many variants, but they are not implemented with
pytorch, the available one is stable and performs well. Note also that the sequential process is
managed by the algorithm, hence in this case, it is better to include all the dataset once in the
dataloader. In some cases minibatches may induce a convergence but this is not sure at all and not
wise to avoid the batch version (except for the recent variants called stochastics).
For this optimizer, it is added a test in order to check which optimizer is involved in the function
"f_train_glmr()" from the module deepglmlib and file "utils.py". Hence, this is included in the
function for training already in the companion file.
The call to the function is thus as previously by just changing the name of the optimizer and the
dataloaders. For instance, some eventual regularization is added as follows.

��
��
��
��
��
��

��

��
��
��

183
��

��
��
��
��

��
��

��

��
��
��
��
��
��
��

��

For some run, the algorithm may not converge, there are some hyperparameters to set. In par-
ticular "nbmax_iter_lbgs" for the "maximal number of iterations per optimization step", see the
documentation of the optimizer for this one and the other ones.
The resulting metrics are as follows.

��
��
��

��
��
��
��
��

��
��
��
��

��
��
The results are summarized in the following table for comparison purpose. The correlation is not

184
given as it is less informative than the mse. The second-order method is clearly performing well
here as expected for a small dataset. The table is thus:

��

��
��
��

��
��

��
��
��

��

��
��
��
��

��

��
��

��

��
��
��

The implementation with pytorch leads to a same loglikelihood (and even slighly better) than the
implementation from statsmodels, while the gradient approach is almost identical with a small
difference. For the second-order training with pytorch, the hessian is approximated recently for
pytorch with external modules with the diagonal hessian for instance in the case of very large

185
models: fully diagonal or by block per layer. External modules such as "backpack" are found in
repositories, this allows the access to other approximation but sometimes available only for some
architectures of neural networks. Second-order are more prone to minimum local, hence for non
convex optimization they should not be considered without further checking. Some mixed method
by combining with another optimizer at the beginning or at the end of the training looks like the
best way to do here.

7.6 Exercices
1. (stat+numpy) Rewrite the function for the four second-order procedures into separated func-
tions for each method. Test the algorithms for the linear regression and the logistic regression
by changing the expression of the gradient vector and the hessian matrix, compare the output
with statsmodels for the datasets from the previous chapters herein.
2. (stat+pytorch) Idem than just before, with pytorch instead of numpy. Compare with the
result from the algorithm l-bfgs.
3. (numpy) Test a cross-validation of the newton-raphson algorithm and the approximated ver-
sion with numpy and find out if five or ten folds are relevant for approaching the expected
error.
4. (numpy) Test the deep Poisson regression with an artificial dataset with only one inde-
pendent variable and a nonlinear function in µ. Compute the mean squared error for
diverses architectures (varying number of hidden layers and number of neurons). See
"https://link.springer.com/article/10.1007/s00521-009-0277-8" for instance for examples of
nonlinear regression.
5. (stat+pytorch) Implement a zero inflated model for deep learning with count data. The dat-
aloader with the id from each row may be required here if shuffling is true, in order to keep
the ordering of the rows for the vector of mixing probabilities.
6. (stat+pytorch) In some not rare cases for deep neural networks, the hessian is expected to
be ill-defined, with the inverse which leads to negative variances on the diagonal. Propose
solutions in order to solve this issue. For instance test a regularization or test a Gan for
tabular data in order to augment the number of observations from the dataset and test a
variable selection method in order to keep only the useful variables in the input layer. Is
there a difference between variable selection and weights removal from the neural network
in order to improve the generalization. An example of network is from the artificial data
"diskandnoise", see previous chapters.
7. (pytorch) Check the convergence by ploting locally the loss function as curves for each
variable, and also deduce a numerical value for the gradient.
8. (pytorch) After a read at the documentation, test the module1 "backpack" with more ad-
vanced second-order derivatives. Compare with l-bfgs for a small dataset. Compare with a
first-order training for a large dataset.
1 "https://pypi.org/project/backpack-for-pytorch/"

186
Chapter 8

Autoencoder compared to ipca and t-sne

In the previous chapters it has been studied the neural networks for regression and classiﬁcation
when the dataset is a data table: nonlinearities are added to the family of the generalized linear
models. Thus, their deﬁnition, their training and their implementation have been presented for
several datasets with several modules available with the computer language "python". These ap-
proaches are called "supervised" because they aim at predicting a variable yi from a variable xi .
When only the variables xi are available, the set of rows or the set of columns from the data table
can be described by methods called "unsupervised". Typically, two main families of unsupervised
methods exist: a) "clustering" (out of the scope) which proposes a descriptive "partitionning" with-
out examples, hence according to a criterion such as minimum intra variance, and b) "reduction"
(see this chapter) which proposes a new set of rows or columns according to a criterion such as
maximum variance.
• The methods when they provide new low dimensional vectors x̂i from the high dimensional
ones xi needs to preserve some information from the former space: the whole variance be-
tween data points, the distances between data points, their nearest neighbors or even the
structure with the furthest ones. To be clear, the method named "pca" preserves mainly the
variance while the one called "t-sne" preserves neighboors, thus they do not have exactly
the same utility. Note that t-sne is the culminating method which came after several ones
(Isomap, LLE, SNE, Laplacien Eigenmap and a few other ones) with lower performances.
• Several criteria from unsupervised methods are summarized in the table below.
Name Type Criterion
K-means Clustering ({ẑik } , {µ̂k }) = argminz ∑ni=1 ∑gk=1 zik �xi − µk �2
GMM Clustering ({ẑik } , {µ̂k } , σ ) = argminz ∑i,k zik log N(xi − µk , σ 2 )
PCA Reduction x̂ = xŵT , with ŵ = argmaxw� ∑i wT n1 xiT xi w
� �2
ISOMAP Reduction x̂ = {x̂1 , · · · , x̂n } = argminx̂ ∑i, j vi, j δi, j − di, j
p
t-SNE Reduction x̂ = {x̂1 , · · · , x̂n } = argminx̂ ∑i, j pi j log qii jj

Here, a simpliﬁed version for principal component analysis (pca with the data columns
supposed standardized and with only one latent component only, with also the constraint

187
ŵT ŵ = 1), and for gaussian mixtures models (gmm with here the non fuzzy version) are
presented, see the literature and the documentation of sklearn for more information. Note
also that di, j is the Euclidean distance between xi and x j while δi, j is the Euclidean distance
between two lower dimensional vectors x̂i and x̂ j , while vi, j is the graph of the nearest neigh-
boors. For SNE and t-SNE, the probabilities pi j and qi j are respectively the transformations
via an exponential function or t-distribution density function of di, j and δi, j plus a normal-
ization: this allows to improve Isomap by increasing the frontiers between the clusters from
the data space or the projection space.
• With neural networks, it becomes possible to model the nonlinearities with hidden layers
instead of the vicinity graph. Thus several architectures of neural networks has been invented
for explaining the contents of a data table. For reducing the number of variables, among
the existing methods of reduction, autoencoders are able to extract a linear or nonlinear
subspace from the set of vectors xi . A reduction of the dimension may be concerned by one of
the following purposes: exploration, visualization, feature extraction (new variables before
regression or classification), regularization and noise reduction, less computer memory and
more speed, outlier detection. More generally this solves for the curse of dimensionality by
retrieving the lower subspace where the data lies really instead of an high dimensional space
with many useless variables. This comes from the numerous variables which share the same
information and are correlated or can be explained by other variables. Removing the noisy
and useless information allows to keep a summary which is informative for the human and
the machine: for interpretation or computing.
• In an autoencoder, the reduced vectors denoted x̂i come from an hidden layer of a neural
network. It is defined very similarly than in the previous chapters, except that the output is
not anymore a variable yi but instead just xi , while the input remains again xi . Autoencoders
are quite old neural networks which are modernized currently with the rise of computer
modules such as pytorch, and they are able to leads to a better reduction than an usual
linear method such as pca and even sometimes a better reduction than the current state of
art methods such as t-sne. Autoencoder (ae) in the simplest linear case behaves like pca,
but when nonlinearities are added and when combined with other neural architectures, ae
is expected to perform better than other methods. But the training of an autoencoder is
currently often difficult: because of the intensive search for the optimal hyperparameters and
the unsupervised side. Thus, it is more often found autoencoders specialized in a family
of datasets, for instance, medical images of a body part, clinical data for a given disease or
biological data related to dna: each architecture of ae needs a careful training in order to be
able to reduce the dimensionality of the available data and future nearly identical data.
The chapter is divided into three parts, first pca, ae and t-sne with an artificial dataset and the 60k
digits dataset named "mnist" which is currently the main usual dataset for model comparisons.

8.1 Three pca methods and t-sne for dimension reduction

The method of pca is widely used and studied until today because in many situations, it is able
to improve many models as a pre-processing for reducing the column space of the available data

188
matrix which aggregates the data vectors xi as rows. It may be even included within a model in
order to regularize an unknown latent matrix.
A data matrix in numerical algebra comes with an Euclidean space from the space of the columns,
for the features or variables. The matrix may be not of full rank with some correlation between
colums which can make the life difficult for some algorithms such as regression or even classifica-
tion. These statistical duplicates induce a large number of columns, a numerical burden during the
matricial computations or even an erroneous solution from the optimization. Hence, pca is very
appealing because it aims at defining new variables which are a linear combination of the available
variables and less numerous, while optimizing a cost function, the variance among these latent
variables.

8.1.1 The pca method in brief

First, let recall the method for computing the new variables from principal component analysis and
its projection, just before an analytical justiﬁcation.
Algorithm

For correlation pca, one proceeds as follows,

a) The data matrix X = (xi j )i j ∈ Rn×p has its ﬁrst columns standardized and centered, with zero
means and unit variances. This results into the new matrix Y = (yi j )i j ∈ Rn×p with cells:

xi j − x̄ j
yi j = .
σj

b) The correlation matrix is found with this matrix as,

1
CorY = Y TY .
n−1
Note that some authors prefer the version with n instead of n − 1.
c) Two different and equivalent approaches allow to ﬁnd the solution for pca:
- A singular value decomposition (svd, a method from linear algebra for any rectangular
matrice) of this matrix Y of rank r, with:
1
√ Y = Ur ΔrVrT .
n−1
- An eigen decomposition (eg, a method of linear algebra for any diagonalisable sym-
metric matrices), with:
1 T
n−1 Y Y = Vr Δ2r VrT
1 T
n−1 YY = Ur Δ2r UrT .

The two methods for a matricial decomposition lead to the same result, but numerically they
have not the same numerical complexity or the same demand for the computer memory. If

189
the number of rows or columns is reduced, the square matrix is a good choice, but for a
sparse matrix the rectangular matrix keeps the sparsity. Here in both cases, it is denoted Δr a
diagonal matrix for the singular values λ� where 1 ≤ � ≤ r, and the two orthogonal matrices
Ur and Vr where:
Δ = Diag (λ1 , λ2 , · · · , λr ) .
UrT Ur = Ir and VrT Vr = Ir .
Note that the squares λ�2 are called the eigenvalues, they come from the eigen decomposition
of a symmetric matrix. For instance the covariance and correlation matrices are positive
deﬁnite and symmetric.
d) This leads to the projection of Y for k (with k ≤ r and often k � r) principal components:

CYpca = YV
√k
= n − 1Uk Δ2k .

Here, Δk keeps only the singular values corresponding to the k largest ones in Δr while Uk
and Vk are similarly truncated versions of Ur and Vr to the corresponding eigenvectors. This
solution is also the principal components for X with: CXpca = CYpca . This approach allows also
an approximation of X instead of a projection, see next.
Some authors prefer to proceed directly with the covariance matrix, and other ones would add
some additional scaling if an underlying metric is involved, this is discussed in the literature of
exploratory data analysis. This explains why different implementations lead to different results
because these options are not always fully documented. Some indicators from pca are not discussed
here. Next, it is explained further the algebra for pca.
Singular value decomposition (svd)

The deﬁnition is not always clear in the literature from one author to another one because the
decomposition is written in three different ways. This paragraph allows a better understanding of
this important part of the algorithm. The svd has a matricial notation for the following result for a
any real matrix of rank k,
r
Y= ∑ λ�u�vT� .
�=1

Here, u� ∈ Rn×1 and v� ∈ R p×1 are basis vectors for respectively Rn and R p , hence,

uT� u� = 1 and uT�1 u�2 = 0 if �1 �= �2

vT� v� = 1 and vT�1 v�2 = 0 if �1 �= �2 .

The quantities λ� are strictly positive (non equal to zero) and called the singular values. By keeping,
only the ﬁrst k largest singular values in the sum above, one get the best rank k approximation.
The three corresponding matricial notations are the following ones:
- Let denote Ur and Vr aggregating the vectors above, with Ur = [u1 |u2 | · · · |ur ] ∈ Rn×r and
Vr = [v1 |v2 | · · · |vr ] ∈ R p×r with UrT Ur = Ir and VrT Vr = Ir while Σr = Diag(λ1 , · · · , λr ) ∈

190
Rr×r , one gets the matrial version of the svd in brief notation:

Y = Ur ΣrVrT .

This ﬁrst clear and concise notation is expanded into two notation because it is possible to
add basis vectors to Ur and Vr when zeros are added to the matrix Σ. Remember that for this
notation, the matrix is diagonal with not null diagonal elements, and thus, on the nondiagonal
part of the matrix there are only null cells (equal to zeros).

- The second notation is a direct extension of the one above but the less natural, by computing
q−r
q = min(n, p), and by partially completing the basis for Rn and R p , aggregated in U∗ for U
q−r q−r q−r
and V∗ for V , thus Uq = [Ur |U∗ ] ∈ Rn×q and Vq = [Vr |V∗ ] ∈ R p×q with UqT Uq = Iq and
VqT Vq = Iq , while Σq is the diagonal matrix Σr completed with q − r zeros on the diagonal,
Σq ∈ Rq×q , thus Uq ΣqVqT is read as:

� q−r ��
q−r Σr 0r VrT
Y = [Ur |U∗ ] q−r q−r T .
0rq−r 0q−r (V∗ )

- The third notation is not the more natural because Σ is not always diagonal. There are
two complete basis, by fully completing the basis for Rn and R p , aggregated in U∗n−r for U
and V∗p−r for V , thus Un = [Ur |U∗n−r ] ∈ Rn×n and Vp = [Vr |V∗p−r ] ∈ R p×p with UnT Un = In
and VpT Vp = I p , while Σnp is the diagonal matrix Σr completed with zeros, Σnp ∈ Rn×p , thus
Un ΣnpVpT is read as:

� ��
Σr 0rp−r VrT
Y = [Ur |U∗n−r ] .
p−r
0rn−r 0n−r (V∗p−r )T

In the general case, n �= p such that Σnp is not even square anymore.

Note that the algorithms for ﬁnding the eigenvalues and eigen vectors do not order the values of λ�
by default, hence, one must reorder the singular values and eigen vectors because only the largest
ones are kept for pca or low rank approximation. Ideally one looks for an implemention able to
ﬁnd only the largest or the smallest eigen components, for faster computation.

From the criterion to the algorithm

Let ﬁnd an approximation ŷi of yi with a matrix Vk ∈ R p×k orthogonal (hence VkT Vk = Ik ) and

191
vectors ci ,
   
ŷi1 ci1
ŷi =  ...  = ∑k�=1 ci� v� = [v1 |v2 | · · · |vk ]  ...  = Vk ci
   
ŷip cik

(ŷ1 , ŷ2 , · · · , ŷn ) = argmin Sn

= argmin ∑i �yi − ŷi �2
= argmin ∑i �yi −Vk ci �2

∂
� �
∂ ci Sn = ∂∂ci ∑i �yi −Vk ci �2
� �
= ∂∂ci (yi −Vk ci )T (yi −Vk ci )
� �
= ∂∂ci yTi yi − 2yTi Vk ci + cTi VkT Vk ci
� �
= −2VkT yi + 2VkT Vk ci
= 0

Thus, both approximations ci and ŷi are linear projections for the data vector into two different
spaces:
- The low dimensional one reduces the size of the vector or length from p to k,

ci = VkT yi .

- The low rank one reduces the size of the linear basis with only k orthogonal vectors,

ŷi = VkVkT yi .

When this solution ci is inserted in the expression of the inertia, it is obtained:

Sn = Tr(Y −YVkVkT )(Y −YVkVkT )T .

Hence,
argminVk Sn � �
= argminVk Tr �(Y −YVkVkT )(Y −YVkVkT )T �
= argminVk Tr� (Y −YVkVkT )(Y T −VkVkT Y T ) �
= argminVk Tr −YVkVkT Y T +YVkVkT Y T −YVkVkT VkVkT Y T + Tr(YY T )
= argminVk − Tr(YVkVkT VkVkT Y T )
= argmaxVk Tr(VkT Y T YVk ) .

Hence it is recognized for the columns of Vk the ﬁrst eigenvectors of Y T Y by deﬁnition of an eigen
decomposition of a symmetric matrix. This also induced that from these eigen vectors, one gets
the low rank approximation of yi with just ŷi = Vk ci = VkVkT yi which is sometimes called denoised.
It is retrieved the approximation for xi by inverting the standardization of the columns (mean and
standard deviation), which concludes the method here. See a textbook on principal component
analysis for the additional results for the method.

192
Note that "ipca" and "kpca" are variants, with ipca for "incremental pca" and kpca for "kernel pca".
They come with alternative algorithms out of the scope here. The result from ipca should lead to
nearly the same reduced space than pca because its aims at solving the same problem with chunks
for scalability. As kpca is a nonlinear version of pca, it is able to improve the resulting reduction.
Next, these three algorithms are compared to an autoencoder and t-sne with several indicators of
quality of the visualization.
Illustration

The method pca is tested with a small dataset for better understanding. A toy dataset is generated,
this is 80 points from a circle in a 3d space.
��
��

��

��
��
��
��
��
��
��
��
��
��
��
��
��
��

It is compared the pca projections from sklearn and numpy.

��
� � ��
��
��
��
� � ��
� � ��
� � � � �
��

��

193
��

��

��
��
��
��
��

��

��
��
��
��
��
��
��
��
��
��
��
��
��
��

��

Check the orthogonality and the norm: VV T = I3 ?

VV T =
��
� ��
��

��

For k ∈ (1, 2, 3), Σvk = λk vk ?
Σv1 − λ1 v1 =
��
��
� ��
Σv2 − λ2 v2 =

194
��
��
� ��

Σv3 − λ3 v3 =

��
��
��

The pca from sklearn and numpy are compared.

��

��
��

��

The eigen vectors remain the same for the two implementations but their sign may change. This
comes from the eigen decomposition which is not unique. The ellipsoidal shape of the circle is
exactly retrieved. Because the points belong to a plane, the third pca component is zero.

��

� ��

Below, other expressions from the svd are also checked with three components.

��

��
��
��
��
��

195
��

��

��
� ��

��

� ��
� ��

��

� ��
� ��

��

� � ��

��

196
��
��
��

��

��
� ��

� � ��
��
��

� ��
��

��
� ��

��

��
��
��

� ��
��

8.1.2 Implementations of pca with sklearn

We consider an illustration with a dataset from Gaussian classes in 10 dimensions. The data can be
considered generated from a mixture model, a probabilistic model for clustering which supposes g
classes or clusters with centers µk , here mostly well separated with clear frontiers between them.
Each data vectors xi belongs to one of the g groups which can be seen as spheres or ellipsoids in
the space of the columns. For a two-dimensional reduced space, this induces that the data from
the group k should appear visually in a circle or in an ellipse around the projection of µk for this
artiﬁcial example. This is like looking the data by displaying only two components of the vectors
xi but instead with new variables or components which are able to better show the data.

��
��

197
��
��

��
��

��

Example of a dataset with 9 clusters and a 3d view

��
��
� � ��
� � ��
��

��

In a three dimensional view, the data are as follows when showing the three ﬁrst components.

��
��

��

��
��

��

��
��
��
��
��
��
��
��

��

198
Figure 8.1: Visualization of 3 variables among 10 for 9 classes.

The classes are not fully apparent when a different color or a different marker is not added to
the points for each class. This information is generally unknown, except if some clustering was
performed before in order to help the data analysis and the visualization.
With real data, the variables yi should be found in an unsupervised way (clustering), otherwise if
they are known a supervised method for visualization may be better actually in order to use the
additional information (see "fda", "discriminant analysis" in the literature), this is out of the scope
herein.
Next, the projections are from the high dimensional space, 10 here, into the bidimensional plane,
hence 2 instead of 3 among the usual choices. The labels yi are unknown during the computations,
the corresponding colors are just added to each points from the visualization after processing the
projection. This allows to better compare visually the differences between the methods.

PCA with sklearn

The algorithm for pca is already implemented in sklearn. The call to the function available with
the module is written as follows.

��
��
��
��

Thus, the pca projection leads for the ﬁrst principal plane in 2d.

199
��
��

��
� � ��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

200
��
��

Figure 8.2: Visualization of two ﬁrst pca components for 9 small classes

For each group an ellipse is added on top of its scatterplot for helping the visualization. Some
data points may appear less near to the cluster as this is allowed by the Gaussian distribution by
deﬁnition.

There is a small overlapping because some clusters are near in the data space. For this example,
we will check how behave an autoencoder.

Incremental PCA with sklearn

For larger datasets, it is preferred an incremental algorithm which operates with chunks from a
data ﬁle. With sklearn it is possible to enter the minibatches manually one after another as for our
loops with pytorch, via the function "partial_ﬁt()". Here, this is sklearn which cycles the dataset
with minibatches until convergence.

��

��
��
��
��
��

201
��
��
�→��

��

Figure 8.3: Visualization of two ﬁrst ipca components for 9 classes

The representation is not exactly the same because the sign from the eigen vectors may be inverted
and the iterative algorithm depends on hyperparameters such as the size of the minibatches, but this
is nearly the same visualization than pca. Note that a pca algorithm is also implemented in pytorch
natively, see "torch.pca_lowrank()". For large matrices, approximations of pca via randomization
or random projection are appealing alternatives to the exact solution.

Kernel PCA with sklearn

In the nonlinear choice, it is considered a kernel pca which proposes a trick (related to "svm",
"support vector machines") in order to change the inner products into nonlinear functions. Some
options are:
• kernel is with the choices ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘cosine’, with by default=‘linear’
with is for the usual pca.
• gamma is a ﬂoat for kernel coefﬁcient in rbf, poly and sigmoid kernels. By default it is equal
to 1/n.
• degree is by default=3 and only for poly kernels.

202
��
��
��
��
��

Figure 8.4: Visualization of two ﬁrst kpca components for 9 classes

This is not working so well here with clusters very near, but some clusters are more separated or
smaller than with a linear kernel.

8.1.3 Implementation of t-sne with sklearn

Let try also t-sne, a recent nonlinear method for reduction and visualization.

��

��
��
��

��
��
��
��
��
��
�→��

203
��
�→��

��

Figure 8.5: Visualization from t-sne after pca for 9 classes

Here for most samples tested from the distribution considered, as for this draw, the clusters are vi-
sually well separated because the overlap is not so high. Otherwise small or large overlapping may
appear with a cluster behind another one on the visualization map like in the orthogonal projection
from pca above. This method is today widely used in theory and practice for visualization, despite
that it is not yet perfect: the output from this nonlinear method often replaces the previous maps
from pca in many research domains (machine learning, biology, physics, ...).

8.1.4 Quality indicators

Numerical indicators for judging the quality of a reduction or a visualization are available from the
literature in python modules. Each indicator depends on the property that one wants to underline.
For instance, the "Davies-Bouldin" and the "Silhouette" focus on the separation of the clusters
projected from existing label {yi } which are not used by the unsupervised reduction method.
The Davies-Bouldin index is the average similarity between each class and its most similar one.
This indicator is preferred minimal because it decreases if the classes are more separated. The
average of the Silhouettes is the mean value from average dissimilarities. This indicator is belongs
to the interval [−1, +1] and is preferred maximal because it increases for more compact classes.
In python, these numerical indicators are implemented in sklearn, they are obtained from the call
to the following function.

204
��
��
��
��
� � ��
��
��
��
��
��
��

��

��
��
��

As expected, t-sne perfoms better than the pca methods, while pca and ipca are equivalent. The
resulting values are given below in a table after adding autoencoders to the list.

8.2 Autoencoders with pytorch

In this section, autoencoders are trained with pytorch for two datasets, the small dataset from above
and a large dataset from handwritten digit images. We are mainly interested on the deﬁnition of the
autoencoders, their training for a linear or nonlinear model with hidden layers and how to combine
them with glm in order to eventually improve the resulting regression or classiﬁcation.

8.2.1 Deﬁnition and link with pca

The neural networks for autoencoder are deﬁned with the same output and input, the multidi-
mensional vector xi . They are unsupervised because the label variables yi are not involved in the
deﬁnition and in the training of the model. They have been called "auto-associative" neural net-
works by the past, a larger family which includes other networks. There are two cases, the linear
reduction and the nonlinear reduction.

Linear case

One popular interpretation in the literature of the autoencoder is its close relation to pca in the
linear case. The authors of a recent communication 1 have recalled their result from 1988: writing
the linear autoencoder formally as with the matrix H which aggregates the values hi of the hidden
1 See "https://doi.org/10.1007/s00422-022-00937-6, Biological Cybernetics", 2022) for a discussion on autoen-
coders and pca plus some related recent architectures of neural networks.

205
layer for the input xi ,

H = W1 X encoder (reduces the input, dimension p to k)

X̂ = W2 H decoder (retrieves the input, dimension k to p) ,

then the solution for the optimal matrices Ŵ2 and Ĥ needs to solve for the minimum of:

�X − X̂�2 = �X −W2 H�2 .

The solution comes from standard linear algebra2 with the "best rank k approximation of X", such
that with the svd X = Un ΣnVnT , the solutions are respectively:

Ŵ2 = Uk S−1
Ĥ = SΣkVkT
Ŵ1 = Ŵ2T .

Here, Un , Σn Vn are the respective truncated matrices from Un , Σn Vn by keeping the part corre-
sponding to the number of nodes k of the unique hidden layer while S is an arbitrary non-singular
matrix. The author had concluded that an autoencoder with an hidden layer and a linear output
activation function learns the "weights that span the same subspace as the one spanned by the
principal component vectors" when the loss function is the MSE.
Without the orthogonalization, linear autoencoders are expected to be less effective than pca for
linear projection. Actually, orthogonality is not an issue and may be introduced via some kind
of explicit penalization for more advanced linear settings according to recent architectures of the
literature. With nonlinearities they become able to improve dramatically the reduction. They are
also ﬂexible with the possibility to be associated with other architectures of neural networks.

General case

The idea is to deﬁne after the input layer a set of hidden layers with a decreasing number of nodes
until a dimension chosen for the reduction. This layer is the "bottleneck layer" where the reduction
is obtained. After this layer, the hidden layers have a increasing number of nodes until the size
of xi . The ﬁrst set of layers are the encoder and the second set of layers is the decoder. The
autoencoder is able to encode a vector xi into a reduced vector ĥi and then to decode the reduction
into an approximation x̂i of the input vector with eventually some small error:

x̂i ≈ xi ,

where,

ĥi = encoder(xi ) is the reduction

x̂i = decoder(ĥi ) is the reconstruction .

There are several results here after training:

2 See Introduction to matrix computation, Stewart G (1973)

206
- the reductions {ĥi } which allows a visualization and any post-processing.
- the decoder() which allows to get new vectors from the continuous latent reduced space.
- the reconstructions {x̂i } which is eventually an improved version of the input vector.
- the encoder() for contructing the reduction for new input vectors if required.
Note that the dimension for the latent space may be two or three for visualization, but this may be
more because an additional method may be used for visualizing the latent space.

8.2.2 Visualization of artiﬁcial data

Roughly speaking yi is replaced by xi in the models of neural network from the previous chapters.
We focus here on unsupervised encoders in pytorch: we deﬁne a class for an autoencoder and train
this new network with the numerical derivatives.

Model as a python class

A general class with hidden layers is deﬁned as previously but with two type of hidden layers.

��

��
��
��
��
��
��
��
��
��
��
��
��
��
��

��

��
��
��

��

207
��
��

The class allows to deﬁne the part of the network for the encoder and the part for the decoder.
The forward pass needs to encode to the reduced space and then to decode to the data space. This
suggests also to deﬁne two new functions for each of these two parts.

Initialization of the network

The initialization is by default or the "Xavier" one in our implementation. For nonlinearities, it has
to be careful otherwise the activation functions may saturate such as the gradient cancels out too
early for some nodes, or on the contrary the gradient may increase too much. This is the role of
the initialization to avoid or limit these events for the activation functions having this issue. There
are diverse approaches implemented in pytorch, the fully random one and the careful random one,
with diverse versions in the module3 "torch.nn.init" as summarized below for a matrix of weights,
say a pytorch tensor.
Initialization Values drawn from Parameters
with a = 0 and b = 1 by
"uniform_" Uniform distribution U(a, b)
default
with mean = 0 and
"normal_" Normal distribution N(mean, std2 )
std =�1 by default
"xavier_uniform_" Uniform distribution U(−a, a) 6×gain2
a= fan_in+fan_out
�
"xavier_normal_" Normal distribution N(0, std2 ) 2×gain2
std = fan_in+fan_out
�
"kaiming_uniform_" Uniform distribution U(−a, a) 3×gain2
a = fan_mode
�
"kaiming_normal_" Normal distribution N(0, std2 ) gain2
std = fan_mode
The two intermediate rows and the two last rows are respectively for the Glorot and He estimation.
The values for gain, fan_in, fan_out or fan_mode are the input size, output size and for the mode
one of them. This is different to the previous bounds on the gradient in the previous chapter:
by shriking the initial weights the inialization remains random and small. This helps a better
convergence by avoiding an early vanishing of the gradient when the weight value is near the
saturation of the activation function because of the "S shape" of most of them.

Dataloader for input and output

The data is loaded as pytorch tensors. Note that the class labels are optional and not used during
training, they could be dummy variables here.

��
��
��
��
3 "https://pytorch.org/docs/stable/nn.init.html"

208
��
��

Structure of the layers

An example of autoencoder is as follows for a non orthogonal projection related to pca.

��
��

��
��
��
��

Training function
The parameters are optimized with the following function for an Euclidean loss. For the python
code, the only difference with previously is the output (xi instead of yi ) of the network and the
optimizer Adam.

��
��

��

��
��
��
��

��
��
��
��
��
��
��
��
��

209
��
��
��
��
��
��
��
��
��
��
��
��

��
��

��

Optimizers for deep networks

The optimizer Adam is an alternative to the optimizer SGD presented in the previous chapters,
which performs well in practice when the network has many hidden layers in an high dimensional
space. This is currently the state of art for deep networks.

Several optimizers are available with pytorch natively (see also the available modules from repos-
itories), in the module4 "torch.optim", with for instance:
Optimizer Description
a stochastic gradient descent with minibatches and an op-
"SGD" tional momentum, extends the Robbins–Monro method
(1951)
the Adam algorithm (2014) is a recent popular stochas-
"Adam" tic gradient descent based on "adaptive estimates of lower-
order moments", it combines AdaGrad and RMSProp
the L-BFGS algorithm (1989), see the previous chapter for
"LBFGS"
an illustration
the "resilient backpropagation algorithm" (1992), it is a pi-
"RPROP" oneer approach for adaptive estimates and can perform well
(improved by RMSProp)

Results of the training

An exemple of training with the Adam optimizer for this autoencoder was stored in a ﬁle, it is
loaded as follows.
4 "https://pytorch.org/docs/stable/optim.html"

210
��
��

��

�→��

� ��

�→��

� ��
� ��
�→��

��

The reduction ĥi in the latent space from the bottleneck layer is retrieved for each data vector xi of
the dataset with "encoder()", the function implemented for this purpose:

��

��
The latent space is visualized as for pca.

Figure 8.6: Visualization from linear autoencoder for 9 classes

211
The output just before comes from the following python program.

��

The resulting projection is for this example very similar to pca, but there is one dimension of
pca which is compressed. A similar subspace than from pca is retrieved but with some linear
transformation. Note that the choice of structure for the network is not exactly corresponding to
pca also because there is an intermediate hidden layer before and after the bottleneck layer.

This is a known observation from the literature, that the autoencoder is able to retrieve pca despite
that it does not include the hypothesis of orthogonality. The numerical indicators allow to retrieve
these similarities if they are identical.

��

Nonlinear autoencoder and training

In this paragraph, we are interested on a non linear projection, hence let test another structure of
autoencoder with nonlinear activation functions. The resulting latent space should perform better
for the separation between the clusters than a linear model such that pca.

��
��
��
��
��
��

��
�→��

An exemple of training for this autoencoder (with nbmax_epoqs=1500 and lr=0.001) was stored
in a ﬁle, it is loaded.

��
�→��

212
��

��
�→��

��

Figure 8.7: Visualization from nonlinear autoencoder for 9 classes

��

The resulting model is able to separate better the projection points from some classes but there is
also some distortion among the clusters with ellipsoidal shapes and diverse orientations for their
main axis. With high dimensional data, the autoencoder is also able to perform better than pca
when associated with a post-processing via t-sne for biological data for instance.

Comparison of different visualizations with two indicators

The numerical results for the artiﬁcial dataset are summarized just below with the indicators Davis-
Bouldin and Silhouettes. Alternative quality indicators related to the nearest neighboors instead
are also of ﬁrst interest but not presented herein. For the two indicators considered, the best visu-
alization separates the best the true classes on its graphic and gets the best indicator values.

��

213
��

��

��
��
��
��
��
��
��

��

� ��
� ��
� ��
� ��
� ��
� ��
In summary, the autoencoders are able to outperform pca and even t-sne as expected, for some
datasets and some settings of the algorithms. Visually, if the indicators are better for this trained
autoencoder, it appears clearly on the graphics that some relative distances between the classes
are very different between the approaches. Hence some improvement is still required here for the
autoencoder (choice of number of layers, nodes and the activation functions plus the training). This
trained nonlinear autoencoder does not always keep the shape of the distribution of the classes from
the data space when its performs the projection with the encoder. On the contrary, pca is linear
thus the Gaussian shape of the classes remains with less varied distortions but also less separation
for the frontiers. A cross-validation may lead to a better training but is costly, see sklearn and its
module for mlp. Note that autoencoders and pca allow to project easily new data such as a test
sample. For t-sne this is less natural as one needs to freeze the former projection points or build an
external model such as a neural network for learning the former projection.
Next a real dataset is processed with the presented python class for autoencoders, this is a classical
dataset with grey images from handwritten digits (between 0 to 9) which is worth to consider as it
appears the most often in the deep learning literature for the comparison of algorithms.

8.2.3 Autoencoder for 60000 images

In this section, a large dataset of images which is currently often presented as an example is pro-
jected on a nonlinear latent space with an autoencoder. It is deﬁned with three hidden layers where
the middle one is the latent space for the projection. Before an incremental pca is implemented
with minibatches because the dataset is kept on the computer disk. The format of the data ﬁle is

214
the binary hdf5 compression which is still near the state of art until today for tabular data, at least
for a sequential access to its local contents.

Dataset (from the image ﬁles to one binary hdf5 ﬁle)

��

��
��
��
��
��
��
��

� ��

Let download the dataset5 with pytorch in a tensor format in order to save each image into a row
of a compressed data file. The first call to "MNIST()" takes the files from a remote server into the
chosen directory of the computer disk. All the next calls can access directly the local files until
they are eventually removed by the user.

��

��
��

��

��
��

We then save the images in a ﬁle hdf5.

��
��

5 "http://yann.lecun.com/exdb/mnist/"

215
��
��
��
��

��

��
��
��
��
��
��

��

��
��

Incremental PCA with sklearn and an hdf5 ﬁle

An incremental pca is obtained by cycling the ﬁle, and then calling a function from sklearn as
follows.
��
��
��
��
��
��
��
� � ��
� � ��
��
��

��

��
��
��
� � ��
��
� � ��

216
��
��
��
� � ��

��
��

��

��
��
��

��
��

��
��
We get a python object for the projection, hence we just project our dataset now.

��
��
��
� � ��
� � ��
��
� � ��

��

��
��
��

��

Here the size of the chunks divides with an integer the number of rows of the dataset, otherwise the
remaining rows must be projected too. The projection is directly saved in a ﬁle on the disk, with a
virtual array created from "memmap()" with numpy (see next after for more about this binary ﬁle
format).

217
��

��
��
��
��

��

��
��
��

��
��

��
��
The resulting projection is below, from the two ﬁrst principal components.

Figure 8.8: Visualization of two ﬁrst ipca components for 60000 images

This graphic is obtained from the following python program.

��

218
The overlap of the ten classes of digits on the projection is not informative and even prone to errors
of interpretation for the human who wants to understand the structure of the data cloud. This is
expected from images of digits which are often very similar. For instance, the digits 3 and 9 may
be both handwritten with open or close curves, and with only small differences from a writer to
another. Too many similar classes are not well separated with a linear projection. This is a limit
of pca which is just linear and not able to process well complex data clouds, the method needs
numerous latent variables instead of just a few ones.
The obtained projection points are all shown without filtering for the first ipca plane. Note that
in pca, there exists additional indicators called "contributions", "cosinus" and "qualities" which
allow to remove points from the projection plane. The pca plane is the scatter plot from two new
variables from pca, one by axis. When pca becomes a statistical exploratory method instead of just
an algorithm, some data are known to be not well represented on the plane. When too far from the
projection plane and with the worse indicator values, these points could be better shown by other
new variables from pca, the ones corresponding to smaller eigenvalues than the two leading ones.
The limit of this old approach is that for data with numerous variables an unknown number of
diverses reduced dimensions is required for each unknown class in order to retrieve the underlying
whole classification.
Current machine learning approaches prefer generally a nonlinear transformation of the variables,
a non Euclidean space or a neighboors graph. Pca is able to improve these methods by helping the
removal of the useless dimensions from the data table and the noises from its rows and its columns
which correspond to the smallest eigenvalues. Autoencoders are able to provide such improved
nonlinear view of the data in a flexible way. Its latent space may be able to separate better the
unkown classes (or "clusters") even in an unsupervised approach without the knowledge of the
true classification.

Nonlinear autoencoder with pytorch for a dataset with 60k rows

An autoencoder is trained for this dataset as follows. We propose to handle the dataloader with
the hdf5 file with extension ".h5" which is suitable for large datasets (or small computers). This is
performed by defining a new class extending "Dataset" from pytorch. The images are rows in the
matrix ’x’ and the labels are components in the vector ’y’ from the hdf5 file.

��
��
��
��
��
��
��
��
��
��
��

219
��
��
��

The dataloader is created as usually, but with this new object dataset for pointing to the rows in the
ﬁle.
��
��

��

We consider the function for splitting the sample into two sets, the one for training and the one for
testing.

� ��

��
��

��

With the data available via a dataloader, let define an example of autoencoder and train it. The en-
coder reduces the original size of the data vectors into a latent space which is then re-transformed
into the original data space with the decoder. This supposes the choice of the hyperparameters:
number of hidden layers, number of nodes for each layer and activation functions. For large
datasets and high dimensions, this supposes a powerful computer in order to train several neu-
ral networks and keep the best one.
For instance for mnist, from 784 dimensions (for an image with heigth 28 pixels and with width
28 pixels), the first layer induces a new reduced dimension equal to 512 before another layer for
a reduction to only 2 dimensions for visualization. This is the encoding phase, which is followed
by the decoding phase which is exactly the inverse. If the dimensions of the bottleneck are just
two or three the autoencoder is suitable for a direct visualization. With more than three nodes, the
purpose is a reduction hence another method for post-processing the latent space is required.
Here the autoencoder is implemented for a direct visualization as follows. We define the encoding
layers and the decoding layers with two python lists.

��

��
��
��
��
��

220
��
��
��
��
��
��
��

��
��
��
��
��
��
��
��
��
��
��
��
��

The last activation function is sigmoidal hence the output of the neural network belongs to [0; 1].
The components of the data vectors should belong to the same range.
This architecture looks too complex for this dataset with only 50000 data vectors for training so
many weights but has lead to a relevant low dimensional projection as a ﬁrst test. Note that the
number of parameters is:

��
��
��

��

The training is with the same minibatches than for training the neural networks for nonlinear gen-
eralized linear models, except that the variable yi is not used here. No monitor class is implemented
here and no loss for the test sample neither. The two lists of layers for encoding and decoding are
processed by the forward function in the python class modeling the autoencoder. The main differ-
ence for this new function "f_train_fromh5_autoencoder()" with the previous function tested with

221
the small artiﬁcial dataset, "f_train_autoencoder()", is the penalization for the loss function. The
id of the row is also eventually loaded because used in some cases like debugging: when shufﬂing
is true in the dataloader the order of the rows is lost. For adding other optional computations, see
for instance the function "f_train_glmr()" and the class "Monitor" as examples.

��

��
��
��
��
��
��
��
��

��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

The optimizer here is Adam because it performs very well for large models and was able to reduce
the loss function automatically without additional advanced function for the learning rate. As seen
previously, this is not required for linear models which just need an optimizer SGD because of
the convexity of their loss function. The documentation of pytorch discusses this more advanced

222
optimizer and a large part of the literature on images processing where the recent optimizer Adam
has become one of the best optimizers currently. This illustrates that a clever combination of
different approaches, here two optimizers with "Adam ≈ AdaGrad+RMSProp", can achieve great
results.
The autoencoder is trained and its weights are stored in a computer ﬁle.

��
��

� ��
� ��
� ��

��
��
��

The trained model is loaded by retrieving the weights from the computer ﬁle and the weights of
the python class are ﬁlled with these values.

��
��
��

��

� ��
��

For a large dataset, one must be careful when computing the coordinates from the latent space by
cycling the dataset with the dataloader. This is the purpose of the following function.

��
��
��
��
��
��
��
��
��
��
��
��

223
�
��
��
��
��
��
��
��
��
��
��
��
��
��

The full dataset is projected here from the trained model with the train sample, otherwise, the dat-
aloader from the train sample or the test sample can be preferred eventually for validation purpose.
The resulting projection is as follows, after training the nonlinear autoencoder.

Figure 8.9: Visualization from nonlinear autoencoder for 60000 images

The python programs for this output is just below.

��

��
��

224
��
��
��
��

��

Then let compare the quality of the projection for the full sample after training.
A sample of the 60000 rows is drawn in order to get faster computation of the numerical indicators.

��

��
��
��

Visually, this trained autoencoder seems not performing better than pca for this dataset but also the
setting is not optimal. According to the literature, a more usual way to do is to choose a higher
dimension for the latent space and perform after a nonlinear reduction such as t-sne. Note also
that better architectures of autoencoder exist but often ask for far more intensive computations.
Convolution autoencoders which model the neighboors between the pixels are a ﬁrst step towards
a better reduction, this is out of the scope of tabular data.

8.3 Low dimensional reduction via t-sne

The criterion optimized by t-sne involves the Euclidean distances between the data vectors, thus,
a linear reduction is often better for these vectors in order to remove the noises and to reduce the
computer time for computing the distances. If pca is the usual method for the pre-processing,
random projection is a possible alternative method or an interesting complement.

8.3.1 Reduction with a random projection

For a large number of columns, a reduction is performed with a random projection.

The random projection method in brief

225
Random projection (rp) is a linear reduction method different from pca. It allows to approximate
the Euclidean distance between two vectors. Each data vector xi ∈ R p is reduced to x̂i ∈ Rq with a
matrix R ∈ Rq×p where q < p. The new vector is written as follows:

x̂i = R xi .

The reducing matrix may be sparse or dense. The cells of this matrix are random values generated
from a Gaussian distribution (or eventually another one). With the right choice for the dimension
of the reduction q, and a small ε depending on the value q, it is written that with high probability,

(1 − ε)�x̂i − x̂ j � ≤ �xi − x j �2 ≤ (1 + ε)�x̂i − x̂ j � .

It is recognized the same inequality than in the Johnson-Lindenstrauss lemma. A smaller ε asks
for a large dimension q, but for a large p one gets easily q � p with an enough small ε, such as the
gain is relevant. When the matrix R is sparse, the reductions are fast to compute for dense vectors xi
such as this version is said to be "database friendly", underlying that it is relevant for huge datasets.
On the contrary to pca, rp does not ask for any matrix decomposition, just to generate a matrix for
the reduction.

Implementation with memmap

The random projection is implemented with sklearn which generates the matrix R for a chosen
reduced dimension q. Also a warning appears if the corresponding approximated distances are
inaccurate in the lower dimension. For datasets which can’t fit in the computer memory, the trans-
formation is applied with chunks from a data file, in this paragraph.
The dataset which comes from a dataloader is stored in the computer disk in a data file, a binary
compressed file which can be read as a numpy array. This is a virtual array from the computer disk
instead of the computer memory, see "memmap()" from numpy. This format is such that there is no
need for loops and chunks to process the rows of the array if the resulting arrays fit the computer
memory: it allows linear algebra with other python objects of type numpy arrays and with the same
operators. But this may be slower than hdf5 and loops for some computations. Other formats of
compressed file exists for python, even with distributed architecture or better parallel processing,
but not always compatible with numpy.
In order to save the result from the transformation directly on the computer disk, a loop is imple-
mented. The linear algebra from numpy remains available, but applied to only subsets of rows
until completing the full transformation of the data matrix. The optional shuffling is set to false in
the dataloader for a sequential processing.

� ��

� � ��

� � ��
� � ��
� ��

226
� ��
� ��
� ��
� ��
� ��
� ��

� ��
� ��

� ��
� ��
� ��
� ��
� ��
� ��
� ��
� ��
� ��
� ��
� ��
� ��
� ��

� ��
� ��
� ��

The dependent and independent variables are stored in two separated ﬁles.

��
��

� � ��
� � ��

��

From the data ﬁle, a virtual numpy array is created via a function as below.

��

� ��

227
��
��
��
��

��
��
��
��

��

��
��

��

Let project with a random projection the space of the columns in a smaller dimension, 120 > 50,
where 50 is the usual value for pca before t-sne. A random projection is able to keep accurate the
distances between rows after the linear transformation when the reduced space is enough large.

��

��
� ��

� ��

��

The random projection allows to keep the distances while reducing the size of the vectors. Thus,
the dataset has it number of columns reduced (from 784 to 120) by the product to the right with
this matrix (after transpose). This is done with chunks/minibatches in order to proceed directly
with the data on the computer disk here.

228
��

��
��
��
��

��

��
��
��

The reduced matrix is stored in a new data memmap ﬁle with 120 variables instead of the previous
784 ones.

��
��

��
��
��
� � ��
��

The new reduced matrix is accessed via a python object, z_rp_mnist, without loading its whole
contents from the disk while the python object y_mnist remains available for the labels.

��

� � ��
� � ��

229
��
��

��

Note that with this format, we keep the size of the matrix in a separated textual ﬁle when the data
is stored, in order to provide the information when the data ﬁle is read later.
This reduced matrix is involved next, just after the projection with the 50 components from the
incremental pca, in order to compare the results.

8.3.2 Projection t-sne after pca

In this subsection, the dataset mnist is visualized with t-sne thanks to a fast implementation,
"opentsne". Other implementations are available with python. This one includes "barnes-hut t-sne"
for instance which is able to approximate t-sne in order to make t-sne scalable for large datasets.
Remember that the method is based on a sparse nearest neighboors graph which is mandatory for
a large dataset and a limited computer memory. By following the documentation, and accessing
the data directly from the computer disk thanks to the memmap ﬁle, one gets the following result:

��
��
��

��
��
��
��
��
�

��

��
��
��
��
��

��

The output from t-sne for mnist after the initialization from the 50 ﬁrst ipca components is as
follows.

230
� ��

� ��

� ��
� ��
� ��
� ��

��
��

Figure 8.10: Visualization from t-sne after ipca for 60000 images

The python code for the graphic just before is below, it is recognized the ten classes almost per-
fectly separated. The method t-sne allows an unsupervised visualization of the natural classes on
the contrary to pca for instance.

��
��
��
��
��

231
8.3.3 Projection t-sne after rp+pca
Here rp is for random projection. The reduction before t-sne is a random projection (120 reduced
dimensions) followed by an incremental pca (50 reduced dimensions). Hence, rp is just a prepro-
cessing step before ipca, while ipca is followed by t-sne at the end.

It is better to reduce and standardize the columns before the partial ﬁt because the whole dataset
is not available at the beginning. The random projection allows to accelerate pca computations but
with a small difference for the quality of the projection because of the approximate distances.

��
��

��

��
��
��
��

��

��
��
��

��

��
��
��
��
��
��
��
��

��

��
��
��
��
��
��
��
��

232
��
��
��
��

��

��
��
��

� ��

� ��
� ��
� ��
� ��

��
��

��
��
��
��

233
Figure 8.11: Visualization from t-sne after rp+ipca for 60000 images

The random projection is by deﬁnition and as its names says: "random". Hence several random
projections may be tried in order to get the best ﬁnal result. Some noise was added during the
reduction, resulting to some data points between the clusters. Several clusters have changed their
relative positions but the ten classes of digits are retrieved. This is informative because if two
points keep the same vicinity for two different random projections, that means that they belong to
the same cluster. Next the two t-sne projections are compared numerically with the previous ones.

8.3.4 Comparison of the visualizations

In summary, the four methods have performed as follows according to the numerical indicators.

��

��
��
��

��

��
��
��

234
��
��
��

��

� ��
� ��
� ��
� ��

Here, t-sne as a dedicated method for reduction and visualization outperforms this trained autoen-
coder but uses more information with the vicinity graph. Note that another recent method could be
used here instead of t-sne for visualization, such as u-map. Tuning further the hyperparameters or
the training should lead to better results but is time consuming for the computer. More advanced
autoencoder architectures may do better than t-sne, see the literature. Currently, autoencoders are
combinated with other neural networks with different architectures and have also been deﬁned for
other kinds of data such as time series, text, graph, or rgb images. The number of parameters is
high or even huge which makes them prone to overﬁtting, a problem mostly solved with very large
datasets and big workstations with powerful gpus.

8.4 Autoencoder associated with a glm

In this section, we are interested on autoencoder in the generalized regression models via alterna-
tive models. When the space of the variables is large it is usual to select the best subset via aic,
bic or lasso. In machine learning data are often complex and numerous, hence neural networks are
able to lead to new approaches with better results for these data.

Examples of extensions for regression/classiﬁcation with reduction

There exists alternative approaches to a glm within a neural network which remains nevertheless
the more usual way to do currently. In all cases, a penalization may be added, it is worth to try as
explained in the chapter on the lasso regression, otherwise other regularizations are available.

Approach Description Criterion

Neural network embedding the glm
Direct argminw F(x; w)
with no regularization and reduction
Neural network embedding the glm
argminw F(x; w) + αR(w)
with a penalty term for regularization
Autoencoder/reduction followed by glm
Plugin argminw F(argminh G(x; h); w)
by consecutive minimization of the losses
Autoencoder/reduction associated to glm
Combined argminw,h G(x; h) + λ F(h; w)
by minimization of a combination of losses

235
Example with one hidden layer in the autoencoder for linear regression
The model for the basic autoencoder with one hidden layer is as follows:

fθ (xi ) = gx (WxT gh (Wh xi , bh ), bx ) .

This is with two stages:

hi = h(xi ) = gh (Wh xi , bh )
x̂i = x̂(hi ) = gx (Wx hi , bx ) .
This leads to the three architectures from autoencoder to regression, with same expression for the
predictions, ŷi = β̂ T ĥi + b̂y = β̂ T gh (Ŵh xi , b̂h ) + b̂y , but three different loss functions to optimize.
Approach Criterion

Direct (β̂ , b̂y , Ŵh , b̂h ) = argmin ∑i �yi − β T h(xi ) − by �2

β ,by ,Wh ,bh

(Ŵx , Ŵh , b̂h , b̂x ) = argmin ∑i �xi − x̂(h(xi ))�2

Plugin Wx ,Wh ,bh ,bx
β̂ , b̂y = argmin ∑i �yi − β̂ T ĥ(xi ) − by �2
β ,by

Combined (Ŵx , Ŵh , b̂h , b̂x , β̂ , b̂y ) = argmin ∑i �xi − x̂(h(xi ))�2
Wx ,Wh ,bh ,bx ,β ,by
+λ ∑i �yi − β T h(xi ) − by �2
These models are not implemented here, another bias (or intercept) term may be added in some
cases for xi . The plugin and combined approaches count nearly each one two times the number
of parameters of the direct approach because of the decoding step, hence they may ask for more
data to train. They seem to be tested in the literature with more advanced autoencoders such as
for image processing for medical images and clinical data for instance. This is because, one may
have even additional data for the combination or plugin approach, by extending the vector xi but
with a part not reduced by the autoencoder. For instance, for images with additional descriptive
variables or even text, another network may process this other information and the result may be
merged with the reductions for the images. This idea is very convenient but the resulting loss is not
obvious to minimize because several networks are in stake. They ask for different optimizations
-with different learning rates and other hyperparameters- while being linked. See next section for
related exercices, not solved herein.

8.5 Exercices
1. (sklearn) Implement the autoencoder with sklearn for the small dataset and improve the
model by cross-validation.
2. (sklearn) For dataset mnist of the chapter, look for a better setting for the autoencoders
(number of hidden layers, number of neurons for each hidden layer, activation functions and

236
even type of hidden layers like linear or dropout). See the documentation of sklearn (or
torch.nn) for the available layers and activation functions.
3. (pytorch) Test further a sparse autoencoder and test some other architectures of autoencoder
available in the literature and code repositories. Check the generalization with two samples
via the two loss functions, and via the quality of the reduction for the test sample.
4. (pytorch+other) Compare the running time for alternative ways to cycle the dataset with the
minibatches from mnist, consider several ones among the following examples.
• Direct access from the image files in directories.
• Access to the images stored in one unique hdf5 file.
• Access to other file formats (numpy memmap, numpy npz, dask, TFRecord, pandas
csv, spark).
• Access without dataloader or with minibatches stored in separated files.
Check: if the access can be made multi-core in parallel, and validate by looking at the the
cpu cores stats during the running (command "top" with ubuntu).
Check: eventually by profiling the code and measuring the running time for loading the
minibatch from the computer hard disk to the cpu (and the gpu), with a profiler for python
or pytorch.
5. (opentsne) Propose a way to improve the projection from t-sne for mnist with the random
projection plus pca while keeping the data on the computer disk. For instance, several pa-
rameters need to be set, such that the dimension of the random projection.
6. (pytorch) Test the visualization of other datasets of tabular or images data, such that:
Many natively in pytorch "https://pytorch.org/vision/0.8/datasets.html"
102 Category Flower torchvision.datasets.Flowers102
CIFAR torchvision.datasets.CIFAR10
CelebFaces torchvision.datasets.CelebA
Traffic Sign Recognition torchvision.datasets.GTSRB
MedMNIST v2 - Biomedical "https://medmnist.com/"
MNIST-like fashion product "https://github.com/zalandoresearch/fashion-mnist"
Stanford Dogs Dataset "http://vision.stanford.edu/aditya86/ImageNetDogs/"
Because pca is doing well at reduction, the purpose is here to achieve at least an equivalent
result as pca at pre-processing for a k-dimensional reduction with k = 2 (or eventually more)
for the autoencoder. Images from some datasets need to be standardized in order to compare
accurately the pixels and their colors, hence testing a classification allows to check this.
7. (pytorch) Test a "Fisher" approach as in parametric likelihood models, by reducing the gradi-
ent vectors with an incremental pca at each epoch. Check the change of the first eigen vectors
and the first eigenvalues during training and for different choices of the hyperparameters.
8. (pytorch) Test the combination of an autoencoder with a glm for a dataset of images com-
pleted with descriptive variables. Examples of datasets are images of objects (cars, houses,
etc) plus a few descriptive variables with the price to be predicted at "kaggle.com". Another

237

d2l en PDF
No ratings yet
d2l en PDF
996 pages
Deep Learning
No ratings yet
Deep Learning
26 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
1,021 pages
Deep Learning for Developers
100% (1)
Deep Learning for Developers
1,029 pages
d2l en
No ratings yet
d2l en
1,027 pages
d2l en Pytorch
No ratings yet
d2l en Pytorch
979 pages
d2l en Pytorch
No ratings yet
d2l en Pytorch
977 pages
R3 - Dive Into Deep Learning - Zhang Lipton Li Smola
100% (1)
R3 - Dive Into Deep Learning - Zhang Lipton Li Smola
1,025 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
837 pages
DeepL OK 2022 Dive Into DL With PyTorch d2l Nov.2022
No ratings yet
DeepL OK 2022 Dive Into DL With PyTorch d2l Nov.2022
975 pages
d2l en Pytorch
No ratings yet
d2l en Pytorch
988 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
997 pages
Deep Learning Guide: Installation to MLPs
No ratings yet
Deep Learning Guide: Installation to MLPs
986 pages
d2l en PDF
No ratings yet
d2l en PDF
995 pages
d2l en
No ratings yet
d2l en
981 pages
Textbook
No ratings yet
Textbook
161 pages
Dive Into Deep Learning
100% (3)
Dive Into Deep Learning
291 pages
Deep Learning for Beginners
No ratings yet
Deep Learning for Beginners
894 pages
Cs181 Textbook
No ratings yet
Cs181 Textbook
163 pages
6 390 Lecture Notes Fall24
No ratings yet
6 390 Lecture Notes Fall24
146 pages
Machine Learning Course Notes
No ratings yet
Machine Learning Course Notes
112 pages
Deep Learning Guide: Installation & Basics
No ratings yet
Deep Learning Guide: Installation & Basics
651 pages
6 390 Lecture Notes Spring24
No ratings yet
6 390 Lecture Notes Spring24
144 pages
Deep Learning Guide: Concepts & Implementation
100% (3)
Deep Learning Guide: Concepts & Implementation
661 pages
Undergraduate Fundamentals of Machine Learning Author William J. Deuschle
No ratings yet
Undergraduate Fundamentals of Machine Learning Author William J. Deuschle
143 pages
SML Book Draft Latest
No ratings yet
SML Book Draft Latest
275 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
883 pages
D2l-En Deep Learning PDF
No ratings yet
D2l-En Deep Learning PDF
639 pages
Undergraduate Fundamentals of Machine Learning
No ratings yet
Undergraduate Fundamentals of Machine Learning
163 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
972 pages
Dive Into Deep Learning: Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
No ratings yet
Dive Into Deep Learning: Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
987 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
60 pages
Deep Down About d2l
No ratings yet
Deep Down About d2l
922 pages
6036 Lecture Notes
No ratings yet
6036 Lecture Notes
56 pages
Machine Learning Simplified
100% (1)
Machine Learning Simplified
109 pages
Deep Learning Cours
No ratings yet
Deep Learning Cours
165 pages
d2l en
No ratings yet
d2l en
1,022 pages
d2l en PDF
100% (1)
d2l en PDF
1,024 pages
SML Book Draft Latest
No ratings yet
SML Book Draft Latest
194 pages
CS230
No ratings yet
CS230
101 pages
Machine Learning The Basics
No ratings yet
Machine Learning The Basics
158 pages
Machine Learning Basics Guide
100% (1)
Machine Learning Basics Guide
124 pages
6.036 Notes
No ratings yet
6.036 Notes
99 pages
SML Book Draft Latest (001 046)
No ratings yet
SML Book Draft Latest (001 046)
46 pages
Machine Learning - A First Course For Engineers and Scientists
No ratings yet
Machine Learning - A First Course For Engineers and Scientists
348 pages
Summary FS24
No ratings yet
Summary FS24
63 pages
Machine Learning Summary
No ratings yet
Machine Learning Summary
38 pages
Main Notes
No ratings yet
Main Notes
227 pages
Coursera Machine Learning Specialization
No ratings yet
Coursera Machine Learning Specialization
46 pages
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
No ratings yet
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
145 pages
Machine Learning Mathematics in Python - Jamie Flux - 2024
No ratings yet
Machine Learning Mathematics in Python - Jamie Flux - 2024
238 pages
CS229: Machine Learning Notes
No ratings yet
CS229: Machine Learning Notes
241 pages
Drew 3 Hematology Analyzer
No ratings yet
Drew 3 Hematology Analyzer
2 pages
Google Takeout
No ratings yet
Google Takeout
13 pages
Active Duty & Reserve "A" School Guide
No ratings yet
Active Duty & Reserve "A" School Guide
31 pages
Chapter 3 Network File System
No ratings yet
Chapter 3 Network File System
41 pages
Mathematics As A Tool: at The End of The Fourth Module, You Must Have
No ratings yet
Mathematics As A Tool: at The End of The Fourth Module, You Must Have
5 pages
Process Domain 2 - Procurement
100% (1)
Process Domain 2 - Procurement
56 pages
SignedBirthCertificate 8318846 335
No ratings yet
SignedBirthCertificate 8318846 335
1 page
Zinc Web Vulnerabilities L1 Scan
No ratings yet
Zinc Web Vulnerabilities L1 Scan
134 pages
Interface Register
100% (5)
Interface Register
9 pages
ISO 27001 2022 Information Security Management System ISMS Policy Template
No ratings yet
ISO 27001 2022 Information Security Management System ISMS Policy Template
18 pages
Capstone Project Circular
No ratings yet
Capstone Project Circular
4 pages
OM Golive FINGate2.0
No ratings yet
OM Golive FINGate2.0
1 page
Batchrpc
No ratings yet
Batchrpc
56 pages
YCT 2024 25 RRB Knowledge TCS Pattern Theory PYQs English Medium
No ratings yet
YCT 2024 25 RRB Knowledge TCS Pattern Theory PYQs English Medium
222 pages
ADR111E
No ratings yet
ADR111E
14 pages
Theroadtosuccsess Chapterand2
No ratings yet
Theroadtosuccsess Chapterand2
62 pages
CBNST 1
No ratings yet
CBNST 1
11 pages
Criminology Refresher Reviewer
No ratings yet
Criminology Refresher Reviewer
99 pages
Robot Control Part 1: Forward Transformation Matrices: Learning The Inverse Kinematic Transformation
No ratings yet
Robot Control Part 1: Forward Transformation Matrices: Learning The Inverse Kinematic Transformation
8 pages
Mimu JD Procurement Cum Fleet Officer WWF-MM
No ratings yet
Mimu JD Procurement Cum Fleet Officer WWF-MM
4 pages
7 Day Electronic Timersheet
100% (1)
7 Day Electronic Timersheet
3 pages
SEO Content Writer & Editor Profile
No ratings yet
SEO Content Writer & Editor Profile
1 page
Ifr Com120b Training
No ratings yet
Ifr Com120b Training
78 pages
Module 5
No ratings yet
Module 5
67 pages
Vaibhav
No ratings yet
Vaibhav
1 page
OCIP E&I Estimation Guide Line
No ratings yet
OCIP E&I Estimation Guide Line
10 pages
TOC - Sep - 2024
No ratings yet
TOC - Sep - 2024
2 pages
CMP A2f
No ratings yet
CMP A2f
2 pages
Computer Science Exam Paper SET 4
0% (1)
Computer Science Exam Paper SET 4
13 pages
MEASURES OF CENTRAL TENDENCY Grouped Data
No ratings yet
MEASURES OF CENTRAL TENDENCY Grouped Data
20 pages

Book Pytorch Scikit Learn Numpy

Uploaded by

Book Pytorch Scikit Learn Numpy

Uploaded by

Linear and Deep Models Basics

with Pytorch, Numpy, and Scikit-Learn

Copyright © 2022 Rodolphe Priam. All rights reserved.

2 Linear models with numpy and scikit-learn 23

3 First-order training of linear models 55

4 Neural networks for (deep) glm 75

5 Lasso selection for (deep) glm 107

6 Hessian and covariance for (deep) glm 133

7 Second-order training of (deep) glm 157

9 Solution to selected exercices 239

1.1 True polynomial curve and train sample . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 True regression line and sample with 10 observations. . . . . . . . . . . . . . . . . 28

3.1 Decreasing loss function during training, per epochs . . . . . . . . . . . . . . . . . 60

4.1 Simple neural network for a linear model. . . . . . . . . . . . . . . . . . . . . . . 76

5.1 Solutions for each penalty function alone with p = 2. . . . . . . . . . . . . . . . . 108

8.1 Visualization of 3 variables among 10 for 9 classes. . . . . . . . . . . . . . . . . . 199

9.1 Loss function from sequential training with sklearn . . . . . . . . . . . . . . . . . 254

Datasets and python codes available for download

Included (chapters, exercices, solutions) and not included (everything else)

How to run the algorithms

xi = (xi1 , xi2 , · · · , xip )T ∈ R p and yi ∈ Rq (with often q = 1) or yi ∈ {0, 1, · · · , q} (with q ∈ N+

The dataset is a data sample with n observations,

sn = {(y1 , x1 ), (y2 , x2 ), · · · , (yi , xi ), · · · , (yn , xn )} .

log L(θ ) = log f (y1 , y2 , · · · , yi , · · · , yn ; x1 , x2 , · · · , xi , · · · , xn , θ )

And, with the summation sign sigma or sign sum,

θ̂ = argmaxθ log L(θ ) .

1.2.4 Bias and variance

1.3 Numerical example with a nonlinear curve

Which leads to:

This leads to:

1.3.2 Polynomial regression

In a matrix format, one gets that:

• Optimization, the optimal solution searches for the best parameters.

1.3.3 Neural network

σ (xiT W (1) ) → ŷi ,

Thus ﬁnally, this is another optimization problem:

1.3.4 Numerical illustration

������ �� ���� � ���� �

��� ������������������ ���������� � ���� ������� � ������

�������������������� �������������� ������������� �������������

���� �� ���� �� ���� �� ���� ��

Thus, the ﬁles are loaded.

�������������������� �������������� ������������� �������������

����� ����� ����� �����

Figure 1.1: True polynomial curve and train sample

Figure 1.2: True polynomial curve and test sample

Simple regression ﬁtting

������� ����� ������ ��������

� � ���������� ��� ������������������

ŷi = β̂0 + β̂1 xi

���������� � ����������� � ����������� � ������

���� �������������� ������ ���������

��� ���������������������� �������

����� ���� � ������������������������������ ����������������

���� �������������� ������ ���������

������������������� � � ������������������������������� ���������

������������������ � � ������������������������������� �������� ���������

��� �� ������ �� ��������������������

The resulting estimations are visualized and compared numerically.

������ ����������������� �� ���

Figure 1.3: Several ﬁtted polynomial curves and train sample

������� ��������� ���������

������ ����������������� �� ���

Figure 1.5: Error on the train and test samples

ŷi = β̂0 + β̂1 xi + β̂2 xi2 + β̂3 xi3 + β̂4 xi4

1.4 Glm as neural networks

This also induces:

1.5 Why learn pytorch

Linear models with numpy and scikit-learn

yi = β0 + β1 xi1 + β2 xi2 + · · · + β j xi j + · · · + β p xip + εi .

sn = {(y1 , x1 ), (y2 , x2 ), · · · , (yn , xn )} .

2.1 Linear regression

2.1.1 Probabilistic interpretation and model

˜ , σ ) = n log 2πσ 2 + n �(β )

log p(y; X, β , σ ) = log ∏ni=1 p(yi ; xiT β , σ

2.1.2 Example of simple linear regression (data)

��

��

��

��

��

��

��

� � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� � � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��