Machine Learning Notes
Machine Learning Notes
Andrea Passerini
a.y. 2022/2023
Stefano Genetti
Vittoria Ossanna
Giovanni Valer (LATEX revision)
These notes are for the course ”Machine Learning” held by professor An-
drea Passerini. These notes do not directly cover the laboratory part; still, while
discussing the theory many concepts applicable to the laboratory part might be
covered.
We tried to include most of the materials used during the lectures or given as
reference. Our aim was to create notes that allow to get a clear overview of all
the topics and to arrange them in a structured manner, while also being fairly
exhaustive.
We tried our best to avoid mistakes while compiling these notes. That being
said, a very broad range of topics was to be covered, hence we do not take re-
sponsibility for any mistake or imperfection that might be present; still, it would
be much appreciated if you reported any of those to us so that we can fix the
notes accordingly.
In this Github repository you can find the LATEXsource code, you are free to clone
the project and modify it by yourself or just create an issue to include topics and
corrections.
We hope that this document will help in the study of this beautiful and exciting
subject.
If you consider this material valuable for you in order to be prepared for the
exam, consider offering us a coffee (-:
Paypal: vittossanna@gmail.com
1 Introduction 7
1.1 Design of a machine learning system . . . . . . . . . . . . . . . . 8
1.2 Learning settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Types of learning methods . . . . . . . . . . . . . . . . . . 9
1.2.2 Types of supervised learning tasks . . . . . . . . . . . . . 11
1.2.3 Types of unsupervised learning tasks . . . . . . . . . . . . 12
1.2.4 Types of learning algorithms . . . . . . . . . . . . . . . . 12
1.2.5 Discriminative and Generative Models . . . . . . . . . . . 12
2 Decision Trees 14
2.1 Learning decision trees . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Choosing the best attribute . . . . . . . . . . . . . . . . . 16
2.2 Issues in decision tree learning . . . . . . . . . . . . . . . . . . . 17
2.2.1 Overfitting avoidance . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Dealing with continuous-valued attributes . . . . . . . . . 17
2.2.3 Alternative attribute test measures . . . . . . . . . . . . . 18
2.2.4 Handling attributes with missing values . . . . . . . . . . 19
2.3 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Bootstrap sampling and bagging . . . . . . . . . . . . . . 20
2.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 K-nearest neighbors 21
3.1 Characteristic of kNN learning . . . . . . . . . . . . . . . . . . . 23
4 Linear algebra 25
4.1 Matrix derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Metric structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Dot product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . 33
4.5 Eigen-decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Understanding eigendecomposition . . . . . . . . . . . . . . . . . 37
4.6.1 Scaling transformation in standard basis . . . . . . . . . . 37
4.6.2 Scaling transformation in eigenbasis . . . . . . . . . . . . 38
2
CONTENTS 3
5 Probability theory 42
5.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1 Probability distributions for discrete variables . . . . . . . 43
5.1.2 Pairs of discrete random variables . . . . . . . . . . . . . 45
5.2 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Continuous random variables . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Probability distributions for continuous variables . . . . . 48
5.4 Probability laws . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Evaluation 54
6.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Multiclass classification . . . . . . . . . . . . . . . . . . . . . . . 58
6.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Performance estimation . . . . . . . . . . . . . . . . . . . . . . . 60
6.4.1 Hold-out procedure . . . . . . . . . . . . . . . . . . . . . . 60
6.4.2 k-fold cross validation . . . . . . . . . . . . . . . . . . . . 60
6.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.5.2 A useful digression to better understand hypothesis testing 63
6.5.3 Hypothesis testing when the variance is unknown . . . . . 65
6.5.4 Some properties of the t distribution . . . . . . . . . . . . 67
6.6 Comparing learning algorithms . . . . . . . . . . . . . . . . . . . 67
6.6.1 t-test example: 10-fold cross validation . . . . . . . . . . . 68
7 Parameter estimation 70
7.1 Maximum-likelihood estimation . . . . . . . . . . . . . . . . . . . 72
7.1.1 Maximizing the log-likelihood . . . . . . . . . . . . . . . . 72
7.2 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2.1 Univariate normal case: unknown µ and known σ . . . . . 76
7.2.2 Generalization of univariate case . . . . . . . . . . . . . . 78
7.3 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.4 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.4.1 Example of a Bernoulli distribution . . . . . . . . . . . . 79
7.4.2 Example with a multinomial distribution . . . . . . . . . 81
8 Bayesian Networks 84
8.1 Example of Bayesian Network: toy regulatory network . . . . . . 88
8.2 Conditional independence . . . . . . . . . . . . . . . . . . . . . . 89
8.2.1 d-separation . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2.2 BN independences revisited . . . . . . . . . . . . . . . . . 103
8.2.3 BN equivalence classes . . . . . . . . . . . . . . . . . . . . 103
8.3 I-maps vs distributions . . . . . . . . . . . . . . . . . . . . . . . . 104
CONTENTS 4
9 Inference in BN 106
9.1 Inference in chain-like structure . . . . . . . . . . . . . . . . . . . 107
9.1.1 Inference without evidence . . . . . . . . . . . . . . . . . 107
9.1.2 Inference adding evidence . . . . . . . . . . . . . . . . . . 109
9.2 Inference in trees-like structure . . . . . . . . . . . . . . . . . . . 109
9.2.1 The sum-product algorithm . . . . . . . . . . . . . . . . . 111
9.2.2 Most probable configuration, sum-product algorithm . . . 115
9.2.3 Exact inference on general graphs . . . . . . . . . . . . . 118
10 Learning BN 119
10.1 Maximum likelihood estimation, complete data . . . . . . . . . . 122
10.1.1 Adding priors . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.2 Maximum likelihood estimation, incomplete data . . . . . . . . . 125
10.2.1 Expectation-Maximization . . . . . . . . . . . . . . . . . . 126
10.2.2 Expectation-Maximization algorithm: e-step . . . . . . . 127
10.2.3 Expectation-Maximization algorithm: m-step . . . . . . . 127
10.3 Learning structure of graphical models . . . . . . . . . . . . . . . 128
Introduction
7
CHAPTER 1. INTRODUCTION 8
2. collect data
3. extract features
4. choose the class of learning models
5. train the model
Collect data
A machine learning system requires data, collected in machine readable format,
on which the algorithm will rely for training. This is a delicate and difficult
process because could imply manual intervention for labelling data.
Not every machine learning technique requires annotation of data (semi-supervised
and unsupervised learning).
CHAPTER 1. INTRODUCTION 9
Extract features
The process of feature extraction consists in representing data in a way in which
a computer can work with, this implies codifying data into other representative
data. Prior knowledge is often needed in order to extract significant features.
The number of features chosen has a relevant impact on the performances: too
many features can require a number of training data much greater, too few could
miss relevant information and decrease the performance. Moreover, by taking
into account too many features, there is the risk of including noisy features
which can make the learning problem harder.
Overfitting
Overfitting is a modeling error that occurs when a function is too closely
aligned to a limited set of data points. This happens when in training
we achieve top performance while with validation sets they get lower: this
could mean that the training took into account also noise among the train-
ing data.
Supervised learning
The learner is provided with a set of input and output pairs (labelled data)
(xi , yi ) ∈ X × Y.
The learned model f : X → Y should map input samples into the proper
output. A domain expert is often required in order to get labelled inputs.
Unsupervised learning
The learner is provided with a set of input but no annotation (xi ) ∈ X .
The learner models training example, e.g. by grouping them into clusters
according to their similarity. These algorithms discover hidden patterns
or data groupings without the need for human intervention.
Semi-Supervised learning
The learner is provided with a set of input and output pairs (labelled data)
(xi , yi ) ∈ X × Y and a typical much bigger set of unlabelled data (xi ) ∈ X .
The learned model f : X → Y should map input samples into the proper
output like in supervised learning while unlabelled data can be exploited
to improve performances (e.g. improve boundaries).
CHAPTER 1. INTRODUCTION 11
Reinforcement learning
The learner is provided with a set of possible states S and for each of
them a set of possible action A is available in order to move through the
next state. Performing an action a from a state s provides an immediate
reward r(s, a) that can be either immediate or delayed.
The goal is learning a policy that allows the algorithm to choose between
states maximizing the overall reward. Often the model needs to face ex-
ploitation (performing moves that are known to be rewarding) and explo-
ration (perform new moves) to avoid getting stuck in a local minimum.
Binary classification
The binary classification problem sets its goal into separate samples into
two subset (often defined as positive and negative).
Multiclass classification
The multiclass classification assigns a sample to a class chosen between
n > 2 different classes.
f : Rd → {1, 2, . . . , n} (1.1)
Multilabel classification
The multilabel classification assigns a sample to a subset of possible classes
(not a unique assignment). E.g. predict the topics of a text.
Regression
This type of problems require to assign a real value to a sample.
f : Rd → R (1.2)
Clustering
For clustering we mean divide data into homologous groups according with
some chosen similarity.
Novelty detection
Tasks in which we are trying to detect anomalies. It consists in studying
the standard behaviour of the system in order to detect novelty or unusual
events.
Decision Trees
Decision tree
A decision tree encodes a logical formula expressed in disjunctive normal
form. It consists of a disjunction of conjuctions of constraints over at-
tribute values. Each path from the root to a leaf is a conjunction of the
constraints specified in the nodes along it. The leaf contains the label to
be assigned to instances reaching it.
Given a decision three, such as the one reported in Figure 2.1, the disjunction
of all paths is the logical formula represented by the tree.
14
CHAPTER 2. DECISION TREES 15
Figure 2.2: Split node training set into children according to the value of the chosen
attribute
3. Split node training set into children according to the value of the chosen
attribute
4. Stop splitting a node when a stopping condition is met. For example, the
node which has been reached contains examples from a single class (all
examples have the same label, they are homogeneous), or there are no
more attributes to test.
Remark: if the current set of examples is not homogeneous but we stop
splitting for other reasons, we assign to the leaf the majority class of the set.
In essence, the learning algorithm recursively partitions the training set and
decides whether to grow leaves or non-terminal nodes.
Information gain
Information gain is the expected reduction of entropy obtained by parti-
tioning a set S according to the value of a certain attribute A.
X |Sv |
IG(S, A) = H(S) − H(Sv ) (2.2)
|S|
v∈Values(A)
This quantity is maximized when there exists an attribute which splits the
dataset homogeneously in many subsets. The aim is to discourage the choice of
these kind of attributes to perform the splits. To do this gain ratio is introduced
as an alternative of information gain measure introduced above.
IG(S, A)
IGR(S, A) = (2.4)
HA (S)
CHAPTER 2. DECISION TREES 19
In this case, splitting according to unique identifiers would result in high values
of entropy with respect to attribute values (HA (S)), which leads to low values
of gain ratio (IGR(S, A)).
Figure 2.3: Dealing with missing attributes at test time. Complex solution.
2.3.1.0.2 Bagging:
1. Create k bootstrap datasets Di
2.3.2 Training
Random forests are ensembles of decision trees. Each tree is typically trained
on a bootstrapped version of the training set (sampled with replacement, i.e.
independent samples). Each decision tree is independently trained with its
bootstrapped version of the training set. At each node the splitting function is
optimized on m randomly sampled features. This helps obtaining decorrelated
decision trees At the end of the training a forest with M trees is generated.
2.3.3 Testing
1. Test the example with each tree in the forest.
2. Return the majority class among the predictions.
More formally, given a set of trees Q = {t1 , ..., tT }, the final prediction of the
forst is obtained by averaging the prediction of each tree in the ensemble:
T
1X
fQ (x) = ft (x) (2.5)
T j=1
Chapter 3
K-nearest neighbors
Remark: kNN is a lazy algorithm because it does not use the training data
points to make any generalisation. Which implies:
• You expect little to no explicit training phase,
• The training phase is pretty fast,
• kNN keeps all the training data since they are needed during the testing
phase.
Given a training set, each point has its own area of influence, as the area in
which is most likely to position new samples that can be labelled as the main
point. Training data can be positioned in a multidimensional space, so the con-
cept of distance we use is not the ”trivial” one: euclidean distance is not always
the best option. The k in kNN means that the number of closest neighbors we
consider is equal to k (e.g. if k = 3 and two of them are points of a class A
and only one belongs to a class B, then the point we are trying to classify will
belong to class A, regardless of the position of the three points in training set).
21
CHAPTER 3. K-NEAREST NEIGHBORS 22
v
u n
uX
d(x, y) = t (xi − yi )2
i=1
k
X
argmaxy δ(y, yi ) (3.1)
i=1
k
1X
yi
k i=1
• instance-based learning: the models results calibrated for the test ex-
ample to be processed.
Instance-based learning (sometimes called memory-based learning) is a
family of learning algorithms that, instead of performing explicit general-
ization, compare new problem instances with instances seen in training,
which have been stored in memory. In other words, this methods construct
hypotheses directly from the training instances themselves. This means
that the hypothesis complexity can grow with the data: in the worst case,
a hypothesis is a list of n training items and the computational complexity
of classifying a single new instance is O(n). One advantage that instance-
based learning has over other methods of machine learning is its ability to
adapt its model to previously unseen data. Instance-based learners may
simply store a new instance or throw an old instance away.
Because computation is postponed until a new instance is observed, these
algorithms are sometimes referred to as lazy.
• lazy learning: computation is mostly deferred to the classification phase
(speed up research)
• local learner: assumes prediction should be mainly influenced by nearby
instances (an area is affected by a class)
• uniform feature weighting: all features are uniformly weighted in com-
puting distances, no data point or distance affect the classification in any
way since the given data point in the training set is considered in the set
of nearest neighbours.
Anyhow, for the last point, a further implementation of the kNN algorithm is
available: its upgrade consists in giving the data a different weight proportional
to the inverse of the distance.
For the classification problem, the selection of the majority vote is:
k
X
argmaxy wi δ(y, yi ) (3.2)
i=1
CHAPTER 3. K-NEAREST NEIGHBORS 24
where
1
wi = (3.4)
d(x, xi )
Chapter 4
Linear algebra
Group
In abstract algebra, a group is a set G where an operation ∗ is defined.
Moreover ∗ satisfies the following properties:
1. associative: (a ∗ b) ∗ c = a ∗ (b ∗ c)∀a, b, c ∈ G
Commutative group
A group (G, ∗) is commutative iff a ∗ b = b ∗ a ∀a, b ∈ G.
Vector space
A vector space over the field R is an algebraic structure composed by
a commutative group (V, +), whose elements are called vectors, and a
function f : R × V → V called scalar multiplication which satisfies the
properties listed below. It is common to use kv, instead of f (k, v) to
represent the scalar multiplication between a scalar k ∈ R and a vector
v ∈V.
• distributive over elements k(v1 + v2 ) = kv1 + kv2
25
CHAPTER 4. LINEAR ALGEBRA 26
Vector space examples are: the set of two-dimensional vectors V 2 , the set of
three-dimensional vectors V 3 , the matrix space Mm,n (R).
Subspace
Given a vector space V defined over the field R, a non-empty set W ⊆ V
is a subspace of V if it is close w.r.t. the sum of vectors and the scalar
product:
∀w, w′ ∈ W, ∀k ∈ K, w + w′ ∈ W ∧ kw ∈ W
Each subspace is in turn a vector space with respect to the operations
defined in V .
Linear combination
Given a vector v ∈ V , v is a linear combination of the vectors v1 , ..., vk ∈ V
with coefficients ci ∈ R if:
K
X
v = c1 v1 + ... + ck vk = ci v i
i=1
Span
The span of vectors v1 , ..., vk is defined as the set of their linear combina-
tions:
K
X
{ ci vi , ci ∈ R} =< v1 , ..., vk >
i=1
a1 v1 + a2 v2 + ... + ak vk = 0
if and only if all the coefficients a1 , ..., ak are null. Intuitively, a set of
vectors v1 , ..., vk is linearly independent if none of them can be written as
a linear combination of the others.
A set of vectors v1 , ..., vk is linearly dependent is it is not linearly indepen-
dent. As a consequence, a set of vectors v1 , ..., vk is linearly dependent if
there exist some not-null scalars a1 , ..., ak such that:
a1 v1 + a2 v2 + ... + ak vk = 0
CHAPTER 4. LINEAR ALGEBRA 27
Basis
A set of vectors B = {v1 , ..., vk } is a basis for V if any element in V can
be uniquely written as a linear combination of vectors vi . A necessary
condition is that vectors vi are linearly independent. All bases of V have
the same number of elements, called the dimension of the vector space.
Linear maps
Given two vector spaces V and V ′ , a function f : V → V ′ is a linear map
if ∀a1 , a2 ∈ R, v1 , v2 ∈ V :
• Given that the n columns of A are the vectors TC (T (uj )) obtained from
the coordinates of the images T (u1 ), ..., T (un ), with respect to basis C.
m
X
f (ui ) = aji vj
j=1
• Putting all together we notice that actually the image of f (x) can be
written as a linear combination of the elements of the basis C. As a
consequence, f (x) belongs to the vector space V ′ .
n
X n X
X m m X
X n m
X
f (x) = λi f (ui ) = λi aji vj = ( λi aji )vj = µj vj
i=1 i=1 j=1 j=1 i=1 j=1
The matrix A ∈ Mmn (R) defined above is called the matrix associated with
the linear map T with respect to the basis B, C. A is typically indicated with
the symbol MBC (T ). If V = V ′ and B = C, then we write MB (T ). If V = Rn ,
V ′ = Rm and we consider the canonical basis of the two vector spaces, then we
simply write M (T ).
CHAPTER 4. LINEAR ALGEBRA 28
Figure 4.1: Example of a computation of the matrix associated with a linear function
CHAPTER 4. LINEAR ALGEBRA 29
Remark: The matrix P is the matrix associated with the identity function
′
with respect to the basis B and B ′ : P = MBB (idV ). From this observation
we can derive the following consequence. If the vector v ∈ V has coordinates
x1 , ..., xn with respect to basis B, and coordinates x′1 , ..., x′n with respect to basis
B ′ , then:
x′ = P x (4.1)
where P is the matrix of basis transformation from B to B ′ . We propose an
example of basis transformation in R2 in Figure 4.2.
Transpose
Matrix obtained exchanging rows with columns (indicated with M T ).
Given two matrices M ,N and their transposes M T , N T , the following
relevant property holds:
(M N )T = N T M T
Trace
Sum of diagonal elements of a matrix.
n
X
tr(M ) = Mii
i=1
Inverse
The matrix which multiplied with the original matrix gives the identity.
M M −1 = I
Rank
The rank of a n × m matrix is the dimension of the space spanned by
its columns. Consider a matrix A m × n. In the matrix there are n
columns A1 , ..., An . It is possible to demonstrate that the column space
< A1 , ..., An > has dimension equal to the rank of the matrix A.
Note: Results are column vectors. Transpose them if row vectors are needed
instead.
where x, y ∈ X , λ ∈ R
Metric
A norm defines a metric d : X × X → R+
0 such that:
Remark: The concept of norm is stronger than that of metric: not any
metric gives rise to a norm.
Unit vector
A unit vector is a vector v ∈ Rn such that:
||v|| = 1
Q(x, y) = Q(y, x)
Dot product
The dot product on Rn is a function < ·, · >: X × X → R which associate
to a couple of vectors x, y ∈ Rn the real number:
n
X
< x, y >= x · y = xi yi = xT y (4.4)
i=1
• A set of vectors {x1 , ..., xn } is orthonormal if all vectors xi in the set are
mutually orthogonal and are all unit vectors:
< xi , xj >= δij
where δij = 1 if i = j, 0 otherwise.
M x = λx (4.6)
Properties:
• A n × n matrix has n eigenvalues
• A n × n matrix can have less than n distinct eigenvalues
• A n × n matrix can have less than n linear independent eigenvectors (also
fewer then the number of distinct eigenvalues)
Singular matrices
A matrix is singular if it has a zero eigenvalue.
M x = 0x = 0
Determinant
The determinant |M | of a n×n matrix M is the product of its eigenvalues.
Symmetric matrix
A symmetric matrix is a square matrix that is equal to its transpose.
A = AT
Because equal matrices have equal dimensions, only square matrices can
be symmetric. The entries of a symmetric matrix are symmetric with
respect to the main diagonal. So if aij denotes the entry in the ith row
and jth column then:
aji = aij
for all indices i and j.
4.5 Eigen-decomposition
The purpose of this section is to describe an iterative procedure to find eigen-
values and eigenvectors for a matrix A. Given A we want to find λ and x values
such that:
Ax = λx
We multiply both sides of the equation by xT and we divide both sides of
the equation by xT x. We are not interested in the 0 vector, so the dot product
xT x is not 0.
xT Ax xT λx
=
xT x xT x
For linearity we can move λ.
xT Ax xT x
= λ =λ
xT x xT x
xT Ax
The fraction xT x
is called Raleigh quotient.
4.5.0.0.1 At this point we want to find the eigenvector which maximize the
Raleigh quotient. In this way we get the eigenvector (x) which corresponds to
the maximal eigenvalue.
v T Av
x = argmax v
vT v
Now we normalize the obtained eigenvector x:
x
x←
||x||
à = A − λxxT
Note that xxT is not xT x which is a scalar (dot product). Indeed xxT is a
matrix.
Deflation turns x into an eigenvector for à which has zero-eigenvalue:
Ãx = Ax − λx
Ãx = Ax − λx = 0
Ãx = 0x
Ãz = Az − λxxT z = Az
v T Ãv
x′ = argmaxv
vT v
where x′ is the eigenvector with the second largest eigenvalue. The procedure is
iteratively repeated on the deflated matrix until solution is zero, which means
that there are no more positive eigenvalues. If the matrix has negative eigenval-
ues, I can recover them minimizing the Raleigh quotient instead of maximizing
it. At some point the solution of the iterative procedure will be 0 again. If
the obtained set of eigenvalues is not full rank yet, we have to take into con-
sideration the eigenvectors with zero eigenvalues. The eigenvectors with zero
eigenvalues are obtained extending the obtained set to an orthonormal basis.
V T AV = Λ (4.7)
Remark: a diagonalized matrix is much simpler to manage and has the
same properties as the original one (e.g. same eigen-decomposition).
Proof:
From eigenvalue-eigenvector definition (Definition 4.4):
λ1 0
A[v1 . . . vn ] = [v1 . . . vn ]
..
.
0 λn
AV = V Λ
We multiply both sides by V −1 .
V −1 AV = V −1 V Λ
V T AV = Λ
xT M x ≥ 0
• there exists a real matrix B such that:
M = BT B
1. Compute the mean of the data (Xi is the ith row vector of X):
n
1X
x̄ = Xi
n i=1
V T CV = Λ
x′ = V −1 x = V T x
described above for PCA, but this time we take into consideration only the first
k eigenvectors of the covariance matrix decomposition.
W = [v1 , . . . , vk ]
Probability theory
P (Vi ) = P r(X = vi )
• P (x) ≥ 0
•
P
x∈X P (x) = 1
Expected value:
The expected value (mean or average) of a random variable is:
X m
X
E[X] = µ = xP (x) = vi P (vi )
x∈X i=1
42
CHAPTER 5. PROBABILITY THEORY 43
Variance:
The variance of a random variable is the moment of inertia of its proba-
bility mass function:
X
V ar[x] = σ 2 = E[(x − µ)2 ] = (x − µ)2 P (x)
x∈X
Binomial distribution:
The binomial distribution is a Bernoulli distribution extended to a certain
number n of trials. Its probability mass function is:
n x
P (x; p, n) = p (1 − p)n−x
x
Multinomial distribution
The multinomial distribution models the probability of an event that can
have m different outcomes, each with (possibly) a different probability. Its
probability mass function is:
m
Y
P (x1 , . . . , xm ; p1 , . . . pm ) = pxi i
i=1
where
• x1 , . . . , xm is a vector that represents the outcome,
• E[xi ] = pi
• V ar[xi ] = pi (1 − pi )
• Cov[xi , xj ] = −pi pj
Also this last case is a generalization of every case described above, but four
our purposes will not be used, it is reported for completeness.
P (vi , wj ) = P r[X = vi , Y = wj ]
• P (x, y) ≥ 0
•
P P
x∈X y∈Y P (x, y) = 1
XX
µy = E[y] = y P (x, y)
x∈X y∈Y
XX
σy2 = V ar[(y − µy )2 ] = (y − µy )2 P (x, y)
x∈X y∈Y
and X
P (y) = P (x, y)
x∈X
P (B) = P (A) + P (W )
Cumulative distribution
Given a continuous variable X we define
F (q) = P (X ≤ q)
Also mean and variance 5.1 5.1 are computed in the same way, but taking into
account the infinite number of small interval, therefore computing an integral
as well.
Z ∞
E[x] = µ = xp(x)dx
−∞
Z ∞
V ar[x] = σ 2 = (x − µ)2 p(x)dx
−∞
CHAPTER 5. PROBABILITY THEORY 48
Beta distribution
The beta distribution is a second degree order of probability distribution
(as the probability of a probability) that is defined in the interval [0, 1]
with parameters α and β. Its probability density function is
Γ(α + β) α−1
p(x; α, β) = x (1 − x)β−1
Γ(α)Γ(β)
where
• E[x] = α
α+β
• V ar[x] = αβ
(α+β)2 (α+β+1)
• Γ(x + 1) = xΓ(x)
• Γ(1) = 1
Dirichlet distribution Pm
This distribution is defined for x ∈ [0, 1]m , i=1 xi = 1, the parameters it
is based on are α = α1 , . . . , αm and its probability density function is
m
Γ(α0 ) Y ai −1
p(x1 , . . . , xm ; α
⃗) = Q
m xi
Γ(αi ) i=1
i=1
Pm
Where α0 = j=1 αj .
Characterizing values of this distribution are:
• E[xi ] = α1
α0
αi (α0 −αi )
• V ar[xi ] = α20 (α0 +1)
−αi αj
• Cov[xi , xj ] = a20 (α0 +1)
E[X̄n ] = µ
CHAPTER 5. PROBABILITY THEORY 51
1 σ2
V ar[X̄n ] = (V ar[X1 ] + · · · V ar[Xn ]) =
n2 n
that means that the variance of the average decreases as the number of ob-
servation increases (the higher the number of the observation, the higher the
precision of the estimate).
Chebyshev’s inequality
Given a random variable X with mean µ and variance σ 2 , for all a > 0
σ2
P r[|X − µ| ≥ a] ≤
a2
this implies the more X differs from the mean more than a the more σ 2
gets bigger and vice versa.
Also, replacing a = kσ for k > 0
1
P r[|X − µ| ≥ kσ] ≤
k2
This inequality shows that most of the probability mass of a random vari-
able stays within few standard deviations from its mean.
σ2
P r[|X¯n − E[X¯n ]| > ϵ] ≤ 2
nϵ
X¯n − µ
z=
σ
√
n
Entropy
The entropy of the set of symbols is the expected length of a message
encoding a symbol assuming each such optimal coding
n
X
H[V] = E[− log P (v)] = − P (vi ) log P (vi )
i=1
CHAPTER 5. PROBABILITY THEORY 53
Cross entropy
Considering two distributions P and Q over a variable X, the cross en-
tropy between the two distributions measures the expected number of bits
needed to code a symbol sampled from P using Q instead:
n
X
H(P ; Q) = EP [− log Q(v)] = − P (vi ) log Q(vi )
i=1
therefore
n
X P (vi )
DKL (p||q) = P (vi ) log
i=1
Q(vi )
Conditional entropy
Consider two variables V, W with (possibly different) distribution P , then
the conditional entropy of the entropy remaining for variable W once V is
known: X
H(W |V ) = P (v)H(W |V = v)
v
X X
H(W |V ) = P (v) P (w|v) log P (w|v)
v w
The conditional entropy allows to get insight of the information gain (reduc-
tion of entropy for W ).
Mutual information
Given two variables V, W with (possibly different) distribution P . The
mutual information (or information gain) is the reduction in entropy for
W once V is known:
Evaluation
54
CHAPTER 6. EVALUATION 55
prediction
P N
P TP FN
ground truth
N FP TN
Accuracy
Accuracy is the fraction of correctly labelled examples among all predic-
tions.
TP + TN
Acc = (6.1)
TP + TN + FP + FN
Remark: accuracy is one minus the misclassification cost.
Since these last two measures are complementary, i.e. when precision increases,
the recall typically decreases, it is not usually appropriate to consider them
separately. Nevertheless, in some contexts it could be reasonable to consider
recall and precision separately. Consider an in information retrieval task like:
”Give me the pages about machine learning from DISI website”. In this case it is
crucial to achieve a good precision in the first results of the searching algorithm
output. Vice versa, if we aim to develop a cancer detection application, recall
has to be prioritized.
For standard classification tasks, there is a measure which combines recall and
precision balancing the two aspects.
CHAPTER 6. EVALUATION 57
F-measure
F-measure combines precision and recall balancing the two aspects. β is
a parameter tradingoff precision and recall.
(1 + β 2 )(Pre · Rec)
Fβ = (6.4)
β 2 Pre + Rec
Figure 6.2: Precision-recall curve. When the threshold is 0, we obtain very high
precision, since in this case the model predicts positive only if it has very high confi-
dence. On the right the graphic highlights high recall since the model predicts always
positive with such a configuration.
With this tool, we can compare different algorithms by plotting them on the
same graphic. Then, we can decide the algorithm to use according to what we
are interested in.
A single aggregate value can be obtained taking the area under the curve (see
Figure 6.3). It combines the performance of the algorithm for all possible thresh-
olds (without preference for a specific value of threshold). The area under the
CHAPTER 6. EVALUATION 58
prediction
y1 y2 y3
y1 n11 n12 n13
ground truth y2 n21 n22 n23
y3 n31 n32 n33
• the sum of off-diagonal elements along a row is the number of false nega-
tives for the row label: X
FN i = nij
j̸=i
CHAPTER 6. EVALUATION 59
Multiclass accuracy
Multiclass accuracy is the overall fraction of correctly classified examples.
P
i nii
MAcc = P P (6.6)
i j nij
In the same manner, we can compute F1 for class i and the average F1 across
classes.
From the confusion matrix we can infer the pair of classes which are often
confused by the classifier. This kind of observations could for example highlight
the need of considering additional features in order to disambiguate classes.
6.3 Regression
Root mean squared error is the typical performance measure in the regression
domain. Intuitively, it represents the distance between the predicted value and
the real value. v
u n
u1 X
RMSE = t (f (xi ) − yi )2 (6.7)
n i=1
The random variable X represents the predicted value, the random vari-
able Y represents the ground truth label. However, the expectation values can
CHAPTER 6. EVALUATION 60
Si = SDi [L(Ti )]
After this iterative procedure, return the average score across folds.
k
1X
S̄ = Si
k i=1
CHAPTER 6. EVALUATION 61
In this way, the performance doesn’t depend on the specific choice of the split,
because we perform multiple splits. In real application a typical value of k is
k ≥ 10.
Notice that, the last part of the equation is not an equality. Indeed, the sets
S1 , . . . , Sk share a lot of components. At this point the variance of the perfor-
mance score on a specific fold can not be directly computed. Since we cannot
exactly compute Var [Sj ], we approximate it with the unbiased variance across
folds:
k
1 X
Var [Sj ] = Var [Sh ] ≈ (Si − S̄)2 (6.10)
k − 1 i=1
It is not important to understand why we divide the result by k − 1. In
essence, it is necessary because an unbiased estimate, is an estimate such that,
if we consider an infinite number of cases converges to the true prediction. If
we plug Equation 6.10 into Equation 6.9, we get:
k k k k
1 X 1 X 2 1 k X 2 1 1 X
Var [S̄] ≈ (Si − S̄) = (Si − S̄) = (Si −S̄)2
k 2 j=1 k − 1 i=1 k 2 k − 1 i=1 k k − 1 i=1
(6.11)
6.5.1 Glossary
• Tail probability: probability that T is at least as great (right tail) or at
least as small (left tail) as the observed value t.
• p-value: the probability of obtaining a value T at least as extreme as the
one observed t, under the assumption that the null hypothesis is correct.
A very small p-value means that such an extreme observed outcome would
be very unlikely under the null hypothesis. (Figure 6.4)
• Type I error : reject the null hypothesis when it’s true. This is the most
serious error. In order to minimize the type I it is common to put a
threshold on the p-value.
• Type II error : accept the null hypothesis when it’s false.
• Critical region: the set of random variables X1 , . . . , Xk taken into account
in the test statistic, can be visualized as a vector in a n-dimensional space.
In this latter, we define a region C such that if the vector lies in C then the
null hypothesis is rejected. In orther words the critical region represents
the set of values of T for which we reject the null hypothesis.
• Critical values: values on the boundary of the critical region.
Example:
A common null hypothesis is the one which states that a Gaussian distri-
bution N (θ, 1) with variance 1 has mean value equals to one (θ = 1). The
adopted critical region to face this test statistic is the following one:
n
1X 1.96
C = {(X1 , X2 , . . . , Xn ) : |1 − Xi | > √ }
n i=1 n
• Significance level : The type 1 error and the type 2 error are not symmet-
ric. Actually, the purpose of the test statistics is not to state that the null
hypothesis is true or false, but the purpose of the test is to understand
if the sampled data are compatible with the null hypothesis. As a conse-
quence, there is high tolerance for accepting H0 , while it is rejected only
if the sampled data are really improbable assuming H0 is satisfied. This
kind of balance is regulated according to a parameter α. This latter is re-
ferred to as significance level. The test must fulfil the property for which,
CHAPTER 6. EVALUATION 63
the probability of rejecting the null hypothesis when it is true is lower then
α. In other words, the significance level α is the largest acceptable proba-
bility for committing a type 1 error. Typical values of α are 0.1, 0.05, 0.005.
Example:
Suppose we want to verify the following null hypothesis:
H0 : θ ∈ w
where w is a set of possible values for parameter θ. For this purpose,
a point estimator d(X X ) is considered. The null hypothesis H0 is rejected
when d(X X ) is ”far” from region w. In order to understand how much ”far”
w and d(X X ) should be in order to reject H0 with a significance level α, it is
necessary to know the distribution of the estimator d(X X ) when H0 is true.
This notion would allow us to use the property such that the probability
of the type 1 error is lower than α, to understand when the estimator is
”far” enough from w to reject the null hypothesis, and so defining the
critical region of the test.
If our intention is to build a test with significance level α, we need to find the
value of c in Equation 6.12 such that the probability of commit a first type error
is α. This means that c has to verify the following relation:
We write Pµ0 to indicate the value of probability computed under the assump-
tion that µ = µ0 . Indeed, by definition, first type error occurs when the sampled
data lead us to reject H0 (i.e. (X1 , X2 , . . . , Xn ∈ C)) when indeed the null hy-
pothesis is correct (i.e. µ = µ0 ).
When µ = µ0 , we know that X̄ is characterized by a normal distribution with
2
mean value µ0 and variance σn . Considering a N (0, 1) random variable Z, we
obtain:
X̄ − µ0
σ ∼µ0 Z (6.14)
√
n
• if | X̄−µ
√σ
0
| ≤ z α2 we accept H0
n
X̄ − µ0
| | > z α2
√σ
n
X̄ − µ
∼ tn−1 (6.16)
√S
n
• H0 is accepted if:
X̄ − µ0
| | ≤ t α2 ,n−1
√S
n
• The shape is wider and shorter (fatter tails): reflects greater variance due
to using its approximation V ˜ar[X̄] instead of the true unknown variance
of the distribution
• k − 1 is the number of degrees of freedom of the distribution (related to
the number of independent events observed)
• tk−1 tends to the standardized normal Z for k → ∞
The null hypothesis is that the mean difference is zero (i.e. there is no
difference between the two algorithms).
H0 : µ = 0
δ̄
q ≤ −tk−1, α2
V ˜ar[δ̄]
or
δ̄
q ≥ tk−1, α2
V ˜ar[δ̄]
CHAPTER 6. EVALUATION 68
Notice the the null hypothesis assumes zero means, that is why δ̄ is alone in the
numerator.
We perform a two-tailed test if no prior knowledge can tell the direction of the
difference. Otherwise it is reasonable to use one-tailed test.
Figure 6.6: 10-fold cross validation used to implement a t-test example on the per-
formance difference between two algorithms A and B.
With this value we can calculate the standardized mean error difference:
δ̄ 0.046
q = = 2.98
0.0154344
V ˜ar[δ̄]
We compute the t distribution for α = 0.05 (i.e. we are happy to have a type one
error probability of 5%) (it is a two-tailed test so we consider α2 in the following
formula) and k = 10:
tk−1, α2 = t9,0.025 = 2.262
Finally, since 2.262 < 2.98, the null hypothesis is rejected. We can conclude that
the two classifiers are different. Indeed, my observation (i.e. my test statistic)
is further to the right with respect to tk−1, α2 . In order to calculate the value of
t9,0.025 we have looked at the entry [df = 9, t. 025] of the table in Figure 6.7.
Remarks:
• The test we have described in this section is a paired test since both the
algorithms are evaluated over identical samples.
• The test we have described is a two-tailed test. Indeed we can not tell
a priory if algorithm A is better then B or vice versa. Otherwise, if we
had prior knowledge about the direction of the difference, we would use a
one-tailed test (dividing α by two wouldn’t be required).
Chapter 7
Parameter estimation
• for any new example x not in the training set, we compute the posterior
probability of the class given the example and the full training set D:
Computing this process for every class, we can take as predicted label the
maximum probability among the classes.
We are trying to find the parameters that describe the actual distribution
p from the training data D, therefore p(X
X |Di , yi ).
From now on, whenever we are referring to a vector of features, we will use
the bold notation x .
70
CHAPTER 7. PARAMETER ESTIMATION 71
By identically distributed we mean that all examples are sampled from the
same distribution p(x x, y).
#yi |Di |
p(yi |D) = =
#total |D|
x|yi , Di ). In
At this point, the factor of the Equation 7.1 we are missing is p(x
order to do this, we need to estimate the parameters θi of the given distribution.
There are two main option in order to get an estimation of the parameters θi :
1. Maximum likelihood: we assume that Θi have fixed but unknown val-
ues, then we assume that the parameters Θi are the ones that maximize
the probability of the observed examples Di (coming from the training
set) of being part of Di .
The values Θi are used to compute probability for new unseen examples:
x|yi , Di ) ≈ p(x
P (x x|Θi )
which maximizes the likelihood of the parameters with respect to the training
samples without prior knowledge distribution of the parameters.
Since every class is treated independently, we will drop the redundant nota-
tion Di , yi with D since it is not ambiguous.
Θ∗ = argmaxΘ p(D|Θ)
Yn
= argmaxΘ xj |Θ)
p(x
j=1
Qn
Remark: p(D|Θ) = j=1 xj |Θ) since the elements x 1 , . . . , x n in the train-
p(x
ing data are i.i.d..
In order to get the maximum, we can apply the derivative operation and
look for the null point of this function. Since there are several parameters we
are looking for, we will apply a gradient to the equation:
n
X
∇Θ xj |Θ) = 0
log p(x (7.3)
j=1
1 x − µ)2
(x
xj |Θ) = √
p(x exp (− )
2πσ 2σ 2
Let’s begin with the derivative of the log of the probability with respect to
µ.
n
xj |Θ)
∂p(x ∂ X 1 x − µ)2
(x
= log √ −
∂µ ∂µ i=1 2πσ 2σ 2
n
X xi − µ)
(x
=
i=1
σ2
n
X xi − µ)
(x
=0
i=1
σ2
n
X
xi − µ) = 0
(x
i=1
CHAPTER 7. PARAMETER ESTIMATION 74
n
X n
X
xi = µ
i=1 i=1
Xn
x i = nµ
i=1
n
1X
µ= xi
n i=1
n
xj |Θ)
∂p(x ∂ X 1 xi − µ)2
(x
= log √ −
∂σ ∂σ i=1 2πσ 2σ 2
n
X √ 1 xi − µ)2
(x
= 2πσ √ (−σ −2 ) − (−2σ −3 )
i=1
2π 2
n
X 1 xi − µ)2
(x
= − +
i=1
σ σ3
n
X 1 xi − µ)2
(x
− + =0
i=1
σ σ3
n
X 1 xi − µ)2
(x
σ3 · − + = 0 · σ3
i=1
σ σ3
n
X n
X
−σ 2 + xi − µ2 ) = 0
(x
i=1 i=1
n
X n
X
xi − µ2 ) =
(x σ2
i=1 i=1
n
X
xi − µ2 ) = nσ 2
(x
i=1
Pn
2 xi
i=1 (x − µ)2
σ =
n
Remark: this is the sample variance.
By this we can get to the conclusion that if take into account Gaussian
distribution, then the parameters that maximize the fitting of the training data
into the dataset divided per class correspond to the sample mean and to the
sample variance.
CHAPTER 7. PARAMETER ESTIMATION 75
• the maximum likelihood estimates are the vector of the sample means
n
1X
µ= xj
n j=1
where D is the dataset for a certain class y and Θ the parameters of its distri-
bution.
Let’s proceed also in this case considering a Gaussian 5.3.1 distribution with
unknown parameters Θ.
In computing p(x x|Θ, D), we do not need D since the probability of an example
depends only on the parameters Θ. As a result, hereafter we write p(x x|Θ). Since
the prediction is conditioned on the parameters Θ
Z
x|D) = p(x
p(x x|Θ)p(Θ|D)dΘ
The member of the equation p(x x|Θ) can be easily compute since we have the
type of distribution and its parameters. Therefore we need to estimate the
parameter posterior density given the training set:
p(D|Θ)p(Θ)
p(Θ|D) =
p(D)
p(D|µ)p(µ)
p(µ|D) =
p(D)
n
Y
=α p(xj |µ)p(µ)
j=1
CHAPTER 7. PARAMETER ESTIMATION 77
1
where α = p(D) is independent of µ.
Qn
p(D|µ) = j=1 xj |µ) since the examples are i.i.d..
p(x
From here we compute p(µ|D) taking into account two Gaussian distribu-
tions: the Gaussian distribution (of the data) p(x xj ; µ, σ 2 ) and the
xj |µ) ∼ N (x
2
one for of the mean, p(µ) ∼ N (µ; µ0 , σ0 ).
After some mathematical computation (substituting the definition of normal
distribution we get), we get that p(µ; D) is equal to
n
Y 1 xi − µ)2
(x 1 (µ − µ0 )2
p(µ|D) = α √ exp − √ exp −
i=1
2πσ 2σ 2
2πσ0 2σ02
From the last step, we can see that this composition behaves like a normal
distribution itself
1 1 (µ−µn ) 2
p(µ|D) = √ e 2 ( σn )
2πσn
then we can put the last two equation as equal (the step by step explanation is
omitted).
Solving for µn and σn2 we get
nσ02 σ2
µn = µˆn + µ0
nσ02 + σ 2 nσ02 + σ 2
and
σ02 σ 2
σn2 =
nσ02 + σ 2
where µˆn is the sample mean
n
1X
µˆn = xi
n j=1
• the mean is a linear combination of the prior µ0 and the sample mean
µ̂n
• the higher the number of the training sample n, the more importance gets
the sample mean µ̂n , while µ0 loses importance
• the more training samples n the more the variance decreases making the
distribution less uncertain and more sharp.
Figure 7.1: 2D and 3D graphics of increasing number of n samples seen for a Bayesian
estimation of the mean µ
therefore the Gaussian distribution of the probability for the new samples
in the model is built with parameters Θ that are the mean (as the posterior
mean) and the sum of the known variance and the variance of the mean
as a random variable (aka the uncertainty over the mean.
s = Φ(D)
P (D|s, Θ) = P (D|s)
p(D|Θ, s)p(Θ|s)
p(Θ|D, s) = = p(Θ|s)
p(D|s)
A sufficient statistic for a Gaussian distribution are sample mean and covariance.
p(Θ)
distribution
P (x|Θ) = Θx (1 − Θ)1−x
then the conjugate prior is a Beta distribution 5.3.1 that depends on αh , αt
P (Θ|ψ) = P (Θ|αh , αt )
Γ(α)
= Θαh −1 (1 − Θ)αt −1
Γ(αh )Γ(αt )
p(D|Θ) = Θh (1 − Θ)t
parameters αh′ and αt′ which are derived from the sum of the real counts (h, t)
and the imaginary counts (αt , αh ).
Then the prediction for a new event given the data is the expected value of
the posterior Beta:
Z
P (x|D) = P (x, Θ|D)dΘ
Θ
Z
= P (Θ|D)P (Θ|D)dΘ
Θ
Since we need to integrate on a binary output (success, fail), then let’s focus of
the success event (P (x = 1) = Θ) keeping in mind that Θ is a Beta distribution
α′h
(therefore its expected value is E[Θ] = α′ +α ′ ):
h t
Z
P (x = 1|D) = ΘP (Θ|D, ψ)dΘ
Θ
= E[Θ]
h + αh
=
h + t + αh + αt
by this we get that:
• with high t, h (seen results) then the parameters αt , αh become irrelevant,
so this estimation becomes closer to the maximum likelihood estimation,
• with small t, h (seen results) that the prior estimate counts more.
P (Θ|ψ) = P (Θ|α1 , . . . , αr )
r
Γ(α) Y
k −1
= Qr Θα
k
k=1 Γ(α k )
k=1
CHAPTER 7. PARAMETER ESTIMATION 82
which stands for the fraction of the outcomes in D. Also, the numbers of the
different outcomes in D are a sufficient statistic 7.3.
where P (Θ|D) is the posterior, P (Θ) is the prior (Dirichlet distribution 5.3.1)
and P (D|Θ) is the likelihood.
Then we have that
r
Y r
Y r
Y
ΘiNi Θiαi −1 = ΘNi +αi −1
i=1 i=1 i=1
r
Y ′
= Θα −1
i=1
Fixing α′ = Ni + αi .
CHAPTER 7. PARAMETER ESTIMATION 83
α′
E[ΘR ] = Pr R
αi′
i=1
Nr + αR
= Pr
i=1 Ni + αi
Also in this case, the prediction ends up being influenced by the actual seen
values in NR but also partially driven by the imaginary value for the parameter
estimation αR .
Chapter 8
Bayesian Networks
84
CHAPTER 8. BAYESIAN NETWORKS 85
The non-descendants of a variable are the nodes which can not be reached
following arrows from the variable. The subscript l in Il (G) stands for local
because, as discussed in the following, the model can also encode some global
indipendences.
where Pa xi stands for the parents of xi . Consider for example the Bayesian
Network illustrated in Figure 8.1:
p(x1 , . . . , x7 ) = p(x1 )p(x2 )p(x3 )p(x4 |x1 , x2 , x3 )p(x5 |x1 , x3 )p(x6 |x4 )p(x7 |x4 , x5 )
I-map ⇒ factorization
CHAPTER 8. BAYESIAN NETWORKS 87
xi → xj ⇒ i < j
The variables in the Bayesian Network in Figure 8.1 are already ordered
topologically.
3. Let us decompose the joint probability using the chain rule as:
factorization ⇒ I-map
1. If p factorizes according to G, the joint probability can be written as:
m
Y
p(x1 , . . . , xm ) = p(xi |Pa xi )
i=1
2. Let us consider the last variable xm (repeat steps for the other variables).
By the product (chain) and sum rules:
p(x1 , . . . , xm ) p(x1 , . . . , xm )
p(xm |x1 , . . . , xm−1 ) = =P
p(x1 , . . . , xm−1 ) xm p(x1 , . . . , xm ))
Qm
p(x1 , . . . , xm ) p(x |Pa xi )
P Qm i
= P i=1
xm p(x 1 , . . . , x m )) xm i=1 p(xi |Pa xi )
Qm−1 ((((
p(xm |Pa x ) ( (p(x( i |Pa xi )
= Qm−1 ((m(( (i=1P (
i=1(p(xi |Pa xi ) ((
(( ( xm( p(x
(m (|Pa
(( xm )
CHAPTER 8. BAYESIAN NETWORKS 88
P
Remark: xm p(xm |Pa xm ) = 1 since we are summing over all possible
values of xm .
Remark: In p(xm |x1 , . . . , xm−1 ) all the variables in {x1 , . . . , xm−1 } are
necessarily non-descendants of xm since we have defined a topological
order.
At the end of the day, we can conclude that the probability of xm given
the non-descendants is equal to the probability of xm given the parents.
4. If this property holds for xm , for the same reasoning the property must
holds for all the other variables. In other words, following this procedure
we recover the set of independences that define an I-map.
At this point we can conclude: factorization ⇔ I-map.
Bayesian Network
A Bayesian Network is a pair (G, p) where p factorizes over G (which
is a Bayesian network structure) and it is represented as a set of con-
ditional probability distributions (cpd) associated with the nodes of G
(p(x1 |Pa x1 ), p(x2 |Pa x2 ), . . . , p(xm |Pa xm )).
m
Y
p(x1 , . . . , xm ) = p(xi |Pa xi )
i=1
Since A and B have no parents, we simply build the two tables as follows:
gene value P(value)
A active 0.3
A inactive 0.7
CHAPTER 8. BAYESIAN NETWORKS 89
On the other hand, mapping P (C|A, B) is slightly more difficult. The condi-
tional probability table has indeed 3 variables. The table is illustrated in Figure
8.3.
Remark: of course the columns of the conditional probability table proposed
in Figure 8.3 sum to one.
Figure 8.3: Conditional probability table of gene C in the toy regulatory network
proposed in Figure 8.2.
8.2.1 d-separation
In order to introduce the process of d-separation, we start analyzing the base
cases. The first relevant base case involves three variables. Indeed, the two
variables case is trivial, since either the two variables are linked together and
so they are related or there is no connection between the two variables and so
they are not related. In a similar manner, also the case with three variables
illustrated in Figure 8.4 is trivial since each variable is related (i.e. there is a
link) to the others. As a consequence everybody is dependent with respect to
everybody else. There is nothing non-trivial to find out. The non-trivial case is
when not all the three variables are connected to all the others (e.g. one link is
missing). An example is illustrated in Figure 8.5.
Figure 8.4: In this case although three variables are involved, verifying independency
assumptions is trivial since each variable is related to the others.
CHAPTER 8. BAYESIAN NETWORKS 91
Figure 8.5: In this case the independences among the three variables are not trivial.
8.2.1.1 Tail-to-tail
An example of this configuration is illustrated in Figure 8.6.
The pattern is called tail-to-tail because the central node c is connected to the
other variables by means of the tails of the two arrows.
Looking at the structure in Figure 8.6, the joint probability distribution is:
p(a, b, c) = p(a|c)p(b|c)p(c)
p(a, b) = p(a)p(b)
Example 1:
• a =Bike
• b =Bus delay
• c =Rain
Example 2:
• a =Cough
• b =Temperature
• c =Covid
(a) a and b are not independent. (b) a and b are conditionally independent
given c.
8.2.1.2 Head-to-tail
In this case node c is in the middle of a chain, Figure 8.7. Similarly to the previ-
ous case, this configuration is called head-to-tail because variable c is connected
to a and b by means of the head of an arrow and a tail of another arrow.
P
There is no way applying probability rules to satisfy the equality p(a) c p(b|c)p(c|a) =
p(a)p(b). Given this, we can conclude that a and b are not independent:
X
p(a, b) = p(a) p(b|c)p(c|a) ̸= p(a)p(b)
c
p(a, b, c) p(b|c)p(c|a)p(a)
p(a, b|c) = =
p(c) p(c)
p(c|a)p(a)
= p(a|c)
p(c)
So:
p(b|c)p(c|a)p(a)
p(a, b|c) = = p(b|c)p(a|c)
p(c)
Intuitively, if you have already observed the effects of a certain cause, you don’t
really care about the cause anymore in order to update the probability of the
consequence.
Example:
• a =Cloudy
• c =Rainy
• b =Get wet
Assume that a student of the DISI department of the University of Trento named
Giovanni Valer wakes up and notices that it’s a cloudy day. He realizes that, if
he goes to the university by bike, he could get wet. Indeed, since it is cloudy,
it could rain. The fact that it is cloudy influences the chance of getting wet
(causality reasoning).
Suppose not that Giovanni wakes up and it is raining but at the same time there
is the sun. Even if it is not cloudy, Giovanni gets wet if he travels by bike to
the university. Evidently, given that it is raining, the fact that it is not cloudy
does not influence the probability of getting wet.
(a) a and b are not independent. (b) a and b are conditionally independent
given c.
8.2.1.3 Head-to-Head
For the same reasons of the previous cases, the structure exemplified in Figure
8.8 is called head-to-head. In this case the joint distribution decomposes as:
In this case we are not able to simplify p(a, b|c) in order to verify the equality
p(a, b|c) = p(a|c)p(b|c). Actually, a and b are not conditionally independent
given c:
p(c|a, b)p(a)p(b)
p(a, b|c) = ̸= p(a|c)p(b|c)
p(c)
Intuitively in this case a and b are competing causes of the effect c. At the
beginning a and b are independent causes, but once the effect c is observed they
compete for the explanation. As a result, observing the result of one of them,
reduces the probability of the other. This situation is usually referred to as
explaining away effect.
Example:
• a =Burglar
• b =Earthquake
• c =Alarm
Burglar and earthquake can be both possible causes for the alarm to ring. Of
course, the burglar can decide to steal something from the house of Giovanni
Valer regardless of the probability of an earthquake.
However, if the alarm rings Giovanni immediately thinks about a burglar. But
if there is an earthquake in that moment, then the probability of a burglar in
the mind of Giovanni decreases to 0.
CHAPTER 8. BAYESIAN NETWORKS 95
(a) a and b are independent. (b) a and b are not conditionally independent
given c.
• Electric fuel gauge[G]: it indicates either full tank (G=1) or empty tank
(G=0)
CHAPTER 8. BAYESIAN NETWORKS 96
P (B = 1) = 0.9
P (F = 1) = 0.9
P (G = 1|B = 1, F = 1) = 0.8
P (G = 1|B = 0, F = 1) = 0.2
P (G = 1|B = 1, F = 0) = 0.2
P (G = 1|B = 0, F = 0) = 0.1
Probability reasoning
P (F = 0) = 1 − P (F = 1) = 0.1
• The posterior after observing empty fuel gauge (Figure 8.10b) is different:
P (F = 0|G = 0) =
P (G = 0|F = 0)P (F = 0)
= =
P (G = 0)
P
[P (G = 0, B|F = 0)]P (F = 0)
= B =
P (G = 0)
P
[P (G = 0|F = 0, B)P (B|F = 0)]P (F = 0)
= B
P (G = 0)
P (F = 0|G = 0) =
P
B [P (G
= 0|F = 0, B)P (B|F = 0)]P (F = 0)
= =
P (G = 0)
P
[P (G = 0|F = 0, B)P (B)]P (F = 0)
= B
P (G = 0)
P (G = 0) =
X X
= P (G = 0, B, F ) =
B∈{0,1} F ∈{0,1}
X X
= P (G = 0|B, F )P (B, F ) =
B∈{0,1} F ∈{0,1}
X X
= P (G = 0|B, F )P (F |B)P (B)
B∈{0,1} F ∈{0,1}
P (G = 0) =
X X
= P (G = 0|B, F )P (F |B)P (B)
B∈{0,1} F ∈{0,1}
X X
= P (G = 0|B, F )P (B)P (F )
B∈{0,1} F ∈{0,1}
P (F = 0|G = 0) =
P
B [P (G
= 0|F = 0, B)P (B)]P (F = 0)
= =
P (G = 0)
P
[P (G = 0|F = 0, B)P (B)]P (F = 0)
=P B P ≃ 0.257
B∈{0,1} F ∈{0,1} P (G = 0|B, F )P (B)P (F )
The probability that the tank is empty increases from observing that the
fuel gauge reads empty (not as much as expected because a strong prior and
unreliable gauge).
Bayesian Networks.
(a) Example of head-to-head (b) The probability that the (c) The probability that the
connection. tank is empty increases from tank is empty decreases after
observing that the fuel gauge observing that the battery is
reads empty. also flat.
Suppose that now, we are interesting in the posterior probability after ob-
serving that the battery is also flat (Figure 8.10c).
P (F = 0|G = 0, B = 0) =
P (G = 0|F = 0, B = 0)P (F = 0|B = 0)
= ≃ 0.111
P (G = 0|B = 0)
From this we understand that F and B are related once G is known. If they
were unrelated, knowing something about B would not change the probability
value. On the contrary, the probability that the tank is empty decreases after
observing that the battery is also flat. The battery condition explains away the
observation that the fuel gauge reads empty. The probability is sill greater than
the prior one, because the fuel gauge observation still gives some evidence in
favour of an empty tank.
General Head-to-head
• B = earthquake
• C = alarm
• D = phone call from the alarm company
If Giovanni Valer is not at home but he receives a phone call from the alarm
company he immediately thinks about a burglar even if he does not hear the
alarm. Evidently, burglar and earthquake both compete for an indirect effect.
This rules is valid for every descendant, though probably the further the
descendant the less influent is the indirect connection.
(a) Head-to-head connection among three (b) Head-to-head connection among more
random variables. than three random variables.
• the arrows on the path meet tail-to tail (Figure 8.6b) or head-to-tail (Fig-
ure 8.7b) at the node and it is in C, or
• the arrows on the path meet head-to-head at the node and neither it nor
any of its descendants is in C (Figure 8.8b).
In a sense, we are using the basic rules in Figure 8.9 as building blocks to
define the more general d-separation principle. The basic rules allow to under-
stand if individual paths are blocked somewhere. It is sufficient that a path is
blocked in one place to be completely blocked.
⊤b|c
Figure 8.12: Example of general d-separation. a⊤
this question we have to examine all the possible paths between A and B. In
correspondence of α there is a head-to-head connection and node α is observed.
Hence, we can conclude that information flows from A to B and the statement
dsep(A, B|C) is not true.
⊤B|C
Figure 8.15: Example of general d-separation. A⊤
⊤B|C
Figure 8.16: Example of general d-separation. A⊤
CHAPTER 8. BAYESIAN NETWORKS 103
(a) The three BNs constitutes an equivalence (b) A second independence class which en-
class whose components encode the same in- codes the opposite (A and B are not inde-
dependence assumptions. pendent, they become independent when C is
observed) of the structures in Figure 8.18a.
Minimal I-map
A minimal I-map for p is an I-map G which can’t be ”reduced” into a G ′ ⊂
G (by removing edges) that is also an I-map for p. In other words it is no
more possible to remove edges from G without introducing independences
which do not hold in p.
Remark: a minimal I-map for p does not necessarily capture all the inde-
pendences in p.
Remark: not all distributions have a P-map. Some cannot be modelled ex-
actly by the BN formalism. In particular there are some independences which
cannot be modelled by directed edges.
• Define variables for entities that can be observed (e.g. symptoms, age,
weight) or that you can be interested in predicting (e.g. pathology). Latent
variables can also be sometimes useful. Latent variables are variables
which we do not observe and we do not want to predict. However, they are
mediators for what we need to predict. For instance, a certain pathology
could depend on some groups of behaviours which could facilitate the
prediction of the pathology.
• Try following causality considerations in adding edges (edges which en-
code causalities are more interpretable and they tend to produce sparser
networks).
Inference in BN
P (X, E = e)
P (X|E = e) =
P (E = e)
106
CHAPTER 9. INFERENCE IN BN 107
We want to compute the probability for a certain node given some evidence:
P (X, E = e)
P (X|E = e) =
P (E = e)
But we notice that the sum over Xn has only a value in the decomposition which
depends on it (P (Xn |Xn−1 )), the other probabilities are constant with respect
to Xn . Therefore we represent this value as a function depending on Xn−1
X
µβ (Xn−1 ) = P (Xn |Xn−1 )
Xn
So we get
XX X
P (XN ) = ··· P (X1 )P (X2 |X1 )P (X3 |X2 ) . . . P (Xn−1 |Xn−2 )µβ (Xn−1 )
X1 X2 Xn−1
We can proceed with the same intuition for all the previous members
X
µβ (Xi ) = P (Xi+1 |Xi )µβ (Xi+1 )
i+1
CHAPTER 9. INFERENCE IN BN 108
We can compute a similar procedure from the top to the bottom of the chain.
Given
XX X
P (Xn ) = ··· P (X1 )P (X2 |X1 )P (X3 |X2 ) . . . P (Xn |Xn−1 )
X1 X2 Xn−1
P
notice that the terms that depend on the first summation X1 are only P (X1 )P (X2 |X1 ),
then we can carry on the same reasoning
X
µα (X2 ) = P (X1 )P (X2 |X1 )
X1
In general: X
µα (Xi ) = µα (Xi−1 )P (Xi |Xi−1 )
Xi−1
The only term that is left from the two chain is P (Xn ) that can be written as
this means that computing the local probabilities and summing them indepen-
dently, using (for a binary problem) a number of operation equal to 2(n − 1)
instead of 2n . µα and µβ are message sent backwards or forward from node to
node. The message is always a function of the destination node: I am interested
in computing the probability of the ith node and I collect the messages coming
from the adjacent nodes.
Figure 9.1: Inference for the ith node based on the messages from the previous and
next node passing
message received from the previous (or next) nodes: this is a standard way to
send messages also for non-chain structures.
and for the latter I do have the value observed. The inference procedure is actu-
ally the same, having evidence implies plugging in values instead of computing
a summation.
(a) undirected trees: undirected graph with a single path for each pair of
nodes
(b) directed trees: directed graph with a single node (root) with no parents,
all other nodes with a single parent
(c) directed poly-trees: directed graphs with multiple parents, for node
and multiple roots but still a single (undirected) path between each pair
of nodes.
The procedure that we will described will work for directed (BN) and undi-
rected models (Markov’s Models, MM ).
Factor graph
It is a graphical representation of a graphical model highlighting its fac-
torization. The factor graph is an undirected graph that has one node for
each node in the original graph and also additional nodes for each factor
(a factor node had undirected links to each of the node variables in the
factor).
Figure 9.3: Example of two different factor graphs for the first directed graph.
This version is the same but generalized to all the messages reaching the node
(incoming), the node are only connected to factors (not nodes) therefore it is
the product for all factors in the neighbourhood from the factor to X
Y
P (X) = µfs →X
fs ∈adj(X)
Figure 9.4: Collecting messages from the adjacent nodes of X (fwd message passing)
with x1 , . . . , xm the nodes that fs receives messages from (but the destination
X).
Computing the message from the node to a factor becomes simple: I have a
node that needs to send a message to a factor, then it will be connected only to
factors node f1 , . . . , fn . The message from f → X is always a function of the
variable node, then we
• collect messages from the adjacent nodes, therefore multiply for all fac-
tors that are not the destination (this is the product of the incoming
messages for all neighbours)
Y
µX→f (X) = µfi →X (X)
fi ∈adj(X)̸=X
• if the extreme is a factor, then we will not have any incoming messages
and from this
X X Y
µf →X = ··· f (X, X1 , . . . , Xm ) µXi →f
X1 Xm Xi ∈adj(f )̸=X
CHAPTER 9. INFERENCE IN BN 113
We can start sending messages like this, and then things come together as sums
and products.
The joint probability encoded in this factor graph is the product of the three
factors, with fa (x1 , x2 ), fb (x2 , x3 ), fc (x4 , x2 ). Therefore it is the multiplication
P (X) = fa (x1 , x2 )fb (x2 , x3 )fc (x4 , x2 )
CHAPTER 9. INFERENCE IN BN 114
and
X
µfc →x2 (x2 ) = µx4 →fc (x4 ) fc (x4 , x2 )
x4
X
= fc (x4 , x2 )
x4
At this point x2 received two messages and can compute its own message to
send to fb as
µx2 →fb (x2 ) = µfa →x2 (x2 )µfc →x2 (x2 )
And now fb sends its message to x3 multiplying it by its internal factor
X
µfb →x3 (x3 ) = µx2 →fb (x2 ) fb (x2 , x3 )
x2
This is the message reaching x3 , which has only one neighbour, so its probability
it is only the incoming message µfb →x3 (x3 ). Replacing bits of equations we get
XXX
P (x3 ) = fa (x1 , x2 )fb (x2 , x3 )fc (x4 , x2 )
x1 x2 x4
Please notice that the result is as expected the sum over all the variables except
the destination. At this point we can propagate backwards until every node
received messages from all neighbours.
this is the most probable explanation that jointly maximizes the probability.
Therefore we need to maximize on all possible causes.
This means jointly maximizing over all variables. Depending on how p(X) gets
decomposed, I can do some local computation independently
P from the rest. I
can apply the very same algorithm where I only replace → max.
For the chain-like structure, we end up with the very same reasoning proposed
in the previous sections:
which won’t be equal to one, but it is the probability. In the general message
passing algorithm (sum-product algorithm), we replace the sum with a max:
Y
µf →x (x) = maxx1 ,...,xm f (x, x1 , . . . , xm ) µxm →f (xm )
xm ∈adj(x)̸=x
Y
µx→f (x) = µfj →x (x)
fj ∈adj(x)̸=f
I need the configuration also for the other nodes. When I compute the mes-
sages, in principle I also know what is the configuration of the variables I am
maximizing over that will give me a certain configuration for x. I need to store
the information of the argmax message that is the configuration of the other
variables that gave me the configuration. I store another message Φf →X (X) for
this reason. For each step I store the max message in µ and its configuration in
Φ in the argmax message.
9.2.2.1 Example
When I get to the end of the chain, I have a message from the node xn−1
coming from the last factor. If I maximize over xn I get the maximum prob-
ability, while if I compute the argmax I get the maximal configuration for xn
(fn−1,n → xn (xn ) stands for the factor that connects n − 1 to n nodes and that
sends message to node xn ).
xmax
n = argmaxxn [µfn−1,n →xn (xn )]
xmax max
n−2 = Φfn−2,n−1 →xn−1 (xn−1 )]
...
xmax
1 = Φf1,2 →x2 (xmax
2 )]
After all the variables, I got the most probable configuration on all the variables.
The back propagation of the last value gives you the most probable configura-
tion of the previous nodes.
For better understanding, we now consider an example with three state of vari-
ables (trellis for linear chain).
A trellis diagram shows k possible states of a variable (in this case k = 3), the
states for each node are represented in n columns, times goes left to right. For
each state k of a variable Xn , Φfn−1,n →Xn (Xn ) defines a unique state which is
maximal given the previous states (in the diagram is linked by an edge). Once
the maximal states for the last variable Xn is chosen, that I back-propagate
through the links in order to get the most probable configuration of the previous
nodes.
This procedure works also for trees with the exception that we need to back-
propagate multiple times. The reasoning is still the same.
Learning BN
119
CHAPTER 10. LEARNING BN 120
At this point, our aim is to write down an expression for p(D|θθ ). Since data
are iid given the parameters θ , then we can write:
N
Y
p(D|θθ ) = x(i)|θθ )
p(x
i=1
x(i)|θθ ):
Now, we can go through the factorization of the BN to rewrite p(x
N Y
Y m
p(D|θθ ) = xj (i)|pa j (i), θ )
p(x
i=1 j=1
CHAPTER 10. LEARNING BN 121
Notice that in Figure 10.2, we replicate the model N times. Each instance of
the BN is related to a different example.
When computing p(x xj (i)|pa j (i)), this probability depends only on a subset of the
parameters θ Xj |pa j which are interesting (are part of the table) in that particular
conditional probability distribution. So, the probability p(x xj (i)|pa j (i)) depends
only on θ Xj |pa j and does not depend on other parameters in the network.
N Y
Y m
p(D|θθ ) = xj (i)|pa j (i), θ Xj |pa j )
p(x (10.1)
i=1 j=1
= p(X1 (1)|θX1 )p(X2 (1)|X1 (1), θX2 |X1 )p(X3 (1)|X1 (1), θX3 |X1 )
p(X1 (2)|θX1 )p(X2 (2)|X1 (2), θX2 |X1 )p(X3 (2)|X1 (2), θX3 |X1 )
Notice that the parameters θX1 , θX2 |X1 , θX3 |X1 , do not depend on the specific
example.
If we try to draw the corresponding network we get the model illustrated in
Figure 10.3.
CHAPTER 10. LEARNING BN 122
max N u ,x
u = P
θx|u
x N u,x
At the end of the day, the maximum likelihood parameters are simply the frac-
tion of times in which the specific configuration was observed in the data.
Example
In this example we consider the situation illustrated in Figure 10.4. Suppose
that our dataset is populated as follows:
X1 X2 X3
T R F
T R F
T G T
F R F
T B T
T B T
T B T
In this case the likelihood function which we want to maximize becomes:
Remember that the columns in the table must sum to one. As a consequence,
the purpose of the problem is to maximize Equation 10.3 subject to the fact
that the columns in the table sum to one. As a result, there are no connections
among columns. So, in 10.3 we consider θT2 |R and θF |R together because they
have to sum to one. On the other hand, for θT |G and θT3 |B we have no con-
straints with respect to the other variables in the formula, so we can maximize
them independently. In essence, in the maximization, each column (i.e. a con-
figuration for the parents) can be treated independently. Maximizing a single
column means maximizing over a single distribution which corresponds to the
column. So, we face the maximization problem, dividing it into pieces until we
end up maximizing individual distributions.
In this case:
max N u ,x
θx|u
u = P (10.4)
x N u ,x
• θT |R = 2
3
• θT |G = 1
1
• θT |B = 3
3
• θF |R = 1
3
CHAPTER 10. LEARNING BN 125
• θF |G = 0
• θF |B = 0
Remark: this is a toy example. In real world scenarios, it is not ideal to have
probability 1 or probability 0. For example, if the estimation of a probability is
zero it nullify the joint probability of the whole network. In order to deal with
this problem, we introduce priors.
In this case, our aim is to compute the configuration θ max which maximizes not
the likelihood but the posterior.
p(D|θθ )p(θθ )
p(θθ |D) =
p(D)
p(D) does not depend on θ . So, maximizing p(θθ |D) is the same as maximizing
p(D|θθ )p(θθ ) since p(D) is constant (p(θθ |D) ∝ p(D|θθ )p(θθ )).
max
N u ,x + αx|uu
θx|u
u = P (10.6)
N u ,x + αx|uu )
x (N
The prior is like having observed αx|uu imaginary samples with configuration
X = x, U = u .
For example, we could assume to have seen each configuration 1 (or a fraction
of 1) times by coherently setting each αx|uu . In this way we avoid to get zero
probabilities.
incomplete data, some of the examples miss evidence on some of the variables.
Counts of occurrences of different configurations cannot be computed if not all
data are observed. The full Bayesian approach of integrating (in the discrete
case, sum up) over missing variables is often intractable in practice. We need
approximate methods to deal with the problem.
Example: for example we could have question marks for some configura-
tions:
X1 X2 X3
T R F
? R F
T G T
F ? F
T B T
T ? T
T B T
This is the case of the medical domain where it is unlikely that a person does
all the possible exams. This of course results in a lot of missing data.
10.2.1 Expectation-Maximization
Sufficient statistics (counts) cannot be computed due to missing data. Remem-
ber that the counts are the sufficient statistics, i.e. all we need of the data. At
the beginning we can initialize the parameters in some ways as we discuss in the
following. After that, we fill-in missing data inferring them using current param-
eters by solving inference problem to get expected counts. In other words, we
compute the probability of a certain configuration given the current parameters
and what we already know (the non-missing data). Using these probabilities we
can compute the so called expected counts. These lasts are named ”expected”
since we do not have certain values for some variables. Then, we compute pa-
rameters maximizing likelihood (or posterior) of such expected counts (using
the expected counts as if they are the exact counts). Finally, we iterate the
procedure to improve the quality of the parameters. Indeed, with the updated
parameters we can re-compute the expected counts. The procedure is iterated
CHAPTER 10. LEARNING BN 127
until nothing changes anymore (i.e. there is no parameters update). This pro-
cedure is guaranteed to converge to a local maximum of the likelihood. The
quality of this local optimum depends significantly on the initial point which is
the initial configuration of the parameters.
• In the formula Nijk is the count for the configuration where example Xi
takes value xk and the parents Pa i of the example Xi takes value pa j . So,
i is the node in the network, k is the value for that node, j is the value of
the parent. i identifies the table, j and k identify the entry of the table.
• If Xi (l) and Pa i (l) are observed for X l , it is either zero or one. (In the case
of complete data, for each example we check if the particular configuration
holds for that example and we sum up obtaining an integer which is the
real count).
• Otherwise, run
Pn Bayesian inference to compute probabilities from observed
variables. ( l=1 is the sum over the set of examples) (Xi (l) represents
node i in the lTH example) (Pa i (l) represents the parent of node i in the
lTH example)
This step is called expectation step (e-step).
In the formula, Dc represents the dataset filled with the expected counts. For
∗
each multinomial parameter θijk evaluates to:
∗ Ep(X θ ) [Nijk ]
X |D,θ
θijk = Pri (10.9)
k=1 Ep(XX |D,θθ ) [Nijk ]
CHAPTER 10. LEARNING BN 128
At the end of this step, we get a new estimate of the parameters θ . We itera-
tively execute the e-step with the updated parameter estimation. The iterative
procedure proceeds in this manner until convergence.
Naive Bayes
Let’s now consider a simple classifier that can be used to exploit a BN.
P (⃗x|y)P (y)
y∗ = argmaxy∈Y P (y|⃗x) = argmaxy∈Y
P (⃗x)
Now, here we are not able to model P (⃗x|y) the joint probability (when ⃗x in-
creases dramatically in dimension).
129
CHAPTER 11. NAIVE BAYES 130
the class. This means that I can simplify the probability into:
n
Y
P (y|⃗x) = P (xi |y)P (y)
i=1
Figure 11.2: Example of recurrences of each word in the vocabulary for each class
(topic) as independent variables
Linear discriminant
functions
132
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 133
to model the full joint probability. We can focus only on learning the bound-
aries. In essence, we learn a function (disicriminative function) f : X → Y
which directly models P (Y |X).
Pros:
• When data are complex, modeling their underlying distribution can be
very difficult (e.g. P (X|Y ) where X represents high resolution images)
• If data discrimination is the goal, data distribution modeling is not needed
• As a consequence we can focus only on learning (using the available train-
ing examples) the parameters which are interesting to understand the
mapping between input and output
Cons:
• The learned model is less flexible in its usage. Indeed, discriminative
models have fixed input and fixed output.
• It does not allow to perform arbitrary inference tasks. Really often in these
scenarios there is not a precise distinction between inputs and outputs.
• It is not possible to efficiently generate new data from a certain class.
x).
The discriminant function is a linear combination of example features (x
x ) = w T x + w0
f (x (12.1)
x, x ′ : f (x
∀x x′ ) = 0
x) = f (x
w T x + w0 = w T x ′ + w0 = 0
w T x + w0 − w T x ′ − w0 = 0
x − x′) = 0
w T (x
x − x′ ) are orthogonal.
This means that vector w T and vector (x
Functional margin
x) of the function for a certain point x is called functional
The value f (x
margin. It can be seen as a confidence in the prediction.
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 135
Geometric margin
The distance from x to the hyperplane is called geometric margin:
x)
f (x
rx = (12.3)
||w
w ||
f (00) w0
r0 = = (12.4)
||w
w || ||w
w ||
w
x = x p + rx
||w
w ||
x) = w T x + w0 =
f (x
w
= w T (xxp + rx ) + w0 =
||w
w ||
w
= w T x p + w0 + r x w T
||w
w ||
wT w wT w √
=√ = w T w = ||w
w ||
||w
w || wT w
Given these, we can write:
x) = rx ||w
f (x w ||
x)
f (x
= rx
||w
w ||
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 136
Figure 12.3: A point (x x) can be expressed by its projection on H plus its distance
to H times the unit vector in that direction: x = x p + rx ||w
w
w ||
12.3 Perceptron
In order to introduce the perceptron, in the following we write about the bio-
logical motivation behind this concept.
12.7).
Figure 12.6: A single linear classifier can represent linearly separable sets of examples.
For example the AND boolean function.
Figure 12.7: A single linear classifier can represent linearly separable sets of examples.
For example the OR boolean function.
A linear classifier can not learn more complex boolean formula like the XOR
(Figure 12.8).
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 139
Figure 12.8: A linear classifier can not learn more complex boolean formula like the
XOR.
Problems
• non-linearly separable sets of examples cannot be separated
• representing complex logic formulas can require a number of perceptrons
exponential in the size of the input. Having an exponential number of
perceptron is definitely prohibitive for learning. This models are usually
referred to as shallow models. The solution is to build deeper models with
several perceptron layers.
It is useful to write down Equation 12.5 in a shorter way. To do this, aug-
mented/weight vectors are used. The idea is to incorporate the bias in the
augmented vector.
f (x) = sign(ŵ T x̂
x̂) (12.6)
w0
ŵ = (12.7)
w
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 140
1
x̂ = (12.8)
x
For the sake of brevity, we skip the hat over x and w in the following,
replacing the search for weight vector + bias with the search for the augmented
weight vector.
Finding the parameters to minimize the error function can easily lead to over-
fitting. However, keep in mind that linear classifier are simple models less prone
to overfitting with respect to deep neural networks.
w = w − η∇E(w
w ; D) (12.10)
In the case of the linear classifier, the misclassification loss is piecewise con-
stant and so it is a poor candidate for gradient descent. Indeed, applying gradi-
ent descent with a function like the one represented in Figure 12.9 the gradient
is either 0 or it does not exist. In order to apply gradient descent approach, we
need functions with a smoother behaviour.
x) ≤ 0
yf (x
Indeed, we do not consider the right predictions but we take into consideration
only the wrong predictions.
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 142
With this approach we are taking into consideration the confidence of our
predictions. For this reason, we typically refer to this kind of loss functions as
confidence loss. We compare the confidence of the prediction with respect to
the actual class.
∇E(w
w ; D) =
X
∇ −yf (x
x) =
X ,y)∈DE
(X
X
∇ wT x)
−y(w
X ,y)∈DE
(X
∇E(w
w ; D) =
X
∇ wT x) =
−y(w
X ,y)∈DE
(X
X
−yx
x
X ,y)∈DE
(X
Remark: the update formula w ← w + ηyx x is the same as before but there
is not the summation over all the training data. We update the weight vector
considering only one example.
With this method, we make a gradient step for each training error, rather
than on the sum of them in batch learning. In this way, each gradient step is
very fast. Moreover, stochasticity can sometimes help to avoid local minima (i.e.
it could be more robust), being guided by various gradients for each training
example (which won’t have the same local minima in general). At each iteration
we compute the gradient on a different error function. Following this approach,
we stochastically optimize the error function.
w =y
Xw (12.12)
Giving as solution:
w = X −1y (12.13)
Unfortunately, this approach does not work:
So, even if the matrix is square, it is not typically invertible. However, our
aim is not necessarily to solve the system of equation exactly. It is enough to do
the best that we can. Again, we are going to minimize an error function. When
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 144
dealing with regression and loss minimization, the most common error function
which is considered is the mean squared error (MSE).
X
E(ww ; D) = x))2 = (yy − Xw
(y − f (x w )T (yy − Xw
w) (12.14)
X ,y)∈D
(X
For this equation, there is a closed form solution. Moreover, it can also be al-
ways solved by gradient descent. This latter approach can be faster.
∇E(w
w ; D) =
T
∇(yy − Xw
w ) (yy − Xw
w) =
w )T (−X) = 0
2(yy − Xw
−2yy T X + 2w
wT X T X = 0
wT XT X = yT X
w T X T X)T = (yy T X)T
(w
X T Xw
w = XT y
w = (X T X)−1 X T y
Remark:
• The left-inverse exists provided (X T X) ∈ Rd×d is full rank, otherwise we
can’t invert it. If (X T X) is not full rank is because features are linearly
dependent. In order to make it full rank, we just remove the redundant
features. In this way, all the features are linearly independent and (X T X)
is full rank.
• If X is square and nonsingular (i.e. invertible), inverse and left-inverse
coincide ((X T X)−1 X T = X −1 ) and the MSE solution corresponds to the
exact one.
w = X −1y
So, if there is an exact solution, MSE finds it, otherwise it will provide an
approximation which minimizes the squared error.
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 145
w Ti x = w Tj x
w i − w j )T x = 0
(w
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 146
From this last equation we observe that the boundary is actually a hyperplane
(Figure 12.10).
covariance is shared among classes (Σi = Σ). This defines a general spread in
the space which does not depend on the class. So, only under this assumption
the Gaussian classifier becomes a linear classifier.
x|yi )P (yi ) =
x) = P (x
fi (x
|x
x|
Y
= P (xj |yi )P (yi ) =
j=1
|x
x| K
z (x[j]) |Di |
Y Y
= θkyk i =
j=1 k=1
|D|
K
Y Nkx x
|Di |
= θky i
|D|
k=1
Remark:
Q|xx|
• j=1 product over all the possible features
QK
• k=1 : xj is a discrete variable with k possible values
• zk (x[j]) is a hot encoding of a categorical variable (a categorical variable
is a variable which can be represented as a hot encoding). zk (x[j]) = 1 if
the variable takes the k value and zero otherwise
• θkyi parameter of the kth value for yi class. It is the probability that
feature k of xj is true given yi . In essence θkyi is raised to the power of
1 when xj has the kth feature, otherwise θkyi is raised to the power of 0.
(Figure 12.11)
|Di |
• |D| is equal to P (yi )
CHAPTER 12. LINEAR DISCRIMINANT FUNCTIONS 148
Figure 12.11: Parameter of the kth value for yi class. In this figure we assume yi is
a binary class.
In the formula there are too many expensive exponents. As result we take
the log:
K
X |Di |
x) =
log fi (x Nkxx log θkyi + log ( )
D
k=1
x) = w T x ′ + w0
log fi (x
Support Vector Machines (SVM) are a very popular method for linear (and
non-linear) classification. They have some properties:
13.1 Hyperplanes
All the hyperplanes that can separate the two subsets in Figure 13.1 classify
correctly every example. If I use a perceptron I could end up with every of
those planes, depending on the starting position.
149
CHAPTER 13. SUPPORT VECTOR MACHINES 150
SVMs learns the central hyperplane, the one that has the greater margin.
The margin is the distance of the closest point to the hyperplane. SVM tries
to maximize this distance from both classes. This is intuitively a good choice:
leaves the most space between samples.
Classifier margin
Given a training set D, a classifier confidence margin is
x)
ρ = min(xx,y)∈D yf (x
ρ x)
yf (x
= min(xx,y)∈D
||w
w || ||w
w ||
Canonical hyperplane
There is a infinite number of equivalent hyperplanes that encode for the
same hyperplane:
w T x + w0 ) = 0
α(w
for any α ̸= 0. The canonical hyperplane is the hyperplane having the
confidence (classifier margin 13.1) equal to 1
x) = 1
ρ = min(xx,y)∈D yf (x
As Figure 13.2 shows, the dashed lines are the hyperplanes for which the
confidence margins are smallest and equal to 1 (and -1). They are the hyper-
planes with confidence 1 for class red (labelled as -1) and class blue (as +1).
Overall it always mean that yf (x) = 1 (taking into account the labels of the
2
classes). The geometric margin becomes ||w|| .
CHAPTER 13. SUPPORT VECTOR MACHINES 152
1. sum of margins errors ν that stands for the component of all margin
errors (training errors or examples with not enough confidence)
2. the second term depends on number of training examples m (the greater
is m, the smaller is this factor) and ρ2 which has to do with the margin
(the larger the margin, the smaller the error)
Remark: if ρ is fixed to 1 (canonical hyperplane), maximizing margin cor-
responds to minimizing ||w
w ||.
w T x i ) + w0 ≥ 1
yi (w
minz f (x)
gi (z) ≥ 0 ∀i
where i are the possible constraints and αi varies depending on the constraint.
Basically we remove the constraint by adding it into the objective. This prob-
lem has z parameter that we want to minimize and αs. We want to minimize z
and maximize with respect to α which has to be non-negative.
The optimal solution(s) x∗ for this problem are also optimal solution for the
original constrained problem. x∗ will minimize the original problem and satisfy
the constraints.
Suppose now that x′ does satisfy all the constraints. Then the equivalent
problem formalized by KKT results in a finite number and therefore we know
for sure that the constraints are satisfied.
If we set all αs to zero, then the summation goes to zero, therefore we have all
constraint satisfied and also z ′ will be a solution of minz f (z) (which is what we
were looking for at the beginning).
CHAPTER 13. SUPPORT VECTOR MACHINES 154
Applying the KKT approach to SVMs, then we can formalize out learning
objective as:
1
minw ,w0 ||w w ||2
2
subject to:
w T x i + w0 ≥ 1)
yi (w
∀(x
xi , yi ) ∈ D
We now apply KKT and bring 1 to the other side.
m
1 X
w , w0 , α) =
L(w w ||2 −
||w w T x i + w0 ) − 1)
αi (yi (w (13.2)
2 i=1
We now try to solve the Lagrangian equivalent form. Let’s take the gradient
with respect to w , w0 and set it to zero.
m
1 X
L= w ||2 −
||w w T x i + w0 ) − 1)
αi (yi (w
2 i=1
T m
w w X
∇w L = ∇w − ∇w αi yiw T x i
2 i=1
m
w X
2w
= − αi yix i
2 i=1
m
X
=w− αi yix i = 0
i=1
m
X
w= αi yix i
i=1
Remark:
m
X m
X
∇w w T x i + w0 ) − 1) = ∇w
αi (yi (w αi yiw T x i
i=1 i=1
Now we get
m
X
w= αi yix i (13.3)
i=1
CHAPTER 13. SUPPORT VECTOR MACHINES 155
defined in terms of alphas, but this is a way to write primal variables in terms
of secondary (dual) variables (the αs). Let’s also take the derivative of the
Lagrangian formulation with respect to w0
m
1 X
L= w ||2 −
||w w T x i + w0 ) − 1)
αi (yi (w
2 i=1
Pm
∂L ∂(− i=1 αi yi w0 )
=
∂w0 ∂w0
Remark: also in this case we take into account only the terms where w0
appears.
What
Pm we can do, is replacing w with the formulation in terms of alphas
(w = i=1 αi yix i ), getting
m
wT w X
L= − w T x i + w0 ) − 1)
αi (yi (w
2 i=1
m
!T m m m m m
1 X X X X X X
= αi yix i αj yj x j − αi yi ( αj yj x j )T x i − αi yi w0 + αi
2 i=1 j=1 i=1 j=1 i=1 i=1
Considering the first term, α, y are scalars (their transpose is still a scalar)
only xi , xj are vectors. We compute the product excluding the scalar part,
therefore
1 XX
L= αi αj yi yj xTi xj . . .
2 i j
Pm Pm
But the second term − i=1 αi yi ( i=j αj yj xj )T xi it is actually twice the
quantity described by the first term. In the third term, we get that w0 does not
depend on i, we can take it outside. At this point we get:
m m
1 XX XX X X
L= αi αj yi yj xi T xj − αi αj yi yj xi T xj − w0 αi yi + αi
2 i j i j i=1 i=1
Pm Pm
Since we defined w = i=1 αi yix i (13.3), and also i=1 αi yi = 0 13.4,
therefore our objective becomes:
CHAPTER 13. SUPPORT VECTOR MACHINES 156
m
X 1 XX
L(α) = αi − αi αj yi yj xi T xj
i=1
2 i j
Which is the final formulation for the Lagrangian. This has to be minimized
with respect to w , w0 and maximized with respect to α. We did get rid of the
minimization, we are now going to maximize with respect to α.
This is a dual formulation (αs are in this case the dual variables, because they
did not appear in the original problem but introduced in the Lagrangian), while
the original formulation was only in terms of the primal variables. So we can
formulate the constrained optimization problem in two ways (as primal or dual),
there forms are mediated via the Lagrangian. This is still a quadratic optimiza-
tion problem, the constraints are simpler then in the primal problem: we had
linear constraints which were not easy to keep updated. In the dual problem we
have easier constraint (non-negativity and bounding box constraint). The dual
formulation is easier to deal with. Overall there are many ways in which we can
solve it. Depending on the number of feature and number of samples (alphas
depend on m number of examples, and w depends on number of features d) we
can address this problem more easily from the primal or from the dual.
1. αi are equal to zero, which means that the example xi does not contribute
to the final solution.
w T x i +w0 )−1 = 0 (the other component of the product is equal to zero),
2. yi (w
which means that yi (w w T x i + w0 ) = 1. This means that the confidence for
the examples should be one. There are two hyperplanes with confidence
equal to one (positive and negative 13.2) but both satisfy this condition
(positive examples with confidence +1 and negative examples with
confidence −1). These are the only points (with exact confidence = +1 or
= −1) with αi > 0 and these are the support vectors (SV).
My decision hyperplane f (x) = 0 is defined only in terms of the support vec-
tors. All other points do not contribute in defining the decision function. This
mean that if I remove every other example from the training set (leaving only
the support vectors) I would get the very same classifier. SVMs are sparse, this
means that only few of the training example will end up in SV. In many cases
the number of SV become much less than the training examples.
w T x i + w0 ) = 1
yi (w
yiw T x i + yi w0 = 1
1 − yiw T x i
w0 =
yi
For each of the samples, we get a different w0 , therefore, out final value will be
their average.
We want to maximize the margin but with a soft constraint: some exceptions
of falsely classified samples are allowed. This intuition can be formalized as
minimizing the inverse of the margin subject to all examples being predicted
in the correct class with a confidence at least one, but we add a slack variable
ξi for each sample in the dataset, then we will sum them up according to a
multiplicative factor C which is a hyperparameter.
m
w ||2
||w X
minw ∈X ,w0 ∈R,ξ∈Rm +C ξi
2 i=1
subject to:
w T x i + w0 ) ≥ 1 − ξi
yi (w i = 1, . . . , m
ξi ≥ 0 i = 1, . . . , m
−ξi the slack is measuring how far we are from satisfying the constraint:
• if it is equal to zero we are correctly classifying the example
• if it is greater than zero we are miss-classifying that example
Pm
• the larger is i=1 ξi , the larger is the number of miss-classified examples
Pm
By collecting all the slack variables C i=1 ξi (sum of penalties) we are trying
to combine margin maximization and penalty minimization. The summation
CHAPTER 13. SUPPORT VECTOR MACHINES 159
The idea is that in this objective we are combining a large margin and
having few exception to the rule of large margin separation. Of course they
are conflicting objectives. There is a more general theory under this concept:
regularization theory. In regularization theory we have objectives to be
learned that combine a complexity term (in this case the norm of the weight
vectors, the margin but in general is a complexity of the solution) and training
errors (in term of a loss function, measure of the error is the case of SVMs).
m
w ||2
||w X
minw ∈X ,w0 ∈R,ξ∈Rm +C xi ))
loss(yi , f (x
2 i=1
In soft margin SVMs, then we need to specify what is the loss function
xi )). Basically ξi is the loss of false predictions
loss(yi , f (x
xi ))
ξi = loss(yi , f (x
considering the constraint
w T x j + w0 ) ≥ 1 − ξi
yi (w
x i ) ≥ 1 − ξi
yi f (x
therefore
w T x i + w0 )
ξi ≥ 1 − yi (w
remember also that ξ is non-negative, so if we combine these things basically
we get that
xi )) = |1 − yi f (x
loss(yi , f (x w T x j + w0 )|+
xi )|+ = |1 − yi (w (13.5)
where | . . . |+ stands for the value itself if it is positive and zero otherwise.
If we look at this loss function graphically we can plot it in terms of yf (x x)
the loss function is zero if the confidence in the correct prediction is at least
one. If the confidence is less than one, then it is |1 − yi f (x
xi )|+ .
This loss function is called hinge loss function since it is not a smooth loss
function. It also shows why SVMs are sparse: the loss function is zero in a
part of a space helps having a sparse solution (means that you don’t care about
differences, so the complexity term will dominate). Still, this loss function can
be minimized.
Then, we compute the derivative wrt w0 , but again does not change, since there
is only one term containing w0 , then zeroing it we get.
m
X
αi yi = 0 (13.7)
i=1
Then, we need to compute also for slack variables. Here we compute gradi-
ents wrt each of the ξi
∂L ∂
= (Cξi − αi ξi − βi ξi )
∂ξi ∂ξi
By zeroing we get
c − αi − βi = 0 (13.8)
CHAPTER 13. SUPPORT VECTOR MACHINES 161
that fully correspond exactly to the hard margin case. Therefore the dual for-
mulation is:
m m
X 1 X
maxαm ∈R m αi − αi αj yi yj x Ti x j (13.9)
i=1
i=1
2 i,j=1
subject to
0 ≤ αi ≤ C i = 1, . . . , m
m
X
αi yi = 0
i=1
The interesting fact is that the support vectors: in the Lagrangian (KKT
conditions), when we get to the saddle point it holds that
w T x i + w0 ) − 1 + ξi ) = 0 ∀i
αi (yi (w
and
βi ξi = 0 ∀i
This implies that the support vectors, if αi = 0, the example does not contribute
w T x i +w0 )−1+ξi ) =
to the decision. If αi > 0 than it is necessarily true that (yi (w
0. This happens because we have and additional constraint (13.8) that is not a
KKT condition but comes directly from the derivative. Suppose that αi < C,
then
C − αi − βi = 0 =⇒ βi > 0
If βi > 0, then to satisfy βi ξi = 0 we need ξi = 0. Therefore
w T x i + w0 ) = 1
yi (w
which is the same condition of the hard margin SVMs. The examples stay on
the hyperplane with confidence one. For any example that has a confidence
greater than zero but smaller than C, this example stays on the hyperplane
with confidence equal to one.
For the examples for which αi = C, it means that βi = 0, then ξi does not
have to be equal to zero (typically it is not). Therefore the example won’t be
on the confidence one hyperplane, but closer to the decision hyperplane (the
CHAPTER 13. SUPPORT VECTOR MACHINES 162
Figure 13.5: soft margin SVM with decision hyperplane, hyperplanes with confidence
equal to one, bounded (on the hyperplanes) and unbounded support vectors (miss-
labelled that training and margin errors or closer to the decision hyperplane that are
margin error)
Both the examples with αi ≤ 0 belong to the so called support vector be-
cause they contribute to the definition of the separation hyperplane (bound and
unbound support vectors). All other having αi > 0 do not contribute to the
decision hyperplane.
• λ= 1
C
∇w E(w
w , (x w + 1[yi ⟨w
xi , yi )) = λw w , x i ⟩ < 1]yix i
At this point we do not know how to carry on, so we will compute a subgra-
dient. Whenever we do not have a single gradient in a point, we can perform
a subgradient. Its indicator function will be
(
1 if yi ⟨w
w, xi⟩ < 1
1[yi ⟨w
w , x i ⟩ < 1] =
0 otherwise
The subgradient of a function f in a point x 0 is any vector v such that for any
x hold the condition:
f (x x0 ) ≥ v T (x
x) − f (x x − x0)
This means that when we have discontinuity for the derivative (gradient) we
could find subgradients.
We could use any of the (red) vectors that satisfy the condition. The selection
of v is defined by the indicator function that tells me if the confidence is smaller
CHAPTER 13. SUPPORT VECTOR MACHINES 164
than one. We then can get the exact gradient in each point and then perform
subgradient descent.
This means that we update along with the error rate. The only additional
piece, is that the learning rate is not constant, but decreases with λ = C1 and
on t, which is pretty common in GD. This has some theoretical guarantees, but
there is no need for theoretical details.
Chapter 14
Non-linear SVMs
Feature map
Φ:X →H (14.1)
Φ is a function mapping each example to a higher dimensional space H
(potentially infinite dimensional).
Examples x are replaced with their feature mapping Φ(x x). The feature
mapping should increase the expressive power of the representation (e.g. intro-
ducing features which are combinations of input features). Examples should be
(approximately) linearly separable in the mapped space.
165
CHAPTER 14. NON-LINEAR SVMS 166
Remark: the higher the degree, the larger the feature space, the higher the
degree of interactions that we are able to model with the mapping.
x):
SVM algorithm is applied just replacing x with Φ(x
x) = w T Φ(x
f (x x ) + w0 (14.2)
x1
f( ) = sign(w1 x21 + w2 x1 x2 + w3 x22 + w0 )
x2
Once data are mapped in a higher dimensional space, we can apply the same
SVM solvers that we saw in the linearly separable case.
to have few support vector. For example, in the previous chapter, the fact that
support vectors were sparse was a consequence of the hinge loss.
ϵ-insensitive loss
(
0 if|y − f (x
x)| ≤ ϵ
x, y)) = |y − f (x
ℓ(f (x x)|ϵ = (14.3)
|y − f (x
x)| − ϵ otherwise
Figure 14.3: The idea is to adopt a loss function which is more tolerant than this in
order to achieve the sparsity property. We introduce a regression version of the hinge
loss which is called ϵ-intensive loss.
x) because
The ϵ-intensive loss depends on the difference between y and f (x
we are speaking about regression and not classification.
The result is that we are able to tolerate small (ϵ) deviations from the true
value (i.e. in this tolerance region we pay no penalty). This is reasonable, re-
member that in regression the data are typically the result of measurements.
The larger the ϵ the more tolerant is our model. As well as C in the lin-
early separable SVM, also ϵ is something that we have to configure beforehand
(hyperparameter). Hyperparameters are configured in the model selection stage
(basically, try out different values and evaluate the performance on a validation
set). Playing on ϵ value allows to trade off function complexity with data fitting.
With this formulation, we are assuming that all the training examples has
to lie in the ϵ-tube. However, as well as in the case of hard margin SVMs,
in order to be learnable, this formulation could require really large values of
ϵ. As a result, as in the soft margin SVMs, we can introduce slack variables
and penalties for non-satisfied constrains. If we want to relax the condition
yi − f (x
xi ) ≤ ϵ, we can write:
yi − f (x
xi ) ≤ ϵ + ξ
and symmetrically:
xi ) − yi ≤ ϵ + ξ ∗
f (x
Remark: we write ξ ∗ in this latter formula to highlight the fact that ξ ̸= ξ ∗ .
subject to
w T Φ(x
xi ) + w0 − yi ≤ ϵ + ξi
xi ) + w0 ) ≤ ϵ + ξi∗
w T Φ(x
yi − (w
ξi , ξi∗ ≥ 0
Remark: we get two constraints for each example for the upper and lower
sides of the tube. Slack variables ξi , ξj∗ penalize predictions out of the ϵ-
insensitive tube.
• constraint 1:
xi ) − yi ≤ ϵ + ξi
f (x
ϵ + ξi − f (x
xi ) + yi ≥ 0
• constraint 2:
yi − f (x
xi ) ≤ ϵ + ξi
ϵ + ξi − yi + f (x
xi ) ≥ 0
• constraint 3:
ξi ≥ 0
• constraint 4:
ξi∗ ≥ 0
w ||2
||w X
L= +C (ξi + ξi∗ )
2 i
X
− αi (ϵ + ξi − yi + f (x
xi ))
i
X
− αi∗ (ϵ + ξi∗ − f (x
xi ) + yi )
i
X
− βi ξ i
i
X
− βi∗ ξi∗
i
CHAPTER 14. NON-LINEAR SVMS 171
X X
∇w L = w − ∇w w T Φ(x
αi (w xi ) + w0 ) + ∇w αi∗ (w
w T Φ(x
xi ) + w0 ) =
i i
X X
w− xi ) +
αi Φ(x αi∗ Φ(x
xi )
i i
wT w X X
− αiw T Φ(x
xi ) + αi∗w T Φ(xi )
2 i i
This expression contains two pieces which are the same only one is twice
with respect to the other:
1X X X X
( (αi −αi∗ )Φ(xi ))T ( (αj −αj∗ )Φ(xj ))− (αi −αi∗ )( (αj −αj∗ )Φ(xj ))T Φ(xi ) =
2 i j i j
1 XX
− (αi − αi∗ )(αj − αj∗ )Φ(xi )T Φ(xj )
2 i j
The obtained result is really similar to the one obtained in the classification
case.
Now we do the same thing for w0 . First of all we highlight where it appears:
X X X
− αi w0 + αi∗ w0 = w0 (αi∗ − αi )
i i i
∗
P
Since i (αi − αi ) = 0 we get:
X
w0 (αi∗ − αi ) = 0
i
Pm T
− 12 (αi∗ − αi ) αj∗ − αj Φ (xi ) Φ (xj )
maxα∈Rm i,j=1
Pm Pm
−ϵ i=1 (αi∗ + αi ) + i=1 yi (αi∗ − αi )
Pm
subject to i=1 (αi − αi∗ ) = 0
P It we replace
∗
w in the decision function with its dual formulation w =
i (α i − αi x
)Φ(x i ) we get:
m
X
x) = w T Φ(x
f (x x ) + w0 = (αi − αi∗ )Φ(x
xi )T Φ(x
x ) + w0
i=1
αi (ϵ + ξi + yi − w T Φ(x
xi ) − w0 ) = 0
x) = m ∗ xi )T Φ(x
P
Notice that in f (x i=1 (αi −αi )Φ(x x)+w0 in order for an example
to not be a support vector both αi and αi∗ must be zero. Indeed, an example
could be over the tube or below the tube and not satisfy the constraints. To
satisfy the constraint the example should be inside the tube. If this is not the
case, only one of the two alphas can be non zero because the example cannot
be over and below the tube at the same time.
If for example αi ̸= 0 then (ϵ + ξi + yi − w T Φ(xxi ) − w0 ) = 0.
If also ξi = 0, than it means that the difference between f (x xi ) and yi is exactly
ϵ. In this case there is no penalty because we are on the boundary of the tube.
If instead ξi > 0 the difference between yi and f (x xi ) is larger than ϵ. In this
case βi = 0 and so αi = C.
The same reasoning works symmetrically for the ∗ case.
• Patterns for which either 0 < αi < C or 0 < αi∗ < C (they cannot be
both non-zero at the same time) are on the border of the ϵ-tube, that is
|f (x
xi ) − yi | = ϵ. They are the unbound support vectors. (It is like staying
exactly on the confidence one hyperplane).
• The remaining training patterns are margin errors (either ξi > 0 or ξi∗ >
0), and reside out of the ϵ-insensitive region. They are bound support
vectors, with corresponding αi = C or αi∗ = C.
For the sake of clarity, Figure 14.4 illustrates the concept graphically.
Figure 14.4: The solid curve corresponds to the regions where y − w T Φxx − w0 = 0.
The two parallel curves are where y − w T Φx
x − w0 = ϵ and w T Φx
x − w0 − y = −ϵ.
CHAPTER 14. NON-LINEAR SVMS 175
Looking at Figure 14.5 we notice that reducing ϵ the tube becomes smaller.
The function can accommodate more bumps if needed (it has more flexibility).
As a result the number of support vector increases. Indeed, if the tube becomes
smaller it is difficult that examples stay inside the tube. In the third case the
tube is really small, as a result almost all the points are support vectors.
In conclusion we understood that the complexity of the function is regulated by
the minimization of the weights and by the configuration of the hyperparameter
ϵ which corresponds to the size of the tube.
Remark: in the last case of Figure 14.5 it is clear that we are allowing too
much flexibility. In this case we tend to interpolate all the examples including
their noise.
Kernel machines
SVMs allow both linear and non linear problems. We saw a solution for non-
linear SVMs that consists in a map in non-linear features as combination of
features. This is a solution that is valid also for perceptrons. Therefore is not
specific to SVMs but it becomes computationally unfeasible in high dimensional
cases. Also you need to find the correct map and it is a non trivial procedure.
The dual formulation of the SVMs (both classification and regression), the fea-
ture map only appear in dot products, which gets really expensive.
For SVMs we can do non-linear learning in a different way, this allows with
infinite dimension spaces (the kernel trick).
Kernel trick
The kernel trick consists in replacing the dot products that appear in the
dual formulation of non-linear SVMs with an equivalent kernel function:
x, x′ ) = Φ(x
k(x x′ )
x)T Φ(x
The kernel function uses example in input space (not feature). We can
build a function that can compute this without to explicitly build the
map.
176
CHAPTER 15. KERNEL MACHINES 177
in which Φ(x x′ ) is in fact the dot product. I can replace the dot product
xi )T Φ(x
′
x, x ) = Φ(x xi )T Φ(xx′ ), since x does not appear anywhere else, we are done.
with k(x
- Homogeneous:
d
k (x, x′ ) = xT x′
- E.g. (d = 2)
′
x1 x1 2
k , = (x1 x′1 + x2 x′2 )
x2 x′2
2 2
= (x1 x′1 ) + (x2 x′2 ) + 2x1 x′1 x2 x′2
x′2
√ T √ 1
2x′1 x′2
= x21 2x1 x2 x22
| {z } x′2
2
Φ(x)T | {z }
Φ(x′ )
and we get two Φ functions, which contain term with degree up to d. Also in
this case we did explicitly computed in feature-space level.
- Inhomogeneous:
d
k (x, x′ ) = 1 + xT x′
- E.g. (d = 2)
′
x1 x1 2
k , = (1 + x1 x′1 + x2 x′2 )
x2 x′2
2 2
= (1) + (x1 x′1 ) + (x2 x′2 ) + 2x1 x′1 + 2x2 x′2 + 2x1 x′1 x2 x′2
√1
√2x1 ′
√ √ √ T 2x2 ′
= 1 2x1 2x2 x21 2x1 x2 x22
′2
| {z } √ x1
T ′ ′
Φ(x) 2x1 x2
x′2
2
| {z }
Φ(x′ )
The fact that we rise in an Rn dimensional space does not change complexity
since we are raising scalars. Usual we do not use high complexity spaces because
we would get too many features and we risk overfitting.
This is possible only because we are working in the dual form and we get x
only as dot products.
Kernel
A valid kernel what works properly, is a function defined over the cartesian
product of the input space
k :X ×X →R
x, x ′ ) = Φ(x
k(x x′ )
x)T Φ(x
Indeed, you can apply kernels to input that are not vectors at all (as sequences
or graphs). We can define kernels on arbitrary structures. In practice, we can
apply several algorithms to these object via this kernel that implicitly maps
these object in some vectors.
CHAPTER 15. KERNEL MACHINES 179
Gram matrix
Given examples {xx1 , . . . , x m } and a kernel function k, the Gram matrix K
is the symmetric matrix of pairwise kernels between examples:
xi , x j ) ∀i, j
Kij = k(x
It is not enough to see what happens only on the training set matrix: we
would like to show that the function is positive definite, that it is indeed a
function. There is a direct way to do it which is an eigen-decomposition for
functions, which will not be discussed and could be tricky.
The easier way to show this validity is checking the satisfaction of at least
one of these conditions and requisites:
• prove its positive definiteness (difficult)
- Dual problem:
m
1 X ∗ T
(αi − αi ) αj∗ − αj Φ (xi ) Φ (xj )
maxm −
α∈R 2 i,j=1 | {z }
k(xi ,xj )
m
X m
X
−ϵ (αi∗ + αi ) + yi (αi∗ − αi )
i=1 i=1
m
X
subject to (αi − αi∗ ) = 0 αi , αi∗ ∈ [0, C] ∀i ∈ [1, m]
i=1
- Regression function:
m
X T
f (x) = wT Φ(x) + w0 = (αi − αi∗ ) Φ (xi ) Φ(x) +w0
| {z }
i=1
k(xi ,x)
Kernelizing a perceptron
The kernel procedure can be also be applied on perceptrons: a linear function
of a perceptron kernelized becomes a non-linear function.
We can take the dual formulation of the perceptron. Remember that in kernel
machines (in both regression and classification), f (x) is a combination of the
sum over the train examples coefficient times the kernel function. For a classic
stochastic perceptron, the procedure is the following: at first we set
w=0
then we iterate until all examples are correctly classified (stochastic perceptron)
and we update each incorrectly classified examples:
w ← w + ηyix i
αi = 0 ∀i
then we iterate through the examples and for each miss-classified example we
perform an update:
αi ← αi + ηyi
The kernel perceptron classification function becomes
m
X
x) =
f (x xi , x )
αi k(x
i=1
CHAPTER 15. KERNEL MACHINES 181
x, x ′ ) = (x
kc,d (x xT x ′ + c)d
x − x ′ ||2
||x
x, x ′ ) = exp −
kσ (x
2σ 2
xT x ′ + x ′T x ′
T
x x − 2x
= exp −
2σ 2
First: if we unroll the square norm, we get a Gaussian kernel that can be
computed as dot products in input space, and then we operate scalar operation
that produce a Gaussian kernel. Once again, the complexity is bounded to the
input space.
A Gaussian kernel has an infinite dimension feature space, this determines a
useful property called universality: this means that it can uniformly approxi-
mate any function (provided it is continuous). In principle, with a Gaussian
kernel, we can approximate any (continuous) function.
The problem of this kernel is finding the correct value of σ, the larger the
value, the more tolerance we allow (on the other hand: the higher, the stricter,
then only the closest training sample will impact the prediction, more similar
to kNN).
0, if x = x′
′ ′
kσ (x, x ) = δ(x, x ) = (15.1)
1, otherwise
This does not make sense with vectors (low generalization). It makes sense
indeed when we use this in combination.
• string kernel 3-gram spectrum kernel which basically looks at fre-
quencies.
where Φ12 (x) = Φ1 (x) × Φ2 (x) is the cartesian product between the spaces
of the two kernels. We have to compute over every combination of features. The
feature space that comes out of it is more complex. This already explodes the
feature spaces.
Still, this is a valid combination.
Since understanding which kernel to use for a specific problem is non trivial, we
can use a linear combination of different kernels and then learn parameters of
kernels β along with learning α. We can learn both coefficient jointly to select
which kernel is more useful (kernel learning).
R(x) = (x1 , x2 , . . . , xD )
A decomposition kernel is
X X D
Y
(k1 ∗ · · · ∗ kD )(x, x′ ) = kd (xd , x′d )
(x1 ,...,xD )∈R(x) (x′1 ,...,x′D )∈R(x′ ) d=1
We compute a kernel, we sum over every possible decomposition of the two ob-
jects according to their decomposition function and then we compare the two
decomposition (in the convolution case) as the products of the single pieces. kd
is a kernel on pieces that could be hierarchically defined (or simply the match
kernel).
Set kernel
This is a very simple example, basically the decomposition relationship is the set
- membership. Therefore the overall kernel is the sum of overall combinations
of the kernel between them.
X X
kset (X, X ′ ) = kmember (ξ, ξ ′ )
ξ∈X ξ ′ ∈X ′
2. for each i ∈ [1 . . . |V |]
(a) for each node v in G, G ′
i. construct a list of sorted labels Mi (v) of the adjacent nodes of v
of the previous step
ii. concatenate the label for the node v at previous step with the
mi (v) list (Figure 15.2), also use a function label that assigns
unique labels to the string (Figure 15.3) (could be for example
an hash function)
(b) if li (G) ̸= li (G ′ )
i. return FALSE
3. return TRUE
CHAPTER 15. KERNEL MACHINES 186
Figure 15.2: WL-isomorphism test, we are assigning to each node a string that
contains the labels of all adjacent nodes
Figure 15.3: WL-isomorphism test, we using the label function to convert the string
into unique labels
as
h
X
′
h
kW L (G, G ) = k(Gi , Gi′ )
i=0
Figure 15.4: Execution of the graph kernel. This kernel will use a string in which it
will store 1 or 0 respectively if it has found the i-th label in the graph or not.
Chapter 16
Deep learning
A crucial limitation of the perceptron is that it can only model linear functions.
Hence, as we have discussed, we cannot for example provide a linear separation
for the XOR. We introduced non-linearity by means of feature mapping. More-
over, we studied how to implicitly provide non-linearity via kernels. To do this
we need to design appropriate kernels (possibly selecting from a set, i.e. kernel
learning) able to produce reasonable similarity measures. The most common
solution (i.e. the decision function) is linear combination of kernels.
X
x) =
f (x xi , x )
αi yi K(x
i
188
CHAPTER 16. DEEP LEARNING 189
Remark: the neuron in the input layer are not exactly neuron. Indeed,
they represent the features of the given input.
Remark: the connections are always from one layer to the following layer.
Each neuron has a connection with each of the neurons of the following layer.
This structure is called densely connected architecture. This is in contrast to
the case of convolutional neural networks for example, where the structure of
connections is template based.
Remark: on the right of the structure there are the formulas which describe
the behaviour of the neural network.
• Note that each Wi is a matrix of weights. For example W1 represents the
weights of the first layer. Each row of the matrix represents the weights of
one neuron in the layer. The weights values are the weights of each input
of the given neuron. The product W1x returns a vector which is a value
for each of the hidden neurons.
• σ is a non linear function called activation function which is applied to
the weighted sum of the inputs. Overall, the operation in the first layer is
represented by ϕ1 (x).
Remark: a beautiful aspect of this structure is the following:
• In non-linear kernel machines the decision function is described by:
x) = wT ϕ(x
f (x x)
However in the case of multilayer perceptron the function ϕ3 (x x) is not fixed but
it is learned. Indeed all the weights of the structure are learned. We have not to
design a kernel, we delegate to the neural network the task of defining a proper
data representation (more flexibility). The downside is that, in order to learn
this representation, a lot of data is needed.
Remark: without σ the decision function would look like a linear function:
x) = w T W3 W2 W1x = Wall
f (x T
x
Remark: this function is derivable and can be plugged into the deep net-
work architecture.
x))
exp(oi (x
fi (x) = Pc (16.2)
j=1 exp(o x))
j (x
y ∗ = argmaxi∈[1,c] fi (x
x) (16.3)
16.2.3 Regression
Typically, in the regression case, we take the output of the last layer as it is,
using the identity as activation function.
We use one output neuron o(x x). The activation function is linear, we dele-
gate to the previous hidden layers the non-linearity of the prediction.
Overall, the decision is the value of output neuron:
x) = o(x
f (x x) (16.4)
AND
OR OR OR
Figure 16.4: Neural network with input layer (AND), hidden layer (OR) with 3
neurons and output layer with 3 neurons.
Note that the formula compares the expected output with the actual out-
put of the network. If y = 1 then we take into account the term log f (x x)
otherwise if y = 0 we take into account the term log (1 − f (x x)). In some
ways, the cross entropy tells us the confidence of the given prediction.
There is a minus in the front because the higher the value the better the
confidence is (we want to minimize). We can extend cross entropy also to
the case of multiclass classification as explained in the following point.
• Cross entropy for multiclass classification (y ∈ [1, c]):
x)) = − log fy (x
ℓ(y, f (x x) (16.6)
ℓ(y, f (x x))2
x)) = (y − f (x (16.7)
So, in order to train the neural neural network we identify a proper training
error for example (x, y) (e.g. regression):
1
E(W ) = (y − f (x))2 (16.8)
2
and use stochastic gradient descent. Given a learning rate ∇, the gradient
update is:
∂E(W )
wlj = wlj − η (16.9)
∂wlj
We represent a generic weight in the network with wlj representing the weight
of the edge which connects node j of one layer with node l of the following layer
(Figure 16.5).
wlj
j l
The difficult thing is to compute the partial derivative because the weights
could be really far from the output layer where we actually compute the error.
The solution of this problem is faced with backpropagation. The idea is to use
the chain rule for derivation. Each weight of the network is updated according
to the error computed at the output layer. The notation used in the following
formula refers to Figure 16.6.
CHAPTER 16. DEEP LEARNING 195
Figure 16.6: The idea of backpropagation is to use the chain rule for derivation.
Remark: the big circle in Figure 16.6 is the node l and ϕl is the output of
node l.
Looking inside node l we find the weighted sum of the inputs of the node
including ϕj wlj . The result of the summation al is the input of an arbitrary
activation function. The output of node l is ϕl .
In Equation 16.10 the aim is to compute the derivative of the error with respect
to wij . According to the chain rule this is equale to ∂E(W ) ∂al
∂al wlj . al is the sum of
each input of the layer times its weight. As a result the derivative with respect
to the weight wlj is the input ϕj .
We represent the fraction ∂E(W
∂ai
)
with δl . The computation of δl is different if
l is a hidden neuron or an output neuron. In what follows, we examine both
these cases:
• Output units Delta is easy to compute on output units. E.g. for regres-
sion with linear outputs:
∂E(W ) ∂ 1 (y − f (x))2
δo = = 2
∂a0 ∂a0
Actually f (x) is the result of the application of the non linearity on ao .
However, as we said above, in regression we typically do not apply non-
linearity at the output layer. As a consequence in this case f (x) = ao .
∂ 21 (y − ao )2
δo = = −(y − a0 )
∂a0
CHAPTER 16. DEEP LEARNING 196
This value allows us to compute the derivative of the error for the weights
of the last layer.
∂E(W )
= δ o ϕj
∂woj
∂ak
We assign weight wkl to the edge (k, l). As a consequence ∂ϕl = wkl .
∂ϕl
Moreover, ∂al is essentially the derivation of the activation function with
respect to its input.
X ∂ak ∂ϕl X ∂σ(al )
δl = δk = δk wkl
∂ϕl ∂al ∂al
k∈ch[l] k∈ch[l]
In deep neural network, the neural architectures are really seen as combi-
nations of modules (Figure 16.8). Each of these modules has an interface to
interact with the outside. The aim is to combine modules in order to construct
a deep architecture. In Figure 16.8 we understand how this architecture looks
like and how it is updated.
• We have a first layer F1 (x, W1 ) which takes input x and weight matrix W1
and computes ϕ1 .
• The second layer F2 (ϕ1 , W2 ) takes as input ϕ1 and weight matrix W2 and
computes the new representation of the input ϕ2 .
• In the output layer we apply a loss function Loss(ϕ3 , y). With this module
we compute the error for having predicted a certain value instead of y.
∂E
With this error value we can compute ∂σ 3
.
∂E
each layer can compute ∂ϕj−1 . For example layer three in Figure 16.8 is
∂E
able to compute ∂ϕ2 .
• Each module should be able to compute the derivative of its output with
∂F (ϕj−1 ,Wj )
respect to its weights ( j ∂W j
). Moreover, each module should be
able to compute the derivative of its output with respect to its inputs
∂F (ϕ ,Wj )
( j ∂ϕj−1
j−1
). The former is used by the module to update its own
weights and the latter is used by the module to backpropagate the er-
ror. This allows to following layers down in the hierarchy to update their
own weights.
Figure 16.8: In deep neural network, the neural architectures are really seen as
combinations of modules.
CHAPTER 16. DEEP LEARNING 199
• many more...
Remark: training kernel machines requires solving quadratic optimization
problems. As a result, finding the global optimum is guaranteed (convex prob-
lem). On the other hand deep networks are more expressive in principle, but
harder to train.
Figure 16.9: Training error, validation error, test error in deep learning.
Remark: Essentially, the ReLU returns the maximum between zero and
w T x ) where x is the input of the neuron.
the input of the activation function (w
As we can see in Figure 16.10 this function brings everything to zero if the
input is negative, while on the right is linear. The function saturates only on the
left. As a consequence we have more chance to avoid vanishing gradient problem.
To sum up:
• Linearity is nice for learning
Another trick of the trade to face the problems related to the training of a
deep neural network is regularization. As discussed above in the text, the aim of
regularization theory is to minimize both model complexity and training error.
So the objective function (J(W )) is composed by two terms: the error (E(W ))
on the training set and the complexity of the model (λ||W ||2 ) with the tradeoff
parameter λ:
J(W ) = E(W ) + λ||W ||2
For the sake of simplicity, in Figure 16.11 we assume to have two weights
(W1 and W2 ). The Euclidean norm (or 2-norm) of the weights is something that
is minimal at zero and grows in circles. This means that points on the same
circle have the same norm, they are indistinguishable with respect to the min-
imization of the regularization term ||W ||2 . Then, suppose the error function
is an ellipsoid which is minimal in the center. Depending on the value of λ we
decide a tradeoff between the two objectives to minimize. With this procedure
we end up choosing a point where the regularization circle touches the error
ellipsoid.
something like a square. Of course, points on the same square have the same
norm. This encourages less relevant weights to be exactly zero (sparsity induc-
ing norm). Indeed, in the previous case the circle touches the ellipsoid at a
random point, on the other hand in this case the square is more likely to touch
the ellipsoid at the vertices. Note that if we touch the square at a vertex some of
the weights are zero. This is called sparsifying norm: a norm which encourages
sparse solutions. This kind of regularization can be useful when we want to
minimize the number of weights in the solution for instance for interpretability
or for feature selection.
Another trick of the trade to face the problems related to the training of a
deep neural network is initialization. First of all it is important to randomly
initialize weights. In this way we break symmetries between neurons. Indeed,
if we initialize all weights to zero they might all learn the same thing becoming
redundant. This would reduce the overall expressiveness of the network. In
order to carefully set the initialization range to preserve forward and backward
variance we use the following formula:
√ √
6 6
Wij ∼ U (− √ ,√ )
n+m n+m
where n and m are number of inputs and outputs. Overall, the aim is to ensure
sparse initialization: enforce a fraction of weights to be non-zero (this encour-
ages diversity between neurons).
CHAPTER 16. DEEP LEARNING 204
Another trick of the trade to face the problems related to the training of
a deep neural network is minibatch gradient descent. Batch gradient descent
updates parameters after seeing all examples (of the training set). This proce-
dure is too slow for large datasets. On the other hand, full stochastic gradient
descent updates parameters after seeing each example. In a large network, the
error that we compute for a single example can be quite different from the er-
ror that is averaged over many examples. Overall, the objective would be too
different from the true one. A valid solution is an intermediate technique be-
tween batch and fully stochastic gradient descent. Minibatch gradient descent
divides the training set into minibatches and updates parameters after seeing a
minibatch of m examples. The proper value of m depends on many factor, e.g.
size of GPU memory. Indeed, our architectural consideration is that we do not
want a minibatch too large which does not fit the GPU memory.
Another trick of the trade to face the problems related to the training of a
deep neural network is a proper configuration of the learning rate. With this
aim an popular approach is the usage of the momentum concept. The idea is to
update the weights not only depending on the current direction of the gradient
but also partially according to the directions of the previous iterations. The
tradeoff is regulated by a coefficient α called momentum.
∂E(W )
vji = αvji − η
∂wl j
w ji = wji + vji
where 0 ≤ α < 1. The adoption of the momentum tends to keep the updating
of the weights in the same direction. Think of a ball rolling on an error surface.
The possible effects could be:
Another trick of the trade to face the problems related to the training of a
deep neural network is related to the Adagrad approach.
∂E(W ) 2
rji = rji + ( )
∂wlj
η ∂E(W )
w ji = wji − √
rji ∂wlj
Using momentum we vary the learning rate only with respect to time, but
the learning rate value is the same for all the directions. In high-dimensional
problems, the behaviour of the function can be very different in each direction.
From this, we understand that using the same learning rate for all directions is
clearly suboptimal. The idea of the adagrad approach is to develop an adaptive
gradient such that:
• Reduce learning rate in steep directions
• Increase learning rate in gentler directions
Intuitively rji accumulates the square of the partial derivatives for a certain
dimension. This means that if a certain dimension has a high value of the
derivative for a lot of iterations, the function is steep along that dimension. So,
the learning rate is divided by this size of the gradient ( √ηrji ). (The square root
is needed to scale the value in the order of the weights, avoiding the square
term).
Overall, adagrad is actually a good solution to speed up convergence in high
dimensional spaces. However, we can highlight a problem of this procedure:
• The square gradient is accumulated over all iterations. This could lead
to non-local decisions. This means that the reduction of the learning rate
could become too large at some point. For non-convex problems, learning
rate reduction can be excessive/premature.
To solve this problem, the literature propose an alternative called RMSProp.
∂E(W ) 2
rji = ρrji + (1 − ρ)( )
∂wij
η ∂E(W )
w ji = wji − √
rji ∂wlj
The most relevant difference with respect to the previous approach is that we
calculate a linear combination (ρrji + (1 − ρ)( ∂E(W ) 2
∂wij ) ) of the current value and
the new squared gradient. In this way, we obtain an exponentially decaying
accumulation of squared gradient (0 < ρ < 1). The squared gradient is expo-
nentially decaying with respect to how far we are from the current position. So,
CHAPTER 16. DEEP LEARNING 206
old values of the gradient do not count anymore in order to decide whether to
slow down or not. In this way the update of the learning rate becomes more
local. Doing this, RMSProp avoids the premature reduction of Adagrad and
the respective slowing down. Poi sulle slide c’è scritto Adgrad-like behaviour
when reaching convex bowl, ma non so bene cosa si intenda.
The update rule for neural networks which is the defacto standard is a vari-
ant of the methods we have discussed, which is called ADAM.
Another trick of the trade to face the problems related to the training of
a deep neural network is batch normalization. This concept is related to the
so called covariate shift problem. Covariate shift problem is when the input
distribution to your model changes over time and the model does not adapt to
the change. The fact that the input distribution changes over time is problem-
atic. Indeed, when we train a model we typically assume that examples are
independent and identically distributed (i.i.d.). If we consider examples whose
distribution changes over time, they are no more identically distributed. The
learning model should adapt continuously to the change of the data. In (very)
deep networks, internal covariate shift takes place among layers when they get
updated by backpropagation. Basically, the update of a layer refers to a rep-
resentation of the data which is old with respect to the current collection of
examples. This leads to a slower convergence: we need to keep adjusting the
weights according to the changes in the network.
Keeping the activation always in the same range is in contrast with respect to
the aim of learning a suitable configuration of the deep neural network. Perhaps,
CHAPTER 16. DEEP LEARNING 207
we could end up to concentrate the activation in the same range too much. To
avoid this we combine the normalization with scale and shift parameters (γ, β).
yi = γ x̂i + β
In this manner, we scale and shift each activation with adjustable parameters.
In this way the standardized activation is adjusted before being processed by
the activation function.
Remark: the scale γ and shift β coefficients become part of the network
parameters and they have to be learnt.
16.5.1 Autoencoder
The autoencoder is a deep neural architecure whose purpose is to train shallow
network to reproduce input in the output. In other words, the autoencoder is
trained to recunstruct its intput. As we can see in Figure 16.13 the structure
has the same number of inputs and outputs. Moreover, we can notice that there
is only one internal layer, called coding layer.
Once the autoencoder is trained, its purpose is to process x as input and provide
x as output. This process can be done with unlabelled examples in a unsuper-
vised learning fashion.
As an example, one can structure the code layer in such a way that handwritten
digits are encoded in the hidden layer with a binary representation. In general,
the aim of the autoencoder is to learn to map inputs into a sensible hidden
representation (representation learning). From this representation it should be
possible to reconstruct the original input. As a consequence, the autoencoder
is often used to map data in a different dimensional space. Another typical ap-
plication of autoencoders is to generate data from the representation that has
been learned. Autoencoders were also adopted to perform layerwise pre-training
(Figure 16.14, Figure 16.15):
1. train the autoencoder
2. discard the output layer
In this way we learn a representation of the input which is more abstract than
the first one, but that can still reconstruct the original input.
We can use the output of this approach (as deep as we want) in supervised
learning task:
1. discard autoencoder output layer
CHAPTER 16. DEEP LEARNING 209
2. add appropriate output layer for supervised task (e.g. one-hot encoding
for multiclass classification)
3. learn output layer weights and refine all internal weights by backpropaga-
tion algorithm
Intuitively, the hidden layers of the final network are designed to conserve as
much information of the input as possible.
to edges to shapes). All these matrices are learnt in an end to end fashion.
At some point the convolutional neural network ends with a multi layer per-
ceptron architecture: all the features are placed in a vector and processed with
densely fully connected layers. Finally, at the output level we reach a represen-
tation which allows us to perform the (e.g. classification) task.
• σ are parametric layers which end with the sigmoid. The sigmoid function
restricts the input in a range between zero and one. We can see this value
as a probabilistic value.
• Somewhere inside the unit products are computed. In Figure 16.17 they
are are represented with a X. These are element wise products.
• These operations allow to learn a state inside the unit and propagate it
along the network.
• During the multiplications, if the value (I think of the sigmoid) for a certain
dimension is zero, we reset that dimension (I think the state associated
with that dimension). On the other hand, if the value (I think of the
sigmoid) is one we keep that dimension as it is. Then, if the value (I think
of the sigmoid) is between zero and one, the dimension is down-weighted.
• a forget gate selectively forgets parts of the cell state. Implicitly we also
select which part of the state to remember to go to the next state.
• an input gate selectively chooses parts of the candidate (i.e. input) for cell
update (i.e. update the state, actually, after the multiplication there is a
plus operation). In essence, the same concept of the forget gate is applied
to the input itself. Indeed, in Figure 16.17 we can notice a sigma layer
which is multiplied with the input.
• Finally output gate selectively chooses parts of the cell state for output
Inside a cell, the overall state is a combination of the part of the previous
state which has been not forgotten (this is represented by the first X inside the
second network unit in Figure 16.17) and a selected part of the information that
we get from the input (this is represented by the plus after the first X in Figure
16.17). This combination produces the updated state which is propagated to
the next unit. Before being propagated, this state is combined with the output
gate (that depends also on the input, look at the arrow from σ in the top right of
the figure to output in Figure 16.17) to decide what parts of the cell state have
to be sent as the output of the given unit. Moreover, the output of the current
unit will become part of the combination of the local input and the propagated
state of the following unit.
In Figure 16.17, letter h at each unit is the representation at each state that
has been learnt. According to the task at hand, we can add (for example) a
classification or a regression layer on top of h (basically everything we need to
perform the final prediction).
The most relevant thing of this gate architecture is that all the gates are
parametric, are learnt. In this way, the network decides what to remember,
what to use to update, what to output, at each unit. This kind of architecture
is commonly referred to as long short-term memory because it adaptively decides
what to remember. We can use this structure for example to avoid forgetting a
signal which is propagated too long in time. The selective memory can decide
what to remember, simplifying the task of remembering something for a long
time. In this sense, this architecture works well in sequential prediction tasks.
16.5.5 Transformers
In this subsection we write about a particular neural network model which has
been proven to be especially effective for common natural language processing
tasks. The model is called Transformer and it makes use of several methods
and mechanisms that we introduce here (Figure 16.19).
Imagine the Encoder and Decoder as human translators who can speak only
two languages. Their first language is their mother tongue, which differs be-
tween both of them (e.g. German and French) and their second language an
imaginary one they have in common. To translate German into French, the En-
coder converts the German sentence into the other language it knows, namely
the imaginary language. Since the Decoder is able to read the imaginary lan-
guage, it can now translates from that language into French. Together, the
model (consisting of Encoder and Decoder) can translate German into French!
Suppose that, initially, neither the Encoder nor the Decoder of the Seq2Seq
model is very fluent in the imaginary language. To learn it, we train them (the
model) on a lot of examples.
A very basic choice for Encoder and the Decoder of the Seq2Seq model is a
single LSTM for each of them.
We need one more technical detail to make the Transformers easier to un-
derstand: Attention. The attention-mechanism looks at an input sequence and
decides at each step which other parts of the sequence are important. It sounds
abstract, but let us clarify with an easy example: when reading this text, you
always focus on the word you read but at the same time your mind still holds
the important keywords of the text in memory in order to provide context. An
attention-mechanism works similarly for a given sequence. For our example with
the human Encoder and Decoder, imagine that instead of only writing down the
translation of the sentence in the imaginary language, the Encoder also writes
down keywords that are important to the semantics of the sentence, and gives
them to the Decoder in addition to the regular translation. Those new keywords
make the translation much easier for the Decoder because it knows what parts
of the sentence are important and which key terms give the sentence context.
CHAPTER 16. DEEP LEARNING 215
In other words, for each input that the LSTM (Encored) reads, the attention-
mechanism takes into account several other inputs at the same time and decides
which ones are important by attributing different weights to those inputs. The
Decoder will them take as input the encoded sentence and the weighs provided
by the attention-mechanism.
The Encored is on the left and the Decoder is on the right. Both Encoder
and Decoder are composed of modules that can be stacked on top of each other
multiple times, which is described by N x in the figure. We see that the modules
consist mainly of Multi-Head Attention and Feed Forward layers. The inputs
and outputs (target sentences) are first embedded into an n-dimensional space
since we cannot use strings directly.
One slight but important part of the model is the positional encoding of the
different words. Since we have no recurrent networks that can remember how
sequences are fed into a model, we need to somehow give every word/part in our
sequence a relative position since a sequence depends on the order of its elements.
These positions are added to the embedded representation (n-dimensional vec-
tor)of each word.
Those matrices Q, K and V are different for each position of the attention
modules in the structure depending on whether they are in the encoder, decoder
or in-between encoder and decoder. The reason is that we want to attend on
either the whole encoder input sequence or a part of the decoder input sequence.
The multi-head attention module that connects the encoder and decoder will
make sure that the encoder input-sequence is taken into account together with
the decoder input-sequence up to a given position.
After the multi-attention heads in both the encoder and decoder, we have
a pointwise feed-forward layer. This little feed-forward network has identical
parameters for each position, which can be described as a separate, identical
linear transformation of each element from the given sequence.
16.5.5.3 Training
The purpose of this section is to understand how to train such a structure.
We know that to train a model for translation tasks we need two sentences in
different languages that are translations of each other. Once we have a lot of
sentence pairs, we can start training our model. Let’s say we want to translate
French to German. Our encoded input will be a French sentence and the input
for the decoder will be a German sentence. However, the decoder input will be
shifted to the right by one position. One reason is that we do not want our
model to learn how to copy our decoder input during training, but we want to
learn that given the encoder sequence and a particular decoder sequence, which
has been already seen by the model, we predict the next word/character. If we
CHAPTER 16. DEEP LEARNING 217
don’t shift the decoder sequence, the model learns to simply ‘copy’ the decoder
input, since the target word/character for position i would be the word/charac-
ter i in the decoder input. Thus, by shifting the decoder input by one position,
our model needs to predict the target word/character for position i having only
seen the word/characters 1, . . . , i − 1 in the decoder sequence. This prevents
our model from learning the copy/paste task. We fill the first position of the
decoder input with a start-of-sentence token, since that place would otherwise
be empty because of the right-shift. Similarly, we append an end-of-sentence
token to the decoder input sequence to mark the end of that sequence and it is
also appended to the target output sentence. In a moment, we’ll see how that
is useful for inferring the results.
This is true for Seq2Seq models and for the Transformer. In addition to the
right-shifting, the Transformer applies a mask to the input in the first multi-head
attention module to avoid seeing potential ‘future’ sequence elements. This is
specific to the Transformer architecture because we do not have RNNs where we
can input our sequence sequentially. Here, we input everything together and if
there were no mask, the multi-head attention would consider the whole decoder
input sequence at each position.
The target sequence we want for our loss calculations is simply the decoder
input (German sentence) without shifting it and with an end-of-sequence token
at the end.
16.5.5.4 Inference
Inferring with those models is different from the training, which makes sense
because in the end we want to translate a French sentence without having the
German sentence. The trick here is to re-feed our model for each position of the
output sequence until we come across an end-of-sentence token.
Sometimes the nodes of the represented graph have a set of features (for
example, a user profile). If the node has f numbers of features, then the node
feature matrix X has a dimension of (n × f ).
Graph data is so complex that it’s created a lot of challenges for existing
machine learning algorithms (Figure 16.23).
The reason is that conventional Machine Learning and Deep Learning tools
are specialized in simple data types. Like images with the same structure and
size, which we can think of as fixed-size grid graphs. Text and speech are se-
quences, so we can think of them as line graphs.
But there are more complex graphs, without a fixed form, with a variable
size of unordered nodes, where nodes can have different amounts of neighbors.
It also doesn’t help that existing machine learning algorithms have a core
assumption that instances are independent of each other. This is false for graph
data, because each node is related to others by links of various types.
Graph Neural Networks (GNNs) are a class of deep learning methods de-
signed to perform inference on data described by graphs.
GNNs are neural networks that can be directly applied to graphs, and pro-
vide an easy way to do node-level, edge-level, and graph-level prediction tasks.
CNNs can be used to make machines visualize things, and perform tasks
like image classification, image recognition, or object detection. This is where
CNNs are the most popular.
The core concept behind CNNs introduces hidden convolution and pooling
layers to identify spatially localized features via a set of receptive fields in kernel
form (Figure 16.24).
How does convolution operate on images that are regular grids? We slide the
convolutional operator window across a two-dimensional image, and we compute
some function over that sliding window. Then, we pass it through many layers.
Our goal is to generalize the notion of convolution beyond these simple two-
dimensional lattices.
The insight allowing us to reach our goal is that convolution takes a little
sub-patch of the image (a little rectangular part of the image), applies a func-
tion to it, and produces a new part (a new pixel).
What happens is that the center node of that center pixel aggregates infor-
mation from its neighbors, as well as from itself, to produce a new value.
It’s very difficult to perform CNN on graphs because of the arbitrary size
of the graph, and the complex topology, which means there is no spatial locality.
Our goal is to map nodes so that similarity in the embedding space approx-
imates similarity in the network.
Now we’ll define the encoder function Enc(u) and Enc(v), which converts
the feature vectors to zu and zv (Figure 16.25).
CHAPTER 16. DEEP LEARNING 222
• Aggregate information
• Stacking multiple layers (computation)
By doing this, we’re capturing the structure, and also borrowing feature in-
formation at the same time.
The inputs are those feature vectors, and the box will take the two feature
vectors (XA and Xc ), aggregate them, and then pass on to the next layer.
Notice that, for example, the input at node C are the features of node C, but
the representation of node C in layer 1 will be a hidden, latent representation
of the node, and in layer 2 it’ll be another latent representation.
Now, to train the model we need to define a loss function on the embed-
dings. We can feed the embeddings into any loss function and run stochastic
gradient descent to train the weight parameters. Training can be unsupervised
or supervised.
CHAPTER 16. DEEP LEARNING 223
Figure 16.23: Conventional Machine Learning and Deep Learning tools are special-
ized in simple data types. Like images with the same structure and size, which we can
think of as fixed-size grid graphs. Text and speech are sequences, so we can think of
them as line graphs.
Figure 16.24: The core concept behind CNNs introduces hidden convolution and
pooling layers to identify spatially localized features via a set of receptive fields in
kernel form.
CHAPTER 16. DEEP LEARNING 224
Figure 16.25: The encoder function Enc(u) and Enc(v), converts the feature vectors
to zu and zv .
Figure 16.27: Once the locality information preserves the computational graph, we
start aggregating. This is basically done using neural networks.
CHAPTER 16. DEEP LEARNING 225
Figure 16.28: The forward propagation rule in GNNs. It determines how the infor-
mation from the input will go to the output side of the neural network.
• Nonlinear activation
The operations are usually done in this order. Together, they make up one
network layer. We can combine one or more layers to form a complete GCN.
Ensemble Methods
227
CHAPTER 17. ENSEMBLE METHODS 228
Diversity of Datasets:
• Each example has probability (1−1/N ) of not being selected at each draw
• Each example has probability (1 − 1/N )N of not being selected after N
draws
CHAPTER 17. ENSEMBLE METHODS 229
• For large enough N , this is 37%, i.e., 37% of the dataset are not part of a
given training set (out-of-bag instances, oob)
• oob instances can be used to estimate test performance of the base model
Soft voting predict the class with largest sum of predicted probability (classifi-
cation). Assumes base classifier outputs a confidence/probability
m
X
ŷ = argmaxy fy(i) (x)
i=1
Mean predict the mean (or median) of the base model outputs (regression)
m
X
ŷ = f (i) (x)
i=1
17.4 Boosting
17.4.1 In a nutshell
1. Take a learner A and train it on D
2. Reweight examples in D based on their accuracy according to the trained
model (harder examples get larger weights)
3. Train A again on the reweighted dataset
4. Repeat the procedure m times
5. Combine the learned models into the final model
A weak learner learns models with accuracy slightly better than random and is
easy to implement and train. Applying boosting with weak learners as the base
learning algorithm allows to turn them into strong learners. This can be easier
than designing a strong learner directly.
CHAPTER 17. ENSEMBLE METHODS 231
1 (i−1)
d(i)
n ← dn exp −α(i) yn ŷn
Z
Example re-weighting:
• correctly classified examples yn ŷn = +1 are multiplicatively downweighted
• incorrectly classified examples yn ŷn = −1 are multiplicatively upweighted
1 − ϵ̂(i)
(1) 1
α = log = 1/2 log(4)
2 ϵ̂(i)
• The multiplicative weight for positive (correct) examples is exp −α(i) yn ŷn =
exp(1/2 log(4)) = 2.
CHAPTER 17. ENSEMBLE METHODS 232
(i−1)
• The normalization Z (ignoring dn for simplicity) is 80∗1/2+20∗2 = 80
• The normalized weights for positive and negative examples are 1/2∗1/80 =
1/160 and 2 ∗ 1/80 = 1/40
• The overall weight of positive and negative examples is now 1/160 ∗ 80 =
1/2 and 1/40 ∗ 20 = 1/2
• The reweighed dataset is thus fully balanced, and a majority class predic-
tor is not a weak learner any more (accuracy exactly 50% ).
Chapter 18
Unsupervised learning
18.1 Clustering
18.1.1 K-means clustering
This is a quite popular naive approach for clustering. This method assumes
that you decide a priori the number of clusters. Each cluster will be represented
by its mean µi . The idea is to assign example to the clusters with the closer
mean.
233
CHAPTER 18. UNSUPERVISED LEARNING 234
d
!1/p
X
′
x, x ) =
d(x xi − x ′i |p
|x
i=1
xT x′
x, x ′ ) =
s(x
||x x′ ||
x||||x
For each cluster, we take the mean µi (as general measure, not only k-means)
within each cluster. I compare all examples (ni number of samples) in the cluster
with the mean of the cluster they are in. The error will be computed as the
squared error (sample and mean):
1 X
µi = x
ni
X ∈Di
The sum-of-squared errors is defined as the sum over the totality of clusters
k
X X
E= x − µi ||2
||x
i=1 X ∈Di
This basically tells us how bad the approximation of a cluster is using the means.
CHAPTER 18. UNSUPERVISED LEARNING 235
Figure 18.1: GMM clustering for the same data with different parameters
This is an example of different ways of fitting the same data, so based on the
approach we select we can get different solutions.
3. using this expected value, we maximize the likelihood through the param-
eters
4. this gives me a new value of the parameter and then iterate until conver-
gence
Settings:
• A dataset of n examples
• for each example xi , a cluster assignment (the latent variable) is modelled
as zi1 , . . . , zik (for each sample, a value for each cluster) modelled with a
one-hot encoding (vector of binary variables, we have 1 only for the correct
class, 0 otherwise)
• we assume for simplicity that we are only estimating the mean of the
gaussian, the variance is assumed to be known
General Algorithm:
(b) M-step: replace the current hypothesis with the one maximizing
Q(h′ , h)
h ← argmaxh′ Q(h′ , h)
This procedure is guaranteed to converge but it is likely to get to a local op-
timum. We can start with a good initialization coming from different methods.
This is just log of product over the example of the likelihood (properties of log-
arithm used on the previous formula).
While the expectation of the xi data comes from the j-th cluster, given current
hypothesis h and observed data X is computed as:
p(xi |µj )
E[zij ] = Pk
l=1 p(xi |µl )
1 2
e− 2σ2 (xi −µj )
= Pk − 2σ12 (xi −µl )2
l=1 e
Given that I have the formula of the likelihood, then I can maximize it.
The next step is taking this expectation and then I perform maximization over
the new hypothesis.
The likelihood maximization gives this first result (since there is a minus sign,
we can minimize instead of maximize), then we can set the derivation equal to
zero with respect to µ′j in order to find the maximum (in the example we are
considering just one specific µj instead summing for every component).
2
n k
xi − µ′j
ln √ 1 −
X X
argmaxh′ Q (h′ ; h) = argmaxh′ E [zij ]
i=1
2πσ j=1 2σ 2
n X
k
X 2
E zij xi − µ′j
= argminh′
i=1 j=1
n
∂ X
E [zij ] xi − µ′j = 0
= −2
∂µj i=1
Pn
′ i=1 E [zij ] xi
µj = P n
i=1 E [zij ]
Figure 18.2: Plot of the error rate within the increase of the number of clusters
si will be averaged over all samples and clusters. We plot this coefficient
within the increase of the number of cluster (Figure 18.3), that represent
the trade-off between the inter e intra similarity. Eventually, we get the
higher coefficient as final choice.
CHAPTER 18. UNSUPERVISED LEARNING 241
Figure 18.3: Plot the average silhouette coefficient according to an increasing number
of clusters
The similarity measure is computed between clusters, there are several meth-
ods that we can use:
• nearest neighbours, computes the similarity between sets as the minimal
distance between the elements in the two sets (sensible to outliers and
expensive)
Reinforcement learning
The typical challenges faced by the leaner during this process are:
• delayed reward coming from future moves
• trade-off between exploitation and exploration
243
CHAPTER 19. REINFORCEMENT LEARNING 244
19.1 Applications
Reinforcement learning has several real world applications. The main research
fields where reinforcement learning solutions are adopted are:
• videogames:
– objective = complete the game with the highest score
– state = raw pixel inputs of the game state
– action = game controls (left, right, up, down)
– reward = score increase/decrease
• robotics
• board games (e.g. Go):
– objective = win the game
– state = position of all pieces
– action = where to put the next piece down
– reward = 1 if win at the end of the game, 0 otherwise
• discounted rewards
Remark: in the more general case each reward should be written as R(st , at , st+1 )
Remark: the lower the discount factor γ is, the less important future re-
wards are, and the agent will tend to focus on actions which will yield immediate
rewards only.
Utilities of states are solutions of the set of Bellman equations. The solutions
to the set of Bellman equations are unique. However, directly solving the set
of equations is hard. As a consequence, we rely on a dynamic programming
algorithmic approach (value iteration):
• step 1: initialize U0 (s) to zero for all s
• step 2: repeat until max utility difference is below a threshold
– step 2.1: do Bellman update for each state s:
X
Ui+1 (s) ← R(s) + γ ∗ maxa∈A p(s′ |s, a)Ui (s′ )
s′ ∈S
– step 2.2: i ← i + 1
• step 3: return U
This latter algorithm is used as a subroutine of the following algorithm
(policy iteration) which is used to learn the optimal policy:
• step 1: initialize π0 randomly
• step 2: repeat until no policy improvement
– step 2.1: policy evaluation solve set of linear equations (as ex-
plained in the previous algorithm):
X
Ui (s) = R(s) + γ p(s′ |s, πi (s))Ui (s′ ) ∀s ∈ S
s′ ∈S
– step 2.3: i ← i + 1
• step 3: return π
In addition we keep track of the number of times we are in state s and take
an action a brings the agent to state s′ :
At this point we have all the information that we need to update the transi-
tion model, by using maximum likelihood estimation. The probability of reach-
ing state s′′ being in state s taking action a is given by the fraction of times in
which the agent is in state s and take an action a that brings the agent to state
s′′ divided by the number of times we are in state s and we take action a:
Ns′′ |sa ′′
p(s′′ |s, a) = ∀s ∈ S (19.10)
Nsa
The last step is to run policy-evaluation to obtain the utility estimate U (U
is initially empty). Each step is expensive as it runs policy evaluation.
• step 1: initialize s
• step 2: repeat until s is terminal
– step 2.1: receive reward r, set R(s) = r
– step 2.2: choose next action a ← π(s)
– step 2.3: take action a, reach step s′
– step 2.4: update counts
Nsa ← Nsa + 1
U ← policyEvaluation(π, U, p, R, γ)
CHAPTER 19. REINFORCEMENT LEARNING 250
Note that this is a rough approximation, not the exact utility of the state s.
Now the idea is to try to make the utility closer to the above approximation.
Making the current U (s) closer to R(s)+γU (s′ ) can be possible by updating the
current U (s) in such a way that the difference between what we want (R(s) +
γU (s′ )) and what we have (the current U (s))
Remark: One of the main benefits of this algorithm is that it does not
need a transition model to update the utility. As a consequence, each step is
much faster than ADP. However, depending on the learning rate α, TD policy
evaluation algorithm takes longer to converge. In a sense, this algorithm can be
seen as a rough efficient approximation of ADP.
• step 1: initialize s
X
U + (s) = R(s) + γmaxa∈A f ( p(s′ |s, a)U + (s′ ), Nsa ) (19.14)
s′ ∈S
with f increasing over the first argument and decreasing over the second.
19.4.4 Q-learning
As an alternative to what we have learnt so far, it is convenient to define the
utility of a state-action (s, a) pair and not only of a state s. The value of a
state-action pair is written Q(s, a). Given policy π, the utility of a state action
pair (s, a) is computed as follows:
X∞
Qπ (s, a) = E[ γ t r(st )|s0 = s, a0 = a, π] (19.15)
t=0
The true utility of a state-action pair (s, a) is its utility under an optimal
policy:
∞
X
Q(s, a) = Qπ∗ (s, a) = maxπ E[ γ t r(st )|s0 = s, a0 = a, π] (19.16)
t=0
Once Q(s, a) has been computed for each state action pair, the optimal
policy corresponds to:
The task is to progressively learn a table where we have the maximum ex-
pected future reward, for each action at each state. As we are going to see in
the next algorithmic procedure, the update of the utility of a state action pair
(s, a) is again computed by means of the Bellman equation:
X
Q(s, a) = R(s) + γmaxa∈A p(s′ |s, a)U (s′ ) (19.18)
s′ ∈S
∗ step 3.1.1 select a random possible action for the current state
∗ step 3.1.2 using this possible action consider going to this next
state s
∗ step 3.1.3 get maximum Q value for this next state s (taking
into account all actions from this next state). This is computed
solving Bellmann equation.
After some episode, my Q-table encodes the optimal policy:
• step 1: set current state = initial state
• step 2: repeat until current state = goal state
– step 2.1: from current state find the action with the highest Q value
– step 2.2: set current state = next state (state from action chosen
on step 2)
• step 1: initialize s
• step 2: repeat until s is terminal
– step 2.1: receive reward r
– step 2.2: choose next action a ← π ϵ (s)
– step 2.3: take action a, reach step s′
– step 2.4: choose action a′ ← π ϵ (s′ )
– step 2.5: update local utility estimate
• step 1: initialize s
• step 2: repeat until s is terminal
– step 2.1: receive reward r
– step 2.2: choose next action a ← π ϵ (s)
– step 2.3: take action a, reach step s′
– step 2.5: update local utility estimate
The main difference between SARSA and Q-learning is that SARSA is on-
policy, meaning that it updates Q using the current policy’s action. On the
other hand Q-learning is off-policy, meaning that it updates Q using the greedy
policy’s action (which is NOT the policy it uses to search).
Off-policy methods are more flexible. On the other hand, on-policy methods
tend to converge faster, and are easier to use for continuous-state spaces and
linear function approximators about which we discuss in the following of this
chapter.
1
E((s, a), s′ ) = (R(s) + γmaxa′ ∈A Qθ (s′ , a′ ) − Qθ (s, a))2 (19.24)
2
This formulation allows us to learn by means of stochastic gradient descent
procedure.
∇θ E((s, a), s′ ) = (R(s)+γmaxa′ ∈A Qθ (s′ , a′ )−Qθ (s, a))(−∇θ Qθ (s, a)) (19.25)