Cheatsheet Deep Learning
Cheatsheet Deep Learning
Cheatsheet Deep Learning
edu/~shervine
VIP Cheatsheet: Deep Learning r Learning rate – The learning rate, often noted η, indicates at which pace the weights get
updated. This can be fixed or adaptively changed. The current most popular method is called
Adam, which is a method that adapts the learning rate.
Neural networks are a class of models that are build with layers. Commonly used types of neural ∂L(z,y)
networks include convolutional and recurrent neural networks. w ←− w − η
∂w
r Architecture – The vocabulary around neural networks architectures is described in the
figure below:
r Updating weights – In a neural network, weights are updated as follows:
where we note w, b, z the weight, bias and output respectively. r Dropout – Dropout is a technique meant at preventing overfitting the training data by
dropping out units in a neural network. In practice, neurons are either dropped with probability
r Activation function – Activation functions are used at the end of a hidden unit to introduce p or kept with probability 1 − p.
non-linear complexities to the model. Here are the most common ones:
W − F + 2P
N = +1
S
follows:
xi − µ B
xi ←− γ p +β
r Cross-entropy loss – In the context of neural networks, the cross-entropy loss L(z,y) is 2 +
σB
commonly used and is defined as follows:
h i
L(z,y) = − y log(z) + (1 − y) log(1 − z) It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
Recurrent Neural Networks • We iterate the value based on the values before:
r Types of gates – Here are the different types of gates that we encounter in a typical recurrent
" #
neural network:
X
0 0
Vi+1 (s) = R(s) + max γPsa (s )Vi (s )
a∈A
s0 ∈S
Input gate Forget gate Output gate Gate
Write to cell or not? Erase a cell or not? Reveal a cell or not? How much writing?
r Maximum likelihood estimate – The maximum likelihood estimates for the state transition
probabilities are as follows:
r LSTM – A long short-term memory (LSTM) network is a type of RNN model that avoids #times took action a in state s and got to s0
the vanishing gradient problem by adding ’forget’ gates. Psa (s0 ) =
#times took action a in state s
∗
r Bellman equation – The optimal Bellman equations characterizes the value function V π
of the optimal policy π ∗ :
∗ ∗
X
V π (s) = R(s) + max γ Psa (s0 )V π (s0 )
a∈A
s0 ∈S
Remark: we note that the optimal policy π ∗ for a given state s is such that:
X
π ∗ (s) = argmax Psa (s0 )V ∗ (s0 )
a∈A
s0 ∈S
V0 (s) = 0