GD: batch size parameter w = w - alpha*(dw / sqrt(s_dw))
Normalizing: faster convergence => move at a b = b - alpha*(db / sqrt(s_db))
similar scale
Mini batch GD: What happens if we raise beta?
- Batch size > 1
Stochastic GD: Adam:
- Bath size = 1, one training example at a time V_dw = beta*V_dw + (1-beta)*dw
- Extremely noisy S_dw = beta*S_dw + (1-beta)*dw^2
- No convergence
V_dw^corrected = V_dw / (1-beta_1^t)
Exponentially weighted moving averages S_dw^corrected = S_dw / (1-beta_2^t)
Moving average W = w - alpha*V_dw^corrected /
V_t = Beta*v_{t-1} + (1-b)*Theta_t (sqrt(S_dw^corrected) + e)
Averaging over 1/(1-Beta) - many t’s
Compute the average of the last t data points Hyperparameters choice:
So in time series let’s say you want the average of Use default values for the hyperparameters but
the quarters: alpha, the learning rate, needs to be tuned
(¼)*sum of the first year, then plot the average of ½ * Combines RMSProp and gradient with momentum
next two
Average of the last 1/(1-b) days plus some weight of Learning rate decay:
that day => Reduce learning rate - to oscillate over a tighter
Number of days proportional to b region
Recursive on sum (1-b)*(b^n)*T for n = 1 to 100 alpha = (1 / (1+decayRate*epoch))*alpha_0
What happens if we raise beta? Closer to the points Local optima:
Bias correction:
Starts off really low => add a bias term
Gradient descent with momentum:
Like moving average for derivatives, instead of time
series
V_dw = beta*V_dw + (1-beta)*dw
V_db = beta*V_db + (1-beta)*db
W = w - a*V_dw
b = b - a*V_db
What happens if we raise beta? It’s like learning rate
High β\betaβ Leads to More Horizontal Movement
=>shorter axis, smoother => depends more on the
general trend, less sensitive to noise
Low β\betaβ Leads to More Vertical Movement,
greater descent noisier => depends more on the Exponential, step decay
current level
RootMeanSprop:
S_dw = beta*S_dw + (1-beta)*dw^2
S_db = beta*S_db + (1-beta)*db^2
Computational resources SEQUENCE MODELS
- tune: learning rate, mini-batch size
- whether to try Panda or Caviar
New hyperparameters - should only be done if new
hardware or computational power is acquired =>
false
Batch normalization
Are beta and gamma learned
Deep learning programming frameworks don’t
require cloud-based machines to run
Framework allows fewer lines of code
Tasks that could be addressed by a many-to-one
RNN model architecture:
If searching among a large number of
hyperparameters, you SHOULD NOT try values What’s many-to-one RNN model architecture?
in a grid rather than random values, so that you Sequence then outputs 1 result
can carry out the search more systematically
and not rely on chance => use random search
Don’t use the most recent mini-batch’s mean
and sigma used for normalization
After training a neural network with batch norm,
at test time, to evaluate the neural network on a
new example, you should perform the needed
normalizations, use mean and sigma estimated
using an exponentially weighted average across
mini-batches seen during training
If you are training an RNN model, and find that your LSTM
weights and activations are all taking on the value of
NaN (“Not a Number”) gradients exploding
Gu has the same dim as # of hidden nodes
choose the r-th training sample first,
then the s-th word
If we want c<t> to be highly dependent on c<t-
1>, we want Gu to be very low,
Gr => about remembering previous states
Dimensionality in word embedding:
Question 10
The sparsity of connections and weight sharing are
mechanisms that allow us to use fewer parameters in a
convolutional layer making it possible to train a network
with smaller training sets. True/False?
Number of weights per filter:
Total number of weights for all filters:
Bias parameters: one per filter
LSTM
Gu => update gate
Gr => forget
Go => output gate
Gu has dimension = # hidden units in the LSTM