Recurrent Neurons and Layers
1. The simplest RNN has just one neuron that:
● Receives inputs at each time step t
● Receives its own previous output from time step t-1
● At the first time step, with no previous output, it starts at 0
2. When expanded to a full RNN layer:
○ Every neuron receives both the input vector x(t)
○ Every neuron receives the output vector from previous time step ŷ
(t-1)
○ The inputs and outputs become vectors instead of scalars
3. Each recurrent neuron has two weight sets:
● wx: weights for current input x(t)
● wŷ: weights for previous outputs ŷ(t-1)
● For a full layer, these become matrices Wx and Wŷ
4. The output calculation for a single instance is:
ŷ(t) = ϕ(Wx⊺x(t) + Wŷ⊺ŷ(t-1) + b)
5. For a mini-batch, the output calculation becomes:
Ŷ(t) = ϕ(X(t)Wx + Ŷ(t-1)Wŷ + b) = ϕ([X(t) Ŷ(t-1)]W + b) with W = [Wx Wŷ]
Where:
● Ŷ(t) is m × n neurons matrix (outputs at time t)
● X(t) is m × n inputs matrix (inputs for all instances)
● Wx is n inputs × n neurons matrix (weights for current inputs)
● Wŷ is n neurons × n neurons matrix (weights for previous outputs)
● b is the bias vector of size n neurons
6. Key characteristics:
● The output Ŷ(t) depends on both current input X(t) and previous output Ŷ
(t-1)
● This creates a chain of dependencies going back to the first time step
● At t=0, previous outputs are initialized to zeros
Memory Cells
1. Memory in RNNs:
● A recurrent neuron's output at time t depends on all previous inputs
● This creates a form of memory in the network
● Any part of a neural network that maintains state across time steps
is called a memory cell
2. Basic Memory Cells:
○ A single recurrent neuron is a basic memory cell
○ A layer of recurrent neurons is also a basic memory cell
○ These basic cells can typically learn patterns about 10 steps long
○ The pattern length capability varies depending on the task
3. More Complex Cells:
● Later chapters cover more sophisticated cell types
● These can learn patterns roughly 10 times longer
● Pattern length still varies based on the task
4. Cell State Characteristics:
● Cell state at time t is denoted as h(t) (h stands for "hidden")
● State is a function of:
○ Current inputs x(t)
○ Previous state h(t-1)
● Written as: h(t) = f(x(t), h(t-1))
5. Cell Output:
● Output at time t is denoted as ŷ(t)
● Output is a function of:
○ Previous state
○ Current inputs
● In basic cells: output equals state
● In complex cells: output may differ from state
Input and Output Sequences
1. Sequence-to-Sequence (top-left):
● Takes a sequence and outputs a sequence
● Example: Power consumption forecasting where you input N days of data and
output predictions shifted by one day
● Best for tasks where input and output are naturally sequential and aligned
2. Sequence-to-Vector (top-right):
● Takes a sequence but only uses final output
● Example: Sentiment analysis of movie reviews, where words are the input
sequence and the output is a single sentiment score
● Good for classification/scoring of sequential data
3. Vector-to-Sequence (bottom-left):
● Takes a single vector repeatedly as input and produces a sequence
● Example: Image captioning, where a CNN-processed image is input and
the output is a sequence of words describing it
● Useful when generating sequential content from a fixed input
4. Encoder-Decoder (bottom-right):
● Combines sequence-to-vector (encoder) with vector-to-sequence
(decoder)
● Example: Language translation, where input sentence is encoded to a
vector, then decoded to target language
● Better than direct sequence-to-sequence for translation because it can
consider entire input context before generating output
● More complex implementation than the diagram suggests (covered in
Chapter 16)
Training RNNs
1. Basic Concept:
● BPTT involves unrolling the RNN through time
● Uses regular backpropagation principles on the unrolled network
● Consists of forward pass followed by backward pass
2. Forward Pass:
● Network processes the input sequence from start to finish
● Represented by dashed arrows in Figure 15-5
● Generates predictions Ŷ(0) through Ŷ(T) for each timestep
3. Loss Function:
● Evaluates output sequence against target sequence
● Format: ℒ(Y(0), Y(1), ..., Y(T); Ŷ(0), Ŷ(1), ..., Ŷ(T))
● Can selectively ignore certain outputs depending on the task
● Example: Sequence-to-vector RNNs only use the final output
4. Backward Pass:
● Gradients flow backward through the unrolled network
● Only flows through outputs used in loss calculation
● In the example, only flows through Ŷ(2), Ŷ(3), and Ŷ(4)
5. Parameter Updates:
● Same parameters (W and b) are used at each timestep
● Parameters receive multiple gradient updates during backprop
● Final gradient descent step updates parameters just like regular backprop
Preparing Data for ML models
● The text describes preparing time series data for machine learning
models, with the goal of forecasting tomorrow's ridership based on
8 weeks (56 days) of past data.
● The concept of using sliding windows: Every 56-day window from
the past serves as training data, with the target being the value
immediately following each window.
● Keras provides two methods for creating time series datasets:
First method using timeseries_dataset_from_array():
import tensorflow as tf
my_series = [0, 1, 2, 3, 4, 5]
my_dataset = tf.keras.utils.timeseries_dataset_from_array(
my_series,
targets=my_series[3:], # targets are 3 steps into the future
sequence_length=3,
batch_size=2
)
Alternative method using window():
dataset = tf.data.Dataset.range(6).window(4, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window_dataset:
window_dataset.batch(4))
# Helper function for extracting windows
def to_windows(dataset, length):
dataset = dataset.window(length, shift=1, drop_remainder=True)
return dataset.flat_map(lambda window_ds: window_ds.batch(length))
Final data preparation steps for the rail ridership example:
rail_train = df["rail"]["2016-01":"2018-12"] / 1e6
rail_valid = df["rail"]["2019-01":"2019-05"] / 1e6
rail_test = df["rail"]["2019-06":] / 1e6
seq_length = 56
train_ds = tf.keras.utils.timeseries_dataset_from_array(
rail_train.to_numpy(),
targets=rail_train[seq_length:],
sequence_length=seq_length,
batch_size=32,
shuffle=True,
seed=42
valid_ds = tf.keras.utils.timeseries_dataset_from_array(
rail_valid.to_numpy(),
targets=rail_valid[seq_length:],
sequence_length=seq_length,
batch_size=32
)
Forecasting Using Linear Model
Performance Results:
● The model achieved a validation MAE of approximately 37,866
● This performance is:
○ Better than naive forecasting
○ Worse than the SARIMA model
Key Model Characteristics:
● Uses Huber loss instead of MAE directly for better performance
● Implements early stopping to prevent overfitting
● Uses SGD optimizer with momentum
● Monitors validation MAE for early stopping
Code Snippet:
tf.random.set_seed(42)
model = tf.keras.Sequential([
tf.keras.layers.Dense(1, input_shape=[seq_length])
])
Forecasting Using Simple RNN
Initial Simple RNN Implementation:
model = tf.keras.Sequential([
tf.keras.layers.SimpleRNN(1, input_shape=[None, 1])
])
2. Input Shape Requirements:
● RNN layers expect 3D inputs: [batch size, time steps, dimensionality]
● Input_shape ignores the first dimension (batch size)
● Time steps can be None (any size)
● Dimensionality is 1 for univariate time series
3. How the Simple RNN Works:
● Initial state h(init) starts at 0
● Each step processes current input and previous state
● Uses hyperbolic tangent (tanh) activation by default
● Outputs only the final value unless return_sequences=True
4. Problems with Initial Model:
● Validation MAE > 100,000 (poor performance)
● Only 3 parameters total (2 weights + 1 bias)
● Limited by tanh activation range (-1 to +1)
● Too simple for the complexity of the data
Forecasting Using Deep RNN
Code Snippet:
deep_model = tf.keras.Sequential([
tf.keras.layers.SimpleRNN(32, return_sequences=True,
input_shape=[None, 1]),
tf.keras.layers.SimpleRNN(32, return_sequences=True),
tf.keras.layers.SimpleRNN(32),
tf.keras.layers.Dense(1)
])
Fighting Unstable Gradiets Problem
1. Common Deep Learning Techniques That Help:
● Good parameter initialization
● Faster optimizers
● Dropout
2. ReLU and Non-saturating Activation Functions:
● May not help as much with RNNs
● Can actually increase instability
● Risk of exploding outputs due to weight reuse across time steps
● Saturating functions like tanh are preferred (hence being the default)
3. Gradient Issues:
● Gradients can explode
● Solutions include:
○ Using smaller learning rates
○ Monitoring gradient size (via TensorBoard)
○ Using gradient clipping
4. Batch Normalization (BN) Limitations:
● Less effective with RNNs than with feedforward networks
● Cannot be used effectively between time steps
● When used in memory cells:
○ Same BN layer used at each time step
○ Same parameters regardless of input scale
○ Only slightly beneficial when applied to layer inputs
○ Not helpful when applied to hidden states
○ Can slow down training
5. Layer Normalization Benefits:
● Better suited for RNNs than batch normalization
● Normalizes across features dimension instead of batch dimension
● Advantages:
○ Can compute statistics on the fly at each time step
○ Works independently for each instance
○ Consistent behavior during training and testing
○ Doesn't need exponential moving averages
○ Learns scale and offset parameters for each input
6. Implementation:
● Used after linear combination of inputs and hidden states
● Requires defining a custom memory cell in Keras
● Cell's call() method needs to handle both:
○ Current time step inputs
○ Previous time step hidden states
Tackling Short-Term Memory Problem
Using,
1. LSTM
2. GRU
LSTM
Code Snippet:
model = tf.keras.Sequential([
tf.keras.layers.LSTM(32, return_sequences=True, input_shape=
[None, 5]),
tf.keras.layers.Dense(14)
])
GRU
Code Snippet:
model = tf.keras.Sequential([
tf.keras.layers.GRU(32, return_sequences=True, input_shape=
[None, 5]),
tf.keras.layers.Dense(14)
])