CS445: Neural Networks and Deep Learning
Lecture 4: Backpropagation and Gradient Descent Professor Chen - Fall 2024
I. Understanding Backpropagation
Today's lecture focused on the mathematics behind neural network training. The
backpropagation algorithm is fundamental to how neural networks learn from data.
Key Concepts:
1. Forward Propagation
- Input signals flow through the network
- Each neuron computes: output = activation_function(weighted_sum + bias)
- Final layer produces prediction
2. Computing the Loss
- Measures difference between prediction and actual target
- Common loss functions:
● Mean Squared Error (MSE): L = 1/n Σ(y - ŷ)²
● Cross-Entropy: L = -Σ y log(ŷ)
II. The Chain Rule in Neural Networks
The chain rule is crucial for computing gradients through multiple layers:
∂L/∂w = ∂L/∂a × ∂a/∂z × ∂z/∂w
Where:
- L is the loss
- a is the activation
- z is the weighted sum
- w is the weight
III. Gradient Descent Implementation
def backward_pass(network, loss, learning_rate=0.01):
# Compute gradients
for layer in reversed(network.layers):
layer.gradients = compute_gradients(layer)
# Update weights and biases
layer.weights -= learning_rate * layer.gradients['weights']
layer.biases -= learning_rate * layer.gradients['biases']
Types of Gradient Descent:
1. Batch Gradient Descent
- Uses entire dataset for each update
- Very stable but slow
- High memory requirements
2. Stochastic Gradient Descent (SGD)
- Uses single sample for each update
- Faster but noisier
- Lower memory requirements
3. Mini-batch Gradient Descent
- Best of both worlds
- Typically 32-256 samples per batch
- Most commonly used in practice
IV. Activation Functions
We covered several activation functions and their derivatives:
1. Sigmoid
- σ(x) = 1/(1 + e^(-x))
- Derivative: σ(x)(1 - σ(x))
- Issues with vanishing gradients
2. ReLU
- f(x) = max(0, x)
- Derivative: 1 if x > 0, 0 otherwise
- Most commonly used today
3. Tanh
- Range: [-1, 1]
- Often better than sigmoid
- Still has vanishing gradient issues
V. Common Challenges and Solutions
1. Vanishing Gradients
Solutions:
- Use ReLU activation
- Implement residual connections
- Proper initialization
2. Exploding Gradients
Solutions:
- Gradient clipping
- Batch normalization
- L2 regularization
VI. Practical Implementation Tips
1. Weight Initialization:
# He initialization for ReLU networks
weights = np.random.randn(shape) * np.sqrt(2/n_inputs)
# Xavier initialization for tanh networks
weights = np.random.randn(shape) * np.sqrt(1/n_inputs)
2. Learning Rate Selection:
- Start with 0.01
- Use learning rate schedules
- Consider adaptive methods (Adam, RMSprop)
VII. Today's Lab Exercise
Implement a simple neural network with:
1. One hidden layer (64 units)
2. ReLU activation
3. Softmax output layer
4. Cross-entropy loss
5. Mini-batch gradient descent
Homework Assignment
Due next Tuesday:
1. Implement backpropagation from scratch
2. Train a network on MNIST dataset
3. Experiment with different:
- Learning rates
- Batch sizes
- Network architectures
Important Formulas to Remember
1. Softmax: σ(z)ᵢ = e^zᵢ / Σ e^z
2. Cross-Entropy Loss: L = -Σ yᵢ log(ŷᵢ)
3. Weight Update Rule: w = w - α∇L
Additional reading: "Deep Learning" by Goodfellow, Bengio, and Courville - Chapter 6.5
Next Week's Preview
- Convolutional Neural Networks
- Feature Maps
- Pooling Layers
- CNN Architectures
Recommended Resources
- Tensorflow Documentation
- PyTorch Tutorials
- Stanford CS231n Course Notes
- Andrew Ng's Deep Learning Specialization
Note: Office hours this week are Wednesday 2-4pm and Thursday 3-5pm in Room 405.