Optimization Techniques:
Comprehensive Notes
Introduction to Optimization
Optimization is the process of finding the best possible solution to a problem under given circumstances. In mathematical
terms, it involves finding the minimum or maximum of an objective function subject to constraints.
Types of Optimization Problems
1. Based on Nature of Variables
Continuous Optimization
Discrete Optimization (Integer Programming)
Mixed-Integer Optimization
2. Based on Nature of Objective Function
Linear Programming (LP)
Nonlinear Programming (NLP)
Quadratic Programming (QP)
Convex Optimization
Non-convex Optimization
3. Based on Constraints
Unconstrained Optimization
Constrained Optimization
Equality Constrained
Inequality Constrained
First-Order Optimization Methods
1. Gradient Descent
Basic Algorithm
θ(t+1) = θ(t) - η∇f(θ(t))
where:
- θ: Parameters
- η: Learning rate
- ∇f: Gradient of objective function
Variants
1. Batch Gradient Descent
Uses entire dataset
More stable
Computationally expensive
2. Stochastic Gradient Descent (SGD)
Updates parameters using single sample
Faster but noisier
Formula: θ = θ - η∇f(θ; x(i), y(i))
3. Mini-batch Gradient Descent
Compromise between batch and SGD
Typical batch size: 32-512
Better convergence stability
2. Momentum-based Methods
Classical Momentum
v(t) = γv(t-1) + η∇f(θ(t))
θ(t+1) = θ(t) - v(t)
where:
- γ: Momentum coefficient
- v: Velocity vector
Nesterov Accelerated Gradient (NAG)
v(t) = γv(t-1) + η∇f(θ(t) - γv(t-1))
θ(t+1) = θ(t) - v(t)
3. Adaptive Learning Rate Methods
AdaGrad
Adapts learning rate per parameter
Accumulates squared gradients
g(t) = ∇f(θ(t))
s(t) = s(t-1) + g(t)²
θ(t+1) = θ(t) - η/√(s(t) + ε) * g(t)
RMSprop
Exponentially decaying average
s(t) = βs(t-1) + (1-β)g(t)²
θ(t+1) = θ(t) - η/√(s(t) + ε) * g(t)
Adam (Adaptive Moment Estimation)
Combines momentum and RMSprop
m(t) = β₁m(t-1) + (1-β₁)g(t)
v(t) = β₂v(t-1) + (1-β₂)g(t)²
m̂(t) = m(t)/(1-β₁ᵗ)
v̂(t) = v(t)/(1-β₂ᵗ)
θ(t+1) = θ(t) - η * m̂(t)/√(v̂(t) + ε)
Second-Order Optimization Methods
1. Newton's Method
Uses second derivatives (Hessian)
θ(t+1) = θ(t) - [H(θ(t))]⁻¹∇f(θ(t))
where H is the Hessian matrix
2. Quasi-Newton Methods
BFGS (Broyden-Fletcher-Goldfarb-Shanno)
Approximates Hessian matrix
Stores dense matrix
More memory efficient than Newton's method
L-BFGS (Limited-memory BFGS)
Stores only few vectors
More memory efficient
Suitable for large-scale problems
Constrained Optimization Techniques
1. Lagrange Multipliers
For equality constraints
L(x,λ) = f(x) + Σᵢλᵢgᵢ(x)
where:
- f(x): Objective function
- gᵢ(x): Constraint functions
- λᵢ: Lagrange multipliers
2. KKT Conditions
For inequality constraints
∇f(x*) + Σᵢλᵢ∇gᵢ(x*) = 0
gᵢ(x*) ≤ 0
λᵢ ≥ 0
λᵢgᵢ(x*) = 0
3. Penalty Methods
Convert constrained to unconstrained
Add penalty term for constraint violation
P(x) = f(x) + c*Σᵢmax(0,gᵢ(x))²
where c is penalty parameter
Global Optimization Methods
1. Simulated Annealing
Inspired by annealing in metallurgy
Probabilistic technique
Steps:
1. Generate neighbor solution
2. Accept if better
3. Accept worse solutions with probability
4. Decrease temperature
2. Genetic Algorithms
Population-based search
Components:
1. Selection
2. Crossover
3. Mutation
4. Evaluation
3. Particle Swarm Optimization
Population-based
Inspired by social behavior
Updates:
1. Position
2. Velocity
3. Personal best
4. Global best
Special Optimization Techniques
1. Linear Programming
Simplex Method
Interior Point Methods
Dual Problems
2. Dynamic Programming
Principle of optimality
Subproblem overlapping
Memoization
3. Convex Optimization
Interior Point Methods
Cutting Plane Methods
Ellipsoid Method
Practical Considerations
1. Learning Rate Selection
Fixed learning rate
Learning rate schedules
Adaptive methods
Grid search
2. Batch Size Selection
Memory constraints
Computational efficiency
Convergence stability
Parallelization
3. Initialization
Xavier/Glorot initialization
He initialization
Random initialization
Zero initialization
4. Regularization
L1 regularization
L2 regularization
Elastic net
Early stopping
Common Challenges and Solutions
1. Local Minima
Multiple restarts
Momentum methods
Stochastic methods
Population-based methods
2. Saddle Points
Second-order methods
Adding noise
Momentum methods
3. Ill-conditioning
Preconditioning
Adaptive methods
Quasi-Newton methods
4. Vanishing/Exploding Gradients
Gradient clipping
Layer normalization
Residual connections
Proper initialization
Implementation Tips
1. Code Optimization
# Example of efficient gradient computation
def compute_gradient(X, y, w):
m = len(y)
h = sigmoid(np.dot(X, w))
gradient = (1/m) * np.dot(X.T, (h - y))
return gradient
2. Monitoring Convergence
# Example of convergence monitoring
def check_convergence(loss_history, tol=1e-6):
if len(loss_history) < 2:
return False
return abs(loss_history[-1] - loss_history[-2]) < tol
Advanced Topics
1. Multi-objective Optimization
Pareto optimality
Weighted sum method
ε-constraint method
Goal programming
2. Online Optimization
Online learning
Regret minimization
Bandit algorithms
3. Distributed Optimization
Parameter server
AllReduce
Asynchronous SGD
Model averaging
Best Practices
1. Problem Analysis
Understand problem structure
Identify constraints
Choose appropriate method
2. Implementation
Start simple
Monitor convergence
Use proper validation
Implement early stopping
3. Tuning
Grid/random search
Bayesian optimization
Cross-validation
Ensemble methods
Conclusion
Success in optimization requires:
1. Understanding of problem structure
2. Proper method selection
3. Careful implementation
4. Proper monitoring and tuning
5. Consideration of computational resources