Reinforcement Learning Basics
A six■slide crash course for engineers & researchers
What you’ll learn
• What RL is: sequential decision■making by learning to act via rewards.
• Key ingredients: agent, environment, state, action, reward, policy, value.
• Mathematical model: Markov Decision Processes (MDPs).
• Core learning paradigms: Dynamic Programming, Monte■Carlo, Temporal■Difference.
• Algorithms you should know: Q■learning, SARSA, Policy Gradient, Actor■Critic, DQN.
• Practical tips: exploration, stability, sample efficiency, evaluation, resources.
Reinforcement Learning Basics • Page 1 of 6
The Reinforcement Learning Problem
Goal: learn a policy π(a|s) that maximizes expected return.
Return (discounted cumulative reward):
Gt = rt+1 + γ rt+2 + γ² rt+3 + ··· = ∑k=0∞ γk rt+k+1
Agent–Environment Loop
• At time t: observe state st, pick action at ~ π(·|st).
• Environment returns reward rt+1 and next state st+1.
• Objective: maximize J(π) = ■π[G0].
On■policy vs Off■policy
• On■policy: learn about the policy being used to generate data (e.g., SARSA).
• Off■policy: learn about a target policy from data generated by a (possibly different) behavior policy (e.g.,
Q■learning).
Reinforcement Learning Basics • Page 2 of 6
MDPs & Value Functions
We model tasks as an MDP: ■ = (■, ■, P, R, γ) with Markov property.
Value functions:
State■value: Vπ(s) = ■π[Gt | st=s]
Action■value: Qπ(s,a) = ■π[Gt | st=s, at=a]
Bellman equations:
Vπ(s) = ■a∼π, s′∼P[ r(s,a) + γ Vπ(s′) ]
Qπ(s,a) = ■s′∼P[ r(s,a) + γ ■a′∼π[ Qπ(s′,a′) ] ]
Optimality & Improvement:
• Optimal value: V*(s) = maxa Q*(s,a).
• Bellman optimality: Q*(s,a) = ■[ r + γ maxa′ Q*(s′,a′) ].
• Policy improvement: greedify w.r.t. current value ⇒ monotonic improvement.
Reinforcement Learning Basics • Page 3 of 6
Core Learning Paradigms
Dynamic Programming (DP)
• Requires known model (P, R); sweeps over state space.
• Policy evaluation ■ policy improvement (Policy Iteration).
• Value Iteration: one■step lookahead + max backup.
Monte■Carlo (MC)
• No model; learn from complete episodes.
• Target is empirical return; unbiased but high variance.
Temporal■Difference (TD)
• Bootstraps from estimates; learn online, from partial episodes.
• TD(0) update: V(s) ← V(s) + α [ r + γ V(s′) − V(s) ].
• Unbiased model■free targets with lower variance than MC.
Reinforcement Learning Basics • Page 4 of 6
Algorithms You Should Know
Value■based
• Q■learning (off■policy): Q(s,a) ← Q(s,a) + α [ r + γ maxa′ Q(s′,a′) − Q(s,a) ].
• SARSA (on■policy): Q(s,a) ← Q(s,a) + α [ r + γ Q(s′,a′) − Q(s,a) ].
• Deep Q■Network (DQN): experience replay + target network for stability.
Policy■based
• REINFORCE: ∇θ J = ■[ ∑ ∇θ log πθ(a|s) (Gt − b) ].
• Baselines/advantages reduce variance (e.g., A2C/A3C, GAE).
Actor■Critic
• Actor updates policy; Critic learns value (V or Q).
• Modern stable methods: PPO/TRPO (clipping or KL constraints).
• Exploration: ε■greedy, entropy regularization, softmax over Q.
• Function approximation: linear → deep NNs; beware divergence off■policy.
Reinforcement Learning Basics • Page 5 of 6
Practice, Pitfalls, & Resources
Practical Tips
• Reward design: align incentives; avoid unintended shortcuts.
• Stability: target networks, normalization, gradient clipping, proper γ.
• Sample efficiency: replay buffers, off■policy learners, model■based rollouts.
• Partial observability: use history, RNNs, or belief states (POMDPs).
• Safety & evaluation: offline metrics, ablations, confidence intervals.
Common Failure Modes
• Function approximation + off■policy + bootstrapping ⇒ divergence.
• Sparse rewards ⇒ exploration collapse (consider shaping or intrinsic bonuses).
• Distribution shift from replay/behavior policy.
Resources
• Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.)
• OpenAI Spinning Up; Gymnasium; CleanRL; RLlib; Dopamine.
• Classic benchmarks: CartPole, MountainCar, Atari, MuJoCo.
Reinforcement Learning Basics • Page 6 of 6