[go: up one dir, main page]

0% found this document useful (0 votes)
5 views6 pages

2025 Reinforcement Learning Basics

Uploaded by

fangda00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

2025 Reinforcement Learning Basics

Uploaded by

fangda00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Reinforcement Learning Basics

A six■slide crash course for engineers & researchers

What you’ll learn


• What RL is: sequential decision■making by learning to act via rewards.
• Key ingredients: agent, environment, state, action, reward, policy, value.
• Mathematical model: Markov Decision Processes (MDPs).
• Core learning paradigms: Dynamic Programming, Monte■Carlo, Temporal■Difference.
• Algorithms you should know: Q■learning, SARSA, Policy Gradient, Actor■Critic, DQN.
• Practical tips: exploration, stability, sample efficiency, evaluation, resources.

Reinforcement Learning Basics • Page 1 of 6


The Reinforcement Learning Problem
Goal: learn a policy π(a|s) that maximizes expected return.
Return (discounted cumulative reward):
Gt = rt+1 + γ rt+2 + γ² rt+3 + ··· = ∑k=0∞ γk rt+k+1

Agent–Environment Loop
• At time t: observe state st, pick action at ~ π(·|st).
• Environment returns reward rt+1 and next state st+1.
• Objective: maximize J(π) = ■π[G0].

On■policy vs Off■policy
• On■policy: learn about the policy being used to generate data (e.g., SARSA).
• Off■policy: learn about a target policy from data generated by a (possibly different) behavior policy (e.g.,
Q■learning).

Reinforcement Learning Basics • Page 2 of 6


MDPs & Value Functions
We model tasks as an MDP: ■ = (■, ■, P, R, γ) with Markov property.

Value functions:
State■value: Vπ(s) = ■π[Gt | st=s]
Action■value: Qπ(s,a) = ■π[Gt | st=s, at=a]

Bellman equations:
Vπ(s) = ■a∼π, s′∼P[ r(s,a) + γ Vπ(s′) ]
Qπ(s,a) = ■s′∼P[ r(s,a) + γ ■a′∼π[ Qπ(s′,a′) ] ]

Optimality & Improvement:


• Optimal value: V*(s) = maxa Q*(s,a).
• Bellman optimality: Q*(s,a) = ■[ r + γ maxa′ Q*(s′,a′) ].
• Policy improvement: greedify w.r.t. current value ⇒ monotonic improvement.

Reinforcement Learning Basics • Page 3 of 6


Core Learning Paradigms
Dynamic Programming (DP)
• Requires known model (P, R); sweeps over state space.
• Policy evaluation ■ policy improvement (Policy Iteration).
• Value Iteration: one■step lookahead + max backup.

Monte■Carlo (MC)
• No model; learn from complete episodes.
• Target is empirical return; unbiased but high variance.

Temporal■Difference (TD)
• Bootstraps from estimates; learn online, from partial episodes.
• TD(0) update: V(s) ← V(s) + α [ r + γ V(s′) − V(s) ].
• Unbiased model■free targets with lower variance than MC.

Reinforcement Learning Basics • Page 4 of 6


Algorithms You Should Know
Value■based
• Q■learning (off■policy): Q(s,a) ← Q(s,a) + α [ r + γ maxa′ Q(s′,a′) − Q(s,a) ].
• SARSA (on■policy): Q(s,a) ← Q(s,a) + α [ r + γ Q(s′,a′) − Q(s,a) ].
• Deep Q■Network (DQN): experience replay + target network for stability.

Policy■based
• REINFORCE: ∇θ J = ■[ ∑ ∇θ log πθ(a|s) (Gt − b) ].
• Baselines/advantages reduce variance (e.g., A2C/A3C, GAE).

Actor■Critic
• Actor updates policy; Critic learns value (V or Q).
• Modern stable methods: PPO/TRPO (clipping or KL constraints).
• Exploration: ε■greedy, entropy regularization, softmax over Q.
• Function approximation: linear → deep NNs; beware divergence off■policy.

Reinforcement Learning Basics • Page 5 of 6


Practice, Pitfalls, & Resources
Practical Tips
• Reward design: align incentives; avoid unintended shortcuts.
• Stability: target networks, normalization, gradient clipping, proper γ.
• Sample efficiency: replay buffers, off■policy learners, model■based rollouts.
• Partial observability: use history, RNNs, or belief states (POMDPs).
• Safety & evaluation: offline metrics, ablations, confidence intervals.

Common Failure Modes


• Function approximation + off■policy + bootstrapping ⇒ divergence.
• Sparse rewards ⇒ exploration collapse (consider shaping or intrinsic bonuses).
• Distribution shift from replay/behavior policy.

Resources
• Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.)
• OpenAI Spinning Up; Gymnasium; CleanRL; RLlib; Dopamine.
• Classic benchmarks: CartPole, MountainCar, Atari, MuJoCo.

Reinforcement Learning Basics • Page 6 of 6

You might also like