[go: up one dir, main page]

0% found this document useful (0 votes)
11 views52 pages

Introduction - Week 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views52 pages

Introduction - Week 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Introduction to RL

Reinforcement Learning is exactly this —


“Learning by interacting with the
environment.”
Introduce the Two Core Components- Agents and
Environment
1. Environment
•In this example, the grid is the environment.
•Each cell s1,s2,…,s12 are states.
•The environment provides feedback to the agent after
every move in the form of reward.

2. Agent
•The little robot is the agent.
•It can take actions: up, down, left, right.
•Its goal is to learn which sequence of moves leads it to
good outcomes (i.e., the diamond), and helps it avoid bad
ones (i.e., the fire).

The agent's goal is to learn a policy — a mapping from each state to the best action —
that maximizes total cumulative reward over time.
“Just like a child learning to walk — this robot learns balance,
coordination, and control through repeated trial and error."
No rules. No instructions. Just learning through feedback: fall, try again, adjust
https://www.youtube.com/shorts/Ll7l2yya-bU?feature=share

Robots –opening door https://www.youtube.com/watch?v=M-QUkgk3HyE


https://www.youtube.com/shorts/aPthvhfAVio?feature=share Stanford : Andrew NG Teaching a Helicopter to Fly
Stunts Autonomously ( 2000)- Inverse Reinforcement
Learning (IRL)

Boston Dynamics is a robotics company known for building


highly dynamic, agile, and lifelike robots.

Boston Dynamics- Atlas Their most famous robots include:


https://www.youtube.com/watch?v=LikxFZZO2sk Atlas – a humanoid robot that runs, jumps, does parkour
Spot – a four-legged robot dog
Stretch – a box-handling warehouse robot
What is Reinforcement Learning?
Learn to make good sequence of decisions
Reinforcement Learning is a paradigm of machine learning where an agent learns to take actions in an environment to
maximize a long-term cumulative reward.
At its core, RL is about:
•Trial-and-error learning
•Sequential decision-making
•Learning from interaction, not from direct supervision

Eg Analogy : A Child Learning to Crawl


Just like the child:
•Tries different movements (actions)
•Receives feedback (falls down, reaches toy, etc.)
•Learns over time what sequence of movements help them move forward
(policy)
This captures the essence of RL:
The agent learns from its own experience — not from labeled data, but from
consequences of its actions.

5
Sequential Decision Making
▪ Series of decisions over time
▪ Decision outcomes may depend on environmental factors
▪ Final goal depends on many interactive decisions and their random consequences
▪ Examples:
§ Traffic signal control
• Objective: Minimize waiting time and congestion.
• RL can learn optimal green/red timing based on live traffic patterns.
• Each signal decision affects the next traffic state → sequential
§ Communication Network Packet Routing
• Objective: Deliver packets with minimal delay.
• Routing decisions affect future network load and delay.
• RL can learn adaptive routing policies depending on congestion and failures.
▪ Autonomous Vehicles
▪ Objective: Safe navigation with time and fuel efficiency
▪ Driving involves constant decision-making:
▪ Speed up, slow down, change lanes, avoid obstacles
▪ These decisions are dependent and sequential.
▪ RL agents learn to optimize safety + speed + fuel efficiency over the long term.
What is RL?

▪ Science of decision making


▪ Discover the sequence of actions – trial and error
▪ Learns the optimal behavior through interactions with the environment
▪ Actions receive a reward or penalty from the environment

▪ After many attempts the robot learns the best path

5
Reinforcement Learning – How Learning happens
• In Reinforcement Learning (RL), the learning
happens as the agent interacts with the environment
and updates its policy or value estimates based on
the feedback (rewards or penalties) it receives.

• Unlike supervised learning, the agent is not given the


correct answer; instead, it must learn through
experience what actions are best.

How Does It Learn Better Over Time? Reinforcement Learning is like a student trying a maze, hitting
•In early episodes, the agent tries many actions dead-ends, and learning which path leads to the exit — that’s
(exploration). learning through experience.”
•It remembers what worked well (via Q-values or
policy gradients).
•Gradually favors better actions (exploitation). https://www.youtube.com/shorts/0opKhTKxXX0
•This leads to an improved policy that achieves higher ?feature=share
cumulative reward.
Type of Learning Problems

1Supervised
2. Unsupervised Learning 3. Reinforcement
A. Regression A. Clustering Learning
A. Reward Based

B. Classification
B. Association

If customer Then
purchased recommen
an item "A" d item "B"

7
1.Supervised
Learning: Learning from
labelled data.

2.Unsupervised
Learning: Discovering
patterns in unlabeled
data.

3.Reinforcement
Learning: Learning
through interactions
with an environment.

Note the feedback arrows:


•Supervised: error is calculated from known targets RL: reward is given based on actions and environment response
How RL different from SL and USL?
▪ There is no supervisor to guide the training
▪ Not required to train with a large (labeled or unlabeled)
dataset.
▪ Datais provided dynamically
with which you are interacting.
via feedback environment

▪ Make decisions over a sequence of time-steps


▪ Work in dynamic and uncertain environments

•Not learning from labels (as in supervised learning)


•Not learning patterns or clusters (as in unsupervised learning) 7

•Instead, it is learning through experience and delayed feedback


Comparison of SL, USL and RL
Supervised Learning Unsupervised Reinforcement
Learning Learning
Labelled data with target Unlabelled data Input data not predefined:
without target learns from environment
using rewards and penalty
External Supervision No supervision
Learn pattern in data and its Learn to group data Compute best reward to
labels reach goal from start state

Map input data to known Find similar features in data Maximize rewards following
labels and understand patterns trail and error approach

Model training prior to Model training and testing


Model training prior to testing
testing simultaneously
8

E.g. Regression and E.g. Association mining and E.g. Reward based problems
classification problems clustering planning, control
A full interaction between Agent and Environment
• S0​: The current state observed
by the agent.
• A0​: The action chosen by the
agent in that state.
• R1​: The reward received from
the environment for taking
action A0​. Represents one full interaction between the agent and the
• S1​: The resulting next state of environment.
the environment It continues into a trajectory:

Through many such interactions, the agent learns an optimal policy.


Convergence - It learns a stable policy that no longer changes significantly with more
interaction — and that maximizes expected cumulative reward over time.
Real World Applications of RL
Smart Vehicles
Application: Self-Driving Cars
Problem: Learn to drive safely by observing surroundings and taking
real-time actions (steering, acceleration, braking)
How RL is used:
1.State: Camera/LiDAR inputs, current speed, position
2.Actions: Turn left/right, accelerate, decelerate, brake
3.Reward:
1. +1 for staying in lane
2. –1 for lane deviation or close to collision
3. –100 for crash
4.Goal: Learn driving policy that avoids crashes and follows traffic
rules
Example:
•Wayve and Tesla Autopilot use variants of deep reinforcement
learning to improve navigation and decision-making.
Games
Application: AlphaGo, Atari Games
Problem: Play games better than humans using RL agents
How RL is used:
1.State: Game screen pixels or board configuration
2.Actions: Legal moves (left, right, place stone, fire)
3.Reward:
1. +1 for winning
2. –1 for losing
3. Intermediate rewards based on game points
4.Goal: Maximize expected game score or win probability
Example:
•AlphaGo (by DeepMind): Defeated world champion in Go using Monte Carlo Tree Search + Policy
Gradients
•Atari (by Deep Q-Network, DQN): RL agents learn to play from pixel input by trial-and-error.
AlphaGo movie Award Winning documentary
https: //www.youtube.com/watch?v=WXuK6gekU1Y
Healthcare
Application: Adaptive Treatment Plans
Problem: Tailor treatment dynamically based on patient response
How RL is used:
1.State: Patient vitals, current condition, previous treatments
2.Actions: Prescribe treatment A, B, or dosage level
3.Reward:
1. +10 for improvement
2. –10 for adverse effect
3. 0 for neutral outcome
4.Goal: Maximize long-term patient outcome
Example:
•RL for HIV therapy: Learn optimal drug scheduling
•Sepsis treatment (Deep RL used to recommend ICU strategies)
Robotics
Application: Navigation & Surveillance
Problem: Teach a robot to move through complex environments or monitor
spaces efficiently
How RL is used:
1.State: Robot’s position, orientation, obstacles nearby (from sensors)
2.Actions: Move forward/backward, rotate, scan
3.Reward:
1. +1 for reaching goal
2. –10 for bumping into walls
3. +0.1 per step towards goal
4.Goal: Learn a navigation strategy (policy) to reach goal with minimal
collisions
Example:
•Boston Dynamics robots can be trained to adjust walking patterns using RL
•Surveillance drones use RL for path planning and persistent monitoring
SmartAd
Application: Personalized Ads
Problem: Show the right ad to the right user at the right
time
How RL is used:
1.State: User profile, browsing behavior, time of day
2.Actions: Select one ad from many
3.Reward:
1. +1 if user clicks
2. 0 if ignored
4.Goal: Learn to display ads that maximize click-
through rate (CTR)
Example:
•YouTube ad recommendation
•Meta/Facebook uses RL to optimize ad ranking and
selection
Finance
Application: Portfolio Optimization
Problem: Decide which stocks to invest in, and how
much
How RL is used:
1.State: Current portfolio weights, market indicators
;Current stock prices – Portfolio weights (how much
is invested in each asset) – Technical indicators (like
moving average, RSI) – Macroeconomic variables
(e.g., interest rates)
2.Actions: Buy/Sell/Hold decision for each asset
3.Reward:
1. for portfolio growth
1. – for loss or risk exposure
4.Goal: Maximize return over time, minimize risk
Example:
•Deep RL in trading bots that learn market timing
•JP Morgan and other banks use RL for investment
automation
chatbots
Application: Dialogue Management in Voice
Assistants

Problem: Engage in helpful, multi-turn conversations


How RL is used:
1.State: Dialogue history, user’s query
2.Actions: Respond with an appropriate sentence or
intent
3.Reward:
1. +1 if user is satisfied or completes task
2. –1 for unhelpful/off-topic response
4.Goal: Learn dialogue policy that keeps the user
engaged and satisfied
Example:
•Siri, Alexa, Google Assistant optimize conversational
strategies using RL
•Conversational agents like Replika use RL to improve
engagement
Elements of
Reinforcement
Learning
Information state
Rat Example- State Representation is critical
In RL, the Markov property
says that the state should
contain all the information
necessary to predict the
future given the action.

Choosing the state


representation is critical:

Too small → agent cannot


learn optimal behavior
(information loss).

Too large → computationally


infeasible.
Fully Observable Environment
Partially Observable Environment
Major Components of an RL agent
Maze example: Value function

Why discount factor


• Mathematical stability
• Preference for sooner rewards
• Uncertainty about the future
Maze example: Policy
Model Based

What is a "model" in RL?


•In reinforcement learning, a model is the agent’s internal representation of the environment’s dynamics.
•If the agent has a model, it can simulate what will happen next without actually interacting with the real
environment.
•This is the essence of model-based RL:
• Learn or have access to a model.
• Use the model to plan actions and improve the policy.
Two components of the model
Categorisation of RL
Value Based Model Free
• No Policy (Implicit) • Policy and/or Value Function
• Value Function • No Model
Policy Based Model Based
• Policy • Policy and/or Value Function
• No Value Function • Model
Actor Critic
• Policy
• Value Function
RL Agent Taxonomy
Evolution of
Reinforcement
Learning
•Policy-based methods directly
optimize the policy. -finding the
Estimating value functions optimal policy that maximizes
(such as state-values or cumulative rewards over time.
1988: Temoral difference action-values)
Learning Method first Actor-Critic method is a hybrid
value-based method that of:
combined the strengths of Value-based methods and
Monte Carlo and Dynamic Policy-based methods
Programming,-first online
learning

Markov Decision process ( MDP)

Deep Q-Network (DQN) is a Deep Reinforcement Proximal Policy Optimization


Q-learning learns the quality (Q-value) of state– Learning algorithm that combines: Q-learning (to (PPO) is a policy gradient
action pairs:how good it is to take a certain action learn action values), and
•Deep Neural Networks (to approximate the Q-
algorithm in reinforcement
in a certain state. function). learning

Courtesy:Medium.com
Evolution of Reinforcement Learning
Pre-1980 – Foundations
1950s–1960s: Development of dynamic programming, Bellman equations, and Markov
Decision Processes — the mathematical backbone of RL
1982: Stevo Bozinovski presents delayed reinforcement learning in the Crossbar
Adaptive Array, an early precursor to later RL algorithms

Late 1980s – Early Algorithms


•1989: Chris Watkins introduces Q-learning, a model-free, off-policy algorithm that
became foundational in RL Early 1990s: Richard Sutton and Andrew Barto advance
temporal-difference learning and actor-critic methods, formalizing RL theory 1992–
1993: Gerald Tesauro’s TD-Gammon—a neural-net trained via TD learning—
demonstrates superhuman backgammon play using self-play
Evolution of Reinforcement Learning
Pre-2010 – Foundations of Deep RL
1996: Kaelbling, Littman & Moore publish a major survey, solidifying practical RL
approaches and exploration‐exploitation theory

Mid-2010s – Deep RL Breakthrough


2013: Mnih et al. publish “Playing Atari with Deep Reinforcement Learning,” introducing
Deep Q-Network (DQN) that learns directly from raw pixel input
2015: Distributed DQN scales via experience replay and parallel learners, achieving
state-of-the-art results across Atari games
2015: Development of Double DQN and improvements in stability and learning
efficiency
Evolution of Reinforcement Learning
Late 2010s – AlphaGo Era
2015–2016: DeepMind’s AlphaGo defeats professional human Go players;
policy-network-guided Monte Carlo Tree Search combines deep learning and planning
2017: AlphaGo Zero and AlphaZero emerge—learning via self-play without human
data and generalizing across board games

2019 Onwards – Model-based and General RL(Deep Mind)


2019: MuZero combines model-based planning and model-free learning, achieving high
performance without knowing game rules explicitly
2020s: RL is adopted in real-world domains, from robotics and energy optimization to
alignment with large language models (e.g., RL from Human Feedback
Evolution of Reinforcement Learning

2025 GRPO (DeepSeek) GRPO introduced critic-free RL for training LLMs in


reasoning. It compared output groups to optimize generation without value networks.
Opened the path for RL in large-scale language models.
TIC-GRPO, AGPO, Dr. GRPO Variants like TIC-GRPO added unbiased learning with
trajectory correction. AGPO improved sample efficiency and reduced reward hacking.
Dr. GRPO reduced bias and improved stability.
Turing Award & Industry Boom Sutton & Barto received the 2025 Turing Award
for foundational RL work. The RL market grew to $120B+, with use in robotics, supply
chains, and language models. RL is now central to AI progress.
Evolution of Reinforcement Learning
2025 – Recognition and Perspective
2025: Andrew Barto and Richard Sutton receive the A.M. Turing Award for their
pioneering contributions in RL theory and practical methods, reinforcing the
importance of RL’s conceptual roots
Lecture Plan
1. Introduction to Reinforcement Learning – Goals, examples, elements (policy,
reward, value function, model), Tic-Tac-Toe example, limitations, scope.
2. Multi-armed Bandits – n-armed bandit problem, action-value methods,
incremental implementation, nonstationary problems, gradient bandits,
associative search.
3. Agent–Environment Interaction – Agent, environment, state, action, goals,
rewards, pole-balancing, Markov property, Markov decision processes.
4. Value Functions & Bellman Equations – Optimal value functions, action-value
functions, Bellman equations.
5. Dynamic Programming – Policy evaluation, policy improvement, policy
iteration, value iteration, asynchronous dynamic programming.
6. Monte Carlo Methods – Monte Carlo prediction, Monte Carlo control.
7. Temporal-Difference Learning – SARSA, Q-learning, applications in RL
problems.
Course outcomes
Textbook(s)
Reinforcement Learning’, Richard.S.Sutton and
Andrew G.Barto, Second edition, MIT Press,
2018

Reference(s)
Powell, Warren B.. Reinforcement Learning and Stochastic
Optimization: A Unified Framework for Sequential Decisions.
United States, Wiley, 2022.
Namah Shivaya

You might also like