0% found this document useful (0 votes)

11 views52 pages

Introduction - Week 1

Uploaded by

somasundaramsahana69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views52 pages

Introduction - Week 1

Uploaded by

somasundaramsahana69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Introduction to RL

Reinforcement Learning is exactly this —

“Learning by interacting with the
environment.”
Introduce the Two Core Components- Agents and
Environment
1. Environment
•In this example, the grid is the environment.
•Each cell s1,s2,…,s12 are states.
•The environment provides feedback to the agent after
every move in the form of reward.

2. Agent
•The little robot is the agent.
•It can take actions: up, down, left, right.
•Its goal is to learn which sequence of moves leads it to
good outcomes (i.e., the diamond), and helps it avoid bad
ones (i.e., the fire).

The agent's goal is to learn a policy — a mapping from each state to the best action —
that maximizes total cumulative reward over time.
“Just like a child learning to walk — this robot learns balance,
coordination, and control through repeated trial and error."
No rules. No instructions. Just learning through feedback: fall, try again, adjust
https://www.youtube.com/shorts/Ll7l2yya-bU?feature=share

Robots –opening door https://www.youtube.com/watch?v=M-QUkgk3HyE

https://www.youtube.com/shorts/aPthvhfAVio?feature=share Stanford : Andrew NG Teaching a Helicopter to Fly
Stunts Autonomously ( 2000)- Inverse Reinforcement
Learning (IRL)

Boston Dynamics is a robotics company known for building

highly dynamic, agile, and lifelike robots.

Boston Dynamics- Atlas Their most famous robots include:

https://www.youtube.com/watch?v=LikxFZZO2sk Atlas – a humanoid robot that runs, jumps, does parkour
Spot – a four-legged robot dog
Stretch – a box-handling warehouse robot
What is Reinforcement Learning?
Learn to make good sequence of decisions
Reinforcement Learning is a paradigm of machine learning where an agent learns to take actions in an environment to
maximize a long-term cumulative reward.
At its core, RL is about:
•Trial-and-error learning
•Sequential decision-making
•Learning from interaction, not from direct supervision

Eg Analogy : A Child Learning to Crawl

Just like the child:
•Tries different movements (actions)
•Receives feedback (falls down, reaches toy, etc.)
•Learns over time what sequence of movements help them move forward
(policy)
This captures the essence of RL:
The agent learns from its own experience — not from labeled data, but from
consequences of its actions.

5
Sequential Decision Making
▪ Series of decisions over time
▪ Decision outcomes may depend on environmental factors
▪ Final goal depends on many interactive decisions and their random consequences
▪ Examples:
§ Traffic signal control
• Objective: Minimize waiting time and congestion.
• RL can learn optimal green/red timing based on live traffic patterns.
• Each signal decision affects the next traffic state → sequential
§ Communication Network Packet Routing
• Objective: Deliver packets with minimal delay.
• Routing decisions affect future network load and delay.
• RL can learn adaptive routing policies depending on congestion and failures.
▪ Autonomous Vehicles
▪ Objective: Safe navigation with time and fuel efficiency
▪ Driving involves constant decision-making:
▪ Speed up, slow down, change lanes, avoid obstacles
▪ These decisions are dependent and sequential.
▪ RL agents learn to optimize safety + speed + fuel efficiency over the long term.
What is RL?

▪ Science of decision making

▪ Discover the sequence of actions – trial and error
▪ Learns the optimal behavior through interactions with the environment
▪ Actions receive a reward or penalty from the environment

▪ After many attempts the robot learns the best path

5
Reinforcement Learning – How Learning happens
• In Reinforcement Learning (RL), the learning
happens as the agent interacts with the environment
and updates its policy or value estimates based on
the feedback (rewards or penalties) it receives.

• Unlike supervised learning, the agent is not given the

correct answer; instead, it must learn through
experience what actions are best.

How Does It Learn Better Over Time? Reinforcement Learning is like a student trying a maze, hitting
•In early episodes, the agent tries many actions dead-ends, and learning which path leads to the exit — that’s
(exploration). learning through experience.”
•It remembers what worked well (via Q-values or
policy gradients).
•Gradually favors better actions (exploitation). https://www.youtube.com/shorts/0opKhTKxXX0
•This leads to an improved policy that achieves higher ?feature=share
cumulative reward.
Type of Learning Problems

1Supervised
2. Unsupervised Learning 3. Reinforcement
A. Regression A. Clustering Learning
A. Reward Based

B. Classification
B. Association

If customer Then
purchased recommen
an item "A" d item "B"

7
1.Supervised
Learning: Learning from
labelled data.

2.Unsupervised
Learning: Discovering
patterns in unlabeled
data.

3.Reinforcement
Learning: Learning
through interactions
with an environment.

Note the feedback arrows:

•Supervised: error is calculated from known targets RL: reward is given based on actions and environment response
How RL different from SL and USL?
▪ There is no supervisor to guide the training
▪ Not required to train with a large (labeled or unlabeled)
dataset.
▪ Datais provided dynamically
with which you are interacting.
via feedback environment

▪ Make decisions over a sequence of time-steps

▪ Work in dynamic and uncertain environments

•Not learning from labels (as in supervised learning)

•Not learning patterns or clusters (as in unsupervised learning) 7

•Instead, it is learning through experience and delayed feedback

Comparison of SL, USL and RL
Supervised Learning Unsupervised Reinforcement
Learning Learning
Labelled data with target Unlabelled data Input data not predefined:
without target learns from environment
using rewards and penalty
External Supervision No supervision
Learn pattern in data and its Learn to group data Compute best reward to
labels reach goal from start state

Map input data to known Find similar features in data Maximize rewards following
labels and understand patterns trail and error approach

Model training prior to Model training and testing

Model training prior to testing
testing simultaneously
8

E.g. Regression and E.g. Association mining and E.g. Reward based problems
classification problems clustering planning, control
A full interaction between Agent and Environment
• S0: The current state observed
by the agent.
• A0: The action chosen by the
agent in that state.
• R1: The reward received from
the environment for taking
action A0. Represents one full interaction between the agent and the
• S1: The resulting next state of environment.
the environment It continues into a trajectory:

Through many such interactions, the agent learns an optimal policy.

Convergence - It learns a stable policy that no longer changes significantly with more
interaction — and that maximizes expected cumulative reward over time.
Real World Applications of RL
Smart Vehicles
Application: Self-Driving Cars
Problem: Learn to drive safely by observing surroundings and taking
real-time actions (steering, acceleration, braking)
How RL is used:
1.State: Camera/LiDAR inputs, current speed, position
2.Actions: Turn left/right, accelerate, decelerate, brake
3.Reward:
1. +1 for staying in lane
2. –1 for lane deviation or close to collision
3. –100 for crash
4.Goal: Learn driving policy that avoids crashes and follows traffic
rules
Example:
•Wayve and Tesla Autopilot use variants of deep reinforcement
learning to improve navigation and decision-making.
Games
Application: AlphaGo, Atari Games
Problem: Play games better than humans using RL agents
How RL is used:
1.State: Game screen pixels or board configuration
2.Actions: Legal moves (left, right, place stone, fire)
3.Reward:
1. +1 for winning
2. –1 for losing
3. Intermediate rewards based on game points
4.Goal: Maximize expected game score or win probability
Example:
•AlphaGo (by DeepMind): Defeated world champion in Go using Monte Carlo Tree Search + Policy
Gradients
•Atari (by Deep Q-Network, DQN): RL agents learn to play from pixel input by trial-and-error.
AlphaGo movie Award Winning documentary
https: //www.youtube.com/watch?v=WXuK6gekU1Y
Healthcare
Application: Adaptive Treatment Plans
Problem: Tailor treatment dynamically based on patient response
How RL is used:
1.State: Patient vitals, current condition, previous treatments
2.Actions: Prescribe treatment A, B, or dosage level
3.Reward:
1. +10 for improvement
2. –10 for adverse effect
3. 0 for neutral outcome
4.Goal: Maximize long-term patient outcome
Example:
•RL for HIV therapy: Learn optimal drug scheduling
•Sepsis treatment (Deep RL used to recommend ICU strategies)
Robotics
Application: Navigation & Surveillance
Problem: Teach a robot to move through complex environments or monitor
spaces efficiently
How RL is used:
1.State: Robot’s position, orientation, obstacles nearby (from sensors)
2.Actions: Move forward/backward, rotate, scan
3.Reward:
1. +1 for reaching goal
2. –10 for bumping into walls
3. +0.1 per step towards goal
4.Goal: Learn a navigation strategy (policy) to reach goal with minimal
collisions
Example:
•Boston Dynamics robots can be trained to adjust walking patterns using RL
•Surveillance drones use RL for path planning and persistent monitoring
SmartAd
Application: Personalized Ads
Problem: Show the right ad to the right user at the right
time
How RL is used:
1.State: User profile, browsing behavior, time of day
2.Actions: Select one ad from many
3.Reward:
1. +1 if user clicks
2. 0 if ignored
4.Goal: Learn to display ads that maximize click-
through rate (CTR)
Example:
•YouTube ad recommendation
•Meta/Facebook uses RL to optimize ad ranking and
selection
Finance
Application: Portfolio Optimization
Problem: Decide which stocks to invest in, and how
much
How RL is used:
1.State: Current portfolio weights, market indicators
;Current stock prices – Portfolio weights (how much
is invested in each asset) – Technical indicators (like
moving average, RSI) – Macroeconomic variables
(e.g., interest rates)
2.Actions: Buy/Sell/Hold decision for each asset
3.Reward:
1. for portfolio growth
1. – for loss or risk exposure
4.Goal: Maximize return over time, minimize risk
Example:
•Deep RL in trading bots that learn market timing
•JP Morgan and other banks use RL for investment
automation
chatbots
Application: Dialogue Management in Voice
Assistants

Problem: Engage in helpful, multi-turn conversations

How RL is used:
1.State: Dialogue history, user’s query
2.Actions: Respond with an appropriate sentence or
intent
3.Reward:
1. +1 if user is satisfied or completes task
2. –1 for unhelpful/off-topic response
4.Goal: Learn dialogue policy that keeps the user
engaged and satisfied
Example:
•Siri, Alexa, Google Assistant optimize conversational
strategies using RL
•Conversational agents like Replika use RL to improve
engagement
Elements of
Reinforcement
Learning
Information state
Rat Example- State Representation is critical
In RL, the Markov property
says that the state should
contain all the information
necessary to predict the
future given the action.

Choosing the state

representation is critical:

Too small → agent cannot

learn optimal behavior
(information loss).

Too large → computationally

infeasible.
Fully Observable Environment
Partially Observable Environment
Major Components of an RL agent
Maze example: Value function

Why discount factor

• Mathematical stability
• Preference for sooner rewards
• Uncertainty about the future
Maze example: Policy
Model Based

What is a "model" in RL?

•In reinforcement learning, a model is the agent’s internal representation of the environment’s dynamics.
•If the agent has a model, it can simulate what will happen next without actually interacting with the real
environment.
•This is the essence of model-based RL:
• Learn or have access to a model.
• Use the model to plan actions and improve the policy.
Two components of the model
Categorisation of RL
Value Based Model Free
• No Policy (Implicit) • Policy and/or Value Function
• Value Function • No Model
Policy Based Model Based
• Policy • Policy and/or Value Function
• No Value Function • Model
Actor Critic
• Policy
• Value Function
RL Agent Taxonomy
Evolution of
Reinforcement
Learning
•Policy-based methods directly
optimize the policy. -finding the
Estimating value functions optimal policy that maximizes
(such as state-values or cumulative rewards over time.
1988: Temoral difference action-values)
Learning Method first Actor-Critic method is a hybrid
value-based method that of:
combined the strengths of Value-based methods and
Monte Carlo and Dynamic Policy-based methods
Programming,-first online
learning

Markov Decision process ( MDP)

Deep Q-Network (DQN) is a Deep Reinforcement Proximal Policy Optimization

Q-learning learns the quality (Q-value) of state– Learning algorithm that combines: Q-learning (to (PPO) is a policy gradient
action pairs:how good it is to take a certain action learn action values), and
•Deep Neural Networks (to approximate the Q-
algorithm in reinforcement
in a certain state. function). learning

Courtesy:Medium.com
Evolution of Reinforcement Learning
Pre-1980 – Foundations
1950s–1960s: Development of dynamic programming, Bellman equations, and Markov
Decision Processes — the mathematical backbone of RL
1982: Stevo Bozinovski presents delayed reinforcement learning in the Crossbar
Adaptive Array, an early precursor to later RL algorithms

Late 1980s – Early Algorithms

•1989: Chris Watkins introduces Q-learning, a model-free, off-policy algorithm that
became foundational in RL Early 1990s: Richard Sutton and Andrew Barto advance
temporal-difference learning and actor-critic methods, formalizing RL theory 1992–
1993: Gerald Tesauro’s TD-Gammon—a neural-net trained via TD learning—
demonstrates superhuman backgammon play using self-play
Evolution of Reinforcement Learning
Pre-2010 – Foundations of Deep RL
1996: Kaelbling, Littman & Moore publish a major survey, solidifying practical RL
approaches and exploration‐exploitation theory

Mid-2010s – Deep RL Breakthrough

2013: Mnih et al. publish “Playing Atari with Deep Reinforcement Learning,” introducing
Deep Q-Network (DQN) that learns directly from raw pixel input
2015: Distributed DQN scales via experience replay and parallel learners, achieving
state-of-the-art results across Atari games
2015: Development of Double DQN and improvements in stability and learning
efficiency
Evolution of Reinforcement Learning
Late 2010s – AlphaGo Era
2015–2016: DeepMind’s AlphaGo defeats professional human Go players;
policy-network-guided Monte Carlo Tree Search combines deep learning and planning
2017: AlphaGo Zero and AlphaZero emerge—learning via self-play without human
data and generalizing across board games

2019 Onwards – Model-based and General RL(Deep Mind)

2019: MuZero combines model-based planning and model-free learning, achieving high
performance without knowing game rules explicitly
2020s: RL is adopted in real-world domains, from robotics and energy optimization to
alignment with large language models (e.g., RL from Human Feedback
Evolution of Reinforcement Learning

2025 GRPO (DeepSeek) GRPO introduced critic-free RL for training LLMs in

reasoning. It compared output groups to optimize generation without value networks.
Opened the path for RL in large-scale language models.
TIC-GRPO, AGPO, Dr. GRPO Variants like TIC-GRPO added unbiased learning with
trajectory correction. AGPO improved sample efficiency and reduced reward hacking.
Dr. GRPO reduced bias and improved stability.
Turing Award & Industry Boom Sutton & Barto received the 2025 Turing Award
for foundational RL work. The RL market grew to $120B+, with use in robotics, supply
chains, and language models. RL is now central to AI progress.
Evolution of Reinforcement Learning
2025 – Recognition and Perspective
2025: Andrew Barto and Richard Sutton receive the A.M. Turing Award for their
pioneering contributions in RL theory and practical methods, reinforcing the
importance of RL’s conceptual roots
Lecture Plan
1. Introduction to Reinforcement Learning – Goals, examples, elements (policy,
reward, value function, model), Tic-Tac-Toe example, limitations, scope.
2. Multi-armed Bandits – n-armed bandit problem, action-value methods,
incremental implementation, nonstationary problems, gradient bandits,
associative search.
3. Agent–Environment Interaction – Agent, environment, state, action, goals,
rewards, pole-balancing, Markov property, Markov decision processes.
4. Value Functions & Bellman Equations – Optimal value functions, action-value
functions, Bellman equations.
5. Dynamic Programming – Policy evaluation, policy improvement, policy
iteration, value iteration, asynchronous dynamic programming.
6. Monte Carlo Methods – Monte Carlo prediction, Monte Carlo control.
7. Temporal-Difference Learning – SARSA, Q-learning, applications in RL
problems.
Course outcomes
Textbook(s)
Reinforcement Learning’, Richard.S.Sutton and
Andrew G.Barto, Second edition, MIT Press,
2018

Reference(s)
Powell, Warren B.. Reinforcement Learning and Stochastic
Optimization: A Unified Framework for Sequential Decisions.
United States, Wiley, 2022.
Namah Shivaya

RL Week - 1
No ratings yet
RL Week - 1
53 pages
Playbook Executive Briefing Reinforcement Learning
No ratings yet
Playbook Executive Briefing Reinforcement Learning
20 pages
Lecture1 Introduction Part1
No ratings yet
Lecture1 Introduction Part1
17 pages
Unit 5 ML
No ratings yet
Unit 5 ML
49 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Module 01
No ratings yet
Module 01
66 pages
Intro
No ratings yet
Intro
28 pages
ML Unit2
No ratings yet
ML Unit2
17 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
RL Unit-1
No ratings yet
RL Unit-1
52 pages
Reinforcement Learning 1
No ratings yet
Reinforcement Learning 1
14 pages
UNIT-V-Reinforcement Learning
No ratings yet
UNIT-V-Reinforcement Learning
4 pages
RL Chap 5
No ratings yet
RL Chap 5
21 pages
Reinforcement Learning (RL) : by Abhiram Sharma (19311A12P0)
No ratings yet
Reinforcement Learning (RL) : by Abhiram Sharma (19311A12P0)
14 pages
Lec 23
No ratings yet
Lec 23
51 pages
Final
No ratings yet
Final
18 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
Reinforcement Learning For IoT - Final
No ratings yet
Reinforcement Learning For IoT - Final
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
RL Report
No ratings yet
RL Report
15 pages
Reinforcement Learning Details
No ratings yet
Reinforcement Learning Details
9 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
RL
No ratings yet
RL
94 pages
Module 1
No ratings yet
Module 1
72 pages
Introduction To Reinforcement Learning (RL)
No ratings yet
Introduction To Reinforcement Learning (RL)
3 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
Green and Black Modern Machine Learning Presentation
No ratings yet
Green and Black Modern Machine Learning Presentation
14 pages
Four
No ratings yet
Four
5 pages
Report ML Aat g1 Final
No ratings yet
Report ML Aat g1 Final
8 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
No ratings yet
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
12 pages
Machine Learning Unit-1.2
No ratings yet
Machine Learning Unit-1.2
23 pages
Lec 1 Intro Course Overview
No ratings yet
Lec 1 Intro Course Overview
50 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
Reinforcement Learning For Embedded Robotics
No ratings yet
Reinforcement Learning For Embedded Robotics
9 pages
Reinforcement Learning Enhanced
No ratings yet
Reinforcement Learning Enhanced
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
63 pages
A Primer Chapter On Reinforcement Learning-Final
No ratings yet
A Primer Chapter On Reinforcement Learning-Final
22 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Reinforcement Learning (RL) : Agent
No ratings yet
Reinforcement Learning (RL) : Agent
35 pages
Reinforcement Learning From Theory To Real World Impact
No ratings yet
Reinforcement Learning From Theory To Real World Impact
8 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
2015.08.26.Lecture01Intro 2
No ratings yet
2015.08.26.Lecture01Intro 2
37 pages
03 04 Lessonarticle
No ratings yet
03 04 Lessonarticle
5 pages
Unit 3
No ratings yet
Unit 3
13 pages
1 Introduction To RL
No ratings yet
1 Introduction To RL
46 pages
Intercom Grammar Notes
No ratings yet
Intercom Grammar Notes
157 pages
WiNG Manager 1.0.5 Release Notes
No ratings yet
WiNG Manager 1.0.5 Release Notes
6 pages
Study 1
No ratings yet
Study 1
6 pages
Organic Fertilizer Recipes for Gardeners
No ratings yet
Organic Fertilizer Recipes for Gardeners
9 pages
737 Air Conditioning and Pressurisation Quiz
No ratings yet
737 Air Conditioning and Pressurisation Quiz
6 pages
Triumph
No ratings yet
Triumph
15 pages
(Ebook) DeathQuest 3: An Introduction To The Theory and Practice of Capital Punishment in The United States by Robert M. Bohm ISBN 9781593453152, 1593453159 Instant Download
100% (1)
(Ebook) DeathQuest 3: An Introduction To The Theory and Practice of Capital Punishment in The United States by Robert M. Bohm ISBN 9781593453152, 1593453159 Instant Download
135 pages
TU2-Equilibrium - Graphic Static - ER - En-160929 - 1475137032
No ratings yet
TU2-Equilibrium - Graphic Static - ER - En-160929 - 1475137032
69 pages
Bearing
No ratings yet
Bearing
6 pages
Crude Oil Marketwire 030818
No ratings yet
Crude Oil Marketwire 030818
25 pages
Norma Ashrae 62 1989
No ratings yet
Norma Ashrae 62 1989
3 pages
Sephora Case
No ratings yet
Sephora Case
9 pages
g9 Ethics Statement and Eligibility
No ratings yet
g9 Ethics Statement and Eligibility
18 pages
Insurance Italo
No ratings yet
Insurance Italo
3 pages
Inspection Checklist For Detecting Structural Decay and Structural Damage
No ratings yet
Inspection Checklist For Detecting Structural Decay and Structural Damage
7 pages
10th Maths Theorems Study Material English Medium PDF Download
No ratings yet
10th Maths Theorems Study Material English Medium PDF Download
2 pages
Model Answer of Vocabularies
No ratings yet
Model Answer of Vocabularies
13 pages
Abstract Espinosa Et Al 2022 Water Research
No ratings yet
Abstract Espinosa Et Al 2022 Water Research
2 pages
D and F Block Past Papers
No ratings yet
D and F Block Past Papers
4 pages
NCERT Class 12 Physics Part 2
No ratings yet
NCERT Class 12 Physics Part 2
1 page
Bottle Extrusion Blow Moulding
No ratings yet
Bottle Extrusion Blow Moulding
2 pages
Physics: Lab - Convex Lenses Name
No ratings yet
Physics: Lab - Convex Lenses Name
2 pages
Comparing Gas Exchange 8Cd Exploring Science 8
No ratings yet
Comparing Gas Exchange 8Cd Exploring Science 8
11 pages
04.EBF-55 Dual Shaft High Speed Disperser With 2000L Tank-20190513
No ratings yet
04.EBF-55 Dual Shaft High Speed Disperser With 2000L Tank-20190513
2 pages
Arts Group 3
No ratings yet
Arts Group 3
16 pages
37 Brett
No ratings yet
37 Brett
8 pages
Imperial Armies of The Thirty Years War 2 Cavalry Brnardic Full
100% (14)
Imperial Armies of The Thirty Years War 2 Cavalry Brnardic Full
196 pages
Chapter 2: Basic Terms of Accountancy - 25 Advanced Mcqs
No ratings yet
Chapter 2: Basic Terms of Accountancy - 25 Advanced Mcqs
6 pages
Technology of Machine Tools 8 Ed Krar
No ratings yet
Technology of Machine Tools 8 Ed Krar
291 pages
Full Building Real-Time Marvels With Laravel: Create Dynamic and Interactive Web Applications 1st Edition Sivaraj Selvaraj Ebook All Chapters
100% (3)
Full Building Real-Time Marvels With Laravel: Create Dynamic and Interactive Web Applications 1st Edition Sivaraj Selvaraj Ebook All Chapters
40 pages

Introduction - Week 1

Uploaded by

Introduction - Week 1

Uploaded by

Introduction to RL

Reinforcement Learning is exactly this —

Robots –opening door https://www.youtube.com/watch?v=M-QUkgk3HyE

Boston Dynamics is a robotics company known for building

Boston Dynamics- Atlas Their most famous robots include:

Eg Analogy : A Child Learning to Crawl

▪ Science of decision making

▪ After many attempts the robot learns the best path

• Unlike supervised learning, the agent is not given the

Note the feedback arrows:

▪ Make decisions over a sequence of time-steps

•Not learning from labels (as in supervised learning)

•Instead, it is learning through experience and delayed feedback

Model training prior to Model training and testing

Through many such interactions, the agent learns an optimal policy.

Problem: Engage in helpful, multi-turn conversations

Choosing the state

Too small → agent cannot

Too large → computationally

Why discount factor

What is a "model" in RL?

Markov Decision process ( MDP)

Deep Q-Network (DQN) is a Deep Reinforcement Proximal Policy Optimization

Late 1980s – Early Algorithms

Mid-2010s – Deep RL Breakthrough

2019 Onwards – Model-based and General RL(Deep Mind)

2025 GRPO (DeepSeek) GRPO introduced critic-free RL for training LLMs in

You might also like