[go: up one dir, main page]

0% found this document useful (0 votes)
11 views63 pages

Lecture 1 - Introduction

Reinforcement Learning (RL) is a type of AI where agents learn to make decisions through trial and error by interacting with their environment to maximize cumulative rewards. It differs from supervised and unsupervised learning by focusing on learning from feedback rather than labeled data. The document discusses the core concepts of RL, including the agent-environment interaction, the importance of rewards, and the structure of agent components such as policies and value functions.

Uploaded by

Husein Yusuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views63 pages

Lecture 1 - Introduction

Reinforcement Learning (RL) is a type of AI where agents learn to make decisions through trial and error by interacting with their environment to maximize cumulative rewards. It differs from supervised and unsupervised learning by focusing on learning from feedback rather than labeled data. The document discusses the core concepts of RL, including the agent-environment interaction, the importance of rewards, and the structure of agent components such as policies and value functions.

Uploaded by

Husein Yusuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Reinforcement Learning

Introduction

Reinforcement learning

Natnael Argaw

1
Can Machines Outsmart Humans?
► In 2016, AlphaGo, an AI trained with reinforcement learning,
stunned the world by defeating Lee Sedol, a Go world champion.
► Imagine:
► A machine teaching itself to play a game with 10^170 possible
moves—more than atoms in the universe!
► How?
► By trial and error, learning from rewards, like a child
mastering a puzzle.
► Question: What if machines could learn any complex task this
way?

2
How did we get to RL?
► First, automation of repeated physical solutions
► Industrial revolution (1750 - 1850) and Machine Age (1870 - 1940)
► Second, automation of repeated mental solutions
► Digital revolution (1950 - now) and Information Age
► Next step: allow machines to find solutions themselves
► Artificial Intelligence
► We then only need to specify a problem and/or goal
► This requires learning autonomously how to make decisions

3
What is reinforcement learning?
► Definition: Reinforcement Learning (RL) is a type
of AI where an agent learns to make decisions by
interacting with an environment to achieve a goal.
► Key Idea: The agent tries actions, receives
rewards (or penalties), and improves over time.
► Example: A robot vacuum learning to clean a
room—gets +1 for picking up dirt, -1 for bumping
into walls.
► Core Loop:
► Observe → Act → Receive Reward →
Learn → Repeat

4
The interaction loop

Goal: optimise sum of rewards, through repeated interaction


5
Motivation

6
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Can machines think?

– Alan Turing, 1950

7
In the process of trying to imitate an adult human mind we are bound to think a good deal about the
process which has brought it to the state that it is in. We may notice three components,
a. The initial state of the mind, say at birth,
b. The education to which it has been subjected,
c. Other experience, not to be described as education, to which it has been subjected.
Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce
one which simulates the child’s? If this were then subjected to an appropriate course of education one
would obtain the adult brain. Presumably the child-brain is something like a note-book as one buys it
from the stationers. Rather little mechanism, and lots of blank sheets. (Mechanism and writing are
from our point of view almost synonymous.) Our hope is that there is so little mechanism in the
child-brain that something like it can be easily programmed.

– Alan Turing, 1950

8
Intelligence is Reinforcement Learning

► All intelligent agents—humans, and animals —learn by interacting with their


environment.
► Why RL Defines Us:
► Active Learning: Unlike passive methods, we explore and act to
discover what works.
► Sequential Decisions: Our choices shape future possibilities, like
steps in a lifelong journey.
► Goal-Driven: We pursue objectives, from survival to mastering
skills, without needing perfect examples.
► Reward Optimization: We learn by chasing rewards—food, joy,
success—guided by trial and error.
► RL isn’t just AI—it’s the blueprint of intelligence itself.

9
The reward hypothesis

Reinforcement learning is based on the reward hypothesis:

Any goal can be formalized as the outcome of maximizing a cumulative reward

10
Examples of RL problems
► Fly a helicopter → Reward: air time, inverse distance, ...
► Manage an investment portfolio → Reward: gains, gains minus risk, ...
► Control a power station → Reward: efficiency, ...

► Make a robot walk → Reward: distance, speed, ...


→ Reward: win, maximise score, ...
► Play video or board games

If the goal is to learn via interaction, these are all reinforcement learning problems
(Irrespective of which solution you use)

11
How RL Stands Out
► RL differs from supervised and unsupervised learning:
○ Supervised Learning:
■ Uses labeled data (e.g., photos tagged as “cat” or “dog”).
■ Goal: Predict correct labels. Example: Email spam
○ Unsupervised Learning:
■ Finds patterns in unlabeled data (e.g., grouping customers by
behavior).
■ Goal: Discover hidden structures/Patterns.Example: Movie
Recommender systems.
○ Reinforcement Learning:
■ Learns from trial-and-error with rewards, no labeled data.
■ Goal: Maximize cumulative reward.
12
■ Example: A game-playing AI learning to win by experimenting.
Example: Atari
Early Paper: Playing Atari with Deep Reinforcement Learning

13
Formalising the RL Problem

14
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Agent and Environment

► At each step t the agent:


► Receives observation Ot (and reward Rt )
► Executes action At
► The environment:
► Receives action At
► Emits observation Ot+1 (and reward Rt+1) 15
Rewards
► A reward Rt is a scalar feedback signal
► Indicates how well agent is doing at step t — defines the goal
► The agent’s job is to maximize cumulative reward

Gt = Rt+1 + Rt+2 + Rt+3 + ...

► We call this the return


Reinforcement learning is based on the reward hypothesis:

Any goal can be formalized as the outcome of maximizing a cumulative reward

16
Values
► We call the expected cumulative reward, from a state s, the value

v ( s ) = E [ Gt | S t = s ]
= E [Rt+1 + Rt+2 + Rt+3 + ... | St = s]

► The value depends on the actions the agent takes


► Goal is to maximize value, by picking suitable actions
► Rewards and values define utility of states and action (no supervised feedback)
► Returns and values can be defined recursively

G =R +G
Bellman equation
t t +1 t +1

v (s) = E [Rt+1 + v (St+1) | St = s] V(s) = E[R + γV(s')],


17
Maximising value by taking actions
► Goal: select actions to maximise value
► Actions may have long term consequences
► Reward may be delayed
► It may be better to sacrifice immediate reward to gain more long-term reward
► Examples:
► Refueling a helicopter (might prevent a crash in several hours)
► Defensive moves in a game (may help chances of winning later)
► Learning a new skill (can be costly & time-consuming at first)
► A mapping from states to actions is called a policy

18
Action values

► It is also possible to condition the value on actions:

q(s, a) = E [Gt | St = s, At = a]
= E [Rt+1 + Rt+2 + Rt+3 + ... | St = s, At = a]

► We will talk in depth about state and action values later

19
Core concepts
The reinforcement learning formalism includes
► Environment (dynamics of the problem)
► Reward signal (specifies the goal)
► Agent, containing:
► Agent state
► Policy
► Value function estimate?
► Model?
► We will now go into the agent

20
Inside the Agent: the Agent State

21
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Agent components

Agent components
► Agent state
► Policy
► Value functions
► Model

22
Environment State

► The environment state is the


environment’s internal state
► It is usually invisible to the agent
► Even if it is visible, it may contain lots of
irrelevant information

23
Agent State

► The history is the full sequence of observations, actions, rewards

Ht = O0, A0, R1, O1, ..., Ot−1, At−1, Rt , Ot

► For instance, the sensorimotor stream of a robot


► This history is used to construct the agent state St

24
Fully Observable Environments
Full observability
Suppose the agent sees the full environment state
► observation = environment state
► The agent state could just be this observation:

St = Ot = environment state

25
Markov decision processes
Markov decision processes (MDPs) are a useful mathematical framework
Definition
A decision process is Markov if

p (r , s | S t , A t ) = p (r , s | H t , A t )

► This means that the state contains all we need to know from the history
► Doesn’t mean it contains everything, just that adding more history doesn’t help

=⇒ Once the state is known, the history may be thrown away
► The full environment + agent state is Markov (but large)
► The full history Ht is Markov (but keeps growing)
► Typically, the agent state St is some compression of Ht
► Note: we use St to denote the agent state, not the environment state 26
Partially Observable Environments

► Partial observability: The observations are not Markovian


► A robot with camera vision isn’t told its absolute location
► A poker playing agent only observes public cards
► Now using the observation as state would not be Markovian
► This is called a partially observable Markov decision process (POMDP)
► The environment state can still be Markov, but the agent does not know it
► We might still be able to construct a Markov agent state

27
Agent State
► The agent’s actions depend on its state
► The agent state is a function of the history
► For instance, St = Ot
► More generally:

St+1 = u(St , At , Rt+1, Ot+1)

where u is a ‘state update function‘


► The agent state is often much smaller than the
environment state

28
The full environment state of a maze
Agent State

29
A potential observation
Agent State

30
Agent State
An observation in a different location

31
Agent State
The two observations are indistinguishable

32
Agent State
These two states are not Markov

How could you construct a Markov agent state in this maze (for any reward signal)?
33
Partially Observable Environments

► To deal with partial observability, agent can construct suitable state representations
► Examples of agent states:
► Last observation: St = Ot (might not be enough)
► Complete history: St = Ht (might be too large)
► A generic update: St = u(St−1, At−1, Rt , Ot ) (but how to pick/learn u?)
► Constructing a fully Markovian agent state is often not feasible
► More importantly, the state should allow good policies and value predictions

34
Inside the Agent: the Policy

35
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Agent components

Agent components
► Agent state
► Policy
► Value function
► Model

36
Policy

► A policy defines the agent’s behaviour


► It is a map from agent state to action
► Deterministic policy: A = π (S)
► Stochastic policy: π (A|S) = p (A|S)

37
Inside the Agent: Value
Estimates

38
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Agent components

Agent components
► Agent state
► Policy
► Value function
► Model

39
Value Function
► The actual value function is the expected return

vπ (s) = Et [G | St = s, π ]
= Et[Rt+1 + γ Rt+2 + γ 2Rt+3 + ... | St = s, π]

► We introduced a discount factor γ ∈ [0, 1]


► Trades off importance of immediate vs long-term rewards
► The value depends on a policy
► Can be used to evaluate the desirability of states
► Can be used to select between actions

40
Value Functions
► The return has a recursive form Gt = Rt+1 + γ Gt+1
V(s) = E[R + γV(s')],
► Therefore, the value has as well

vπ (s) = E [Rt+1 + γ Gt+1 | St = s, At ∼ π (s)]


= E [Rt+1 + γ vπ (St+1) | St = s, At ∼ π (s)]

Here a ∼ π (s) means a is chosen by policy π in state s (even if π is


deterministic)
► This is known as a Bellman equation (Bellman 1957)
► A similar equation holds for the optimal (=highest possible) value:
a
v (s) = max

E [Rt+1 + γ v∗ (St+1) | S = s, A = a]

This does not depend on a policy, it considers all available policies.


► We heavily exploit such equalities, and use them to create algorithms 41
Why Value Functions Matter

► Why Value Functions Matter:


○ Agents estimate value to predict future rewards—like a treasure map for
decisions.
○ In big worlds (games, robotics), agents sketch values, not calculate exactly.
○ Smart algorithms refine these sketches efficiently through experience.
○ Impact:
■ Accurate values → Optimal actions (best path to the goal).
■ Good approximations → Strong performance, even in massive domains
like video games.
○ Value functions guide agents to success, no matter the challenge’s size.

42
Inside the Agent: Models

43
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Agent components

Agent components
► Agent state
► Policy
► Value function
► Model

44
Model
► A model predicts what the environment will do next
► E.g., P predicts the next state

P(s, a, s J) ≈ p (St+1 = s J | St = s, At = a)

► E.g., R predicts the next (immediate) reward

R(s, a) ≈ E [Rt+1 | St = s, At = a]

► A model does not immediately give us a good policy - we would still need to plan
► We could also consider stochastic (generative) models

45
An Example

46
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Maze Example

Start
► Rewards: -1 per time-step
► Actions: N, E, S, W
► States: Agent’s location

Goal

47
Maze Example: Policy

Start

Goal

► Arrows represent policy π (s) for each state s

48
Maze Example: Model
-1 -1 -1 -1 -1

Start -1 -1 -1 -1

-1 -1 -1

-1

-1 -1

-1 Goal
-1

► Numbers represent value vπ (s) of each state s


► Grid layout represents partial transition model P ssa J

► Numbers represent immediate reward R ss a


from each state s (same for all a and s J in this
J

case)
49
Agent Categories

50
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Agent Categories: How Agents Learn to Win
► Value-Based:
► Score each choice to pick the best (value function).
► No explicit plan—just choose actions with highest value.
► Ex: Q-learning finds the maze’s shortest path (Page 50).
► Policy-Based:
► Act confidently with a direct plan (policy).
► No value scores—learn the best moves outright.
► Ex: A robot learns smooth walking steps (Page 37).
► Actor-Critic:
► Balance instincts and insight.
► Actor: Crafts a policy to act.
► Critic: Scores actions to refine the plan.
► Ex: Atari AI juggles paddle moves and point predictions (Page 16).
► Takeaway: Each path maximizes rewards—choose based on the challenge! 51
Agent Categories

► Model-Free:
► Dive in and learn by doing.
► Uses policies (plans) or value functions (scores) to chase rewards.
► No map of the world—just trial and error.
► Ex: Atari AI learns paddle moves through gameplay.
► Model-Based:
► Plan ahead with a world model.
► Builds a map of how actions lead to states and rewards.
► May use policies or values to decide, or plan directly.
► Ex: Robot predicts maze paths before moving (Page 50).
► Takeaway: Model-Free reacts fast; Model-Based thinks ahead—both aim for reward
success!
52
Subproblems of the RL Problem

53
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Prediction and Control

► Prediction: evaluate the future (for a given policy)


► Control: optimise the future (find the best policy)
► These can be strongly related:
► Control involves finding the optimal policy π*(s) that maximizes the action-value function:
π*(s) = argmax Q(s, a),
► If we could predict everything do we need anything else?

54
Learning and Planning
Two fundamental problems in reinforcement learning
► Learning:
► The environment is initially unknown
► The agent interacts with the environment
► Planning:
► A model of the environment is given (or learnt)
► The agent plans in this model (without external interaction)
► a.k.a. reasoning, pondering, thought, search, planning

55
Learning Agent Components
► All components are functions
► Policies: π : S → A (or to probabilities over A)
► Value functions: v : S → R
► Models: m : S → S and/or r : S → R
► State update: u : S × O → S

► E.g., we can use neural networks, and use deep learning techniques to learn
► Take care: we do often violate assumptions from supervised learning (iid, stationarity)
► Deep learning is an important tool
► Deep reinforcement learning is a rich and active research field

56
Examples

57
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Atari Example: Reinforcement Learning

observation action

ot at

► Rules of the game are unknown


reward
► Learn directly from interactive
rt
game-play
► Pick actions on joystick, see
pixels and scores
Gridworld Example: Prediction

A B 3.3 8.8 4.4 5.3 1.5


+5 1.5 3.0 2.3 1.9 0.5
+10 B’ 0.1 0.7 0.7 0.4 -0.4

-1.0 -0.4 -0.4 -0.6 -1.2


Actions
A’ -1.9 -1.3 -1.2 -1.4 -2.0

(a) (b)
Reward is −1 when bumping into a wall, γ = 0.9

What is the value function for the uniform random policy?

59
Gridworld Example:
Control

A B 22.0 24.4 22.0 19.4 17.5

+5 19.8 22.0 19.8 17.8 16.0

+10 B’ 17.8 19.8 17.8 16.0 14.4

16.0 17.8 16.0 14.4 13.0

A’ 14.4 16.0 14.4 13.0 11.7

c)
a) gridworld b) V* ⋅
What is the optimal value function over all possible policies? *
What is the optimal policy?
π
60
End of Lecture

61
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Background material

Background material
Reinforcement Learning: An Introduction, Sutton & Barto 2018
http://incompleteideas.net/book/the-book-2nd.html

62
Assignment
Project 1

63

You might also like