Reinforcement Learning:
Teaching Machines to Learn
from Experience
Understanding the Decision-Making Power of AI
Presented by:
RA2311027010051 - DIYA SHARMA
RA2311027010063 - SHRUTI RAJ
RA2311027010065 - HARSH KUMAR
What is Reinforcement
Learning?
Definition
RL is machine learning where an agent learns by interacting
with an environment.
Key Idea
Learn by trial and error, like humans and animals.
Real-Life Examples of RL
AlphaGo defeating human champions
Robots learning to walk or grasp objects
AI mastering video games like Atari, Dota, Minecraft
Self-driving cars navigating roads
Components of Reinforcement
Learning
Agent The learner or decision maker
Environment The world the agent interacts
with
State (s) Current situation of the agent
Action (a) Possible moves the agent can
make
Reward (r) Feedback from the environment
Policy (π) Strategy the agent follows
Value (V/Q) Long-term value of state or
action
The Reinforcement Learning
Observe State
Agent perceives current environment state.
Choose Action
Agent selects and performs an action.
Receive Reward
Agent gets feedback from environment.
Update Knowledge
Agent adjusts policy or value estimates.
Repeat
Goal: maximize cumulative future rewards.
Mathematical Foundation: Markov Decision Processes
A Markov Decision Process (MDP) is a mathematical framework used to describe decision-making in situations where outcomes are
partly random and partly under the control of a decision-maker (agent).
It provides a formal model for environments in reinforcement learning.
🧱 MDP Components (5-Tuple)
An MDP is defined as a 5-tuple:
MDP=⟨S,A,P,R,γ⟩MDP=⟨S,A,P,R,γ⟩
States (S) Actions (A) Transition Probabilities (P)
Possible situations the agent can be in. Choices available to the agent. Likelihood of moving between states.
Reward Function (R) Discount Factor (γ)
Feedback received after actions. Importance of future rewards.
Types of Reinforcement
Model-Free vs Model- Value-Based Policy-Based Actor-Critic
Based
Examples: Q-Learning Example: REINFORCE Examples: PPO, A3C
Different approaches to algorithm combining value and
learning environment policy methods
dynamics.
Q-Learning Algorithm
Q-Learning is a model-free reinforcement learning algorithm used to learn the optimal action-selection policy for an agent interacting with an environment. The goal of Q-
Learning is to learn a policy that tells an agent the best action to take in each state in order to maximize the total cumulative reward over time.
Key Concepts:
State (s): The condition or configuration of the environment at any given time.
Action (a): The decision or move made by the agent.
Reward (r): The feedback received after taking an action in a state.
Q-value (Q(s, a)): The expected future reward for taking action aa in state ss. It represents the quality of the action taken in a given state.
Update Rule Parameters
Q(s,a) ← Q(s,a) + α [r + γ max Q(s',a') - Q(s,a)] α: learning rate
γ: discount factor
r: reward
s', a': next state and action
Summary and Next Steps
RL teaches agents to learn from experience.
Key components include agent, environment, rewards,
and policies.
Mathematical foundations guide algorithm design.
Explore advanced RL methods and applications next.