RL Viva
RL Viva
Definition:
Reinforcement Learning is a type of machine learning where an agent learns to make
decisions by interacting with an environment and receiving rewards or penalties.
Key idea:
Learn by trial and error to maximize cumulative reward over time.
Key Features of RL
1. Trial-and-Error Learning
Agent tries actions → gets feedback (reward) → improves future choices.
2. Delayed Reward
The effect of an action might be seen after many steps.
3. Exploration vs Exploitation
Elements of RL
Element Description
1. Model-Based RL
2. Model-Free RL
3. On-Policy
o Example: SARSA
4. Off-Policy
o Example: Q-Learning
Rewards
Formula:
Return = r₁ + 𝛾*r₂ + 𝛾²*r₃ + ...
1. Q-Learning
• Off-policy
2. SARSA (State-Action-Reward-State-Action)
• On-policy
• Learns based on the action actually taken.
• Update Rule:
Q(s,a) ← Q(s,a) + α [r + γ * Q(s’,a’) - Q(s,a)]
• Agent
• Environment
• State (s)
• Action (a)
• Reward (r)
• Policy (π)
• Q-value (Q)
6. What is a state?
A state is the current situation or position of the agent in the environment.
7. What is an action?
An action is a move or decision taken by the agent in a particular state.
• Off-policy (Q-Learning): Learns from the best possible action, not necessarily the
one taken.
High γ → future rewards matter more. Low γ → short-term rewards matter more.
Module 2: Bandit Problems and Online Learning – Summary
Topics Covered:
2. Action-Value Methods
6. Gradient Bandits
2. Action-Value Methods:
6. Gradient Bandits:
25. Which method is better for non-stationary problems – sample average or constant
α?
Constant α is better because it gives more weight to recent data.
26. What are some practical applications of bandit problems?
Online ad selection, recommendation systems, clinical trials.
Let me know if you want summary flashcards, a one-pager formula sheet, or a quiz
version for practice.
4o
You said:
yes give it
Key Concepts:
Concept Description
Formulas:
Flashcard-Style Q&A
Question Answer
What is the N-armed bandit problem? Choosing the best action among N
options for max reward.
What is the goal of bandit problems? Maximize cumulative reward over time.
What does Nt(a) mean in UCB? How many times action a has been
chosen.
How are actions chosen in gradient bandits? Using softmax probability from
preferences.
4. Markov Property
5. Markov Decision Process (MDP)
6. Value Functions
1. Agent–Environment Interface
• At each time step, the agent observes a state (S), takes an action (A), and receives a
reward (R).
• Rewards are numeric values given by the environment to evaluate the agent’s actions.
3. Returns (Gₜ)
4. Markov Property
• The future state depends only on the current state and action, not on the past.
(S,A,P,R,γ)
• S: Set of states
• A: Set of actions
• γ: Discount factor
6. Value Functions
• State-Value Function (Vπ(s)): Expected return starting from state s, following policy
π.
• Action-Value Function (Qπ(s, a)): Expected return from state s, taking action a, and
following policy π thereafter.
• V* and Q* represent the maximum expected return achievable from any state or
action.
Term Meaning
Markov Property The future depends only on the current state, not on the past.
Action-Value Function
Expected return from a state-action pair following a policy.
(Q)
Optimal Value Function The maximum possible value from a state or state-action.
Important Formulas:
30 Viva Questions & Answers (Module 3)
Question Answer
2. What are the components of an States (S), Actions (A), Transition Probabilities (P),
MDP? Rewards (R), Discount Factor (γ).
3. Define the Markov property. Future state depends only on the current state, not on
past states.
5. What is the agent-environment The loop where the agent acts, and the environment
interface? returns a new state and reward.
8. What is a discount factor (γ)? It reduces the weight of future rewards; γ ∈ [0,1].
11. Define value function V(s). Expected return from state s following a policy π.
12. Define Q-value or action-value Expected return from (state, action) pair following π.
function.
13. What is V*? Optimal value function – max possible return from
each state.
15. What is the Bellman Recursive equation for Vπ(s) using the expected next
expectation equation? state.
16. What is the Bellman optimality Defines optimal V* using best actions at each step.
equation?
17. What is the goal of RL in To find the optimal policy that maximizes expected
MDP? return.
18. Can MDPs be stochastic? Yes – transitions and rewards can be probabilistic.
19. What is a deterministic policy? Same action is chosen every time from a state.
20. What is a stochastic policy? Actions are chosen with probabilities from a state.
21. What’s the difference between V(s) is value of a state; Q(s, a) includes specific
V(s) and Q(s, a)? actions.
23. How does RL differ from RL learns from interaction with environment, not
supervised learning? labeled data.
25. What is finite horizon? The agent acts for a limited number of steps.
26. What is infinite horizon? Agent continues acting indefinitely with discounting.
27. Why is discounting used? To ensure convergence and prefer immediate rewards.
28. What is the purpose of value To guide the agent in choosing better actions.
functions?
29. Why are Bellman equations They help compute value functions recursively.
important?
30. What is policy improvement? Updating the policy to increase expected return.
Module 4 – Dynamic Programming (DP)
Key Topics Covered:
Topic Description
Policy Evaluation
Computes the value function Vπ(s) for a given policy π.
(Prediction)
Asynchronous DP Updates some states rather than all – more flexible and faster.
1. Policy Evaluation
Iteratively compute Vπ(s) using the Bellman Expectation Equation until convergence.
2. Policy Improvement
3. Policy Iteration
4. Value Iteration
5. Asynchronous DP
Question Answer
2. What’s the role of DP in RL? It is used for computing value functions and finding
optimal policies when a full model is known.
10. What is greedy policy? A policy that chooses the action with the highest
value.
11. What does it mean for a policy No further improvements are possible – it's optimal.
to be stable?
12. What is asynchronous DP? Updates only some states per iteration – useful for
large state spaces.
13. What is the key idea behind A better policy can be formed by choosing actions
policy improvement theorem? with higher expected returns.
15. How is convergence ensured in Policy evaluation and improvement steps eventually
policy iteration? stabilize.
16. Is value iteration faster than Often yes, because it avoids full evaluation steps.
policy iteration?
17. What is the stopping condition When the value change between iterations is below
for value iteration? a small threshold (ε).
19. How does DP differ from Monte DP uses a model; Monte Carlo methods learn from
Carlo methods? sampled episodes.
20. What is the computational cost High in large state/action spaces – that's why
of DP methods? approximations or sampling are used.
21. Why is DP important for RL? It forms the theoretical foundation for RL
algorithms.
22. What is the significance of the It determines how much future rewards are
discount factor γ in DP? considered.
23. What’s the main challenge with Large or continuous state/action spaces make exact
DP in real-world problems? DP infeasible.
24. In value iteration, what is being The value function using Bellman optimality.
iteratively updated?
25. How do you derive an optimal Choose the action that maximizes expected value at
policy from a value function? each state.
27. When does GPI reach When both value and policy no longer change.
convergence?
28. What does “argmax” do in Selects the action with the highest expected value.
policy improvement?
29. Why is initialization of value It can affect speed of convergence but not the final
function important? result.
30. How is DP related to Bellman Bellman equations form the basis of all DP updates.
equations?
Module 5 – Monte Carlo (MC) & Temporal-Difference (TD)
Learning
Key Topics Covered:
Subtopic Description
Monte Carlo Prediction Estimate value functions using sample returns from
episodes.
TD Control using Q-Learning Off-policy TD control that learns Q(s, a) from action-
value estimates.
Bootstrapping No Yes
Algorithm Overviews
Question Answer
1. What is Monte Carlo method in A method that learns from complete episodes
RL? using sample returns.
2. What is the difference between MC MC waits for episodes to end; TD updates at each
and TD? step using bootstrapping.
4. Define return (G) in MC. The cumulative discounted reward from a state.
6. What is every-visit MC? Updates value every time the state appears in an
episode.
7. How does MC estimate action- By averaging returns from all episodes starting
value Q(s, a)? with (s, a).
12. Write the TD(0) update formula. V(s) ← V(s) + α [r + γ V(s') - V(s)]
15. What is the Q-learning update Q(s,a) ← Q(s,a) + α[r + γ max Q(s’,a’) – Q(s,a)]
rule?
16. Is Q-learning on-policy or off- Off-policy – learns about greedy policy while
policy? following a different one.
17. What is an ε-greedy policy? A policy that chooses the best action most of the
time but explores randomly with probability ε.
18. What’s the role of learning rate α? It determines how much new info overrides old
value estimates.
19. What are the pros of TD over TD can learn faster, online, and doesn’t need
MC? episode termination.
20. Can TD and MC be combined? Yes – in algorithms like TD(λ), combining both
ideas.
21. What is the exploration- Balancing trying new actions vs exploiting known
exploitation tradeoff? best ones.
22. What is off-policy learning? Learning about one policy while behaving
according to another.
23. What is on-policy learning? Learning about the same policy being followed.
24. Why is Q-learning called off- It learns about the greedy policy while acting ε-
policy? greedy.
25. What’s the terminal state? A state where the episode ends.
26. Which methods can handle TD methods, especially TD(0), can handle them.
continuing tasks?
27. Which method has higher Monte Carlo has higher variance.
variance – MC or TD?
28. Why is MC method not always It requires the episode to finish to update values.
suitable for real-time learning?
29. Name a real-world example using Blackjack value estimation based on full games.
MC methods.
30. Name a use case of Q-learning. Game AI, elevator control, traffic light control.
Module 6 – Applications and Case Studies
Overview
This module explores how Reinforcement Learning (RL) can be applied in real-world
decision-making problems where the environment is dynamic, and outcomes depend on
sequential actions.
Elevator Dispatching Optimize elevator movements to reduce wait time and energy
consumption.
Elevator Dispatching
Job-Shop Scheduling
• Q-Learning
• SARSA
Question Answer
2. What is the objective in elevator Minimize waiting time and energy usage.
dispatching?
4. What are actions in the elevator Move up, move down, stay idle, or stop at a floor.
dispatching problem?
5. What kind of reward function is Negative rewards for wait time and idle
used in elevator dispatching? movement, positive rewards for serving
passengers.
10. What does the state include in Current usage, available channels, interference
channel allocation? levels.
11. What are the actions in channel Assigning a specific channel to a particular user.
allocation?
12. What is the reward in channel High data rate with minimal interference.
allocation?
14. Define job-shop scheduling. A scheduling problem where jobs are assigned to
different machines in an optimal sequence.
15. What is the goal in job-shop Minimize total production time (makespan) and
scheduling? delays.
16. What is a state in job-shop Current status of machines and pending jobs.
scheduling?
18. How does RL benefit job-shop It adapts to dynamic job arrivals and machine
scheduling? conditions.
19. What is the reward in job-shop Completion bonus or penalty based on job
scheduling? timing.
20. Can deep reinforcement learning Yes, DQN and Actor-Critic methods can be used.
be used for job scheduling?
21. What are the advantages of using Real-time decision-making, adaptability, and
RL in real-time applications? optimization.
22. What is the main challenge in Large state/action spaces, data scarcity, and
applying RL to real-world problems? delayed rewards.
23. Which is more suitable for TD methods, because they don’t require episodes
continuous problems: MC or TD? to end.
25. Which algorithm would you Q-Learning or DQN for scalability and
choose for job-shop scheduling and adaptiveness.
why?
26. Can RL work without a simulator Yes, using real-time interaction or historical data
in such applications? logs.
27. What is the role of exploration in It allows the agent to try new strategies for better
these applications? outcomes.
28. How does RL handle uncertainty in By updating value functions based on new
dynamic environments? observations.
29. How is reward shaped in industrial Using task-specific metrics like cost, time,
RL use cases? energy, etc.