[go: up one dir, main page]

0% found this document useful (0 votes)
59 views30 pages

RL Viva

Uploaded by

yatin.chauhan02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views30 pages

RL Viva

Uploaded by

yatin.chauhan02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Module 1: Introduction to Reinforcement Learning

What is Reinforcement Learning (RL)?

Definition:
Reinforcement Learning is a type of machine learning where an agent learns to make
decisions by interacting with an environment and receiving rewards or penalties.

Key idea:
Learn by trial and error to maximize cumulative reward over time.

Key Features of RL

1. Trial-and-Error Learning
Agent tries actions → gets feedback (reward) → improves future choices.

2. Delayed Reward
The effect of an action might be seen after many steps.

3. Exploration vs Exploitation

o Exploration: Try new actions to learn more.

o Exploitation: Use known actions for the best reward.

4. Sequential Decision Making


The next state depends on the current action.

Elements of RL

Element Description

Agent Learner/decision maker

Environment Everything the agent interacts with

State (s) Current situation

Action (a) Choice made by the agent

Reward (r) Feedback received after action

Policy (π) Agent’s strategy (mapping from state to action)

Value Function (V) Expected long-term reward

Q-value (Q) Value of taking a specific action in a state


Example:
In a video game, the agent is the player, actions are moves, rewards are scores, and policy is
the strategy.

Types of Reinforcement Learning

1. Model-Based RL

o Agent learns the model (transition probabilities).


o Example: Chess AI using tree search.

2. Model-Free RL

o Agent learns directly from experience.

o Example: Playing Atari games using only screen pixels.

3. On-Policy

o Agent learns from the actions it actually takes.

o Example: SARSA
4. Off-Policy

o Agent learns from optimal actions, not necessarily taken.

o Example: Q-Learning

Rewards

• Immediate Reward (r): Received right after an action.

• Cumulative Reward (Return): Sum of all future rewards.

• Discounted Reward: Future rewards reduced by a factor (𝛾) to prioritize short-term


gains.

Formula:
Return = r₁ + 𝛾*r₂ + 𝛾²*r₃ + ...

Reinforcement Learning Algorithms

1. Q-Learning

• Off-policy

• Learns the best action-value function (Q-table).


• Update Rule:
Q(s,a) ← Q(s,a) + α [r + γ * max(Q(s’,a’)) - Q(s,a)]

Uses max future Q-value

2. SARSA (State-Action-Reward-State-Action)

• On-policy
• Learns based on the action actually taken.

• Update Rule:
Q(s,a) ← Q(s,a) + α [r + γ * Q(s’,a’) - Q(s,a)]

Uses Q-value of the next action taken

Quick Comparison: Q-Learning vs SARSA

Feature Q-Learning SARSA

Type Off-policy On-policy

Exploration Greedy Depends on policy

Target max Q(s’, a’) Q(s’, a’)

Summary for Viva

• RL helps agents learn by interacting with the environment.

• Key components: Agent, Environment, State, Action, Reward.

• Two main algorithms: Q-Learning (off-policy) and SARSA (on-policy).

• Use Q-table to store and update action values.

• Focus on maximizing long-term rewards

Viva Questions & Answers – Module 1: Introduction to RL

1. What is Reinforcement Learning?


Reinforcement Learning is a type of machine learning where an agent learns to make
decisions by performing actions in an environment to maximize cumulative reward.
2. What are the key elements of Reinforcement Learning?

• Agent

• Environment

• State (s)
• Action (a)

• Reward (r)

• Policy (π)

• Value Function (V)

• Q-value (Q)

3. What is the role of the agent in RL?


The agent is the decision-maker that interacts with the environment to learn the best actions.

4. What is an environment in RL?


The environment is everything that surrounds the agent and responds to its actions.

5. What is a reward in RL?


A reward is a numeric value that tells the agent how good or bad its action was.

6. What is a state?
A state is the current situation or position of the agent in the environment.

7. What is an action?
An action is a move or decision taken by the agent in a particular state.

8. What is a policy (π)?


A policy is a strategy or a mapping from states to actions.

9. What is the value function?


The value function gives the expected cumulative reward from a state following a policy.
10. What is Q-value (action-value function)?
Q-value represents the expected cumulative reward of taking a specific action in a specific
state.

11. What is the difference between value function and Q-function?

• Value function (V): Expected reward from a state.

• Q-function (Q): Expected reward from a state-action pair.

12. What is exploration in RL?


Exploration is when the agent tries new actions to gather more knowledge about the
environment.

13. What is exploitation in RL?


Exploitation is when the agent uses its current knowledge to maximize the reward.

14. What is the exploration vs exploitation trade-off?


It’s the balance between trying new actions (exploration) and using known actions for the
best reward (exploitation).

15. What is Q-Learning?


Q-Learning is an off-policy RL algorithm that learns the optimal action-value function.

16. What is the Q-Learning update rule?


Q(s,a) ← Q(s,a) + α [r + γ * max(Q(s’,a’)) - Q(s,a)]

17. What is SARSA?


SARSA is an on-policy RL algorithm that learns the Q-value using the action actually taken
by the agent.

18. What is the SARSA update rule?


Q(s,a) ← Q(s,a) + α [r + γ * Q(s’,a’) - Q(s,a)]

19. What is the difference between on-policy and off-policy learning?


• On-policy (SARSA): Learns from actions taken using the current policy.

• Off-policy (Q-Learning): Learns from the best possible action, not necessarily the
one taken.

20. What is a discount factor (γ) in RL?


It is a value between 0 and 1 that reduces the importance of future rewards.

High γ → future rewards matter more. Low γ → short-term rewards matter more.
Module 2: Bandit Problems and Online Learning – Summary
Topics Covered:

1. N-Armed Bandit Problem

2. Action-Value Methods

3. Tracking a Non-stationary Problem


4. Optimistic Initial Values

5. Upper Confidence Bound (UCB) Action Selection

6. Gradient Bandits

Summary Notes (Easy to Remember)

1. N-Armed Bandit Problem:


• Imagine a slot machine with N levers.

• Each lever (action) gives different reward probabilities.

• Goal: Find the best lever to pull (i.e., best action).

2. Action-Value Methods:

3. Tracking Non-Stationary Problems:

• In real-world, reward distributions may change over time.

• Use constant step-size (α) instead of sample average to adapt to changes.


4. Optimistic Initial Values:

• Start with high Q values to encourage exploration early.

• Agent tries all actions before settling.


5. Upper Confidence Bound (UCB):

6. Gradient Bandits:

• Use preferences instead of Q-values.

• Compute probability of actions using softmax:

• Update preferences to improve reward.

30 Viva Questions with Answers

1. What is the N-Armed Bandit problem?


It is a problem where an agent chooses between N actions, each giving random rewards, to
maximize the total reward.

2. Why is it called a “bandit” problem?


Like a slot machine (one-armed bandit), the agent decides which lever (action) to pull to get
the best reward.

3. What is the goal in a bandit problem?


To find and exploit the action with the highest average reward.

4. What is an action-value method?


Method to estimate the value of each action using the average of received rewards.

5. What is the formula for updating Q-value incrementally?


6. What does the step-size α represent?
It controls how much new rewards affect the current value.

7. Why are sample averages not good for non-stationary problems?


Because they give equal weight to old data, even if the environment has changed.

8. How do we track non-stationary problems?


Use constant step-size (α) updates to give more weight to recent rewards.

9. What are optimistic initial values?


Starting Q-values are set high to encourage the agent to explore all actions initially.

10. What is the benefit of optimistic initial values?


They prevent the agent from settling on suboptimal actions too early.

11. What is the exploration-exploitation dilemma?


Choosing between exploring new actions or exploiting known best actions.

12. What is UCB in bandits?


Upper Confidence Bound is a method that adds a confidence term to Q-values to balance
exploration and exploitation.

13. Write the UCB action selection formula.

14. What does the constant ‘c’ do in UCB?


Controls the level of exploration.

15. What is Nt(a) in UCB?


The number of times action a has been selected.
16. Why does UCB use log(t)?
To gradually reduce exploration as more time steps pass.

17. What are gradient bandits?


Methods that learn preferences for actions instead of values.

18. What is the softmax function used for in gradient bandits?


To convert preferences into a probability distribution over actions.

19. Write the softmax formula for action selection.

20. What is H(a) in gradient bandits?


H(a) is the preference value for action a.

21. Do gradient bandits need Q-values?


No, they directly work with preferences and probabilities.

22. What is the role of a baseline in gradient bandits?


To reduce variance and stabilize learning during updates.

23. What does a high preference H(a) mean?


That action a is more likely to be selected.

24. What is a non-stationary problem in RL?


A problem where reward probabilities change over time.

25. Which method is better for non-stationary problems – sample average or constant
α?
Constant α is better because it gives more weight to recent data.
26. What are some practical applications of bandit problems?
Online ad selection, recommendation systems, clinical trials.

27. Why is exploration necessary in bandits?


To discover which actions yield better rewards.

28. Can UCB be used in non-stationary settings?


Not ideal. UCB assumes stationary distributions.

29. What happens if we set α too high?


Q-values may fluctuate too much – unstable learning.

30. What happens if we always exploit the best-known action?


We may miss better actions and get stuck with suboptimal performance.

Let me know if you want summary flashcards, a one-pager formula sheet, or a quiz
version for practice.

4o

You said:
yes give it

One-Pager Summary + Formula Sheet

Key Concepts:

Concept Description

N-Armed Bandit A decision-making problem where agent chooses between N


actions (arms).

Action-Value Methods Estimate value of actions using reward averages.

Exploration vs Exploitation Explore new actions vs use best-known action.

Non-stationary Problems When reward distributions change over time.

Optimistic Initial Values Set high initial Q-values to encourage exploration.


UCB (Upper Confidence Select actions using both value and uncertainty.
Bound)

Gradient Bandits Use preferences and softmax for action probabilities.

Formulas:

Flashcard-Style Q&A

Just flip (mentally!) and revise:

Question Answer

What is the N-armed bandit problem? Choosing the best action among N
options for max reward.

What is Q-value? Estimate of an action’s average reward.

What is the goal of bandit problems? Maximize cumulative reward over time.

What is exploration? Trying new actions to learn more about


them.

What is exploitation? Using the best-known action for reward.

Why is α (alpha) used? Controls learning rate in Q-value update.

What’s better for non-stationary environments Constant α.


– sample average or constant α?
Why use optimistic initial values? Encourage exploration of all actions.

What does UCB do? Balances exploration and exploitation


using confidence bounds.

What does c in UCB control? Amount of exploration.

What does Nt(a) mean in UCB? How many times action a has been
chosen.

What happens as Nt(a) increases? UCB confidence decreases – less


exploration.

What is a gradient bandit? Bandit method using preferences, not Q-


values.

How are actions chosen in gradient bandits? Using softmax probability from
preferences.

Why use a baseline in gradient bandits? Reduce variance, improve stability.

Softmax ensures what? All probabilities are positive and sum to


1.

High H(a) leads to? Higher probability of choosing action a.

Where are bandit problems used? Ads, recommendations, A/B testing,


clinical trials.

Problem with only exploiting? Might miss better options.

What if α is too high? Unstable learning; overreacting to latest


reward.
Module 3 – Markov Decision Processes
Key Topics:

1. The Agent–Environment Interface

2. Goals and Rewards

3. Returns (Cumulative Reward)

4. Markov Property
5. Markov Decision Process (MDP)

6. Value Functions

7. Optimal Value Functions

1. Agent–Environment Interface

• Defines how the agent interacts with the environment.

• At each time step, the agent observes a state (S), takes an action (A), and receives a
reward (R).

• The environment transitions to a new state (S') based on the action.

2. Goals and Rewards


• The agent’s goal is to maximize cumulative rewards over time.

• Rewards are numeric values given by the environment to evaluate the agent’s actions.

3. Returns (Gₜ)

• Return is the total cumulative reward from time step t.

• If future rewards are discounted:

where γ\gammaγ ∈ [0,1] is the discount factor.

4. Markov Property

• The future state depends only on the current state and action, not on the past.

• This makes decision-making memoryless and efficient.


5. Markov Decision Process (MDP)

An MDP is defined by a 5-tuple:

(S,A,P,R,γ)
• S: Set of states

• A: Set of actions

• P: State transition probability: P(s′∣s,a)

• R: Reward function: R(s,a)

• γ: Discount factor

6. Value Functions

• State-Value Function (Vπ(s)): Expected return starting from state s, following policy
π.

• Action-Value Function (Qπ(s, a)): Expected return from state s, taking action a, and
following policy π thereafter.

7. Optimal Value Functions

• V* and Q* represent the maximum expected return achievable from any state or
action.

• Used to derive the optimal policy π*:

Term Meaning

γ (gamma) Discount factor (0 ≤ γ ≤ 1): weight for future rewards

Markov Property The future depends only on the current state, not on the past.

MDP (Markov Decision Defined by (S, A, P, R, γ) → States, Actions, Transition


Process) Probabilities, Rewards, Discount.

Value Function (V) Expected return from a state following a policy.


Term Meaning

Action-Value Function
Expected return from a state-action pair following a policy.
(Q)

Optimal Value Function The maximum possible value from a state or state-action.

Important Formulas:
30 Viva Questions & Answers (Module 3)

Question Answer

1. What is an MDP? A framework to model decision-making with states,


actions, rewards, and probabilities.

2. What are the components of an States (S), Actions (A), Transition Probabilities (P),
MDP? Rewards (R), Discount Factor (γ).

3. Define the Markov property. Future state depends only on the current state, not on
past states.

4. What is a policy? A mapping from states to actions.

5. What is the agent-environment The loop where the agent acts, and the environment
interface? returns a new state and reward.

6. What is a reward? A signal from the environment indicating the result of


an action.

7. What is return? The total accumulated reward (can be discounted).

8. What is a discount factor (γ)? It reduces the weight of future rewards; γ ∈ [0,1].

9. What happens if γ = 0? Agent considers only immediate reward.

10. What if γ = 1? Agent considers future rewards equally (infinite


horizon).

11. Define value function V(s). Expected return from state s following a policy π.

12. Define Q-value or action-value Expected return from (state, action) pair following π.
function.

13. What is V*? Optimal value function – max possible return from
each state.

14. What is Q*? Optimal action-value function.

15. What is the Bellman Recursive equation for Vπ(s) using the expected next
expectation equation? state.

16. What is the Bellman optimality Defines optimal V* using best actions at each step.
equation?

17. What is the goal of RL in To find the optimal policy that maximizes expected
MDP? return.

18. Can MDPs be stochastic? Yes – transitions and rewards can be probabilistic.

19. What is a deterministic policy? Same action is chosen every time from a state.
20. What is a stochastic policy? Actions are chosen with probabilities from a state.

21. What’s the difference between V(s) is value of a state; Q(s, a) includes specific
V(s) and Q(s, a)? actions.

22. What does π(a s) mean?

23. How does RL differ from RL learns from interaction with environment, not
supervised learning? labeled data.

24. What is a trajectory or A sequence of (state, action, reward) over time.


episode?

25. What is finite horizon? The agent acts for a limited number of steps.

26. What is infinite horizon? Agent continues acting indefinitely with discounting.

27. Why is discounting used? To ensure convergence and prefer immediate rewards.

28. What is the purpose of value To guide the agent in choosing better actions.
functions?

29. Why are Bellman equations They help compute value functions recursively.
important?

30. What is policy improvement? Updating the policy to increase expected return.
Module 4 – Dynamic Programming (DP)
Key Topics Covered:

Topic Description

Policy Evaluation
Computes the value function Vπ(s) for a given policy π.
(Prediction)

Generates a better policy by acting greedily with respect to


Policy Improvement
Vπ(s).

Alternates between evaluation and improvement until policy


Policy Iteration
stabilizes.

Value Iteration Combines evaluation and improvement in one update step.

Asynchronous DP Updates some states rather than all – more flexible and faster.

Generalized Policy Unified view where evaluation and improvement happen in


Iteration (GPI) parallel, often seen in RL.

Important Equations & Ideas:

1. Bellman Expectation Equation for Vπ(s):


Algorithm Overviews:

1. Policy Evaluation

Iteratively compute Vπ(s) using the Bellman Expectation Equation until convergence.
2. Policy Improvement

Update policy by choosing best actions w.r.t. Vπ(s).

3. Policy Iteration

• Step 1: Evaluate current policy.

• Step 2: Improve policy.

• Repeat until policy stabilizes.

4. Value Iteration

Performs only one update step per iteration:

5. Asynchronous DP

Update value functions in random order or selectively.


6. Generalized Policy Iteration (GPI)

Evaluation and improvement steps interact continuously until optimality.

30 Viva Questions & Answers – Module 4 (Dynamic Programming)

Question Answer

1. What is dynamic programming? A method that solves problems by breaking them


into subproblems and reusing solutions.

2. What’s the role of DP in RL? It is used for computing value functions and finding
optimal policies when a full model is known.

3. Define policy evaluation. Computing value function for a given policy.

4. What is policy improvement? Improving a policy by acting greedily with respect


to current value estimates.

5. What is policy iteration? Alternating policy evaluation and improvement until


the policy becomes stable.
6. What is value iteration? A combination of policy evaluation and
improvement done in one step.

7. Difference between policy Policy iteration separates evaluation and


iteration and value iteration? improvement; value iteration combines them.

8. What is Bellman Expectation Evaluating a policy.


Equation used for?

9. What is Bellman Optimality Finding the optimal value function.


Equation used for?

10. What is greedy policy? A policy that chooses the action with the highest
value.

11. What does it mean for a policy No further improvements are possible – it's optimal.
to be stable?

12. What is asynchronous DP? Updates only some states per iteration – useful for
large state spaces.

13. What is the key idea behind A better policy can be formed by choosing actions
policy improvement theorem? with higher expected returns.

14. What is generalized policy Continuous interaction between policy evaluation


iteration? and improvement.

15. How is convergence ensured in Policy evaluation and improvement steps eventually
policy iteration? stabilize.

16. Is value iteration faster than Often yes, because it avoids full evaluation steps.
policy iteration?

17. What is the stopping condition When the value change between iterations is below
for value iteration? a small threshold (ε).

18. Can DP be used without No, it requires a known model (transition


knowing the model? probabilities and rewards).

19. How does DP differ from Monte DP uses a model; Monte Carlo methods learn from
Carlo methods? sampled episodes.

20. What is the computational cost High in large state/action spaces – that's why
of DP methods? approximations or sampling are used.

21. Why is DP important for RL? It forms the theoretical foundation for RL
algorithms.

22. What is the significance of the It determines how much future rewards are
discount factor γ in DP? considered.
23. What’s the main challenge with Large or continuous state/action spaces make exact
DP in real-world problems? DP infeasible.

24. In value iteration, what is being The value function using Bellman optimality.
iteratively updated?

25. How do you derive an optimal Choose the action that maximizes expected value at
policy from a value function? each state.

26. Can asynchronous updates be Yes, often used in multi-threaded implementations.


parallelized?

27. When does GPI reach When both value and policy no longer change.
convergence?

28. What does “argmax” do in Selects the action with the highest expected value.
policy improvement?

29. Why is initialization of value It can affect speed of convergence but not the final
function important? result.

30. How is DP related to Bellman Bellman equations form the basis of all DP updates.
equations?
Module 5 – Monte Carlo (MC) & Temporal-Difference (TD)
Learning
Key Topics Covered:

Subtopic Description

Monte Carlo Prediction Estimate value functions using sample returns from
episodes.

Monte Carlo Estimation of Learn action-value function Q(s, a) without a model.


Action Values

Monte Carlo Control Improves policy using estimates from episodes.

TD Prediction Learns value function using bootstrapping from


current estimates.

TD Control using Q-Learning Off-policy TD control that learns Q(s, a) from action-
value estimates.

Difference Between MC & TD

Feature Monte Carlo Temporal-Difference

Updates End of episode Step-by-step

Bootstrapping No Yes

Sampling Yes Yes

Efficiency Less efficient More efficient

Variance High Low

Bias Low Biased estimates

Algorithm Overviews

Monte Carlo Prediction

• Used to estimate V(s) by averaging returns from complete episodes.

Monte Carlo Estimation of Q(s, a)

• Similar to V(s), but tracks (state, action) returns.


Monte Carlo Control

• Uses ε-greedy policy improvement.

• Learns optimal policy through episodic sampling.

30 Viva Questions & Answers – Module 5 (MC & TD Learning)

Question Answer

1. What is Monte Carlo method in A method that learns from complete episodes
RL? using sample returns.

2. What is the difference between MC MC waits for episodes to end; TD updates at each
and TD? step using bootstrapping.

3. What is MC prediction? Estimating the value function by averaging


observed returns.

4. Define return (G) in MC. The cumulative discounted reward from a state.

5. What is first-visit MC method? Updates value only on the first occurrence of a


state in an episode.

6. What is every-visit MC? Updates value every time the state appears in an
episode.

7. How does MC estimate action- By averaging returns from all episodes starting
value Q(s, a)? with (s, a).

8. Is MC model-free or model-based? Model-free – doesn’t require transition


probabilities.

9. What is policy improvement in MC Using ε-greedy policy to improve based on value


control? estimates.
10. What is TD learning? A method that updates value functions using
estimates from current step.

11. Define TD(0). A simple form of TD learning where the update is


made from a single step.

12. Write the TD(0) update formula. V(s) ← V(s) + α [r + γ V(s') - V(s)]

13. What is bootstrapping? Updating estimates based on other estimates rather


than actual returns.

14. What is Q-learning? A TD control method that learns optimal Q-values


off-policy.

15. What is the Q-learning update Q(s,a) ← Q(s,a) + α[r + γ max Q(s’,a’) – Q(s,a)]
rule?

16. Is Q-learning on-policy or off- Off-policy – learns about greedy policy while
policy? following a different one.

17. What is an ε-greedy policy? A policy that chooses the best action most of the
time but explores randomly with probability ε.

18. What’s the role of learning rate α? It determines how much new info overrides old
value estimates.

19. What are the pros of TD over TD can learn faster, online, and doesn’t need
MC? episode termination.

20. Can TD and MC be combined? Yes – in algorithms like TD(λ), combining both
ideas.

21. What is the exploration- Balancing trying new actions vs exploiting known
exploitation tradeoff? best ones.

22. What is off-policy learning? Learning about one policy while behaving
according to another.

23. What is on-policy learning? Learning about the same policy being followed.

24. Why is Q-learning called off- It learns about the greedy policy while acting ε-
policy? greedy.

25. What’s the terminal state? A state where the episode ends.

26. Which methods can handle TD methods, especially TD(0), can handle them.
continuing tasks?

27. Which method has higher Monte Carlo has higher variance.
variance – MC or TD?
28. Why is MC method not always It requires the episode to finish to update values.
suitable for real-time learning?

29. Name a real-world example using Blackjack value estimation based on full games.
MC methods.

30. Name a use case of Q-learning. Game AI, elevator control, traffic light control.
Module 6 – Applications and Case Studies
Overview

This module explores how Reinforcement Learning (RL) can be applied in real-world
decision-making problems where the environment is dynamic, and outcomes depend on
sequential actions.

Key Applications Covered:

Application RL Use Case

Elevator Dispatching Optimize elevator movements to reduce wait time and energy
consumption.

Dynamic Channel Allocate frequency channels in wireless networks to maximize


Allocation bandwidth and reduce interference.

Job-Shop Scheduling Schedule machines and jobs in a factory to minimize production


time and costs.

Detailed Application Insights

Elevator Dispatching

• Objective: Minimize passenger wait time and energy cost.

• State: Elevator position, direction, number of people waiting.

• Action: Move up, move down, stop.

• Reward: Negative for waiting time, positive for servicing requests.

Dynamic Channel Allocation

• Objective: Efficient use of wireless spectrum with minimal interference.

• State: Channel usage, user demands, interference levels.

• Action: Assign a channel to a user.

• Reward: High throughput and minimal collisions.

Job-Shop Scheduling

• Objective: Schedule jobs across machines to minimize makespan (total time).

• State: Machine status, job queue, time elapsed.

• Action: Assign job to machine.


• Reward: Negative time penalties, bonuses for early finishes.

Common RL Techniques Used

• Q-Learning

• SARSA

• Deep Q-Networks (DQN)

• Policy Gradient Methods

30 Viva Questions & Answers – Module 6

Question Answer

1. What is elevator dispatching in RL? A control problem where RL is used to optimize


elevator movements based on requests.

2. What is the objective in elevator Minimize waiting time and energy usage.
dispatching?

3. How is state defined in elevator Elevator location, direction, number of requests,


dispatching? and floor status.

4. What are actions in the elevator Move up, move down, stay idle, or stop at a floor.
dispatching problem?

5. What kind of reward function is Negative rewards for wait time and idle
used in elevator dispatching? movement, positive rewards for serving
passengers.

6. How can RL help in multi-elevator It can learn coordination strategies to reduce


systems? overall system wait time.

7. What is dynamic channel allocation? Assigning communication channels in a wireless


network dynamically based on current demands.

8. Why is dynamic channel allocation To manage spectrum efficiently and avoid


needed? interference.

9. How is RL applied in channel It learns which channels to assign to which users


allocation? based on past interference and usage.

10. What does the state include in Current usage, available channels, interference
channel allocation? levels.
11. What are the actions in channel Assigning a specific channel to a particular user.
allocation?

12. What is the reward in channel High data rate with minimal interference.
allocation?

13. Which RL algorithm is commonly Q-learning and Deep Q-Networks (DQN).


used in channel allocation?

14. Define job-shop scheduling. A scheduling problem where jobs are assigned to
different machines in an optimal sequence.

15. What is the goal in job-shop Minimize total production time (makespan) and
scheduling? delays.

16. What is a state in job-shop Current status of machines and pending jobs.
scheduling?

17. What are actions in job-shop Assigning a job to an available machine.


scheduling?

18. How does RL benefit job-shop It adapts to dynamic job arrivals and machine
scheduling? conditions.

19. What is the reward in job-shop Completion bonus or penalty based on job
scheduling? timing.

20. Can deep reinforcement learning Yes, DQN and Actor-Critic methods can be used.
be used for job scheduling?

21. What are the advantages of using Real-time decision-making, adaptability, and
RL in real-time applications? optimization.

22. What is the main challenge in Large state/action spaces, data scarcity, and
applying RL to real-world problems? delayed rewards.

23. Which is more suitable for TD methods, because they don’t require episodes
continuous problems: MC or TD? to end.

24. What is a practical example of Smart elevators in office buildings.


elevator dispatching using RL?

25. Which algorithm would you Q-Learning or DQN for scalability and
choose for job-shop scheduling and adaptiveness.
why?

26. Can RL work without a simulator Yes, using real-time interaction or historical data
in such applications? logs.
27. What is the role of exploration in It allows the agent to try new strategies for better
these applications? outcomes.

28. How does RL handle uncertainty in By updating value functions based on new
dynamic environments? observations.

29. How is reward shaped in industrial Using task-specific metrics like cost, time,
RL use cases? energy, etc.

30. What’s a major benefit of RL in It provides adaptive control without manually


automation? programmed rules.

You might also like