Course Course Course L T P C
21CSE417T REINFORCEMENT LEARNING TECHNIQUES E PROFESSIONAL ELECTIVE
Code Name Category 2 1 0 3
Pre-requisite Co- requisite Progressive
Nil Nil Nil
Courses Courses Courses
Course Offering Department School of Computing Data Book / Codes / Standards Nil
Course Learning Rationale (CLR): The purpose of learning this course is to: Program Outcomes (PO) Program
Specific
CLR-1: introduce the fundamentals of Reinforcement Learning 1 2 3 4 5 6 7 8 9 10 11 12 Outcomes
CLR-2: illustrate model-based prediction and control using dynamic programming
Individual & Team Work
Engineering Knowledge
Design/development of
Project Mgt. & Finance
Conduct investigations
of complex problems
CLR-3: illustrate model-free prediction and control
Modern Tool Usage
Life Long Learning
The engineer and
Problem Analysis
Communication
CLR-4: introduce planning and learning with tabular methods
Environment &
Sustainability
CLR-5: explain approximation of a value function
solutions
society
PSO-1
PSO-2
PSO-3
Ethics
Course Outcomes (CO): At the end of this course, learners will be able to:
CO-1: understand basic concepts of reinforcement learning 3 2 - 2 - - - - - - - - - - 2
CO-2: perform model-based prediction and control using dynamic programming 3 3 - 3 - - - - - - - - - - 2
CO-3: apply model-free prediction and control 3 3 - 3 - - - - - - - - - - 3
CO-4: comprehend the use of tabular methods 3 3 - 3 - - - - - - - - - - 3
CO-5: understand how a value function can be approximated 3 3 - 3 - - - - - - - - - - 3
Unit-1 - Introduction 9 Hour
Introduction to Reinforcement learning, examples - Elements of reinforcement learning - Limitations and Scope- An extended example - multi-armed bandits - k-armed bandit problem - action-value methods - the
10-armed testbed - incremental implementation - tracking a nonstationary problem - optimistic initial values - upper-confidence-bound action selection - associative search (contextual bandits)
T1: Implementing the 10-armed testbed
T2: Comparing performance for different values
T3: Upper-confidence bound action selection performance comparison with –greedy
Unit-2 - Markov Decision Process and Model-Based Prediction and Control 9 Hour
Finite Markov Decision Process - The Agent–Environment Interface - Goals and Rewards - Returns and Episodes - Unified Notation for Episodic and Continuing Tasks - Policies and Value Functions - Optimal
Policies and Optimal Value Functions - Optimality and Approximation - Dynamic Programming - Policy Evaluation (Prediction) - Policy Improvement - Policy Iteration - Value Iteration - Generalized Policy Iteration -
Efficiency of Dynamic Programming - Asynchronous Dynamic Programming
T4: MDP for Recycling Robot
T5: Policies and value functions for Gridworld example
T6: Policy evaluation for Gridworld example
139
B.Tech / M.Tech (Integrated) Programmes-Regulations 2021-Volume-11-CSE-Higher Semester Syllabi-Control Copy
Unit-3 - Model-Free Prediction and Control 9 Hour
Model-free learning - Model-free prediction - Monte Carlo methods - Monte Carlo Prediction - Monte Carlo Estimation of Action Values - Temporal-Difference Learning - TD Prediction - Advantages of TD Prediction
Methods - Optimality of TD(0) - n-step Bootstrapping - n-step TD Prediction - n-step Sarsa - Model-free control - Monte Carlo Control - Monte Carlo Control without Exploring Starts - Off policy learning - Importance
sampling - Off-policy Monte Carlo Control - Sarsa: On-policy TD Control - Q-learning: Off-policy TD control
T7: Monte Carlo Policy Evaluation for Blackjack
T8: TD Prediction for Driving Home example
T9: Sarsa vs Q-learning using Cliff Walking example
Unit-4 - Planning and Learning with Tabular Methods 9 Hour
Models and planning - Dyna: Integrated Planning, Acting and Learning - When the model is wrong - Prioritized Sweeping - Real-time Dynamic Programming - Monte Carlo Tree Search
T10: Simple maze using Dyna-Q
T11: Prioritized sweeping on Maze example
T12: Real-time Dynamic Programming for Racetrack example
Unit-5 - Value Function Approximation 9 Hour
On-policy Prediction with Approximation - Value Function Approximation - The Prediction Objective (VE) - Stochastic-gradient and Semi-gradient Methods - Linear Methods - Least-Squares TD
T13: State aggregation on the 1000-state Random Walk
T14: Bootstrapping on the 1000-state Random Walk
T15: Least squares TD example
1. Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An 3. Artificial Intelligence: A Modern Approach, Stuart J. Russell and Peter Norvig, 3rd edition, Pearson, 2015.
Learning introduction, 2nd edition, The MIT Press, 2015. 4. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press Ltd., 2016.
Resources 2. Martijn van Otterlo, Marco Wiering, Reinforcement Learning: State-of-the-Art, 5. https://deepmind.com/learning-resources/-introduction-reinforcement-learning-david-silver
Springer-Verlag Berlin Heidelberg, 2012. 6. Reinforcement Learning with MATLAB, MathWorks Inc., 2020.
Learning Assessment
Continuous Learning Assessment (CLA)
Summative
Formative Life-Long Learning
Bloom’s Final Examination
CLA-1 Average of unit test CLA-2
Level of Thinking (40% weightage)
(50%) (10%)
Theory Practice Theory Practice Theory Practice
Level 1 Remember 40% - 40% - 40% -
Level 2 Understand 40% - 40% - 40% -
Level 3 Apply 20% - 20% - 20% -
Level 4 Analyze - - - - - -
Level 5 Evaluate - - - - - -
Level 6 Create - - - - - -
Total 100 % 100 % 100 %
Course Designers
Experts from Industry Experts from Higher Technical Institutions Internal Experts
1. Mr. Ghulam Ahmed Ansari, Applied Research Engineer, LinkedIn 1. Dr. Manikantan Srinivasan, , Adjunct Faculty, CSE, IIT Madras 1. Dr. Saad Y. Sait, SRMIST
140
B.Tech / M.Tech (Integrated) Programmes-Regulations 2021-Volume-11-CSE-Higher Semester Syllabi-Control Copy