[go: up one dir, main page]

100% found this document useful (1 vote)
64 views20 pages

Dlvu Lecture09

Uploaded by

Awatef Messaoudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
64 views20 pages

Dlvu Lecture09

Uploaded by

Awatef Messaoudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Lecture 9: Reinforcement Learning

Emile van Krieken


Deep Learning 2020

dlvu.github.io

In Reinforcement Learning, we try to learn a policy of an agent that acts in an


REINFORCEMENT LEARNING environment. The agent should act ‘optimally’, in the sense that it should
maximize rewards it receives from the environment.
Reinforcement learning
This is different from supervised learning, where agents are told exactly what
• Train agent to act in an environment by maximizing reward
is the best action to perform. The agent just receives a reward from the
Comparison environment, but no feedback as to what action leads to good rewards.
• Supervised learning: Exact action given
Furthermore, it’s also not like unsupervised learning, where no explicit
• Unsupervised learning: No reward given feedback on how to act is given. We do receive rewards!
Deep Reinforcement Learning
• Agents use neural network policies

- This is the standard reinforcement loop that most approaches follow. We


REINFORCEMENT LEARNING LOOP have an environment, like a game, or the real world in which a robot has to
act. We assume it is black-box, which means that we have no information
about what the environment looks like. The environment receives actions
ac9on at
<latexit sha1_base64="+LzeGFZxpTjAKZV/HNGMwJ8Fw9M=">AAAOQ3icfZdNb9s2HMbV7q3L5rXdjrsICwoMQxBIieOXQ4Faso0e1jYL8tbVQUDRtCyEEgmKcuwK+hS7bt9mX2JfYcdh1wGjZEeWSMq6+G8+Dx//9JdokB7FQcwt669Hjz/59LPPv3jy5d5XX7e+efrs+beXMUkYRBeQYMKuPRAjHEToggcco2vKEAg9jK68OzfXrxaIxQGJzvmKopsQ+FEwCyDgYuj9BMD885bfPtu3Dq3iMtXC3hT7xuY6vX3e2p9MCUxCFHGIQRx/sC3Kb1LAeAAxyvYmSYwogHfARx8SPuvdpEFEE44imJkvhDZLsMmJmUOZ04AhyPFKFACyQCSYcA6YgBPoe/WoGEUgRPHBdBHQeF3GC39dcCDu+yZdFn3JahNTnwE6D+CyRpaCMA4BnyuD8Sr06oMowYgtwvpgTikYJecSMRjEeQ9ORWPe0bzF8Tk53ejzFZ2jKM7ShOGsOlEIiDE0ExOLMkY8oWlxM+L53sUvOUvQQV4WYy+HgN2doemByKkN1HFmmABeH/LCvDkRuockDEE0TSc0SyccLXk6OTjMhPjC9LAwewSwad15lqXpJO+Z55lnwloT31bEt7I4qoijzY8QPDVnhJkL8fwJi01hNIWFBRDF9dkX5eyZeSFHX1bES1m8qohXsuglFTVR1EVFXSjqfUW9l9VlRVzK4qoirmTxY0X8KIvXu2Lf74r9VYoV/ReLYrU3maKZ+PsoXqH0Dmbp6/M3P2dpv7g270KCTLtuhN6D8Xjc6fa7mSzjB7097tnOUNVLQ9cZ2K7OUDoGXdcajrYsR5K3hLaszmjQkaMg3uq9vjtW9S2sNegOHY1hSzt226Pupn0IRZLVL31uv9NW2uKXOX3HcU76ql4anLbr9o40htLhDofDgVug0IRRjCQvfTB2OieWGkXLoJ7VaQ80+vYBWD3HUWBphcUZO/bQLlg4Alhy8vJlcXuD/kgO4tv+OwPXVR4gr7S/59rDtsawZT0ZnoyOCxLCQOTLXSFl+zrd3nFPjiJl0LgrnqDCQra/NO47VldhIRWWseM6eWPrK1Essg/2Tbr+y92uO3PfNrNMNouFJpvztfdglrxYY8bNbq19l18/ATfCq3cKYVM81ITDRhioY4HN8FALD3fB+6rfb4r3NeF+I4yvY/Gb4X0tvL8Lnqp+2hRPNeG0EYbqWGgzPNXC013wXPXzpniuCeeNMFzHwpvhuRae74Inqp80xRNNOGmEIToW0gxPtPBkBzxD92LLl+8UxNuVMiVyfWAodIhToOgxBxwVMsFprMgenyMOct0LBVPxRfWApHSIUtanQQxJEvENJMbpxAdCy9Z7GpofQQA28y08wWYQbXY5tT9gmk++g2Lju3avGzFE4ijD0BuxR3on9t9AbEl/ErfM/DAQtyw+Jwd5tcsIlg9GUe2JY5UtH6LU4uro0G4f2vYv7f1Xnc0J64nxvfGD8aNhG13jlfHaODUuDGiExm/G78YfrT9bf7f+af27tj5+tJnznVG7Wv/9D6soSYc=</latexit>

from the agent, which follows a policy. A policy is a probability distribution


that suggests which action to take in some state. Of course, it receives the
state again from the environment.

environment policy ⇡✓ This loop happens for every timestep t. The environment presents a state
<latexit sha1_base64="QERxgNakkv/TH9ixu1baowQUq7c=">AAAN/HicfZdNb9s2HMbVdi9dVq/tetxFWFBgGIJAShy/HArUkmz0sLZZkLctDgKKpmXBlEhQlGNX0MfYdTvsOOy677J9mlGyI0sUZV3C8Hn4+Ke/ROFPl2I/4obx76PHTz77/Isvn3619/Wz1jfPX7z89jIiMYPoAhJM2LULIoT9EF1wn2N0TRkCgYvRlTu3M/1qgVjkk/Ccryi6DYAX+lMfAi6mbsbUvxu7fIY4uHuxbxwa+aXXB+ZmsK9trtO7l8/+G08IjAMUcohBFN2YBuW3CWDchxile+M4QhTAOfDQTcynvdvED2nMUQhT/bXQpjHWOdEzLH3iMwQ5XokBgMwXCTqcAQYgF/B71agIhSBA0cFk4dNoPYwW3nrAgbjz22SZVyatLEw8BujMh8sKWQKCKAB8VpuMVoFbnUQxRmwRVCczSsEoOZeIQT/KanAqCvORZsWOzsnpRp+t6AyFUZrEDKflhUJAjKGpWJgPI8RjmuQ3I57wPHrDWYwOsmE+98YBbH6GJgcipzJRxZliAnh1yg2y4oToHpIgAOEkGdM0GXO05Mn44DAV4mvdxcLsEsAmVedZmiTjrGauq58Ja0X8UBI/yOKwJA43P0LwRJ8Spi/E8ycs0oVRFxbmQxRVV18Uq6f6hRx9WRIvZfGqJF7JohuX1LimLkrqoqbel9R7WV2WxKUsrkriShY/lcRPsni9K/aXXbG/SrGi/mJTrPbGEzQVH5D8FUrmME3enb//KU36+bV5F2Kkm1UjdB+Mx6NOt99NZRk/6O1Rz7Scul4YutbAtFWGwjHo2oYz3LIcSd4C2jA6w0FHjoJ4q/f69qiub2GNQdexFIYt7chuD7ub8iEUSlav8Nn9TrtWFq/I6VuWddKv64XBatt270hhKBy24zgDO0ehMaMYSV76YOx0Tox6FC2CekanPVDo2wdg9CyrBktLLNbIMh0zZ+EIYMnJi5fF7g36QzmIb+tvDWy79gB5qfw923TaCsOW9cQ5GR7nJISB0JOrQorydbq9454cRYqgUVc8wRoL2f7SqG8Z3RoLKbGMLNvKClvdiWKT3Zi3yfqTu913+r6pp6lsFhtNNmd778EsebHCjJvdSvsuv3oBboSv3ymETfFQEQ4bYaCKBTbDQyU83AXv1f1eU7ynCPcaYTwVi9cM7ynhvV3wtO6nTfFUEU4bYaiKhTbDUyU83QXP637eFM8V4bwRhqtYeDM8V8LzXfCk7idN8UQRThphiIqFNMMTJTzZAc/QvWj5sk5BvF0Jq0WKnlx0s7kOcQJqesQBR7lMcBLV5PVpI9PdQDDl/6xbEZqdHADWs86bYN0PN81J5btJs5VzKPrVtXvN7yBxAmHovWhtPoq2GYhO8kdByrzAF6Ti7/ggG+0yguWDUYz2xGnIlM8+9cHV0aHZPjTNn9v7bzubg9FT7Tvte+0HzdS62lvtnXaqXWhQI9pv2u/aH6209Wfrr9bfa+vjR5s1r7TK1frnf76FK1o=</latexit>

-
learner
s_t to the policy \pi_\theta, which then chooses an action a_t. The
environment does some magic, and chooses the next state s_{t+1}. It also
returns a reward r_{t+1} for that time step, which could be received
state st
<latexit sha1_base64="/pzW3Mv4fuxhZwPDsc7BaKD9MQQ=">AAAOQnicfZfLbuM2GIU109s0rduZdtmN0GCAoggCKXF8WQwwlmxjFp2ZNMitjYOAomlZCCUSJOXYI+glum3fpk/RR+iy6LaLUrIjSxRlbcLwHB5/+iUKPz2KAy4s668nTz/6+JNPP3v2+d4XX7a++vr5i28uOYkZRBeQYMKuPcARDiJ0IQKB0TVlCIQeRlfevZvpVwvEeECic7Gi6DYEfhTMAgiEnLqecAEEuhN3z/etQyu/zPrA3gz2jc11eveitT+ZEhiHKBIQA85vbIuK2wQwEUCM0r1JzBEF8B746CYWs95tEkQ0FiiCqflSarMYm4KYGZM5DRiCAq/kAEAWyAQTzgEDUEjyvWoURxEIET+YLgLK10O+8NcDAeRt3ybLvCxpZWHiM0DnAVxWyBIQ8hCIeW2Sr0KvOolijNgirE5mlJJRcS4RgwHPanAqC/OeZpXm5+R0o89XdI4iniYxw2l5oRQQY2gmF+ZDjkRMk/xm5OO9568Ei9FBNsznXg0Buz9D0wOZU5mo4swwAaI65YVZcSL0AEkYgmiaTGiaTARaimRycJhK8aXpYWn2CGDTqvMsTZJJVjPPM8+ktSK+K4nvVHFUEkebHyF4as4IMxfy+RPGTWk0pYUFEPHq6oti9cy8UKMvS+KlKl6VxCtV9OKSGtfURUld1NSHkvqgqsuSuFTFVUlcqeKHkvhBFa93xf6yK/ZXJVbWX26K1d5kimby65G/Qsk9TJM3529/SpN+fm3ehRiZdtUIvUfj8bjT7XdTVcaPenvcs51hXS8MXWdguzpD4Rh0XWs42rIcKd4C2rI6o0FHjYJ4q/f67riub2GtQXfoaAxb2rHbHnU35UMoUqx+4XP7nXatLH6R03cc56Rf1wuD03bd3pHGUDjc4XA4cHMUGjOKkeKlj8ZO58SqR9EiqGd12gONvn0AVs9xarC0xOKMHXto5ywCAaw4RfGyuL1Bf6QGiW39nYHr1h6gKJW/59rDtsawZT0ZnoyOcxLCQOSrVSFF+Trd3nFPjSJF0Lgrn2CNhWx/adx3rG6NhZRYxo7rZIWt7kS5yW7s22T9yd3uO3PfNtNUNcuNppqzvfdoVrxYY8bNbq19l1+/ADfC1+8UwqZ4qAmHjTBQxwKb4aEWHu6C9+t+vyne14T7jTC+jsVvhve18P4ueFr306Z4qgmnjTBUx0Kb4akWnu6CF3W/aIoXmnDRCCN0LKIZXmjhxS54UveTpniiCSeNMETHQprhiRae7IBn6EG2fFmnIN+uhNUiZU8uu9lchzgBNT0/T+QywQmvyZ6YIwEy3QslU/5P3QPiwiGHqj4NOCRxJDaQGCcTH0gtXfc0NDuCAGxmLTzBZhBtupzKB5hmi++hbHzX7nUhhkgeZRh6K3uk97L/BrIl/VHeMvPDQN6y/Ds5yEa7jGD5aJSjPXmsstVDVH1wdXRotw9t++f2/uvO5oT1zPjO+N74wbCNrvHaeGOcGhcGNLDxm/G78Ufrz9bfrX9a/66tT59s1nxrVK7Wf/8Db7dJIA==</latexit>

because the agent achieves some goal in the environment. This reward is
then used in the learner to update the policy parameters.
reward rt
<latexit sha1_base64="JCho6RHVaZ12zLSSoSGaNUMc/P8=">AAAOQ3icfZdNb9s2HMbV7q3L5rXdjrsICwoMQxBIieOXQ4Faso0e1jYL8tbVQUDRtCyEEgmKcuwK+hS7bt9mX2JfYcdh1wGjZEeWSMq65B8+Dx//9JcokB7FQcwt669Hjz/59LPPv3jy5d5XX7e+efrs+beXMUkYRBeQYMKuPRAjHEToggcco2vKEAg9jK68OzfXrxaIxQGJzvmKopsQ+FEwCyDgYuj9hKF7wKa3/PbZvnVoFZepFvam2Dc21+nt89b+ZEpgEqKIQwzi+INtUX6TAsYDiFG2N0liRAG8Az76kPBZ7yYNIppwFMHMfCG0WYJNTswcypwGDEGOV6IAkAUiwYRzwADkAn2vHhWjCIQoPpguAhqvy3jhrwsOxH3fpMuiL1ltYuozQOcBXNbIUhDGIeBzZTBehV59ECUYsUVYH8wpBaPkXCIGgzjvwalozDuatzo+J6cbfb6icxTFWZownFUnCgExhmZiYlHGiCc0LW5GPN+7+CVnCTrIy2Ls5RCwuzM0PRA5tYE6zgwTwOtDXpg3J0L3kIQhiKbphGbphKMlTycHh5kQX5geFmaPiLej7jzL0nSS98zzzDNhrYlvK+JbWRxVxNHmRwiemjPCzIV4/oTFpjCawsICiOL67Ity9sy8kKMvK+KlLF5VxCtZ9JKKmijqoqIuFPW+ot7L6rIiLmVxVRFXsvixIn6Uxetdse93xf4qxYr+i0Wx2ptM0Ux8PopXKL2DWfr6/M3PWdovrs27kCDTrhuh92A8Hne6/W4my/hBb497tjNU9dLQdQa2qzOUjkHXtYajLcuR5C2hLaszGnTkKIi3eq/vjlV9C2sNukNHY9jSjt32qLtpH0KRZPVLn9vvtJW2+GVO33Gck76qlwan7bq9I42hdLjD4XDgFig0YRQjyUsfjJ3OiaVG0TKoZ3XaA42+fQBWz3EUWFphccaOPbQLFo4Alpy8fFnc3qA/koP4tv/OwHWVB8gr7e+59rCtMWxZT4Yno+OChDAQ+XJXSNm+Trd33JOjSBk07oonqLCQ7S+N+47VVVhIhWXsuE7e2PpKFIvsg32Trj+523Vn7ttmlslmsdBkc772HsySF2vMuNmtte/y6yfgRnj1TiFsioeacNgIA3UssBkeauHhLnhf9ftN8b4m3G+E8XUsfjO8r4X3d8FT1U+b4qkmnDbCUB0LbYanWni6C56rft4UzzXhvBGG61h4MzzXwvNd8ET1k6Z4ogknjTBEx0Ka4YkWnuyAXx8I8p2CeLtSpkSKPbnYzRY6xClQ9JgDjgqZ4DRWZI/PEQe57oWCqfhH9YCkdIhS1qdBDEkS8Q0kxunEB0LL1nsamh9BADbzLTzBZhBtdjm1DzDNJ99BsfFdu9eNGCJxlGHojdgjvRP7byC2pD+JW2Z+GIhbFn8nB3m1ywiWD0ZR7YljlS0fotTi6ujQbh/a9i/t/VedzQnrifG98YPxo2EbXeOV8do4NS4MaITGb8bvxh+tP1t/t/5p/bu2Pn60mfOdUbta//0PDH9Jjg==</latexit>

3
Here, we present an example of an RL environment: Cart pole balancing. Our
CART POLE agent is the funny object on the left of the screen, a cart, has to balance the
wooden stick, the pole, so that it remains upright.
le? or right
To encode the state, it is sufficient to pass the angle of the pole, though
additional features can be thought of! Our agent uses its policy to decide
whether the cart should be moved to the left or the right, so the pole remains
physics upright.
policy ⇡✓
<latexit sha1_base64="QERxgNakkv/TH9ixu1baowQUq7c=">AAAN/HicfZdNb9s2HMbVdi9dVq/tetxFWFBgGIJAShy/HArUkmz0sLZZkLctDgKKpmXBlEhQlGNX0MfYdTvsOOy677J9mlGyI0sUZV3C8Hn4+Ke/ROFPl2I/4obx76PHTz77/Isvn3619/Wz1jfPX7z89jIiMYPoAhJM2LULIoT9EF1wn2N0TRkCgYvRlTu3M/1qgVjkk/Ccryi6DYAX+lMfAi6mbsbUvxu7fIY4uHuxbxwa+aXXB+ZmsK9trtO7l8/+G08IjAMUcohBFN2YBuW3CWDchxile+M4QhTAOfDQTcynvdvED2nMUQhT/bXQpjHWOdEzLH3iMwQ5XokBgMwXCTqcAQYgF/B71agIhSBA0cFk4dNoPYwW3nrAgbjz22SZVyatLEw8BujMh8sKWQKCKAB8VpuMVoFbnUQxRmwRVCczSsEoOZeIQT/KanAqCvORZsWOzsnpRp+t6AyFUZrEDKflhUJAjKGpWJgPI8RjmuQ3I57wPHrDWYwOsmE+98YBbH6GJgcipzJRxZliAnh1yg2y4oToHpIgAOEkGdM0GXO05Mn44DAV4mvdxcLsEsAmVedZmiTjrGauq58Ja0X8UBI/yOKwJA43P0LwRJ8Spi/E8ycs0oVRFxbmQxRVV18Uq6f6hRx9WRIvZfGqJF7JohuX1LimLkrqoqbel9R7WV2WxKUsrkriShY/lcRPsni9K/aXXbG/SrGi/mJTrPbGEzQVH5D8FUrmME3enb//KU36+bV5F2Kkm1UjdB+Mx6NOt99NZRk/6O1Rz7Scul4YutbAtFWGwjHo2oYz3LIcSd4C2jA6w0FHjoJ4q/f69qiub2GNQdexFIYt7chuD7ub8iEUSlav8Nn9TrtWFq/I6VuWddKv64XBatt270hhKBy24zgDO0ehMaMYSV76YOx0Tox6FC2CekanPVDo2wdg9CyrBktLLNbIMh0zZ+EIYMnJi5fF7g36QzmIb+tvDWy79gB5qfw923TaCsOW9cQ5GR7nJISB0JOrQorydbq9454cRYqgUVc8wRoL2f7SqG8Z3RoLKbGMLNvKClvdiWKT3Zi3yfqTu913+r6pp6lsFhtNNmd778EsebHCjJvdSvsuv3oBboSv3ymETfFQEQ4bYaCKBTbDQyU83AXv1f1eU7ynCPcaYTwVi9cM7ynhvV3wtO6nTfFUEU4bYaiKhTbDUyU83QXP637eFM8V4bwRhqtYeDM8V8LzXfCk7idN8UQRThphiIqFNMMTJTzZAc/QvWj5sk5BvF0Jq0WKnlx0s7kOcQJqesQBR7lMcBLV5PVpI9PdQDDl/6xbEZqdHADWs86bYN0PN81J5btJs5VzKPrVtXvN7yBxAmHovWhtPoq2GYhO8kdByrzAF6Ti7/ggG+0yguWDUYz2xGnIlM8+9cHV0aHZPjTNn9v7bzubg9FT7Tvte+0HzdS62lvtnXaqXWhQI9pv2u/aH6209Wfrr9bfa+vjR5s1r7TK1frnf76FK1o=</latexit>

learner
engine It receives a positive reward if the pole is upright, and otherwise receives no
reward. The reward is used in the learner to update the parameters.
angle of pole

reward: 1 if poll upright


0 otherwise
4
https://medium.com/@tuzzer/cart-pole-balancing-with-q-learning-b54c6068d947

In this lecture, we will focus on the simplest of Deep Reinforcement Learning


THIS LECTURE settings. We will only use neural networks to model our policies, and not look
at the tabular reinforcement learning setting or different machine learning
Simplest (Deep) RL setting models.
• Only neural network policies
We will assume our environment is episodic. This means that our agent acts
• Episodic in the environment for a finite amount of timesteps. There is some state that
• Online is the terminal state: From this point, nothing happens anymore! There are
also non-episodic environments in which there is never an end to the agent
• Model-free acting and receiving rewards. This is however much less common in practice.
Gradient estimation focus
Secondly, we will be assuming an online RL setting: Here, we train the agent
while it interacts with the environment. Other settings, like offline RL, exist,
where instead we receive a batch of data of another agent acting in the
environment, and we should find an agent that acts optimally just from
inspecting this batch of data. This is a very challenging setting, and an active
5 research area!

We will also only look at model-free methods for now. There are also model-
based methods to RL, where in addition to learning the policy, we also learn a
model of the environment! This is also an exciting and active research area.

In this lecture, we will attempt to explain Deep RL with a focus on gradient


estimation. We will go into more details later what exactly this is, but we have
seen an example of gradient estimation before in our course: The
reparameterization in VAEs!

This lecture will be introducing RL and explaining the basic Deep RL algorithm,
while the second lecture on RL, lecture 12) will focus on more recent popular
methods for Deep RL.
One benefit of RL is that a single system can be developed for
DEEPMIND ATARI
many different tasks, so long as the interface between the world
and the learner stays the same. Here is a famous experiment by
DeepMind, the company behind AlphaGo. The environment is an
Atari simulator. The state is a single image, containing everything
that can be seen on the screen. The ackons are the four possible
movements of the joyskck and the pressing of the fire bulon. The
reward is determined by the score shown on the screen.
The amazing thing here is that the system was not pre-
programmed with any knowledge of any of the games. For several
of the games the system learned play the game beler than the top
h"ps://www.youtube.com/
human performance.
6 source: h"ps://www.youtube.com/watch?v=V1eYniJ0Rnk
watch?v=VCdxqn0fcnE

Here, we see an example of what a Deep RL policy might look like. Like we
ATARI POLICY mentioned, the states in our atari game are simple images, so, a sensible idea
is to use a CNN! This CNN policy takes the current state of the game, does a
lot of hard neural network computation, and computes a probability
distribution over actions! Since we have a finite set of actions, we can use a
softmax output layer to create a categorical distribution over actions. Then,
we sample an action to perform from this distribution.

We won’t discuss policy network architectures any further in this lecture. For
Left choosing network architecture, generally the same recommendations apply
as for normal deep learning: Use CNNs for states represented as images,
s
<latexit sha1_base64="waUnr2Aq8dArVtVjOAboSaAlUwM=">AAAOHnicfZdNb9s2HMaV7q3L5rXdjrsICwoMQxBIieOXQ4Faso0e1jYL8rbVQUHRtKyFEgmScuwK+g67bpd9mh2HXbdvM0p2ZImirEsYPg8f//inKJAexQEXlvXf3qOPPv7k088ef77/xZetr548ffb1FScxg+gSEkzYjQc4wkGELkUgMLqhDIHQw+jau3Mz/XqBGA9IdCFWFN2GwI+CWQCBkF1XEy6AQO+fHlhHVv6Y9Ya9aRwYm+fs/bPW3mRKYByiSEAMOH9nW1TcJoCJAGKU7k9ijiiAd8BH72Ix690mQURjgSKYms+lNouxKYiZEZnTgCEo8Eo2AGSBTDDhHDAAheTer0ZxFIEQ8cPpIqB83eQLf90QQE76NlnmRUkrAxOfAToP4LJCloCQh0DMa518FXrVThRjxBZhtTOjlIyKc4kYDHhWgzNZmLc0qzO/IGcbfb6icxTxNIkZTssDpYAYQzM5MG9yJGKa5JORi3vHXwgWo8Osmfe9GAJ2d46mhzKn0lHFmWECRLXLC7PiROgekjAE0TSZ0DSZCLQUyeTwKJXic9PD0uwRwKZV53maJJOsZp5nnktrRXxTEt+o4qgkjjY/QvDUnBFmLuT6E8ZNaTSlhQUQ8eroy2L0zLxUo69K4pUqXpfEa1X04pIa19RFSV3U1PuSeq+qy5K4VMVVSVyp4oeS+EEVb3bF/rwr9hclVtZfborV/mSKZvLbkb9CyR1Mk1cXr39Mk37+bN6FGJl21Qi9B+PJuNPtd1NVxg96e9yznWFdLwxdZ2C7OkPhGHRdazjashwr3gLasjqjQUeNgnir9/ruuK5vYa1Bd+hoDFvasdsedTflQyhSrH7hc/uddq0sfpHTdxzntF/XC4PTdt3escZQONzhcDhwcxQaM4qR4qUPxk7n1KpH0SKoZ3XaA42+XQCr5zg1WFpiccaOPbRzFoEAVpyieFnc3qA/UoPEtv7OwHVrCyhK5e+59rCtMWxZT4eno5OchDAQ+WpVSFG+Trd30lOjSBE07soVrLGQ7S+N+47VrbGQEsvYcZ2ssNWdKDfZO/s2WX9yt/vOPLDNNFXNcqOp5mzvPZgVL9aYcbNba9/l1w/AjfD1mULYFA814bARBupYYDM81MLDXfB+3e83xfuacL8Rxtex+M3wvhbe3wVP637aFE814bQRhupYaDM81cLTXfCi7hdN8UITLhphhI5FNMMLLbzYBU/qftIUTzThpBGG6FhIMzzRwpMd8AzdyyNfdlKQb1fCapHyTC5Ps7kOcQJqen6byGWCE16TPTFHAmS6F0qm/B/VMw04JHEk8hSKk4kPpJKuTyw0u2AAbGYHdILNINqcYSqfV5oNvYPyWLt2r6c5RPKiwtBreQJ6K0/XQB44f5ATYn4YyAnJv5PDrLXLCJYPRtnal5cmW70i1RvXx0d2+8i2f2ofvOxs7k+PjW+N74zvDdvoGi+NV8aZcWlA41fjN+N344/Wn62/Wn+3/llbH+1txnxjVJ7Wv/8DASs5Vg==</latexit>

a
<latexit sha1_base64="tu4dC4JjNNVWy9U2qYJ6CpDY5rg=">AAAOH3icfZdNb9s2HMaV7q3L5q3djrsICwoMQxBIieOXQ4Fako0e1jYLkjhbHBQUTcuCKZGgKMeuoA+x63bZp9lx2LXfZpTsyBJFWRfRfB4+/vEvUSBdiv2IG8bHgyeffPrZ5188/fLwq69b33z77Pl3NxGJGUTXkGDCbl0QIeyH6Jr7HKNbyhAIXIzG7sLO9PESscgn4RVfU3QfAC/0Zz4EXHSNJwBm9/fPjowTI7/0esPcNo607XXx/nnrYDIlMA5QyCEGUXRnGpTfJ4BxH2KUHk7iCFEAF8BDdzGf9e4TP6QxRyFM9RdCm8VY50TPkPSpzxDkeC0aADJfJOhwDphAE+CH1agIhSBA0fF06dNo04yW3qbBgZj1fbLKq5JWBiYeA3Tuw1WFLAFBFAA+r3VG68CtdqIYI7YMqp0ZpWCUnCvEoB9lNbgQhXlHswJHV+Riq8/XdI7CKE1ihtPyQCEgxtBMDMybEeIxTfLJiKe7iF5yFqPjrJn3vXQAW1yi6bHIqXRUcWaYAF7tcoOsOCF6gCQIQDhNJjRNJhyteDI5PkmF+EJ3sTC7BLBp1XmZJskkq5nr6pfCWhHflsS3sjgsicPtnxA81WeE6Uvx/AmLdGHUhYX5EEXV0dfF6Jl+LUfflMQbWRyXxLEsunFJjWvqsqQua+pDSX2Q1VVJXMniuiSuZfFDSfwgi7f7Yn/bF/u7FCvqLxbF+nAyRTPx8chfoWQB0+T11Ztf0qSfX9t3IUa6WTVC99F4Nup0+91UlvGj3h71TMup64Whaw1MW2UoHIOubTjDHcup5C2gDaMzHHTkKIh3eq9vj+r6DtYYdB1LYdjRjuz2sLstH0KhZPUKn93vtGtl8YqcvmVZ5/26Xhistm33ThWGwmE7jjOwcxQaM4qR5KWPxk7n3KhH0SKoZ3TaA4W+ewBGz7JqsLTEYo0s0zFzFo4Alpy8eFns3qA/lIP4rv7WwLZrD5CXyt+zTaetMOxYz53z4VlOQhgIPbkqpChfp9s768lRpAgadcUTrLGQ3T+N+pbRrbGQEsvIsq2ssNWVKBbZnXmfbD65u3WnH5l6mspmsdBkc7b2Hs2SFyvMuNmttO/zqwfgRvj6TCFsioeKcNgIA1UssBkeKuHhPniv7vea4j1FuNcI46lYvGZ4Twnv7YOndT9tiqeKcNoIQ1UstBmeKuHpPnhe9/OmeK4I540wXMXCm+G5Ep7vgyd1P2mKJ4pw0ghDVCykGZ4o4ckeeIYexJYv2ymItythtcjNcSHXIU5ATY844CiXCU6imuzyOeIg091AMOU/ZM/UjyCJQ56nUJxMPCCUdLNjodkBA2A926ATrPvhdg9T+bzSbOgCim3txr2ZpoPEQYWhN2IH9E7sroHYcP4sJsS8wBcTEvfJcdbaZwSrR6NoHYpDkykfkeqN8emJ2T4xzV/bR6862/PTU+0H7UftJ83Uutor7bV2oV1rUFtof2h/an+1/m790/q39d/G+uRgO+Z7rXK1Pv4PL2I5vQ==</latexit>

Graph Neural Networks for states represented as graphs, RNNs or


Mnih, V., Kavukcuoglu, K., Silver, D. et al. transformers for text, etc.
Human-level control through deep
reinforcement learning. Nature 518,
529–533 (2015).
<latexit sha1_base64="nfi9SmFy8d5udS4HKyngRVb/rHE=">AAAONXicfZdNb9s2HMaV7q3L5jXdjr0ICwp0QxBIieOXQ4Faso0e1jYL8tItCgKKpmUhlEhQlGNX02GfZtftsu8yYMdh132FUbIjSxRlXcLwefj4xz9FgXQp9iNuGH/tPPro408+/ezx57tffNn66sne068vIxIziC4gwYS9d0GEsB+iC+5zjN5ThkDgYnTl3tmZfjVHLPJJeM6XFN0EwAv9qQ8BF123e88c6t86Lp8hDl44AGa9vzgRBxx9d7u3bxwa+aPXG+a6sa+tn9Pbp60dZ0JgHKCQQwyi6No0KL9JAOM+xCjddeIIUQDvgIeuYz7t3SR+SGOOQpjqz4U2jbHOiZ6B6hOfIcjxUjQAZL5I0OEMMIEoprNbjYpQCAIUHUzmPo1WzWjurRociFrcJIu8VmllYOIxQGc+XFTIEhBEAeCzWme0DNxqJ4oxYvOg2plRCkbJuUAM+lFWg1NRmHc0K3R0Tk7X+mxJZyiM0iRmOC0PFAJiDE3FwLwZIR7TJJ+MWPO76CVnMTrImnnfyyFgd2dociByKh1VnCkmgFe73CArTojuIQkCEE4Sh6aJw9GCJ87BYSrE57qLhdklgE2qzrM0SZysZq6rnwlrRXxbEt/K4qgkjtY/QvBEnxKmz8X6ExbpwqgLC/MhiqqjL4rRU/1Cjr4siZeyeFUSr2TRjUtqXFPnJXVeU+9L6r2sLkriQhaXJXEpix9K4gdZfL8t9qdtsT9LsaL+YlMsd50JmopPSv4KJXcwTV6fv/khTfr5s34XYqSbVSN0H4zH4063301lGT/o7XHPtIZ1vTB0rYFpqwyFY9C1jeFow3IkeQtow+iMBh05CuKN3uvb47q+gTUG3aGlMGxox3Z71F2XD6FQsnqFz+532rWyeEVO37Ksk35dLwxW27Z7RwpD4bCHw+HAzlFozChGkpc+GDudE6MeRYugntFpDxT6ZgGMnmXVYGmJxRpb5tDMWTgCWHLy4mWxe4P+SA7im/pbA9uuLSAvlb9nm8O2wrBhPRmejI5zEsJA6MlVIUX5Ot3ecU+OIkXQuCtWsMZCNr807ltGt8ZCSixjy7aywlZ3othk1+ZNsvrkbvadvm/qaSqbxUaTzdneezBLXqww42a30r7Nrx6AG+HrM4WwKR4qwmEjDFSxwGZ4qISH2+C9ut9rivcU4V4jjKdi8ZrhPSW8tw2e1v20KZ4qwmkjDFWx0GZ4qoSn2+B53c+b4rkinDfCcBULb4bnSni+DZ7U/aQpnijCSSMMUbGQZniihCdb4Bm6F0e+7KQg3q6E1SJX14ZchzgBNT2/TuQywUlUk1e3j0x3A8GU/yN7Jn4ESRzyPIXixPGAUNLViYVmFwyA9eyATrDuh+szTOXzSrOhd1Aca1fu1TSHSFxUGHojTkDvxOkaiAPn92JCzAt8MSHx1znIWtuMYPFgFK1dcWky5StSvXF1dGi2D03zx/b+q876/vRYe6Z9q73QTK2rvdJea6fahQa1X7XftN+1P1p/tv5u/dP6d2V9tLMe841WeVr//Q/9WUJo</latexit>

CNN policy network ⇡✓ (a|s)


Figure from the original DQN paper in Nature: Mnih, V., Kavukcuoglu, K.,
7
Silver, D. et al. Human-level control through deep reinforcement learning.
Nature 518, 529–533 (2015).

As we have seen before, reinforcement learning settings are modelled using


TRAJECTORIES actions, states and rewards. Note that RL settings are stochastic: This means
these three components are random variables.
• Components of RL:
Furthermore, we have trajectories, which are a full rollout of the agent
• Actions at
<latexit sha1_base64="+LzeGFZxpTjAKZV/HNGMwJ8Fw9M=">AAAOQ3icfZdNb9s2HMbV7q3L5rXdjrsICwoMQxBIieOXQ4Faso0e1jYL8tbVQUDRtCyEEgmKcuwK+hS7bt9mX2JfYcdh1wGjZEeWSMq6+G8+Dx//9JdokB7FQcwt669Hjz/59LPPv3jy5d5XX7e+efrs+beXMUkYRBeQYMKuPRAjHEToggcco2vKEAg9jK68OzfXrxaIxQGJzvmKopsQ+FEwCyDgYuj9BMD885bfPtu3Dq3iMtXC3hT7xuY6vX3e2p9MCUxCFHGIQRx/sC3Kb1LAeAAxyvYmSYwogHfARx8SPuvdpEFEE44imJkvhDZLsMmJmUOZ04AhyPFKFACyQCSYcA6YgBPoe/WoGEUgRPHBdBHQeF3GC39dcCDu+yZdFn3JahNTnwE6D+CyRpaCMA4BnyuD8Sr06oMowYgtwvpgTikYJecSMRjEeQ9ORWPe0bzF8Tk53ejzFZ2jKM7ShOGsOlEIiDE0ExOLMkY8oWlxM+L53sUvOUvQQV4WYy+HgN2doemByKkN1HFmmABeH/LCvDkRuockDEE0TSc0SyccLXk6OTjMhPjC9LAwewSwad15lqXpJO+Z55lnwloT31bEt7I4qoijzY8QPDVnhJkL8fwJi01hNIWFBRDF9dkX5eyZeSFHX1bES1m8qohXsuglFTVR1EVFXSjqfUW9l9VlRVzK4qoirmTxY0X8KIvXu2Lf74r9VYoV/ReLYrU3maKZ+PsoXqH0Dmbp6/M3P2dpv7g270KCTLtuhN6D8Xjc6fa7mSzjB7097tnOUNVLQ9cZ2K7OUDoGXdcajrYsR5K3hLaszmjQkaMg3uq9vjtW9S2sNegOHY1hSzt226Pupn0IRZLVL31uv9NW2uKXOX3HcU76ql4anLbr9o40htLhDofDgVug0IRRjCQvfTB2OieWGkXLoJ7VaQ80+vYBWD3HUWBphcUZO/bQLlg4Alhy8vJlcXuD/kgO4tv+OwPXVR4gr7S/59rDtsawZT0ZnoyOCxLCQOTLXSFl+zrd3nFPjiJl0LgrnqDCQra/NO47VldhIRWWseM6eWPrK1Essg/2Tbr+y92uO3PfNrNMNouFJpvztfdglrxYY8bNbq19l18/ATfCq3cKYVM81ITDRhioY4HN8FALD3fB+6rfb4r3NeF+I4yvY/Gb4X0tvL8Lnqp+2hRPNeG0EYbqWGgzPNXC013wXPXzpniuCeeNMFzHwpvhuRae74Inqp80xRNNOGmEIToW0gxPtPBkBzxD92LLl+8UxNuVMiVyfWAodIhToOgxBxwVMsFprMgenyMOct0LBVPxRfWApHSIUtanQQxJEvENJMbpxAdCy9Z7GpofQQA28y08wWYQbXY5tT9gmk++g2Lju3avGzFE4ijD0BuxR3on9t9AbEl/ErfM/DAQtyw+Jwd5tcsIlg9GUe2JY5UtH6LU4uro0G4f2vYv7f1Xnc0J64nxvfGD8aNhG13jlfHaODUuDGiExm/G78YfrT9bf7f+af27tj5+tJnznVG7Wv/9D6soSYc=</latexit>

interacting with the environment and receiving rewards. It starts at the initial
• States st
<latexit sha1_base64="/pzW3Mv4fuxhZwPDsc7BaKD9MQQ=">AAAOQnicfZfLbuM2GIU109s0rduZdtmN0GCAoggCKXF8WQwwlmxjFp2ZNMitjYOAomlZCCUSJOXYI+glum3fpk/RR+iy6LaLUrIjSxRlbcLwHB5/+iUKPz2KAy4s668nTz/6+JNPP3v2+d4XX7a++vr5i28uOYkZRBeQYMKuPcARDiJ0IQKB0TVlCIQeRlfevZvpVwvEeECic7Gi6DYEfhTMAgiEnLqecAEEuhN3z/etQyu/zPrA3gz2jc11eveitT+ZEhiHKBIQA85vbIuK2wQwEUCM0r1JzBEF8B746CYWs95tEkQ0FiiCqflSarMYm4KYGZM5DRiCAq/kAEAWyAQTzgEDUEjyvWoURxEIET+YLgLK10O+8NcDAeRt3ybLvCxpZWHiM0DnAVxWyBIQ8hCIeW2Sr0KvOolijNgirE5mlJJRcS4RgwHPanAqC/OeZpXm5+R0o89XdI4iniYxw2l5oRQQY2gmF+ZDjkRMk/xm5OO9568Ei9FBNsznXg0Buz9D0wOZU5mo4swwAaI65YVZcSL0AEkYgmiaTGiaTARaimRycJhK8aXpYWn2CGDTqvMsTZJJVjPPM8+ktSK+K4nvVHFUEkebHyF4as4IMxfy+RPGTWk0pYUFEPHq6oti9cy8UKMvS+KlKl6VxCtV9OKSGtfURUld1NSHkvqgqsuSuFTFVUlcqeKHkvhBFa93xf6yK/ZXJVbWX26K1d5kimby65G/Qsk9TJM3529/SpN+fm3ehRiZdtUIvUfj8bjT7XdTVcaPenvcs51hXS8MXWdguzpD4Rh0XWs42rIcKd4C2rI6o0FHjYJ4q/f67riub2GtQXfoaAxb2rHbHnU35UMoUqx+4XP7nXatLH6R03cc56Rf1wuD03bd3pHGUDjc4XA4cHMUGjOKkeKlj8ZO58SqR9EiqGd12gONvn0AVs9xarC0xOKMHXto5ywCAaw4RfGyuL1Bf6QGiW39nYHr1h6gKJW/59rDtsawZT0ZnoyOcxLCQOSrVSFF+Trd3nFPjSJF0Lgrn2CNhWx/adx3rG6NhZRYxo7rZIWt7kS5yW7s22T9yd3uO3PfNtNUNcuNppqzvfdoVrxYY8bNbq19l1+/ADfC1+8UwqZ4qAmHjTBQxwKb4aEWHu6C9+t+vyne14T7jTC+jsVvhve18P4ueFr306Z4qgmnjTBUx0Kb4akWnu6CF3W/aIoXmnDRCCN0LKIZXmjhxS54UveTpniiCSeNMETHQprhiRae7IBn6EG2fFmnIN+uhNUiZU8uu9lchzgBNT0/T+QywQmvyZ6YIwEy3QslU/5P3QPiwiGHqj4NOCRxJDaQGCcTH0gtXfc0NDuCAGxmLTzBZhBtupzKB5hmi++hbHzX7nUhhkgeZRh6K3uk97L/BrIl/VHeMvPDQN6y/Ds5yEa7jGD5aJSjPXmsstVDVH1wdXRotw9t++f2/uvO5oT1zPjO+N74wbCNrvHaeGOcGhcGNLDxm/G78Ufrz9bfrX9a/66tT59s1nxrVK7Wf/8Db7dJIA==</latexit>

state s_0, and ends in the terminal state s_T, from which the agent stops
• Rewards rt interacting with the environment.
<latexit sha1_base64="JCho6RHVaZ12zLSSoSGaNUMc/P8=">AAAOQ3icfZdNb9s2HMbV7q3L5rXdjrsICwoMQxBIieOXQ4Faso0e1jYL8tbVQUDRtCyEEgmKcuwK+hS7bt9mX2JfYcdh1wGjZEeWSMq65B8+Dx//9JcokB7FQcwt669Hjz/59LPPv3jy5d5XX7e+efrs+beXMUkYRBeQYMKuPRAjHEToggcco2vKEAg9jK68OzfXrxaIxQGJzvmKopsQ+FEwCyDgYuj9hKF7wKa3/PbZvnVoFZepFvam2Dc21+nt89b+ZEpgEqKIQwzi+INtUX6TAsYDiFG2N0liRAG8Az76kPBZ7yYNIppwFMHMfCG0WYJNTswcypwGDEGOV6IAkAUiwYRzwADkAn2vHhWjCIQoPpguAhqvy3jhrwsOxH3fpMuiL1ltYuozQOcBXNbIUhDGIeBzZTBehV59ECUYsUVYH8wpBaPkXCIGgzjvwalozDuatzo+J6cbfb6icxTFWZownFUnCgExhmZiYlHGiCc0LW5GPN+7+CVnCTrIy2Ls5RCwuzM0PRA5tYE6zgwTwOtDXpg3J0L3kIQhiKbphGbphKMlTycHh5kQX5geFmaPiLej7jzL0nSS98zzzDNhrYlvK+JbWRxVxNHmRwiemjPCzIV4/oTFpjCawsICiOL67Ity9sy8kKMvK+KlLF5VxCtZ9JKKmijqoqIuFPW+ot7L6rIiLmVxVRFXsvixIn6Uxetdse93xf4qxYr+i0Wx2ptM0Ux8PopXKL2DWfr6/M3PWdovrs27kCDTrhuh92A8Hne6/W4my/hBb497tjNU9dLQdQa2qzOUjkHXtYajLcuR5C2hLaszGnTkKIi3eq/vjlV9C2sNukNHY9jSjt32qLtpH0KRZPVLn9vvtJW2+GVO33Gck76qlwan7bq9I42hdLjD4XDgFig0YRQjyUsfjJ3OiaVG0TKoZ3XaA42+fQBWz3EUWFphccaOPbQLFo4Alpy8fFnc3qA/koP4tv/OwHWVB8gr7e+59rCtMWxZT4Yno+OChDAQ+XJXSNm+Trd33JOjSBk07oonqLCQ7S+N+47VVVhIhWXsuE7e2PpKFIvsg32Trj+523Vn7ttmlslmsdBkc772HsySF2vMuNmtte/y6yfgRnj1TiFsioeacNgIA3UssBkeauHhLnhf9ftN8b4m3G+E8XUsfjO8r4X3d8FT1U+b4qkmnDbCUB0LbYanWni6C56rft4UzzXhvBGG61h4MzzXwvNd8ET1k6Z4ogknjTBEx0Ka4YkWnuyAXx8I8p2CeLtSpkSKPbnYzRY6xClQ9JgDjgqZ4DRWZI/PEQe57oWCqfhH9YCkdIhS1qdBDEkS8Q0kxunEB0LL1nsamh9BADbzLTzBZhBtdjm1DzDNJ99BsfFdu9eNGCJxlGHojdgjvRP7byC2pD+JW2Z+GIhbFn8nB3m1ywiWD0ZR7YljlS0fotTi6ujQbh/a9i/t/VedzQnrifG98YPxo2EbXeOV8do4NS4MaITGb8bvxh+tP1t/t/5p/bu2Pn60mfOdUbta//0PDH9Jjg==</latexit>

• These are random variables!


Trajectories ⌧ = s0 , a0 , r1 , s1 , a1 , r2 , s2 . . . aT -1 , rT , sT
<latexit sha1_base64="nyvr2I+1rJ5Zdc1PSdcTqOSsGX4=">AAAOs3icfZdbb9s2GIaVdocua7x0u9yNsKDAMHiBlDg+XASoJdvoxdpmgZN0jYKAomlZMCUSJOXYFfRDd7H/Mkp2dJZ1ky98X75+9EkUSJtilwtN+/fgxctvvv3u+1c/HP74+qj10/Gbn285CRhEN5Bgwj7bgCPs+uhGuAKjz5Qh4NkY3dlLM9bvVohxl/hTsaHowQOO785dCIQcejx+smwBgkuLCyDQo9ZWLQATJS4ZegJs9qjLcqvrma5n+lmqn6nWjAiemsLpn3qUGaepcfp4fKKdasmlVgt9V5wou+vq8c3RicyGgYd8ATHg/F7XqHgIARMuxCg6tAKOKIBL4KD7QMz7D6Hr00AgH0bqW6nNA6wKosZNUGcuQ1DgjSwAZK5MUOECMEktW3VYjOLIBx7i7dnKpXxb8pWzLQSQfX4I18lziAoTQ4cBunDhukAWAo97QCwqg3zj2cVBFGDEVl5xMKaUjCXnGjHo8rgHV7Ixn2jcez4lVzt9saEL5PMoDBiO8hOlgBhDczkxKTkSAQ2Tm5Hv05JfChagdlwmY5cjwJbXaNaWOYWBIs4cEyCKQ7YXN8dHT5B4HvBnoUWj0BJoLUKrfRpJ8a1qY2m2iXxPis7rKAytuGe2rV5La0H8mBM/lsVxThzvfoTgmTonTF3J508YV6VRlRbmQsSLs2/S2XP1phx9mxNvy+JdTrwri3aQU4OKusqpq4r6lFOfyuo6J67L4iYnbsri15z4tSx+3hf7z77YL6VY2X+5KDaH1gzN5ecqeYXCJYzC99MPf0XhILl270KAVL1ohPaz8XzS7Q16UVnGz3pn0teNUVVPDT1jqJt1htQx7JnaaJyxnJW8KbSmdcfDbjkK4kzvD8xJVc9gtWFvZNQYMtqJ2Rn3du1DyC9ZndRnDrqdSlucNGdgGMbFoKqnBqNjmv2zGkPqMEej0dBMUGjAKEYlL302drsXWjWKpkF9rdsZ1ujZA9D6hlGBpTkWY2LoIz1hEQjgklOkL4vZHw7G5SCR9d8YmmblAYpc+/umPurUGDLWi9HF+DwhIQz4TrkrJG1ft9c/75ejSBo06cknWGEh2S9NBobWq7CQHMvEMI24scWVKBfZvf4Qbj+52bpTT3Q1ispmudDK5njtPZtLXlxjxs3uWvs+f/0E3AhfvVMIm+JhTThshIF1LLAZHtbCw33wTtXvNMU7NeFOI4xTx+I0wzu18M4+eFr106Z4WhNOG2FoHQtthqe18HQfvKj6RVO8qAkXjTCijkU0w4taeLEPnlT9pCme1ISTRhhSx0Ka4UktPNkDvz0axDsF+XaFrBK5PUkkOsQhqOjJeSKRCQ55RbbFAgkQ67YnmZJ/qh4QpA5ZlvWZyyEJfLGDxDi0HCC1aLunofERBGA13sITrLr+bpdT+ADTePISyo3v1r1txAjJowxDH+Qe6ZPcfwO5Jf1D3jJzPFfesvxrteNqnxGsn42yOpTHKr18iKoWd2eneudU1//unLzr7k5Yr5Rfld+U3xVd6SnvlPfKlXKjQOW/g5cHrw+OWp3Wl5bdmm2tLw52c35RClfL+x8CFGve</latexit>

Initial state s0
<latexit sha1_base64="GEr+u/sGsjmjWJOBOSadRWZ5rCM=">AAAOQnicfZfLbuM2GIU109s0rduZdtmN0GCAoggCKXF8WQwwlmxjFp2ZNMitjYOAomlZCCUSJOXYI+glum3fpk/RR+iy6LaLUrIjSxRlbcLwHB5/+iUKPz2KAy4s668nTz/6+JNPP3v2+d4XX7a++vr5i28uOYkZRBeQYMKuPcARDiJ0IQKB0TVlCIQeRlfevZvpVwvEeECic7Gi6DYEfhTMAgiEnLqecAEEurPunu9bh1Z+mfWBvRnsG5vr9O5Fa38yJTAOUSQgBpzf2BYVtwlgIoAYpXuTmCMK4D3w0U0sZr3bJIhoLFAEU/Ol1GYxNgUxMyZzGjAEBV7JAYAskAkmnAMGoJDke9UojiIQIn4wXQSUr4d84a8HAsjbvk2WeVnSysLEZ4DOA7iskCUg5CEQ89okX4VedRLFGLFFWJ3MKCWj4lwiBgOe1eBUFuY9zSrNz8npRp+v6BxFPE1ihtPyQikgxtBMLsyHHImYJvnNyMd7z18JFqODbJjPvRoCdn+GpgcypzJRxZlhAkR1yguz4kToAZIwBNE0mdA0mQi0FMnk4DCV4kvTw9LsEcCmVedZmiSTrGaeZ55Ja0V8VxLfqeKoJI42P0Lw1JwRZi7k8yeMm9JoSgsLIOLV1RfF6pl5oUZflsRLVbwqiVeq6MUlNa6pi5K6qKkPJfVBVZclcamKq5K4UsUPJfGDKl7viv1lV+yvSqysv9wUq73JFM3k1yN/hZJ7mCZvzt/+lCb9/Nq8CzEy7aoReo/G43Gn2++mqowf9fa4ZzvDul4Yus7AdnWGwjHoutZwtGU5UrwFtGV1RoOOGgXxVu/13XFd38Jag+7Q0Ri2tGO3PepuyodQpFj9wuf2O+1aWfwip+84zkm/rhcGp+26vSONoXC4w+Fw4OYoNGYUI8VLH42dzolVj6JFUM/qtAcaffsArJ7j1GBpicUZO/bQzlkEAlhxiuJlcXuD/kgNEtv6OwPXrT1AUSp/z7WHbY1hy3oyPBkd5ySEgchXq0KK8nW6veOeGkWKoHFXPsEaC9n+0rjvWN0aCymxjB3XyQpb3Ylyk93Yt8n6k7vdd+a+baapapYbTTVne+/RrHixxoyb3Vr7Lr9+AW6Er98phE3xUBMOG2GgjgU2w0MtPNwF79f9flO8rwn3G2F8HYvfDO9r4f1d8LTup03xVBNOG2GojoU2w1MtPN0FL+p+0RQvNOGiEUboWEQzvNDCi13wpO4nTfFEE04aYYiOhTTDEy082QHP0INs+bJOQb5dCatFyp5cdrO5DnECanp+nshlghNekz0xRwJkuhdKpvyfugfEhUMOVX0acEjiSGwgMU4mPpBauu5paHYEAdjMWniCzSDadDmVDzDNFt9D2fiu3etCDJE8yjD0VvZI72X/DWRL+qO8ZeaHgbxl+XdykI12GcHy0ShHe/JYZauHqPrg6ujQbh/a9s/t/dedzQnrmfGd8b3xg2EbXeO18cY4NS4MaGDjN+N344/Wn62/W/+0/l1bnz7ZrPnWqFyt//4Hwl9I3A==</latexit>

Terminal state sT
<latexit sha1_base64="mT6OGDXD2gX+Wm6tE1qKZ8PXC3g=">AAAOQnicfZfLbuM2GIU109s0rduZdtmN0GCAoggCKXF8WQwwlmxjFp2ZNMitjYOAomlZCCUSJOXYI+glum3fpk/RR+iy6LaLUrIjSxRlbcLwHB5/+iUKPz2KAy4s668nTz/6+JNPP3v2+d4XX7a++vr5i28uOYkZRBeQYMKuPcARDiJ0IQKB0TVlCIQeRlfevZvpVwvEeECic7Gi6DYEfhTMAgiEnLqecAEEuju/e75vHVr5ZdYH9mawb2yu07sXrf3JlMA4RJGAGHB+Y1tU3CaAiQBilO5NYo4ogPfARzexmPVukyCisUARTM2XUpvF2BTEzJjMacAQFHglBwCyQCaYcA4YgEKS71WjOIpAiPjBdBFQvh7yhb8eCCBv+zZZ5mVJKwsTnwE6D+CyQpaAkIdAzGuTfBV61UkUY8QWYXUyo5SMinOJGAx4VoNTWZj3NKs0PyenG32+onMU8TSJGU7LC6WAGEMzuTAfciRimuQ3Ix/vPX8lWIwOsmE+92oI2P0Zmh7InMpEFWeGCRDVKS/MihOhB0jCEETTZELTZCLQUiSTg8NUii9ND0uzRwCbVp1naZJMspp5nnkmrRXxXUl8p4qjkjja/AjBU3NGmLmQz58wbkqjKS0sgIhXV18Uq2fmhRp9WRIvVfGqJF6poheX1LimLkrqoqY+lNQHVV2WxKUqrkriShU/lMQPqni9K/aXXbG/KrGy/nJTrPYmUzSTX4/8FUruYZq8OX/7U5r082vzLsTItKtG6D0aj8edbr+bqjJ+1Nvjnu0M63ph6DoD29UZCseg61rD0ZblSPEW0JbVGQ06ahTEW73Xd8d1fQtrDbpDR2PY0o7d9qi7KR9CkWL1C5/b77RrZfGLnL7jOCf9ul4YnLbr9o40hsLhDofDgZuj0JhRjBQvfTR2OidWPYoWQT2r0x5o9O0DsHqOU4OlJRZn7NhDO2cRCGDFKYqXxe0N+iM1SGzr7wxct/YARan8PdcetjWGLevJ8GR0nJMQBiJfrQopytfp9o57ahQpgsZd+QRrLGT7S+O+Y3VrLKTEMnZcJytsdSfKTXZj3ybrT+5235n7tpmmqlluNNWc7b1Hs+LFGjNudmvtu/z6BbgRvn6nEDbFQ004bISBOhbYDA+18HAXvF/3+03xvibcb4TxdSx+M7yvhfd3wdO6nzbFU004bYShOhbaDE+18HQXvKj7RVO80ISLRhihYxHN8EILL3bBk7qfNMUTTThphCE6FtIMT7TwZAc8Qw+y5cs6Bfl2JawWKXty2c3mOsQJqOn5eSKXCU54TfbEHAmQ6V4omfJ/6h4QFw45VPVpwCGJI7GBxDiZ+EBq6bqnodkRBGAza+EJNoNo0+VUPsA0W3wPZeO7dq8LMUTyKMPQW9kjvZf9N5At6Y/ylpkfBvKW5d/JQTbaZQTLR6Mc7cljla0eouqDq6NDu31o2z+39193NiesZ8Z3xvfGD4ZtdI3Xxhvj1LgwoIGN34zfjT9af7b+bv3T+ndtffpks+Zbo3K1/vsftLlJAA==</latexit>

8
To show this setup, let’s again look at the atari example. We see three states,
ATARI TRAJECTORY represented using images. These go into our CNN policy, and we sample an
action from them, in this case left (we see the bat go to the left). The ball hits
rt+1 = 0 rt+2 = 1 a block on the third picture, which increases the score by one! That is a
<latexit sha1_base64="6BZVX3qWN3a5GYVMCDT6XJYIKOY=">AAAOS3icfZdNb9s2HMbVbl27bG7T7biLsKDAsAWBlDh+OQSoJdvoYW2zIG9bbAQUTcuCKZEgKceuoE+y6/Zt9gX2NXYcdhglO7JESdYlDJ+Hj3/8UxRIh2KPC8P4+8nTzz5/9sXzF1/uffV14+Wr/dffXHMSMoiuIMGE3TqAI+wF6Ep4AqNbyhDwHYxunLmd6DcLxLhHgkuxomjsAzfwph4EQnbd778aMfQA2OQ+Ej+Z8Zlxv39gHBnpo5cb5qZxoG2e8/vXjYPRhMDQR4GAGHB+ZxpUjCPAhAcxivdGIUcUwDlw0V0opp1x5AU0FCiAsf5GatMQ64LoCZw+8RiCAq9kA0DmyQQdzgADUMgp7BWjOAqAj/jhZOFRvm7yhbtuCCDnP46WaX3iwsDIZYDOPLgskEXA5z4Qs1InX/lOsROFGLGFX+xMKCWj4lwiBj2e1OBcFuYjTUrOL8n5Rp+t6AwFPI5ChuP8QCkgxtBUDkybHImQRulk5DrP+ZlgITpMmmnfWR+w+QWaHMqcQkcRZ4oJEMUux0+KE6AHSHwfBJNoRONoJNBSRKPDo1iKb3QHS7ND5DtSdF7EUTRKauY4+oW0FsQPOfGDKg5y4mDzIwRP9Clh+kKuP2Fcl0ZdWpgHES+OvspGT/UrNfo6J16r4k1OvFFFJ8ypYUld5NRFSX3IqQ+qusyJS1Vc5cSVKn7KiZ9U8XZX7K+7Yn9TYmX95aZY7Y0maCo/I+krFM1hHL27fP9zHHXTZ/MuhEg3i0boPBpPhq12tx2rMn7Um8OOafXLemZoWz3TrjJkjl7bNvqDLcux4s2gDaM16LXUKIi3eqdrD8v6FtbotftWhWFLO7Sbg/amfAgFitXNfHa31SyVxc1yupZlnXbLemawmrbdOa4wZA673+/37BSFhoxipHjpo7HVOjXKUTQL6hitZq9C3y6A0bGsEizNsVhDy+ybKYtAACtOkb0sdqfXHahBYlt/q2fbpQUUufJ3bLPfrDBsWU/7p4OTlIQwELhqVUhWvla7c9JRo0gWNGzLFSyxkO0vDbuW0S6xkBzL0LKtpLDFnSg32Z05jtaf3O2+0w9MPY5Vs9xoqjnZe49mxYsrzLjeXWnf5a8egGvhyzOFsC4eVoTDWhhYxQLr4WElPNwF75b9bl28WxHu1sK4VSxuPbxbCe/ugqdlP62LpxXhtBaGVrHQenhaCU93wYuyX9TFi4pwUQsjqlhEPbyohBe74EnZT+riSUU4qYUhVSykHp5UwpMd8OtrQXJSkG9XxEqR8kwuT7OpDnEESjoXQKBUJjjiJdkRMyRAoju+ZEr/KXtAmDlkU9UnHockDMQGEuNo5AKpxeszDU2uIADryRGeYN0LNqecwgeYJoPnUB581+51IfpIXmUYei/PSB/l+RvII+mPcsrM9T05Zfl3dJi0dhnB8tEoW3vyWmWql6hy4+b4yGwemeYvzYO3rc0N64X2nfa99oNmam3trfZOO9euNKiF2u/aH9qfjb8a/zT+bfy3tj59shnzrVZ4Xj77H4vTSr0=</latexit> <latexit sha1_base64="0Xz+aHDaXnZyuOfTbAOjKslbg60=">AAAOS3icfZdNb9s2HMbVbl27bG7T7biLsKDAsAWBlDh+OQSoJdvoYW2zIG9bbAQUTcuCKZGgKMeuoE+y6/Zt9gX2NXYcdhglO7JEStYlDJ+Hj3/8UxRIh2Iv5Ibx95Onn33+7IvnL77c++rrxstX+6+/uQ5JxCC6ggQTduuAEGEvQFfc4xjdUoaA72B048ztVL9ZIBZ6JLjkK4rGPnADb+pBwEXX/f6rEUMPgE3uY/7TcXJm3u8fGEdG9uhqw9w0DrTNc37/unEwmhAY+SjgEIMwvDMNyscxYNyDGCV7oyhEFMA5cNFdxKedcewFNOIogIn+RmjTCOuc6CmcPvEYghyvRANA5okEHc4AA5CLKeyVo0IUAB+Fh5OFR8N1M1y46wYHYv7jeJnVJykNjF0G6MyDyxJZDPzQB3ymdIYr3yl3oggjtvDLnSmlYJScS8SgF6Y1OBeF+UjTkoeX5Hyjz1Z0hoIwiSOGk+JAISDG0FQMzJoh4hGNs8mIdZ6HZ5xF6DBtZn1nfcDmF2hyKHJKHWWcKSaAl7scPy1OgB4g8X0QTOIRTeIRR0sejw6PEiG+0R0szA4R70jZeZHE8SitmePoF8JaEj8UxA+yOCiIg82PEDzRp4TpC7H+hIW6MOrCwjyIwvLoq3z0VL+So68L4rUs3hTEG1l0ooIaKeqioC4U9aGgPsjqsiAuZXFVEFey+KkgfpLF212xv+6K/U2KFfUXm2K1N5qgqfiMZK9QPIdJ/O7y/c9J3M2ezbsQId0sG6HzaDwZttrddiLL+FFvDjum1Vf13NC2eqZdZcgdvbZt9AdblmPJm0MbRmvQa8lREG/1TtceqvoW1ui1+1aFYUs7tJuD9qZ8CAWS1c19drfVVMri5jldy7JOu6qeG6ymbXeOKwy5w+73+z07Q6ERoxhJXvpobLVODTWK5kEdo9XsVejbBTA6lqXA0gKLNbTMvpmxcASw5OT5y2J3et2BHMS39bd6tq0sIC+Uv2Ob/WaFYct62j8dnGQkhIHAlatC8vK12p2TjhxF8qBhW6ygwkK2vzTsWkZbYSEFlqFlW2lhyztRbLI7cxyvP7nbfacfmHqSyGax0WRzuvcezZIXV5hxvbvSvstfPQDXwqszhbAuHlaEw1oYWMUC6+FhJTzcBe+qfrcu3q0Id2th3CoWtx7erYR3d8FT1U/r4mlFOK2FoVUstB6eVsLTXfBc9fO6eF4RzmtheBULr4fnlfB8FzxR/aQunlSEk1oYUsVC6uFJJTzZAb++FqQnBfF2xUyJFGdycZrNdIhjoOghBxxlMsFxqMgOnyEOUt3xBVP2j+oBUe4QTVmfeCEkUcA3kBjHIxcILVmfaWh6BQFYT4/wBOtesDnllD7ANB08h+Lgu3avC9FH4irD0HtxRvoozt9AHEl/FFNmru+JKYu/o8O0tcsIlo9G0doT1ypTvkSpjZvjI7N5ZJq/NA/etjY3rBfad9r32g+aqbW1t9o77Vy70qAWab9rf2h/Nv5q/NP4t/Hf2vr0yWbMt1rpefnsf6eESr8=</latexit>

positive reward.
<latexit sha1_base64="gfi6SL5j681Wde1MvJUNdF/DmQ4=">AAAOW3icfZdNb9s2HMbV7q3LmizdsNMOExYU6IYgkBLHL4cCtWQbPaxtFuRti4KAomlZMCUSFOXY1XTcp9l1+zAD9mFGyY4sUZR1yT98Hj7+6S9RIF2K/Ygbxr9Pnn7y6Weff/Hsy52vnu/ufb3/4puriMQMoktIMGE3LogQ9kN0yX2O0Q1lCAQuRtfuzM706zlikU/CC76k6C4AXuhPfAi4GLrf/8Gh/r3j8ini4JUDYD7K/3AiDji65z/d7x8YR0Z+6fXCXBcH2vo6u3+xe+CMCYwDFHKIQRTdmgbldwlg3IcYpTtOHCEK4Ax46Dbmk+5d4oc05iiEqf5SaJMY65zoGaw+9hmCHC9FASDzRYIOp4AJTHFLO9WoCIUgQNHheO7TaFVGc29VcCD6cZcs8n6llYmJxwCd+nBRIUtAEAWAT2uD0TJwq4MoxojNg+pgRikYJecCMehHWQ/ORGM+0KzZ0QU5W+vTJZ2iMEqTmOG0PFEIiDE0ERPzMkI8pkl+M+K5z6LXnMXoMCvzsdcDwGbnaHwocioDVZwJJoBXh9wga06IHiAJAhCOE4emicPRgifO4VEqxJe6i4XZJYCNq87zNEmcrGeuq58La0V8XxLfy+KwJA7XP0LwWJ8Qps/F8ycs0oVRFxbmQxRVZ18Wsyf6pRx9VRKvZPG6JF7LohuX1LimzkvqvKY+lNQHWV2UxIUsLkviUhY/lsSPsnizLfa3bbG/S7Gi/2JRLHecMZqIz0r+CiUzmCZvL979kia9/Fq/CzHSzaoRuo/Gk1G70+uksowf9daoa1qDul4YOlbftFWGwtHv2MZguGE5lrwFtGG0h/22HAXxRu/27FFd38Aa/c7AUhg2tCO7Neys24dQKFm9wmf32q1aW7wip2dZ1mmvrhcGq2Xb3WOFoXDYg8Ggb+coNGYUI8lLH43t9qlRj6JFUNdot/oKffMAjK5l1WBpicUaWebAzFk4Alhy8uJlsbv93lAO4pv+W33brj1AXmp/1zYHLYVhw3o6OB2e5CSEgdCTu0KK9rU73ZOuHEWKoFFHPMEaC9n80qhnGZ0aCymxjCzbyhpbXYlikd2ad8nqk7tZd/qBqaepbBYLTTZna+/RLHmxwoyb3Ur7Nr96Am6Er98phE3xUBEOG2GgigU2w0MlPNwG79X9XlO8pwj3GmE8FYvXDO8p4b1t8LTup03xVBFOG2GoioU2w1MlPN0Gz+t+3hTPFeG8EYarWHgzPFfC823wpO4nTfFEEU4aYYiKhTTDEyU82QLP0IPY8mU7BfF2JawWuTo65DrECajp+YEilwlOopq8OoFkuhsIpvyfugfEhUOUsj72I0jikK8hMU4cDwgtXe1paHYEAVjPtvAE63643uVUPsA0mzyDYuO7cq8aMUDiKMPQO7FH+iD230BsSX8Wt8y8wBe3LP46h1m1zQgWj0ZR7YhjlSkfourF9fGR2ToyzV9bB2/a6xPWM+177UftlWZqHe2N9lY70y41qP2p/aX9rf2z+9/e072dvecr69Mn6znfapVr77v/ARAATx0=</latexit> <latexit sha1_base64="LQ65ew/VbtFCIOFTc/0FK50zRhc=">AAAOY3icfZfdbts2GIbV7q/LliztdjYMEBYU67YgkBLHPwcFask2erC2WZA02aIgoGhaFkyJBEU5djVdwq5mp9uF7HwXMkp2ZImUrBN/5vvy9aNPokG6FPsRN4x/Hz3+6ONPPv3syec7X3y5u/fV/tNn7yMSM4guIcGEXbsgQtgP0SX3OUbXlCEQuBhduTM706/miEU+CS/4kqLbAHihP/Eh4GLobv8Hh/p3jsuniIMXDoD5aMJ/NtM/nIgDjlZffrzbPzCOjPzS1cJcFwfa+jq7e7p74IwJjAMUcohBFN2YBuW3CWDchxilO04cIQrgDHjoJuaT7m3ihzTmKISp/lxokxjrnOgZtD72GYIcL0UBIPNFgg6ngAlccWs71agIhSBA0eF47tNoVUZzb1VwIPpymyzyvqWViYnHAJ36cFEhS0AQBYBPlcFoGbjVQRRjxOZBdTCjFIySc4EY9KOsB2eiMe9o1vTogpyt9emSTlEYpUnMcFqeKATEGJqIiXkZIR7TJL8Z8fxn0UvOYnSYlfnYywFgs3M0PhQ5lYEqzgQTwKtDbpA1J0T3kAQBCMeJQ9PE4WjBE+fwKBXic93FwuwSwMZV53maJE7WM9fVz4W1Ir4tiW9lcVgSh+sfIXisTwjT5+L5ExbpwqgLC/MhiqqzL4vZE/1Sjn5fEt/L4lVJvJJFNy6psaLOS+pcUe9L6r2sLkriQhaXJXEpix9K4gdZvN4W+9u22N+lWNF/sSiWO84YTcTfS/4KJTOYJq8v3vySJr38Wr8LMdLNqhG6D8aTUbvT66SyjB/01qhrWgNVLwwdq2/adYbC0e/YxmC4YTmWvAW0YbSH/bYcBfFG7/bskapvYI1+Z2DVGDa0I7s17Kzbh1AoWb3CZ/faLaUtXpHTsyzrtKfqhcFq2Xb3uMZQOOzBYNC3cxQaM4qR5KUPxnb71FCjaBHUNdqtfo2+eQBG17IUWFpisUaWOTBzFo4Alpy8eFnsbr83lIP4pv9W37aVB8hL7e/a5qBVY9iwng5Ohyc5CWEg9OSukKJ97U73pCtHkSJo1BFPUGEhm18a9Syjo7CQEsvIsq2ssdWVKBbZjXmbrP5yN+tOPzD1NJXNYqHJ5mztPZglL64x42Z3rX2bv34CboRX7xTCpnhYEw4bYWAdC2yGh7XwcBu8p/q9pnivJtxrhPHqWLxmeK8W3tsGT1U/bYqnNeG0EYbWsdBmeFoLT7fBc9XPm+J5TThvhOF1LLwZntfC823wRPWTpnhSE04aYUgdC2mGJ7XwZAs8Q/diy5ftFMTblTAlcnWEyHWIE6Do+aEilwlOIkVenUQy3Q0EU/5F9YC4cIhS1sd+BEkc8jUkxonjAaGlqz0NzY4gAOvZFp5g3Q/Xu5zKHzDNJs+g2Piu3KtGDJA4yjD0RuyR3on9NxBb0p/ELTMv8MUti0/nMKu2GcHiwSiqHXGsMuVDlFpcHR+ZrSPT/LV18Kq9PmE90b7VvtdeaKbW0V5pr7Uz7VKD2p/aX9rf2j+7/+3t7D3b+2ZlffxoPedrrXLtffc/yxxSFQ==</latexit>

⇡✓ (at |st ) ⇡✓ (at+1 |st+1 ) This brings into question the credit assignment problem: what was it that
caused this positive reward? It probably was not bat moving left, but rather,
at = Left
<latexit sha1_base64="+LzeGFZxpTjAKZV/HNGMwJ8Fw9M=">AAAOQ3icfZdNb9s2HMbV7q3L5rXdjrsICwoMQxBIieOXQ4Faso0e1jYL8tbVQUDRtCyEEgmKcuwK+hS7bt9mX2JfYcdh1wGjZEeWSMq6+G8+Dx//9JdokB7FQcwt669Hjz/59LPPv3jy5d5XX7e+efrs+beXMUkYRBeQYMKuPRAjHEToggcco2vKEAg9jK68OzfXrxaIxQGJzvmKopsQ+FEwCyDgYuj9BMD885bfPtu3Dq3iMtXC3hT7xuY6vX3e2p9MCUxCFHGIQRx/sC3Kb1LAeAAxyvYmSYwogHfARx8SPuvdpEFEE44imJkvhDZLsMmJmUOZ04AhyPFKFACyQCSYcA6YgBPoe/WoGEUgRPHBdBHQeF3GC39dcCDu+yZdFn3JahNTnwE6D+CyRpaCMA4BnyuD8Sr06oMowYgtwvpgTikYJecSMRjEeQ9ORWPe0bzF8Tk53ejzFZ2jKM7ShOGsOlEIiDE0ExOLMkY8oWlxM+L53sUvOUvQQV4WYy+HgN2doemByKkN1HFmmABeH/LCvDkRuockDEE0TSc0SyccLXk6OTjMhPjC9LAwewSwad15lqXpJO+Z55lnwloT31bEt7I4qoijzY8QPDVnhJkL8fwJi01hNIWFBRDF9dkX5eyZeSFHX1bES1m8qohXsuglFTVR1EVFXSjqfUW9l9VlRVzK4qoirmTxY0X8KIvXu2Lf74r9VYoV/ReLYrU3maKZ+PsoXqH0Dmbp6/M3P2dpv7g270KCTLtuhN6D8Xjc6fa7mSzjB7097tnOUNVLQ9cZ2K7OUDoGXdcajrYsR5K3hLaszmjQkaMg3uq9vjtW9S2sNegOHY1hSzt226Pupn0IRZLVL31uv9NW2uKXOX3HcU76ql4anLbr9o40htLhDofDgVug0IRRjCQvfTB2OieWGkXLoJ7VaQ80+vYBWD3HUWBphcUZO/bQLlg4Alhy8vJlcXuD/kgO4tv+OwPXVR4gr7S/59rDtsawZT0ZnoyOCxLCQOTLXSFl+zrd3nFPjiJl0LgrnqDCQra/NO47VldhIRWWseM6eWPrK1Essg/2Tbr+y92uO3PfNrNMNouFJpvztfdglrxYY8bNbq19l18/ATfCq3cKYVM81ITDRhioY4HN8FALD3fB+6rfb4r3NeF+I4yvY/Gb4X0tvL8Lnqp+2hRPNeG0EYbqWGgzPNXC013wXPXzpniuCeeNMFzHwpvhuRae74Inqp80xRNNOGmEIToW0gxPtPBkBzxD92LLl+8UxNuVMiVyfWAodIhToOgxBxwVMsFprMgenyMOct0LBVPxRfWApHSIUtanQQxJEvENJMbpxAdCy9Z7GpofQQA28y08wWYQbXY5tT9gmk++g2Lju3avGzFE4ijD0BuxR3on9t9AbEl/ErfM/DAQtyw+Jwd5tcsIlg9GUe2JY5UtH6LU4uro0G4f2vYv7f1Xnc0J64nxvfGD8aNhG13jlfHaODUuDGiExm/G78YfrT9bf7f+af27tj5+tJnznVG7Wv/9D6soSYc=</latexit>

at+1 = Left
<latexit sha1_base64="5c7YI5DF4pFOIUZakFvff7LcCFY=">AAAOR3icfZfLbuM2GIWV6W2a1tOZdtmN0GCAog0CKXF8WQwwlmxjFp2ZNMitjY2AomlZMCWyFOXYI+g5um3fpq/Ql+iy6LKUbMsSKVkb/+Y5PP70SzRIh2Iv5Ibx98GTjz7+5NPPnn5++MWXjWdfPX/x9U1IIgbRNSSYsDsHhAh7AbrmHsfojjIEfAejW2dup/rtArHQI8EVX1E09oEbeFMPAi6GxiMA08+HmP9oJg/Pj4wTI7t0tTA3xZG2uS4eXjSORhMCIx8FHGIQhvemQfk4Box7EKPkcBSFiAI4By66j/i0M469gEYcBTDRXwptGmGdEz0F0yceQ5DjlSgAZJ5I0OEMMAEo8A/LUSEKgI/C48nCo+G6DBfuuuBA3Ps4Xma9SUoTY5cBOvPgskQWAz/0AZ8pg+HKd8qDKMKILfzyYEopGCXnEjHohWkPLkRj3tO0zeEVudjosxWdoSBM4ojhpDhRCIgxNBUTszJEPKJxdjPiGc/DV5xF6Dgts7FXfcDml2hyLHJKA2WcKSaAl4ccP21OgB4h8X0QTOIRTeIRR0sej45PEiG+1B0szA4BbFJ2XiZxPEp75jj6pbCWxHcF8Z0sDgriYPMjBE/0KWH6Qjx/wkJdGHVhYR5EYXn2dT57ql/L0TcF8UYWbwvirSw6UUGNFHVRUBeK+lhQH2V1WRCXsrgqiCtZ/FAQP8ji3b7YX/bF/irFiv6LRbE6HE3QVPyFZK9QPIdJ/Obq7U9J3M2uzbsQId0sG6GzNZ4NW+1uO5FlvNWbw45p9VU9N7StnmlXGXJHr20b/cGO5VTy5tCG0Rr0WnIUxDu907WHqr6DNXrtvlVh2NEO7eagvWkfQoFkdXOf3W01lba4eU7XsqzzrqrnBqtp253TCkPusPv9fs/OUGjEKEaSl26Nrda5oUbRPKhjtJq9Cn33AIyOZSmwtMBiDS2zb2YsHAEsOXn+stidXncgB/Fd/62ebSsPkBfa37HNfrPCsGM9758PzjISwkDgyl0hefta7c5ZR44iedCwLZ6gwkJ2vzTsWkZbYSEFlqFlW2ljyytRLLJ7cxyv/3J3604/MvUkkc1iocnmdO1tzZIXV5hxvbvSvs9fPQHXwqt3CmFdPKwIh7UwsIoF1sPDSni4D95V/W5dvFsR7tbCuFUsbj28Wwnv7oOnqp/WxdOKcFoLQ6tYaD08rYSn++C56ud18bwinNfC8CoWXg/PK+H5Pnii+kldPKkIJ7UwpIqF1MOTSniyB56hR7HlS3cK4u2KmRK5PjRkOsQxUPSQA44ymeA4VGSHzxAHqe74gin7onpAlDtEKesTL4QkCvgGEuN45AKhJes9DU2PIADr6RaeYN0LNruc0h8wTSfPodj4rt3rRvSROMow9Fbskd6L/TcQW9IfxC0z1/fELYvP0XFa7TOC5dYoqkNxrDLlQ5Ra3J6emM0T0/y5efS6tTlhPdW+1b7TvtdMra291t5oF9q1BrXftN+1P7Q/G381/mn82/hvbX1ysJnzjVa6nh38D+KuSgQ=</latexit>

several actions that were taken quite a few timesteps back, where the bat
reflected the ball. We need to assign credit to exactly the action(s) that
caused the positive reward because it allows us to ‘reinforce' these actions:
These were good choices!
st
<latexit sha1_base64="/pzW3Mv4fuxhZwPDsc7BaKD9MQQ=">AAAOQnicfZfLbuM2GIU109s0rduZdtmN0GCAoggCKXF8WQwwlmxjFp2ZNMitjYOAomlZCCUSJOXYI+glum3fpk/RR+iy6LaLUrIjSxRlbcLwHB5/+iUKPz2KAy4s668nTz/6+JNPP3v2+d4XX7a++vr5i28uOYkZRBeQYMKuPcARDiJ0IQKB0TVlCIQeRlfevZvpVwvEeECic7Gi6DYEfhTMAgiEnLqecAEEuhN3z/etQyu/zPrA3gz2jc11eveitT+ZEhiHKBIQA85vbIuK2wQwEUCM0r1JzBEF8B746CYWs95tEkQ0FiiCqflSarMYm4KYGZM5DRiCAq/kAEAWyAQTzgEDUEjyvWoURxEIET+YLgLK10O+8NcDAeRt3ybLvCxpZWHiM0DnAVxWyBIQ8hCIeW2Sr0KvOolijNgirE5mlJJRcS4RgwHPanAqC/OeZpXm5+R0o89XdI4iniYxw2l5oRQQY2gmF+ZDjkRMk/xm5OO9568Ei9FBNsznXg0Buz9D0wOZU5mo4swwAaI65YVZcSL0AEkYgmiaTGiaTARaimRycJhK8aXpYWn2CGDTqvMsTZJJVjPPM8+ktSK+K4nvVHFUEkebHyF4as4IMxfy+RPGTWk0pYUFEPHq6oti9cy8UKMvS+KlKl6VxCtV9OKSGtfURUld1NSHkvqgqsuSuFTFVUlcqeKHkvhBFa93xf6yK/ZXJVbWX26K1d5kimby65G/Qsk9TJM3529/SpN+fm3ehRiZdtUIvUfj8bjT7XdTVcaPenvcs51hXS8MXWdguzpD4Rh0XWs42rIcKd4C2rI6o0FHjYJ4q/f67riub2GtQXfoaAxb2rHbHnU35UMoUqx+4XP7nXatLH6R03cc56Rf1wuD03bd3pHGUDjc4XA4cHMUGjOKkeKlj8ZO58SqR9EiqGd12gONvn0AVs9xarC0xOKMHXto5ywCAaw4RfGyuL1Bf6QGiW39nYHr1h6gKJW/59rDtsawZT0ZnoyOcxLCQOSrVSFF+Trd3nFPjSJF0Lgrn2CNhWx/adx3rG6NhZRYxo7rZIWt7kS5yW7s22T9yd3uO3PfNtNUNcuNppqzvfdoVrxYY8bNbq19l1+/ADfC1+8UwqZ4qAmHjTBQxwKb4aEWHu6C9+t+vyne14T7jTC+jsVvhve18P4ueFr306Z4qgmnjTBUx0Kb4akWnu6CF3W/aIoXmnDRCCN0LKIZXmjhxS54UveTpniiCSeNMETHQprhiRae7IBn6EG2fFmnIN+uhNUiZU8uu9lchzgBNT0/T+QywQmvyZ6YIwEy3QslU/5P3QPiwiGHqj4NOCRxJDaQGCcTH0gtXfc0NDuCAGxmLTzBZhBtupzKB5hmi++hbHzX7nUhhkgeZRh6K3uk97L/BrIl/VHeMvPDQN6y/Ds5yEa7jGD5aJSjPXmsstVDVH1wdXRotw9t++f2/uvO5oT1zPjO+N74wbCNrvHaeGOcGhcGNLDxm/G78Ufrz9bfrX9a/66tT59s1nxrVK7Wf/8Db7dJIA==</latexit>

st+1
<latexit sha1_base64="PlwOklHMGJBk/GATo+aTkqmNU+g=">AAAORnicfZfLbuM2GIU109s0bebSLrsRGgxQtEEgJY4viwHGkm3MojOTBrlMGwcBRdOyYEokSMqxR9BrdNu+TZ+hD9Fl0W0p2ZElirI2YXgOjz/9EoWfHsUBF5b196PHn3z62edfPPly76uv958+e/7imytOYgbRJSSYsA8e4AgHEboUgcDoA2UIhB5G197czfTrBWI8INGFWFF0GwI/CqYBBEJOjcdcAIHuEvGTnd49P7COrPwy6wN7MzgwNtfZ3Yv9g/GEwDhEkYAYcH5jW1TcJoCJAGKU7o1jjiiAc+Cjm1hMu7dJENFYoAim5kupTWNsCmJmXOYkYAgKvJIDAFkgE0w4AwxAIen3qlEcRSBE/HCyCChfD/nCXw8EkLd+myzz0qSVhYnPAJ0FcFkhS0DIQyBmtUm+Cr3qJIoxYouwOplRSkbFuUQMBjyrwZkszHuaVZtfkLONPlvRGYp4msQMp+WFUkCMoalcmA85EjFN8puRj3jOXwkWo8NsmM+9GgA2P0eTQ5lTmajiTDEBojrlhVlxInQPSRiCaJKMaZqMBVqKZHx4lErxpelhafYIYJOq8zxNknFWM88zz6W1Ir4rie9UcVgSh5sfIXhiTgkzF/L5E8ZNaTSlhQUQ8erqy2L11LxUo69K4pUqXpfEa1X04pIa19RFSV3U1PuSeq+qy5K4VMVVSVyp4seS+FEVP+yK/XVX7G9KrKy/3BSrvfEETeUXJH+FkjlMkzcXb39Ok15+bd6FGJl21Qi9B+PJqN3pdVJVxg96a9S1nUFdLwwdp2+7OkPh6HdcazDcshwr3gLastrDfluNgnird3vuqK5vYa1+Z+BoDFvakdsadjblQyhSrH7hc3vtVq0sfpHTcxzntFfXC4PTct3uscZQONzBYNB3cxQaM4qR4qUPxnb71KpH0SKoa7VbfY2+fQBW13FqsLTE4owce2DnLAIBrDhF8bK43X5vqAaJbf2dvuvWHqAolb/r2oOWxrBlPR2cDk9yEsJA5KtVIUX52p3uSVeNIkXQqCOfYI2FbH9p1HOsTo2FlFhGjutkha3uRLnJbuzbZP3J3e4788A201Q1y42mmrO992BWvFhjxs1urX2XX78AN8LX7xTCpnioCYeNMFDHApvhoRYe7oL3636/Kd7XhPuNML6OxW+G97Xw/i54WvfTpniqCaeNMFTHQpvhqRae7oIXdb9oiheacNEII3QsohleaOHFLnhS95OmeKIJJ40wRMdCmuGJFp7sgGfoXrZ8Wacg366E1SJlTy672VyHOAE1PT9T5DLBCa/JnpghATLdCyVT/k/dA+LCIYeqPgk4JHEkNpAYJ2MfSC1d9zQ0O4IAbGYtPMFmEG26nMoHmGaL51A2vmv3uhADJI8yDL2VPdJ72X8D2ZL+KG+Z+WEgb1n+HR9mo11GsHwwytGePFbZ6iGqPrg+PrJbR7b9S+vgdXtzwnpifGd8b/xg2EbHeG28Mc6MSwMa1Pjd+MP4c/+v/X/2/93/b219/Giz5lujcj01/gelwUmd</latexit>

st+2
<latexit sha1_base64="9Fg4jXwGbDbUAdJrMoo6ifvIsd8=">AAAORnicfZfLbuM2GIU109s0bebSLrsRGgxQtEEgJY4viwHGkm3MojOTBrlMGwcBRdOyYEokSMqxR9BrdNu+TZ+hD9Fl0W0p2ZElirI2YXgOjz/9EoWfHsUBF5b196PHn3z62edfPPly76uv958+e/7imytOYgbRJSSYsA8e4AgHEboUgcDoA2UIhB5G197czfTrBWI8INGFWFF0GwI/CqYBBEJOjcdcAIHuEvHTcXr3/MA6svLLrA/szeDA2Fxndy/2D8YTAuMQRQJiwPmNbVFxmwAmAohRujeOOaIAzoGPbmIx7d4mQURjgSKYmi+lNo2xKYiZcZmTgCEo8EoOAGSBTDDhDDAAhaTfq0ZxFIEQ8cPJIqB8PeQLfz0QQN76bbLMS5NWFiY+A3QWwGWFLAEhD4GY1Sb5KvSqkyjGiC3C6mRGKRkV5xIxGPCsBmeyMO9pVm1+Qc42+mxFZyjiaRIznJYXSgExhqZyYT7kSMQ0yW9GPuI5fyVYjA6zYT73agDY/BxNDmVOZaKKM8UEiOqUF2bFidA9JGEIokkypmkyFmgpkvHhUSrFl6aHpdkjgE2qzvM0ScZZzTzPPJfWiviuJL5TxWFJHG5+hOCJOSXMXMjnTxg3pdGUFhZAxKurL4vVU/NSjb4qiVeqeF0Sr1XRi0tqXFMXJXVRU+9L6r2qLkviUhVXJXGlih9L4kdV/LAr9tddsb8psbL+clOs9sYTNJVfkPwVSuYwTd5cvP05TXr5tXkXYmTaVSP0Howno3an10lVGT/orVHXdgZ1vTB0nL7t6gyFo99xrcFwy3KseAtoy2oP+201CuKt3u25o7q+hbX6nYGjMWxpR25r2NmUD6FIsfqFz+21W7Wy+EVOz3Gc015dLwxOy3W7xxpD4XAHg0HfzVFozChGipc+GNvtU6seRYugrtVu9TX69gFYXcepwdISizNy7IGdswgEsOIUxcvidvu9oRoktvV3+q5be4CiVP6uaw9aGsOW9XRwOjzJSQgDka9WhRTla3e6J101ihRBo458gjUWsv2lUc+xOjUWUmIZOa6TFba6E+Umu7Fvk/Und7vvzAPbTFPVLDeaas723oNZ8WKNGTe7tfZdfv0C3Ahfv1MIm+KhJhw2wkAdC2yGh1p4uAver/v9pnhfE+43wvg6Fr8Z3tfC+7vgad1Pm+KpJpw2wlAdC22Gp1p4ugte1P2iKV5owkUjjNCxiGZ4oYUXu+BJ3U+a4okmnDTCEB0LaYYnWniyA56he9nyZZ2CfLsSVouUPbnsZnMd4gTU9PxMkcsEJ7wme2KGBMh0L5RM+T91D4gLhxyq+iTgkMSR2EBinIx9ILV03dPQ7AgCsJm18ASbQbTpciofYJotnkPZ+K7d60IMkDzKMPRW9kjvZf8NZEv6o7xl5oeBvGX5d3yYjXYZwfLBKEd78lhlq4eo+uD6+MhuHdn2L62D1+3NCeuJ8Z3xvfGDYRsd47XxxjgzLg1oUON34w/jz/2/9v/Z/3f/v7X18aPNmm+NyvXU+B+zmUme</latexit>

https://becominghuman.ai/
Credit assignment: lets-build-an-atari-ai-part-0-
intro-to-rl-9b2c5336e0ec
What action causes reward?
9

Markov decision processes model the RL loop. An MDP is defined as a


MARKOV DECISION PROCESSES distribution over trajectories: What trajectories are most likely, given that our
agent acts in the environment according to the policy \pi_\theta?
Actions at, states st and rewards rt
<latexit sha1_base64="+LzeGFZxpTjAKZV/HNGMwJ8Fw9M=">AAAOQ3icfZdNb9s2HMbV7q3L5rXdjrsICwoMQxBIieOXQ4Faso0e1jYL8tbVQUDRtCyEEgmKcuwK+hS7bt9mX2JfYcdh1wGjZEeWSMq6+G8+Dx//9JdokB7FQcwt669Hjz/59LPPv3jy5d5XX7e+efrs+beXMUkYRBeQYMKuPRAjHEToggcco2vKEAg9jK68OzfXrxaIxQGJzvmKopsQ+FEwCyDgYuj9BMD885bfPtu3Dq3iMtXC3hT7xuY6vX3e2p9MCUxCFHGIQRx/sC3Kb1LAeAAxyvYmSYwogHfARx8SPuvdpEFEE44imJkvhDZLsMmJmUOZ04AhyPFKFACyQCSYcA6YgBPoe/WoGEUgRPHBdBHQeF3GC39dcCDu+yZdFn3JahNTnwE6D+CyRpaCMA4BnyuD8Sr06oMowYgtwvpgTikYJecSMRjEeQ9ORWPe0bzF8Tk53ejzFZ2jKM7ShOGsOlEIiDE0ExOLMkY8oWlxM+L53sUvOUvQQV4WYy+HgN2doemByKkN1HFmmABeH/LCvDkRuockDEE0TSc0SyccLXk6OTjMhPjC9LAwewSwad15lqXpJO+Z55lnwloT31bEt7I4qoijzY8QPDVnhJkL8fwJi01hNIWFBRDF9dkX5eyZeSFHX1bES1m8qohXsuglFTVR1EVFXSjqfUW9l9VlRVzK4qoirmTxY0X8KIvXu2Lf74r9VYoV/ReLYrU3maKZ+PsoXqH0Dmbp6/M3P2dpv7g270KCTLtuhN6D8Xjc6fa7mSzjB7097tnOUNVLQ9cZ2K7OUDoGXdcajrYsR5K3hLaszmjQkaMg3uq9vjtW9S2sNegOHY1hSzt226Pupn0IRZLVL31uv9NW2uKXOX3HcU76ql4anLbr9o40htLhDofDgVug0IRRjCQvfTB2OieWGkXLoJ7VaQ80+vYBWD3HUWBphcUZO/bQLlg4Alhy8vJlcXuD/kgO4tv+OwPXVR4gr7S/59rDtsawZT0ZnoyOCxLCQOTLXSFl+zrd3nFPjiJl0LgrnqDCQra/NO47VldhIRWWseM6eWPrK1Essg/2Tbr+y92uO3PfNrNMNouFJpvztfdglrxYY8bNbq19l18/ATfCq3cKYVM81ITDRhioY4HN8FALD3fB+6rfb4r3NeF+I4yvY/Gb4X0tvL8Lnqp+2hRPNeG0EYbqWGgzPNXC013wXPXzpniuCeeNMFzHwpvhuRae74Inqp80xRNNOGmEIToW0gxPtPBkBzxD92LLl+8UxNuVMiVyfWAodIhToOgxBxwVMsFprMgenyMOct0LBVPxRfWApHSIUtanQQxJEvENJMbpxAdCy9Z7GpofQQA28y08wWYQbXY5tT9gmk++g2Lju3avGzFE4ijD0BuxR3on9t9AbEl/ErfM/DAQtyw+Jwd5tcsIlg9GUe2JY5UtH6LU4uro0G4f2vYv7f1Xnc0J64nxvfGD8aNhG13jlfHaODUuDGiExm/G78YfrT9bf7f+af27tj5+tJnznVG7Wv/9D6soSYc=</latexit> <latexit sha1_base64="/pzW3Mv4fuxhZwPDsc7BaKD9MQQ=">AAAOQnicfZfLbuM2GIU109s0rduZdtmN0GCAoggCKXF8WQwwlmxjFp2ZNMitjYOAomlZCCUSJOXYI+glum3fpk/RR+iy6LaLUrIjSxRlbcLwHB5/+iUKPz2KAy4s668nTz/6+JNPP3v2+d4XX7a++vr5i28uOYkZRBeQYMKuPcARDiJ0IQKB0TVlCIQeRlfevZvpVwvEeECic7Gi6DYEfhTMAgiEnLqecAEEuhN3z/etQyu/zPrA3gz2jc11eveitT+ZEhiHKBIQA85vbIuK2wQwEUCM0r1JzBEF8B746CYWs95tEkQ0FiiCqflSarMYm4KYGZM5DRiCAq/kAEAWyAQTzgEDUEjyvWoURxEIET+YLgLK10O+8NcDAeRt3ybLvCxpZWHiM0DnAVxWyBIQ8hCIeW2Sr0KvOolijNgirE5mlJJRcS4RgwHPanAqC/OeZpXm5+R0o89XdI4iniYxw2l5oRQQY2gmF+ZDjkRMk/xm5OO9568Ei9FBNsznXg0Buz9D0wOZU5mo4swwAaI65YVZcSL0AEkYgmiaTGiaTARaimRycJhK8aXpYWn2CGDTqvMsTZJJVjPPM8+ktSK+K4nvVHFUEkebHyF4as4IMxfy+RPGTWk0pYUFEPHq6oti9cy8UKMvS+KlKl6VxCtV9OKSGtfURUld1NSHkvqgqsuSuFTFVUlcqeKHkvhBFa93xf6yK/ZXJVbWX26K1d5kimby65G/Qsk9TJM3529/SpN+fm3ehRiZdtUIvUfj8bjT7XdTVcaPenvcs51hXS8MXWdguzpD4Rh0XWs42rIcKd4C2rI6o0FHjYJ4q/f67riub2GtQXfoaAxb2rHbHnU35UMoUqx+4XP7nXatLH6R03cc56Rf1wuD03bd3pHGUDjc4XA4cHMUGjOKkeKlj8ZO58SqR9EiqGd12gONvn0AVs9xarC0xOKMHXto5ywCAaw4RfGyuL1Bf6QGiW39nYHr1h6gKJW/59rDtsawZT0ZnoyOcxLCQOSrVSFF+Trd3nFPjSJF0Lgrn2CNhWx/adx3rG6NhZRYxo7rZIWt7kS5yW7s22T9yd3uO3PfNtNUNcuNppqzvfdoVrxYY8bNbq19l1+/ADfC1+8UwqZ4qAmHjTBQxwKb4aEWHu6C9+t+vyne14T7jTC+jsVvhve18P4ueFr306Z4qgmnjTBUx0Kb4akWnu6CF3W/aIoXmnDRCCN0LKIZXmjhxS54UveTpniiCSeNMETHQprhiRae7IBn6EG2fFmnIN+uhNUiZU8uu9lchzgBNT0/T+QywQmvyZ6YIwEy3QslU/5P3QPiwiGHqj4NOCRxJDaQGCcTH0gtXfc0NDuCAGxmLTzBZhBtupzKB5hmi++hbHzX7nUhhkgeZRh6K3uk97L/BrIl/VHeMvPDQN6y/Ds5yEa7jGD5aJSjPXmsstVDVH1wdXRotw9t++f2/uvO5oT1zPjO+N74wbCNrvHaeGOcGhcGNLDxm/G78Ufrz9bfrX9a/66tT59s1nxrVK7Wf/8Db7dJIA==</latexit> <latexit sha1_base64="JCho6RHVaZ12zLSSoSGaNUMc/P8=">AAAOQ3icfZdNb9s2HMbV7q3L5rXdjrsICwoMQxBIieOXQ4Faso0e1jYL8tbVQUDRtCyEEgmKcuwK+hS7bt9mX2JfYcdh1wGjZEeWSMq65B8+Dx//9JcokB7FQcwt669Hjz/59LPPv3jy5d5XX7e+efrs+beXMUkYRBeQYMKuPRAjHEToggcco2vKEAg9jK68OzfXrxaIxQGJzvmKopsQ+FEwCyDgYuj9hKF7wKa3/PbZvnVoFZepFvam2Dc21+nt89b+ZEpgEqKIQwzi+INtUX6TAsYDiFG2N0liRAG8Az76kPBZ7yYNIppwFMHMfCG0WYJNTswcypwGDEGOV6IAkAUiwYRzwADkAn2vHhWjCIQoPpguAhqvy3jhrwsOxH3fpMuiL1ltYuozQOcBXNbIUhDGIeBzZTBehV59ECUYsUVYH8wpBaPkXCIGgzjvwalozDuatzo+J6cbfb6icxTFWZownFUnCgExhmZiYlHGiCc0LW5GPN+7+CVnCTrIy2Ls5RCwuzM0PRA5tYE6zgwTwOtDXpg3J0L3kIQhiKbphGbphKMlTycHh5kQX5geFmaPiLej7jzL0nSS98zzzDNhrYlvK+JbWRxVxNHmRwiemjPCzIV4/oTFpjCawsICiOL67Ity9sy8kKMvK+KlLF5VxCtZ9JKKmijqoqIuFPW+ot7L6rIiLmVxVRFXsvixIn6Uxetdse93xf4qxYr+i0Wx2ptM0Ux8PopXKL2DWfr6/M3PWdovrs27kCDTrhuh92A8Hne6/W4my/hBb497tjNU9dLQdQa2qzOUjkHXtYajLcuR5C2hLaszGnTkKIi3eq/vjlV9C2sNukNHY9jSjt32qLtpH0KRZPVLn9vvtJW2+GVO33Gck76qlwan7bq9I42hdLjD4XDgFig0YRQjyUsfjJ3OiaVG0TKoZ3XaA42+fQBWz3EUWFphccaOPbQLFo4Alpy8fFnc3qA/koP4tv/OwHWVB8gr7e+59rCtMWxZT4Yno+OChDAQ+XJXSNm+Trd33JOjSBk07oonqLCQ7S+N+47VVVhIhWXsuE7e2PpKFIvsg32Trj+523Vn7ttmlslmsdBkc772HsySF2vMuNmtte/y6yfgRnj1TiFsioeacNgIA3UssBkeauHhLnhf9ftN8b4m3G+E8XUsfjO8r4X3d8FT1U+b4qkmnDbCUB0LbYanWni6C56rft4UzzXhvBGG61h4MzzXwvNd8ET1k6Z4ogknjTBEx0Ka4YkWnuyAXx8I8p2CeLtSpkSKPbnYzRY6xClQ9JgDjgqZ4DRWZI/PEQe57oWCqfhH9YCkdIhS1qdBDEkS8Q0kxunEB0LL1nsamh9BADbzLTzBZhBtdjm1DzDNJ99BsfFdu9eNGCJxlGHojdgjvRP7byC2pD+JW2Z+GIhbFn8nB3m1ywiWD0ZR7YljlS0fotTi6ujQbh/a9i/t/VedzQnrifG98YPxo2EbXeOV8do4NS4MaITGb8bvxh+tP1t/t/5p/bu2Pn60mfOdUbta//0PDH9Jjg==</latexit>

An MDP defines the probability of a trajectory, when interacting using policy


Trajectories ⌧ = s0 , a0 , r1 , s1 , a1 , r2 , s2 . . . aT -1 , rT , sT
<latexit sha1_base64="nyvr2I+1rJ5Zdc1PSdcTqOSsGX4=">AAAOs3icfZdbb9s2GIaVdocua7x0u9yNsKDAMHiBlDg+XASoJdvoxdpmgZN0jYKAomlZMCUSJOXYFfRDd7H/Mkp2dJZ1ky98X75+9EkUSJtilwtN+/fgxctvvv3u+1c/HP74+qj10/Gbn285CRhEN5Bgwj7bgCPs+uhGuAKjz5Qh4NkY3dlLM9bvVohxl/hTsaHowQOO785dCIQcejx+smwBgkuLCyDQo9ZWLQATJS4ZegJs9qjLcqvrma5n+lmqn6nWjAiemsLpn3qUGaepcfp4fKKdasmlVgt9V5wou+vq8c3RicyGgYd8ATHg/F7XqHgIARMuxCg6tAKOKIBL4KD7QMz7D6Hr00AgH0bqW6nNA6wKosZNUGcuQ1DgjSwAZK5MUOECMEktW3VYjOLIBx7i7dnKpXxb8pWzLQSQfX4I18lziAoTQ4cBunDhukAWAo97QCwqg3zj2cVBFGDEVl5xMKaUjCXnGjHo8rgHV7Ixn2jcez4lVzt9saEL5PMoDBiO8hOlgBhDczkxKTkSAQ2Tm5Hv05JfChagdlwmY5cjwJbXaNaWOYWBIs4cEyCKQ7YXN8dHT5B4HvBnoUWj0BJoLUKrfRpJ8a1qY2m2iXxPis7rKAytuGe2rV5La0H8mBM/lsVxThzvfoTgmTonTF3J508YV6VRlRbmQsSLs2/S2XP1phx9mxNvy+JdTrwri3aQU4OKusqpq4r6lFOfyuo6J67L4iYnbsri15z4tSx+3hf7z77YL6VY2X+5KDaH1gzN5ecqeYXCJYzC99MPf0XhILl270KAVL1ohPaz8XzS7Q16UVnGz3pn0teNUVVPDT1jqJt1htQx7JnaaJyxnJW8KbSmdcfDbjkK4kzvD8xJVc9gtWFvZNQYMtqJ2Rn3du1DyC9ZndRnDrqdSlucNGdgGMbFoKqnBqNjmv2zGkPqMEej0dBMUGjAKEYlL302drsXWjWKpkF9rdsZ1ujZA9D6hlGBpTkWY2LoIz1hEQjgklOkL4vZHw7G5SCR9d8YmmblAYpc+/umPurUGDLWi9HF+DwhIQz4TrkrJG1ft9c/75ejSBo06cknWGEh2S9NBobWq7CQHMvEMI24scWVKBfZvf4Qbj+52bpTT3Q1ispmudDK5njtPZtLXlxjxs3uWvs+f/0E3AhfvVMIm+JhTThshIF1LLAZHtbCw33wTtXvNMU7NeFOI4xTx+I0wzu18M4+eFr106Z4WhNOG2FoHQtthqe18HQfvKj6RVO8qAkXjTCijkU0w4taeLEPnlT9pCme1ISTRhhSx0Ka4UktPNkDvz0axDsF+XaFrBK5PUkkOsQhqOjJeSKRCQ55RbbFAgkQ67YnmZJ/qh4QpA5ZlvWZyyEJfLGDxDi0HCC1aLunofERBGA13sITrLr+bpdT+ADTePISyo3v1r1txAjJowxDH+Qe6ZPcfwO5Jf1D3jJzPFfesvxrteNqnxGsn42yOpTHKr18iKoWd2eneudU1//unLzr7k5Yr5Rfld+U3xVd6SnvlPfKlXKjQOW/g5cHrw+OWp3Wl5bdmm2tLw52c35RClfL+x8CFGve</latexit>

<latexit sha1_base64="Ij/51rj3GbZAgyYSleGt0K1m3sU=">AAAOTHicfZfdbts2GIbVbuu6bO7S7XAnwoIC3RAEUuL456BALclGD9Y2C/LTLQ4CiqZlwZRIkJRjV9Od7HS7m13ArmOHw4BRsiNLlGQdJDTfl68ffRKNjy7FPheG8fejx598+tmTz59+sfflV61nX+8//+aKk4hBdAkJJuyDCzjCfoguhS8w+kAZAoGL0bU7t1P9eoEY90l4IVYU3QbAC/2pD4GQU3f7+/Tl2BUg+k3+nSEBfrjbPzCOjOzSqwNzMzjQNtfZ3fPWwXhCYBSgUEAMOL8xDSpuY8CEDzFK9sYRRxTAOfDQTSSmvdvYD2kkUAgT/YXUphHWBdFTOn3iMwQFXskBgMyXCTqcAQagkPewV47iKAQB4oeThU/5esgX3noggCzAbbzMCpSUFsYeA3Tmw2WJLAYBD4CYVSb5KnDLkyjCiC2C8mRKKRkV5xIx6PO0BmeyMO9pWnN+Qc42+mxFZyjkSRwxnBQXSgExhqZyYTbkSEQ0zm5GPug5fyVYhA7TYTb3ygFsfo4mhzKnNFHGmWICRHnKDdLihOgekiAA4SQe0yQeC7QU8fjwKJHiC93F0uwSwCZl53kSx+O0Zq6rn0trSXxXEN+p4rAgDjdfQvBEnxKmL+TzJ4zr0qhLC/Mh4uXVl/nqqX6pRl8VxCtVvC6I16roRgU1qqiLgrqoqPcF9V5VlwVxqYqrgrhSxY8F8aMqftgV+8uu2F+VWFl/uSlWe+MJmsrfkewViucwid9cvP0pifvZtXkXIqSbZSN0H4wno063301UGT/o7VHPtJyqnhu61sC06wy5Y9C1DWe4ZTlWvDm0YXSGg44aBfFW7/XtUVXfwhqDrmPVGLa0I7s97G7Kh1CoWL3cZ/c77UpZvDynb1nWab+q5warbdu94xpD7rAdxxnYGQqNGMVI8dIHY6dzalSjaB7UMzrtQY2+fQBGz7IqsLTAYo0s0zEzFoEAVpwif1ns3qA/VIPEtv7WwLYrD1AUyt+zTaddY9iynjqnw5OMhDAQempVSF6+Trd30lOjSB406sonWGEh228a9S2jW2EhBZaRZVtpYcs7UW6yG/M2Xv/kbvedfmDqSaKa5UZTzeneezArXlxjxs3uWvsuf/0C3AhfvVMIm+JhTThshIF1LLAZHtbCw13wXtXvNcV7NeFeI4xXx+I1w3u18N4ueFr106Z4WhNOG2FoHQtthqe18HQXvKj6RVO8qAkXjTCijkU0w4taeLELnlT9pCme1ISTRhhSx0Ka4UktPNkBz9C9bPnSTkG+XTGrRMqeXHazmQ5xDCo6F0CgTCY45hV5fdxIdTeQTNmHqgdEuUMOVX3ic0iiUGwgMY7HHpBasu5paHoEAVhPW3iCdT/cdDmlH2CaLp5D2fiu3etCOEgeZRh6K3uk97L/BrIl/VHeMvMCX96y/D8+TEe7jGD5YJSjPXmsMtVDVHVwfXxkto9M8+f2wevO5oT1VPtO+157qZlaV3utvdHOtEsNagvtd+0P7c/WX61/Wv+2/ltbHz/arPlWK13PnvwPn69Law==</latexit>

\pi_theta. This probability is given with the long equation. We will explain in a
Markov decision processes (MDPs) model p(⌧|✓) lot of detail in the coming slides what exactly this means.
<latexit sha1_base64="q+zu+qOySWKNgXCi+xVWreMKZs0=">AAAOvnicfZfbbts2HMaV7tRlq5duFxuwG2FBgWTzAilxfLgIVku20Yu1zYqctigzKJqWBVMiQVGOXUU3e8u9wR5jlCzLOloXNs3v4+ef/iIF0qTY9rii/Lv37JNPP/v8i+df7n/19YvGNwcvv73xiM8guoYEE3ZnAg9h20XX3OYY3VGGgGNidGvO9Ui/XSDm2cS94iuKHhxgufbUhoCLrvHBP/TIMDnwn8TnDHFwfCE6PA44GivHBmVkMg74hRL+HVz9qoYGtceJ8cgAMI7gT4mfH6dDA/6LGjZlg6FHwCbrn/KTnKqRthl+PD44VE6U+JLLDTVpHErJdTl++eLQmBDoO8jlEAPPu1cVyh8CwLgNMQr3Dd9DFMA5sNC9z6fdh8B2qc+RC0P5ldCmPpY5kaN6yBObIcjxSjQAZLZIkOEMMEEnqrafj/KQCxzkNScLm3rrprew1g0ORMkfgmX8SMLcwMBigM5suMyRBcDxHMBnpU5v5Zj5TuRjxBZOvjOiFIwF5xIxaHtRDS5FYd7TqMbeFblM9NmKzpDrhYHPcJgdKATEGJqKgXHTQ9ynQXwzYmrNvQvOfNSMmnHfxQCw+Qc0aYqcXEceZ4oJ4Pku04mK46JHSBwHuJPAoGFgcLTkgdE8CYX4SjaxMJtETJ2880MYBEZUM9OUPwhrTnyXEd8VxWFGHCZ/QvBEnhImL8TzJ8yThVEWFmZD5OVHX6ejp/J1MfomI94UxduMeFsUTT+j+iV1kVEXJfUxoz4W1WVGXBbFVUZcFcWPGfFjUbzbFfvnrti/CrGi/mJRrPaNCZqKN1c8hYI5DIM3V29/D4NefCVzwUeymjdCc2M8G7U7vU5YlPFGb426qjYo66mho/VVvcqQOvodXRkMtyynBW8KrSjtYb9djIJ4q3d7+qisb2GVfmegVRi2tCO9Newk5UPILVit1Kf32q1SWaw0p6dp2nmvrKcGraXr3dMKQ+rQB4NBX49RqM8oRgUv3Rjb7XOlHEXToK7SbvUr9O0DULqaVoKlGRZtpKkDNWbhCOCCk6eTRe/2e8NiEN/WX+vreukB8kz5u7o6aFUYtqzng/PhWUxCGHCtYlVIWr52p3vWLUaRNGjUEU+wxEK2/zTqaUqnxEIyLCNN16LC5leiWGT36kOwfuVu1518qMphWDSLhVY0R2tvYy54cYUZ17sr7bv81QNwLXz5TiGsi4cV4bAWBlaxwHp4WAkPd8FbZb9VF29VhFu1MFYVi1UPb1XCW7vgadlP6+JpRTithaFVLLQenlbC013wvOzndfG8IpzXwvAqFl4Pzyvh+S54UvaTunhSEU5qYUgVC6mHJ5XwZAf8+rQQ7RTE7ApYKXJ9Yoh1iANQ0uPTRSwTHHgleX1uiXTTEUzxj7IH+KlDNIv6xPYg8V2eQGIcGBYQWrje09DoCAKwHG3hCZZtN9nl5F7ANBo8h2Lju3avCzFA4ijD0FuxR3ov9t9AbEl/FrfMLMcWtyy+jWbU2mUEy41RtPbFsUotHqLKjdvTE7V1oqp/tA5ft5MT1nPpR+kn6UhSpY70WnojXUrXEpT+22vsfb/3Q+O3Bmo4DbK2PttLxnwn5a7G8n/biXMp</latexit>

T
Y -1 The probability of a trajectory is build up using three probability distributions:
p(⌧|✓) = p(s0 ) ⇡✓ (at |st )p(st+1 , rt+1 |st , at ) The first is the policy, we have seen this one before. It represents our agent,
t=0 and defines a distribution over actions given states.
<latexit sha1_base64="gfi6SL5j681Wde1MvJUNdF/DmQ4=">AAAOW3icfZdNb9s2HMbV7q3LmizdsNMOExYU6IYgkBLHL4cCtWQbPaxtFuRti4KAomlZMCUSFOXY1XTcp9l1+zAD9mFGyY4sUZR1yT98Hj7+6S9RIF2K/Ygbxr9Pnn7y6Weff/Hsy52vnu/ufb3/4puriMQMoktIMGE3LogQ9kN0yX2O0Q1lCAQuRtfuzM706zlikU/CC76k6C4AXuhPfAi4GLrf/8Gh/r3j8ini4JUDYD7K/3AiDji65z/d7x8YR0Z+6fXCXBcH2vo6u3+xe+CMCYwDFHKIQRTdmgbldwlg3IcYpTtOHCEK4Ax46Dbmk+5d4oc05iiEqf5SaJMY65zoGaw+9hmCHC9FASDzRYIOp4AJTHFLO9WoCIUgQNHheO7TaFVGc29VcCD6cZcs8n6llYmJxwCd+nBRIUtAEAWAT2uD0TJwq4MoxojNg+pgRikYJecCMehHWQ/ORGM+0KzZ0QU5W+vTJZ2iMEqTmOG0PFEIiDE0ERPzMkI8pkl+M+K5z6LXnMXoMCvzsdcDwGbnaHwocioDVZwJJoBXh9wga06IHiAJAhCOE4emicPRgifO4VEqxJe6i4XZJYCNq87zNEmcrGeuq58La0V8XxLfy+KwJA7XP0LwWJ8Qps/F8ycs0oVRFxbmQxRVZ18Wsyf6pRx9VRKvZPG6JF7LohuX1LimzkvqvKY+lNQHWV2UxIUsLkviUhY/lsSPsnizLfa3bbG/S7Gi/2JRLHecMZqIz0r+CiUzmCZvL979kia9/Fq/CzHSzaoRuo/Gk1G70+uksowf9daoa1qDul4YOlbftFWGwtHv2MZguGE5lrwFtGG0h/22HAXxRu/27FFd38Aa/c7AUhg2tCO7Neys24dQKFm9wmf32q1aW7wip2dZ1mmvrhcGq2Xb3WOFoXDYg8Ggb+coNGYUI8lLH43t9qlRj6JFUNdot/oKffMAjK5l1WBpicUaWebAzFk4Alhy8uJlsbv93lAO4pv+W33brj1AXmp/1zYHLYVhw3o6OB2e5CSEgdCTu0KK9rU73ZOuHEWKoFFHPMEaC9n80qhnGZ0aCymxjCzbyhpbXYlikd2ad8nqk7tZd/qBqaepbBYLTTZna+/RLHmxwoyb3Ur7Nr96Am6Er98phE3xUBEOG2GgigU2w0MlPNwG79X9XlO8pwj3GmE8FYvXDO8p4b1t8LTup03xVBFOG2GoioU2w1MlPN0Gz+t+3hTPFeG8EYarWHgzPFfC823wpO4nTfFEEU4aYYiKhTTDEyU82QLP0IPY8mU7BfF2JawWuTo65DrECajp+YEilwlOopq8OoFkuhsIpvyfugfEhUOUsj72I0jikK8hMU4cDwgtXe1paHYEAVjPtvAE63643uVUPsA0mzyDYuO7cq8aMUDiKMPQO7FH+iD230BsSX8Wt8y8wBe3LP46h1m1zQgWj0ZR7YhjlSkfourF9fGR2ToyzV9bB2/a6xPWM+177UftlWZqHe2N9lY70y41qP2p/aX9rf2z+9/e072dvecr69Mn6znfapVr77v/ARAATx0=</latexit>

• Policy ⇡✓ (at |st ) <latexit sha1_base64="SYCfd0d/aUDyf9x/Pic/Ian8kLo=">AAAObnicfZfbbts2HMbV7tRlS5Zu2NUwTFhQIFuNQEocHy4K1JJt9GJtsyCnLQ4CiqZlwZRIkJRjV9N77Gl2u73CnmKvMEp2ZImSrBtT/D5+/ukvUiAdij0uDOPfJ08/+viTTz979vnOF1/u7n21//zrK05CBtElJJiwGwdwhL0AXQpPYHRDGQK+g9G1M7MT/XqOGPdIcCGWFN35wA28iQeBkF33+8f0cMQFEOg+Ei/NuKGPGHoAbLy6/WOtCdkPYDpC/HS/f2AcGemllxvmunGgra+z++e7B6MxgaGPAgEx4PzWNKi4iwATHsQo3hmFHFEAZ8BFt6GYdO4iL6ChQAGM9RdSm4RYF0RP+PWxxxAUeCkbADJPJuhwCpikk0+5U4ziKAA+4o3x3KN81eRzd9UQQJboLlqkJYwLAyOXATr14KJAFgGf+0BMS5186TvFThRixOZ+sTOhlIyKc4EY9HhSgzNZmPc0qTG/IGdrfbqkUxTwOAoZjvMDpYAYQxM5MG1yJEIapQ8jp8KMvxIsRI2kmfa96gM2O0fjhswpdBRxJpgAUexy/KQ4AXqAxPdBMI5GNI5GAi1ENGocxVJ8oTtYmh0ip03ReR5H0SipmePo59JaEN/lxHeqOMiJg/WfEDzWJ4Tpc/n+CeO6NOrSwjyIeHH0ZTZ6ol+q0Vc58UoVr3PitSo6YU4NS+o8p85L6kNOfVDVRU5cqOIyJy5V8UNO/KCKN9tif9sW+7sSK+svF8VyZzRGE/mlSadQNINx9Obi7S9x1E2v9VwIkW4WjdB5NJ4MW+1uO1Zl/Kg3hx3T6pf1zNC2eqZdZcgcvbZt9AcblmPFm0EbRmvQa6lREG/0TtcelvUNrNFr960Kw4Z2aDcH7XX5EAoUq5v57G6rWSqLm+V0Lcs67Zb1zGA1bbtzXGHIHHa/3+/ZKQoNGcVI8dJHY6t1apSjaBbUMVrNXoW+eQFGx7JKsDTHYg0ts2+mLAIBrDhFNlnsTq87UIPEpv5Wz7ZLL1Dkyt+xzX6zwrBhPe2fDk5SEsJA4KpVIVn5Wu3OSUeNIlnQsC3fYImFbP5p2LWMdomF5FiGlm0lhS2uRLnIbs27aPXJ3aw7/cDU41g1y4WmmpO192hWvLjCjOvdlfZt/uoBuBa+/KQQ1sXDinBYCwOrWGA9PKyEh9vg3bLfrYt3K8LdWhi3isWth3cr4d1t8LTsp3XxtCKc1sLQKhZaD08r4ek2eFH2i7p4UREuamFEFYuohxeV8GIbPCn7SV08qQgntTCkioXUw5NKeLIFfnVSSHYKcnZFrBS5OjGkOsQRKOnp2SKVCY54SXbEFAmQ6I4vmdKbsgeEmUM2VX3scUjCQKwhMY5GLpBavNrT0OQIArCebOEJ1r1gvcspfIBpMngG5cZ35V4Voo/kUYaht3KP9F7uv4Hckv4sH5m5vicfWf6OGklrmxEsHo2ytSOPVaZ6iCo3ro+PzOaRaf7aPHjdWp+wnmnfaT9qh5qptbXX2hvtTLvUoPan9pf2t/bP7n973+59v/fDyvr0yXrMN1rh2jv8H/d8Vic=</latexit>


The second distribution is the state transition distribution, which represents
• State transition distribution p(st+1 , rt+1 |st , at ) the environment. This one is more complicated: It defines a distribution over
what the next state of the environment is, given our current state of the
<latexit sha1_base64="CZu97J0BQEi9bXmmD2bQ3YcmuT4=">AAAORXicfZdNb9s2HMbV7q3L5q3djrsICwp0QxBIieOXQ4Faso0e1jYL8tItDgKKpmUhlEhQlGNX0MfYdfs2+w77DjsOu26U7MgSSVmXMHwePv7pL1H406M4iLll/fXo8Ucff/LpZ08+3/viy9ZXXz999s1lTBIG0QUkmLD3HogRDiJ0wQOO0XvKEAg9jK68OzfXrxaIxQGJzvmKopsQ+FEwCyDgYuqavpjEHHB0a/1w+3TfOrSKy1QH9mawb2yu09tnrf3JlMAkRBGHGMTxtW1RfpMCxgOIUbY3SWJEAbwDPrpO+Kx3kwYRTTiKYGY+F9oswSYnZo5lTgOGIMcrMQCQBSLBhHPAAOQCfq8eFaMIhCg+mC4CGq+H8cJfDzgQd36TLovKZLWFqc8AnQdwWSNLQRiHgM+VyXgVevVJlGDEFmF9MqcUjJJziRgM4rwGp6Iw72he7PicnG70+YrOURRnacJwVl0oBMQYmomFxTBGPKFpcTPiCd/FLzlL0EE+LOZeDgG7O0PTA5FTm6jjzDABvD7lhXlxInQPSRiCaJpOaJZOOFrydHJwmAnxuelhYfYIYNO68yxL00leM88zz4S1Jr6tiG9lcVQRR5sfIXhqzggzF+L5ExabwmgKCwsgiuurL8rVM/NCjr6siJeyeFURr2TRSypqoqiLirpQ1PuKei+ry4q4lMVVRVzJ4oeK+EEW3++K/WVX7K9SrKi/2BSrvckUzcQHpHiF0juYpa/P3/yUpf3i2rwLCTLtuhF6D8bjcafb72ayjB/09rhnO0NVLw1dZ2C7OkPpGHRdazjashxJ3hLasjqjQUeOgnir9/ruWNW3sNagO3Q0hi3t2G2PupvyIRRJVr/0uf1OWymLX+b0Hcc56at6aXDarts70hhKhzscDgdugUITRjGSvPTB2OmcWGoULYN6Vqc90OjbB2D1HEeBpRUWZ+zYQ7tg4QhgycnLl8XtDfojOYhv6+8MXFd5gLxS/p5rD9saw5b1ZHgyOi5ICAORL1eFlOXrdHvHPTmKlEHjrniCCgvZ/tK471hdhYVUWMaO6+SFre9Escmu7Zt0/cnd7jtz3zazTDaLjSab8733YJa8WGPGzW6tfZdfvwA3wqt3CmFTPNSEw0YYqGOBzfBQCw93wfuq32+K9zXhfiOMr2Pxm+F9Lby/C56qftoUTzXhtBGG6lhoMzzVwtNd8Fz186Z4rgnnjTBcx8Kb4bkWnu+CJ6qfNMUTTThphCE6FtIMT7TwZAc8Q/ei5cs7BfF2pUyJFD256GYLHeIUKHpxoChkgtNYkT0+RxzkuhcKpuIf1QOS0iGGsj4NYkiSiG8gMU4nPhBatu5paH4EAdjMW3iCzSDadDm1DzDNF99B0fiu3etCDJE4yjD0RvRI70T/DURL+qO4ZeaHgbhl8XdykI92GcHywShGe+JYZcuHKHVwdXRotw9t++f2/qvO5oT1xPjO+N54YdhG13hlvDZOjQsDGsT4zfjd+KP1Z+vv1j+tf9fWx482a741alfrv/8BzsRJuw==</latexit>

• Initial state distribution p(s0 )


environment and the action performed by the agent. It is jointly distributed
with the reward, which is also dependent both on the previous state and
10
action.

MDP an be u a ed u ng a va an o ompu a on g aph The e


MARKOV DECISION PROCESSES ep e en he dependen e be ween how he d e en omponen o he
MDP a e gene a ed ep by ep Th done by o ow ng he ne n he
Env onmen
Env onmen g aph Do ed ne deno e non d e en ab e ompu a on A node w h he
de ~ ope a on a amp ng ep
<latexit sha1_base64="1XD/drTWwzJ56XWvUlFI4PIcB5s=">AAAN2HicfZdNb9s2HMbV7q3L6q1dj7sICwp0QxBIieMXFAVqSTZ6WNssyFsbBwFF04pgSiQoyrErCDsOuw7YdfsW+y77NqNkR5YoyrqY4PPw8U9/icKfLsV+xA3jvwcPP/v8iy+/evT1zjePW99+9+Tp9+cRiRlEZ5Bgwi5dECHsh+iM+xyjS8oQCFyMLtyZnekXc8Qin4SnfEnRdQC80J/6EHAxdUVfjCMOOLoxfrp5smvsG/ml1wfmerCrra/jm6eP/x1PCIwDFHKIQRRdmQbl1wlg3IcYpTvjOEIUwBnw0FXMp73rxA9pzFEIU/250KYx1jnRMyx94jMEOV6KAYDMFwk6vAUMQC7gd6pREQpBgKK9ydyn0WoYzb3VgANx59fJIq9MWlmYeAzQWx8uKmQJCKIA8NvaZLQM3OokijFi86A6mVEKRsm5QAz6UVaDY1GY9zQrdnRKjtf67ZLeojBKk5jhtLxQCIgxNBUL82GEeEyT/GbEE55FrziL0V42zOdeOYDNTtBkT+RUJqo4U0wAr065QVacEN1BEgQgnCRjmiZjjhY8Ge/tp0J8rrtYmF0C2KTqPEmTZJzVzHX1E2GtiO9K4jtZHJbE4fpPCJ7oU8L0uXj+hEW6MOrCwnyIourqs2L1VD+To89L4rksXpTEC1l045Ia19R5SZ3X1LuSeieri5K4kMVlSVzK4qeS+EkWL7fFftgW+1GKFfUXm2K5M56gqfiA5K9QMoNp8ub07S9p0s+v9bsQI92sGqF7bzwcdbr9birL+F5vj3qm5dT1wtC1BqatMhSOQdc2nOGG5UDyFtCG0RkOOnIUxBu917dHdX0Dawy6jqUwbGhHdnvYXZcPoVCyeoXP7nfatbJ4RU7fsqyjfl0vDFbbtnsHCkPhsB3HGdg5Co0ZxUjy0ntjp3Nk1KNoEdQzOu2BQt88AKNnWTVYWmKxRpbpmDkLRwBLTl68LHZv0B/KQXxTf2tg27UHyEvl79mm01YYNqxHztHwMCchDISeXBVSlK/T7R325ChSBI264gnWWMjmn0Z9y+jWWEiJZWTZVlbY6k4Um+zKvE5Wn9zNvtN3TT1NZbPYaLI523v3ZsmLFWbc7Fbat/nVC3AjfP1OIWyKh4pw2AgDVSywGR4q4eE2eK/u95riPUW41wjjqVi8ZnhPCe9tg6d1P22Kp4pw2ghDVSy0GZ4q4ek2eF7386Z4rgjnjTBcxcKb4bkSnm+DJ3U/aYoninDSCENULKQZnijhyRZ4hu5Ey5d1CuLtSlgtUvTkopvNdYgTUNPzA0UuE5xE6arNoNmpAGA966oJ1v1w3XhUvok0WzWDohdduVdsDhKnC4beirblvWiJgegSfxYUzAt8QSF+x3vZaJsRLO6NYrQjTjqmfK6pDy4O9s32vmn+2t59/XJ96Hmk/aD9qL3QTK2rvdbeaMfamQY1ov2l/a390/rQ+q31e+uPlfXhg/WaZ1rlav35Px0uGww=</latexit> <latexit sha1_base64="YHkXKgzGtE9903LLKx5hW4i+GOE=">AAAN+XicfZfbbts2HMbV7tRl9ZZul7sRFhToCiOQEscHDAVqyTZ6sbZZkNMaBwFF04pgSiQoyrGj6R32CLscdjugt9tr7G1GyY4sUZR5Y4Lfx88//UUKpEOxF3LD+O/R408+/ezzL558ufPV08bX3+w++/Y8JBGD6AwSTNilA0KEvQCdcY9jdEkZAr6D0YUzs1P9Yo5Y6JHglC8puvaBG3hTDwIuhm52X9IX45ADjm7Mpj5m6A6wyY352xjATDfE4Eo2frzZ3TP2jazp1Y657uxp63Z88+zpx/GEwMhHAYcYhOGVaVB+HQPGPYhRsjOOQkQBnAEXXUV82r2OvYBGHAUw0Z8LbRphnRM95dYnHkOQ46XoAMg8kaDDW8AEp3i6nXJUiALgo7A5mXs0XHXDubvqcCBKcx0vstIlpYmxywC99eCiRBYDP/QBv60MhkvfKQ+iCCM298uDKaVglJwLxKAXpjU4FoV5T9Nqh6fkeK3fLuktCsIkjhhOihOFgBhDUzEx64aIRzTOHkYsgVn4irMINdNuNvZqANjsBE2aIqc0UMaZYgJ4ecjx0+IE6A4S3wfBJB7TJB5ztODxuLmfCPG57mBhdohYMmXnSRLH47RmjqOfCGtJfFcQ38nisCAO139C8ESfEqbPxfsnLNSFURcW5kEUlmef5bOn+pkcfV4Qz2XxoiBeyKITFdSoos4L6ryi3hXUO1ldFMSFLC4L4lIW7wvivSxebov9dVvsBylW1F9siuXOeIKm4guTLaF4BpP4zenbn5O4l7X1WoiQbpaN0HkwHo7anV4nkWX8oLdGXdMaVPXc0LH6pq0y5I5+xzYGww3LgeTNoQ2jPey35SiIN3q3Z4+q+gbW6HcGlsKwoR3ZrWFnXT6EAsnq5j67125VyuLmOT3Lso56VT03WC3b7h4oDLnDHgwGfTtDoRGjGEle+mBst4+MahTNg7pGu9VX6JsXYHQtqwJLCyzWyDIHZsbCEcCSk+eLxe72e0M5iG/qb/Vtu/ICeaH8XdsctBSGDevR4Gh4mJEQBgJXrgrJy9fudA+7chTJg0Yd8QYrLGTzT6OeZXQqLKTAMrJsKy1seSeKTXZlXserT+5m3+l7pp4ksllsNNmc7r0Hs+TFCjOudyvt2/zqCbgWvvqkENbFQ0U4rIWBKhZYDw+V8HAbvFv1u3XxriLcrYVxVSxuPbyrhHe3wdOqn9bFU0U4rYWhKhZaD0+V8HQbPK/6eV08V4TzWhiuYuH18FwJz7fBk6qf1MUTRTiphSEqFlIPT5TwZAv86paQnhTE6opZJXJ1d8h0iGNQ0bMLRSYTHIfJ6phB01sBwHp6qiZY94L1waP0TaTprBkUZ9GVe8U2QOJ2wdBbcWx5L47EQJwSXwoK5vqeoBC/42ba22YEiwej6O2Im44p32uqnYuDfbO1b5q/tPZe/7S+9DzRvtd+0F5optbRXmtvtGPtTIPa79pH7R/t38Z944/Gn42/VtbHj9ZzvtNKrfH3/2uAJ24=</latexit> <latexit sha1_base64="TtotXtVyiolaey0Pj34Znxjw2ao=">AAAN+XicfZfbbts2HMbV7tRl9ZZul7sRFhToCiOQEscHDAVqyTZ6sbZZkNMaBwFF04pgSiQoyrGj6R32CLscdjugt9tr7G1GyY4sUZR5Y4Lfx88//UUKpEOxF3LD+O/R408+/ezzL558ufPV08bX3+w++/Y8JBGD6AwSTNilA0KEvQCdcY9jdEkZAr6D0YUzs1P9Yo5Y6JHglC8puvaBG3hTDwIuhm52X9IX45ADjm4OmvqYoTvAJjcHv40BzHRTDK5k88eb3T1j38iaXu2Y686etm7HN8+efhxPCIx8FHCIQRhemQbl1zFg3IMYJTvjKEQUwBlw0VXEp93r2AtoxFEAE/250KYR1jnRU2594jEEOV6KDoDMEwk6vAVMcIqn2ylHhSgAPgqbk7lHw1U3nLurDgeiNNfxIitdUpoYuwzQWw8uSmQx8EMf8NvKYLj0nfIgijBic788mFIKRsm5QAx6YVqDY1GY9zStdnhKjtf67ZLeoiBM4ojhpDhRCIgxNBUTs26IeETj7GHEEpiFrziLUDPtZmOvBoDNTtCkKXJKA2WcKSaAl4ccPy1OgO4g8X0QTOIxTeIxRwsej5v7iRCf6w4WZoeIJVN2niRxPE5r5jj6ibCWxHcF8Z0sDgvicP0nBE/0KWH6XLx/wkJdGHVhYR5EYXn2WT57qp/J0ecF8VwWLwrihSw6UUGNKuq8oM4r6l1BvZPVRUFcyOKyIC5l8b4g3svi5bbYX7fFfpBiRf3FpljujCdoKr4w2RKKZzCJ35y+/TmJe1lbr4UI6WbZCJ0H4+Go3el1ElnGD3pr1DWtQVXPDR2rb9oqQ+7od2xjMNywHEjeHNow2sN+W46CeKN3e/aoqm9gjX5nYCkMG9qR3Rp21uVDKJCsbu6ze+1WpSxuntOzLOuoV9Vzg9Wy7e6BwpA77MFg0LczFBoxipHkpQ/GdvvIqEbRPKhrtFt9hb55AUbXsiqwtMBijSxzYGYsHAEsOXm+WOxuvzeUg/im/lbftisvkBfK37XNQUth2LAeDY6GhxkJYSBw5aqQvHztTvewK0eRPGjUEW+wwkI2/zTqWUanwkIKLCPLttLClnei2GRX5nW8+uRu9p2+Z+pJIpvFRpPN6d57MEterDDjerfSvs2vnoBr4atPCmFdPFSEw1oYqGKB9fBQCQ+3wbtVv1sX7yrC3VoYV8Xi1sO7Snh3Gzyt+mldPFWE01oYqmKh9fBUCU+3wfOqn9fFc0U4r4XhKhZeD8+V8HwbPKn6SV08UYSTWhiiYiH18EQJT7bAr24J6UlBrK6YVSJXd4dMhzgGFT27UGQywXGYrI4ZNL0VAKynp2qCdS9YHzxK30SazppBcRZduVdsAyRuFwy9FceW9+JIDMQp8aWgYK7vCQrxO26mvW1GsHgwit6OuOmY8r2m2rk42Ddb+6b5S2vv9U/rS88T7XvtB+2FZmod7bX2RjvWzjSo/a591P7R/m3cN/5o/Nn4a2V9/Gg95zut1Bp//w+haSdy</latexit> <latexit sha1_base64="dlBFF7y4eqn9g3p0iFNRijY0e0s=">AAAN+XicfZfbbts2HMbV7tRl9ZZul7sRFhToCiOQEscHDAVqyTZ6sbZZkNMaBwFF04pgSiQoyrGj6R32CLscdjugt9tr7G1GyY4sUZR5Y4Lfx88//UUKpEOxF3LD+O/R408+/ezzL558ufPV08bX3+w++/Y8JBGD6AwSTNilA0KEvQCdcY9jdEkZAr6D0YUzs1P9Yo5Y6JHglC8puvaBG3hTDwIuhm52X9IX45ADjm4Om/qYoTvAJjeHv40BzPQDMbiSD3682d0z9o2s6dWOue7saet2fPPs6cfxhMDIRwGHGIThlWlQfh0Dxj2IUbIzjkJEAZwBF11FfNq9jr2ARhwFMNGfC20aYZ0TPeXWJx5DkOOl6ADIPJGgw1vABKd4up1yVIgC4KOwOZl7NFx1w7m76nAgSnMdL7LSJaWJscsAvfXgokQWAz/0Ab+tDIZL3ykPoggjNvfLgymlYJScC8SgF6Y1OBaFeU/Taoen5Hit3y7pLQrCJI4YTooThYAYQ1MxMeuGiEc0zh5GLIFZ+IqzCDXTbjb2agDY7ARNmiKnNFDGmWICeHnI8dPiBOgOEt8HwSQe0yQec7Tg8bi5nwjxue5gYXaIWDJl50kSx+O0Zo6jnwhrSXxXEN/J4rAgDtd/QvBEnxKmz8X7JyzUhVEXFuZBFJZnn+Wzp/qZHH1eEM9l8aIgXsiiExXUqKLOC+q8ot4V1DtZXRTEhSwuC+JSFu8L4r0sXm6L/XVb7AcpVtRfbIrlzniCpuILky2heAaT+M3p25+TuJe19VqIkG6WjdB5MB6O2p1eJ5Fl/KC3Rl3TGlT13NCx+qatMuSOfsc2BsMNy4HkzaENoz3st+UoiDd6t2ePqvoG1uh3BpbCsKEd2a1hZ10+hALJ6uY+u9duVcri5jk9y7KOelU9N1gt2+4eKAy5wx4MBn07Q6ERoxhJXvpgbLePjGoUzYO6RrvVV+ibF2B0LasCSwss1sgyB2bGwhHAkpPni8Xu9ntDOYhv6m/1bbvyAnmh/F3bHLQUhg3r0eBoeJiREAYCV64KycvX7nQPu3IUyYNGHfEGKyxk80+jnmV0KiykwDKybCstbHknik12ZV7Hq0/uZt/pe6aeJLJZbDTZnO69B7PkxQozrncr7dv86gm4Fr76pBDWxUNFOKyFgSoWWA8PlfBwG7xb9bt18a4i3K2FcVUsbj28q4R3t8HTqp/WxVNFOK2FoSoWWg9PlfB0Gzyv+nldPFeE81oYrmLh9fBcCc+3wZOqn9TFE0U4qYUhKhZSD0+U8GQL/OqWkJ4UxOqKWSVydXfIdIhjUNGzC0UmExyHyeqYQdNbAcB6eqomWPeC9cGj9E2k6awZFGfRlXvFNkDidsHQW3FseS+OxECcEl8KCub6nqAQv+Nm2ttmBIsHo+jtiJuOKd9rqp2Lg32ztW+av7T2Xv+0vvQ80b7XftBeaKbW0V5rb7Rj7UyD2u/aR+0f7d/GfeOPxp+Nv1bWx4/Wc77TSq3x9//XUid2</latexit>

p(s0 ) p(s1 , r1 |a0 , s0 ) p(s2 , r2 |a1 , s1 ) p(s3 , r3 |a2 , s2 )


~ ~ r1 <latexit sha1_base64="9xsJmqaI6QVO7aNPEsbsg9BoZ44=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4bQwwlL7837Z4fGsbG89GphrotDbX2d3z/fHwxHBMYBCjnEIIpuTYPyuwQw7kOM0r1hHCEK4BR46Dbm4/Zd4oc05iiEqf5SaOMY65zoGZQ+8hmCHC9EASDzRYIOJ4AByAX6XjkqQiEIUHQ0mvk0WpXRzFsVHIj7vkvmy76kpYmJxwCd+HBeIktAEAWATyqD0SJwy4MoxojNgvJgRikYJeccMehHWQ/ORWPe06zV0SU5X+uTBZ2gMEqTmOG0OFEIiDE0FhOXZYR4TJPlzYj1nUavOIvRUVYux145gE0v0OhI5JQGyjhjTAAvD7lB1pwQPUASBCAcJUOaJkOO5jwZHh2nQnypu1iYXQLYqOy8SJNkmPXMdfULYS2J7wriO1nsFcTe+kcIHuljwvSZWH/CIl0YdWFhPkRRefYgnz3WB3L0VUG8ksXrgngti25cUOOKOiuos4r6UFAfZHVeEOeyuCiIC1n8VBA/yeLNrtgPu2I/SrGi/2JTLPaGIzQWr4/lI5RMYZq8uXz7R5p0ltf6WYiRbpaN0N0YT/vNVqeVyjLe6I1+27Scqp4bWlbXtFWG3NFt2YbT27KcSN4c2jCavW5TjoJ4q7c7dr+qb2GNbsuxFIYtbd9u9Frr9iEUSlYv99mdZqPSFi/P6ViWddap6rnBath2+0RhyB224zhde4lCY0Yxkrx0Y2w2z4xqFM2D2kaz0VXo2wUw2pZVgaUFFqtvmY65ZOEIYMnJ84fFbnc7PTmIb/tvdW27soC80P62bToNhWHLeuac9U6XJISB0JO7QvL2NVvt07YcRfKgfkusYIWFbH+p37GMVoWFFFj6lm1ljS3vRLHJbs27ZPXK3e47/dDU01Q2i40mm7O9tzFLXqww43q30r7Lr56Aa+GrdwphXTxUhMNaGKhigfXwUAkPd8F7Vb9XF+8pwr1aGE/F4tXDe0p4bxc8rfppXTxVhNNaGKpiofXwVAlPd8Hzqp/XxXNFOK+F4SoWXg/PlfB8Fzyp+kldPFGEk1oYomIh9fBECU/K8OLPIzu1A6xnp16CdT9cHwxK7yyaHR+mUJwVV+7VjTtInP4ZeiuOFe/FkRWIU9yvyRAwL/DDVHwNeMOjrNplBPONUVTiQ8SUPzuqxfXJsdk4Ns0/G4evf19/kzzVvtd+1H7RTK2lvdbeaOfaQINaoP2l/a39s//fwQ8HPx38vLI+frSe80IrXQe//Q8UNPHI</latexit>

~ r2 <latexit sha1_base64="pd3L+wPVTqb/6ceddXPUL+1bOrE=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4bQwwlL70/unx0ax8by0quFuS4OtfV1fv98fzAcERgHKOQQgyi6NQ3K7xLAuA8xSveGcYQogFPgoduYj9t3iR/SmKMQpvpLoY1jrHOiZ1D6yGcIcrwQBYDMFwk6nAAGIBfoe+WoCIUgQNHRaObTaFVGM29VcCDu+y6ZL/uSliYmHgN04sN5iSwBQRQAPqkMRovALQ+iGCM2C8qDGaVglJxzxKAfZT04F415T7NWR5fkfK1PFnSCwihNYobT4kQhIMbQWExclhHiMU2WNyPWdxq94ixGR1m5HHvlADa9QKMjkVMaKOOMMQG8POQGWXNC9ABJEIBwlAxpmgw5mvNkeHScCvGl7mJhdglgo7LzIk2SYdYz19UvhLUkviuI72SxVxB76x8heKSPCdNnYv0Ji3Rh1IWF+RBF5dmDfPZYH8jRVwXxShavC+K1LLpxQY0r6qygzirqQ0F9kNV5QZzL4qIgLmTxU0H8JIs3u2I/7Ir9KMWK/otNsdgbjtBYvD6Wj1AyhWny5vLtH2nSWV7rZyFGulk2QndjPO03W51WKst4ozf6bdNyqnpuaFld01YZcke3ZRtOb8tyInlzaMNo9rpNOQrird7u2P2qvoU1ui3HUhi2tH270Wut24dQKFm93Gd3mo1KW7w8p2NZ1lmnqucGq2Hb7ROFIXfYjuN07SUKjRnFSPLSjbHZPDOqUTQPahvNRlehbxfAaFtWBZYWWKy+ZTrmkoUjgCUnzx8Wu93t9OQgvu2/1bXtygLyQvvbtuk0FIYt65lz1jtdkhAGQk/uCsnb12y1T9tyFMmD+i2xghUWsv2lfscyWhUWUmDpW7aVNba8E8UmuzXvktUrd7vv9ENTT1PZLDaabM723sYsebHCjOvdSvsuv3oCroWv3imEdfFQEQ5rYaCKBdbDQyU83AXvVf1eXbynCPdqYTwVi1cP7ynhvV3wtOqndfFUEU5rYaiKhdbDUyU83QXPq35eF88V4bwWhqtYeD08V8LzXfCk6id18UQRTmphiIqF1MMTJTwpw4s/j+zUDrCenXoJ1v1wfTAovb

~ r3 F we gene a e he n a a e _0 om he n a a e d bu on Th
s1 <latexit sha1_base64="OzSwNdQHk8BIaE6qZCRw39GSBas=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YiIYnSe/P+2aFxbCwvvVqY6+JQW1/n98/3B8MRgXGAQg4xiKJb06D8LgGM+xCjdG8YR4gCOAUeuo35uH2X+CGNOQphqr8U2jjGOid6BqWPfIYgxwtRAMh8kaDDCWAAcoG+V46KUAgCFB2NZj6NVmU081YFB+K+75L5si9paWLiMUAnPpyXyBIQRAHgk8pgtAjc8iCKMWKzoDyYUQpGyTlHDPpR1oNz0Zj3NGt1dEnO1/pkQScojNIkZjgtThQCYgyNxcRlGSEe02R5M2J9p9ErzmJ0lJXLsVcOYNMLNDoSOaWBMs4YE8DLQ26QNSdED5AEAQhHyZCmyZCjOU+GR8epEF/qLhZmlwA2Kjsv0iQZZj1zXf1CWEviu4L4ThZ7BbG3/hGCR/qYMH0m1p+wSBdGXViYD1FUnj3IZ4/1gRx9VRCvZPG6IF7LohsX1LiizgrqrKI+FNQHWZ0XxLksLgriQhY/FcRPsnizK/bDrtiPUqzov9gUi73hCI3F62P5CCVTmCZvLt/+kSad5bV+FmKkm2UjdDfG036z1Wmlsow3eqPfNi2nqueGltU1bZUhd3RbtuH0tiwnkjeHNoxmr9uUoyDe6u2O3a/qW1ij23IshWFL27cbvda6fQiFktXLfXan2ai0xctzOpZlnXWqem6wGrbdPlEYcoftOE7XXqLQmFGMJC/dGJvNM6MaRfOgttFsdBX6dgGMtmVVYGmBxepbpmMuWTgCWHLy/GGx291OTw7i2/5bXduuLCAvtL9tm05DYdiynjlnvdMlCWEg9OSukLx9zVb7tC1HkTyo3xIrWGEh21/qdyyjVWEhBZa+ZVtZY8s7UWyyW/MuWb1yt/tOPzT1NJXNYqPJ5mzvbcySFyvMuN6ttO/yqyfgWvjqnUJYFw8V4bAWBqpYYD08VMLDXfBe1e/VxXuKcK8WxlOxePXwnhLe2wVPq35aF08V4bQWhqpYaD08VcLTXfC86ud18VwRzmthuIqF18NzJTzfBU+qflIXTxThpBaGqFhIPTxRwpMyvPjzyE7tAOvZqZdg3Q/XB4PSO4tmx4cpFGfFlXt14w4Sp3+G3opjxXtxZAXiFPdrMgTMC/wwFV8D3vAoq3YZwXxjFJX4EDHlz45qcX1ybDaOTfPPxuHr39ffJE+177UftV80U2tpr7U32rk20KAWaH9pf2v/7P938MPBTwc/r6yPH63nvNBK18Fv/wOJuPHR</latexit>

s2 s3 no ond oned on any npu wh h ep e en ed by he a ha he e a e


s0
<latexit sha1_base64="KIzHNcLYbod1+x/KccCLlzNOPGI=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YiIYnSe+P+2aFxbCwvvVqY6+JQW1/n98/3B8MRgXGAQg4xiKJb06D8LgGM+xCjdG8YR4gCOAUeuo35uH2X+CGNOQphqr8U2jjGOid6BqWPfIYgxwtRAMh8kaDDCWAAcoG+V46KUAgCFB2NZj6NVmU081YFB+K+75L5si9paWLiMUAnPpyXyBIQRAHgk8pgtAjc8iCKMWKzoDyYUQpGyTlHDPpR1oNz0Zj3NGt1dEnO1/pkQScojNIkZjgtThQCYgyNxcRlGSEe02R5M2J9p9ErzmJ0lJXLsVcOYNMLNDoSOaWBMs4YE8DLQ26QNSdED5AEAQhHyZCmyZCjOU+GR8epEF/qLhZmlwA2Kjsv0iQZZj1zXf1CWEviu4L4ThZ7BbG3/hGCR/qYMH0m1p+wSBdGXViYD1FUnj3IZ4/1gRx9VRCvZPG6IF7LohsX1LiizgrqrKI+FNQHWZ0XxLksLgriQhY/FcRPsnizK/bDrtiPUqzov9gUi73hCI3F62P5CCVTmCZvLt/+kSad5bV+FmKkm2UjdDfG036z1Wmlsow3eqPfNi2nqueGltU1bZUhd3RbtuH0tiwnkjeHNoxmr9uUoyDe6u2O3a/qW1ij23IshWFL27cbvda6fQiFktXLfXan2ai0xctzOpZlnXWqem6wGrbdPlEYcoftOE7XXqLQmFGMJC/dGJvNM6MaRfOgttFsdBX6dgGMtmVVYGmBxepbpmMuWTgCWHLy/GGx291OTw7i2/5bXduuLCAvtL9tm05DYdiynjlnvdMlCWEg9OSukLx9zVb7tC1HkTyo3xIrWGEh21/qdyyjVWEhBZa+ZVtZY8s7UWyyW/MuWb1yt/tOPzT1NJXNYqPJ5mzvbcySFyvMuN6ttO/yqyfgWvjqnUJYFw8V4bAWBqpYYD08VMLDXfBe1e/VxXuKcK8WxlOxePXwnhLe2wVPq35aF08V4bQWhqpYaD08VcLTXfC86ud18VwRzmthuIqF18NzJTzfBU+qflIXTxThpBaGqFhIPTxRwpMyvPjzyE7tAOvZqZdg3Q/XB4PSO4tmx4cpFGfFlXt14w4Sp3+G3opjxXtxZAXiFPdrMgTMC/wwFV8D3vAoq3YZwXxjFJX4EDHlz45qcX1ybDaOTfPPxuHr39ffJE+177UftV80U2tpr7U32rk20KAWaH9pf2v/7P938MPBTwc/r6yPH63nvNBK18Fv/wN8r/HQ</latexit>

no n om ng a ow Nex we pa he gene a ed n a a e h ough he


<latexit sha1_base64="v7EdJ+ThHXgJX0gdR//dSzJOt1g=">AAAOEnicfZdNb9s2HMbV7q3Lmi3djjtMWFCgG4JAShy/YChQS7bRw9pmQV66RUFA0bQsmBIJinLsajruE+xj7Loddhx23RfYPs0o2ZElirIu+YfPw8c//SUKpEuxH3HD+PfBw/fe/+DDjx59vPPJ491PP9t78vllRGIG0QUkmLC3LogQ9kN0wX2O0VvKEAhcjK7cmZ3pV3PEIp+E53xJ0U0AvNCf+BBwMXS795VD/VvH5VPEwTMHwHzU+NmJOODo1vjmdm/fODTyS68X5rrY19bX6e2Tx/85YwLjAIUcYhBF16ZB+U0CGPchRumOE0eIAjgDHrqO+aR7k/ghjTkKYao/Fdokxjonegarj32GIMdLUQDIfJGgwylgAlPc0k41KkIhCFB0MJ77NFqV0dxbFRyIftwki7xfaWVi4jFApz5cVMgSEEQB4NPaYLQM3OogijFi86A6mFEKRsm5QAz6UdaDU9GYNzRrdnROTtf6dEmnKIzSJGY4LU8UAmIMTcTEvIwQj2mS34x47rPoOWcxOsjKfOz5ALDZGRofiJzKQBVnggng1SE3yJoTojtIggCE48ShaeJwtOCJc3CYCvGp7mJhdglg46rzLE0SJ+uZ6+pnwloRX5fE17I4LInD9Y8QPNYnhOlz8fwJi3Rh1IWF+RBF1dkXxeyJfiFHX5bES1m8KolXsujGJTWuqfOSOq+pdyX1TlYXJXEhi8uSuJTFdyXxnSy+3Rb747bYn6RY0X+xKJY7zhhNxGclf4WSGUyTl+evvk+TXn6t34UY6WbVCN174/Go3el1UlnG93pr1DWtQV0vDB2rb9oqQ+Hod2xjMNywHEneAtow2sN+W46CeKN3e/aorm9gjX5nYCkMG9qR3Rp21u1DKJSsXuGze+1WrS1ekdOzLOukV9cLg9Wy7e6RwlA47MFg0LdzFBozipHkpffGdvvEqEfRIqhrtFt9hb55AEbXsmqwtMRijSxzYOYsHAEsOXnxstjdfm8oB/FN/62+bdceIC+1v2ubg5bCsGE9GZwMj3MSwkDoyV0hRfvane5xV44iRdCoI55gjYVsfmnUs4xOjYWUWEaWbWWNra5EsciuzZtk9cndrDt939TTVDaLhSabs7V3b5a8WGHGzW6lfZtfPQE3wtfvFMKmeKgIh40wUMUCm+GhEh5ug/fqfq8p3lOEe40wnorFa4b3lPDeNnha99OmeKoIp40wVMVCm+GpEp5ug+d1P2+K54pw3gjDVSy8GZ4r4fk2eFL3k6Z4oggnjTBExUKa4YkSnmyBZ+hObPmynYJ4uxJWi1wdHXId4gTU9PxAkcsEJ1FNXp1AMt0NBFP+z2orQrOTA8B6tvMmWPfD9eak8t2k2cwZFPvVlXvFP0DiBMLQK7G1eSO2zUDsJL8VpMwLfEEq/joHWbXNCBb3RlHtiNOQKZ996sXV0aHZOjTNH1r7L75bH4weaV9qX2vPNFPraC+0l9qpdqFB7RftN+137Y/dX3f/3P1r9++V9eGD9ZwvtMq1+8//ia4zmg==</latexit> <latexit sha1_base64="4j85n+ya1RERqomaZORIgNwngbY=">AAAOEnicfZdNb9s2HMbV7q3Lmi3djjtMWFCgG4JAShy/YChQS7bRw9pmQV66RUFA0bQsmBIJinLsajruE+xj7Loddhx23RfYPs0o2ZElirIu+YfPw8c//SUKpEuxH3HD+PfBw/fe/+DDjx59vPPJ491PP9t78vllRGIG0QUkmLC3LogQ9kN0wX2O0VvKEAhcjK7cmZ3pV3PEIp+E53xJ0U0AvNCf+BBwMXS795VD/VvH5VPEwTMHwHzU/NmJOODo1vzmdm/fODTyS68X5rrY19bX6e2Tx/85YwLjAIUcYhBF16ZB+U0CGPchRumOE0eIAjgDHrqO+aR7k/ghjTkKYao/Fdokxjonegarj32GIMdLUQDIfJGgwylgAlPc0k41KkIhCFB0MJ77NFqV0dxbFRyIftwki7xfaWVi4jFApz5cVMgSEEQB4NPaYLQM3OogijFi86A6mFEKRsm5QAz6UdaDU9GYNzRrdnROTtf6dEmnKIzSJGY4LU8UAmIMTcTEvIwQj2mS34x47rPoOWcxOsjKfOz5ALDZGRofiJzKQBVnggng1SE3yJoTojtIggCE48ShaeJwtOCJc3CYCvGp7mJhdglg46rzLE0SJ+uZ6+pnwloRX5fE17I4LInD9Y8QPNYnhOlz8fwJi3Rh1IWF+RBF1dkXxeyJfiFHX5bES1m8KolXsujGJTWuqfOSOq+pdyX1TlYXJXEhi8uSuJTFdyXxnSy+3Rb747bYn6RY0X+xKJY7zhhNxGclf4WSGUyTl+evvk+TXn6t34UY6WbVCN174/Go3el1UlnG93pr1DWtQV0vDB2rb9oqQ+Hod2xjMNywHEneAtow2sN+W46CeKN3e/aorm9gjX5nYCkMG9qR3Rp21u1DKJSsXuGze+1WrS1ekdOzLOukV9cLg9Wy7e6RwlA47MFg0LdzFBozipHkpffGdvvEqEfRIqhrtFt9hb55AEbXsmqwtMRijSxzYOYsHAEsOXnxstjdfm8oB/FN/62+bdceIC+1v2ubg5bCsGE9GZwMj3MSwkDoyV0hRfvane5xV44iRdCoI55gjYVsfmnUs4xOjYWUWEaWbWWNra5EsciuzZtk9cndrDt939TTVDaLhSabs7V3b5a8WGHGzW6lfZtfPQE3wtfvFMKmeKgIh40wUMUCm+GhEh5ug/fqfq8p3lOEe40wnorFa4b3lPDeNnha99OmeKoIp40wVMVCm+GpEp5ug+d1P2+K54pw3gjDVSy8GZ4r4fk2eFL3k6Z4oggnjTBExUKa4YkSnmyBZ+hObPmynYJ4uxJWi1wdHXId4gTU9PxAkcsEJ1FNXp1AMt0NBFP+z2orQrOTA8B6tvMmWPfD9eak8t2k2cwZFPvVlXvFP0DiBMLQK7G1eSO2zUDsJL8VpMwLfEEq/joHWbXNCBb3RlHtiNOQKZ996sXV0aHZOjTNH1r7L75bH4weaV9qX2vPNFPraC+0l9qpdqFB7RftN+137Y/dX3f/3P1r9++V9eGD9ZwvtMq1+8//pNUznA==</latexit> <latexit sha1_base64="y2/EBqPf66V+UZmCJQzpbqka44c=">AAAOEnicfZdNb9s2HMbV7q3Lmi3djjtMWFCgG4JAShy/YChQS7bRw9pmQV66RUFA0bQsmBIJinLsajruE+xj7Loddhx23RfYPs0o2ZElirIu+YfPw8c//SUKpEuxH3HD+PfBw/fe/+DDjx59vPPJ491PP9t78vllRGIG0QUkmLC3LogQ9kN0wX2O0VvKEAhcjK7cmZ3pV3PEIp+E53xJ0U0AvNCf+BBwMXS795VD/VvH5VPEwTMHwHz06Gcn4oCj26Nvbvf2jUMjv/R6Ya6LfW19nd4+efyfMyYwDlDIIQZRdG0alN8kgHEfYpTuOHGEKIAz4KHrmE+6N4kf0pijEKb6U6FNYqxzomew+thnCHK8FAWAzBcJOpwCJjDFLe1UoyIUggBFB+O5T6NVGc29VcGB6MdNssj7lVYmJh4DdOrDRYUsAUEUAD6tDUbLwK0OohgjNg+qgxmlYJScC8SgH2U9OBWNeUOzZkfn5HStT5d0isIoTWKG0/JEISDG0ERMzMsI8Zgm+c2I5z6LnnMWo4OszMeeDwCbnaHxgcipDFRxJpgAXh1yg6w5IbqDJAhAOE4cmiYORwueOAeHqRCf6i4WZpcANq46z9IkcbKeua5+JqwV8XVJfC2Lw5I4XP8IwWN9Qpg+F8+fsEgXRl1YmA9RVJ19Ucye6Bdy9GVJvJTFq5J4JYtuXFLjmjovqfOaeldS72R1URIXsrgsiUtZfFcS38ni222xP26L/UmKFf0Xi2K544zRRHxW8lcomcE0eXn+6vs06eXX+l2IkW5WjdC9Nx6P2p1eJ5VlfK+3Rl3TGtT1wtCx+qatMhSOfsc2BsMNy5HkLaANoz3st+UoiDd6t2eP6voG1uh3BpbCsKEd2a1hZ90+hELJ6hU+u9du1driFTk9y7JOenW9MFgt2+4eKQyFwx4MBn07R6ExoxhJXnpvbLdPjHoULYK6RrvVV+ibB2B0LasGS0ss1sgyB2bOwhHAkpMXL4vd7feGchDf9N/q23btAfJS+7u2OWgpDBvWk8HJ8DgnIQyEntwVUrSv3eked+UoUgSNOuIJ1ljI5pdGPcvo1FhIiWVk2VbW2OpKFIvs2rxJVp/czbrT9009TWWzWGiyOVt792bJixVm3OxW2rf51RNwI3z9TiFsioeKcNgIA1UssBkeKuHhNniv7vea4j1FuNcI46lYvGZ4TwnvbYOndT9tiqeKcNoIQ1UstBmeKuHpNnhe9/OmeK4I540wXMXCm+G5Ep5vgyd1P2mKJ4pw0ghDVCykGZ4o4ckWeIbuxJYv2ymItythtcjV0SHXIU5ATc8PFLlMcBLV5NUJJNPdQDDl/6y2IjQ7OQCsZztvgnU/XG9OKt9Nms2cQbFfXblX/AMkTiAMvRJbmzdi2wzETvJbQcq8wBek4q9zkFXbjGBxbxTVjjgNmfLZp15cHR2arUPT/KG1/+K79cHokfal9rX2TDO1jvZCe6mdahca1H7RftN+1/7Y/XX3z92/dv9eWR8+WM/5Qqtcu//8D7/8M54=</latexit>

po y ne wo k p _ he a o ge a p obab y d bu on ove po b e
⇡✓ (a0 |s0 ) ⇡✓ (a1 |s1 ) ⇡✓ (a2 |s2 ) ⇡✓ (a2 s2 ) a on F om h d bu on we gene a e he a on a_0
~ a0 <latexit sha1_base64="M8sqRzeDFh3Mi1iBGtqcTbyBm/Y=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YQ4gSk98b9s0Pj2FheerUw18Whtr7O75/vD4YjAuMAhRxiEEW3pkH5XQIY9yFG6d4wjhAFcAo8dBvzcfsu8UMacxTCVH8ptHGMdU70DEof+QxBjheiAJD5IkGHE8AA5AJ9rxwVoRAEKDoazXwarcpo5q0KDsR93yXzZV/S0sTEY4BOfDgvkSUgiALAJ5XBaBG45UEUY8RmQXkwoxSMknOOGPSjrAfnojHvadbq6JKcr/XJgk5QGKVJzHBanCgExBgai4nLMkI8psnyZsT6TqNXnMXoKCuXY68cwKYXaHQkckoDZZwxJoCXh9wga06IHiAJAhCOkiFNkyFHc54Mj45TIb7UXSzMLgFsVHZepEkyzHrmuvqFsJbEdwXxnSz2CmJv/SMEj/QxYfpMrD9hkS6MurAwH6KoPHuQzx7rAzn6qiBeyeJ1QbyWRTcuqHFFnRXUWUV9KKgPsjoviHNZXBTEhSx+KoifZPFmV+yHXbEfpVjRf7EpFnvDERqL18fyEUqmME3eXL79I006y2v9LMRIN8tG6G6Mp/1mq9NKZRlv9Ea/bVpOVc8NLatr2ipD7ui2bMPpbVlOJG8ObRjNXrcpR0G81dsdu1/Vt7BGt+VYCsOWtm83eq11+xAKJauX++xOs1Fpi5fndCzLOutU9dxgNWy7faIw5A7bcZyuvUShMaMYSV66MTabZ0Y1iuZBbaPZ6Cr07QIYbcuqwNICi9W3TMdcsnAEsOTk+cNit7udnhzEt/23urZdWUBeaH/bNp2GwrBlPXPOeqdLEsJA6MldIXn7mq32aVuOInlQvyVWsMJCtr/U71hGq8JCCix9y7ayxpZ3othkt+ZdsnrlbvedfmjqaSqbxUaTzdne25glL1aYcb1bad/lV0/AtfDVO4WwLh4qwmEtDFSxwHp4qISHu+C9qt+ri/cU4V4tjKdi8erhPSW8twueVv20Lp4qwmktDFWx0Hp4qoSnu+B51c/r4rkinNfCcBULr4fnSni+C55U/aQunijCSS0MUbGQeniihCdlePHnkZ3aAdazUy/Buh+uDwaldxbNjg9TKM6KK/fqxh0kTv8MvRXHivfiyArEKe7XZAiYF/hhKr4GvOFRVu0ygvnGKCrxIWLKnx3V4vrk2Gwcm+afjcPXv6+/SZ5q32s/ar9optbSXmtvtHNtoEEt0P7S/tb+2f/v4IeDnw5+XlkfP1rPeaGVroPf/gf1BfGy</latexit>

~ a1
<latexit sha1_base64="faCoi3suTLeizhRJ44TVzLkE1Ww=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YQ4gSk9+b9s0Pj2FheerUw18Whtr7O75/vD4YjAuMAhRxiEEW3pkH5XQIY9yFG6d4wjhAFcAo8dBvzcfsu8UMacxTCVH8ptHGMdU70DEof+QxBjheiAJD5IkGHE8AA5AJ9rxwVoRAEKDoazXwarcpo5q0KDsR93yXzZV/S0sTEY4BOfDgvkSUgiALAJ5XBaBG45UEUY8RmQXkwoxSMknOOGPSjrAfnojHvadbq6JKcr/XJgk5QGKVJzHBanCgExBgai4nLMkI8psnyZsT6TqNXnMXoKCuXY68cwKYXaHQkckoDZZwxJoCXh9wga06IHiAJAhCOkiFNkyFHc54Mj45TIb7UXSzMLgFsVHZepEkyzHrmuvqFsJbEdwXxnSz2CmJv/SMEj/QxYfpMrD9hkS6MurAwH6KoPHuQzx7rAzn6qiBeyeJ1QbyWRTcuqHFFnRXUWUV9KKgPsjoviHNZXBTEhSx+KoifZPFmV+yHXbEfpVjRf7EpFnvDERqL18fyEUqmME3eXL79I006y2v9LMRIN8tG6G6Mp/1mq9NKZRlv9Ea/bVpOVc8NLatr2ipD7ui2bMPpbVlOJG8ObRjNXrcpR0G81dsdu1/Vt7BGt+VYCsOWtm83eq11+xAKJauX++xOs1Fpi5fndCzLOutU9dxgNWy7faIw5A7bcZyuvUShMaMYSV66MTabZ0Y1iuZBbaPZ6Cr07QIYbcuqwNICi9W3TMdcsnAEsOTk+cNit7udnhzEt/23urZdWUBeaH/bNp2GwrBlPXPOeqdLEsJA6MldIXn7mq32aVuOInlQvyVWsMJCtr/U71hGq8JCCix9y7ayxpZ3othkt+ZdsnrlbvedfmjqaSqbxUaTzdne25glL1aYcb1bad/lV0/AtfDVO4WwLh4qwmEtDFSxwHp4qISHu+C9qt+ri/cU4V4tjKdi8erhPSW8twueVv20Lp4qwmktDFWx0Hp4qoSnu+B51c/r4rkinNfCcBULr4fnSni+C55U/aQunijCSS0MUbGQeniihCdlePHnkZ3aAdazUy/Buh+uDwaldxbNjg9TKM6KK/fqxh0kTv8MvRXHivfiyArEKe7XZAiYF/hhKr4GvOFRVu0ygvnGKCrxIWLKnx3V4vrk2Gwcm+afjcPXv6+/SZ5q32s/ar9optbSXmtvtHNtoEEt0P7S/tb+2f/v4IeDnw5+XlkfP1rPeaGVroPf/gcCHfGz</latexit>

~ a2
Fo ow ng he a ow we pa he n a a e _0 and he a on a_0 o
he env onmen and gene a e he nex a e _1 and he a o a ed ewa d
_1 o h ae
Agen
Agen ✓
The p o e epea om h po n Gene a e an a on om he po y and
11
u e oge he w h he a e o gene a e he nex a e and ewa d
MDPs contain two important assumptions. These assumptions are used to
OBSERVABILITY develop efficient algorithms, but are often not a good model of the real
world! The first one is full observability of states. We assume that we receive
• Full observability of state from the environment all there is to know about the current state of the
environment. An example of an environment with full observability is chess:
<latexit sha1_base64="9xsJmqaI6QVO7aNPEsbsg9BoZ44=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4bQwwlL7837Z4fGsbG89GphrotDbX2d3z/fHwxHBMYBCjnEIIpuTYPyuwQw7kOM0r1hHCEK4BR46Dbm4/Zd4oc05iiEqf5SaOMY65zoGZQ+8hmCHC9EASDzRYIOJ4AByAX6XjkqQiEIUHQ0mvk0WpXRzFsVHIj7vkvmy76kpYmJxwCd+HBeIktAEAWATyqD0SJwy4MoxojNgvJgRikYJeccMehHWQ/ORWPe06zV0SU5X+uTBZ2gMEqTmOG0OFEIiDE0FhOXZYR4TJPlzYj1nUavOIvRUVYux145gE0v0OhI5JQGyjhjTAAvD7lB1pwQPUASBCAcJUOaJkOO5jwZHh2nQnypu1iYXQLYqOy8SJNkmPXMdfULYS2J7wriO1nsFcTe+kcIHuljwvSZWH/CIl0YdWFhPkRRefYgnz3WB3L0VUG8ksXrgngti25cUOOKOiuos4r6UFAfZHVeEOeyuCiIC1n8VBA/yeLNrtgPu2I/SrGi/2JTLPaGIzQWr4/lI5RMYZq8uXz7R5p0ltf6WYiRbpaN0N0YT/vNVqeVyjLe6I1+27Scqp4bWlbXtFWG3NFt2YbT27KcSN4c2jCavW5TjoJ4q7c7dr+qb2GNbsuxFIYtbd9u9Frr9iEUSlYv99mdZqPSFi/P6ViWddap6rnBath2+0RhyB224zhde4lCY0Yxkrx0Y2w2z4xqFM2D2kaz0VXo2wUw2pZVgaUFFqtvmY65ZOEIYMnJ84fFbnc7PTmIb/tvdW27soC80P62bToNhWHLeuac9U6XJISB0JO7QvL2NVvt07YcRfKgfkusYIWFbH+p37GMVoWFFFj6lm1ljS3vRLHJbs27ZPXK3e47/dDU01Q2i40mm7O9tzFLXqww43q30r7Lr56Aa+GrdwphXTxUhMNaGKhigfXwUAkPd8F7Vb9XF+8pwr1aGE/F4tXDe0p4bxc8rfppXTxVhNNaGKpiofXwVAlPd8Hzqp/XxXNFOK+F4SoWXg/PlfB8Fzyp+kldPFGEk1oYomIh9fBECU/K8OLPIzu1A6xnp16CdT9cHwxK7yyaHR+mUJwVV+7VjTtInP4ZeiuOFe/FkRWIU9yvyRAwL/DDVHwNeMOjrNplBPONUVTiQ8SUPzuqxfXJsdk4Ns0/G4evf19/kzzVvtd+1H7RTK2lvdbeaOfaQINaoP2l/a39s//fwQ8HPx38vLI+frSe80IrXQe//Q8UNPHI</latexit>

r1 <latexit sha1_base64="pd3L+wPVTqb/6ceddXPUL+1bOrE=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4bQwwlL70/unx0ax8by0quFuS4OtfV1fv98fzAcERgHKOQQgyi6NQ3K7xLAuA8xSveGcYQogFPgoduYj9t3iR/SmKMQpvpLoY1jrHOiZ1D6yGcIcrwQBYDMFwk6nAAGIBfoe+WoCIUgQNHRaObTaFVGM29VcCDu+y6ZL/uSliYmHgN04sN5iSwBQRQAPqkMRovALQ+iGCM2C8qDGaVglJxzxKAfZT04F415T7NWR5fkfK1PFnSCwihNYobT4kQhIMbQWExclhHiMU2WNyPWdxq94ixGR1m5HHvlADa9QKMjkVMaKOOMMQG8POQGWXNC9ABJEIBwlAxpmgw5mvNkeHScCvGl7mJhdglgo7LzIk2SYdYz19UvhLUkviuI72SxVxB76x8heKSPCdNnYv0Ji3Rh1IWF+RBF5dmDfPZYH8jRVwXxShavC+K1LLpxQY0r6qygzirqQ0F9kNV5QZzL4qIgLmTxU0H8JIs3u2I/7Ir9KMWK/otNsdgbjtBYvD6Wj1AyhWny5vLtH2nSWV7rZyFGulk2QndjPO03W51WKst4ozf6bdNyqnpuaFld01YZcke3ZRtOb8tyInlzaMNo9rpNOQrird7u2P2qvoU1ui3HUhi2tH270Wut24dQKFm93Gd3mo1KW7w8p2NZ1lmnqucGq2Hb7ROFIXfYjuN07SUKjRnFSPLSjbHZPDOqUTQPahvNRlehbxfAaFtWBZYWWKy+ZTrmkoUjgCUnzx8Wu93t9OQgvu2/1bXtygLyQvvbtuk0FIYt65lz1jtdkhAGQk/uCsnb12y1T9tyFMmD+i2xghUWsv2lfscyWhUWUmDpW7aVNba8E8UmuzXvktUrd7vv9ENTT1PZLDaabM723sYsebHCjOvdSvsuv3oCroWv3imEdfFQEQ5rYaCKBdbDQyU83AXvVf1eXbynCPdqYTwVi1cP7ynhvV3wtOqndfFUEU5rYaiKhdbDUyU83QXPq35eF88V4bwWhqtYeD08V8LzXfCk6id18UQRTmphiIqF1MMTJTwpw4s/j+zUDrCenXoJ1v1wfTAovbNodnyYQnFWXLlXN+4gcfpn6K04VrwXR1YgTnG/JkPAvMAPU/E14A2PsmqXEcw3RlGJDxFT/uyoFtcnx2bj2DT/bBy+/n39TfJU+177UftFM7WW9lp7o51rAw1qgfaX9rf2z/5/Bz8c/HTw88r6+NF6zgutdB389j8hPfHJ</latexit>

r2 Both our agent and their opponent know everything there is to know about
~ ~ ~ ~
the current state of the game by just observing the game board. This is often
s1
<latexit sha1_base64="OzSwNdQHk8BIaE6qZCRw39GSBas=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YiIYnSe/P+2aFxbCwvvVqY6+JQW1/n98/3B8MRgXGAQg4xiKJb06D8LgGM+xCjdG8YR4gCOAUeuo35uH2X+CGNOQphqr8U2jjGOid6BqWPfIYgxwtRAMh8kaDDCWAAcoG+V46KUAgCFB2NZj6NVmU081YFB+K+75L5si9paWLiMUAnPpyXyBIQRAHgk8pgtAjc8iCKMWKzoDyYUQpGyTlHDPpR1oNz0Zj3NGt1dEnO1/pkQScojNIkZjgtThQCYgyNxcRlGSEe02R5M2J9p9ErzmJ0lJXLsVcOYNMLNDoSOaWBMs4YE8DLQ26QNSdED5AEAQhHyZCmyZCjOU+GR8epEF/qLhZmlwA2Kjsv0iQZZj1zXf1CWEviu4L4ThZ7BbG3/hGCR/qYMH0m1p+wSBdGXViYD1FUnj3IZ4/1gRx9VRCvZPG6IF7LohsX1LiizgrqrKI+FNQHWZ0XxLksLgriQhY/FcRPsnizK/bDrtiPUqzov9gUi73hCI3F62P5CCVTmCZvLt/+kSad5bV+FmKkm2UjdDfG036z1Wmlsow3eqPfNi2nqueGltU1bZUhd3RbtuH0tiwnkjeHNoxmr9uUoyDe6u2O3a/qW1ij23IshWFL27cbvda6fQiFktXLfXan2ai0xctzOpZlnXWqem6wGrbdPlEYcoftOE7XXqLQmFGMJC/dGJvNM6MaRfOgttFsdBX6dgGMtmVVYGmBxepbpmMuWTgCWHLy/GGx291OTw7i2/5bXduuLCAvtL9tm05DYdiynjlnvdMlCWEg9OSukLx9zVb7tC1HkTyo3xIrWGEh21/qdyyjVWEhBZa+ZVtZY8s7UWyyW/MuWb1yt/tOPzT1NJXNYqPJ5mzvbcySFyvMuN6ttO/yqyfgWvjqnUJYFw8V4bAWBqpYYD08VMLDXfBe1e/VxXuKcK8WxlOxePXwnhLe2wVPq35aF08V4bQWhqpYaD08VcLTXfC86ud18VwRzmthuIqF18NzJTzfBU+qflIXTxThpBaGqFhIPTxRwpMyvPjzyE7tAOvZqZdg3Q/XB4PSO4tmx4cpFGfFlXt14w4Sp3+G3opjxXtxZAXiFPdrMgTMC/wwFV8D3vAoq3YZwXxjFJX4EDHlz45qcX1ybDaOTfPPxuHr39ffJE+177UftV80U2tpr7U32rk20KAWaH9pf2v/7P938MPBTwc/r6yPH63nvNBK18Fv/wOJuPHR</latexit>

s2
<latexit sha1_base64="1RcFirdzOoS5tyEl3SzBljfwnaY=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YiIYnS+5P7Z4fGsbG89GphrotDbX2d3z/fHwxHBMYBCjnEIIpuTYPyuwQw7kOM0r1hHCEK4BR46Dbm4/Zd4oc05iiEqf5SaOMY65zoGZQ+8hmCHC9EASDzRYIOJ4AByAX6XjkqQiEIUHQ0mvk0WpXRzFsVHIj7vkvmy76kpYmJxwCd+HBeIktAEAWATyqD0SJwy4MoxojNgvJgRikYJeccMehHWQ/ORWPe06zV0SU5X+uTBZ2gMEqTmOG0OFEIiDE0FhOXZYR4TJPlzYj1nUavOIvRUVYux145gE0v0OhI5JQGyjhjTAAvD7lB1pwQPUASBCAcJUOaJkOO5jwZHh2nQnypu1iYXQLYqOy8SJNkmPXMdfULYS2J7wriO1nsFcTe+kcIHuljwvSZWH/CIl0YdWFhPkRRefYgnz3WB3L0VUG8ksXrgngti25cUOOKOiuos4r6UFAfZHVeEOeyuCiIC1n8VBA/yeLNrtgPu2I/SrGi/2JTLPaGIzQWr4/lI5RMYZq8uXz7R5p0ltf6WYiRbpaN0N0YT/vNVqeVyjLe6I1+27Scqp4bWlbXtFWG3NFt2YbT27KcSN4c2jCavW5TjoJ4q7c7dr+qb2GNbsuxFIYtbd9u9Frr9iEUSlYv99mdZqPSFi/P6ViWddap6rnBath2+0RhyB224zhde4lCY0Yxkrx0Y2w2z4xqFM2D2kaz0VXo2wUw2pZVgaUFFqtvmY65ZOEIYMnJ84fFbnc7PTmIb/tvdW27soC80P62bToNhWHLeuac9U6XJISB0JO7QvL2NVvt07YcRfKgfkusYIWFbH+p37GMVoWFFFj6lm1ljS3vRLHJbs27ZPXK3e47/dDU01Q2i40mm7O9tzFLXqww43q30r7Lr56Aa+GrdwphXTxUhMNaGKhigfXwUAkPd8F7Vb9XF+8pwr1aGE/F4tXDe0p4bxc8rfppXTxVhNNaGKpiofXwVAlPd8Hzqp/XxXNFOK+F4SoWXg/PlfB8Fzyp+kldPFGEk1oYomIh9fBECU/K8OLPIzu1A6xnp16CdT9cHwxK7yyaHR+mUJwVV+7VjTtInP4ZeiuOFe/FkRWIU9yvyRAwL/DDVHwNeMOjrNplBPONUVTiQ8SUPzuqxfXJsdk4Ns0/G4evf19/kzzVvtd+1H7RTK2lvdbeaOfaQINaoP2l/a39s//fwQ8HPx38vLI+frSe80IrXQe//Q+WwfHS</latexit>

a wrong assumption: Consider the game of poker, where our agent only
s0
<latexit sha1_base64="KIzHNcLYbod1+x/KccCLlzNOPGI=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YiIYnSe+P+2aFxbCwvvVqY6+JQW1/n98/3B8MRgXGAQg4xiKJb06D8LgGM+xCjdG8YR4gCOAUeuo35uH2X+CGNOQphqr8U2jjGOid6BqWPfIYgxwtRAMh8kaDDCWAAcoG+V46KUAgCFB2NZj6NVmU081YFB+K+75L5si9paWLiMUAnPpyXyBIQRAHgk8pgtAjc8iCKMWKzoDyYUQpGyTlHDPpR1oNz0Zj3NGt1dEnO1/pkQScojNIkZjgtThQCYgyNxcRlGSEe02R5M2J9p9ErzmJ0lJXLsVcOYNMLNDoSOaWBMs4YE8DLQ26QNSdED5AEAQhHyZCmyZCjOU+GR8epEF/qLhZmlwA2Kjsv0iQZZj1zXf1CWEviu4L4ThZ7BbG3/hGCR/qYMH0m1p+wSBdGXViYD1FUnj3IZ4/1gRx9VRCvZPG6IF7LohsX1LiizgrqrKI+FNQHWZ0XxLksLgriQhY/FcRPsnizK/bDrtiPUqzov9gUi73hCI3F62P5CCVTmCZvLt/+kSad5bV+FmKkm2UjdDfG036z1Wmlsow3eqPfNi2nqueGltU1bZUhd3RbtuH0tiwnkjeHNoxmr9uUoyDe6u2O3a/qW1ij23IshWFL27cbvda6fQiFktXLfXan2ai0xctzOpZlnXWqem6wGrbdPlEYcoftOE7XXqLQmFGMJC/dGJvNM6MaRfOgttFsdBX6dgGMtmVVYGmBxepbpmMuWTgCWHLy/GGx291OTw7i2/5bXduuLCAvtL9tm05DYdiynjlnvdMlCWEg9OSukLx9zVb7tC1HkTyo3xIrWGEh21/qdyyjVWEhBZa+ZVtZY8s7UWyyW/MuWb1yt/tOPzT1NJXNYqPJ5mzvbcySFyvMuN6ttO/yqyfgWvjqnUJYFw8V4bAWBqpYYD08VMLDXfBe1e/VxXuKcK8WxlOxePXwnhLe2wVPq35aF08V4bQWhqpYaD08VcLTXfC86ud18VwRzmthuIqF18NzJTzfBU+qflIXTxThpBaGqFhIPTxRwpMyvPjzyE7tAOvZqZdg3Q/XB4PSO4tmx4cpFGfFlXt14w4Sp3+G3opjxXtxZAXiFPdrMgTMC/wwFV8D3vAoq3YZwXxjFJX4EDHlz45qcX1ybDaOTfPPxuHr39ffJE+177UftV80U2tpr7U32rk20KAWaH9pf2v/7P938MPBTwc/r6yPH63nvNBK18Fv/wN8r/HQ</latexit>

a0
<latexit sha1_base64="M8sqRzeDFh3Mi1iBGtqcTbyBm/Y=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YQ4gSk98b9s0Pj2FheerUw18Whtr7O75/vD4YjAuMAhRxiEEW3pkH5XQIY9yFG6d4wjhAFcAo8dBvzcfsu8UMacxTCVH8ptHGMdU70DEof+QxBjheiAJD5IkGHE8AA5AJ9rxwVoRAEKDoazXwarcpo5q0KDsR93yXzZV/S0sTEY4BOfDgvkSUgiALAJ5XBaBG45UEUY8RmQXkwoxSMknOOGPSjrAfnojHvadbq6JKcr/XJgk5QGKVJzHBanCgExBgai4nLMkI8psnyZsT6TqNXnMXoKCuXY68cwKYXaHQkckoDZZwxJoCXh9wga06IHiAJAhCOkiFNkyFHc54Mj45TIb7UXSzMLgFsVHZepEkyzHrmuvqFsJbEdwXxnSz2CmJv/SMEj/QxYfpMrD9hkS6MurAwH6KoPHuQzx7rAzn6qiBeyeJ1QbyWRTcuqHFFnRXUWUV9KKgPsjoviHNZXBTEhSx+KoifZPFmV+yHXbEfpVjRf7EpFnvDERqL18fyEUqmME3eXL79I006y2v9LMRIN8tG6G6Mp/1mq9NKZRlv9Ea/bVpOVc8NLatr2ipD7ui2bMPpbVlOJG8ObRjNXrcpR0G81dsdu1/Vt7BGt+VYCsOWtm83eq11+xAKJauX++xOs1Fpi5fndCzLOutU9dxgNWy7faIw5A7bcZyuvUShMaMYSV66MTabZ0Y1iuZBbaPZ6Cr07QIYbcuqwNICi9W3TMdcsnAEsOTk+cNit7udnhzEt/23urZdWUBeaH/bNp2GwrBlPXPOeqdLEsJA6MldIXn7mq32aVuOInlQvyVWsMJCtr/U71hGq8JCCix9y7ayxpZ3othkt+ZdsnrlbvedfmjqaSqbxUaTzdne25glL1aYcb1bad/lV0/AtfDVO4WwLh4qwmEtDFSxwHp4qISHu+C9qt+ri/cU4V4tjKdi8erhPSW8twueVv20Lp4qwmktDFWx0Hp4qoSnu+B51c/r4rkinNfCcBULr4fnSni+C55U/aQunijCSS0MUbGQeniihCdlePHnkZ3aAdazUy/Buh+uDwaldxbNjg9TKM6KK/fqxh0kTv8MvRXHivfiyArEKe7XZAiYF/hhKr4GvOFRVu0ygvnGKCrxIWLKnx3V4vrk2Gwcm+afjcPXv6+/SZ5q32s/ar9optbSXmtvtHNtoEEt0P7S/tb+2f/v4IeDnw5+XlkfP1rPeaGVroPf/gf1BfGy</latexit>

a1
<latexit sha1_base64="faCoi3suTLeizhRJ44TVzLkE1Ww=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YQ4gSk9+b9s0Pj2FheerUw18Whtr7O75/vD4YjAuMAhRxiEEW3pkH5XQIY9yFG6d4wjhAFcAo8dBvzcfsu8UMacxTCVH8ptHGMdU70DEof+QxBjheiAJD5IkGHE8AA5AJ9rxwVoRAEKDoazXwarcpo5q0KDsR93yXzZV/S0sTEY4BOfDgvkSUgiALAJ5XBaBG45UEUY8RmQXkwoxSMknOOGPSjrAfnojHvadbq6JKcr/XJgk5QGKVJzHBanCgExBgai4nLMkI8psnyZsT6TqNXnMXoKCuXY68cwKYXaHQkckoDZZwxJoCXh9wga06IHiAJAhCOkiFNkyFHc54Mj45TIb7UXSzMLgFsVHZepEkyzHrmuvqFsJbEdwXxnSz2CmJv/SMEj/QxYfpMrD9hkS6MurAwH6KoPHuQzx7rAzn6qiBeyeJ1QbyWRTcuqHFFnRXUWUV9KKgPsjoviHNZXBTEhSx+KoifZPFmV+yHXbEfpVjRf7EpFnvDERqL18fyEUqmME3eXL79I006y2v9LMRIN8tG6G6Mp/1mq9NKZRlv9Ea/bVpOVc8NLatr2ipD7ui2bMPpbVlOJG8ObRjNXrcpR0G81dsdu1/Vt7BGt+VYCsOWtm83eq11+xAKJauX++xOs1Fpi5fndCzLOutU9dxgNWy7faIw5A7bcZyuvUShMaMYSV66MTabZ0Y1iuZBbaPZ6Cr07QIYbcuqwNICi9W3TMdcsnAEsOTk+cNit7udnhzEt/23urZdWUBeaH/bNp2GwrBlPXPOeqdLEsJA6MldIXn7mq32aVuOInlQvyVWsMJCtr/U71hGq8JCCix9y7ayxpZ3othkt+ZdsnrlbvedfmjqaSqbxUaTzdne25glL1aYcb1bad/lV0/AtfDVO4WwLh4qwmEtDFSxwHp4qISHu+C9qt+ri/cU4V4tjKdi8erhPSW8twueVv20Lp4qwmktDFWx0Hp4qoSnu+B51c/r4rkinNfCcBULr4fnSni+C55U/aQunijCSS0MUbGQeniihCdlePHnkZ3aAdazUy/Buh+uDwaldxbNjg9TKM6KK/fqxh0kTv8MvRXHivfiyArEKe7XZAiYF/hhKr4GvOFRVu0ygvnGKCrxIWLKnx3V4vrk2Gwcm+afjcPXv6+/SZ5q32s/ar9optbSXmtvtHNtoEEt0P7S/tb+2f/v4IeDnw5+XlkfP1rPeaGVroPf/gcCHfGz</latexit>

a2
<latexit sha1_base64="MJSdrcnRCn4KhDzKoML3O8YiGsk=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YQ4gSk9yf3zw6NY2N56dXCXBeH2vo6v3++PxiOCIwDFHKIQRTdmgbldwlg3IcYpXvDOEIUwCnw0G3Mx+27xA9pzFEIU/2l0MYx1jnRMyh95DMEOV6IAkDmiwQdTgADkAv0vXJUhEIQoOhoNPNptCqjmbcqOBD3fZfMl31JSxMTjwE68eG8RJaAIAoAn1QGo0XglgdRjBGbBeXBjFIwSs45YtCPsh6ci8a8p1mro0tyvtYnCzpBYZQmMcNpcaIQEGNoLCYuywjxmCbLmxHrO41ecRajo6xcjr1yAJteoNGRyCkNlHHGmABeHnKDrDkheoAkCEA4SoY0TYYczXkyPDpOhfhSd7EwuwSwUdl5kSbJMOuZ6+oXwloS3xXEd7LYK4i99Y8QPNLHhOkzsf6ERbow6sLCfIii8uxBPnusD+Toq4J4JYvXBfFaFt24oMYVdVZQZxX1oaA+yOq8IM5lcVEQF7L4qSB+ksWbXbEfdsV+lGJF/8WmWOwNR2gsXh/LRyiZwjR5c/n2jzTpLK/1sxAj3SwbobsxnvabrU4rlWW80Rv9tmk5VT03tKyuaasMuaPbsg2nt2U5kbw5tGE0e92mHAXxVm937H5V38Ia3ZZjKQxb2r7d6LXW7UMolKxe7rM7zUalLV6e07Es66xT1XOD1bDt9onCkDtsx3G69hKFxoxiJHnpxthsnhnVKJoHtY1mo6vQtwtgtC2rAksLLFbfMh1zycIRwJKT5w+L3e52enIQ3/bf6tp2ZQF5of1t23QaCsOW9cw5650uSQgDoSd3heTta7bap205iuRB/ZZYwQoL2f5Sv2MZrQoLKbD0LdvKGlveiWKT3Zp3yeqVu913+qGpp6lsFhtNNmd7b2OWvFhhxvVupX2XXz0B18JX7xTCunioCIe1MFDFAuvhoRIe7oL3qn6vLt5ThHu1MJ6KxauH95Tw3i54WvXTuniqCKe1MFTFQuvhqRKe7oLnVT+vi+eKcF4Lw1UsvB6eK+H5LnhS9ZO6eKIIJ7UwRMVC6uGJEp6U4cWfR3ZqB1jPTr0E6364PhiU3lk0Oz5MoTgrrtyrG3eQOP0z9FYcK96LIysQp7hfkyFgXuCHqfga8IZHWbXLCOYbo6jEh4gpf3ZUi+uTY7NxbJp/Ng5f/77+Jnmqfa/9qP2imVpLe6290c61gQa1QPtL+1v7Z/+/gx8Ofjr4eWV9/Gg954VWug5++x8PJvG0</latexit>

knows their hand and the open cards, and not the hands of other players.
~ ~ Poker is an example of a partially observable MDP (POMDP), where our
~
agent can only observe a small part of the environment, or its observations
<latexit sha1_base64="QERxgNakkv/TH9ixu1baowQUq7c=">AAAN/HicfZdNb9s2HMbVdi9dVq/tetxFWFBgGIJAShy/HArUkmz0sLZZkLctDgKKpmXBlEhQlGNX0MfYdTvsOOy677J9mlGyI0sUZV3C8Hn4+Ke/ROFPl2I/4obx76PHTz77/Isvn3619/Wz1jfPX7z89jIiMYPoAhJM2LULIoT9EF1wn2N0TRkCgYvRlTu3M/1qgVjkk/Ccryi6DYAX+lMfAi6mbsbUvxu7fIY4uHuxbxwa+aXXB+ZmsK9trtO7l8/+G08IjAMUcohBFN2YBuW3CWDchxile+M4QhTAOfDQTcynvdvED2nMUQhT/bXQpjHWOdEzLH3iMwQ5XokBgMwXCTqcAQYgF/B71agIhSBA0cFk4dNoPYwW3nrAgbjz22SZVyatLEw8BujMh8sKWQKCKAB8VpuMVoFbnUQxRmwRVCczSsEoOZeIQT/KanAqCvORZsWOzsnpRp+t6AyFUZrEDKflhUJAjKGpWJgPI8RjmuQ3I57wPHrDWYwOsmE+98YBbH6GJgcipzJRxZliAnh1yg2y4oToHpIgAOEkGdM0GXO05Mn44DAV4mvdxcLsEsAmVedZmiTjrGauq58Ja0X8UBI/yOKwJA43P0LwRJ8Spi/E8ycs0oVRFxbmQxRVV18Uq6f6hRx9WRIvZfGqJF7JohuX1LimLkrqoqbel9R7WV2WxKUsrkriShY/lcRPsni9K/aXXbG/SrGi/mJTrPbGEzQVH5D8FUrmME3enb//KU36+bV5F2Kkm1UjdB+Mx6NOt99NZRk/6O1Rz7Scul4YutbAtFWGwjHo2oYz3LIcSd4C2jA6w0FHjoJ4q/f69qiub2GNQdexFIYt7chuD7ub8iEUSlav8Nn9TrtWFq/I6VuWddKv64XBatt270hhKBy24zgDO0ehMaMYSV76YOx0Tox6FC2CekanPVDo2wdg9CyrBktLLNbIMh0zZ+EIYMnJi5fF7g36QzmIb+tvDWy79gB5qfw923TaCsOW9cQ5GR7nJISB0JOrQorydbq9454cRYqgUVc8wRoL2f7SqG8Z3RoLKbGMLNvKClvdiWKT3Zi3yfqTu913+r6pp6lsFhtNNmd778EsebHCjJvdSvsuv3oBboSv3ymETfFQEQ4bYaCKBTbDQyU83AXv1f1eU7ynCPcaYTwVi9cM7ynhvV3wtO6nTfFUEU4bYaiKhTbDUyU83QXP637eFM8V4bwRhqtYeDM8V8LzXfCk7idN8UQRThphiIqFNMMTJTzZAc/QvWj5sk5BvF0Jq0WKnlx0s7kOcQJqesQBR7lMcBLV5PVpI9PdQDDl/6xbEZqdHADWs86bYN0PN81J5btJs5VzKPrVtXvN7yBxAmHovWhtPoq2GYhO8kdByrzAF6Ti7/ggG+0yguWDUYz2xGnIlM8+9cHV0aHZPjTNn9v7bzubg9FT7Tvte+0HzdS62lvtnXaqXWhQI9pv2u/aH6209Wfrr9bfa+vjR5s1r7TK1frnf76FK1o=</latexit>

⇡✓
<latexit sha1_base64="QERxgNakkv/TH9ixu1baowQUq7c=">AAAN/HicfZdNb9s2HMbVdi9dVq/tetxFWFBgGIJAShy/HArUkmz0sLZZkLctDgKKpmXBlEhQlGNX0MfYdTvsOOy677J9mlGyI0sUZV3C8Hn4+Ke/ROFPl2I/4obx76PHTz77/Isvn3619/Wz1jfPX7z89jIiMYPoAhJM2LULIoT9EF1wn2N0TRkCgYvRlTu3M/1qgVjkk/Ccryi6DYAX+lMfAi6mbsbUvxu7fIY4uHuxbxwa+aXXB+ZmsK9trtO7l8/+G08IjAMUcohBFN2YBuW3CWDchxile+M4QhTAOfDQTcynvdvED2nMUQhT/bXQpjHWOdEzLH3iMwQ5XokBgMwXCTqcAQYgF/B71agIhSBA0cFk4dNoPYwW3nrAgbjz22SZVyatLEw8BujMh8sKWQKCKAB8VpuMVoFbnUQxRmwRVCczSsEoOZeIQT/KanAqCvORZsWOzsnpRp+t6AyFUZrEDKflhUJAjKGpWJgPI8RjmuQ3I57wPHrDWYwOsmE+98YBbH6GJgcipzJRxZliAnh1yg2y4oToHpIgAOEkGdM0GXO05Mn44DAV4mvdxcLsEsAmVedZmiTjrGauq58Ja0X8UBI/yOKwJA43P0LwRJ8Spi/E8ycs0oVRFxbmQxRVV18Uq6f6hRx9WRIvZfGqJF7JohuX1LimLkrqoqbel9R7WV2WxKUsrkriShY/lcRPsni9K/aXXbG/SrGi/mJTrPbGEzQVH5D8FUrmME3enb//KU36+bV5F2Kkm1UjdB+Mx6NOt99NZRk/6O1Rz7Scul4YutbAtFWGwjHo2oYz3LIcSd4C2jA6w0FHjoJ4q/f69qiub2GNQdexFIYt7chuD7ub8iEUSlav8Nn9TrtWFq/I6VuWddKv64XBatt270hhKBy24zgDO0ehMaMYSV76YOx0Tox6FC2CekanPVDo2wdg9CyrBktLLNbIMh0zZ+EIYMnJi5fF7g36QzmIb+tvDWy79gB5qfw923TaCsOW9cQ5GR7nJISB0JOrQorydbq9454cRYqgUVc8wRoL2f7SqG8Z3RoLKbGMLNvKClvdiWKT3Zi3yfqTu913+r6pp6lsFhtNNmd778EsebHCjJvdSvsuv3oBboSv3ymETfFQEQ4bYaCKBTbDQyU83AXv1f1eU7ynCPcaYTwVi9cM7ynhvV3wtO6nTfFUEU4bYaiKhTbDUyU83QXP637eFM8V4bwRhqtYeDM8V8LzXfCk7idN8UQRThphiIqFNMMTJTzZAc/QvWj5sk5BvF0Jq0WKnlx0s7kOcQJqesQBR7lMcBLV5PVpI9PdQDDl/6xbEZqdHADWs86bYN0PN81J5btJs5VzKPrVtXvN7yBxAmHovWhtPoq2GYhO8kdByrzAF6Ti7/ggG+0yguWDUYz2xGnIlM8+9cHV0aHZPjTNn9v7bzubg9FT7Tvte+0HzdS62lvtnXaqXWhQI9pv2u/aH6209Wfrr9bfa+vjR5s1r7TK1frnf76FK1o=</latexit>

⇡✓ <latexit sha1_base64="QERxgNakkv/TH9ixu1baowQUq7c=">AAAN/HicfZdNb9s2HMbVdi9dVq/tetxFWFBgGIJAShy/HArUkmz0sLZZkLctDgKKpmXBlEhQlGNX0MfYdTvsOOy677J9mlGyI0sUZV3C8Hn4+Ke/ROFPl2I/4obx76PHTz77/Isvn3619/Wz1jfPX7z89jIiMYPoAhJM2LULIoT9EF1wn2N0TRkCgYvRlTu3M/1qgVjkk/Ccryi6DYAX+lMfAi6mbsbUvxu7fIY4uHuxbxwa+aXXB+ZmsK9trtO7l8/+G08IjAMUcohBFN2YBuW3CWDchxile+M4QhTAOfDQTcynvdvED2nMUQhT/bXQpjHWOdEzLH3iMwQ5XokBgMwXCTqcAQYgF/B71agIhSBA0cFk4dNoPYwW3nrAgbjz22SZVyatLEw8BujMh8sKWQKCKAB8VpuMVoFbnUQxRmwRVCczSsEoOZeIQT/KanAqCvORZsWOzsnpRp+t6AyFUZrEDKflhUJAjKGpWJgPI8RjmuQ3I57wPHrDWYwOsmE+98YBbH6GJgcipzJRxZliAnh1yg2y4oToHpIgAOEkGdM0GXO05Mn44DAV4mvdxcLsEsAmVedZmiTjrGauq58Ja0X8UBI/yOKwJA43P0LwRJ8Spi/E8ycs0oVRFxbmQxRVV18Uq6f6hRx9WRIvZfGqJF7JohuX1LimLkrqoqbel9R7WV2WxKUsrkriShY/lcRPsni9K/aXXbG/SrGi/mJTrPbGEzQVH5D8FUrmME3enb//KU36+bV5F2Kkm1UjdB+Mx6NOt99NZRk/6O1Rz7Scul4YutbAtFWGwjHo2oYz3LIcSd4C2jA6w0FHjoJ4q/f69qiub2GNQdexFIYt7chuD7ub8iEUSlav8Nn9TrtWFq/I6VuWddKv64XBatt270hhKBy24zgDO0ehMaMYSV76YOx0Tox6FC2CekanPVDo2wdg9CyrBktLLNbIMh0zZ+EIYMnJi5fF7g36QzmIb+tvDWy79gB5qfw923TaCsOW9cQ5GR7nJISB0JOrQorydbq9454cRYqgUVc8wRoL2f7SqG8Z3RoLKbGMLNvKClvdiWKT3Zi3yfqTu913+r6pp6lsFhtNNmd778EsebHCjJvdSvsuv3oBboSv3ymETfFQEQ4bYaCKBTbDQyU83AXv1f1eU7ynCPcaYTwVi9cM7ynhvV3wtO6nTfFUEU4bYaiKhTbDUyU83QXP637eFM8V4bwRhqtYeDM8V8LzXfCk7idN8UQRThphiIqFNMMTJTzZAc/QvWj5sk5BvF0Jq0WKnlx0s7kOcQJqesQBR7lMcBLV5PVpI9PdQDDl/6xbEZqdHADWs86bYN0PN81J5btJs5VzKPrVtXvN7yBxAmHovWhtPoq2GYhO8kdByrzAF6Ti7/ggG+0yguWDUYz2xGnIlM8+9cHV0aHZPjTNn9v7bzubg9FT7Tvte+0HzdS62lvtnXaqXWhQI9pv2u/aH6209Wfrr9bfa+vjR5s1r7TK1frnf76FK1o=</latexit>

⇡✓ are noisy. POMDPs are much more complex to work with than MDPs, but
• Partial observability: POMDP there is a lot of literature out there! For this lecture, it’s out of scope,
however.
• Out of scope!

12

The second key assumption, which is maybe even more important than the
MARKOV ASSUMPTION previous one, is the Markov assumption. It says that the distribution over the
next state s_t is independent of the history, if we have complete information
Markov assumption: st independent of history given st-1:
<latexit sha1_base64="hFxcxLt2af6z65PzziqJZ9qXfU0=">AAAN+XicfZdNb9s2HMbV7q3L6rXdjrsICwoMQxBIieOXQ4Fako0e1jYL8rbFQUDRtCyYEgmKcuwI+hK7bocdh133abZPM0p2ZImirEsYPg8f//SXKPzpUuxH3DD+ffL0k08/+/yLZ1/uffW89fWLl6++uYxIzCC6gAQTdu2CCGE/RBfc5xhdU4ZA4GJ05c7tTL9aIBb5JDznK4puA+CF/tSHgIup63HEAUd3/O7lvnFo5JdeH5ibwb62uU7vXj3/bzwhMA5QyCEGUXRjGpTfJoBxH2KU7o3jCFEA58BDNzGf9m4TP6QxRyFM9ddCm8ZY50TPmPSJzxDkeCUGADJfJOhwBhiAXJDvVaMiFIIARQeThU+j9TBaeOsBB+K2b5NlXpa0sjDxGKAzHy4rZAkIogDwWW0yWgVudRLFGLFFUJ3MKAWj5FwiBv0oq8GpKMxHmlU6OienG322ojMURmkSM5yWFwoBMYamYmE+jBCPaZLfjHi88+gNZzE6yIb53BsHsPkZmhyInMpEFWeKCeDVKTfIihOie0iCAISTZEzTZMzRkifjg8NUiK91FwuzSwCbVJ1naZKMs5q5rn4mrBXxQ0n8IIvDkjjc/AjBE31KmL4Qz5+wSBdGXViYD1FUXX1RrJ7qF3L0ZUm8lMWrkngli25cUuOauiipi5p6X1LvZXVZEpeyuCqJK1l8KIkPsni9K/aXXbG/SrGi/mJTrPbGEzQVX4/8FUrmME3enb//KU36+bV5F2Kkm1UjdB+Nx6NOt99NZRk/6u1Rz7Scul4YutbAtFWGwjHo2oYz3LIcSd4C2jA6w0FHjoJ4q/f69qiub2GNQdexFIYt7chuD7ub8iEUSlav8Nn9TrtWFq/I6VuWddKv64XBatt270hhKBy24zgDO0ehMaMYSV76aOx0Tox6FC2CekanPVDo2wdg9CyrBktLLNbIMh0zZ+EIYMnJi5fF7g36QzmIb+tvDWy79gB5qfw923TaCsOW9cQ5GR7nJISB0JOrQorydbq9454cRYqgUVc8wRoL2f7SqG8Z3RoLKbGMLNvKClvdiWKT3Zi3yfqTu913+r6pp6lsFhtNNmd779EsebHCjJvdSvsuv3oBboSv3ymETfFQEQ4bYaCKBTbDQyU83AXv1f1eU7ynCPcaYTwVi9cM7ynhvV3wtO6nTfFUEU4bYaiKhTbDUyU83QXP637eFM8V4bwRhqtYeDM8V8LzXfCk7idN8UQRThphiIqFNMMTJTzZAc/QvWj5sk5BvF0Jq0WKnlx0s7kOcQJqen6eyGWCk6gmu3yGOMh0NxBM+T/rVoRmJweA9azzJlj3w01zUvlu0mzlHIp+de1e8ztInEAYei9am4+ibQaik/xRkDIv8AWp+Ds+yEa7jGD5aBSjPXEaMuWzT31wdXRotg9N8+f2/tvO5mD0TPtO+177QTO1rvZWe6edahca1LD2m/a79kfrofVn66/W32vr0yebNd9qlav1z/9NHCok</latexit> <latexit sha1_base64="+i+xB44Y0hcRWpnFl3QGzjOKgDk=">AAAN/XicfZdNb9s2HMbV7q3L6rXdjrsICwoMQxZIieOXQ4Fako0e1jYL8rbFQUDRtCyYEgmKcuwK2sfYdTvsOOy6z7J9mlGyI0sUZV3C8Hn4+Ke/ROFPl2I/4obx76PHH338yaefPfl874unrS+fPX/x1WVEYgbRBSSYsGsXRAj7IbrgPsfomjIEAhejK3duZ/rVArHIJ+E5X1F0GwAv9Kc+BFxMjccRBxzdJfwHM717vm8cGvml1wfmZrCvba7TuxdP/xtPCIwDFHKIQRTdmAbltwlg3IcYpXvjOEIUwDnw0E3Mp73bxA9pzFEIU/2l0KYx1jnRMy594jMEOV6JAYDMFwk6nAEGIBf0e9WoCIUgQNHBZOHTaD2MFt56wIG49dtkmZcmrSxMPAbozIfLClkCgigAfFabjFaBW51EMUZsEVQnM0rBKDmXiEE/ympwKgrznmbVjs7J6UafregMhVGaxAyn5YVCQIyhqViYDyPEY5rkNyMe8Tx6xVmMDrJhPvfKAWx+hiYHIqcyUcWZYgJ4dcoNsuKE6B6SIADhJBnTNBlztOTJ+OAwFeJL3cXC7BLAJlXnWZok46xmrqufCWtFfFcS38nisCQONz9C8ESfEqYvxPMnLNKFURcW5kMUVVdfFKun+oUcfVkSL2XxqiReyaIbl9S4pi5K6qKm3pfUe1ldlsSlLK5K4koWP5TED7J4vSv2512xv0ixov5iU6z2xhM0FV+Q/BVK5jBN3py//TFN+vm1eRdipJtVI3QfjMejTrffTWUZP+jtUc+0nLpeGLrWwLRVhsIx6NqGM9yyHEneAtowOsNBR46CeKv3+vaorm9hjUHXsRSGLe3Ibg+7m/IhFEpWr/DZ/U67VhavyOlblnXSr+uFwWrbdu9IYSgctuM4AztHoTGjGEle+mDsdE6MehQtgnpGpz1Q6NsHYPQsqwZLSyzWyDIdM2fhCGDJyYuXxe4N+kM5iG/rbw1su/YAean8Pdt02grDlvXEORke5ySEgdCTq0KK8nW6veOeHEWKoFFXPMEaC9n+0qhvGd0aCymxjCzbygpb3Ylik92Yt8n6k7vdd/q+qaepbBYbTTZne+/BLHmxwoyb3Ur7Lr96AW6Er98phE3xUBEOG2GgigU2w0MlPNwF79X9XlO8pwj3GmE8FYvXDO8p4b1d8LTup03xVBFOG2GoioU2w1MlPN0Fz+t+3hTPFeG8EYarWHgzPFfC813wpO4nTfFEEU4aYYiKhTTDEyU82QHP0L1o+bJOQbxdCatFip5cdLO5DnECanp+pshlgpOoJrt8hjjIdDcQTPk/61aEZicHgPWs8yZY98NNc1L5btJs5RyKfnXtXvM7SJxAGHorWpv3om0GopP8XpAyL/AFqfg7PshGu4xg+WAUoz1xGjLls099cHV0aLYPTfOn9v7rzuZg9ET7RvtW+04zta72WnujnWoXGtSo9pv2u/ZH69fWn62/Wn+vrY8fbdZ8rVWu1j//Az0sK6I=</latexit>

<latexit sha1_base64="09xIn/l5oG3MK3LZwLHHoihf5iw=">AAAOUXichZdNc6M2HMbZ9G2b3aTZ9tgL08zObDuuBxLHL4fMrAF79tDdTTN5a+M0I2QZMxZII4RjL+Xb9RP01n6EXttDjxXYwSDA5WKh59HjH38kRrIpdgOuaX882fno408+/ezp57vPnu/tf3Hw4surgIQMoktIMGE3NggQdn10yV2O0Q1lCHg2Rtf2zEz06zligUv8C76k6M4Dju9OXAi46Lo/+IW+GgUccHTPfx0BmHZG/Hs9bqjrfr2hNpvN7DYVvz3932Er3/3BodbU0kstN/R141BZX2f3L57/ORoTGHrI5xCDILjVNcrvIsC4CzGKd0dhgCiAM+Cg25BPuneR69OQIx/G6kuhTUKscqImz6qOXYYgx0vRAJC5IkGFU8AEr6jIbjEqQD7wUNAYz10arJrB3Fk1OBDlvIsWabnjwsDIYYBOXbgokEXACzzAp6XOYOnZxU4UYsTmXrEzoRSMknOBGHSDpAZnojDvaVL14IKcrfXpkk6RH8RRyHCcHygExBiaiIFpM0A8pFH6MGLazIJTzkLUSJpp36kF2OwcjRsip9BRxJlgAnixy/aS4vjoARLPA/44GtE4GnG04NGo0YyF+FK1sTDbBLBx0XkeR9EoqZltq+fCWhDf5cR3sjjIiYP1nxA8VieEqXPx/gkLVGFUhYW5EAXF0ZfZ6Il6KUdf5cQrWbzOideyaIc5NSyp85w6L6kPOfVBVhc5cSGLy5y4lMUPOfGDLN5si/1pW+zPUqyov1gUy93RGE3EVymdQtEMxtGbi7c/xFEvvdZzIUSqXjRC+9F4PGx3ep1YlvGj3hp2dcMq65mhY/R1s8qQOfodU7MGG5YjyZtBa1p70G/LURBv9G7PHJb1DazW71hGhWFDOzRbg866fAj5ktXJfGav3SqVxclyeoZhnPTKemYwWqbZPaowZA7Tsqy+maLQkFGMJC99NLbbJ1o5imZBXa3d6lfomxegdQ2jBEtzLMbQ0C09ZeEIYMnJs8lidvu9gRzEN/U3+qZZeoE8V/6uqVutCsOG9cQ6GRynJIQB35GrQrLytTvd464cRbKgYUe8wRIL2fzTsGdonRILybEMDdNICltciWKR3ep30eqTu1l36qGuxrFsFgtNNidr79EseXGFGde7K+3b/NUDcC18+UkhrIuHFeGwFgZWscB6eFgJD7fBO2W/UxfvVIQ7tTBOFYtTD+9Uwjvb4GnZT+viaUU4rYWhVSy0Hp5WwtNt8Lzs53XxvCKc18LwKhZeD88r4fk2eFL2k7p4UhFOamFIFQuphyeV8GQLPEMPYsuX7BTE7IpYKXJ1hkh1iCNQ0tNDRSoTHAUl2eZTxEGi255gSm9WWxGanBwAVpOdN8Gq6683J4XvJk1GzqDYr67cK34LiRMIQ2/F1ua92DYDsZP8TpAyx3MFqfgdNZLWNiNYPBpFa1echnT57FNuXB819VZT139sHb5urw9GT5WvlW+UV4qudJTXyhvlTLlUoPKb8pfyt/LP3u97/+4r+zsr686T9ZivlMK1/+w/Dy1I1A==</latexit>
of the current state s_{t-1}. Or, in other words, we can reliably reconstruct the
p(st |at-1 , s1 , ..., st-1 ) = p(st |at-1 , st-1 ) next state using just the current state and the chosen action. This condition is
<latexit sha1_base64="9xsJmqaI6QVO7aNPEsbsg9BoZ44=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4bQwwlL7837Z4fGsbG89GphrotDbX2d3z/fHwxHBMYBCjnEIIpuTYPyuwQw7kOM0r1hHCEK4BR46Dbm4/Zd4oc05iiEqf5SaOMY65zoGZQ+8hmCHC9EASDzRYIOJ4AByAX6XjkqQiEIUHQ0mvk0WpXRzFsVHIj7vkvmy76kpYmJxwCd+HBeIktAEAWATyqD0SJwy4MoxojNgvJgRikYJeccMehHWQ/ORWPe06zV0SU5X+uTBZ2gMEqTmOG0OFEIiDE0FhOXZYR4TJPlzYj1nUavOIvRUVYux145gE0v0OhI5JQGyjhjTAAvD7lB1pwQPUASBCAcJUOaJkOO5jwZHh2nQnypu1iYXQLYqOy8SJNkmPXMdfULYS2J7wriO1nsFcTe+kcIHuljwvSZWH/CIl0YdWFhPkRRefYgnz3WB3L0VUG8ksXrgngti25cUOOKOiuos4r6UFAfZHVeEOeyuCiIC1n8VBA/yeLNrtgPu2I/SrGi/2JTLPaGIzQWr4/lI5RMYZq8uXz7R5p0ltf6WYiRbpaN0N0YT/vNVqeVyjLe6I1+27Scqp4bWlbXtFWG3NFt2YbT27KcSN4c2jCavW5TjoJ4q7c7dr+qb2GNbsuxFIYtbd9u9Frr9iEUSlYv99mdZqPSFi/P6ViWddap6rnBath2+0RhyB224zhde4lCY0Yxkrx0Y2w2z4xqFM2D2kaz0VXo2wUw2pZVgaUFFqtvmY65ZOEIYMnJ84fFbnc7PTmIb/tvdW27soC80P62bToNhWHLeuac9U6XJISB0JO7QvL2NVvt07YcRfKgfkusYIWFbH+p37GMVoWFFFj6lm1ljS3vRLHJbs27ZPXK3e47/dDU01Q2i40mm7O9tzFLXqww43q30r7Lr56Aa+GrdwphXTxUhMNaGKhigfXwUAkPd8F7Vb9XF+8pwr1aGE/F4tXDe0p4bxc8rfppXTxVhNNaGKpiofXwVAlPd8Hzqp/XxXNFOK+F4SoWXg/PlfB8Fzyp+kldPFGEk1oYomIh9fBECU/K8OLPIzu1A6xnp16CdT9cHwxK7yyaHR+mUJwVV+7VjTtInP4ZeiuOFe/FkRWIU9yvyRAwL/DDVHwNeMOjrNplBPONUVTiQ8SUPzuqxfXJsdk4Ns0/G4evf19/kzzVvtd+1H7RTK2lvdbeaOfaQINaoP2l/a39s//fwQ8HPx38vLI+frSe80IrXQe//Q8UNPHI</latexit>

r1 <latexit sha1_base64="pd3L+wPVTqb/6ceddXPUL+1bOrE=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4bQwwlL70/unx0ax8by0quFuS4OtfV1fv98fzAcERgHKOQQgyi6NQ3K7xLAuA8xSveGcYQogFPgoduYj9t3iR/SmKMQpvpLoY1jrHOiZ1D6yGcIcrwQBYDMFwk6nAAGIBfoe+WoCIUgQNHRaObTaFVGM29VcCDu+y6ZL/uSliYmHgN04sN5iSwBQRQAPqkMRovALQ+iGCM2C8qDGaVglJxzxKAfZT04F415T7NWR5fkfK1PFnSCwihNYobT4kQhIMbQWExclhHiMU2WNyPWdxq94ixGR1m5HHvlADa9QKMjkVMaKOOMMQG8POQGWXNC9ABJEIBwlAxpmgw5mvNkeHScCvGl7mJhdglgo7LzIk2SYdYz19UvhLUkviuI72SxVxB76x8heKSPCdNnYv0Ji3Rh1IWF+RBF5dmDfPZYH8jRVwXxShavC+K1LLpxQY0r6qygzirqQ0F9kNV5QZzL4qIgLmTxU0H8JIs3u2I/7Ir9KMWK/otNsdgbjtBYvD6Wj1AyhWny5vLtH2nSWV7rZyFGulk2QndjPO03W51WKst4ozf6bdNyqnpuaFld01YZcke3ZRtOb8tyInlzaMNo9rpNOQrird7u2P2qvoU1ui3HUhi2tH270Wut24dQKFm93Gd3mo1KW7w8p2NZ1lmnqucGq2Hb7ROFIXfYjuN07SUKjRnFSPLSjbHZPDOqUTQPahvNRlehbxfAaFtWBZYWWKy+ZTrmkoUjgCUnzx8Wu93t9OQgvu2/1bXtygLyQvvbtuk0FIYt65lz1jtdkhAGQk/uCsnb12y1T9tyFMmD+i2xghUWsv2lfscyWhUWUmDpW7aVNba8E8UmuzXvktUrd7vv9ENTT1PZLDaabM723sYsebHCjOvdSvsuv3oCroWv3imEdfFQEQ5rYaCKBdbDQyU83AXvVf1eXbynCPdqYTwVi1cP7ynhvV3wtOqndfFUEU5rYaiKhdbDUyU83QXPq35eF88V4bwWhqtYeD08V8LzXfCk6id18UQRTmphiIqF1MMTJTwpw4s/j+zUDrCenXoJ1v1wfTAovbNodnyYQnFWXLlXN+4gcfpn6K04VrwXR1YgTnG/JkPAvMAPU/E14A2PsmqXEcw3RlGJDxFT/uyoFtcnx2bj2DT/bBy+/n39TfJU+177UftFM7WW9lp7o51rAw1qgfaX9rf2z/5/Bz8c/HTw88r6+NF6zgutdB389j8hPfHJ</latexit>

r2 very easy to violate: For example, a static image doesn’t contain the velocity
• Used to derive strong algorithms! ~ ~ ~ ~
of objects on the pixel! This requires multiple images, or a different feature
• Fundamental assumption behind RL s1
<latexit sha1_base64="OzSwNdQHk8BIaE6qZCRw39GSBas=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YiIYnSe/P+2aFxbCwvvVqY6+JQW1/n98/3B8MRgXGAQg4xiKJb06D8LgGM+xCjdG8YR4gCOAUeuo35uH2X+CGNOQphqr8U2jjGOid6BqWPfIYgxwtRAMh8kaDDCWAAcoG+V46KUAgCFB2NZj6NVmU081YFB+K+75L5si9paWLiMUAnPpyXyBIQRAHgk8pgtAjc8iCKMWKzoDyYUQpGyTlHDPpR1oNz0Zj3NGt1dEnO1/pkQScojNIkZjgtThQCYgyNxcRlGSEe02R5M2J9p9ErzmJ0lJXLsVcOYNMLNDoSOaWBMs4YE8DLQ26QNSdED5AEAQhHyZCmyZCjOU+GR8epEF/qLhZmlwA2Kjsv0iQZZj1zXf1CWEviu4L4ThZ7BbG3/hGCR/qYMH0m1p+wSBdGXViYD1FUnj3IZ4/1gRx9VRCvZPG6IF7LohsX1LiizgrqrKI+FNQHWZ0XxLksLgriQhY/FcRPsnizK/bDrtiPUqzov9gUi73hCI3F62P5CCVTmCZvLt/+kSad5bV+FmKkm2UjdDfG036z1Wmlsow3eqPfNi2nqueGltU1bZUhd3RbtuH0tiwnkjeHNoxmr9uUoyDe6u2O3a/qW1ij23IshWFL27cbvda6fQiFktXLfXan2ai0xctzOpZlnXWqem6wGrbdPlEYcoftOE7XXqLQmFGMJC/dGJvNM6MaRfOgttFsdBX6dgGMtmVVYGmBxepbpmMuWTgCWHLy/GGx291OTw7i2/5bXduuLCAvtL9tm05DYdiynjlnvdMlCWEg9OSukLx9zVb7tC1HkTyo3xIrWGEh21/qdyyjVWEhBZa+ZVtZY8s7UWyyW/MuWb1yt/tOPzT1NJXNYqPJ5mzvbcySFyvMuN6ttO/yqyfgWvjqnUJYFw8V4bAWBqpYYD08VMLDXfBe1e/VxXuKcK8WxlOxePXwnhLe2wVPq35aF08V4bQWhqpYaD08VcLTXfC86ud18VwRzmthuIqF18NzJTzfBU+qflIXTxThpBaGqFhIPTxRwpMyvPjzyE7tAOvZqZdg3Q/XB4PSO4tmx4cpFGfFlXt14w4Sp3+G3opjxXtxZAXiFPdrMgTMC/wwFV8D3vAoq3YZwXxjFJX4EDHlz45qcX1ybDaOTfPPxuHr39ffJE+177UftV80U2tpr7U32rk20KAWaH9pf2v/7P938MPBTwc/r6yPH63nvNBK18Fv/wOJuPHR</latexit>

s2
<latexit sha1_base64="1RcFirdzOoS5tyEl3SzBljfwnaY=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YiIYnS+5P7Z4fGsbG89GphrotDbX2d3z/fHwxHBMYBCjnEIIpuTYPyuwQw7kOM0r1hHCEK4BR46Dbm4/Zd4oc05iiEqf5SaOMY65zoGZQ+8hmCHC9EASDzRYIOJ4AByAX6XjkqQiEIUHQ0mvk0WpXRzFsVHIj7vkvmy76kpYmJxwCd+HBeIktAEAWATyqD0SJwy4MoxojNgvJgRikYJeccMehHWQ/ORWPe06zV0SU5X+uTBZ2gMEqTmOG0OFEIiDE0FhOXZYR4TJPlzYj1nUavOIvRUVYux145gE0v0OhI5JQGyjhjTAAvD7lB1pwQPUASBCAcJUOaJkOO5jwZHh2nQnypu1iYXQLYqOy8SJNkmPXMdfULYS2J7wriO1nsFcTe+kcIHuljwvSZWH/CIl0YdWFhPkRRefYgnz3WB3L0VUG8ksXrgngti25cUOOKOiuos4r6UFAfZHVeEOeyuCiIC1n8VBA/yeLNrtgPu2I/SrGi/2JTLPaGIzQWr4/lI5RMYZq8uXz7R5p0ltf6WYiRbpaN0N0YT/vNVqeVyjLe6I1+27Scqp4bWlbXtFWG3NFt2YbT27KcSN4c2jCavW5TjoJ4q7c7dr+qb2GNbsuxFIYtbd9u9Frr9iEUSlYv99mdZqPSFi/P6ViWddap6rnBath2+0RhyB224zhde4lCY0Yxkrx0Y2w2z4xqFM2D2kaz0VXo2wUw2pZVgaUFFqtvmY65ZOEIYMnJ84fFbnc7PTmIb/tvdW27soC80P62bToNhWHLeuac9U6XJISB0JO7QvL2NVvt07YcRfKgfkusYIWFbH+p37GMVoWFFFj6lm1ljS3vRLHJbs27ZPXK3e47/dDU01Q2i40mm7O9tzFLXqww43q30r7Lr56Aa+GrdwphXTxUhMNaGKhigfXwUAkPd8F7Vb9XF+8pwr1aGE/F4tXDe0p4bxc8rfppXTxVhNNaGKpiofXwVAlPd8Hzqp/XxXNFOK+F4SoWXg/PlfB8Fzyp+kldPFGEk1oYomIh9fBECU/K8OLPIzu1A6xnp16CdT9cHwxK7yyaHR+mUJwVV+7VjTtInP4ZeiuOFe/FkRWIU9yvyRAwL/DDVHwNeMOjrNplBPONUVTiQ8SUPzuqxfXJsdk4Ns0/G4evf19/kzzVvtd+1H7RTK2lvdbeaOfaQINaoP2l/a39s//fwQ8HPx38vLI+frSe80IrXQe//Q+WwfHS</latexit>

space. Often, we can expand the features of the state, for example by taking
s0
<latexit sha1_base64="KIzHNcLYbod1+x/KccCLlzNOPGI=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YiIYnSe+P+2aFxbCwvvVqY6+JQW1/n98/3B8MRgXGAQg4xiKJb06D8LgGM+xCjdG8YR4gCOAUeuo35uH2X+CGNOQphqr8U2jjGOid6BqWPfIYgxwtRAMh8kaDDCWAAcoG+V46KUAgCFB2NZj6NVmU081YFB+K+75L5si9paWLiMUAnPpyXyBIQRAHgk8pgtAjc8iCKMWKzoDyYUQpGyTlHDPpR1oNz0Zj3NGt1dEnO1/pkQScojNIkZjgtThQCYgyNxcRlGSEe02R5M2J9p9ErzmJ0lJXLsVcOYNMLNDoSOaWBMs4YE8DLQ26QNSdED5AEAQhHyZCmyZCjOU+GR8epEF/qLhZmlwA2Kjsv0iQZZj1zXf1CWEviu4L4ThZ7BbG3/hGCR/qYMH0m1p+wSBdGXViYD1FUnj3IZ4/1gRx9VRCvZPG6IF7LohsX1LiizgrqrKI+FNQHWZ0XxLksLgriQhY/FcRPsnizK/bDrtiPUqzov9gUi73hCI3F62P5CCVTmCZvLt/+kSad5bV+FmKkm2UjdDfG036z1Wmlsow3eqPfNi2nqueGltU1bZUhd3RbtuH0tiwnkjeHNoxmr9uUoyDe6u2O3a/qW1ij23IshWFL27cbvda6fQiFktXLfXan2ai0xctzOpZlnXWqem6wGrbdPlEYcoftOE7XXqLQmFGMJC/dGJvNM6MaRfOgttFsdBX6dgGMtmVVYGmBxepbpmMuWTgCWHLy/GGx291OTw7i2/5bXduuLCAvtL9tm05DYdiynjlnvdMlCWEg9OSukLx9zVb7tC1HkTyo3xIrWGEh21/qdyyjVWEhBZa+ZVtZY8s7UWyyW/MuWb1yt/tOPzT1NJXNYqPJ5mzvbcySFyvMuN6ttO/yqyfgWvjqnUJYFw8V4bAWBqpYYD08VMLDXfBe1e/VxXuKcK8WxlOxePXwnhLe2wVPq35aF08V4bQWhqpYaD08VcLTXfC86ud18VwRzmthuIqF18NzJTzfBU+qflIXTxThpBaGqFhIPTxRwpMyvPjzyE7tAOvZqZdg3Q/XB4PSO4tmx4cpFGfFlXt14w4Sp3+G3opjxXtxZAXiFPdrMgTMC/wwFV8D3vAoq3YZwXxjFJX4EDHlz45qcX1ybDaOTfPPxuHr39ffJE+177UftV80U2tpr7U32rk20KAWaH9pf2v/7P938MPBTwc/r6yPH63nvNBK18Fv/wN8r/HQ</latexit>

a0
<latexit sha1_base64="M8sqRzeDFh3Mi1iBGtqcTbyBm/Y=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YQ4gSk98b9s0Pj2FheerUw18Whtr7O75/vD4YjAuMAhRxiEEW3pkH5XQIY9yFG6d4wjhAFcAo8dBvzcfsu8UMacxTCVH8ptHGMdU70DEof+QxBjheiAJD5IkGHE8AA5AJ9rxwVoRAEKDoazXwarcpo5q0KDsR93yXzZV/S0sTEY4BOfDgvkSUgiALAJ5XBaBG45UEUY8RmQXkwoxSMknOOGPSjrAfnojHvadbq6JKcr/XJgk5QGKVJzHBanCgExBgai4nLMkI8psnyZsT6TqNXnMXoKCuXY68cwKYXaHQkckoDZZwxJoCXh9wga06IHiAJAhCOkiFNkyFHc54Mj45TIb7UXSzMLgFsVHZepEkyzHrmuvqFsJbEdwXxnSz2CmJv/SMEj/QxYfpMrD9hkS6MurAwH6KoPHuQzx7rAzn6qiBeyeJ1QbyWRTcuqHFFnRXUWUV9KKgPsjoviHNZXBTEhSx+KoifZPFmV+yHXbEfpVjRf7EpFnvDERqL18fyEUqmME3eXL79I006y2v9LMRIN8tG6G6Mp/1mq9NKZRlv9Ea/bVpOVc8NLatr2ipD7ui2bMPpbVlOJG8ObRjNXrcpR0G81dsdu1/Vt7BGt+VYCsOWtm83eq11+xAKJauX++xOs1Fpi5fndCzLOutU9dxgNWy7faIw5A7bcZyuvUShMaMYSV66MTabZ0Y1iuZBbaPZ6Cr07QIYbcuqwNICi9W3TMdcsnAEsOTk+cNit7udnhzEt/23urZdWUBeaH/bNp2GwrBlPXPOeqdLEsJA6MldIXn7mq32aVuOInlQvyVWsMJCtr/U71hGq8JCCix9y7ayxpZ3othkt+ZdsnrlbvedfmjqaSqbxUaTzdne25glL1aYcb1bad/lV0/AtfDVO4WwLh4qwmEtDFSxwHp4qISHu+C9qt+ri/cU4V4tjKdi8erhPSW8twueVv20Lp4qwmktDFWx0Hp4qoSnu+B51c/r4rkinNfCcBULr4fnSni+C55U/aQunijCSS0MUbGQeniihCdlePHnkZ3aAdazUy/Buh+uDwaldxbNjg9TKM6KK/fqxh0kTv8MvRXHivfiyArEKe7XZAiYF/hhKr4GvOFRVu0ygvnGKCrxIWLKnx3V4vrk2Gwcm+afjcPXv6+/SZ5q32s/ar9optbSXmtvtHNtoEEt0P7S/tb+2f/v4IeDnw5+XlkfP1rPeaGVroPf/gf1BfGy</latexit>

a1
<latexit sha1_base64="faCoi3suTLeizhRJ44TVzLkE1Ww=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YQ4gSk9+b9s0Pj2FheerUw18Whtr7O75/vD4YjAuMAhRxiEEW3pkH5XQIY9yFG6d4wjhAFcAo8dBvzcfsu8UMacxTCVH8ptHGMdU70DEof+QxBjheiAJD5IkGHE8AA5AJ9rxwVoRAEKDoazXwarcpo5q0KDsR93yXzZV/S0sTEY4BOfDgvkSUgiALAJ5XBaBG45UEUY8RmQXkwoxSMknOOGPSjrAfnojHvadbq6JKcr/XJgk5QGKVJzHBanCgExBgai4nLMkI8psnyZsT6TqNXnMXoKCuXY68cwKYXaHQkckoDZZwxJoCXh9wga06IHiAJAhCOkiFNkyFHc54Mj45TIb7UXSzMLgFsVHZepEkyzHrmuvqFsJbEdwXxnSz2CmJv/SMEj/QxYfpMrD9hkS6MurAwH6KoPHuQzx7rAzn6qiBeyeJ1QbyWRTcuqHFFnRXUWUV9KKgPsjoviHNZXBTEhSx+KoifZPFmV+yHXbEfpVjRf7EpFnvDERqL18fyEUqmME3eXL79I006y2v9LMRIN8tG6G6Mp/1mq9NKZRlv9Ea/bVpOVc8NLatr2ipD7ui2bMPpbVlOJG8ObRjNXrcpR0G81dsdu1/Vt7BGt+VYCsOWtm83eq11+xAKJauX++xOs1Fpi5fndCzLOutU9dxgNWy7faIw5A7bcZyuvUShMaMYSV66MTabZ0Y1iuZBbaPZ6Cr07QIYbcuqwNICi9W3TMdcsnAEsOTk+cNit7udnhzEt/23urZdWUBeaH/bNp2GwrBlPXPOeqdLEsJA6MldIXn7mq32aVuOInlQvyVWsMJCtr/U71hGq8JCCix9y7ayxpZ3othkt+ZdsnrlbvedfmjqaSqbxUaTzdne25glL1aYcb1bad/lV0/AtfDVO4WwLh4qwmEtDFSxwHp4qISHu+C9qt+ri/cU4V4tjKdi8erhPSW8twueVv20Lp4qwmktDFWx0Hp4qoSnu+B51c/r4rkinNfCcBULr4fnSni+C55U/aQunijCSS0MUbGQeniihCdlePHnkZ3aAdazUy/Buh+uDwaldxbNjg9TKM6KK/fqxh0kTv8MvRXHivfiyArEKe7XZAiYF/hhKr4GvOFRVu0ygvnGKCrxIWLKnx3V4vrk2Gwcm+afjcPXv6+/SZ5q32s/ar9optbSXmtvtHNtoEEt0P7S/tb+2f/v4IeDnw5+XlkfP1rPeaGVroPf/gcCHfGz</latexit>

a2
<latexit sha1_base64="MJSdrcnRCn4KhDzKoML3O8YiGsk=">AAANdXicfZdbb9s2GIbVdocuS7Z2vdwuhGUdhiHIpMTxAUOBWpKNXqxtFsRJ2jgIKJqWBVMiQVGOXUG/YrfbD9sf2fUoH2SJoqyrD3xfvn70UZQpl2I/4obx76PHTz77/Isvn3619/X+wTffPnv+3VVEYgbRABJM2I0LIoT9EA24zzG6oQyBwMXo2p3amX49QyzySXjJFxTdBcAL/bEPARdDH4YQ4gSk9yf3zw6NY2N56dXCXBeH2vo6v3++PxiOCIwDFHKIQRTdmgbldwlg3IcYpXvDOEIUwCnw0G3Mx+27xA9pzFEIU/2l0MYx1jnRMyh95DMEOV6IAkDmiwQdTgADkAv0vXJUhEIQoOhoNPNptCqjmbcqOBD3fZfMl31JSxMTjwE68eG8RJaAIAoAn1QGo0XglgdRjBGbBeXBjFIwSs45YtCPsh6ci8a8p1mro0tyvtYnCzpBYZQmMcNpcaIQEGNoLCYuywjxmCbLmxHrO41ecRajo6xcjr1yAJteoNGRyCkNlHHGmABeHnKDrDkheoAkCEA4SoY0TYYczXkyPDpOhfhSd7EwuwSwUdl5kSbJMOuZ6+oXwloS3xXEd7LYK4i99Y8QPNLHhOkzsf6ERbow6sLCfIii8uxBPnusD+Toq4J4JYvXBfFaFt24oMYVdVZQZxX1oaA+yOq8IM5lcVEQF7L4qSB+ksWbXbEfdsV+lGJF/8WmWOwNR2gsXh/LRyiZwjR5c/n2jzTpLK/1sxAj3SwbobsxnvabrU4rlWW80Rv9tmk5VT03tKyuaasMuaPbsg2nt2U5kbw5tGE0e92mHAXxVm937H5V38Ia3ZZjKQxb2r7d6LXW7UMolKxe7rM7zUalLV6e07Es66xT1XOD1bDt9onCkDtsx3G69hKFxoxiJHnpxthsnhnVKJoHtY1mo6vQtwtgtC2rAksLLFbfMh1zycIRwJKT5w+L3e52enIQ3/bf6tp2ZQF5of1t23QaCsOW9cw5650uSQgDoSd3heTta7bap205iuRB/ZZYwQoL2f5Sv2MZrQoLKbD0LdvKGlveiWKT3Zp3yeqVu913+qGpp6lsFhtNNmd7b2OWvFhhxvVupX2XXz0B18JX7xTCunioCIe1MFDFAuvhoRIe7oL3qn6vLt5ThHu1MJ6KxauH95Tw3i54WvXTuniqCKe1MFTFQuvhqRKe7oLnVT+vi+eKcF4Lw1UsvB6eK+H5LnhS9ZO6eKIIJ7UwRMVC6uGJEp6U4cWfR3ZqB1jPTr0E6364PhiU3lk0Oz5MoTgrrtyrG3eQOP0z9FYcK96LIysQp7hfkyFgXuCHqfga8IZHWbXLCOYbo6jEh4gpf3ZUi+uTY7NxbJp/Ng5f/77+Jnmqfa/9qP2imVpLe6290c61gQa1QPtL+1v7Z/+/gx8Ofjr4eWV9/Gg954VWug5++x8PJvG0</latexit>

the last few observations as part of the state, to solve such issues.
~ ~ ~

⇡✓ ⇡✓ ⇡✓
<latexit sha1_base64="QERxgNakkv/TH9ixu1baowQUq7c=">AAAN/HicfZdNb9s2HMbVdi9dVq/tetxFWFBgGIJAShy/HArUkmz0sLZZkLctDgKKpmXBlEhQlGNX0MfYdTvsOOy677J9mlGyI0sUZV3C8Hn4+Ke/ROFPl2I/4obx76PHTz77/Isvn3619/Wz1jfPX7z89jIiMYPoAhJM2LULIoT9EF1wn2N0TRkCgYvRlTu3M/1qgVjkk/Ccryi6DYAX+lMfAi6mbsbUvxu7fIY4uHuxbxwa+aXXB+ZmsK9trtO7l8/+G08IjAMUcohBFN2YBuW3CWDchxile+M4QhTAOfDQTcynvdvED2nMUQhT/bXQpjHWOdEzLH3iMwQ5XokBgMwXCTqcAQYgF/B71agIhSBA0cFk4dNoPYwW3nrAgbjz22SZVyatLEw8BujMh8sKWQKCKAB8VpuMVoFbnUQxRmwRVCczSsEoOZeIQT/KanAqCvORZsWOzsnpRp+t6AyFUZrEDKflhUJAjKGpWJgPI8RjmuQ3I57wPHrDWYwOsmE+98YBbH6GJgcipzJRxZliAnh1yg2y4oToHpIgAOEkGdM0GXO05Mn44DAV4mvdxcLsEsAmVedZmiTjrGauq58Ja0X8UBI/yOKwJA43P0LwRJ8Spi/E8ycs0oVRFxbmQxRVV18Uq6f6hRx9WRIvZfGqJF7JohuX1LimLkrqoqbel9R7WV2WxKUsrkriShY/lcRPsni9K/aXXbG/SrGi/mJTrPbGEzQVH5D8FUrmME3enb//KU36+bV5F2Kkm1UjdB+Mx6NOt99NZRk/6O1Rz7Scul4YutbAtFWGwjHo2oYz3LIcSd4C2jA6w0FHjoJ4q/f69qiub2GNQdexFIYt7chuD7ub8iEUSlav8Nn9TrtWFq/I6VuWddKv64XBatt270hhKBy24zgDO0ehMaMYSV76YOx0Tox6FC2CekanPVDo2wdg9CyrBktLLNbIMh0zZ+EIYMnJi5fF7g36QzmIb+tvDWy79gB5qfw923TaCsOW9cQ5GR7nJISB0JOrQorydbq9454cRYqgUVc8wRoL2f7SqG8Z3RoLKbGMLNvKClvdiWKT3Zi3yfqTu913+r6pp6lsFhtNNmd778EsebHCjJvdSvsuv3oBboSv3ymETfFQEQ4bYaCKBTbDQyU83AXv1f1eU7ynCPcaYTwVi9cM7ynhvV3wtO6nTfFUEU4bYaiKhTbDUyU83QXP637eFM8V4bwRhqtYeDM8V8LzXfCk7idN8UQRThphiIqFNMMTJTzZAc/QvWj5sk5BvF0Jq0WKnlx0s7kOcQJqesQBR7lMcBLV5PVpI9PdQDDl/6xbEZqdHADWs86bYN0PN81J5btJs5VzKPrVtXvN7yBxAmHovWhtPoq2GYhO8kdByrzAF6Ti7/ggG+0yguWDUYz2xGnIlM8+9cHV0aHZPjTNn9v7bzubg9FT7Tvte+0HzdS62lvtnXaqXWhQI9pv2u/aH6209Wfrr9bfa+vjR5s1r7TK1frnf76FK1o=</latexit> <latexit sha1_base64="QERxgNakkv/TH9ixu1baowQUq7c=">AAAN/HicfZdNb9s2HMbVdi9dVq/tetxFWFBgGIJAShy/HArUkmz0sLZZkLctDgKKpmXBlEhQlGNX0MfYdTvsOOy677J9mlGyI0sUZV3C8Hn4+Ke/ROFPl2I/4obx76PHTz77/Isvn3619/Wz1jfPX7z89jIiMYPoAhJM2LULIoT9EF1wn2N0TRkCgYvRlTu3M/1qgVjkk/Ccryi6DYAX+lMfAi6mbsbUvxu7fIY4uHuxbxwa+aXXB+ZmsK9trtO7l8/+G08IjAMUcohBFN2YBuW3CWDchxile+M4QhTAOfDQTcynvdvED2nMUQhT/bXQpjHWOdEzLH3iMwQ5XokBgMwXCTqcAQYgF/B71agIhSBA0cFk4dNoPYwW3nrAgbjz22SZVyatLEw8BujMh8sKWQKCKAB8VpuMVoFbnUQxRmwRVCczSsEoOZeIQT/KanAqCvORZsWOzsnpRp+t6AyFUZrEDKflhUJAjKGpWJgPI8RjmuQ3I57wPHrDWYwOsmE+98YBbH6GJgcipzJRxZliAnh1yg2y4oToHpIgAOEkGdM0GXO05Mn44DAV4mvdxcLsEsAmVedZmiTjrGauq58Ja0X8UBI/yOKwJA43P0LwRJ8Spi/E8ycs0oVRFxbmQxRVV18Uq6f6hRx9WRIvZfGqJF7JohuX1LimLkrqoqbel9R7WV2WxKUsrkriShY/lcRPsni9K/aXXbG/SrGi/mJTrPbGEzQVH5D8FUrmME3enb//KU36+bV5F2Kkm1UjdB+Mx6NOt99NZRk/6O1Rz7Scul4YutbAtFWGwjHo2oYz3LIcSd4C2jA6w0FHjoJ4q/f69qiub2GNQdexFIYt7chuD7ub8iEUSlav8Nn9TrtWFq/I6VuWddKv64XBatt270hhKBy24zgDO0ehMaMYSV76YOx0Tox6FC2CekanPVDo2wdg9CyrBktLLNbIMh0zZ+EIYMnJi5fF7g36QzmIb+tvDWy79gB5qfw923TaCsOW9cQ5GR7nJISB0JOrQorydbq9454cRYqgUVc8wRoL2f7SqG8Z3RoLKbGMLNvKClvdiWKT3Zi3yfqTu913+r6pp6lsFhtNNmd778EsebHCjJvdSvsuv3oBboSv3ymETfFQEQ4bYaCKBTbDQyU83AXv1f1eU7ynCPcaYTwVi9cM7ynhvV3wtO6nTfFUEU4bYaiKhTbDUyU83QXP637eFM8V4bwRhqtYeDM8V8LzXfCk7idN8UQRThphiIqFNMMTJTzZAc/QvWj5sk5BvF0Jq0WKnlx0s7kOcQJqesQBR7lMcBLV5PVpI9PdQDDl/6xbEZqdHADWs86bYN0PN81J5btJs5VzKPrVtXvN7yBxAmHovWhtPoq2GYhO8kdByrzAF6Ti7/ggG+0yguWDUYz2xGnIlM8+9cHV0aHZPjTNn9v7bzubg9FT7Tvte+0HzdS62lvtnXaqXWhQI9pv2u/aH6209Wfrr9bfa+vjR5s1r7TK1frnf76FK1o=</latexit>

<latexit sha1_base64="QERxgNakkv/TH9ixu1baowQUq7c=">AAAN/HicfZdNb9s2HMbVdi9dVq/tetxFWFBgGIJAShy/HArUkmz0sLZZkLctDgKKpmXBlEhQlGNX0MfYdTvsOOy677J9mlGyI0sUZV3C8Hn4+Ke/ROFPl2I/4obx76PHTz77/Isvn3619/Wz1jfPX7z89jIiMYPoAhJM2LULIoT9EF1wn2N0TRkCgYvRlTu3M/1qgVjkk/Ccryi6DYAX+lMfAi6mbsbUvxu7fIY4uHuxbxwa+aXXB+ZmsK9trtO7l8/+G08IjAMUcohBFN2YBuW3CWDchxile+M4QhTAOfDQTcynvdvED2nMUQhT/bXQpjHWOdEzLH3iMwQ5XokBgMwXCTqcAQYgF/B71agIhSBA0cFk4dNoPYwW3nrAgbjz22SZVyatLEw8BujMh8sKWQKCKAB8VpuMVoFbnUQxRmwRVCczSsEoOZeIQT/KanAqCvORZsWOzsnpRp+t6AyFUZrEDKflhUJAjKGpWJgPI8RjmuQ3I57wPHrDWYwOsmE+98YBbH6GJgcipzJRxZliAnh1yg2y4oToHpIgAOEkGdM0GXO05Mn44DAV4mvdxcLsEsAmVedZmiTjrGauq58Ja0X8UBI/yOKwJA43P0LwRJ8Spi/E8ycs0oVRFxbmQxRVV18Uq6f6hRx9WRIvZfGqJF7JohuX1LimLkrqoqbel9R7WV2WxKUsrkriShY/lcRPsni9K/aXXbG/SrGi/mJTrPbGEzQVH5D8FUrmME3enb//KU36+bV5F2Kkm1UjdB+Mx6NOt99NZRk/6O1Rz7Scul4YutbAtFWGwjHo2oYz3LIcSd4C2jA6w0FHjoJ4q/f69qiub2GNQdexFIYt7chuD7ub8iEUSlav8Nn9TrtWFq/I6VuWddKv64XBatt270hhKBy24zgDO0ehMaMYSV76YOx0Tox6FC2CekanPVDo2wdg9CyrBktLLNbIMh0zZ+EIYMnJi5fF7g36QzmIb+tvDWy79gB5qfw923TaCsOW9cQ5GR7nJISB0JOrQorydbq9454cRYqgUVc8wRoL2f7SqG8Z3RoLKbGMLNvKClvdiWKT3Zi3yfqTu913+r6pp6lsFhtNNmd778EsebHCjJvdSvsuv3oBboSv3ymETfFQEQ4bYaCKBTbDQyU83AXv1f1eU7ynCPcaYTwVi9cM7ynhvV3wtO6nTfFUEU4bYaiKhTbDUyU83QXP637eFM8V4bwRhqtYeDM8V8LzXfCk7idN8UQRThphiIqFNMMTJTzZAc/QvWj5sk5BvF0Jq0WKnlx0s7kOcQJqesQBR7lMcBLV5PVpI9PdQDDl/6xbEZqdHADWs86bYN0PN81J5btJs5VzKPrVtXvN7yBxAmHovWhtPoq2GYhO8kdByrzAF6Ti7/ggG+0yguWDUYz2xGnIlM8+9cHV0aHZPjTNn9v7bzubg9FT7Tvte+0HzdS62lvtnXaqXWhQI9pv2u/aH6209Wfrr9bfa+vjR5s1r7TK1frnf76FK1o=</latexit>

The Markov assumption is what distinguishes RL from black-box optimization


algorithms, like evolutionary algorithms: These do not assume anything about
the environment and thus their algorithms don't make use of this structure.

13

So far, we have introduced the structure of RL using MDPs. In RL, we want to


REWARDS maximize reward. What exactly does this mean? And particularly, what does
this mean in the context of Deep RL?
We introduced RL with MDPs
The goal is to maximize reward!
What does this mean in Deep RL?

14
The total reward is simply the sum of all rewards the agent receives during
EXPECTED RETURN

You might also like