Some Thoughts on Reinforcement Learning
Maher Said February 3, 2006
Motivation
Why do we care about learning? One important reason is rationality (or the lack thereof): it is commonly accepted that rationality, as dened by economists (i.e., optimization), is far too demanding a requirement. In particular, the Bayesian approach requires full mental models, with state spaces, partitions, priors, and the like in games, we need to know our payos, opponents payos, how the interaction of actions determines payos, the distributions over possible payo functions of our opponents, and so on. While this may be a good approximation in some cases, it entirely rules out situations in which such a mental model is unavailable or infeasible. Furthermore, assuming the existence of such a mental model begs the question of where it comes from. One natural source is repeated experience combined with inductive and deductive reasoning. Specically, simple learning algorithms or rules of thumb may lead to rational behavior if there is a great enough wealth of experience to learn from. Thus, it is useful to understand the performance of these learning rules in individual (as well as interactive) decision making problems.
Overview of Reinforcement Learning
Reinforcement learning (RL) is a powerful tool used frequently by computer scientists to train essentially independent agents (generally computers or machines) to choose optimal actions to achieve its goals in a Markovian environment. It has been successfully used to, for example, train a computer to learn the game of backgammon and frequently defeat the best human players in the world, as well as to control elevator dispatching in complicated environments (see Sutton and Barto (1998) for case studies on these and other examples of the real-world implementations of reinforcement learning). A basic reinforcement learning system consists of a decision maker and an environment, as well as a policy, reward function, and value function. The policy denes the decision makers behavior at a given time; in particular, it is a mapping from observed states to available actions. The reward function maps the observed state of the environment to a real-valued reward (which may be interpreted as utility in our setting). The decision maker is assumed to maximize the sum total of rewards received. Finally, the value function serves as the agents estimate of the best possible
stream of rewards that may be achieved by following her policy. The key to a RL system, however, is the manner in which the policy evolves in response to the interaction of the agent with the environment: as actions are taken and rewards are received, the value function is updated and the policy is rened.
2.1
Q-Learning
There exists a wide variety of reinforcement learning techniques, some of which are better suited to certain learning environments and tasks than others. The most fundamental of these techniques is that of Q-learning. Dene the function Q (s, a) to be the maximum payo achievable when starting in state s and taking action a in that period; that is, in a setting with no discounting, we would dene Q (s, a) := E [r (s, a)] + E [V ( (s, a))] , (1)
where r (s, a) is the reward function evaluated at state s and action a, (s, a) is a state transition mapping, V is the optimal value function, and E is the expectations operator. The optimal policy in state s is then (s) := arg max {Q (s, a)} ;
aA
(2)
thus, if the decision maker is able to learn Q through experience, it is possible to determine an optimal policy without knowledge of the reward function r, the transition function , or the optimal value function V. In fact, knowledge of Q implies knowledge of V, as V (s) = Q (s, (s)) for all states s. So, how does a decision maker go about learning Q? The rst step is noting that Q (s, a) = E [r (s, a)] + E max Q (s, a) , a
a
(3)
So, let Qt denote the estimate of Q at time t, and initialize Q0 (s, a) = 0 for all s and a. The decision maker observes the current state st , and chooses an action at . After receiving a reward rt and observing the new state st+1 , the estimate of Q is updated to Qt : Qt (st , at ) = (1 t ) Qt1 (st , at ) + t rt + max Qt1 st+1 , a
a
(4)
where t is the inverse of the number of times the state-action pair (st , at ) has been observed. It can be shown that for an arbitrary initialization of Q0 and any action selection rule such that each state-action pair (s, a) is observed innitely often, then the update rule above is such that Qt (s, a) Q (s, a) for all s and all a as t .1
1
For a detailed statement and discussion of this convergence theorem, see Theorem 13.2 in Mitchell (1997).
2.2
The n-Armed Bandit Problem
An n-armed bandit, also called a multi-armed bandit, is similar to a traditional slot machine (a one-armed bandit), but has n levers. When pulled, each lever yields a reward drawn from some xed distribution associated with that lever. Initially, the decision maker (or gambler) has no information about any of the levers, but with repeated trials begins to learn which levers are associated with the highest payos. The n-armed bandit is a particularly good setting in which to study such methods due to its simplicity. One can easily compare the average rewards yielded by dierent decision-making approaches. Furthermore, there is always a well-dened optimal lever, so one may compare the time required by each strategy before consistently choosing the correct action. Finally, the stationarity and independence of the basic n-armed bandit problem allows one to examine the optimality properties of each method in a setting that does not require assumptions of signicant probabilistic sophistication about the decision maker.
2.3
Q-Learning and the n-Armed Bandit Problem
In the n-armed bandit problem, the Q-learning algorithm described above becomes even simpler. Since the problem is completely stationary, st = st+1 for all t N, allowing us to drop the dependence on s; thus, we may rewrite equation (3) above as Q (a) = E [r (a)] + E max Q a
a
(5)
Since an optimal policy maximizes Q and the n-armed bandit (since n is nite) has a well-dened best action (the one with the highest expected payo), we may dene for all a (a) := Q (a) E max Q a Q
a
= E [r (a)] .
(6)
is also optimal with respect to Q knowledge of Q Thus, a policy that is optimal with respect to Q Notice, however, immediately implies knowledge of Q. Therefore, we can concentrate on learning Q. that the algorithm above for learning Q allows us, via a simple law of large numbers argument, to as long as each arm is chosen innitely often.2 learn the value of Q
2.4
Action Selection
t (a) Q (a) for all a is that each arm be Given that the only requirement for convergence of Q chosen innitely often, what then is the appropriate method for choosing which action to take? Especially given that the decision maker cares about the stream of rewards received while learning
2 t (at ) = (1 t ) Q t1 (at ) + t rt ,where t is the Note that the update rule will be modied slightly, becoming Q 0 (a) = 0 for all a, this corresponds to tracking the reciprocal of the number of times action at has been chosen. If Q average performance over time of each action.
it is unclear how best to balance the explorative and exploitative motives. However, two main Q, methods dominate the literature: -greedy action selection and softmax action selection. The -greedy selection method is extremely simple; for some xed > 0, in each period t, the decision maker chooses a t = arg maxa Qt1 (a) with probability 1 , and randomizes uniformly between all arms with probability . With this method, for any > 0, each action will be sampled t Q. Furthermore, this implies an innite number of times as t , ensuring convergence of Q that the best action will be chosen (in the limit) with probability 1
n1 n
While simple, -greedy selection suers from the criticism that it ignores much of the information learned over time: it is just as likely to choose the worst-performing action as it is to choose the second-best action. Since the decision maker facing the bandit problem cares about payos as well as learning the optimal policy, such a method may be quite unsatisfactory. Therefore, softmax action selection procedures have been suggested which attempt to smooth the relationship between and the probability of selecting various actions. The most common softmax method estimates of Q uses the Boltzmann distribution, choosing action a at period t with probability t (a) = eQt (a)/
t (b)/ n Q b=1 e
(7)
where > 0 is a parameter called the temperature. Note that t (a) > 0 for all arms a and all t N. Furthermore, as , t approaches a discrete uniform distribution, while as (a) , then 0, t approaches the -greedy method, with = 0. Finally, if a = arg maxa Q t (a )
Pe
(a )/ Q
ods. Furthermore, choosing a value for the softmax method is a much more opaque task than choosing a level of for the -greedy method. Notice that modifying each of these methods slightly by decreasing or over time allows one to decrease the amount of exploration over time (by converging to a 0-greedy selection rule) while ensuring that each action will still be chosen an innite number of times in the limit (since all actions will still be chosen with positive probability for any > 0 or any > 0). The literature often suggests decreasing these parameters by a rate of either these methods so that or decline at a rate of a t = arg maxa Qt (a) has been chosen.
1 kt 1 t
(b)/ n Q b=1 e
as t , a much less intuitive result than that obtained by -greedy meth-
or
log(t) t ;
it is also possible to modify
or
log(kt ) kt ,
where kt is the number of times that
Related Literature
Economists have only relatively recently begun to examine reinforcement learning, with the vast majority of authors examining the performance of reinforcement learning agents in the context of a game. Roth and Erev (1995) and Erev and Roth (1998) were seminal papers in this regards they took as given players initial propensities to choose each action available to them in a game, with propensities updated additively by the payos received in each period and actions chosen with probabilities proportional to their propensities. The authors experimentally tested this model with 4
human subjects, nding that observed behavior seemed to be relatively well-explained by their reinforcement model. Rustichini (1999) is the only paper that I am aware of that examines the performance of RLbased algorithms in individual decision-making problems. In particular, he examines the optimality properties of two dierent reinforcement learning procedures in the case where there is a nite set of states that change according to a Markov process with a nite number of actions with deterministic payos. The rst rule he examines is that of Roth and Erev (1995) and Erev and Roth (1998):
i > 0, where i indexes actions. Then, if action j is chosen take as given some initial propensities S0 j j in period t 1, yielding a payo of rt1 , then St = St 1 + rt1 . The probability of choosing action
i at time t is then
i t =
i St j j St
(8)
eectively, the probability of choosing an action is proportional to the cumulative payo from that action (plus the initial propensity weights). Rustichini shows that this procedure converges almost surely to the action that maximizes expected payo, where the expectation is taken relative to the invariant distribution of the Markov process governing state transitions. He also examines action selection rules of the form
i t =
St
i j
St j
(9)
i are dened as before. This procedure is shown to converge to its limit where > 0 and the St
much faster than the cumulative linear rule in (6) ; however, this convergence need not be to the payo-maximizing choice.3 The remainder of the literature focuses on the performance of RL agents in games. Hopkins (2002) compares the Roth-Erev rule in (6) to ctitious play in two player games, and nds that an appropriate perturbation of a dynamic system generated by the interaction of two RL agents has the same stationary points as a stochastic ctitious play model, implying that if the game has sucient structure to determine the asymptotic behavior of stochastic ctitious play, then the Roth-Erev reinforcement learning model will yield similar behavior. Beggs (2005) shows that the play of RothErev agents leads to the elimination of iteratively dominated strategies, and that Roth-Erev agents cannot be forced permanently below their minmax payos, even when playing against sophisticated players. Laslier, Topol, and Walliser (2001) show that the play of agents using the Roth-Erev rule will converge with positive probability to any strict Nash equilibria. Laslier and Walliser (2005) show that this rule converges to the (generically unique) subgame perfect equilibrium in nite extensive form games of perfect information, and Jehiel and Samet (2005) prove a similar result for a closely related updating rule. All of the above-mentioned authors note a link between the dynamic system generated by reinforcement learning and the evolutionary replicator dynamic. B orgers and Sarin (1997) and (2000) are the only two papers in this literature to examine a
3 i Bell (2001) examines the sensitivity of the Roth-Erev rule to the initial conditions S0 . Her simulation results show that early behavior is extremely sensitive to these initial conditions.
rule other than the Roth-Erev cumulative payo rule. Suppose that action i is taken in period t, yielding a payo rt (0, 1) . Then actions are chosen in the following period with probabilities given by
j t +1 = j rt + (1 rt ) t
if j = i, if j = i.
(1
j rt ) t
(10)
Thus, payos directly aect selection probabilities, with no intermediaries such as propensities or Q. The authors then note that the continuous time limit of this stochastic process satises the dierential equation which characterizes the continuous time replicator dynamic. In the second of the two papers, they add to the previous specication an endogenous aspiration level at+1 = rt + (1 ) at , and the reinforcement in period t is based on |rt at | instead of on rt alone. In this case, the stochastic system is the sum of a replicator dynamic and a probability matching component.4 Unfortunately, the established literature suers from a few shortcomings. Firstly, all of the papers based on the Roth-Erev framework assume that payos are always positive.5 Therefore,
i i i St +1 = St + rt > St , i i implying that t +1 > t ; that is, there is always positive reinforcement and no negative reinforce-
ment, implying that the probability of taking an action increases merely due to that action having been taken in the past. Meanwhile, the framework used by B orgers and Sarin (1997) and (2000) has no intermediary between payos and action probabilities, preventing any sort of cognitive or behavioral interpretation of their model. Finally, the only paper that explores nonstationary environments is Rustichini (1999); however, he makes use of a Markov process governing the state transition process and looks for optimality in terms of the stationary distribution thereby introducing stationarity through the back door.
Possible Directions for Future Work
What can we conclude from the overview above? The main point is that, at least for the basic narmed bandit problem, reinforcement learning algorithms do quite well in terms of picking out the best arm. Furthermore, there are many dierent parameterizations that yield this same result. Moreover, the economics literature has explored only a small portion of the universe of reinforcement learning algorithms. So, where do I want to go from here? There are several potential directions for further investigation.
Suppose there are two actions s1 and s2 with (respective) payos of (1, 0) with probability and (0, 1) with probability 1 . Then probability matching implies that the long-run frequency of choosing s1 is . j 5 i to be small enough that t is either negative or greater than 1 for some i. This Negative payos can cause j St is not an issue with exponential weighting rules.
4
4.1
Reinforcement Learning in a Nonstationary Environment
One limitation of the literature on reinforcement learning (at least in economics) is the assumption of stationarity. While multi-armed bandit problems are interesting and crop up in many strands of economic theory, what is potentially more interesting is when the world changes. Suppose the distributions governing the payos of our bandit problem unexpectedly change does a naive agent making use of a reinforcement learning approach notice the change? And if so, how long does it take for there to be a response? Can the agent eventually learn what the new optimal action is? Moreover, if we take things a step further and look at the performance of RL agents playing games against one another, what happens if the underlying game changes? How will these rules of thumb perform? One result that seems rather intuitive is that a single change in the environment is not sucient to bring about permanently sub-optimal behavior. To see this, note that the action selection rules described above all have positive probabilities of choosing any given action in any given period. Therefore, each action will be selected innitely often after the change in the environment. estimates, a law of large numbers While the agents initial (pre-change) experience will bias the Q argument implies that these estimates will be consistent in the limit, they will converge to the true means of the distributions, and the agent will have sucient information about the new environment to choose the optimal action. Thus, for the situation to be interesting, we need a setting in which the environment will change
N innitely often. So, let there be M actions {am }M m=1 and N states of the world {sn }n=1 . Taking
action m in state n yields a stochastic payo from a distribution Fmn with mean mn thus, we have N dierent M -armed bandit problems, with payo distributions governed by the state of the world. State transitions are governed by, for example, a Poisson distribution; in each period in which there is a switch in states, the state of the world is chosen from a uniform distribution over {sn }N n=1 . One of the most dicult questions is perhaps that of quantifying optimality. If the state of the world is changing innitely often, there is no action that may be considered to be ex ante optimal, so we cannot look for situations in which the probability of choosing a given arm goes to one as time goes to innity. One possible metric of optimality would be performance within a regime: the reinforcement learning algorithm would be considered to be performing well if it adjusts quickly after the regime change and approaches the state-specic optimal action. Intuition, supported by simulation evidence, suggests that with such a metric, performance should be inversely related to the frequency of regime change: if the state is constantly in ux, the best we can hope to do is to pick actions at random, whereas if there are long periods of stability, we can expect a signicant amount of learning (or re-learning) to occur. Another approach to this question would be to consider the problem from a Bayesian perspective. What would a fully rational, forward looking Bayesian agent do in each period? This behavior may be used as a benchmark with which to compare the performance of agents who act according to reinforcement learning. In particular, we can examine the sequence of actions, as well as the 7
history of payos, and determine whether RL behavior approximates Bayesian behavior, and how much of a payo loss, if any, is incurred by acting in accordance with reinforcement learning as opposed to Bayesian optimality.
4.2
An Axiomatic Approach
A major diculty with making use of reinforcement learning techniques is the determination of which specic parameterization one should use. It would be useful to have an axiomatic foundation for reinforcement learning. Restricting attention to the stationary environment setting for the moment, it seems plausible and worthwhile to develop reasonable axioms which would narrow down the set of possible dynamic decision-making rules to those that fall under the umbrella of reinforcement learning. In light of the large variety of possible RL models, axioms that characterize the softmax parameterization, for example, would be extremely useful (both theoretically and for practical application).
4.3
Reinforcement Learning in Other Settings
Recently, there have been a number of papers published about the performance of various RL algorithms in interactive settings for example, several authors have noted links between the behavior of RL agents playing games against one another and the continuous time replicator dynamic, while others have demonstrated that payos will be bounded below by minmax levels, or that RL behavior (in specic parameterizations) is consistent with iterated deletion of strictly dominated strategies. In this vein, it might be interesting to examine the performance of reinforcement learning agents in networks will reinforcement learning lead to convergence to optimal or ecient networks?
References
[1] Beggs, Alan (2005). On the Convergence of Reinforcement Learning. Journal of Economic Theory 122: 1-36. [2] Bell, Ann Maria (2001). Reinforcement Learning Rules in a Repeated Game. Computational Economics 18: 89-111. [3] B orgers, Tilman and Rajiv Sarin (1997). Learning through Reinforcement and Replicator Dynamics. Journal of Economic Theory 77: 1-14. [4] B orgers, Tilman and Rajiv Sarin (2000). Naive Reinforcement Learning with Endogenous Aspirations. International Economic Review 41: 921-950. [5] Erev, Ido and Alvin Roth (1998). Predicting how People Play Games: Reinforcement Learning in Experimental Games with Unique Mixed Strategy Equilibria. American Economic Review 88: 848-881.
[6] Hopkins, Ed (2002). Two Competing Models of How People Learn in Games. Econometrica 70: 2141-2166. [7] Jehiel, Philippe and Dov Samet (2005). Learning to Play Games in Extensive Form by Valuation. Journal of Economic Theory 124: 129-148. [8] Laslier, Jean-Francois, Richard Topol, and Bernard Walliser (2001). A Behavioral Learning Process in Games. Games and Economic Behavior 37: 340-366. [9] Laslier, Jean-Francois and Bernard Walliser (2005). A Reinforcement Learning Process in Extensive Form Games. International Journal of Game Theory 33: 219-227. [10] Mitchell, Tom (1997). Machine Learning. Boston: McGraw-Hill. [11] Roth, Alvin and Ido Erev (1995). Learning in Extensive-Form Games: Experimental Data and Simple Dynamic Models in the Intermediate Term. Games and Economic Behavior 8: 164-212. [12] Rustichini, Aldo (1999). Optimal Properties of Stimulus-Response Learning Models. Games and Economic Behavior 29: 244-273. [13] Sutton, Richard and Andrew Barto (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.