Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1997
Abstract Q-learning can greatly improve its convergence speed if helped by immediate reinforcements provided by a trainer able to judge the usefulness of actions as stage setting with respect to the goal of the agent.
Q-learning is a simple, powerful algorithm for behavior learn-ing. It was derived in the context of single agent decision making in Markov decision process environments, but its applicability is much broader—in experiments in multia-gent environments, Q-learning has also performed well. Our preliminary analysis using dynamical systems finds that Q-learning's indirect control of behavior via estimates of value contributes to its beneficial performance in general-sum 2-player games like the Prisoner's Dilemma.
Learning from reinforcements is a promising approach for creating intelligent agents. However, reinforcement learning usually requires a large number of training episodes. We present a system called ratle that addresses this shortcoming by allowing a connectionist Q-learner to accept advice given, at any time and in a natural manner, by an external observer. In ratle, the advice-giver watches the learner and occasionally makes suggestions, expressed as instructions in a simple programming language. Based on techniques from knowledge-based neural networks, ratle inserts these programs directly into the agent's utility function. Subsequent reinforcement learning further integrates and refines the advice. We present empirical evidence that shows our approach leads to statistically-significant gains in expected reward. Importantly, the advice improves the expected reward regardless of the stage of training at which it is given. A shorter version of this paper appears in the Pr...
2001 •
In reinforcement learning an autonomous agent learns an optimal policy while interacting with the environment. In particular, in one-step Q-learning, with each action an agent updates its Q values considering immediate rewards. In this paper a new strategy for updating Q values is proposed. The strategy, implemented in an algorithm called DQL, uses a set of agents all searching the same goal in the same space to obtain the same optimal policy. Each agent leaves traces over a copy of the environment (copies of Q-values), while searching for a goal. These copies are used by the agents to decide which actions to take. Once all the agents reach a goal, the original Q-values of the best solution found by all the agents are updated using Watkins’ Q-learning formula. DQL has some similarities with Gambardella’s Ant-Q algorithm [4], however it does not require the definition of a domain dependent heuristic and consequently the tuning of additional parameters. DQL also does not update the original Q-values with zero reward while the agents are searching, as Ant-Q does. It is shown how DQL’s guided exploration of several agents with selected exploitation (updating only the best solution) produces faster convergence times than Q-learning and Ant-Q on several test bed problems under similar conditions.
International Journal of Intelligent Systems
Concurrent Q-learning: Reinforcement learning for dynamic goals and environments2005 •
This article presents a powerful new algorithm for reinforcement learning in problems where the goals and also the environment may change. The algorithm is completely goal independent, allowing the mechanics of the environment to be learned independently of the task that is being undertaken. Conventional reinforcement learning techniques, such as Q-learning, are goal dependent. When the goal or reward conditions change, previous learning interferes with the new task that is being learned, resulting in very poor performance. Previously, the Concurrent Q-Learning algorithm was developed, based on Watkin's Q-learning, which learns the relative proximity of all states simultaneously. This learning is completely independent of the reward experienced at those states and, through a simple action selection strategy, may be applied to any given reward structure. Here it is shown that the extra information obtained may be used to replace the eligibility traces of Watkin's Q-learning, allowing many more value updates to be made at each time step. The new algorithm is compared to the previous version and also to DG-learning in tasks involving changing goals and environments. The new algorithm is shown to perform significantly better than these alternatives, especially in situations involving novel obstructions. The algorithm adapts quickly and intelligently to changes in both the environment and reward structure, and does not suffer interference from training undertaken prior to those changes. © 2005 Wiley Periodicals, Inc. Int J Int Syst 20: 1037–1052, 2005.
Oceanic Linguistics
The Greater West Bomberai Language Family2022 •
Revista Metafísica y Persona
Philosophy and Neuroscience: Relation between Mirror Neurons and Empathy2018 •
Review of Biblical Literature
Book review.rbl christs first theologian2018 •
Journal of Science and Medicine in Sport
Athletic performance and training characteristics in Junior Tennis Davis-Cup Players2015 •
Archiv für Diplomatik
Sigillum Petri plebani de Glathovia. Ein spätmittelalterliches Pfarrersiegel aus Klattau (Böhmen)2004 •
Journal of Organic Chemistry
Photoinduced Skeletal Rearrangement of N-Substituted Colchicine Derivatives2020 •
Academia Medicine
Racial and gender disparities in the effect of new drug approvals on U.S. cancer mortality2024 •
1987 •
Information engineering express
Quantitative Measurement and Analysis to Computational Thinking for Elementary Schools in Japan2022 •
Afyon Kocatepe Üniversitesi Sosyal Bilimler Dergisi
Hz. Peygamber (sav)’in Kurduğu Devlette Milli İrade