Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control
<p>Representation of the Control-based Reinforcement Learning (CRL) model. The top red box depicts the Adaptive layer or the original CRL model, composed of an Actor-Critic Temporal-Difference (TD) learning algorithm. The TD learning algorithm is composed of an Actor module that generates the action-selection policy (<span class="html-italic">P</span>), and a Critic module that estimates the value (<span class="html-italic">V</span>) of a given state. Both the Actor and the Critic are updated by the TD-error <span class="html-italic">e</span> signal between the estimated state value and the actual reward obtained in that state. The bottom green box represents the Reactive layer, with its three sets of sensors, one for the ‘cooperate’ location (<math display="inline"><semantics><msup><mi>s</mi><mi>C</mi></msup></semantics></math>), one for the “defect” location (<math display="inline"><semantics><msup><mi>s</mi><mi>D</mi></msup></semantics></math>), and one for the other agent (<math display="inline"><semantics><msup><mi>s</mi><mi>A</mi></msup></semantics></math>); the two reactive behaviors, <span class="html-italic">“approach cooperate location”</span> (<math display="inline"><semantics><msup><mi>f</mi><mi>C</mi></msup></semantics></math>), <span class="html-italic">“approach defect location</span>” (<math display="inline"><semantics><msup><mi>f</mi><mi>D</mi></msup></semantics></math>); and the two motors, one for the left wheel (<math display="inline"><semantics><msub><mi>m</mi><mi>l</mi></msub></semantics></math>) and one for the right wheel (<math display="inline"><semantics><msub><mi>m</mi><mi>r</mi></msub></semantics></math>). Between the two layers, the inhibitory function (<span class="html-italic">i</span>) regulates which reactive behaviors will be active depending on the action received from the Adaptive layer, while the error monitoring function (<math display="inline"><semantics><mrow><mi>p</mi><mi>e</mi></mrow></semantics></math>) manages the mismatch between the opponent’s predicted behavior and the actual observation in real-time.</p> "> Figure 2
<p>Representations of the TD-learning model. (<b>A</b>) shows a detailed representation of the algorithm components, as implemented in the Adaptive layer of the CRL model [<a href="#B24-information-14-00441" class="html-bibr">24</a>]. (<b>B</b>) shows a simplified representation showing only the inputs and outputs of that same model.</p> "> Figure 3
<p>Representation of the Rational Model. This model is composed of a predictive module (RL) that learns to predict the opponent’s future action and a utility maximization function (U) that computes the action that yields the highest reward based on the opponent’s predicted action. At the end of each round, the RL module is updated based on its prediction error.</p> "> Figure 4
<p>Representation of the Predictive Model. This model is composed of a predictive module (RL) that learns to predict the opponent’s future action and a TD learning module (RL) that uses that prediction along with the previous state to learn the optimal policy. At the end of the round, the predictive-RL (green) is updated according to its error in the prediction.</p> "> Figure 5
<p>Representation of the Internal Model. This model is composed by two TD learning algorithms. The first one (<b>left</b>, blue) learns to predict the opponent’s future action while the second (<b>right</b>, red) uses a that prediction to learn the optimal policy. The first algorithm is updated with the opponent’s reward and the second with the agent’s own reward.</p> "> Figure 6
<p>(<b>A</b>) Top-down visualization of the agents used in the continuous version of the games. The green circles represent the location-specific sensors. The green lines that connect the location sensors with the wheels represent the Braitenberg-like excitatory and inhibitory connections. (<b>B</b>) Image of initial conditions of the dyadic games. The blue circles are the two agents facing each other (representing two ePuck robots viewed from the top). The big green circle represents the ’cooperate’ reward location; the small green circle represents the ‘defect’ reward. The white circles around each reward spot mark the threshold of the detection area.</p> "> Figure 7
<p>Main results of the four behavioral models (TD-learning, Rational, Predictive, and Internal) against a Greedy agent across five games. Panel rows show the results of the models at each game, starting from the top row: Prisoner’s Dilemma, Hawk-Dove, Stag Hunt, Harmony Game, and Battle of the Exes. Panel columns show the results across three metrics: Efficacy (<b>left</b>), Stability (<b>center</b>), and Prediction Accuracy (<b>right</b>).</p> "> Figure 8
<p>Main results of the four behavioral models (TD-learning, Rational, Predictive, and Internal) against a Nice agent across five games. Panel rows show the results of the models at each game, starting from the top row: Prisoner’s Dilemma, Hawk-Dove, Stag Hunt, Harmony Game, and Battle of the Exes. Panel columns show the results across three metrics: Efficacy (<b>left</b>), Stability (<b>center</b>), and Prediction Accuracy (<b>right</b>).</p> "> Figure 9
<p>Main results of the four behavioral models (TD-learning, Rational, Predictive, and Internal) against a TFT agent across five games. Panel rows show the results of the models at each game, starting from the top row: Prisoner’s Dilemma, Hawk-Dove, Stag Hunt, Harmony Game, and Battle of the Exes. Panel columns show the results across three metrics: Efficacy (<b>left</b>), Stability (<b>center</b>), and Prediction Accuracy (<b>right</b>).</p> "> Figure 10
<p>Mean results of the four behavioral models (TD-learning, Rational, Predictive, and Internal) against the TD-learning agent across five games. Panel rows show the results of the models at each game, starting from the top row: Prisoner’s Dilemma, Hawk-Dove, Stag Hunt, Harmony Game, and Battle of the Exes. Panel columns show the results across three metrics: Efficacy (<b>left</b>), Stability (<b>center</b>), and Prediction Accuracy (<b>right</b>).</p> "> Figure 11
<p>Comparative results of the Hawk-Dove game played in the continuous-time version of the game versus the classical discrete-time.</p> "> Figure 12
<p>Results of the predictive models against human data. Model results showed the best fit to human data when tested against a greedy deterministic agent as an opponent. Human data reproduced from [<a href="#B35-information-14-00441" class="html-bibr">35</a>].</p> ">
Abstract
:1. Introduction
2. Methods
2.1. Game Theoretic Tasks
Cooperate | Defect | |
---|---|---|
Cooperate | R, R | S, T |
Defect | T, S | P, P |
2.2. Control-Based Reinforcement Learning
2.3. Agent Models
2.3.1. TD-Learning Model
2.3.2. Rational Model
2.3.3. Predictive Model
2.3.4. Internal Model
2.3.5. Deterministic Agent Models
Greedy
Cooperative/Nice
Tit-for-Tat
2.4. Experimental Setup
3. Results
3.1. Experiment 1: Versus a Deterministic-Greedy Agent
3.2. Experiment 2: Versus a Deterministic-Nice Agent
3.3. Experiment 3: Versus a Tit-for-Tat Agent
3.4. Experiment 4: Versus the TD-Learning Agent
3.5. Experiment 5: Continuous-Time Effects on Prediction Accuracy
3.6. Comparison against Human Data
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ToM | Theory of Mind |
DAC | Distributed Adaptive Control |
CRL | Control-based Reinforcement Learning |
TD-Learning | Temporal-Difference Learning |
TFT | Tit-for-tat |
References
- Premack, D.; Woodruff, G. Does the chimpanzee have a theory of mind? Behav. Brain Sci. 1978, 1, 515–526. [Google Scholar] [CrossRef] [Green Version]
- Baron-Cohen, S.; Leslie, A.M.; Frith, U. Does the autistic child have a “theory of mind”? Cognition 1985, 21, 37–46. [Google Scholar] [CrossRef]
- Premack, D. The infant’s theory of self-propelled objects. Cognition 1990, 36, 1–16. [Google Scholar] [CrossRef]
- Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Pérolat, J.; Silver, D.; Graepel, T. A unified game-theoretic approach to multiagent reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4190–4203. [Google Scholar]
- Lerer, A.; Peysakhovich, A. Learning social conventions in markov games. arXiv 2018, arXiv:1806.10071. [Google Scholar]
- Zhao, Z.; Zhao, F.; Zhao, Y.; Zeng, Y.; Sun, Y. A brain-inspired theory of mind spiking neural network improves multi-agent cooperation and competition. Patterns 2023, 100775. [Google Scholar] [CrossRef]
- Rabinowitz, N.C.; Perbet, F.; Song, H.F.; Zhang, C.; Eslami, S.; Botvinick, M. Machine Theory of Mind. arXiv 2018, arXiv:1802.07740. [Google Scholar]
- Sclar, M.; Neubig, G.; Bisk, Y. Symmetric machine theory of mind. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 19450–19466. [Google Scholar]
- Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [Green Version]
- Yoshida, W.; Dolan, R.J.; Friston, K.J. Game theory of mind. PLoS Comput. Biol. 2008, 4, e1000254. [Google Scholar] [CrossRef] [PubMed]
- Baker, C.; Saxe, R.; Tenenbaum, J. Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution. In Proceedings of the Annual Meeting of the Cognitive Science Society, Boston, MA, USA, 20–23 July 2011; Volume 33. [Google Scholar]
- Baker, C.L.; Jara-Ettinger, J.; Saxe, R.; Tenenbaum, J.B. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nat. Hum. Behav. 2017, 1, 0064. [Google Scholar] [CrossRef]
- Lake, B.M.; Ullman, T.D.; Tenenbaum, J.B.; Gershman, S.J. Building machines that learn and think like people. Behav. Brain Sci. 2017, 40, e253. [Google Scholar] [CrossRef] [Green Version]
- Berke, M.; Jara-Ettinger, J. Integrating Experience into Bayesian Theory of Mind. In Proceedings of the Annual Meeting of the Cognitive Science Society, Toronto, ON, Canada, 27–30 July 2022; Volume 44. [Google Scholar]
- Abbeel, P.; Ng, A.Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; p. 1. [Google Scholar]
- Jara-Ettinger, J. Theory of mind as inverse reinforcement learning. Curr. Opin. Behav. Sci. 2019, 29, 105–110. [Google Scholar] [CrossRef]
- Wu, H.; Sequeira, P.; Pynadath, D.V. Multiagent Inverse Reinforcement Learning via Theory of Mind Reasoning. arXiv 2023, arXiv:2302.10238. [Google Scholar]
- Ruiz-Serra, J.; Harré, M.S. Inverse Reinforcement Learning as the Algorithmic Basis for Theory of Mind: Current Methods and Open Problems. Algorithms 2023, 16, 68. [Google Scholar] [CrossRef]
- Kahneman, D.; Slovic, P.; Tversky, A. Judgment under Uncertainty: Heuristics and Biases; Cambridge University Press: Cambridge, UK, 1982. [Google Scholar]
- Cuzzolin, F.; Morelli, A.; Cirstea, B.; Sahakian, B.J. Knowing me, knowing you: Theory of mind in AI. Psychol. Med. 2020, 50, 1057–1061. [Google Scholar] [CrossRef]
- Albrecht, S.V.; Stone, P. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artif. Intell. 2018, 258, 66–95. [Google Scholar] [CrossRef] [Green Version]
- Wang, Y.; Zhong, F.; Xu, J.; Wang, Y. Tom2c: Target-oriented multi-agent communication and cooperation with theory of mind. arXiv 2021, arXiv:2111.09189. [Google Scholar]
- Yuan, L.; Fu, Z.; Zhou, L.; Yang, K.; Zhu, S.C. Emergence of theory of mind collaboration in multiagent systems. arXiv 2021, arXiv:2110.00121. [Google Scholar]
- Freire, I.T.; Moulin-Frier, C.; Sanchez-Fibla, M.; Arsiwalla, X.D.; Verschure, P.F. Modeling the formation of social conventions from embodied real-time interactions. PLoS ONE 2020, 15, e0234434. [Google Scholar] [CrossRef]
- Freire, I.T.; Puigbò, J.Y.; Arsiwalla, X.D.; Verschure, P.F. Limits of Multi-Agent Predictive Models in the Formation of Social Conventions. Artif. Intell. Res. Dev. Curr. Chall. New Trends Appl. 2018, 308, 297. [Google Scholar]
- Köster, R.; McKee, K.R.; Everett, R.; Weidinger, L.; Isaac, W.S.; Hughes, E.; Duéñez-Guzmán, E.A.; Graepel, T.; Botvinick, M.; Leibo, J.Z. Model-free conventions in multi-agent reinforcement learning with heterogeneous preferences. arXiv 2020, arXiv:2010.09054. [Google Scholar]
- Kleiman-Weiner, M.; Ho, M.K.; Austerweil, J.L.; Littman, M.L.; Tenenbaum, J.B. Coordinate to cooperate or compete: Abstract goals and joint intentions in social interaction. In Proceedings of the CogSci, Philadelphia, PA, USA, 10–13 August 2016. [Google Scholar]
- Perolat, J.; Leibo, J.Z.; Zambaldi, V.; Beattie, C.; Tuyls, K.; Graepel, T. A multi-agent reinforcement learning model of common-pool resource appropriation. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3643–3652. [Google Scholar]
- Peysakhovich, A.; Lerer, A. Prosocial learning agents solve generalized stag hunts better than selfish ones. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, Stockholm, Sweden, 10–15 July 2018; International Foundation for Autonomous Agents and Multiagent Systems: London, UK, 2018; pp. 2043–2044. [Google Scholar]
- Freire, I.T.; Puigbò, J.Y.; Arsiwalla, X.D.; Verschure, P.F. Modeling the Opponent’s Action Using Control-Based Reinforcement Learning. In Proceedings of the Conference on Biomimetic and Biohybrid Systems; Springer: Berlin/Heidelberg, Germany, 2018; pp. 179–186. [Google Scholar]
- Gaparrini, M.J.; Sánchez-Fibla, M. Loss Aversion Fosters Coordination in Independent Reinforcement Learners. Artif. Intell. Res. Dev. Curr. Challenges New Trends Appl. 2018, 308, 307. [Google Scholar]
- Leibo, J.Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; Graepel, T. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, Sao Paulo, Brazil, 8–12 May 2017; International Foundation for Autonomous Agents and Multiagent Systems: London, UK, 2017; pp. 464–473. [Google Scholar]
- Peysakhovich, A.; Lerer, A. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv 2017, arXiv:1710.06975. [Google Scholar]
- Nash, J.F. Equilibrium points in n-person games. Proc. Natl. Acad. Sci. USA 1950, 36, 48–49. [Google Scholar] [CrossRef] [PubMed]
- Hawkins, R.X.; Goldstone, R.L. The formation of social conventions in real-time environments. PLoS ONE 2016, 11, e0151670. [Google Scholar] [CrossRef] [Green Version]
- Hawkins, R.X.; Goodman, N.D.; Goldstone, R.L. The emergence of social norms and conventions. Trends Cogn. Sci. 2018. [Google Scholar] [CrossRef] [PubMed]
- Poncela-Casasnovas, J.; Gutiérrez-Roig, M.; Gracia-Lázaro, C.; Vicens, J.; Gómez-Gardeñes, J.; Perelló, J.; Moreno, Y.; Duch, J.; Sánchez, A. Humans display a reduced set of consistent behavioral phenotypes in dyadic games. Sci. Adv. 2016, 2, e1600451. [Google Scholar] [CrossRef] [Green Version]
- Sanfey, A.G. Social decision-making: Insights from game theory and neuroscience. Science 2007, 318, 598–602. [Google Scholar] [CrossRef] [Green Version]
- Verschure, P.F.; Voegtlin, T.; Douglas, R.J. Environmentally mediated synergy between perception and behaviour in mobile robots. Nature 2003, 425, 620. [Google Scholar] [CrossRef]
- Moulin-Frier, C.; Arsiwalla, X.D.; Puigbò, J.Y.; Sanchez-Fibla, M.; Duff, A.; Verschure, P.F. Top-Down and Bottom-Up Interactions between Low-Level Reactive Control and Symbolic Rule Learning in Embodied Agents. In Proceedings of the CoCo@ NIPS, Barcelona, Spain, 9 December 2016. [Google Scholar]
- Braitenberg, V. Vehicles: Experiments in Synthetic Psychology; MIT Press: Cambridge, MA, USA, 1986. [Google Scholar]
- Corbetta, M.; Shulman, G.L. Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci. 2002, 3, 201. [Google Scholar] [CrossRef]
- Koechlin, E.; Ody, C.; Kouneiher, F. The architecture of cognitive control in the human prefrontal cortex. Science 2003, 302, 1181–1185. [Google Scholar] [CrossRef] [Green Version]
- Munakata, Y.; Herd, S.A.; Chatham, C.H.; Depue, B.E.; Banich, M.T.; O’Reilly, R.C. A unified framework for inhibitory control. Trends Cogn. Sci. 2011, 15, 453–459. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Den Ouden, H.E.; Kok, P.; De Lange, F.P. How prediction errors shape perception, attention, and motivation. Front. Psychol. 2012, 3, 548. [Google Scholar] [CrossRef] [Green Version]
- Wacongne, C.; Labyt, E.; van Wassenhove, V.; Bekinschtein, T.; Naccache, L.; Dehaene, S. Evidence for a hierarchy of predictions and prediction errors in human cortex. Proc. Natl. Acad. Sci. USA 2011, 108, 20754–20759. [Google Scholar] [CrossRef] [PubMed]
- Sutton, R.S. Learning to predict by the methods of temporal differences. Mach. Learn. 1988, 3, 9–44. [Google Scholar] [CrossRef]
- Axelrod, R.; Hamilton, W.D. The evolution of cooperation. Science 1981, 211, 1390–1396. [Google Scholar] [CrossRef] [PubMed]
- Axelrod, R. Effective choice in the prisoner’s dilemma. J. Confl. Resolut. 1980, 24, 3–25. [Google Scholar] [CrossRef] [Green Version]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
- Lengyel, M.; Dayan, P. Hippocampal contributions to control: The third way. Adv. Neural Inf. Process. Syst. 2007, 20. [Google Scholar]
- Freire, I.T.; Amil, A.F.; Verschure, P.F. Sequential Episodic Control. arXiv 2021, arXiv:2112.14734. [Google Scholar]
- Rosado, O.G.; Amil, A.F.; Freire, I.T.; Verschure, P.F. Drive competition underlies effective allostatic orchestration. Front. Robot. AI 2022, 9, 1052998. [Google Scholar] [CrossRef]
- Sweis, B.M.; Abram, S.V.; Schmidt, B.J.; Seeland, K.D.; MacDonald, A.W., III; Thomas, M.J.; Redish, A.D. Sensitivity to “sunk costs” in mice, rats, and humans. Science 2018, 361, 178–181. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tutić, A.; Voss, T. Trust and game theory. The Routledge Handbook of Trust and Philosophy; Routledge: London, UK, 2020; pp. 175–188. [Google Scholar]
- Moulin-Frier, C.; Puigbo, J.Y.; Arsiwalla, X.D.; Sanchez-Fibla, M.; Verschure, P. Embodied artificial intelligence through distributed adaptive control: An integrated framework. arXiv 2017, arXiv:1704.01407. [Google Scholar]
- Freire, I.T.; Urikh, D.; Arsiwalla, X.D.; Verschure, P.F. Machine Morality: From Harm-Avoidance to Human-Robot Cooperation. In Proceedings of the Conference on Biomimetic and Biohybrid Systems; Springer: Berlin/Heidelberg, Germany, 2020; pp. 116–127. [Google Scholar]
- Arsiwalla, X.D.; Herreros, I.; Moulin-Frier, C.; Sánchez-Fibla, M.; Verschure, P.F. Is Consciousness a Control Process? In Proceedings of the CCIA, Catalonia, Spain, 19–21 October 2016; pp. 233–238. [Google Scholar]
- Arsiwalla, X.D.; Sole, R.; Moulin-Frier, C.; Herreros, I.; Sanchez-Fibla, M.; Verschure, P. The Morphospace of Consciousness. arXiv 2017, arXiv:1705.11190. [Google Scholar]
- Gopnik, A.; Meltzoff, A. Imitation, cultural learning and the origins of “theory of mind”. Behav. Brain Sci. 1993, 16, 521–523. [Google Scholar] [CrossRef]
- Gavrilets, S. Coevolution of actions, personal norms and beliefs about others in social dilemmas. Evol. Hum. Sci. 2021, 3, e44. [Google Scholar] [CrossRef]
Cooperate | Defect | |
---|---|---|
Cooperate | 2, 2 | 0, 3 |
Defect | 3, 0 | 1, 1 |
Cooperate | Defect | |
---|---|---|
Cooperate | 3, 3 | 0, 2 |
Defect | 2, 0 | 1, 1 |
Cooperate | Defect | |
---|---|---|
Cooperate | 2, 2 | 1, 3 |
Defect | 3, 1 | 0, 0 |
Cooperate | Defect | |
---|---|---|
Cooperate | 3, 3 | 1, 2 |
Defect | 2, 1 | 0, 0 |
A | B | |
---|---|---|
A | 0, 0 | 1, 4 |
B | 4, 1 | 0, 0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Freire, I.T.; Arsiwalla, X.D.; Puigbò, J.-Y.; Verschure, P. Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control. Information 2023, 14, 441. https://doi.org/10.3390/info14080441
Freire IT, Arsiwalla XD, Puigbò J-Y, Verschure P. Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control. Information. 2023; 14(8):441. https://doi.org/10.3390/info14080441
Chicago/Turabian StyleFreire, Ismael T., Xerxes D. Arsiwalla, Jordi-Ysard Puigbò, and Paul Verschure. 2023. "Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control" Information 14, no. 8: 441. https://doi.org/10.3390/info14080441
APA StyleFreire, I. T., Arsiwalla, X. D., Puigbò, J.-Y., & Verschure, P. (2023). Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control. Information, 14(8), 441. https://doi.org/10.3390/info14080441