{"id":2067170,"date":"2021-12-28T22:39:00","date_gmt":"2021-12-28T20:39:00","guid":{"rendered":"http:\/\/nhub.news\/?p=2067170"},"modified":"2021-12-29T00:03:17","modified_gmt":"2021-12-28T22:03:17","slug":"cooperative-multi-agent-reinforcement-learning-and-qmix-at-neurips-2021","status":"publish","type":"post","link":"http:\/\/nhub.news\/fr\/2021\/12\/cooperative-multi-agent-reinforcement-learning-and-qmix-at-neurips-2021\/","title":{"rendered":"Cooperative Multi-Agent Reinforcement Learning and QMIX at NeurIPS 2021"},"content":{"rendered":"

This post introduces Cooperative MARL and goes through innovations by S. Whiterson Lab, with QMIX (2019), and their current contributions for NeurIPS 2021.<\/b>
\nJoin the DZone community and get the full member experience. Authors: Gema Parre\u00f1o, David Suarez (Apiumhub), with thanks to: Alberto Hernandez (BBVA Innovation Labs). The following post aims to introduce Cooperative MARL and goes through innovations by S. Whiterson Lab, with QMIX (2019), and their current contributions for NeurIPS 2021. Going through this article might imply having certain fundamentals about Reinforcement Learning. A multi-agent system describes multiple distributed entities: so-called \u00ab\u00a0agents\u00a0\u00bb that take decisions autonomously and interact within a shared environment (Weiss 1999). MARL (Multi-Agent Reinforcement Learning) can be understood as a field related to RL in which a system of agents interact within an environment to achieve a goal. The goal of each one of these agents or learnable units is to learn a policy in order to maximize the long-term reward, in which each agent discovers a strategy alongside other entities in a common environment and adapts its policy in response to the behavioral changes of others. Properties of MARL systems are key to their modeling. Depending on these properties, we might branch into specific particularities of areas of research. Table 1. This taxonomic schema (Weiss 1999) proposes to let us know more about the MARL exploration we will talk about today. In cooperative MARL, agents cooperate to achieve a common goal. From the environment perspective, we can enunciate several challenges: When we branch from MARL into Cooperative MARL, we focus on reformulating the challenge into a system of agents that interact within an environment to achieve a common goal. These challenges might have more importance depending on the type of behavior and environment. From the conceptual challenges derived from the agent interaction and performance perspective inside cooperation, we can think of the following derived from: From now on, we will focus on centralized cooperative MARL and QMIX definition, notation, and description. The above images are a visual representation of MARL properties with some challenges regarding the taxonomy. The zoomed-in area includes areas inside Cooperative AI posted in Open Problems in Cooperative AI and Q-MIX papers. Q tot (\u03c4, u) = \u03a3 Q i ( \u03c4 i , ui ): Above, the Global Action-value functions as a sum of individual action-value functions, one for each agent. Q-Mix paper (published in 2018 by T. Rashid, et al) explores a hybrid value-based multi-agent reinforcement learning method, adding a constraint and a mixing network structure in order to make the learning stable, faster, and ultimately better in a controlled setup. A conceptual key idea for QMIX is to understand centralized learning ( Qtot) with decentralized execution paradigm ( Qi), also known as CTDE. Here, agents are trained in a centralized way with access to the overall action-observation history ( \u03c4) and global state during training, but during execution, have access only to their own local action-observation histories ( \u03c4 i). One of the main first ideas is to verify a constraint that enforces the monotonicity of the relationship between the global action-value function Qtot and the action-value function of each one of the agents Qi in every action. This constrained action allows each agent to participate in a decentralized execution by choosing greedy actions with respect to its action value function. \u10db Q tot \/ \u10db Q i \u2265 0, \u2200a In this notation, the Global argmax Action-Value function is divided for the argmax Action-Value function of each agent is 0 or higher, for every action. This function allows each agent to participate in a decentralized execution by choosing greedy actions with respect to its value function. The overall QMIX architecture shows two main differentiated parts: The image above depicts the overall architecture of QMIX proposed by QMIX paper with the main components: the mixing network with the hypernetwork, which forces monotonicity, and the agent networks. Overestimation is an important challenge because it indeed can be accumulated and be counterproductive for the performance of value-based algorithms. Besides, the fact that there are multiple agents inside a MARL scenario derives into the joint-action space exponentially increasing with the number of agents, and this can be considered an issue. In the case of Q-MIX, the overestimation phenomena can not only come from the calculation of Qi but also from the mixing network. First, the paper presents some key experimental results from some mental models to tackle the challenge that didn\u00b4t show the desired outcomes: a gradient regularization of the mixing network and a baseline with Qtot by adding a regularized term to the loss \u03bb (Qtot(s,u) \u2212 b(s,u))2, where they used the mean squared error loss and \u03bb is the regularization coefficient. As the final proposal that showed better empirical results, they used a softmax for the joint action-value function (softmax(Qtot(s,u)) with principles from Deep Q-Learning, using the state and not the action-observation history \u03c4 as in QMIX Value Decomposition Networks approach. For knowing more about this contribution, don\u00b4t hesitate to read their paper here. Published at DZone with permission of David Suarez. See the original article here. Opinions expressed by DZone contributors are their own.<\/p>\n