Multi-Agent Reinforcement Learning
When multiple learning agents share an environment, a single agent's reward depends on everyone else's behavior. This chapter explores the game-theoretic foundations, cooperative and competitive architectures, emergent communication, and the paradigm of centralized training with decentralized execution.
Prerequisites
This chapter assumes familiarity with single-agent RL: MDPs, Q-learning, and policy gradient methods (Chapters 01–04). Basic game theory concepts are introduced from scratch. Exposure to deep learning is helpful for the neural network–based methods in later sections.
The Multi-Agent Setting
Single-agent RL assumes a stationary environment. Introduce multiple learners and that assumption evaporates — each agent's policy changes the world that every other agent is trying to learn in.
In a standard MDP there is one agent whose actions affect the environment but not any other learner. A Markov game (also called a stochastic game) extends this to $n$ agents. Formally, a Markov game is a tuple $\langle \mathcal{S},\, \{\mathcal{A}^i\}_{i=1}^n,\, \mathcal{T},\, \{R^i\}_{i=1}^n,\, \gamma \rangle$ where each agent $i$ chooses action $a^i \in \mathcal{A}^i$, the transition $\mathcal{T}(s' \mid s, \mathbf{a})$ depends on the joint action $\mathbf{a} = (a^1, \ldots, a^n)$, and each agent has its own reward function $R^i$.
The key difference from single-agent RL is that the environment is now non-stationary from any individual agent's perspective. Even if the world's physics are fixed, the effective environment an agent faces changes as its teammates and opponents update their policies. This violates the Markov property as seen by any single learner.
Partial Observability
Most real multi-agent settings are also partially observable: each agent receives a local observation $o^i$ rather than the full state $s$. A robot sees its own sensors; a game player sees its own screen. Formally, this extends the Markov game to a decentralized partially observable MDP (Dec-POMDP). Solving Dec-POMDPs optimally is NEXP-hard in general, so practical algorithms rely on approximation.
Solution Concepts
In single-agent RL the goal is to find the optimal policy. In multi-agent settings the notion of "optimal" becomes entangled with other agents' behaviors. Three main paradigms:
- Cooperative: agents share a single team reward and jointly try to maximize it. The goal is a joint policy that maximizes $\sum_t \gamma^t R(s_t, \mathbf{a}_t)$.
- Competitive (zero-sum): one agent's gain is another's loss. The standard solution concept is a Nash equilibrium.
- Mixed (general-sum): agents have their own rewards that may partially align. Most real-world problems fall here.
The fundamental challenge of MARL is that each agent is simultaneously a learner and a part of the environment that other agents are learning about. Standard convergence proofs for single-agent RL break down because the environment is non-stationary.
Game Theory Foundations
Game theory provides the vocabulary and solution concepts that MARL inherits. Understanding Nash equilibria, dominant strategies, and the taxonomy of games is essential before designing learning algorithms.
Normal Form Games
A normal form game captures a single simultaneous interaction. Each player $i$ chooses an action; payoffs are defined by a matrix. In the classic Prisoner's Dilemma, each player can cooperate (C) or defect (D). If both cooperate they each get 3; if both defect they each get 1; if one defects while the other cooperates the defector gets 5 and the cooperator gets 0. Defecting dominates cooperation for both players individually, yet mutual defection (1,1) is worse for both than mutual cooperation (3,3). This is the canonical illustration of how individually rational choices can be collectively irrational.
Nash Equilibrium
A Nash equilibrium is a joint policy $(\pi^{1*}, \ldots, \pi^{n*})$ such that no agent can improve its expected return by unilaterally changing its own policy. Formally: for every agent $i$ and every alternative policy $\pi^i$,
$$V^i(\pi^{i*}, \boldsymbol{\pi}^{-i*}) \geq V^i(\pi^i, \boldsymbol{\pi}^{-i*})$$where $\boldsymbol{\pi}^{-i}$ denotes all other agents' policies. Nash proved that every finite game has at least one Nash equilibrium in mixed (randomized) strategies. However, Nash equilibria may be multiple, hard to find, and may not correspond to the socially optimal outcome.
Zero-Sum vs. General-Sum
In a zero-sum game the rewards sum to zero: $\sum_i R^i = 0$. Chess, Go, and poker (heads-up) are zero-sum. In these games minimax is the solution concept: one agent minimizes while the other maximizes. A Nash equilibrium in a two-player zero-sum game is also a minimax solution, and it is unique in value (though not necessarily in strategy). General-sum games admit richer dynamics including coordination equilibria, communication, and side payments.
Common Social Dilemmas
Several canonical games appear repeatedly in MARL research: the Prisoner's Dilemma (defection dominates but mutual cooperation is better), the Stag Hunt (coordination game with two equilibria), the Battle of the Sexes (mixed incentives requiring coordination), and the public goods game (each agent can free-ride on collective effort). These templates reveal the fundamental tensions between individual and collective rationality that MARL must navigate.
Independent Learners
The simplest approach: give each agent its own RL algorithm and let them learn simultaneously. Surprisingly effective in practice, yet theoretically problematic — and understanding why reveals the core difficulty of MARL.
Independent Q-Learning
Independent Q-Learning (IQL) treats each agent as if it were the only learner. Agent $i$ maintains its own Q-function $Q^i(o^i, a^i)$ and updates it using standard Q-learning, treating other agents' actions as part of an effectively stochastic environment. The simplicity is appealing: existing single-agent algorithms can be deployed without modification.
The problem is that the environment is non-stationary. As agent $j$ updates its policy, the transition dynamics and reward signal seen by agent $i$ change — but agent $i$'s Q-function does not explicitly account for this. Convergence guarantees that rely on the environment being a stationary MDP no longer apply. In practice, IQL can cycle, diverge, or converge to suboptimal equilibria.
Why It Sometimes Works
Despite the theoretical issues, IQL performs surprisingly well in many benchmarks. The intuition is that if all agents converge quickly relative to each other's rate of change, each agent's local landscape looks approximately stationary. When agents are roughly symmetric and the game has a unique Nash equilibrium that is also a coordination equilibrium, independent learning tends to find it. The gap between theory and practice here motivates much of the more sophisticated MARL research.
Lenient Learners and Hysteretic Q-Learning
Hysteretic Q-Learning addresses IQL's instability by using two learning rates: a large $\alpha$ for positive TD errors (good news) and a small $\beta$ for negative ones (bad news). The asymmetry makes each agent optimistic: it emphasizes improvements and discounts deteriorations that may be due to other agents' transient exploratory behavior. This simple modification significantly improves convergence in cooperative settings.
Cooperative MARL
When agents share a reward, the challenge shifts from competition to coordination. The joint action space grows exponentially with the number of agents, making centralized computation intractable at scale — but structure in the value function can be exploited.
The Team Problem
In fully cooperative settings all agents share a single team reward $R(s, \mathbf{a})$. The optimal joint policy maximizes $\mathbb{E}[\sum_t \gamma^t R(s_t, \mathbf{a}_t)]$. If agents could communicate freely and compute jointly, this reduces to a standard MDP over the joint action space $\mathcal{A} = \mathcal{A}^1 \times \cdots \times \mathcal{A}^n$. But with $n$ agents each having $|\mathcal{A}|$ actions, the joint space has $|\mathcal{A}|^n$ elements — exponential in the number of agents. Coordination at scale requires exploiting structure.
Credit Assignment
A fundamental difficulty in cooperative MARL is credit assignment: when the team receives a reward, how much credit should each agent receive? If agent 1 scored the goal but agents 2–10 set it up, a naive uniform attribution may not train any of them well. Difference rewards ($D^i = R(\mathbf{a}) - R(\mathbf{a}_{-i})$, the counterfactual reward from removing agent $i$) provide theoretically principled attribution but require estimating what would have happened without each agent — expensive in large teams.
Cooperative MARL benchmark: StarCraft Multi-Agent Challenge (SMAC) is the dominant cooperative benchmark, featuring groups of units (marines, zerglings, stalkers) that must be micromanaged cooperatively to defeat a scripted enemy. It has driven much of the algorithmic progress in value decomposition methods.
Competitive MARL & Self-Play
In zero-sum games, the best teacher is a peer who is exactly as good as you are — and the only way to get that teacher is to play against yourself. Self-play is the engine behind AlphaGo, AlphaZero, and OpenAI Five.
Minimax Q-Learning
Minimax-Q extends Q-learning to two-player zero-sum games. The agent learns a joint Q-function $Q(s, a^1, a^2)$ and at each state solves a linear program to find the minimax policy: the strategy that maximizes the agent's return against a worst-case opponent. Minimax-Q converges to the Nash equilibrium value in two-player zero-sum games. The catch is that it requires the full joint action space and a linear program at every step — tractable only for small discrete games.
Self-Play
In self-play, an agent trains against copies of itself. The key insight is that a minimax-optimal policy beats any opponent, including future versions of itself. Self-play provides an automatic curriculum: as the agent improves, so does its opponent. When implemented naively, self-play can cycle — policy A beats B, B beats C, C beats A — because the game is not transitive. More sophisticated variants use a population of past selves (fictitious self-play) or a diverse league of agents to promote more robust strategies.
League Training
DeepMind's AlphaStar used league training to train StarCraft II agents. A diverse league of agent types — main agents, main exploiters, and league exploiters — each with different matchmaking strategies — created a rich multi-agent training environment. Main agents maximize win rate against the entire league; exploiters specifically target weaknesses in current main agents. The result is a policy that is robust to many different opponents simultaneously, achieving grandmaster-level play.
Elo and Skill Rating
Measuring progress in competitive MARL requires rating agents against each other. The Elo rating system assigns scalar scores based on win/loss outcomes; the expected win probability between two players with Elo ratings $R_A$ and $R_B$ is $1/(1 + 10^{(R_B - R_A)/400})$. More advanced metrics like Trueskill handle partial orderings and account for variance. Neither metric fully captures non-transitive relationships (A beats B beats C beats A), motivating approaches like Nash averaging for more robust skill assessment.
Centralized Training, Decentralized Execution
The CTDE paradigm is the dominant organizing principle in modern cooperative MARL. Agents execute independently using only local information, but a shared critic or coordinator uses global information during training to guide them.
The central tension in multi-agent RL is between what is available during training (often all agents' observations and actions) and what is available at execution time (often only each agent's local observation). CTDE resolves this by allowing a centralized critic or mixing network to access global information during training while the policies that are actually deployed use only local observations.
Why CTDE Works
The central critic can condition on global information that is available at training time (from the environment or simulator) even though it is unavailable at execution. This breaks the non-stationarity problem for critic training: the critic sees the full joint action and can properly attribute returns. The individual policies still only condition on local observations, keeping execution practical and decentralized. Most state-of-the-art cooperative MARL algorithms — MADDPG, COMA, MAPPO, and the value decomposition family — are instances of the CTDE paradigm.
Value Decomposition Networks
If the joint Q-function can be decomposed into per-agent components, each agent can be trained with its own Q-network while still optimizing for the team. The key constraint is monotonicity: greedy action selection must remain consistent between the joint and individual value functions.
VDN: Additive Decomposition
Value Decomposition Networks (VDN) propose the simplest possible decomposition: the joint Q-value is the sum of individual utilities, $Q_{tot}(\mathbf{a}, s) = \sum_i Q^i(o^i, a^i)$. Each agent $i$ has its own Q-network that takes only its local observation $o^i$. The key property is that $\arg\max_{\mathbf{a}} Q_{tot} = (\arg\max_{a^1} Q^1, \ldots, \arg\max_{a^n} Q^n)$ — the joint greedy action decomposes into individual greedy actions. This makes decentralized execution trivial. The limitation is that the additive constraint is too restrictive: many cooperative tasks have value functions with interaction effects that cannot be captured by a sum.
QMIX: Monotone Mixing
QMIX relaxes VDN's additivity to monotonicity. The joint Q-value is produced by a mixing network that combines per-agent utilities in a way where all weights are constrained to be non-negative. This ensures:
$$\frac{\partial Q_{tot}}{\partial Q^i} \geq 0 \quad \forall\, i$$Whenever $Q_{tot}$ is monotone in each $Q^i$, the global greedy action still decomposes: $\arg\max_{\mathbf{a}} Q_{tot} = (\arg\max_{a^i} Q^i)_i$. The mixing network's weights are generated by a hypernetwork that takes the global state $s$ as input — so the mixing can depend on state without violating decentralized execution. During training the mixing network and hypernetwork are learned end-to-end via backpropagation through the joint loss.
QTRAN and Beyond
QTRAN drops the monotonicity constraint entirely, instead enforcing a softer consistency condition using an auxiliary loss. This allows it to represent any factorizable joint Q-function. However, the auxiliary objective can be difficult to optimize stably. More recent work (QPLEX, Weighted QMIX) finds intermediate points with better empirical stability, or conditions the monotonicity on the optimal action to allow richer representations without losing the IGM (Individual-Global-Max) property.
The IGM condition (Individual-Global-Max) is the formal requirement that individual greedy actions constitute a joint greedy action: $\arg\max_{\mathbf{a}} Q_{tot}(\mathbf{a}) = (\arg\max_{a^i} Q^i(a^i))_i$. This is what makes decentralized greedy execution consistent with centralized training. VDN and QMIX both satisfy IGM; QTRAN enforces it approximately.
Multi-Agent Actor-Critic Methods
Policy gradient approaches extend to multi-agent settings by giving each agent an actor while sharing or centralizing the critic. MADDPG introduced CTDE for continuous actions; MAPPO showed that well-tuned PPO scales surprisingly well to large cooperative teams.
MADDPG
Multi-Agent DDPG (MADDPG) extends DDPG to multi-agent settings under CTDE. Each agent $i$ has an actor $\mu^i(o^i;\theta^i)$ that maps local observations to actions, and a centralized critic $Q^i(o^1,\ldots,o^n, a^1,\ldots,a^n;\phi^i)$ that conditions on all observations and actions. During training, the critic uses full information to compute TD targets. At execution, only the actor runs. Each agent has its own critic, allowing general-sum games where agents have different reward functions.
The policy gradient for agent $i$ is:
$$\nabla_{\theta^i} J(\mu^i) = \mathbb{E}\left[\nabla_{\theta^i} \mu^i(o^i) \cdot \nabla_{a^i} Q^i(\mathbf{o}, \mathbf{a})\Big|_{a^i = \mu^i(o^i)}\right]$$MADDPG also introduced the idea of training with inferred policies: each agent maintains estimates of other agents' policies and conditions its critic on these estimates, improving stability when other agents' policies are changing.
MAPPO
Multi-Agent PPO (MAPPO) applies PPO to cooperative settings with a shared centralized critic. Despite its simplicity, a 2021 paper from Hu et al. showed that MAPPO with careful implementation choices (centralized value normalization, mini-batch training, input normalization) matches or outperforms more specialized algorithms on SMAC and MPE benchmarks. The key insight is that many apparent failures of on-policy methods in MARL trace to implementation details rather than fundamental algorithmic limitations.
HAPPO and Heterogeneous Agents
HAPPO (Heterogeneous-Agent PPO) addresses a subtle issue in multi-agent policy optimization: standard PPO's clipping provides monotonic improvement guarantees for single agents, but these do not compose — improving each agent's policy individually does not guarantee improving the joint policy. HAPPO derives a factored sequential policy update scheme where each agent is updated conditioned on the updated policies of agents earlier in a fixed ordering, restoring monotonic improvement in the joint policy.
Communication & Coordination
When agents can exchange messages, cooperation becomes dramatically easier — but only if the communication protocol is meaningful. End-to-end differentiable communication allows agents to learn what to say to each other, giving rise to emergent protocols.
Differentiable Communication
In differentiable communication architectures, agents' messages are continuous vectors and the entire system — observations to messages to actions — is trained end-to-end with backpropagation. CommNet (Sukhbaatar et al., 2016) is the prototype: agents broadcast a continuous message vector at each step; each agent's action is computed from its observation plus the mean of all received messages. CommNet showed that learned communication could improve cooperative performance on tasks like traffic coordination and resource management.
DIAL and Discrete Communication
DIAL (Differentiable Inter-Agent Learning) extends this to discrete communication. During training, messages are relaxed to continuous values (allowing gradients to flow through the communication channel). At test time, messages are discretized to simulate realistic bandwidth-limited communication. DIAL agents learn to use their limited discrete channel efficiently — naturally compressing task-relevant information.
TarMAC and Attention
TarMAC introduces targeted, attention-based communication. Rather than broadcasting to all agents, each agent produces a key-value pair; receivers compute attention over senders to selectively weight messages. This allows agents to learn who is worth listening to and what to communicate — routing information efficiently in large multi-agent systems.
Emergent Language
A deeper line of research studies whether agents can develop emergent language — a compositional communication protocol that generalizes to novel situations. Lewis signaling games provide the simplest test: a sender observes an object and sends a discrete symbol; a receiver must reconstruct the object. Work from Mordatch, Abbeel, and others has shown that referential games with neural agents can produce protocols with compositional structure — different symbols for different attributes — although whether these exhibit the recursive, hierarchical structure of natural language remains contested.
Practical note: In many deployed multi-agent systems, the communication channel is constrained — bandwidth-limited, noisy, or subject to latency. Algorithms that learn discrete communication (DIAL, RIAL) or that work without any explicit channel (CTDE) tend to be more deployable than those requiring high-bandwidth continuous message passing.
Emergent Behavior
Multi-agent systems can produce behaviors far more complex than any designer specified — from tool use and deception to social norms and hierarchical coordination. These emergent phenomena are among the most striking results in modern AI.
OpenAI Hide-and-Seek
OpenAI's 2019 hide-and-seek experiment is the canonical demonstration of open-ended emergent behavior. Hiders (rewarded for hiding) and seekers (rewarded for finding) trained in a physics-based environment with moveable boxes and ramps. Over six phases of training, spontaneous tool use emerged: seekers learned to use ramps to jump over walls; hiders responded by locking the ramps; seekers then exploited a physics glitch to surf on boxes. None of this behavior was programmed — it emerged from the adversarial training signal alone. The progression of strategies demonstrated an autocurriculum: each side's innovations provided a new challenge for the other.
Emergent Social Norms
When large populations of agents interact repeatedly in social dilemmas, cooperative norms can emerge through learning. Agents may evolve reciprocity (tit-for-tat), reputation systems, and punishment of defectors — not because these were hard-coded but because they are evolutionarily stable strategies in the repeated game. Research in multi-agent social learning uses these dynamics to study the origins of human cooperation, fairness, and punishment.
Division of Labor and Role Emergence
In cooperative tasks that reward specialization, agents spontaneously adopt different roles. In logistics scenarios some agents become scouts while others become carriers. In team sports simulations agents develop positional roles analogous to offense and defense. These emergent roles are not assigned — they arise because differentiation increases collective performance. Role differentiation can be encouraged algorithmically through diversity-promoting rewards or explicitly by conditioning each agent on a role embedding, but it often appears even without explicit incentives.
Autocurricula
The term autocurriculum captures the key mechanism behind many emergent MARL results: agents generate training challenges for each other automatically. As one side improves, the other faces a harder problem, pushing continued learning. Unlike hand-crafted curricula, autocurricula scale naturally — the difficulty always matches the current agent capability. The theoretical conditions under which autocurricula lead to open-ended learning (rather than cycling or stagnating) are an active area of research.
Opponent Modeling
To cooperate or compete effectively, it helps to know what other agents will do. Opponent modeling — inferring beliefs, policies, or intentions from observed behavior — is the multi-agent analog of theory of mind.
Behavioral Cloning of Opponents
The simplest approach is to train a supervised model to predict an opponent's next action from its observed history. This opponent model can then be used to plan more effectively: instead of treating other agents as stationary noise, the agent can anticipate responses. In cooperative settings, teammate modeling helps coordinate without explicit communication; in competitive settings, an accurate opponent model enables exploitation.
Recursive Reasoning
Level-k reasoning models agents as reasoning recursively about each other. A Level-0 agent acts randomly. A Level-1 agent best-responds to a Level-0 model of its opponents. A Level-2 agent best-responds to a Level-1 model. Behavioral economics experiments suggest humans often reason at Level-1 or Level-2. In practice, training agents with explicit level labels can produce robust strategies even at test time when opponent level is unknown — simply averaging over levels gives a policy that performs well against a range of opponents.
Theory of Mind in RL
Humans attribute mental states — beliefs, goals, intentions — to others and reason about them. Theory of mind (ToM) in agents means modeling not just what opponents will do but why. Machine ToM approaches train agents to maintain beliefs over other agents' hidden states (their goals or types) and update those beliefs as new information arrives, using Bayesian inference or a learned belief encoder. ToM-enabled agents can adapt to novel opponents more quickly because they generalize across different policies with the same underlying goal.
Ad Hoc Cooperation
Ad hoc teamwork is the problem of cooperating with previously unseen teammates whose policies are unknown. A capable ad hoc agent must rapidly infer teammates' capabilities and intentions from observation and adapt its own strategy accordingly. This requires online opponent modeling combined with flexible policy representation. The Open Agent benchmark and related work in ad hoc teamwork test agents on populations of diverse partners, including humans, measuring adaptation speed and final cooperative performance.
Applications & Frontiers
Multi-agent RL has moved from toy games to real-world systems: StarCraft grandmaster play, autonomous vehicle fleets, and large-scale trading systems. Mean field methods scale cooperative MARL to thousands of agents.
AlphaStar and StarCraft II
DeepMind's AlphaStar trained on StarCraft II using a combination of supervised learning from human replays, self-play, and league training. The agent achieved grandmaster level — top 0.15% of human players — using only raw pixel-level input processed through a Transformer encoder. AlphaStar demonstrated that MARL combined with league training could solve a real-time strategy game with partial observability, combinatorial action spaces, and long time horizons, while competing against diverse human opponents rather than brittle against a single strategy.
Autonomous Driving
Fleets of autonomous vehicles constitute a natural multi-agent system: each car is an agent, other cars are simultaneously adversaries (competing for lanes) and cooperators (avoiding collisions). MARL approaches to autonomous driving must handle continuous action spaces, partial observability, varying numbers of agents, and safety constraints. Mixed cooperative-competitive formulations are common: vehicles cooperate to maintain traffic flow and avoid accidents while competing for limited road space and time.
Mean Field Reinforcement Learning
As the number of agents grows toward hundreds or thousands, tracking each agent individually becomes intractable. Mean field RL approximates the collective behavior of all other agents by their empirical mean field — the distribution of actions in the population. Each agent then treats the mean field as a single aggregate opponent rather than tracking individual competitors. Under the mean field approximation, the multi-agent game decomposes into a sequence of single-agent problems, each interacting with a fixed field. Mean field Q-learning (MFQ) and mean field actor-critic algorithms scale cooperative and mixed-motive MARL to populations of thousands of agents — with applications in ride-sharing optimization, epidemic control, and financial market simulation.
MARL Algorithm Landscape
| Algorithm | Setting | Paradigm | Key Idea |
|---|---|---|---|
| IQL | Cooperative / general-sum | Independent | Simplest baseline; ignores non-stationarity |
| VDN | Cooperative | CTDE (value decomp.) | Additive Q-value factorization |
| QMIX | Cooperative | CTDE (value decomp.) | Monotone mixing network via hypernetwork |
| MADDPG | General-sum (continuous) | CTDE (actor-critic) | Centralized critic per agent; DDPG actors |
| MAPPO | Cooperative | CTDE (actor-critic) | Shared centralized critic; PPO clipping |
| HAPPO | Cooperative | CTDE (actor-critic) | Sequential policy updates with monotonic guarantees |
| Minimax-Q | Competitive (zero-sum) | Centralized | Minimax value with LP at each state |
| League Training | Competitive | Self-play + population | Diverse agent league for robust strategies |
| MFQ | Large-population | Mean field approx. | Replace individual interactions with aggregate field |
Open Problems
MARL remains a frontier with many open questions. Sample efficiency: multi-agent environments require exponentially more samples than single-agent ones; combining model-based MARL with the approaches from Chapter 05 is an active area. Scalability: most algorithms are validated on two to ten agents; scaling to hundreds requires mean field or other approximations that may introduce significant bias. Robustness and safety: competitive MARL systems can learn deceptive or unsafe strategies; constraining agents to be safe in deployment while still training adversarially is unsolved. Human-AI teaming: agents trained via self-play often develop coordination protocols that are opaque or incompatible with human partners — building agents that cooperate naturally with people remains a key challenge for the next generation of cooperative AI.
Further Reading
Foundational Papers
-
Multi-Agent Reinforcement Learning: A Selective Overview of Theories and AlgorithmsComprehensive theoretical survey covering convergence, equilibrium concepts, and algorithm design across cooperative, competitive, and mixed settings. Best single-paper overview of the field's theoretical landscape.
-
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments (MADDPG)Introduced CTDE for continuous action spaces, with centralized critics and decentralized actors. Became the standard baseline for general-sum multi-agent scenarios. Foundational CTDE paper; required reading.
-
QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement LearningProposed the monotone mixing network approach to value decomposition, enabling tractable cooperative MARL via the IGM property. State-of-the-art on SMAC for years. Essential cooperative MARL paper.
Key Results
-
Emergent Tool Use From Multi-Agent AutocurriculaThe hide-and-seek paper demonstrating six phases of emergent strategy, including tool use, ramp exploitation, and box surfing — all arising from competitive pressure alone. Best demonstration of emergent behavior via autocurriculum.
-
The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games (MAPPO)Showed that carefully implemented PPO with a centralized value function matches or outperforms more complex MARL algorithms, emphasizing implementation hygiene over algorithmic complexity. Crucial corrective to overcomplicated MARL algorithms.
-
Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning (AlphaStar)AlphaStar's league training approach, achieving top human-level play in StarCraft II using a combination of supervised pretraining, self-play, and a diverse agent population. The benchmark for competitive MARL at scale.
Courses & Textbooks
-
Multi-Agent Reinforcement Learning: Foundations and Modern ApproachesThe first dedicated MARL textbook, covering game theory, cooperative and competitive algorithms, emergent behavior, and applications. Comprehensive and up-to-date. The definitive reference — draft freely available online.
-
PyMARL / EPyMARLPyMARL is the standard implementation framework for MARL algorithms on SMAC, including QMIX, VDN, COMA, and IQL. EPyMARL extends it with MAPPO, MADDPG, and additional environments. The starting point for any practical MARL implementation.