Multi-Agent Reinforcement Learning

When multiple learning agents share an environment, a single agent's reward depends on everyone else's behavior. This chapter explores the game-theoretic foundations, cooperative and competitive architectures, emergent communication, and the paradigm of centralized training with decentralized execution.

Prerequisites

This chapter assumes familiarity with single-agent RL: MDPs, Q-learning, and policy gradient methods (Chapters 01–04). Basic game theory concepts are introduced from scratch. Exposure to deep learning is helpful for the neural network–based methods in later sections.

01

The Multi-Agent Setting

Single-agent RL assumes a stationary environment. Introduce multiple learners and that assumption evaporates — each agent's policy changes the world that every other agent is trying to learn in.

In a standard MDP there is one agent whose actions affect the environment but not any other learner. A Markov game (also called a stochastic game) extends this to $n$ agents. Formally, a Markov game is a tuple $\langle \mathcal{S},\, \{\mathcal{A}^i\}_{i=1}^n,\, \mathcal{T},\, \{R^i\}_{i=1}^n,\, \gamma \rangle$ where each agent $i$ chooses action $a^i \in \mathcal{A}^i$, the transition $\mathcal{T}(s' \mid s, \mathbf{a})$ depends on the joint action $\mathbf{a} = (a^1, \ldots, a^n)$, and each agent has its own reward function $R^i$.

The key difference from single-agent RL is that the environment is now non-stationary from any individual agent's perspective. Even if the world's physics are fixed, the effective environment an agent faces changes as its teammates and opponents update their policies. This violates the Markov property as seen by any single learner.

Partial Observability

Most real multi-agent settings are also partially observable: each agent receives a local observation $o^i$ rather than the full state $s$. A robot sees its own sensors; a game player sees its own screen. Formally, this extends the Markov game to a decentralized partially observable MDP (Dec-POMDP). Solving Dec-POMDPs optimally is NEXP-hard in general, so practical algorithms rely on approximation.

Solution Concepts

In single-agent RL the goal is to find the optimal policy. In multi-agent settings the notion of "optimal" becomes entangled with other agents' behaviors. Three main paradigms:

The fundamental challenge of MARL is that each agent is simultaneously a learner and a part of the environment that other agents are learning about. Standard convergence proofs for single-agent RL break down because the environment is non-stationary.

02

Game Theory Foundations

Game theory provides the vocabulary and solution concepts that MARL inherits. Understanding Nash equilibria, dominant strategies, and the taxonomy of games is essential before designing learning algorithms.

Normal Form Games

A normal form game captures a single simultaneous interaction. Each player $i$ chooses an action; payoffs are defined by a matrix. In the classic Prisoner's Dilemma, each player can cooperate (C) or defect (D). If both cooperate they each get 3; if both defect they each get 1; if one defects while the other cooperates the defector gets 5 and the cooperator gets 0. Defecting dominates cooperation for both players individually, yet mutual defection (1,1) is worse for both than mutual cooperation (3,3). This is the canonical illustration of how individually rational choices can be collectively irrational.

Nash Equilibrium

A Nash equilibrium is a joint policy $(\pi^{1*}, \ldots, \pi^{n*})$ such that no agent can improve its expected return by unilaterally changing its own policy. Formally: for every agent $i$ and every alternative policy $\pi^i$,

$$V^i(\pi^{i*}, \boldsymbol{\pi}^{-i*}) \geq V^i(\pi^i, \boldsymbol{\pi}^{-i*})$$

where $\boldsymbol{\pi}^{-i}$ denotes all other agents' policies. Nash proved that every finite game has at least one Nash equilibrium in mixed (randomized) strategies. However, Nash equilibria may be multiple, hard to find, and may not correspond to the socially optimal outcome.

Zero-Sum vs. General-Sum

In a zero-sum game the rewards sum to zero: $\sum_i R^i = 0$. Chess, Go, and poker (heads-up) are zero-sum. In these games minimax is the solution concept: one agent minimizes while the other maximizes. A Nash equilibrium in a two-player zero-sum game is also a minimax solution, and it is unique in value (though not necessarily in strategy). General-sum games admit richer dynamics including coordination equilibria, communication, and side payments.

Common Social Dilemmas

Several canonical games appear repeatedly in MARL research: the Prisoner's Dilemma (defection dominates but mutual cooperation is better), the Stag Hunt (coordination game with two equilibria), the Battle of the Sexes (mixed incentives requiring coordination), and the public goods game (each agent can free-ride on collective effort). These templates reveal the fundamental tensions between individual and collective rationality that MARL must navigate.

03

Independent Learners

The simplest approach: give each agent its own RL algorithm and let them learn simultaneously. Surprisingly effective in practice, yet theoretically problematic — and understanding why reveals the core difficulty of MARL.

Independent Q-Learning

Independent Q-Learning (IQL) treats each agent as if it were the only learner. Agent $i$ maintains its own Q-function $Q^i(o^i, a^i)$ and updates it using standard Q-learning, treating other agents' actions as part of an effectively stochastic environment. The simplicity is appealing: existing single-agent algorithms can be deployed without modification.

The problem is that the environment is non-stationary. As agent $j$ updates its policy, the transition dynamics and reward signal seen by agent $i$ change — but agent $i$'s Q-function does not explicitly account for this. Convergence guarantees that rely on the environment being a stationary MDP no longer apply. In practice, IQL can cycle, diverge, or converge to suboptimal equilibria.

Why It Sometimes Works

Despite the theoretical issues, IQL performs surprisingly well in many benchmarks. The intuition is that if all agents converge quickly relative to each other's rate of change, each agent's local landscape looks approximately stationary. When agents are roughly symmetric and the game has a unique Nash equilibrium that is also a coordination equilibrium, independent learning tends to find it. The gap between theory and practice here motivates much of the more sophisticated MARL research.

Lenient Learners and Hysteretic Q-Learning

Hysteretic Q-Learning addresses IQL's instability by using two learning rates: a large $\alpha$ for positive TD errors (good news) and a small $\beta$ for negative ones (bad news). The asymmetry makes each agent optimistic: it emphasizes improvements and discounts deteriorations that may be due to other agents' transient exploratory behavior. This simple modification significantly improves convergence in cooperative settings.

04

Cooperative MARL

When agents share a reward, the challenge shifts from competition to coordination. The joint action space grows exponentially with the number of agents, making centralized computation intractable at scale — but structure in the value function can be exploited.

The Team Problem

In fully cooperative settings all agents share a single team reward $R(s, \mathbf{a})$. The optimal joint policy maximizes $\mathbb{E}[\sum_t \gamma^t R(s_t, \mathbf{a}_t)]$. If agents could communicate freely and compute jointly, this reduces to a standard MDP over the joint action space $\mathcal{A} = \mathcal{A}^1 \times \cdots \times \mathcal{A}^n$. But with $n$ agents each having $|\mathcal{A}|$ actions, the joint space has $|\mathcal{A}|^n$ elements — exponential in the number of agents. Coordination at scale requires exploiting structure.

Credit Assignment

A fundamental difficulty in cooperative MARL is credit assignment: when the team receives a reward, how much credit should each agent receive? If agent 1 scored the goal but agents 2–10 set it up, a naive uniform attribution may not train any of them well. Difference rewards ($D^i = R(\mathbf{a}) - R(\mathbf{a}_{-i})$, the counterfactual reward from removing agent $i$) provide theoretically principled attribution but require estimating what would have happened without each agent — expensive in large teams.

Cooperative MARL benchmark: StarCraft Multi-Agent Challenge (SMAC) is the dominant cooperative benchmark, featuring groups of units (marines, zerglings, stalkers) that must be micromanaged cooperatively to defeat a scripted enemy. It has driven much of the algorithmic progress in value decomposition methods.

05

Competitive MARL & Self-Play

In zero-sum games, the best teacher is a peer who is exactly as good as you are — and the only way to get that teacher is to play against yourself. Self-play is the engine behind AlphaGo, AlphaZero, and OpenAI Five.

Minimax Q-Learning

Minimax-Q extends Q-learning to two-player zero-sum games. The agent learns a joint Q-function $Q(s, a^1, a^2)$ and at each state solves a linear program to find the minimax policy: the strategy that maximizes the agent's return against a worst-case opponent. Minimax-Q converges to the Nash equilibrium value in two-player zero-sum games. The catch is that it requires the full joint action space and a linear program at every step — tractable only for small discrete games.

Self-Play

In self-play, an agent trains against copies of itself. The key insight is that a minimax-optimal policy beats any opponent, including future versions of itself. Self-play provides an automatic curriculum: as the agent improves, so does its opponent. When implemented naively, self-play can cycle — policy A beats B, B beats C, C beats A — because the game is not transitive. More sophisticated variants use a population of past selves (fictitious self-play) or a diverse league of agents to promote more robust strategies.

League Training

DeepMind's AlphaStar used league training to train StarCraft II agents. A diverse league of agent types — main agents, main exploiters, and league exploiters — each with different matchmaking strategies — created a rich multi-agent training environment. Main agents maximize win rate against the entire league; exploiters specifically target weaknesses in current main agents. The result is a policy that is robust to many different opponents simultaneously, achieving grandmaster-level play.

Elo and Skill Rating

Measuring progress in competitive MARL requires rating agents against each other. The Elo rating system assigns scalar scores based on win/loss outcomes; the expected win probability between two players with Elo ratings $R_A$ and $R_B$ is $1/(1 + 10^{(R_B - R_A)/400})$. More advanced metrics like Trueskill handle partial orderings and account for variance. Neither metric fully captures non-transitive relationships (A beats B beats C beats A), motivating approaches like Nash averaging for more robust skill assessment.

06

Centralized Training, Decentralized Execution

The CTDE paradigm is the dominant organizing principle in modern cooperative MARL. Agents execute independently using only local information, but a shared critic or coordinator uses global information during training to guide them.

The central tension in multi-agent RL is between what is available during training (often all agents' observations and actions) and what is available at execution time (often only each agent's local observation). CTDE resolves this by allowing a centralized critic or mixing network to access global information during training while the policies that are actually deployed use only local observations.

TRAINING TIME EXECUTION TIME AGENT 1 POLICY π¹(a¹ | o¹) AGENT 2 POLICY π²(a² | o²) CENTRAL CRITIC Q(s, a¹, a², …) full state s AGENT 1 POLICY π¹(a¹ | o¹) AGENT 2 POLICY π²(a² | o²) no critic local obs only gradients
Centralized Training, Decentralized Execution. During training a central critic accesses the full joint state and all actions, computing gradients that update each agent's policy. At execution, only local observations are available — the critic is discarded and policies act independently.

Why CTDE Works

The central critic can condition on global information that is available at training time (from the environment or simulator) even though it is unavailable at execution. This breaks the non-stationarity problem for critic training: the critic sees the full joint action and can properly attribute returns. The individual policies still only condition on local observations, keeping execution practical and decentralized. Most state-of-the-art cooperative MARL algorithms — MADDPG, COMA, MAPPO, and the value decomposition family — are instances of the CTDE paradigm.

07

Value Decomposition Networks

If the joint Q-function can be decomposed into per-agent components, each agent can be trained with its own Q-network while still optimizing for the team. The key constraint is monotonicity: greedy action selection must remain consistent between the joint and individual value functions.

VDN: Additive Decomposition

Value Decomposition Networks (VDN) propose the simplest possible decomposition: the joint Q-value is the sum of individual utilities, $Q_{tot}(\mathbf{a}, s) = \sum_i Q^i(o^i, a^i)$. Each agent $i$ has its own Q-network that takes only its local observation $o^i$. The key property is that $\arg\max_{\mathbf{a}} Q_{tot} = (\arg\max_{a^1} Q^1, \ldots, \arg\max_{a^n} Q^n)$ — the joint greedy action decomposes into individual greedy actions. This makes decentralized execution trivial. The limitation is that the additive constraint is too restrictive: many cooperative tasks have value functions with interaction effects that cannot be captured by a sum.

QMIX: Monotone Mixing

QMIX relaxes VDN's additivity to monotonicity. The joint Q-value is produced by a mixing network that combines per-agent utilities in a way where all weights are constrained to be non-negative. This ensures:

$$\frac{\partial Q_{tot}}{\partial Q^i} \geq 0 \quad \forall\, i$$

Whenever $Q_{tot}$ is monotone in each $Q^i$, the global greedy action still decomposes: $\arg\max_{\mathbf{a}} Q_{tot} = (\arg\max_{a^i} Q^i)_i$. The mixing network's weights are generated by a hypernetwork that takes the global state $s$ as input — so the mixing can depend on state without violating decentralized execution. During training the mixing network and hypernetwork are learned end-to-end via backpropagation through the joint loss.

QTRAN and Beyond

QTRAN drops the monotonicity constraint entirely, instead enforcing a softer consistency condition using an auxiliary loss. This allows it to represent any factorizable joint Q-function. However, the auxiliary objective can be difficult to optimize stably. More recent work (QPLEX, Weighted QMIX) finds intermediate points with better empirical stability, or conditions the monotonicity on the optimal action to allow richer representations without losing the IGM (Individual-Global-Max) property.

The IGM condition (Individual-Global-Max) is the formal requirement that individual greedy actions constitute a joint greedy action: $\arg\max_{\mathbf{a}} Q_{tot}(\mathbf{a}) = (\arg\max_{a^i} Q^i(a^i))_i$. This is what makes decentralized greedy execution consistent with centralized training. VDN and QMIX both satisfy IGM; QTRAN enforces it approximately.

08

Multi-Agent Actor-Critic Methods

Policy gradient approaches extend to multi-agent settings by giving each agent an actor while sharing or centralizing the critic. MADDPG introduced CTDE for continuous actions; MAPPO showed that well-tuned PPO scales surprisingly well to large cooperative teams.

MADDPG

Multi-Agent DDPG (MADDPG) extends DDPG to multi-agent settings under CTDE. Each agent $i$ has an actor $\mu^i(o^i;\theta^i)$ that maps local observations to actions, and a centralized critic $Q^i(o^1,\ldots,o^n, a^1,\ldots,a^n;\phi^i)$ that conditions on all observations and actions. During training, the critic uses full information to compute TD targets. At execution, only the actor runs. Each agent has its own critic, allowing general-sum games where agents have different reward functions.

The policy gradient for agent $i$ is:

$$\nabla_{\theta^i} J(\mu^i) = \mathbb{E}\left[\nabla_{\theta^i} \mu^i(o^i) \cdot \nabla_{a^i} Q^i(\mathbf{o}, \mathbf{a})\Big|_{a^i = \mu^i(o^i)}\right]$$

MADDPG also introduced the idea of training with inferred policies: each agent maintains estimates of other agents' policies and conditions its critic on these estimates, improving stability when other agents' policies are changing.

MAPPO

Multi-Agent PPO (MAPPO) applies PPO to cooperative settings with a shared centralized critic. Despite its simplicity, a 2021 paper from Hu et al. showed that MAPPO with careful implementation choices (centralized value normalization, mini-batch training, input normalization) matches or outperforms more specialized algorithms on SMAC and MPE benchmarks. The key insight is that many apparent failures of on-policy methods in MARL trace to implementation details rather than fundamental algorithmic limitations.

HAPPO and Heterogeneous Agents

HAPPO (Heterogeneous-Agent PPO) addresses a subtle issue in multi-agent policy optimization: standard PPO's clipping provides monotonic improvement guarantees for single agents, but these do not compose — improving each agent's policy individually does not guarantee improving the joint policy. HAPPO derives a factored sequential policy update scheme where each agent is updated conditioned on the updated policies of agents earlier in a fixed ordering, restoring monotonic improvement in the joint policy.

09

Communication & Coordination

When agents can exchange messages, cooperation becomes dramatically easier — but only if the communication protocol is meaningful. End-to-end differentiable communication allows agents to learn what to say to each other, giving rise to emergent protocols.

Differentiable Communication

In differentiable communication architectures, agents' messages are continuous vectors and the entire system — observations to messages to actions — is trained end-to-end with backpropagation. CommNet (Sukhbaatar et al., 2016) is the prototype: agents broadcast a continuous message vector at each step; each agent's action is computed from its observation plus the mean of all received messages. CommNet showed that learned communication could improve cooperative performance on tasks like traffic coordination and resource management.

DIAL and Discrete Communication

DIAL (Differentiable Inter-Agent Learning) extends this to discrete communication. During training, messages are relaxed to continuous values (allowing gradients to flow through the communication channel). At test time, messages are discretized to simulate realistic bandwidth-limited communication. DIAL agents learn to use their limited discrete channel efficiently — naturally compressing task-relevant information.

TarMAC and Attention

TarMAC introduces targeted, attention-based communication. Rather than broadcasting to all agents, each agent produces a key-value pair; receivers compute attention over senders to selectively weight messages. This allows agents to learn who is worth listening to and what to communicate — routing information efficiently in large multi-agent systems.

Emergent Language

A deeper line of research studies whether agents can develop emergent language — a compositional communication protocol that generalizes to novel situations. Lewis signaling games provide the simplest test: a sender observes an object and sends a discrete symbol; a receiver must reconstruct the object. Work from Mordatch, Abbeel, and others has shown that referential games with neural agents can produce protocols with compositional structure — different symbols for different attributes — although whether these exhibit the recursive, hierarchical structure of natural language remains contested.

Practical note: In many deployed multi-agent systems, the communication channel is constrained — bandwidth-limited, noisy, or subject to latency. Algorithms that learn discrete communication (DIAL, RIAL) or that work without any explicit channel (CTDE) tend to be more deployable than those requiring high-bandwidth continuous message passing.

10

Emergent Behavior

Multi-agent systems can produce behaviors far more complex than any designer specified — from tool use and deception to social norms and hierarchical coordination. These emergent phenomena are among the most striking results in modern AI.

OpenAI Hide-and-Seek

OpenAI's 2019 hide-and-seek experiment is the canonical demonstration of open-ended emergent behavior. Hiders (rewarded for hiding) and seekers (rewarded for finding) trained in a physics-based environment with moveable boxes and ramps. Over six phases of training, spontaneous tool use emerged: seekers learned to use ramps to jump over walls; hiders responded by locking the ramps; seekers then exploited a physics glitch to surf on boxes. None of this behavior was programmed — it emerged from the adversarial training signal alone. The progression of strategies demonstrated an autocurriculum: each side's innovations provided a new challenge for the other.

Emergent Social Norms

When large populations of agents interact repeatedly in social dilemmas, cooperative norms can emerge through learning. Agents may evolve reciprocity (tit-for-tat), reputation systems, and punishment of defectors — not because these were hard-coded but because they are evolutionarily stable strategies in the repeated game. Research in multi-agent social learning uses these dynamics to study the origins of human cooperation, fairness, and punishment.

Division of Labor and Role Emergence

In cooperative tasks that reward specialization, agents spontaneously adopt different roles. In logistics scenarios some agents become scouts while others become carriers. In team sports simulations agents develop positional roles analogous to offense and defense. These emergent roles are not assigned — they arise because differentiation increases collective performance. Role differentiation can be encouraged algorithmically through diversity-promoting rewards or explicitly by conditioning each agent on a role embedding, but it often appears even without explicit incentives.

Autocurricula

The term autocurriculum captures the key mechanism behind many emergent MARL results: agents generate training challenges for each other automatically. As one side improves, the other faces a harder problem, pushing continued learning. Unlike hand-crafted curricula, autocurricula scale naturally — the difficulty always matches the current agent capability. The theoretical conditions under which autocurricula lead to open-ended learning (rather than cycling or stagnating) are an active area of research.

11

Opponent Modeling

To cooperate or compete effectively, it helps to know what other agents will do. Opponent modeling — inferring beliefs, policies, or intentions from observed behavior — is the multi-agent analog of theory of mind.

Behavioral Cloning of Opponents

The simplest approach is to train a supervised model to predict an opponent's next action from its observed history. This opponent model can then be used to plan more effectively: instead of treating other agents as stationary noise, the agent can anticipate responses. In cooperative settings, teammate modeling helps coordinate without explicit communication; in competitive settings, an accurate opponent model enables exploitation.

Recursive Reasoning

Level-k reasoning models agents as reasoning recursively about each other. A Level-0 agent acts randomly. A Level-1 agent best-responds to a Level-0 model of its opponents. A Level-2 agent best-responds to a Level-1 model. Behavioral economics experiments suggest humans often reason at Level-1 or Level-2. In practice, training agents with explicit level labels can produce robust strategies even at test time when opponent level is unknown — simply averaging over levels gives a policy that performs well against a range of opponents.

Theory of Mind in RL

Humans attribute mental states — beliefs, goals, intentions — to others and reason about them. Theory of mind (ToM) in agents means modeling not just what opponents will do but why. Machine ToM approaches train agents to maintain beliefs over other agents' hidden states (their goals or types) and update those beliefs as new information arrives, using Bayesian inference or a learned belief encoder. ToM-enabled agents can adapt to novel opponents more quickly because they generalize across different policies with the same underlying goal.

Ad Hoc Cooperation

Ad hoc teamwork is the problem of cooperating with previously unseen teammates whose policies are unknown. A capable ad hoc agent must rapidly infer teammates' capabilities and intentions from observation and adapt its own strategy accordingly. This requires online opponent modeling combined with flexible policy representation. The Open Agent benchmark and related work in ad hoc teamwork test agents on populations of diverse partners, including humans, measuring adaptation speed and final cooperative performance.

12

Applications & Frontiers

Multi-agent RL has moved from toy games to real-world systems: StarCraft grandmaster play, autonomous vehicle fleets, and large-scale trading systems. Mean field methods scale cooperative MARL to thousands of agents.

AlphaStar and StarCraft II

DeepMind's AlphaStar trained on StarCraft II using a combination of supervised learning from human replays, self-play, and league training. The agent achieved grandmaster level — top 0.15% of human players — using only raw pixel-level input processed through a Transformer encoder. AlphaStar demonstrated that MARL combined with league training could solve a real-time strategy game with partial observability, combinatorial action spaces, and long time horizons, while competing against diverse human opponents rather than brittle against a single strategy.

Autonomous Driving

Fleets of autonomous vehicles constitute a natural multi-agent system: each car is an agent, other cars are simultaneously adversaries (competing for lanes) and cooperators (avoiding collisions). MARL approaches to autonomous driving must handle continuous action spaces, partial observability, varying numbers of agents, and safety constraints. Mixed cooperative-competitive formulations are common: vehicles cooperate to maintain traffic flow and avoid accidents while competing for limited road space and time.

Mean Field Reinforcement Learning

As the number of agents grows toward hundreds or thousands, tracking each agent individually becomes intractable. Mean field RL approximates the collective behavior of all other agents by their empirical mean field — the distribution of actions in the population. Each agent then treats the mean field as a single aggregate opponent rather than tracking individual competitors. Under the mean field approximation, the multi-agent game decomposes into a sequence of single-agent problems, each interacting with a fixed field. Mean field Q-learning (MFQ) and mean field actor-critic algorithms scale cooperative and mixed-motive MARL to populations of thousands of agents — with applications in ride-sharing optimization, epidemic control, and financial market simulation.

MARL Algorithm Landscape

Algorithm Setting Paradigm Key Idea
IQL Cooperative / general-sum Independent Simplest baseline; ignores non-stationarity
VDN Cooperative CTDE (value decomp.) Additive Q-value factorization
QMIX Cooperative CTDE (value decomp.) Monotone mixing network via hypernetwork
MADDPG General-sum (continuous) CTDE (actor-critic) Centralized critic per agent; DDPG actors
MAPPO Cooperative CTDE (actor-critic) Shared centralized critic; PPO clipping
HAPPO Cooperative CTDE (actor-critic) Sequential policy updates with monotonic guarantees
Minimax-Q Competitive (zero-sum) Centralized Minimax value with LP at each state
League Training Competitive Self-play + population Diverse agent league for robust strategies
MFQ Large-population Mean field approx. Replace individual interactions with aggregate field

Open Problems

MARL remains a frontier with many open questions. Sample efficiency: multi-agent environments require exponentially more samples than single-agent ones; combining model-based MARL with the approaches from Chapter 05 is an active area. Scalability: most algorithms are validated on two to ten agents; scaling to hundreds requires mean field or other approximations that may introduce significant bias. Robustness and safety: competitive MARL systems can learn deceptive or unsafe strategies; constraining agents to be safe in deployment while still training adversarially is unsolved. Human-AI teaming: agents trained via self-play often develop coordination protocols that are opaque or incompatible with human partners — building agents that cooperate naturally with people remains a key challenge for the next generation of cooperative AI.

Further Reading

Foundational Papers

Key Results

Courses & Textbooks