Multi-Agent Systems, when one agent isn't enough.
A single agent is bounded by its context window, its knowledge, and its single thread of reasoning. Multiple agents can divide cognitive labour, cross-check each other's work, pursue parallel lines of inquiry, and collectively reach conclusions that no individual would reach alone. This chapter examines how LLM-based agents are coordinated into systems: the protocols they use to communicate, the roles they specialise into, the debate patterns that improve their reliability, and the failure modes that emerge when coordination breaks down.
Prerequisites
This chapter builds on Agent Frameworks & Infrastructure (Ch 07) — especially the orchestration patterns and AutoGen discussion — and on Tool Use & Function Calling (Ch 05) for context on how individual agents act. Readers familiar with classical multi-agent systems from the AI textbook tradition will recognise many concepts here, though the LLM context changes many of the key assumptions.
Why Multiple Agents?
The limitations of a single LLM agent are structural, not incidental. The context window is finite — a single agent working a complex research task will eventually run out of room to hold the full problem state. Attention degrades over long contexts, so tasks requiring sustained coherence across thousands of tokens suffer quality decay. A single reasoning thread serialises all thought — there is no way to pursue two hypotheses simultaneously. And a model trained on a single distribution will have systematic blind spots that self-reflection alone cannot reveal.
Multi-agent systems address these limitations through division of labour. Different agents can hold different parts of the problem state, run in parallel, and challenge each other's outputs. The analogy to human organisations is imperfect but instructive: teams outperform individuals not because any individual is smarter, but because specialisation, parallel effort, and mutual critique collectively achieve what no individual could alone.
The empirical case is also strong. Du et al. (2023) showed that multi-agent debate — where multiple LLMs argue their positions and update based on peer responses — substantially improves accuracy on mathematical and strategic reasoning tasks versus single-agent sampling. Liang et al. (2023) found that disagreement between agents is a reliable signal of task difficulty and a driver of quality improvements in the final consensus output. Self-consistency (Wang et al., 2022) — sampling multiple reasoning paths and majority-voting the final answer — outperforms single-chain reasoning across a wide range of benchmarks without any additional training.
A natural question is whether multi-agent systems will become unnecessary as context windows grow. The evidence suggests not. The quality of attention over very long contexts degrades in ways that parallel processing avoids. Specialisation — running a dedicated critic rather than asking the same model to critique its own output — provides orthogonality that a single model's self-reflection cannot match. And parallelism reduces wall-clock time in ways that no amount of context extension addresses. The two approaches are complements, not substitutes.
Coordination Protocols
How agents share information and coordinate action is the central engineering problem of multi-agent systems. The mechanisms range from fully centralised (a single orchestrator routes all messages) to fully decentralised (agents communicate directly using a shared medium), with many hybrid arrangements between.
Message Passing
The simplest coordination primitive is direct message passing: agent A sends a structured message to agent B, which processes it and responds. In LLM-based systems this typically means appending a message to B's conversation history with a sender tag — [Researcher]: Here are my findings: ... — so B has both the content and its provenance. The format of messages matters considerably: structured JSON is more reliably parsed than free prose, but prose preserves nuance that rigid schemas can discard. Most production systems use a hybrid: structured headers (sender, recipient, message type, timestamp) wrapping free-text content.
Blackboard Systems
In a blackboard architecture, agents do not communicate directly. Instead they read from and write to a shared knowledge store — the blackboard — and a controller decides which agent to activate next based on the current blackboard state. This pattern from classical AI is experiencing a revival in LLM systems: the shared store might be a vector database of retrieved facts, a structured JSON document, or a simple key-value store. Agents are activated by an orchestrating LLM or rule-based scheduler that monitors the blackboard for triggering conditions.
The blackboard pattern decouples agents from each other (no agent needs to know about the others), enables asynchronous contribution (agents can write to the blackboard at different times), and makes the intermediate state inspectable — a useful property for debugging and auditing. Its weakness is the potential for conflicting writes: two agents updating the same field simultaneously can produce inconsistent state, requiring conflict resolution logic.
Contract Net Protocol
The Contract Net Protocol (Smith, 1980) models coordination as a market. A manager agent announces a task; bidder agents submit proposals describing their capability and cost; the manager selects the best bid and awards the contract. This pattern maps naturally to dynamic task allocation in LLM systems: an orchestrator posts a task ("research quantum computing post-2022"), specialist agents describe their tool access and expertise, and the orchestrator selects the most capable agent for the job. Unlike static role assignment, contract net allows dynamic specialisation — an agent with access to arXiv will win research tasks; an agent with a code interpreter will win computation tasks.
Shared State via Structured Documents
A practical alternative to explicit protocols is a shared structured document that all agents can read and update — a living brief, a research memo, a code file. Agents take turns extending, correcting, or annotating the document rather than sending messages to each other. This is the pattern underlying many code-writing multi-agent setups (where the shared state is the codebase) and the "scratchpad" pattern in chain-of-thought systems. The coordination overhead is low — agents simply read the document and append — but conflict resolution requires turn-taking discipline or an orchestrating lock manager.
Role Specialisation
The most natural way to structure a multi-agent system is to assign each agent a distinct role — a focused identity, capability set, and responsibility. Role specialisation trades generality for depth: a dedicated Researcher with web search tools will outperform a generalist agent trying to both research and write, because its system prompt, tool access, and context are all optimised for a single function.
Generates initial answers, plans, or hypotheses. Optimised for breadth and creativity. Trades precision for coverage — its outputs are raw material for other agents to refine.
Challenges Proposer outputs: finds errors, identifies assumptions, flags missing cases. Most effective when given explicit criteria rather than asked to "find problems."
Checks specific claims against ground truth: runs code, queries databases, calls search APIs. Provides the deterministic anchor that prevents hallucination propagation.
Integrates outputs from other agents into a coherent whole. Must resolve contradictions, fill gaps, and maintain consistent voice across diverse contributions.
Retrieves external information via search, RAG, or API calls. Handles pagination, source evaluation, and result summarisation before passing findings to reasoning agents.
Takes planned actions in the world: writes files, calls APIs, sends messages. Intentionally narrow — no reasoning, just reliable execution of well-specified instructions.
Fixed vs. Dynamic Roles
Roles can be fixed at design time (each agent is always the Critic, always the Researcher) or assigned dynamically during a run. Fixed roles are simpler to reason about and easier to optimise — you can tune the Critic's system prompt independently of the others. Dynamic roles, where an orchestrator assigns responsibilities based on task requirements, are more flexible but introduce the risk of misassignment: an agent asked to critique when it lacks the domain knowledge to do so effectively.
A middle path is role pools: a team of five Researcher agents, each with access to different tool sets (one with web search, one with academic APIs, one with a code interpreter), from which the orchestrator selects based on the task type. This preserves the flexibility of dynamic assignment while keeping each agent's identity stable.
The Specialist-Generalist Tradeoff
Deeply specialised agents perform better on their target tasks but require more careful orchestration — the system must correctly identify which specialist to invoke. Generalist agents are easier to orchestrate but regress toward average performance. Empirically, the benefit of specialisation is largest when tasks have genuinely different optimal prompts, tool sets, or context requirements. For tasks where a single capable model performs near-ceiling, the overhead of multi-agent coordination may not be worth the marginal gain from specialisation.
Debate & Critique Patterns
One of the most reliably effective multi-agent techniques is structured disagreement: having one agent produce an answer and one or more other agents challenge it. The mechanism is simple but the effect is substantial — models that know their output will be scrutinised produce more careful, hedged, and accurate responses, and the critiquing agent often catches errors that the proposer missed.
LLM Debate (Du et al., 2023)
In the debate framework, multiple agents independently generate answers to a question, then each agent is shown the other agents' answers and asked to refine its own position. This process repeats for several rounds. Convergence toward consensus indicates high confidence; persistent disagreement flags genuinely hard cases for human review. Du et al. found that debate improved accuracy on mathematical reasoning, strategic games, and factual recall, with gains scaling with the number of debaters up to about four agents.
The key implementation detail is the framing of the refinement step. Asking an agent to "revise your answer given your peers' responses" tends to produce sycophantic capitulation — the agent converges on the most common answer regardless of its correctness. Better framings instruct the agent to "identify specific errors in your own reasoning or your peers' reasoning, and update only if you find a substantive reason to." This preserves the beneficial diversity of opinion while resisting peer pressure.
Generator-Critic-Reviser
The generator-critic-reviser (GCR) pattern is a three-agent pipeline that separates generation, evaluation, and revision into distinct roles. The Generator produces an initial draft without any self-censorship. The Critic evaluates it against explicit criteria — accuracy, completeness, tone, formatting — and produces a structured critique. The Reviser receives both the draft and the critique and produces an improved version. The pattern can be iterated, with the new draft re-entering the Critic.
GCR outperforms self-refinement (where a single model generates, critiques, and revises) because it achieves genuine orthogonality: the Generator is not biased toward its own output when evaluating, and the Reviser is not anchored to the original draft when rewriting. The Critic role benefits most from explicit rubrics — a vague instruction to "critique the draft" elicits generic feedback, while a structured criterion checklist elicits actionable, specific observations.
Devil's Advocate and Steelman
Two related patterns address the tendency of multi-agent systems to converge prematurely on consensus. A Devil's Advocate agent is instructed to argue against the current leading answer, regardless of its private assessment — its role is to surface objections that other agents might suppress to maintain social harmony. A Steelman agent is given the weakest current hypothesis and instructed to construct the strongest possible case for it. Both patterns are most useful when a multi-agent system has converged quickly and the orchestrator suspects groupthink rather than genuine agreement.
Hierarchical Orchestration
In a flat multi-agent system, all agents communicate as peers. In a hierarchical system, an orchestrator sits above the workers: it receives the high-level task, decomposes it into subtasks, delegates each to a worker agent, monitors progress, and synthesises the results. This mirrors how human organisations operate — a project manager doesn't do the individual work, but is responsible for the overall outcome.
Orchestrator Design
The orchestrator's role demands a different set of capabilities than the workers it manages. It must be good at decomposition (breaking complex goals into independent subtasks), delegation (selecting the right worker for each subtask, with the right instructions), monitoring (recognising when a worker has failed or produced unusable output), and synthesis (combining heterogeneous outputs into a coherent result). These are largely planning and reasoning tasks — which is why orchestrators are typically assigned to the most capable model in the system, even if that makes them the most expensive per-call.
A common failure mode is the orchestrator becoming a bottleneck. If every worker result must pass through the orchestrator before the next step can begin, parallelism is limited to the fan-out phase. The solution is to design the workflow so the orchestrator is only activated at coordination boundaries — after the parallel phase, before the synthesis phase — while workers run independently in between.
Nested Hierarchies
Complex tasks may warrant hierarchies more than two levels deep. A top-level orchestrator decomposes the task into research, coding, and writing subtasks. Each subtask has its own sub-orchestrator that further decomposes: the research sub-orchestrator assigns individual topics to search agents; the coding sub-orchestrator assigns modules to coding agents. Nesting enables fine-grained specialisation, but each additional layer adds latency, cost (orchestrator calls at each level), and coordination complexity. In practice, two levels — one orchestrator, N workers — is sufficient for the vast majority of tasks. Three or more levels are reserved for genuinely complex, long-running workflows.
Worker Autonomy
How much should workers deviate from instructions? A fully compliant worker that executes instructions literally will fail when instructions are ambiguous or subtly wrong. A fully autonomous worker that reinterprets instructions freely introduces unpredictability. The right calibration depends on the task: for execution tasks (file writes, API calls) high compliance is essential; for reasoning tasks (analysis, planning) some interpretive latitude produces better results. Many frameworks expose this as a max_retries and allow_replanning configuration at the worker level.
Swarm Behaviour
Swarm systems are the opposite of hierarchical orchestration. There is no central coordinator; agents follow simple local rules and interact only with their immediate neighbours or a shared medium. Global behaviour — exploration, convergence, task completion — emerges from these local interactions without any agent having a global view of the system state.
The biological inspiration (ant colonies, bird flocking, bee foraging) is well-studied, but the translation to LLM agents is recent and still exploratory. Several principles from swarm intelligence map usefully onto agent systems: stigmergy — indirect coordination through environmental modification (one agent writes search results to a shared store; subsequent agents are attracted to the same area of the search space); positive feedback — agents reinforce successful paths (a sub-topic that multiple agents independently find useful gets more exploration); and negative feedback — agents avoid redundant work (a completed subtask is marked done on the blackboard, preventing re-execution).
LLM Swarms in Practice
The most practical application of swarm principles to LLM agents is massively parallel exploration. Rather than having a single agent explore a search space sequentially (suffering from path dependence and local optima), launch N agents simultaneously with diverse starting points and let them independently explore. Each agent deposits its findings to a shared store. A lightweight aggregator periodically surveys the store and identifies which areas have produced useful findings, steering subsequent agents toward promising regions while discarding barren ones.
This pattern has been applied to hyperparameter search, literature review (spawning agents to explore different topic areas simultaneously), and code refactoring (separate agents working on independent modules in parallel). The key engineering challenge is designing the shared medium so that agents can benefit from each other's discoveries without being distracted by irrelevant noise — in practice, this usually means the aggregator maintains a filtered summary rather than exposing the raw deposit store to agents.
Emergent Collaboration
Emergent collaboration refers to productive coordination that was not explicitly programmed — behaviours that arise from the interaction of agents following simple rules in a shared environment. The concept comes from complexity theory, where it describes how macro-level patterns arise from micro-level interactions without central design.
In LLM multi-agent systems, emergence has been observed in several forms. Park et al.'s Generative Agents (2023) — a simulation of 25 LLM-powered characters in a virtual town — produced spontaneous social behaviours including gossip propagation, emergent party planning across agents who had never directly communicated, and the formation of relationships that neither the researchers nor the agents had explicitly coordinated. These behaviours arose purely from agents following simple rules: perceive, plan, act, remember.
Cooperation Without Explicit Coordination
A particularly striking finding from the Generative Agents work is that agents develop implicit social norms — consistent patterns of behaviour in repeated interactions — without any mechanism for explicit norm agreement. Agents who interact frequently develop stable relationship patterns; agents who share common acquaintances develop shared mental models of the social environment. This mirrors findings from multi-agent reinforcement learning, where agents learn to coordinate through experience rather than communication protocols.
For practical agent system designers, the implication is that some coordination can be achieved through environment design rather than explicit protocol specification. If agents share a structured environment with clear affordances — a shared document, a task queue with visible state, a common memory store — useful coordination patterns may emerge without requiring the designer to specify every interaction. This is not reliable enough to substitute for explicit coordination in high-stakes tasks, but it can reduce the engineering burden in exploratory or low-stakes systems.
Game-Theoretic Perspective
Multi-agent interactions can be analysed through game theory. When agents have aligned objectives (all trying to complete the task correctly), they are in a cooperative game — the challenge is efficient coordination rather than strategic behaviour. When agents have divergent objectives (e.g., different agents instructed to advocate for different positions), they are in a mixed-motive game. In debate settings, the Nash equilibrium of the debate game — where no agent benefits from changing its position given the others' positions — corresponds to genuine consensus rather than sycophantic convergence, making game-theoretic analysis a useful tool for understanding when debate protocols will work.
Communication Topology
The structure of the communication network — who can talk to whom — shapes what a multi-agent system can compute, how efficiently it coordinates, and what failure modes it is vulnerable to. Not all topologies are equally suited to all tasks.
Topology and Failure Propagation
Topology determines how errors propagate. In a pipeline, a hallucination in step 2 is injected into step 3's input — subsequent agents may correct it, but they may also elaborate on it, amplifying the error. In a fully connected system, an agent's error is visible to all peers immediately, enabling rapid correction — but also rapid spread if peers do not critically evaluate what they receive. In a hierarchical system, errors in worker agents are filtered by the orchestrator, which acts as a quality gate — effective if the orchestrator is competent, catastrophic if it is not.
A useful heuristic: topologies that provide short feedback cycles (where an agent's output is evaluated quickly by another agent) are more robust to error propagation than topologies with long linear chains. The price of short feedback cycles is coordination overhead and potentially slower forward progress.
Consensus & Voting
When multiple agents produce answers to the same question, the system needs a mechanism to aggregate those answers into a single output. The choice of aggregation mechanism significantly affects both accuracy and calibration.
Self-Consistency (Majority Vote)
Self-consistency (Wang et al., 2022) samples multiple independent reasoning chains from the same model (or different models) and selects the most common final answer by majority vote. It is the simplest multi-agent aggregation scheme and one of the most effective: across arithmetic, commonsense, and symbolic reasoning benchmarks, self-consistency over 10–40 samples improves accuracy by 5–20 percentage points over greedy decoding. The mechanism works because reasoning errors are typically inconsistent across samples, while correct reasoning converges.
Weighted Voting
Majority vote treats all agents equally. Weighted voting assigns higher influence to agents based on their estimated reliability — historical accuracy on similar tasks, confidence scores, or the coherence of their reasoning chain. A Verifier agent that has confirmed its answer against a database should receive higher weight than a Proposer whose output has not been checked. In practice, reliable per-agent reliability estimates are difficult to obtain, and weighted voting requires careful calibration to avoid amplifying the biases of whichever agent is overweighted.
LLM-as-Judge
Rather than aggregating by vote, an LLM judge evaluates all candidate answers and selects the best — or synthesises them into a combined answer. LLM-as-judge (Zheng et al., 2023) is well-suited to tasks where correctness is nuanced and hard to express as a majority: long-form writing, complex plans, subjective evaluations. The failure mode is positional bias — LLM judges systematically prefer answers that appear earlier in the input, or answers that are longer, regardless of quality. Mitigations include randomising the order of candidates across repeated evaluations and using multiple judges.
Confidence Calibration and Abstention
A well-designed multi-agent system should know when it does not know. If agents consistently disagree and no consensus emerges after multiple rounds, this is a signal that the task is genuinely hard — the system should abstain or escalate to human review rather than forcing a low-confidence consensus. Calibrated abstention is a form of epistemic humility that single-agent systems rarely exhibit naturally; the disagreement signal from a multi-agent debate provides a principled trigger for it.
Failure Modes
Multi-agent systems introduce failure modes that do not exist in single-agent settings. Some are amplifications of single-agent problems; others are genuinely novel properties of interacting systems.
Agents agree with each other not because the answer is correct but because disagreement is socially costly. LLMs trained on human feedback learn that agreement is rewarded; in a multi-agent debate, this manifests as premature consensus around the first confident-sounding answer. Mitigated by explicit adversarial role assignments (Critic, Devil's Advocate) and critique rubrics that reward disagreement.
A false belief introduced by one agent is repeated by others who read its output, each repetition lending it more apparent authority. The downstream synthesis agent sees the claim appearing in multiple "independent" sources and treats it as well-established. Mitigated by ensuring agents have genuinely independent information sources, not just independent reasoning over shared context.
As the number of agents grows, the fraction of compute spent on coordination (routing messages, summarising outputs, resolving conflicts) can exceed the fraction spent on actual task work. A system with 20 agents producing 10% more accuracy than a 5-agent system at 4× the coordination cost is often not worth the trade. Mitigated by careful topology design and regularly benchmarking system efficiency.
An early-stage agent hallucinates a fact; downstream agents incorporate it into their reasoning without verification; the synthesiser treats the repeated (false) claim as corroborated. The pipeline amplifies rather than corrects the error. Mitigated by inserting Verifier agents at boundaries and by never allowing agents to cite other agents' outputs as ground truth.
When a multi-agent system produces a harmful or incorrect output, no single agent is clearly responsible — each followed its instructions, and the failure was a property of the interaction. This makes debugging harder and raises accountability questions for high-stakes applications. Mitigated by keeping detailed execution traces and designing clear ownership: one agent signs off on the final output.
An adversarial instruction injected into one agent's context (via a malicious tool result or document) can propagate through the system if other agents are instructed to incorporate the output of the compromised agent. A single injection can corrupt an entire pipeline. Mitigated by sandboxing agents' ability to pass arbitrary instructions to peers and by validating tool results before incorporation.
Termination and Infinite Loops
Multi-agent systems with cyclic communication topologies — ring networks, debate loops, critic-reviser cycles — can fail to terminate. An agent convinced its output is wrong will revise; the revised output may also be critiqued; the critic may itself be critiqued. Without explicit termination conditions (maximum rounds, convergence threshold, token budget), the system runs indefinitely. Termination logic must be designed at the system level, not delegated to individual agents — no agent has enough global context to reliably judge when the system as a whole should stop.
Benchmarks & Evaluation
Evaluating multi-agent systems is harder than evaluating single agents. The system's behaviour depends on the interaction between agents, not just the capability of each individual — a system of weaker models may outperform a single stronger model on tasks that benefit from diverse perspectives, making it misleading to benchmark components in isolation.
| Benchmark | Focus | Key Metric | MAS-Specific |
|---|---|---|---|
| MMLU (multi-agent) | Knowledge QA with debate | Accuracy vs. single-agent baseline | Measures debate gain |
| GSM8K + self-consistency | Mathematical reasoning | Accuracy at k samples | Sample efficiency curve |
| ChatEval | Open-ended dialogue evaluation | Human preference vs. single evaluator | Multi-evaluator agreement |
| AgentBench | Agentic task completion | Success rate across 8 environments | Coordination overhead |
| Smallville (Generative Agents) | Social simulation | Emergent behaviour fidelity | Inter-agent coherence |
| CoEval | Collaborative code writing | Functional correctness, test pass rate | Role division quality |
What Single-Agent Benchmarks Miss
Standard benchmarks evaluate terminal accuracy — whether the final answer is correct. They do not measure coordination efficiency (tokens spent per correct answer across the whole system), diversity of reasoning (whether agents genuinely explored different paths or just paraphrased the same approach), or graceful failure (whether the system correctly identifies when it should abstain). Purpose-built multi-agent evaluation suites are emerging but remain less standardised than single-agent benchmarks — a gap that makes it difficult to compare systems across papers.
Frontier & Open Problems
Multi-agent systems with LLMs are one of the most active areas of agent research in 2025, with several fundamental questions still open.
Optimal Team Size and Composition
How many agents are enough? The empirical answer varies dramatically by task. Du et al. found diminishing returns in debate beyond four agents; other work finds gains continuing to ten or more for certain search tasks. A theory of optimal team composition — predicting which combination of roles and capabilities maximises performance on a given task class — does not yet exist. Current practice is empirical: try small teams, measure gain, expand if the cost is justified.
Heterogeneous Model Teams
Most multi-agent research uses identical models for all agents. Teams of heterogeneous models — a frontier model as orchestrator, smaller specialist models as workers — are increasingly practical as APIs become cheaper and latency requirements vary by role. The open question is how to optimally assign models to roles: not just "capable model for hard tasks" (which is obvious) but the fine-grained matching of model strengths (code-tuned, instruction-following, retrieval-augmented) to specific agent functions.
Trust and Verification Between Agents
In a human organisation, trust is built over repeated interactions and backed by institutional accountability. In LLM agent systems, an agent has no reliable way to verify the trustworthiness of a message from another agent — the message could be hallucinated, prompt-injected, or from a compromised agent. Designing trust mechanisms for inter-agent communication — lightweight cryptographic attestation, reputation systems based on verified past performance, sandboxed execution of agent outputs before incorporation — is an active research area without a clear solution.
Emergent Deception and Misalignment
As multi-agent systems become more capable and interact over longer horizons, the possibility of emergent deception — agents developing strategies to mislead other agents in pursuit of their assigned objectives — becomes more than theoretical. Hagendorff (2024) documented cases where LLMs spontaneously use deceptive strategies in multi-agent game settings. Understanding when and why this occurs, and how to design multi-agent systems that are robust to it, is one of the deeper open problems at the intersection of AI safety and multi-agent systems research.
The most ambitious vision for multi-agent systems is not teams of five or ten agents, but communities of thousands — each specialised, each persistent, each maintaining relationships with other agents over time. Whether such systems exhibit qualitatively different capabilities or merely quantitative scaling of existing ones is unknown. What is clear is that the infrastructure, coordination protocols, and safety mechanisms needed to operate at that scale do not yet exist. The next few years will determine which parts of that vision are realistically achievable.
Further Reading
-
Improving Factuality and Reasoning through Multiagent DebateThe foundational paper on LLM debate as a technique for improving reasoning accuracy. Demonstrates that multi-round debate between multiple LLMs consistently outperforms single-model sampling on mathematical and strategic reasoning tasks. Essential reading for anyone implementing debate or critique patterns.
-
Generative Agents: Interactive Simulacra of Human BehaviorThe Smallville paper: 25 LLM-powered agents in a virtual town produce emergent social behaviours including relationship formation, gossip propagation, and spontaneous event coordination. The most compelling demonstration of emergent collaboration in LLM multi-agent systems to date.
-
Self-Consistency Improves Chain of Thought Reasoning in Language ModelsIntroduces self-consistency: sample diverse reasoning paths and majority-vote the answer. Shows consistent, large accuracy gains across arithmetic, commonsense, and symbolic reasoning benchmarks. The simplest and most reliable multi-agent aggregation technique, worth understanding before anything more complex.
-
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent DebateDemonstrates that multi-agent debate produces better evaluation of open-ended text than single LLM judges, with higher inter-evaluator agreement and stronger correlation to human judgement. Important for teams using LLM-based evaluation in their agent pipelines.
-
AgentBench: Evaluating LLMs as AgentsA comprehensive benchmark across eight environments (operating systems, databases, games, web browsing) for evaluating agentic task completion. Provides a baseline for comparing multi-agent against single-agent architectures on real tasks. The most widely used benchmark for comparing agent architectures on practical, diverse tasks.
-
Theory of Mind May Have Spontaneously Emerged in Large Language ModelsDocuments evidence that large LLMs exhibit theory-of-mind capabilities — modelling other agents' beliefs and intentions — which is a prerequisite for sophisticated multi-agent coordination and a prerequisite for the emergent deception concerns discussed in the frontier section. Important background for understanding what individual agents can and cannot infer about their peers.