Agent Frameworks & Infrastructure, the plumbing that keeps agents running reliably.

Writing a ReAct loop from scratch takes an afternoon. Keeping it reliable, observable, resumable, and affordable across months of production traffic takes an infrastructure stack. This chapter surveys the orchestration frameworks — LangGraph, AutoGen, CrewAI, the Claude Agent SDK — that handle the plumbing, then digs into the architectural patterns, state management strategies, observability tooling, and cost controls that separate a demo agent from a deployed one.

Prerequisites

This chapter assumes familiarity with the ReAct loop and tool-calling mechanics covered in LLM-Based Agents (Ch 02) and Tool Use & Function Calling (Ch 05). Basic Python literacy is assumed for the code examples. The multi-agent patterns here foreshadow the deeper treatment in Multi-Agent Systems (Ch 08).

Why Frameworks?

Motivation · Abstraction Layers

The minimal implementation of an agent is a while-loop: call the LLM, check whether it wants to use a tool, execute the tool, append the result, repeat until the model signals completion. This fits in forty lines of Python and runs correctly for simple, fast, single-turn tasks. The problems emerge when you leave the happy path.

Production agents face a different class of concerns. A task that runs for minutes or hours must be resumable — if the process crashes halfway through, replaying from the start is often unacceptable. Long-running workflows accumulate costs; without budgeting primitives the bill is unpredictable. Tool calls fail, APIs time out, and LLM outputs don't always parse; robust retry and fallback logic is tedious to write correctly from scratch. Debugging a misbehaving agent is nearly impossible without structured logs of every LLM call, tool invocation, and state transition. And when you want to run several agents in parallel or have them collaborate, the coordination logic dwarfs the agent logic itself.

Frameworks solve these problems at the abstraction level rather than the application level. They provide the scaffolding — state containers, execution engines, persistence backends, tracing hooks — so application code can focus on what the agent should do, not on the mechanics of keeping it running reliably.

The Cost of Abstraction

Frameworks impose opinions. LangGraph's graph model is powerful for cyclic workflows but overkill for a linear three-step pipeline. AutoGen's conversation metaphor fits collaborative tasks but adds overhead for solo agents. The right question is not "which framework is best" but "which abstraction matches my workflow's shape." Many production teams end up using a framework for orchestration while keeping the core agent logic as thin, testable plain Python.

The Infrastructure Stack

A mature agent system has layers: the model layer (LLM API calls, prompt construction, response parsing), the agent layer (the reasoning loop, tool dispatch, memory access), the orchestration layer (how agents are composed, sequenced, or run in parallel), the persistence layer (state storage, checkpoints, conversation history), and the observability layer (traces, metrics, cost accounting). Frameworks vary in how many of these layers they address — some provide all five, others focus on orchestration and leave the rest to the developer.

Framework Landscape

Survey · Comparison

The agent framework ecosystem evolved rapidly from 2023 onward. Early entrants like LangChain established conventions (chains, agents, tools) that later frameworks either adopted or deliberately rejected. The mature landscape as of 2025 features several distinct philosophical positions.

LangGraph
Graph-based · Stateful · Cyclic workflows

Models agent execution as a directed graph where nodes are Python functions and edges are transitions. Supports cycles, conditional branching, and human-in-the-loop interrupts. Built-in checkpointing with SQLite and Redis backends. The lowest-level of the mainstream frameworks — you define the graph explicitly.

AutoGen
Conversation-first · Multi-agent · Microsoft Research

Abstracts agents as conversational participants who exchange messages. GroupChat coordinates multiple ConversableAgents with configurable speaker-selection policies. Strong support for code execution, human proxy agents, and nested conversations. v0.4 introduced an async event-driven runtime.

CrewAI
Role-based · Task delegation · High-level API

Organises agents into Crews with explicit roles, goals, and backstories. Tasks are assigned to agents; the framework handles sequential and hierarchical execution, inter-agent delegation, and shared memory. Designed for readability — a Crew definition reads almost like a job description document.

Claude Agent SDK
Subagent model · MCP-native · Anthropic

Provides primitives for building agents that spawn subagents, use tools via the Model Context Protocol, and participate in multi-agent orchestration. Emphasises safety constraints and controlled delegation. Integrates natively with Claude's extended thinking and tool-use capabilities.

Dimension LangGraph AutoGen CrewAI Claude SDK
Abstraction State graph Conversation Crew/Role Subagent tree
Checkpointing Built-in Partial Limited Via SDK hooks
Multi-agent Explicit graph GroupChat Crew hierarchy Subagent spawn
Human-in-loop Interrupt nodes Human proxy Callback hooks Pause/resume
Observability LangSmith AutoGen Studio AgentOps / custom Anthropic console
Model support Any (via LangChain) Any (OpenAI compat) Any Claude-only
Learning curve Moderate–High Moderate Low–Moderate Low (Claude-native)

LangGraph: State Machines for Agents

Framework · Graph Execution

LangGraph, released by LangChain Inc. in early 2024, reframes agent execution as a finite-state machine encoded as a directed graph. The core insight is that many agent failure modes — infinite loops, lost context, unrecoverable errors — stem from the implicit, unstructured nature of while-loop agents. Making the state machine explicit forces the developer to think about every possible transition, and gives the runtime enough information to checkpoint, resume, and debug reliably.

Core Primitives

A LangGraph application has three components. The state schema is a typed dictionary (using Python's TypedDict or Pydantic) defining the shared data that flows through the graph — messages, intermediate results, flags, counters. Every node reads from and writes to this state. Nodes are ordinary Python functions (or async coroutines) that take the current state and return a partial update. Edges connect nodes; conditional edges inspect the state to decide which node to visit next, enabling branching and looping.

from langgraph.graph import StateGraph, END from typing import TypedDict, Annotated import operator class AgentState(TypedDict): messages: Annotated[list, operator.add] # reducer: append, don't replace tool_calls_remaining: int final_answer: str | None def call_model(state: AgentState) -> dict: response = llm.invoke(state["messages"]) return {"messages": [response]} def should_continue(state: AgentState) -> str: last = state["messages"][-1] if last.tool_calls and state["tool_calls_remaining"] > 0: return "tools" return END graph = StateGraph(AgentState) graph.add_node("agent", call_model) graph.add_node("tools", tool_executor) graph.add_conditional_edges("agent", should_continue) graph.add_edge("tools", "agent") # loop back after tool execution graph.set_entry_point("agent") app = graph.compile(checkpointer=SqliteSaver.from_conn_string(":memory:"))

Checkpointing and Resumption

The checkpointer argument serialises the full state after every node execution. If a run is interrupted — by a process crash, a timeout, or a deliberate human-in-the-loop pause — it can be resumed from the last checkpoint by passing the same thread_id. This is the feature that most distinguishes LangGraph from homegrown loops: durable execution without requiring an external workflow engine like Temporal or Celery.

LangGraph ships SQLite and Redis checkpointers out of the box. For long-running or distributed agents, Redis is preferred: it supports concurrent readers, TTL-based expiry, and horizontal scaling. The checkpoint format is a JSON-serialisable snapshot of the state dict plus the current node pointer and the pending edge queue.

Human-in-the-Loop

Interrupt nodes pause execution before or after a specified node and surface the current state to an external caller — typically a UI or approval workflow. The application resumes when the caller invokes app.update_state() with the modified state. This pattern is used for high-stakes actions (file deletions, financial transactions, customer communications) where an automated step needs human sign-off before proceeding.

AutoGen: Conversational Multi-Agent

Framework · Message Passing

AutoGen, from Microsoft Research (Wu et al., 2023), takes the position that agent collaboration is most naturally modelled as conversation. Each agent is a participant that sends and receives messages; the framework handles routing, turn-taking, and termination. This metaphor maps naturally onto tasks that involve negotiation, review cycles, or the kind of back-and-forth that humans use to refine ideas.

ConversableAgent

The central class is ConversableAgent, which wraps an LLM (or a human, or a code executor) and exposes a unified messaging interface. A UserProxyAgent represents a human or automated tester; an AssistantAgent wraps an LLM with optional tool access. The simplest pattern is a two-agent conversation where the UserProxy initiates a task and the Assistant iterates until the UserProxy's is_termination_msg function returns true.

assistant = AssistantAgent( name="coder", llm_config={"model": "claude-opus-4-6", "temperature": 0}, system_message="Write Python code to solve the task. Reply TERMINATE when done." ) user_proxy = UserProxyAgent( name="executor", human_input_mode="NEVER", code_execution_config={"work_dir": "coding", "use_docker": True}, is_termination_msg=lambda m: "TERMINATE" in m.get("content", "") ) user_proxy.initiate_chat(assistant, message="Plot the S&P 500 return distribution.")

GroupChat

GroupChat extends the two-agent pattern to N agents with a configurable speaker-selection policy. The auto policy uses an LLM to decide which agent should speak next based on the conversation history — effectively an LLM-as-moderator. The round-robin policy cycles through agents deterministically. The random policy is useful for simulating markets or social dynamics. Custom policies are plain Python functions that take the chat history and return the next speaker's name.

GroupChat is well-suited to workflows that benefit from diverse perspectives: a Critic that challenges the Planner's proposals, a Fact-Checker that queries external sources, a Summariser that condenses long threads before passing them on. The failure mode is verbose, circular conversations where agents agree with each other rather than making progress — mitigated by explicit termination conditions and turn limits.

AutoGen v0.4 and the Event-Driven Runtime

AutoGen's 0.4 release introduced a significantly redesigned runtime based on asynchronous message passing. Agents become actors in an actor model: they register message handlers and communicate exclusively via typed messages, with no shared mutable state. This enables genuinely concurrent multi-agent execution and makes the system easier to test and distribute. The older "conversational" API remains available but is increasingly positioned as a high-level convenience wrapper over the new runtime.

CrewAI: Role-Based Orchestration

Framework · Declarative Agent Definition

CrewAI (Moura, 2023) prioritises legibility. The core bet is that the most important thing about a multi-agent system is understanding who does what and why — and that this should be readable without tracing through code. A CrewAI application is a declaration of agents (with roles, goals, and backstories), tasks (with descriptions and expected outputs), and the crew that binds them together.

researcher = Agent( role="Senior Research Analyst", goal="Find accurate, up-to-date information on {topic}", backstory="Expert at sifting signal from noise in academic literature.", tools=[SerperDevTool(), WebsiteSearchTool()], verbose=True ) writer = Agent( role="Technical Writer", goal="Distil research into clear, accurate prose for a general audience", backstory="Turns jargon into insight without losing precision." ) research_task = Task( description="Research the current state of {topic} and list key findings.", expected_output="Bullet-point summary with citations", agent=researcher ) write_task = Task( description="Write a 500-word explainer based on the research.", expected_output="Polished article draft", agent=writer, context=[research_task] # receives researcher's output automatically ) crew = Crew(agents=[researcher, writer], tasks=[research_task, write_task], process=Process.sequential, verbose=2) result = crew.kickoff(inputs={"topic": "protein structure prediction"})

Execution Processes

CrewAI supports three process modes. Sequential executes tasks in order, passing each task's output as context to the next — the default and the simplest to reason about. Hierarchical adds a manager agent that decomposes the overall goal, delegates sub-tasks to crew members, and synthesises results; the manager is typically a more capable model than the workers. Consensual (experimental) requires agents to agree on outputs before proceeding — useful when correctness matters more than speed.

Memory and Tool Sharing

CrewAI's memory system has three tiers: short-term (the current run's conversation), long-term (a SQLite database persisting facts across runs), and entity memory (a structured store of named entities encountered during tasks). Tools can be shared across agents in a crew or assigned to specific roles. The framework handles tool invocation, result passing, and retry automatically, with configurable max-delegation limits to prevent runaway agent loops.

Claude Agent SDK

Framework · Subagent Model · MCP Integration

Anthropic's Claude Agent SDK approaches the problem from a different angle than LangGraph or AutoGen. Rather than providing a general-purpose orchestration engine, it provides the primitives needed to build agents that are native to Claude's capabilities: extended thinking, the Model Context Protocol, subagent delegation, and Anthropic's safety infrastructure.

The Subagent Model

The SDK's primary abstraction is the subagent: an agent invocation that runs as a tool call within a parent agent's context. A parent agent can spawn multiple subagents for parallel subtasks, each with its own tool set, system prompt, and context window. Subagent results are returned to the parent as tool results, which the parent uses to synthesise a final answer. This creates a tree of agent invocations that maps naturally onto hierarchical task decomposition.

# Parent agent spawns a subagent via the SDK's Task tool """ Parent system prompt: You are a research coordinator. For large research tasks, use the Task tool to spawn a subagent that will handle deep-dive research while you coordinate. """ # The SDK wires up the Task tool automatically; the parent calls: { "name": "Task", "input": { "description": "Research transformer attention variants (2021–2025)", "prompt": "Find and summarise the five most-cited attention mechanism papers..." } } # Subagent runs with WebSearch + WebFetch tools; returns structured result

MCP Integration

The Model Context Protocol (MCP) is an open standard, co-developed by Anthropic, for connecting LLMs to external tools and data sources. The SDK makes MCP a first-class citizen: any MCP server can be attached to an agent as a tool provider, and the agent receives the server's tool manifest automatically without manual schema writing. This enables a plug-in architecture where capabilities (Slack, GitHub, databases, custom APIs) are attached to agents via MCP servers, rather than being hardcoded into the agent's tool list.

Safety Constraints

The SDK exposes Claude's native safety mechanisms at the framework level. Operators can set computer use constraints (which domains a browser agent may visit), file system boundaries (which directories an agent may read and write), and tool permission levels (read-only vs. read-write vs. destructive). These constraints are enforced at the model level — they cannot be circumvented by prompt injection — and propagate automatically to subagents spawned during a run.

Orchestration Patterns

Architecture · Composition

Independent of which framework you use, agent workflows fall into a small set of structural patterns. Recognising these patterns is more useful than knowing any particular framework's API, because frameworks are libraries but patterns are the things your code is actually doing.

SEQUENTIAL
A → B → C → done. Each step receives the previous step's output as input. Simple to reason about, easy to debug, but any step's latency adds directly to total latency. Use when steps have data dependencies and the total step count is small. Typical in research-then-write pipelines, multi-stage data transformation, and linear approval chains.
PARALLEL
A fires B, C, D simultaneously; waits for all; synthesises. Eliminates latency for independent subtasks. The fan-out step distributes work; the fan-in step aggregates. Effective latency is the slowest parallel branch, not the sum. Use for independent research threads, multi-source data fetching, and ensemble generation. Requires that subtasks genuinely have no data dependencies between them.
CONDITIONAL
After A, inspect state: if X go to B, else go to C. Implements branching — routing, classification-based dispatch, escalation logic. The condition can be a deterministic rule (e.g. "error count > 3 → escalate") or an LLM judgement ("is this task within scope?"). LangGraph's conditional edges express this natively. Prone to subtle bugs when conditions overlap or are mutually exclusive only in theory.
MAP-REDUCE
Split input into N chunks, process each independently, merge outputs. The agent equivalent of distributed data processing. Used for long-document summarisation (split by section), bulk entity extraction (split by record), and multi-file code review (split by file). The merge step is the critical design challenge: naive concatenation loses coherence, so the reducer is usually itself an LLM call that synthesises across the mapped results.

Nested and Recursive Patterns

Real workflows combine these primitives. A hierarchical decompose-and-conquer pattern has an orchestrator that breaks a task into subtasks (conditional), runs each in parallel (parallel fan-out), and then synthesises results sequentially (sequential fold). Recursion appears when a subtask is itself complex enough to warrant the same decomposition — a fact that LangGraph handles elegantly via graph cycles and that the Claude SDK handles via subagent spawning.

ORCHESTRATOR decomposes task SUBAGENT A research thread 1 SUBAGENT B research thread 2 SUBAGENT C data gathering SYNTHESISER merges results FINAL OUTPUT to user fan-out fan-in
Parallel fan-out / fan-in: the orchestrator decomposes the task into independent subagent threads, waits for all results, then synthesises.

The Evaluator-Optimizer Loop

A particularly useful pattern — described in Anthropic's agent design guide — adds an evaluator node after the primary agent output. The evaluator checks whether the output meets quality criteria and routes back to the generator if not, up to a maximum iteration budget. This turns a single-pass generator into a self-correcting system without requiring the generator to self-critique (which is less reliable). The evaluator can be a lightweight model, a deterministic check, or a separate LLM call with a focused prompt.

State Management & Checkpointing

Reliability · Persistence

Agent state is everything the system needs to resume a run from an arbitrary point: the conversation history, intermediate results, tool call records, counters, flags, and any application-specific data. Getting state management right is the difference between an agent that fails gracefully and one that loses work on every error.

What Needs to Be Persisted

A minimal checkpoint contains four things: the full message history (so the LLM can continue the conversation coherently), the current node or step identifier (so execution resumes at the right place), any out-of-band state variables (counters, accumulated results, flags), and the run metadata (start time, input parameters, configuration). In practice, serialising the entire state dict is easier and more robust than selective persistence, as long as the state is designed to be serialisable — no live network connections, no open file handles.

Storage Backends

BackendBest ForTradeoffs
SQLite Development, single-process agents No concurrency; fast local reads; zero ops overhead
Redis Production, low-latency, short-lived state In-memory; TTL expiry; supports pub/sub for event-driven agents
PostgreSQL Long-running tasks, audit trails, ACID requirements Durable; queryable; slower writes than Redis
Object store (S3/GCS) Large state blobs, archival, cross-region High latency; ideal for checkpoints rather than frequent reads
Temporal / Durable Execution Mission-critical, complex retry semantics Full workflow engine; significant operational complexity

Idempotency and Exactly-Once Semantics

A resumed agent must not re-execute side effects that already happened. If a tool call sent an email or submitted an API request, re-running from a pre-tool checkpoint would duplicate the action. The standard solution is to record completed tool calls in the state with their results, and to skip re-execution of already-completed calls on resume. This is at-least-once execution with idempotency guards at the tool layer — not true exactly-once, which requires distributed transactions. For non-idempotent tools (anything that mutates external state), the tool implementation itself should check whether the action was already performed before executing.

Context Window Management

Long-running agents accumulate conversation history until it fills the context window. Frameworks handle this through three strategies: truncation (drop oldest messages, preserving system prompt and recent history), summarisation (call an LLM to compress old history into a summary message, then discard the raw messages), and rolling window (keep the last k turns verbatim, with a summary prefix for earlier history). Summarisation is more expensive but preserves information; truncation is free but risks losing critical earlier context. The right choice depends on whether the task requires long-range coherence.

Context Budget
\[ \text{tokens\_remaining}(t) = W - \bigl(\text{system} + \sum_{i=1}^{t} |\text{msg}_i|\bigr) \]
where \(W\) is the context window size. When tokens\_remaining falls below a threshold \(\tau\), trigger compression. Typical values: \(W = 200{,}000\) (Claude 3.x), \(\tau = 20{,}000\) (leave headroom for the next LLM response).

Observability & Tracing

Operations · Debugging

An agent that produces wrong output is not obviously broken — it may appear to run correctly while making subtle reasoning errors, calling tools in the wrong order, or silently discarding relevant context. Traditional software observability (uptime, error rates, p99 latency) is necessary but not sufficient. Agent observability requires understanding the reasoning process, not just the inputs and outputs.

The Span Hierarchy

The standard model for agent traces borrows from distributed tracing. A root span represents the top-level agent invocation. Each LLM call, tool invocation, and subagent spawn creates a child span. The span hierarchy mirrors the agent's execution tree and can be visualised as a waterfall or flame graph. Each span captures: start and end time, input (prompt or tool arguments), output (completion or tool result), token counts, model identifier, and any error information.

LangSmith

LangSmith is LangChain's observability platform, tightly integrated with LangGraph. It captures every LLM call and tool invocation in a structured trace, annotates each with latency and token cost, and provides a UI for replaying, comparing, and annotating traces. For LangGraph users it's the fastest path to observability; for other frameworks it can be instrumented via the OpenAI-compatible tracing API or LangSmith's SDK.

OpenTelemetry for Agents

The OpenTelemetry (OTel) ecosystem is emerging as the vendor-neutral standard for agent tracing. The GenAI semantic conventions (under active standardisation as of 2025) define span attributes for LLM calls: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and so on. Instrumenting an agent with OTel lets you route traces to any compatible backend — Jaeger, Datadog, Honeycomb, Grafana Tempo — without vendor lock-in.

Key metrics to track at the agent level: end-to-end latency per task type, tool call frequency (which tools are called most, and which fail most), loop count (how many LLM calls per task — high counts often signal confusion or inefficiency), context utilisation (fraction of window used at completion — high values risk truncation), and cost per task (covered in the next section).

Structured Logging for Agent Reasoning

Beyond span data, structured logs of the agent's reasoning are invaluable. If the agent uses extended thinking, capturing the thinking block alongside the final response makes debugging failures tractable. If it uses ReAct-style scratchpads, logging the thought-action-observation triples creates a human-readable execution trace. The key discipline: log at the semantic level (what the agent decided and why) rather than the syntactic level (what tokens were generated), because semantic logs remain interpretable as models change.

Evaluation Integration

Observability systems are most valuable when they feed back into evaluation. LangSmith, Braintrust, and similar platforms let you define evaluators — LLM-based or rule-based — that score past traces automatically. This creates a feedback loop: you can detect regressions when you change a prompt or update a model version, and you can identify systematic failure modes (e.g., "the agent consistently fails on tasks that require more than three sequential tool calls") that would be invisible from aggregate metrics alone.

Cost Management

Economics · Optimisation

LLM API costs are a function of token throughput. For a simple chatbot, this is easy to reason about. For an agent that may make dozens of LLM calls per task, spawn subagents, and process long tool results, costs can compound quickly and unpredictably. Cost management is not just an operational concern — it shapes architectural choices from the ground up.

Cost Anatomy

A multi-step agent task has costs along several dimensions. Each LLM call costs input tokens (the full conversation history + system prompt + tool definitions, which grow with each turn) plus output tokens (the model's response). In a conversation of n turns, input tokens grow as O(n²) if history is not compressed — the most common source of unexpectedly large bills. Tool calls add latency but not direct LLM cost, unless the tool result is large (e.g. fetching a long web page) and gets appended to the context. Subagent spawns have the full cost structure of a nested conversation.

Cumulative Input Token Growth (Uncompressed)
\[ \text{input\_tokens}_n = S + \sum_{k=1}^{n} \left(S + \sum_{j=1}^{k-1} |\text{msg}_j|\right) \approx S \cdot n + \frac{n(n-1)}{2} \cdot \bar{m} \]
where \(S\) = system prompt tokens, \(\bar{m}\) = average tokens per message, \(n\) = number of turns. The quadratic term dominates for long runs — a 50-turn conversation with 200 tokens/turn incurs ~250,000 input tokens from history accumulation alone.

Cost Controls

Several techniques keep costs bounded. Prompt caching is the highest-leverage single technique: Claude and other major APIs offer cached prompt pricing (typically 10–20% of full input cost) for prompt prefixes that are stable across calls — system prompts, tool definitions, and long context documents. For agents that make many calls with the same system prompt, caching can reduce input costs by 80% or more. Model routing uses a cheaper model for steps that don't require frontier capability — classification, formatting, simple lookup — and reserves the expensive model for complex reasoning steps. History compression (summarisation or truncation) bounds the quadratic token growth. Tool result trimming extracts only the relevant portion of large tool outputs before appending to context.

TechniqueCost ReductionLatency ImpactQuality Risk
Prompt caching 50–85% on input Neutral or faster None (same tokens)
Model routing (cheap for easy steps) 30–70% overall Faster on routed steps Low if routing is accurate
History compression 40–80% on input (long runs) Adds compression call Medium — info loss possible
Tool result trimming Variable (0–60%) Neutral Low if trimming is targeted
Parallelism (fewer sequential turns) Reduces O(n²) to O(k·m²) Reduces wall time None

Budget Enforcement

Token budgets should be enforced in code, not just monitored after the fact. A common pattern is a BudgetGuard that wraps every LLM call, accumulates token usage, and raises an exception (or triggers a graceful termination branch) when a per-task budget is exceeded. Frameworks like LangGraph make this straightforward to add as a node in the graph; the guard inspects state before each call and routes to an over_budget terminal node if the budget is breached. This prevents a runaway agent from generating a surprise four-figure API bill.

Deployment Patterns

Infrastructure · Production

Running an agent in development — where it completes in seconds on a local machine — reveals almost nothing about its production behaviour. Production agents are long-running, concurrent, interrupted by infrastructure failures, and subject to latency requirements that don't exist in a notebook.

Synchronous vs. Asynchronous Execution

Short tasks (under ~30 seconds) can be executed synchronously: the client waits for the response. Longer tasks require asynchronous patterns. The standard approach is a job queue: the client submits a task and receives a job ID; the agent runs in a worker process and writes results to a database; the client polls or subscribes to completion. Redis Queues, Celery, and cloud-native services (AWS SQS + Lambda, GCP Tasks + Cloud Run) are common implementations. For streaming responses — where partial results are valuable as they're produced — Server-Sent Events or WebSockets allow the agent to push incremental output to the client without waiting for full completion.

01
Client submits task via REST API. Server validates, assigns a task_id, enqueues the job, and returns 202 Accepted with the task ID.
02
Worker picks up the job from the queue. Initialises the agent with the task parameters and a checkpointer keyed on task_id.
03
Agent runs, checkpointing state after each node. On error, the worker marks the job as failed with the error and checkpoint reference; the job can be retried from the checkpoint.
04
Client polls GET /tasks/{task_id} or subscribes to a completion webhook. On success, the result is available at a result endpoint; on failure, the error and last state are accessible for debugging.

Concurrency and Rate Limiting

LLM APIs have rate limits on tokens per minute and requests per minute. Multiple concurrent agents can easily saturate these limits, causing cascading 429 errors that are hard to recover from gracefully. The solution is a shared rate-limiter (token bucket or leaky bucket) in front of all LLM calls. Redis-based implementations (e.g. slowapi, redis-rate-limit) provide distributed rate limiting across multiple worker processes. Exponential backoff with jitter handles transient 429s that slip through.

Containerisation and Isolation

Agents that execute code (Python interpreters, shell commands) or manipulate files must run in isolated environments. Docker containers with restricted network access, read-only filesystem mounts, and resource limits (CPU, memory, wall time) are the standard approach. For high-security deployments, gVisor or Firecracker microVMs provide stronger isolation than standard Docker. The container image should be minimal — no unnecessary tools that an agent could misuse — and should run as a non-root user.

Frontier & Open Problems

Research · Roadmap

The agent framework ecosystem is young and moving fast. Several important problems remain unsolved or underserved as of 2025.

Workflow Portability

Agent workflows are currently deeply coupled to their framework. A LangGraph workflow cannot be migrated to CrewAI without substantial rewriting; a Claude SDK application assumes Claude throughout. There is no standard intermediate representation for agentic workflows analogous to ONNX for models. This portability problem is not merely an inconvenience — it creates vendor lock-in, makes benchmarking across frameworks difficult, and slows the development of shared tooling. Early proposals exist (e.g. AgentFlow YAML), but no standard has achieved traction.

Standardised Evaluation Integration

Observability captures what the agent did; evaluation captures whether it did it correctly. These two concerns are currently served by separate tooling that rarely integrates cleanly. A production monitoring system that could automatically detect anomalous traces, route them to evaluators, and generate regression alerts without manual intervention would close the feedback loop that makes agent systems genuinely improvable over time.

Adaptive Cost Optimisation

Current cost controls are static — you configure them in advance and they apply uniformly. A more sophisticated system would learn the cost-quality frontier for a given task distribution: automatically routing easy tasks to cheaper models, increasing context compression on low-importance conversations, and reserving expensive compute for tasks where quality has been empirically shown to matter. This requires online learning over the agent's own traces — a form of meta-optimisation that no production framework currently provides.

Long-Horizon Reliability

The compounding-error problem from computer use agents applies equally to all long-horizon orchestration. A workflow with 50 steps and 95% per-step reliability completes correctly only 8% of the time. Improving this requires either (a) dramatically higher per-step reliability through better models, (b) checkpointing plus retry at a granularity fine enough that no step is too costly to re-do, or (c) planning architectures that commit to fewer, more reversible actions before verifying intermediate results. Current frameworks address (b); (a) and (c) remain active research directions.

The Convergence Thesis

A recurring prediction in the field is that agent frameworks will converge toward a small set of primitives: a durable execution engine (probably built on something like Temporal), a standardised tool interface (MCP is the leading candidate), and a common observability layer (OTel). The diversity of current frameworks is a sign of an ecosystem exploring the design space; consolidation typically follows as best practices emerge and switching costs become salient. Whether convergence happens around an open standard or a single dominant framework is the open question.

Further Reading