Agent Frameworks & Infrastructure, the plumbing that keeps agents running reliably.
Writing a ReAct loop from scratch takes an afternoon. Keeping it reliable, observable, resumable, and affordable across months of production traffic takes an infrastructure stack. This chapter surveys the orchestration frameworks — LangGraph, AutoGen, CrewAI, the Claude Agent SDK — that handle the plumbing, then digs into the architectural patterns, state management strategies, observability tooling, and cost controls that separate a demo agent from a deployed one.
Prerequisites
This chapter assumes familiarity with the ReAct loop and tool-calling mechanics covered in LLM-Based Agents (Ch 02) and Tool Use & Function Calling (Ch 05). Basic Python literacy is assumed for the code examples. The multi-agent patterns here foreshadow the deeper treatment in Multi-Agent Systems (Ch 08).
Why Frameworks?
The minimal implementation of an agent is a while-loop: call the LLM, check whether it wants to use a tool, execute the tool, append the result, repeat until the model signals completion. This fits in forty lines of Python and runs correctly for simple, fast, single-turn tasks. The problems emerge when you leave the happy path.
Production agents face a different class of concerns. A task that runs for minutes or hours must be resumable — if the process crashes halfway through, replaying from the start is often unacceptable. Long-running workflows accumulate costs; without budgeting primitives the bill is unpredictable. Tool calls fail, APIs time out, and LLM outputs don't always parse; robust retry and fallback logic is tedious to write correctly from scratch. Debugging a misbehaving agent is nearly impossible without structured logs of every LLM call, tool invocation, and state transition. And when you want to run several agents in parallel or have them collaborate, the coordination logic dwarfs the agent logic itself.
Frameworks solve these problems at the abstraction level rather than the application level. They provide the scaffolding — state containers, execution engines, persistence backends, tracing hooks — so application code can focus on what the agent should do, not on the mechanics of keeping it running reliably.
Frameworks impose opinions. LangGraph's graph model is powerful for cyclic workflows but overkill for a linear three-step pipeline. AutoGen's conversation metaphor fits collaborative tasks but adds overhead for solo agents. The right question is not "which framework is best" but "which abstraction matches my workflow's shape." Many production teams end up using a framework for orchestration while keeping the core agent logic as thin, testable plain Python.
The Infrastructure Stack
A mature agent system has layers: the model layer (LLM API calls, prompt construction, response parsing), the agent layer (the reasoning loop, tool dispatch, memory access), the orchestration layer (how agents are composed, sequenced, or run in parallel), the persistence layer (state storage, checkpoints, conversation history), and the observability layer (traces, metrics, cost accounting). Frameworks vary in how many of these layers they address — some provide all five, others focus on orchestration and leave the rest to the developer.
Framework Landscape
The agent framework ecosystem evolved rapidly from 2023 onward. Early entrants like LangChain established conventions (chains, agents, tools) that later frameworks either adopted or deliberately rejected. The mature landscape as of 2025 features several distinct philosophical positions.
Models agent execution as a directed graph where nodes are Python functions and edges are transitions. Supports cycles, conditional branching, and human-in-the-loop interrupts. Built-in checkpointing with SQLite and Redis backends. The lowest-level of the mainstream frameworks — you define the graph explicitly.
Abstracts agents as conversational participants who exchange messages. GroupChat coordinates multiple ConversableAgents with configurable speaker-selection policies. Strong support for code execution, human proxy agents, and nested conversations. v0.4 introduced an async event-driven runtime.
Organises agents into Crews with explicit roles, goals, and backstories. Tasks are assigned to agents; the framework handles sequential and hierarchical execution, inter-agent delegation, and shared memory. Designed for readability — a Crew definition reads almost like a job description document.
Provides primitives for building agents that spawn subagents, use tools via the Model Context Protocol, and participate in multi-agent orchestration. Emphasises safety constraints and controlled delegation. Integrates natively with Claude's extended thinking and tool-use capabilities.
| Dimension | LangGraph | AutoGen | CrewAI | Claude SDK |
|---|---|---|---|---|
| Abstraction | State graph | Conversation | Crew/Role | Subagent tree |
| Checkpointing | Built-in | Partial | Limited | Via SDK hooks |
| Multi-agent | Explicit graph | GroupChat | Crew hierarchy | Subagent spawn |
| Human-in-loop | Interrupt nodes | Human proxy | Callback hooks | Pause/resume |
| Observability | LangSmith | AutoGen Studio | AgentOps / custom | Anthropic console |
| Model support | Any (via LangChain) | Any (OpenAI compat) | Any | Claude-only |
| Learning curve | Moderate–High | Moderate | Low–Moderate | Low (Claude-native) |
LangGraph: State Machines for Agents
LangGraph, released by LangChain Inc. in early 2024, reframes agent execution as a finite-state machine encoded as a directed graph. The core insight is that many agent failure modes — infinite loops, lost context, unrecoverable errors — stem from the implicit, unstructured nature of while-loop agents. Making the state machine explicit forces the developer to think about every possible transition, and gives the runtime enough information to checkpoint, resume, and debug reliably.
Core Primitives
A LangGraph application has three components. The state schema is a typed dictionary (using Python's TypedDict or Pydantic) defining the shared data that flows through the graph — messages, intermediate results, flags, counters. Every node reads from and writes to this state. Nodes are ordinary Python functions (or async coroutines) that take the current state and return a partial update. Edges connect nodes; conditional edges inspect the state to decide which node to visit next, enabling branching and looping.
Checkpointing and Resumption
The checkpointer argument serialises the full state after every node execution. If a run is interrupted — by a process crash, a timeout, or a deliberate human-in-the-loop pause — it can be resumed from the last checkpoint by passing the same thread_id. This is the feature that most distinguishes LangGraph from homegrown loops: durable execution without requiring an external workflow engine like Temporal or Celery.
LangGraph ships SQLite and Redis checkpointers out of the box. For long-running or distributed agents, Redis is preferred: it supports concurrent readers, TTL-based expiry, and horizontal scaling. The checkpoint format is a JSON-serialisable snapshot of the state dict plus the current node pointer and the pending edge queue.
Human-in-the-Loop
Interrupt nodes pause execution before or after a specified node and surface the current state to an external caller — typically a UI or approval workflow. The application resumes when the caller invokes app.update_state() with the modified state. This pattern is used for high-stakes actions (file deletions, financial transactions, customer communications) where an automated step needs human sign-off before proceeding.
AutoGen: Conversational Multi-Agent
AutoGen, from Microsoft Research (Wu et al., 2023), takes the position that agent collaboration is most naturally modelled as conversation. Each agent is a participant that sends and receives messages; the framework handles routing, turn-taking, and termination. This metaphor maps naturally onto tasks that involve negotiation, review cycles, or the kind of back-and-forth that humans use to refine ideas.
ConversableAgent
The central class is ConversableAgent, which wraps an LLM (or a human, or a code executor) and exposes a unified messaging interface. A UserProxyAgent represents a human or automated tester; an AssistantAgent wraps an LLM with optional tool access. The simplest pattern is a two-agent conversation where the UserProxy initiates a task and the Assistant iterates until the UserProxy's is_termination_msg function returns true.
GroupChat
GroupChat extends the two-agent pattern to N agents with a configurable speaker-selection policy. The auto policy uses an LLM to decide which agent should speak next based on the conversation history — effectively an LLM-as-moderator. The round-robin policy cycles through agents deterministically. The random policy is useful for simulating markets or social dynamics. Custom policies are plain Python functions that take the chat history and return the next speaker's name.
GroupChat is well-suited to workflows that benefit from diverse perspectives: a Critic that challenges the Planner's proposals, a Fact-Checker that queries external sources, a Summariser that condenses long threads before passing them on. The failure mode is verbose, circular conversations where agents agree with each other rather than making progress — mitigated by explicit termination conditions and turn limits.
AutoGen v0.4 and the Event-Driven Runtime
AutoGen's 0.4 release introduced a significantly redesigned runtime based on asynchronous message passing. Agents become actors in an actor model: they register message handlers and communicate exclusively via typed messages, with no shared mutable state. This enables genuinely concurrent multi-agent execution and makes the system easier to test and distribute. The older "conversational" API remains available but is increasingly positioned as a high-level convenience wrapper over the new runtime.
CrewAI: Role-Based Orchestration
CrewAI (Moura, 2023) prioritises legibility. The core bet is that the most important thing about a multi-agent system is understanding who does what and why — and that this should be readable without tracing through code. A CrewAI application is a declaration of agents (with roles, goals, and backstories), tasks (with descriptions and expected outputs), and the crew that binds them together.
Execution Processes
CrewAI supports three process modes. Sequential executes tasks in order, passing each task's output as context to the next — the default and the simplest to reason about. Hierarchical adds a manager agent that decomposes the overall goal, delegates sub-tasks to crew members, and synthesises results; the manager is typically a more capable model than the workers. Consensual (experimental) requires agents to agree on outputs before proceeding — useful when correctness matters more than speed.
Memory and Tool Sharing
CrewAI's memory system has three tiers: short-term (the current run's conversation), long-term (a SQLite database persisting facts across runs), and entity memory (a structured store of named entities encountered during tasks). Tools can be shared across agents in a crew or assigned to specific roles. The framework handles tool invocation, result passing, and retry automatically, with configurable max-delegation limits to prevent runaway agent loops.
Claude Agent SDK
Anthropic's Claude Agent SDK approaches the problem from a different angle than LangGraph or AutoGen. Rather than providing a general-purpose orchestration engine, it provides the primitives needed to build agents that are native to Claude's capabilities: extended thinking, the Model Context Protocol, subagent delegation, and Anthropic's safety infrastructure.
The Subagent Model
The SDK's primary abstraction is the subagent: an agent invocation that runs as a tool call within a parent agent's context. A parent agent can spawn multiple subagents for parallel subtasks, each with its own tool set, system prompt, and context window. Subagent results are returned to the parent as tool results, which the parent uses to synthesise a final answer. This creates a tree of agent invocations that maps naturally onto hierarchical task decomposition.
MCP Integration
The Model Context Protocol (MCP) is an open standard, co-developed by Anthropic, for connecting LLMs to external tools and data sources. The SDK makes MCP a first-class citizen: any MCP server can be attached to an agent as a tool provider, and the agent receives the server's tool manifest automatically without manual schema writing. This enables a plug-in architecture where capabilities (Slack, GitHub, databases, custom APIs) are attached to agents via MCP servers, rather than being hardcoded into the agent's tool list.
Safety Constraints
The SDK exposes Claude's native safety mechanisms at the framework level. Operators can set computer use constraints (which domains a browser agent may visit), file system boundaries (which directories an agent may read and write), and tool permission levels (read-only vs. read-write vs. destructive). These constraints are enforced at the model level — they cannot be circumvented by prompt injection — and propagate automatically to subagents spawned during a run.
Orchestration Patterns
Independent of which framework you use, agent workflows fall into a small set of structural patterns. Recognising these patterns is more useful than knowing any particular framework's API, because frameworks are libraries but patterns are the things your code is actually doing.
Nested and Recursive Patterns
Real workflows combine these primitives. A hierarchical decompose-and-conquer pattern has an orchestrator that breaks a task into subtasks (conditional), runs each in parallel (parallel fan-out), and then synthesises results sequentially (sequential fold). Recursion appears when a subtask is itself complex enough to warrant the same decomposition — a fact that LangGraph handles elegantly via graph cycles and that the Claude SDK handles via subagent spawning.
The Evaluator-Optimizer Loop
A particularly useful pattern — described in Anthropic's agent design guide — adds an evaluator node after the primary agent output. The evaluator checks whether the output meets quality criteria and routes back to the generator if not, up to a maximum iteration budget. This turns a single-pass generator into a self-correcting system without requiring the generator to self-critique (which is less reliable). The evaluator can be a lightweight model, a deterministic check, or a separate LLM call with a focused prompt.
State Management & Checkpointing
Agent state is everything the system needs to resume a run from an arbitrary point: the conversation history, intermediate results, tool call records, counters, flags, and any application-specific data. Getting state management right is the difference between an agent that fails gracefully and one that loses work on every error.
What Needs to Be Persisted
A minimal checkpoint contains four things: the full message history (so the LLM can continue the conversation coherently), the current node or step identifier (so execution resumes at the right place), any out-of-band state variables (counters, accumulated results, flags), and the run metadata (start time, input parameters, configuration). In practice, serialising the entire state dict is easier and more robust than selective persistence, as long as the state is designed to be serialisable — no live network connections, no open file handles.
Storage Backends
| Backend | Best For | Tradeoffs |
|---|---|---|
| SQLite | Development, single-process agents | No concurrency; fast local reads; zero ops overhead |
| Redis | Production, low-latency, short-lived state | In-memory; TTL expiry; supports pub/sub for event-driven agents |
| PostgreSQL | Long-running tasks, audit trails, ACID requirements | Durable; queryable; slower writes than Redis |
| Object store (S3/GCS) | Large state blobs, archival, cross-region | High latency; ideal for checkpoints rather than frequent reads |
| Temporal / Durable Execution | Mission-critical, complex retry semantics | Full workflow engine; significant operational complexity |
Idempotency and Exactly-Once Semantics
A resumed agent must not re-execute side effects that already happened. If a tool call sent an email or submitted an API request, re-running from a pre-tool checkpoint would duplicate the action. The standard solution is to record completed tool calls in the state with their results, and to skip re-execution of already-completed calls on resume. This is at-least-once execution with idempotency guards at the tool layer — not true exactly-once, which requires distributed transactions. For non-idempotent tools (anything that mutates external state), the tool implementation itself should check whether the action was already performed before executing.
Context Window Management
Long-running agents accumulate conversation history until it fills the context window. Frameworks handle this through three strategies: truncation (drop oldest messages, preserving system prompt and recent history), summarisation (call an LLM to compress old history into a summary message, then discard the raw messages), and rolling window (keep the last k turns verbatim, with a summary prefix for earlier history). Summarisation is more expensive but preserves information; truncation is free but risks losing critical earlier context. The right choice depends on whether the task requires long-range coherence.
Observability & Tracing
An agent that produces wrong output is not obviously broken — it may appear to run correctly while making subtle reasoning errors, calling tools in the wrong order, or silently discarding relevant context. Traditional software observability (uptime, error rates, p99 latency) is necessary but not sufficient. Agent observability requires understanding the reasoning process, not just the inputs and outputs.
The Span Hierarchy
The standard model for agent traces borrows from distributed tracing. A root span represents the top-level agent invocation. Each LLM call, tool invocation, and subagent spawn creates a child span. The span hierarchy mirrors the agent's execution tree and can be visualised as a waterfall or flame graph. Each span captures: start and end time, input (prompt or tool arguments), output (completion or tool result), token counts, model identifier, and any error information.
LangSmith is LangChain's observability platform, tightly integrated with LangGraph. It captures every LLM call and tool invocation in a structured trace, annotates each with latency and token cost, and provides a UI for replaying, comparing, and annotating traces. For LangGraph users it's the fastest path to observability; for other frameworks it can be instrumented via the OpenAI-compatible tracing API or LangSmith's SDK.
OpenTelemetry for Agents
The OpenTelemetry (OTel) ecosystem is emerging as the vendor-neutral standard for agent tracing. The GenAI semantic conventions (under active standardisation as of 2025) define span attributes for LLM calls: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and so on. Instrumenting an agent with OTel lets you route traces to any compatible backend — Jaeger, Datadog, Honeycomb, Grafana Tempo — without vendor lock-in.
Key metrics to track at the agent level: end-to-end latency per task type, tool call frequency (which tools are called most, and which fail most), loop count (how many LLM calls per task — high counts often signal confusion or inefficiency), context utilisation (fraction of window used at completion — high values risk truncation), and cost per task (covered in the next section).
Structured Logging for Agent Reasoning
Beyond span data, structured logs of the agent's reasoning are invaluable. If the agent uses extended thinking, capturing the thinking block alongside the final response makes debugging failures tractable. If it uses ReAct-style scratchpads, logging the thought-action-observation triples creates a human-readable execution trace. The key discipline: log at the semantic level (what the agent decided and why) rather than the syntactic level (what tokens were generated), because semantic logs remain interpretable as models change.
Evaluation Integration
Observability systems are most valuable when they feed back into evaluation. LangSmith, Braintrust, and similar platforms let you define evaluators — LLM-based or rule-based — that score past traces automatically. This creates a feedback loop: you can detect regressions when you change a prompt or update a model version, and you can identify systematic failure modes (e.g., "the agent consistently fails on tasks that require more than three sequential tool calls") that would be invisible from aggregate metrics alone.
Cost Management
LLM API costs are a function of token throughput. For a simple chatbot, this is easy to reason about. For an agent that may make dozens of LLM calls per task, spawn subagents, and process long tool results, costs can compound quickly and unpredictably. Cost management is not just an operational concern — it shapes architectural choices from the ground up.
Cost Anatomy
A multi-step agent task has costs along several dimensions. Each LLM call costs input tokens (the full conversation history + system prompt + tool definitions, which grow with each turn) plus output tokens (the model's response). In a conversation of n turns, input tokens grow as O(n²) if history is not compressed — the most common source of unexpectedly large bills. Tool calls add latency but not direct LLM cost, unless the tool result is large (e.g. fetching a long web page) and gets appended to the context. Subagent spawns have the full cost structure of a nested conversation.
Cost Controls
Several techniques keep costs bounded. Prompt caching is the highest-leverage single technique: Claude and other major APIs offer cached prompt pricing (typically 10–20% of full input cost) for prompt prefixes that are stable across calls — system prompts, tool definitions, and long context documents. For agents that make many calls with the same system prompt, caching can reduce input costs by 80% or more. Model routing uses a cheaper model for steps that don't require frontier capability — classification, formatting, simple lookup — and reserves the expensive model for complex reasoning steps. History compression (summarisation or truncation) bounds the quadratic token growth. Tool result trimming extracts only the relevant portion of large tool outputs before appending to context.
| Technique | Cost Reduction | Latency Impact | Quality Risk |
|---|---|---|---|
| Prompt caching | 50–85% on input | Neutral or faster | None (same tokens) |
| Model routing (cheap for easy steps) | 30–70% overall | Faster on routed steps | Low if routing is accurate |
| History compression | 40–80% on input (long runs) | Adds compression call | Medium — info loss possible |
| Tool result trimming | Variable (0–60%) | Neutral | Low if trimming is targeted |
| Parallelism (fewer sequential turns) | Reduces O(n²) to O(k·m²) | Reduces wall time | None |
Budget Enforcement
Token budgets should be enforced in code, not just monitored after the fact. A common pattern is a BudgetGuard that wraps every LLM call, accumulates token usage, and raises an exception (or triggers a graceful termination branch) when a per-task budget is exceeded. Frameworks like LangGraph make this straightforward to add as a node in the graph; the guard inspects state before each call and routes to an over_budget terminal node if the budget is breached. This prevents a runaway agent from generating a surprise four-figure API bill.
Deployment Patterns
Running an agent in development — where it completes in seconds on a local machine — reveals almost nothing about its production behaviour. Production agents are long-running, concurrent, interrupted by infrastructure failures, and subject to latency requirements that don't exist in a notebook.
Synchronous vs. Asynchronous Execution
Short tasks (under ~30 seconds) can be executed synchronously: the client waits for the response. Longer tasks require asynchronous patterns. The standard approach is a job queue: the client submits a task and receives a job ID; the agent runs in a worker process and writes results to a database; the client polls or subscribes to completion. Redis Queues, Celery, and cloud-native services (AWS SQS + Lambda, GCP Tasks + Cloud Run) are common implementations. For streaming responses — where partial results are valuable as they're produced — Server-Sent Events or WebSockets allow the agent to push incremental output to the client without waiting for full completion.
task_id, enqueues the job, and returns 202 Accepted with the task ID.task_id.failed with the error and checkpoint reference; the job can be retried from the checkpoint.GET /tasks/{task_id} or subscribes to a completion webhook. On success, the result is available at a result endpoint; on failure, the error and last state are accessible for debugging.Concurrency and Rate Limiting
LLM APIs have rate limits on tokens per minute and requests per minute. Multiple concurrent agents can easily saturate these limits, causing cascading 429 errors that are hard to recover from gracefully. The solution is a shared rate-limiter (token bucket or leaky bucket) in front of all LLM calls. Redis-based implementations (e.g. slowapi, redis-rate-limit) provide distributed rate limiting across multiple worker processes. Exponential backoff with jitter handles transient 429s that slip through.
Containerisation and Isolation
Agents that execute code (Python interpreters, shell commands) or manipulate files must run in isolated environments. Docker containers with restricted network access, read-only filesystem mounts, and resource limits (CPU, memory, wall time) are the standard approach. For high-security deployments, gVisor or Firecracker microVMs provide stronger isolation than standard Docker. The container image should be minimal — no unnecessary tools that an agent could misuse — and should run as a non-root user.
Frontier & Open Problems
The agent framework ecosystem is young and moving fast. Several important problems remain unsolved or underserved as of 2025.
Workflow Portability
Agent workflows are currently deeply coupled to their framework. A LangGraph workflow cannot be migrated to CrewAI without substantial rewriting; a Claude SDK application assumes Claude throughout. There is no standard intermediate representation for agentic workflows analogous to ONNX for models. This portability problem is not merely an inconvenience — it creates vendor lock-in, makes benchmarking across frameworks difficult, and slows the development of shared tooling. Early proposals exist (e.g. AgentFlow YAML), but no standard has achieved traction.
Standardised Evaluation Integration
Observability captures what the agent did; evaluation captures whether it did it correctly. These two concerns are currently served by separate tooling that rarely integrates cleanly. A production monitoring system that could automatically detect anomalous traces, route them to evaluators, and generate regression alerts without manual intervention would close the feedback loop that makes agent systems genuinely improvable over time.
Adaptive Cost Optimisation
Current cost controls are static — you configure them in advance and they apply uniformly. A more sophisticated system would learn the cost-quality frontier for a given task distribution: automatically routing easy tasks to cheaper models, increasing context compression on low-importance conversations, and reserving expensive compute for tasks where quality has been empirically shown to matter. This requires online learning over the agent's own traces — a form of meta-optimisation that no production framework currently provides.
Long-Horizon Reliability
The compounding-error problem from computer use agents applies equally to all long-horizon orchestration. A workflow with 50 steps and 95% per-step reliability completes correctly only 8% of the time. Improving this requires either (a) dramatically higher per-step reliability through better models, (b) checkpointing plus retry at a granularity fine enough that no step is too costly to re-do, or (c) planning architectures that commit to fewer, more reversible actions before verifying intermediate results. Current frameworks address (b); (a) and (c) remain active research directions.
A recurring prediction in the field is that agent frameworks will converge toward a small set of primitives: a durable execution engine (probably built on something like Temporal), a standardised tool interface (MCP is the leading candidate), and a common observability layer (OTel). The diversity of current frameworks is a sign of an ecosystem exploring the design space; consolidation typically follows as best practices emerge and switching costs become salient. Whether convergence happens around an open standard or a single dominant framework is the open question.
Further Reading
-
LangGraph: Building Stateful, Multi-Actor Applications with LLMsThe canonical reference for LangGraph's state machine model, checkpointing API, and human-in-the-loop patterns. Essential reading before building any production LangGraph workflow.
-
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent ConversationIntroduces the ConversableAgent abstraction and demonstrates how multi-agent conversation enables tasks that single agents cannot reliably complete. The foundational paper for the conversational multi-agent paradigm.
-
Building Effective AgentsAnthropic's guide to agent design patterns including sequential, parallel, evaluator-optimizer, and orchestrator-subagent patterns. Directly informs the Claude SDK design philosophy. The clearest articulation of when and how to compose orchestration patterns.
-
Model Context Protocol (MCP) SpecificationThe open standard for connecting LLM applications to external tools and data sources via a standardised JSON-RPC protocol. Supported natively by Claude and an expanding ecosystem of third-party servers. Understanding MCP is increasingly essential for any production agent architecture.
-
Prompt Caching — Anthropic API DocumentationTechnical reference for Claude's prompt caching feature, including cache breakpoint syntax, pricing, TTL, and practical guidance on which content to cache. Prompt caching is the single highest-leverage cost reduction available for agent workloads on Claude.
-
OpenTelemetry Semantic Conventions for Generative AIThe emerging standard for instrumenting LLM calls and agent workflows with vendor-neutral tracing. Defines span attributes for model calls, token usage, and tool invocations. Following this spec now prevents observability lock-in as the standard matures.