"Sometimes you gotta run before you can walk."
— Tony Stark, Iron Man (2008)
The Iron Man metaphor is closer than it looks. Stark's suit is not
intelligent; J.A.R.V.I.S. is. The suit is a tightly-coupled set of
tools — sensors, propulsion, weapons — that the AI orchestrates
through a control loop with explicit safety interlocks. That is the
same architecture this chapter develops: the LLM is J.A.R.V.I.S., the
forecasters and policies of Chapters 4–6 are the suit, and Section
8-02's tool-calling layer is the mechanism that wires them together.
A working agent is rarely just "an LLM with some tools." It is a small
control flow — retrieve, plan, call, check, retry, summarise — that
needs the same engineering discipline as any other piece of production
code. This chapter operationalises the LLMs of Chapter 7 by wiring them
to tools, structured state, and approval gates.
What this chapter does
Two through-lines:
Tools are the substrate. A small, well-designed tool catalogue
with a mediocre control loop usually beats a fancy control loop over
a poor tool set. Section 8-02 spends most of its time on what makes
a tool catalogue maintainable in production.
The graph is the program. Agents that loop, branch, or share
state across many steps are best modelled as graphs (LangGraph) over
a typed state object. Section 8-01 develops this with the patterns
that survive past prototype-stage demos.
The chapter ends by handing the agentic layer the synthetic data
of Chapter 10: agents author scenarios, the synthetic generator
produces matching paths, and the rest of the stack (forecasters,
policies) is evaluated on those scenarios.
This Chapter Covers
Composing prompt + tool chains with LangChain's typed runnables
and structured outputs.
Graph-based orchestration with LangGraph: typed shared state,
conditional edges, checkpointing, and the agent topologies that
recur in finance (ReAct, plan-and-execute, supervisor + workers).
Function calling, schema design, and guardrails — the
boring-but-load-bearing layer that turns LLM tool calls into a
reliable operator pattern.
The Model Context Protocol (MCP) for portable, auditable tool
integrations across providers.
Multi-agent topologies — when to multiply agents and which
patterns (supervisor + workers, plan-and-execute, debate)
actually pay off.
Production deployment — versioning, observability, approval
gates, and audit logs.
Evaluation and regression — golden tasks, process fidelity,
adversarial probes.
Contents
LangChain and LangGraph —
composable building blocks, graph-based control flow,
observability, and the version/approval discipline that production
agents need.
Tool Calling and MCP — function
calling, the Model Context Protocol, a representative tool
catalogue, and the guardrails that make agents safe to ship.
Multi-Agent Topologies —
supervisor + workers, plan-and-execute, debate, reflection, and
the communication patterns that hold them together.
Production Deployment — the
versioning, observability, approval-gate, and audit-log layer
every shippable agent has to sit inside.
Evaluation and Benchmarks —
golden tasks, rubric-based scoring, process fidelity, adversarial
probes, and the regression discipline that catches drift.
LangChain and LangGraph
A working agent is rarely just "an LLM with some tools." It is a small
control flow — retrieve, plan, call, check, retry, summarise — that
needs the same engineering discipline as any other piece of production code.
LangChain provides the composable building blocks (prompts, models, tools,
memory) and LangGraph adds the graph-based orchestration that makes
non-trivial flows tractable. This section covers the abstractions you
actually use day-to-day for finance applications, the patterns that have
held up across teams, and the observability hooks that make agentic
systems debuggable.
When you actually need a framework
A common warning before reaching for a framework: if your application is a
single LLM call with one or two tools and no looping, do not introduce
LangChain or LangGraph. A Python function with explicit if/else and a
plain SDK call from anthropic or openai is clearer and faster to debug.
Frameworks earn their place when one or more of the following is true:
Multi-step planning — the agent decides what to do next based on
intermediate results, possibly looping until a stop condition.
Branching workflows — different paths depending on tool outputs,
user role, or compliance flags.
Stateful sessions — the agent retains memory across calls and the
state is non-trivial to serialise.
Multiple agents — supervisor + worker patterns, debate, or
parallel research that needs to be merged.
Anything below that bar is a function call.
LangChain core concepts
LangChain centres on a small set of composable abstractions:
PromptTemplate — parameterise instructions with named variables.
ChatModel — a uniform wrapper around providers (Anthropic, OpenAI,
open-source local models) so swapping is a config change.
Tool — a callable with a typed schema that the model can invoke.
Chain / Runnable — composition operator (the | pipe) that
threads prompts → models → parsers → side-effect-free post-processing.
Always pin to typed outputs.PydanticOutputParser (or LangChain's
newer with_structured_output) raises on schema violations rather than
returning silently malformed text. Production agents need this.
Compose with |. Each stage is a pure function except the LLM.
Side effects (tool calls, writes to a database) belong in tools, not in
the chain.
LangGraph: graphs over chains
A linear chain falls apart as soon as you need a loop ("retry on tool
error", "ask again if confidence is low") or a branch ("if the result
mentions a security, run compliance check; else summarise"). LangGraph
generalises chains to directed graphs of nodes with shared state:
Typed shared state. Every node receives the same AgentState and
returns partial updates that are merged. Type errors at graph
compile time instead of runtime.
Conditional edges and loops. Real agents need to decide what to do
next and stop conditions need to be explicit. The condition function
returns a node name; LangGraph handles the dispatch.
Checkpointing. Persist AgentState to a backend (SQLite, Postgres,
Redis) so a session resumes across processes. This is what
production-ready trading agent systems
\citep{langgraphserver2025} exploit for restartable runs.
Agent topologies
A handful of graph topologies cover most finance applications:
ReAct loop — single agent alternating think → tool call → observe
until done. The default for analysis-heavy tasks (market summary,
filings extraction).
Plan-and-execute — a planner writes a multi-step plan, then a worker
executes each step. Useful when the task decomposes naturally and you
want each step auditable.
Supervisor + workers — a central supervisor delegates to specialised
agents (researcher, quant, compliance reviewer) and synthesises their
outputs. The pattern in FINCON \citep{fincon2026} for financial
decision-making and similar multi-agent systems.
Debate / critic — two agents argue alternative analyses; a third
selects or merges. Trades latency for higher quality on judgement
tasks.
The right topology is mostly about how independent the sub-tasks are. If
they are independent, parallelise with workers; if they depend on each
other, sequence them in a plan.
State and memory
Agent state is bigger than "previous messages." For finance specifically:
Working memory. The current AgentState — question, retrieved
documents, partial plan, intermediate tool outputs.
Episodic memory. Past sessions with the same user, persisted across
conversations. LangGraph's checkpointer plus a small embedding store
keeps this manageable.
Long-term memory. Domain facts, style guides, firm-specific rules
that should not be in every prompt but should surface when relevant.
Live in a vector store and are retrieved on demand.
A frequent mistake is to push everything into the prompt context. Treat
context as a scarce, latency-sensitive resource and put structural
information into typed state, not the system message.
Observability
Agentic systems are much harder to debug than synchronous APIs because
control flow is partly model-driven. The minimum observability stack:
Tracing. LangSmith, Phoenix, or OpenTelemetry-based tools capture
the full graph trace per run: node inputs, outputs, latencies, token
counts. A trace is the agentic equivalent of a stack trace.
Edge-level metrics. Latency and token count per node. Which node
drove the cost? Which one slowed the run? Without this, optimisation
is guesswork.
Failure-mode tagging. Errors fall into a small set of categories
(tool timeout, schema violation, model refusal, infinite loop). Tag
them at capture time so you can monitor rates over time.
Replay. Saved traces should be replayable so you can debug a
past failure with the same inputs and (ideally) the same model
version.
Versioning and shipping
Three discipline points that matter once an agent is in production:
Version the graph. A change to nodes, edges, or prompts gets a
version number. Production traffic pins to a version; new versions
go through staged rollout.
Separate the model version from the graph version. Both can change;
log both. A model upgrade should trigger the regression suite even if
no graph code changed.
Approval gates for actions with side effects. Anything that
consumes resources outside the agent (sending email, executing a
trade, modifying records) goes through an explicit approval node —
human or rule-based — rather than being a tool the agent calls
silently.
LangChain + LangGraph give the scaffolding; the discipline above is what
makes the scaffolding hold up under production load. Section 8-02 builds
on this with the tool calling patterns and the Model Context
Protocol (MCP) that standardises how agents discover and invoke tools
across providers.
Tool Calling and MCP
Tool calling is what turns an LLM from a text generator into an operator:
the model emits a structured request to invoke a function, the runtime
executes it, and the result feeds back into the conversation. Done well,
this is the single biggest lever against hallucination — anything that has
to be exact (a price, a P&L, a position) is delivered by a deterministic
tool, not by the model's parametric memory. The Model Context Protocol
(MCP) standardises how tools advertise themselves to models, how
permissions are negotiated, and how observability data flows back. This
section covers both the function-calling primitive and the MCP layer that
makes agents portable across providers.
Tools as the agent's substrate
An agent's behaviour is largely a function of the set of tools it can
call. A small, well-designed tool set with a mediocre control loop usually
beats a fancy control loop over a poor tool set. Three classes of tool
recur in finance applications:
Read tools. Look up a price, fetch a position, retrieve filings,
query a vector store. Idempotent and safe; can be cached aggressively.
Compute tools. Price options, run a forecaster, evaluate a policy.
Deterministic and side-effect-free given the same inputs; cache by hash
of arguments.
Action tools. Send an email, file a trade, update a database. Have
side effects, so they need explicit approval gates and audit logging.
A working rule: read and compute tools can be wired up directly; action
tools never run without an approval node in between (Section 8-01).
Function-calling basics
Modern providers (Anthropic, OpenAI, Google) expose a uniform interface:
declare tools as JSON Schemas, the model emits typed calls, the runtime
executes them and returns the result. With LangChain:
A complete tool definition that an Anthropic agent can call directly:
from anthropic import Anthropicimport polars as plPRICES = pl.read_csv("data/prices_daily.csv", try_parse_dates=True)# 1) declarative schema the model seesTOOLS = [{ "name": "latest_price", "description": "Return the last close (or as-of close) for an asset.", "input_schema": { "type": "object", "properties": { "ticker": {"type": "string", "description": "Asset ticker, e.g. SPY"}, "asof": {"type": "string", "description": "ISO date; default = latest"}, }, "required": ["ticker"], },}]# 2) deterministic local implementationdef latest_price(ticker: str, asof: str | None = None) -> dict: rows = PRICES.filter(pl.col("Ticker") == ticker).sort("Date") if asof is not None: rows = rows.filter(pl.col("Date") <= asof) last = rows.tail(1).to_dicts()[0] return {"ticker": ticker, "date": str(last["Date"]), "close": last["AdjClose"]}# 3) agent loop that calls the tool when the model asks for itclient = Anthropic()messages = [{"role": "user", "content": "What was SPY's close on 2024-12-31?"}]while True: resp = client.messages.create( model="claude-opus-4-7", max_tokens=512, tools=TOOLS, messages=messages ) if resp.stop_reason != "tool_use": print(resp.content[-1].text) break tu = next(b for b in resp.content if b.type == "tool_use") out = latest_price(**tu.input) messages += [ {"role": "assistant", "content": resp.content}, {"role": "user", "content": [{"type": "tool_result", "tool_use_id": tu.id, "content": str(out)}]}, ]
The schema declares what the model can request; the implementation
controls what actually runs; the agent loop alternates the two. MCP
generalises the schema and the transport so the same tool can be
exposed to any compatible host without rewriting the agent loop.
The four ingredients are always the same:
Schema. A typed input schema (Pydantic, JSON Schema). The model
sees the schema; bad inputs raise locally before any external call.
Implementation. A pure function (or a function with audited side
effects). Behaviour must be deterministic given the same inputs.
Description. Plain English that tells the model when to use the
tool. Models pick tools by description; sparse descriptions cause
underuse.
Output type. Structured (dict, Pydantic model). The model is
better at composing structured outputs than free text.
A common but subtle bug is over-permissive descriptions: tools whose
docstrings hint at capabilities they don't have. Models will call them
optimistically and fail downstream. Be precise about scope.
The Model Context Protocol (MCP)
MCP is an open protocol for connecting language models to external tools
and data sources \citep{mcp2024spec}. Conceptually it sits between the
model and the tools, replacing ad-hoc per-provider integrations with a
single contract:
Tool servers advertise their capabilities (name, schema, auth)
over a standard transport (stdio, websockets).
Hosts (Claude Code, ChatGPT Desktop, custom agents) discover
tool servers, negotiate which tools to expose per session, and route
the model's calls to them.
Clients are the per-session connections that handle tool
invocation, auth, and observability.
Three concrete benefits:
Portability. The same agent runs against Anthropic, OpenAI, or any
MCP-compatible runtime — pick the provider that fits the workload, not
the provider's lock-in.
Least-privilege security. A session exposes a subset of tools
with explicit allow/deny. Sensitive tools (file write, trade execution)
go behind an explicit grant, not a default-allow flag.
Observability. Standard tracing fields (latency, status, request
ID) flow back from every tool call so the host can build a uniform
trace.
In practice, MCP is most useful when you build one tool server that
exposes firm capabilities (positions, prices, forecasters) and let
multiple agents — research, trading, compliance — connect to it with
different permission scopes.
A representative tool catalogue
A reasonable starter catalogue for a portfolio-research agent:
Category
Tool
Purpose
Data
prices.history(ticker, range)
Historical OHLCV
Data
filings.search(query, range)
Filings RAG
Data
news.recent(query, since)
News RAG with time bound
Compute
forecaster.predict(panel, horizon)
Call Chapter 4 model
Compute
optimizer.solve(mu, sigma, constraints)
Mean-variance / RL policy
Compute
risk.var(portfolio, horizon, alpha)
Risk metric
Action
report.draft(template, payload)
Prepare a draft (no send)
Action
notify.slack(channel, msg)
Send notification (approval gated)
Two design choices worth noting:
Compute tools wrap, not duplicate, your existing code. The
forecaster called by forecaster.predict is the same one your humans
call in notebooks. The agent does not get a parallel implementation.
Action tools are the ones that need policy. Read and compute tools
can be exposed by default; action tools require explicit grants and a
human-readable explanation in the trace.
Guardrails
Five guardrails sit between every tool call and its execution:
Schema validation. Reject malformed payloads before they hit the
implementation.
Permission check. Verify the calling session has the right grants
for this tool and for these arguments. Per-argument scoping (e.g.
read-only on certain accounts) catches mistakes that per-tool scoping
misses.
Rate limiting. Cap calls per session and per user. Models in loops
hit a limit before they hit a billing surprise.
Result sanitisation. Strip sensitive fields (PII, internal IDs)
before returning to the model. The model's context is the weakest
privacy boundary — assume anything in it might end up in a log.
Audit log. Every tool call → caller, timestamp, arguments, result
hash, latency, status. The audit log is the single most useful
artefact when a regulator asks how a recommendation was produced.
End-to-end flow
A representative hedging recommendation flow:
User: "Recommend a hedge for the equity book against a 10% S&P
drawdown."
Agent → MCPportfolio.read(book="equity_us") → current positions.
Agent → user: structured recommendation referencing each tool
output, with explicit assumptions and the trace ID.
Approval node: human approves before any report.send action.
Every step is logged; every numerical claim traces to a tool output;
the agent's prose is narration over the tools, not original
calculation.
When to skip MCP
MCP is overhead for small applications. Use it when:
You need provider portability (Anthropic + OpenAI + a local model).
You have multiple agents sharing the same tool set with different
permissions.
You need a unified audit log across agents.
Otherwise, direct provider tool-calling APIs are simpler and good enough
for the first version. MCP earns its place when the system grows to the
point that integration complexity becomes a tax — and that day comes
sooner than most teams expect.
Tool calling with MCP is what turns LLMs from clever summarisers into
reliable operators. The next chapter (synthetic data) closes the
book by giving the agent — and the forecasters and policies it
orchestrates — a way to test itself against scenarios that have never
occurred.
Multi-Agent Topologies
A single agent with a good tool catalogue (Sections 8-01 and 8-02) is
the right starting point for most finance use cases. Multi-agent
systems become the right tool when the work itself decomposes
naturally into specialised sub-tasks — research and write-up, debate
and judgement, plan and execute. This section covers the topologies
that recur in financial workflows and the patterns that keep them
debuggable.
When to multiply agents
A useful test before adding a second agent: can the same outcome be
reached by giving the existing agent one more tool? If yes, do that
instead. Multi-agent systems are interesting when the boundary
between sub-tasks is conceptual (research vs. compliance vs.
execution) rather than just functional. Three concrete shapes:
Specialist diversity. Each agent has its own system prompt,
tool subset, and even its own underlying model. A research agent
uses a long-context model and retrieval tools; a compliance agent
uses a smaller model with strict refusal rules and a regulatory
knowledge base.
Independent perspectives. The same task is given to multiple
agents to surface disagreement. Useful when the answer is a
judgement and false confidence is the real risk.
Pipeline decomposition. A planner produces a structured plan; a
worker executes each step. The planner does not need execution
tools and the worker does not need to think strategically.
Anything that can fit into a single ReAct loop without contortions
should stay there.
Supervisor + workers
The most common multi-agent topology in finance applications. A
supervisor agent decides which worker to delegate to, the worker
executes, the supervisor synthesises and decides the next step.
Supervisor. Read the user goal, current state, and prior worker
outputs. Emit a directive: which worker to call next, with what
prompt, or "stop and synthesise".
Workers. Each is a specialist with a narrow tool set. Examples
in a portfolio-research workflow: data-puller (pricing, fundamentals,
news), modeller (calls the forecasters of Chapter 4 / policies of
Chapter 5 as tools), narrator (drafts the write-up).
Coordination state. The supervisor maintains shared memory the
workers can read; workers append their results. LangGraph's
StateGraph models this naturally.
Three things to monitor when this topology is in production:
Loop termination. The supervisor must reliably emit "done".
Without an explicit max-iteration cap, agents can spin.
Worker scope creep. A specialist that grows tools beyond its
original scope often turns into the de facto generalist; resist by
keeping the tool list per worker pinned and reviewable.
Cost per goal. Supervisor + worker rounds compound. Track total
token and tool-call cost per goal completion; cap on outliers.
Plan-and-execute
A sibling of supervisor + workers: a planner produces a multi-step
plan up front, then a worker executes each step in sequence
without re-planning. Better than ReAct on long-horizon tasks because
the planner keeps the global view; ReAct can lose it.
The worker resolves $stepN.x references between steps. When a step
fails or returns unexpected output, the worker triggers re-planning
on the remaining steps rather than restarting. This is the recipe
behind the FINCON multi-agent system for financial decision-making
\citep{fincon2026}.
The planner-worker split also makes auditing easier: the plan can be
reviewed (and approved or rejected) before any tool with side effects
runs. For finance applications this is a load-bearing feature.
Debate / critic
Two agents argue alternative analyses; a third selects or merges.
Useful for tasks where the failure mode is over-confident wrong
answer — investment-thesis evaluation, model-risk reviews, regulatory-
interpretation questions.
Proposer. Generates a candidate analysis with stated assumptions.
Critic. Reads the proposal and finds weaknesses: missing data,
unsupported claims, alternative interpretations.
Judge. Reads both and either picks one, merges, or asks for
another round.
Three rounds of debate consistently beat a single ReAct pass on
judgement tasks. Two caveats:
Cost. Each round is at least 3× the per-pass cost. Reserve for
tasks where the quality matters more than the latency.
Convergence. Without a judge that can break ties, debates can
oscillate. The judge prompt should explicitly allow "merge" and
"neither — try again" rather than just "pick A or B".
Reflection
A specialised debate where the proposer and critic are the same
agent in two passes. Pass 1 produces a draft; pass 2 ("now critique
this for the points you would push back on") revises it. Cheaper than
true debate; modest quality gain on most tasks. Good default for
write-up tasks where the cost budget does not support full debate.
Hierarchical agents
Supervisors of supervisors. Useful when the work has natural domain
boundaries — research / risk / execution as separate teams, each
with its own internal tool set, coordinated by a top-level
orchestrator.
Risks scale with depth:
Latency multiplies with depth. A three-level hierarchy means
every user request waits on three layers of LLM calls before any
real work happens.
Information loss across boundaries. Each summarisation step
drops detail; by depth 3 the top-level orchestrator may be making
decisions on a heavily lossy view.
A reasonable rule: do not exceed two levels of hierarchy. If a third
level seems necessary, split the system into separate workflows
triggered independently by the user rather than nesting.
Communication patterns
Four ways agents pass information to each other, in increasing
robustness:
Free-text. Each agent emits prose; the next agent reads it.
Lowest friction, lossy, hard to audit.
Structured records. JSON with a fixed schema (title, body,
citations, confidence). Auditable; the schema is the contract.
Shared memory. A blackboard the agents read and write through
a tool interface. Useful for long-running collaborations; needs
garbage collection to stop the blackboard from growing without
bound.
Typed events. Each message carries a type (research_finding,
risk_flag, recommendation) plus a typed payload. The router
uses the type to decide what runs next. The most robust pattern
for production agents.
Production multi-agent systems converge on typed events plus shared
memory for persistence; debate-style proof-of-concepts live happily
on free-text alone.
What this section adds
The single-agent loop in 8-01 plus the tool-calling layer in 8-02
covers most finance applications. Multi-agent topologies become the
right tool when the work genuinely splits along a conceptual axis
(research vs. compliance, propose vs. critique, plan vs. execute) and
the cost of getting it wrong justifies the cost of running multiple
agents. Section 8-04 covers the production engineering — versioning,
observability, approval gates — that any multi-agent system has to
sit inside before it touches a real workflow.
Production Deployment
A working agent in a notebook is one thing. A working agent that runs
unattended against production data, costs predictable money, fails
loudly on regressions, and survives a regulator's audit is a much
harder thing. This section is about the operational layer between
those two. Most of the work here is unglamorous infrastructure —
versioning, observability, approval gates, audit logs — but for
finance applications it is what separates cool demo from system
worth depending on.
Versioning everything
An agent's behaviour is the joint product of at least four moving
pieces, and each needs an explicit version pinned for any
production trace to be reproducible:
Graph version. The LangGraph (or equivalent) graph definition
— nodes, edges, conditional routing. Bump on any structural change.
Prompt version. System prompts, few-shot examples, tool
descriptions. The single thing that changes most often; treat it
like source code.
Tool version. Every tool's schema and implementation. A schema
change is a breaking change; bump the major version.
Model version. The exact model snapshot used (e.g.
claude-opus-4-7, not claude-opus-latest). Hosted models can
silently change behaviour within a stated family.
Every trace logs the AGENT_VERSION block. A bug report that names
the version is reproducible; one without is folklore.
Observability stack
Three layers, each with a different audience:
Per-call telemetry. LangSmith, Phoenix, OpenTelemetry — capture
every LLM call, tool invocation, and routing decision with
inputs, outputs, latency, and token count. The audience is the
engineer debugging a single failed run.
Aggregated metrics. Per-day or per-week dashboards: success
rate, average tool-call count, p50/p95 latency, cost per run,
distribution of failure modes. The audience is whoever owns the
service.
Anomaly alerts. Spike in error rate, drop in success rate, cost
per run trending upward. Page on these, not on per-call failures.
A useful internal contract: the trace ID for any agent run must be
linkable from any user-facing artefact (chat reply, generated
report, recommendation). When a user asks "why did the agent say X",
the answer is a trace URL, not a guess.
Approval gates and circuit breakers
Anything with a cost of being wrong belongs behind an explicit gate.
Three patterns recur in finance:
Human approval. The agent proposes a structured action; a human
reviews and clicks approve. Suited for trade execution, large
capital reallocation, anything client-facing.
Rule-based gate. A deterministic check evaluates the agent's
proposed action. Reject if it violates an invariant: position
outside risk limits, citation pointing nowhere, JSON failing
schema validation.
Circuit breaker. Per-day caps on total tool calls, total
external spend, total trade volume. When tripped, the agent is
paused (with a notification) until a human inspects.
These three compose. A typical production wiring: the agent emits a
proposal → rule-based gate filters obvious failures → human reviews
the rest → circuit breaker caps total daily approvals as a floor.
A working rule: if removing the gate would let the agent move money
or send messages without human visibility, the gate stays.
Cost and latency monitoring
Agents have a habit of spiralling on edge-case inputs — long
contexts, recursive tool calls, planning loops that don't terminate.
Without monitoring, a single bad prompt can spend an order of
magnitude more than the team noticed.
The minimum dashboard for production agents:
Cost per run, by run type. Distribution, not just average. The
90th percentile is where the surprises live.
Tool calls per run. A run that uses 50 tool calls when the
median is 5 is a runaway loop, even if it eventually completed.
Time-to-first-action and time-to-completion. First-action time
measures how long the user waits before any visible progress;
completion time measures total wall clock.
Cost per dollar of value delivered. For tasks where the value
can be quantified — a fund-flow report, a research note — track
the ratio. If the agent costs 5toproducea2 deliverable, it
is uneconomical at scale.
Per-user / per-team cost caps belong in the same dashboard. Soft
caps that warn before hard caps that pause; hard caps that pause
before bills become surprises.
Rollouts and feature flags
Agentic behaviour is hard to QA exhaustively. A staged rollout cuts
the blast radius of regressions:
Canary. New version runs on 1–5% of traffic for a fixed window.
Compare success rate, error rate, cost on the canary vs. the
current version. Promote if metrics are equal-or-better.
A/B. Two versions running side by side with metrics emitted
per arm. Useful for prompt changes where the question is "is the
new prompt actually better".
Shadow. New version runs on every input but its output is
logged, not delivered. Most useful for graph or model changes
that need offline comparison before any user sees them.
A graph-version + prompt-version + model-version tuple is what gets
flagged. Rolling back is bumping the live tuple back to a prior
known-good combination.
Audit log architecture
Regulators and internal model-risk reviewers want answers to
questions like:
"What inputs did the agent see when it produced this output?"
Append-only. Logs are written once and never edited. Use
immutable storage (cloud object storage with object lock) or a
log-structured database with retention policies.
Hashed payloads. Large inputs and outputs (filings, model
predictions) are stored separately by hash; the log references the
hash. Keeps log size manageable; immutability still holds.
Reproducible. Given a trace_id, replaying the same input
through the same agent version with the same tools produces the
same output (modulo non-determinism logged separately as seeds).
The audit log is the artefact you defend in front of a regulator. It
is also what lets a developer reproduce a six-week-old bug.
Failure modes that show up only in production
A working list of the classes that recur:
Long-context degradation. Once the agent's running context
exceeds the model's training distribution sweet spot, behaviour
changes — usually toward verbosity, occasionally toward outright
errors. Cap context length at a fraction of the model's max.
Stale retrieval. A document is indexed once, then changes; the
agent retrieves the stale version and acts on outdated content.
Periodic re-indexing plus a "last verified" timestamp on retrieved
chunks.
Tool result drift. A tool's output schema or units changed
upstream; agent prompts assuming the old shape break. Add schema
validation on tool outputs as well as inputs.
Adversarial prompt-injection through retrieved docs. A document
in the retrieval corpus contains instructions that hijack the
agent. Treat retrieved text as data only; never let it be parsed
as instructions.
Cost drift. Cost per run creeps up over months as prompts grow
and models change. Re-baseline costs on every prompt or model
bump.
What this section adds
Sections 8-01 through 8-03 cover the agent runtime; this section is
the operations layer that runtime sits inside. The discipline here
— versioning, observability, gates, audit logs — is what makes
agentic systems shippable in regulated environments. Section 8-05
adds the evaluation discipline that closes the loop: agents are
measured against goals, not just inspected per call.
Evaluation and Benchmarks
A forecaster (Chapter 4) is evaluated against a held-out distribution
with CRPS. A policy (Chapter 5) is evaluated against rolling
benchmarks with Sharpe and drawdown. Agents are harder: their
output is rarely a number, their behaviour is non-deterministic, and
the same input can produce a defensible answer through several
different tool-call paths. This section covers the evaluation
discipline that survives that complexity — what to measure, how to
measure it, and how to keep agents from regressing as prompts and
models change underneath them.
Three things to measure
Resist the urge to score agents on a single number. Three orthogonal
dimensions need separate tracking:
Outcome quality. Did the agent's output answer the user's
question correctly and at the firm's quality bar? The thing the
user actually cares about.
Process fidelity. Did the agent follow the right tool path?
An agent that produces a correct answer by hallucinating numbers
it should have looked up via a tool is a bug, even if the
observed output is right.
Operational metrics. Cost, latency, success rate, error
taxonomy. The thing the team running the agent cares about.
A run can be 10/10 on outcome quality, 2/10 on process fidelity, and
that mismatch is the evaluation signal. A team that only looks at
outcome quality will ship an agent that fabricates plausibly.
Golden tasks
The minimum evaluation artefact: a curated set of 30–200 prompts
with hand-written reference answers and required tool-call paths.
Run on every prompt change, every model upgrade, every graph
revision.
{ "task_id": "earnings-summary-msft-2024q4", "input": "Summarise MSFT's Q4 2024 earnings highlights against analyst consensus.", "required_tool_calls": [ {"tool": "filings.search", "args_match": {"ticker": "MSFT", "form": "10-Q"}}, {"tool": "consensus.estimates", "args_match": {"ticker": "MSFT"}} ], "reference_answer_keypoints": [ "Revenue beat consensus by ~2%", "Cloud segment growth >25% YoY", "Forward guidance unchanged" ], "rubric": { "completeness": "Mentions all three keypoints", "groundedness": "Each numeric claim has a citation", "tone": "Concise, no hedging beyond what the source supports" }}
Running the suite produces a per-task scorecard with three columns:
outcome (rubric-graded), process (tool-path match against
required_tool_calls), and operational (cost, latency).
The hard part is curating the golden set. The cheap-but-bad pattern
is to scrape historical user queries; the expensive-but-correct
pattern is to write tasks by hand against the failure modes the team
actually wants to catch.
Rubric-based scoring with LLM judges
For the outcome dimension, programmatic scoring works for some
tasks (extraction with reconciliation, numeric Q&A) and not others
(write-ups, narrative analyses). For the latter, an LLM judge with a
structured rubric is the standard tool — with caveats.
JUDGE_PROMPT = """You are scoring an agent's response to a finance task.Task: {task_input}Reference points the response should cover:{reference_keypoints}Agent response:{agent_response}Score each dimension on a 0-3 integer and return strict JSON:- completeness: 3 if all reference points covered, 0 if none- groundedness: 3 if every numeric/factual claim has a citation- tone: 3 if concise and appropriately hedgedReturn: {{"completeness": int, "groundedness": int, "tone": int, "rationale": "..."}}"""
Two practices that keep judge-based scoring honest:
Judge from a different model family than the agent. Same-
family judges over-reward outputs that match their own house
style.
Calibrate against humans periodically. Score 50–100 prompts
with both human and LLM judges; report agreement (Cohen's kappa).
Below 0.6, the judge is too noisy; refine the rubric or change
models.
LLM-judge scores drift as models change. Re-run calibration on every
judge-model upgrade.
Process fidelity: the tool-call audit
The audit log of Section 8-04 turns into evaluation data when
overlaid on golden tasks. For each run:
Required calls present. Every required_tool_calls entry was
invoked at least once, with arguments matching the constraint.
Forbidden calls absent. Tasks can also list calls that should
not fire (e.g., the agent should not have written to a database
on a read-only task).
Path efficiency. The agent reached the answer in ≤N tool
calls. Loops or redundant calls signal an opportunity to tighten
the prompt.
def score_process(trace, task) -> dict: invoked = [(c.tool, c.args) for c in trace.tool_calls] required = task["required_tool_calls"] fidelity = sum( 1 for r in required if any(t == r["tool"] and matches(a, r["args_match"]) for t, a in invoked) ) / max(1, len(required)) n_calls = len(invoked) return {"fidelity": fidelity, "n_calls": n_calls}
Process fidelity is the metric that catches plausible-but-wrong
agents — they look fine on outcome quality and have a low fidelity
score because they shortcut the work.
Adversarial probes
A separate suite of inputs designed to fail. Three categories:
Prompt injection. Documents in the retrieval corpus that try to
hijack the agent's instructions. The agent should ignore them.
Out-of-scope requests. Questions outside the agent's intended
domain. The agent should refuse or escalate, not improvise.
Edge-case inputs. Empty fields, ambiguous tickers, malformed
documents, very long contexts. The agent should fail gracefully —
ask for clarification, not silently invent.
The adversarial set grows over time. Every production failure mode
that surfaces becomes a new entry. Treat it the same way a security
team treats CVE entries: the set never shrinks.
Regression discipline
The evaluation suite runs on every change that could affect agent
behaviour:
Prompt edits. Run the golden suite. Block merge if any task's
outcome score regresses by more than a threshold (e.g., -1 on
the rubric).
Tool schema changes. Run process-fidelity checks; the new
schema must match all required_tool_calls constraints.
Model upgrades. Run the entire suite plus the adversarial
probes. Model upgrades are the single most common source of
silent behavioural change.
Graph topology changes. Full suite. Graph changes can shift
routing in subtle ways that only show up across tasks.
Weekly: run the entire suite on production traffic samples and
compare against the previous week's baseline. Drift in outcome
quality without an associated change shows up here first; without
this, a slow model degradation can run for months unnoticed.
Online evaluation
Offline evaluation catches regressions on known tasks. Online
evaluation catches behaviour on tasks the offline suite never saw.
A/B testing. Two graph versions running side by side. Compare
outcome scores, success rates, cost. Statistically powered samples
before drawing conclusions; agentic outputs are noisy.
Shadow runs. A new version processes every input but its
output is logged, not delivered. Lets you compare before any user
sees the new behaviour. Especially useful for model upgrades.
User feedback. Thumbs up / down on agent outputs, free-text
comments, and explicit "this was wrong" flags. Aggregate weekly;
use as the qualitative complement to offline scores.
Per-domain failure taxonomy
Track failures by category, not just count:
Category
Description
Hallucination
Numeric or factual claim with no source
Tool-skip
Required call missing
Tool-misuse
Wrong tool or wrong args
Schema violation
Output doesn't validate
Refusal
Agent declined a valid request
Loop
Failed to terminate within step cap
Cost
Run exceeded budget without value
Per-week, per-category counts. A spike in hallucination and
refusal together usually means a model upgrade went badly.
What this section adds
The agent runtime (8-01 / 8-02), multi-agent topologies (8-03), and
production engineering (8-04) are the building discipline. This
section is the measuring discipline. Without it, agentic systems
silently drift; with it, the team running the agent has the same
quality of feedback loop that the rest of the book has been
developing for forecasters and policies.
The next chapter (RL fine-tuning) is the bridge into actually
improving agents from this evaluation signal: convert the
preference data and verifiable rewards from the evaluation suite
into a fine-tuning corpus that pushes the policy toward the
behaviour the rubric rewards.