Chapter 08

AI Agent

"Sometimes you gotta run before you can walk." — Tony Stark, Iron Man (2008)

The Iron Man metaphor is closer than it looks. Stark's suit is not intelligent; J.A.R.V.I.S. is. The suit is a tightly-coupled set of tools — sensors, propulsion, weapons — that the AI orchestrates through a control loop with explicit safety interlocks. That is the same architecture this chapter develops: the LLM is J.A.R.V.I.S., the forecasters and policies of Chapters 4–6 are the suit, and Section 8-02's tool-calling layer is the mechanism that wires them together.

A working agent is rarely just "an LLM with some tools." It is a small control flow — retrieve, plan, call, check, retry, summarise — that needs the same engineering discipline as any other piece of production code. This chapter operationalises the LLMs of Chapter 7 by wiring them to tools, structured state, and approval gates.

What this chapter does

Two through-lines:

  • Tools are the substrate. A small, well-designed tool catalogue with a mediocre control loop usually beats a fancy control loop over a poor tool set. Section 8-02 spends most of its time on what makes a tool catalogue maintainable in production.
  • The graph is the program. Agents that loop, branch, or share state across many steps are best modelled as graphs (LangGraph) over a typed state object. Section 8-01 develops this with the patterns that survive past prototype-stage demos.

The chapter ends by handing the agentic layer the synthetic data of Chapter 10: agents author scenarios, the synthetic generator produces matching paths, and the rest of the stack (forecasters, policies) is evaluated on those scenarios.

This Chapter Covers

  • Composing prompt + tool chains with LangChain's typed runnables and structured outputs.
  • Graph-based orchestration with LangGraph: typed shared state, conditional edges, checkpointing, and the agent topologies that recur in finance (ReAct, plan-and-execute, supervisor + workers).
  • Function calling, schema design, and guardrails — the boring-but-load-bearing layer that turns LLM tool calls into a reliable operator pattern.
  • The Model Context Protocol (MCP) for portable, auditable tool integrations across providers.
  • Multi-agent topologies — when to multiply agents and which patterns (supervisor + workers, plan-and-execute, debate) actually pay off.
  • Production deployment — versioning, observability, approval gates, and audit logs.
  • Evaluation and regression — golden tasks, process fidelity, adversarial probes.

Contents

  • LangChain and LangGraph — composable building blocks, graph-based control flow, observability, and the version/approval discipline that production agents need.
  • Tool Calling and MCP — function calling, the Model Context Protocol, a representative tool catalogue, and the guardrails that make agents safe to ship.
  • Multi-Agent Topologies — supervisor + workers, plan-and-execute, debate, reflection, and the communication patterns that hold them together.
  • Production Deployment — the versioning, observability, approval-gate, and audit-log layer every shippable agent has to sit inside.
  • Evaluation and Benchmarks — golden tasks, rubric-based scoring, process fidelity, adversarial probes, and the regression discipline that catches drift.

LangChain and LangGraph

A working agent is rarely just "an LLM with some tools." It is a small control flow — retrieve, plan, call, check, retry, summarise — that needs the same engineering discipline as any other piece of production code. LangChain provides the composable building blocks (prompts, models, tools, memory) and LangGraph adds the graph-based orchestration that makes non-trivial flows tractable. This section covers the abstractions you actually use day-to-day for finance applications, the patterns that have held up across teams, and the observability hooks that make agentic systems debuggable.

When you actually need a framework

A common warning before reaching for a framework: if your application is a single LLM call with one or two tools and no looping, do not introduce LangChain or LangGraph. A Python function with explicit if/else and a plain SDK call from anthropic or openai is clearer and faster to debug. Frameworks earn their place when one or more of the following is true:

  • Multi-step planning — the agent decides what to do next based on intermediate results, possibly looping until a stop condition.
  • Branching workflows — different paths depending on tool outputs, user role, or compliance flags.
  • Stateful sessions — the agent retains memory across calls and the state is non-trivial to serialise.
  • Multiple agents — supervisor + worker patterns, debate, or parallel research that needs to be merged.

Anything below that bar is a function call.

LangChain core concepts

LangChain centres on a small set of composable abstractions:

  • PromptTemplate — parameterise instructions with named variables.
  • ChatModel — a uniform wrapper around providers (Anthropic, OpenAI, open-source local models) so swapping is a config change.
  • Tool — a callable with a typed schema that the model can invoke.
  • Chain / Runnable — composition operator (the | pipe) that threads prompts → models → parsers → side-effect-free post-processing.
from langchain.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
 
class RiskMemo(BaseModel):
    ticker: str
    drivers: list[str] = Field(description="Top 3-5 risk drivers")
    summary: str
 
parser = PydanticOutputParser(pydantic_object=RiskMemo)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a risk analyst. Output strict JSON matching the schema."),
    ("user", "Summarise {ticker} risk drivers given the context:\n\n{context}\n\n{format}"),
])
llm = ChatAnthropic(model="claude-opus-4-7", temperature=0)
chain = prompt.partial(format=parser.get_format_instructions()) | llm | parser
 
memo = chain.invoke({"ticker": "MSFT", "context": retrieved_text})

Two patterns to internalise:

  • Always pin to typed outputs. PydanticOutputParser (or LangChain's newer with_structured_output) raises on schema violations rather than returning silently malformed text. Production agents need this.
  • Compose with |. Each stage is a pure function except the LLM. Side effects (tool calls, writes to a database) belong in tools, not in the chain.

LangGraph: graphs over chains

A linear chain falls apart as soon as you need a loop ("retry on tool error", "ask again if confidence is low") or a branch ("if the result mentions a security, run compliance check; else summarise"). LangGraph generalises chains to directed graphs of nodes with shared state:

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated, Sequence
from operator import add
 
class AgentState(TypedDict):
    question: str
    retrieved: list[str]
    plan: str
    answer: str
    errors: Annotated[Sequence[str], add]
 
def retrieve(state: AgentState) -> AgentState:
    docs = vector_store.search(state["question"], k=8)
    return {"retrieved": docs}
 
def plan(state: AgentState) -> AgentState:
    return {"plan": planner_chain.invoke(state)}
 
def answer(state: AgentState) -> AgentState:
    return {"answer": answer_chain.invoke(state)}
 
def needs_more(state: AgentState) -> str:
    return "retrieve" if confidence(state["answer"]) < 0.6 else END
 
g = StateGraph(AgentState)
g.add_node("retrieve", retrieve)
g.add_node("plan", plan)
g.add_node("answer", answer)
g.set_entry_point("retrieve")
g.add_edge("retrieve", "plan")
g.add_edge("plan", "answer")
g.add_conditional_edges("answer", needs_more, {"retrieve": "retrieve", END: END})
 
agent = g.compile()
agent.invoke({"question": "What's the latest macro outlook?"})

Three properties LangGraph buys you:

  • Typed shared state. Every node receives the same AgentState and returns partial updates that are merged. Type errors at graph compile time instead of runtime.
  • Conditional edges and loops. Real agents need to decide what to do next and stop conditions need to be explicit. The condition function returns a node name; LangGraph handles the dispatch.
  • Checkpointing. Persist AgentState to a backend (SQLite, Postgres, Redis) so a session resumes across processes. This is what production-ready trading agent systems \citep{langgraphserver2025} exploit for restartable runs.

Agent topologies

A handful of graph topologies cover most finance applications:

  • ReAct loop — single agent alternating think → tool call → observe until done. The default for analysis-heavy tasks (market summary, filings extraction).
  • Plan-and-execute — a planner writes a multi-step plan, then a worker executes each step. Useful when the task decomposes naturally and you want each step auditable.
  • Supervisor + workers — a central supervisor delegates to specialised agents (researcher, quant, compliance reviewer) and synthesises their outputs. The pattern in FINCON \citep{fincon2026} for financial decision-making and similar multi-agent systems.
  • Debate / critic — two agents argue alternative analyses; a third selects or merges. Trades latency for higher quality on judgement tasks.

The right topology is mostly about how independent the sub-tasks are. If they are independent, parallelise with workers; if they depend on each other, sequence them in a plan.

State and memory

Agent state is bigger than "previous messages." For finance specifically:

  • Working memory. The current AgentState — question, retrieved documents, partial plan, intermediate tool outputs.
  • Episodic memory. Past sessions with the same user, persisted across conversations. LangGraph's checkpointer plus a small embedding store keeps this manageable.
  • Long-term memory. Domain facts, style guides, firm-specific rules that should not be in every prompt but should surface when relevant. Live in a vector store and are retrieved on demand.

A frequent mistake is to push everything into the prompt context. Treat context as a scarce, latency-sensitive resource and put structural information into typed state, not the system message.

Observability

Agentic systems are much harder to debug than synchronous APIs because control flow is partly model-driven. The minimum observability stack:

  • Tracing. LangSmith, Phoenix, or OpenTelemetry-based tools capture the full graph trace per run: node inputs, outputs, latencies, token counts. A trace is the agentic equivalent of a stack trace.
  • Edge-level metrics. Latency and token count per node. Which node drove the cost? Which one slowed the run? Without this, optimisation is guesswork.
  • Failure-mode tagging. Errors fall into a small set of categories (tool timeout, schema violation, model refusal, infinite loop). Tag them at capture time so you can monitor rates over time.
  • Replay. Saved traces should be replayable so you can debug a past failure with the same inputs and (ideally) the same model version.

Versioning and shipping

Three discipline points that matter once an agent is in production:

  • Version the graph. A change to nodes, edges, or prompts gets a version number. Production traffic pins to a version; new versions go through staged rollout.
  • Separate the model version from the graph version. Both can change; log both. A model upgrade should trigger the regression suite even if no graph code changed.
  • Approval gates for actions with side effects. Anything that consumes resources outside the agent (sending email, executing a trade, modifying records) goes through an explicit approval node — human or rule-based — rather than being a tool the agent calls silently.

LangChain + LangGraph give the scaffolding; the discipline above is what makes the scaffolding hold up under production load. Section 8-02 builds on this with the tool calling patterns and the Model Context Protocol (MCP) that standardises how agents discover and invoke tools across providers.

Tool Calling and MCP

Tool calling is what turns an LLM from a text generator into an operator: the model emits a structured request to invoke a function, the runtime executes it, and the result feeds back into the conversation. Done well, this is the single biggest lever against hallucination — anything that has to be exact (a price, a P&L, a position) is delivered by a deterministic tool, not by the model's parametric memory. The Model Context Protocol (MCP) standardises how tools advertise themselves to models, how permissions are negotiated, and how observability data flows back. This section covers both the function-calling primitive and the MCP layer that makes agents portable across providers.

Tools as the agent's substrate

An agent's behaviour is largely a function of the set of tools it can call. A small, well-designed tool set with a mediocre control loop usually beats a fancy control loop over a poor tool set. Three classes of tool recur in finance applications:

  • Read tools. Look up a price, fetch a position, retrieve filings, query a vector store. Idempotent and safe; can be cached aggressively.
  • Compute tools. Price options, run a forecaster, evaluate a policy. Deterministic and side-effect-free given the same inputs; cache by hash of arguments.
  • Action tools. Send an email, file a trade, update a database. Have side effects, so they need explicit approval gates and audit logging.

A working rule: read and compute tools can be wired up directly; action tools never run without an approval node in between (Section 8-01).

Function-calling basics

Modern providers (Anthropic, OpenAI, Google) expose a uniform interface: declare tools as JSON Schemas, the model emits typed calls, the runtime executes them and returns the result. With LangChain:

A complete tool definition that an Anthropic agent can call directly:

from anthropic import Anthropic
import polars as pl
 
PRICES = pl.read_csv("data/prices_daily.csv", try_parse_dates=True)
 
# 1) declarative schema the model sees
TOOLS = [{
    "name": "latest_price",
    "description": "Return the last close (or as-of close) for an asset.",
    "input_schema": {
        "type": "object",
        "properties": {
            "ticker": {"type": "string", "description": "Asset ticker, e.g. SPY"},
            "asof":   {"type": "string", "description": "ISO date; default = latest"},
        },
        "required": ["ticker"],
    },
}]
 
# 2) deterministic local implementation
def latest_price(ticker: str, asof: str | None = None) -> dict:
    rows = PRICES.filter(pl.col("Ticker") == ticker).sort("Date")
    if asof is not None:
        rows = rows.filter(pl.col("Date") <= asof)
    last = rows.tail(1).to_dicts()[0]
    return {"ticker": ticker, "date": str(last["Date"]), "close": last["AdjClose"]}
 
# 3) agent loop that calls the tool when the model asks for it
client = Anthropic()
messages = [{"role": "user", "content": "What was SPY's close on 2024-12-31?"}]
while True:
    resp = client.messages.create(
        model="claude-opus-4-7", max_tokens=512, tools=TOOLS, messages=messages
    )
    if resp.stop_reason != "tool_use":
        print(resp.content[-1].text)
        break
    tu = next(b for b in resp.content if b.type == "tool_use")
    out = latest_price(**tu.input)
    messages += [
        {"role": "assistant", "content": resp.content},
        {"role": "user", "content": [{"type": "tool_result",
                                       "tool_use_id": tu.id,
                                       "content": str(out)}]},
    ]

The schema declares what the model can request; the implementation controls what actually runs; the agent loop alternates the two. MCP generalises the schema and the transport so the same tool can be exposed to any compatible host without rewriting the agent loop.

The four ingredients are always the same:

  1. Schema. A typed input schema (Pydantic, JSON Schema). The model sees the schema; bad inputs raise locally before any external call.
  2. Implementation. A pure function (or a function with audited side effects). Behaviour must be deterministic given the same inputs.
  3. Description. Plain English that tells the model when to use the tool. Models pick tools by description; sparse descriptions cause underuse.
  4. Output type. Structured (dict, Pydantic model). The model is better at composing structured outputs than free text.

A common but subtle bug is over-permissive descriptions: tools whose docstrings hint at capabilities they don't have. Models will call them optimistically and fail downstream. Be precise about scope.

The Model Context Protocol (MCP)

MCP is an open protocol for connecting language models to external tools and data sources \citep{mcp2024spec}. Conceptually it sits between the model and the tools, replacing ad-hoc per-provider integrations with a single contract:

  • Tool servers advertise their capabilities (name, schema, auth) over a standard transport (stdio, websockets).
  • Hosts (Claude Code, ChatGPT Desktop, custom agents) discover tool servers, negotiate which tools to expose per session, and route the model's calls to them.
  • Clients are the per-session connections that handle tool invocation, auth, and observability.

Three concrete benefits:

  • Portability. The same agent runs against Anthropic, OpenAI, or any MCP-compatible runtime — pick the provider that fits the workload, not the provider's lock-in.
  • Least-privilege security. A session exposes a subset of tools with explicit allow/deny. Sensitive tools (file write, trade execution) go behind an explicit grant, not a default-allow flag.
  • Observability. Standard tracing fields (latency, status, request ID) flow back from every tool call so the host can build a uniform trace.

In practice, MCP is most useful when you build one tool server that exposes firm capabilities (positions, prices, forecasters) and let multiple agents — research, trading, compliance — connect to it with different permission scopes.

A representative tool catalogue

A reasonable starter catalogue for a portfolio-research agent:

CategoryToolPurpose
Dataprices.history(ticker, range)Historical OHLCV
Datafilings.search(query, range)Filings RAG
Datanews.recent(query, since)News RAG with time bound
Computeforecaster.predict(panel, horizon)Call Chapter 4 model
Computeoptimizer.solve(mu, sigma, constraints)Mean-variance / RL policy
Computerisk.var(portfolio, horizon, alpha)Risk metric
Actionreport.draft(template, payload)Prepare a draft (no send)
Actionnotify.slack(channel, msg)Send notification (approval gated)

Two design choices worth noting:

  • Compute tools wrap, not duplicate, your existing code. The forecaster called by forecaster.predict is the same one your humans call in notebooks. The agent does not get a parallel implementation.
  • Action tools are the ones that need policy. Read and compute tools can be exposed by default; action tools require explicit grants and a human-readable explanation in the trace.

Guardrails

Five guardrails sit between every tool call and its execution:

  • Schema validation. Reject malformed payloads before they hit the implementation.
  • Permission check. Verify the calling session has the right grants for this tool and for these arguments. Per-argument scoping (e.g. read-only on certain accounts) catches mistakes that per-tool scoping misses.
  • Rate limiting. Cap calls per session and per user. Models in loops hit a limit before they hit a billing surprise.
  • Result sanitisation. Strip sensitive fields (PII, internal IDs) before returning to the model. The model's context is the weakest privacy boundary — assume anything in it might end up in a log.
  • Audit log. Every tool call → caller, timestamp, arguments, result hash, latency, status. The audit log is the single most useful artefact when a regulator asks how a recommendation was produced.

End-to-end flow

A representative hedging recommendation flow:

  1. User: "Recommend a hedge for the equity book against a 10% S&P drawdown."
  2. Agent → MCP portfolio.read(book="equity_us") → current positions.
  3. Agent → MCP prices.history(SPY, 5y) → vol/correlation context.
  4. Agent → MCP pricing.greeks(SPX_put_options, ...) → candidate hedges with Greeks.
  5. Agent → MCP optimizer.solve(...) → recommended overlay.
  6. Agent → user: structured recommendation referencing each tool output, with explicit assumptions and the trace ID.
  7. Approval node: human approves before any report.send action.

Every step is logged; every numerical claim traces to a tool output; the agent's prose is narration over the tools, not original calculation.

When to skip MCP

MCP is overhead for small applications. Use it when:

  • You need provider portability (Anthropic + OpenAI + a local model).
  • You have multiple agents sharing the same tool set with different permissions.
  • You need a unified audit log across agents.

Otherwise, direct provider tool-calling APIs are simpler and good enough for the first version. MCP earns its place when the system grows to the point that integration complexity becomes a tax — and that day comes sooner than most teams expect.

Tool calling with MCP is what turns LLMs from clever summarisers into reliable operators. The next chapter (synthetic data) closes the book by giving the agent — and the forecasters and policies it orchestrates — a way to test itself against scenarios that have never occurred.

Multi-Agent Topologies

A single agent with a good tool catalogue (Sections 8-01 and 8-02) is the right starting point for most finance use cases. Multi-agent systems become the right tool when the work itself decomposes naturally into specialised sub-tasks — research and write-up, debate and judgement, plan and execute. This section covers the topologies that recur in financial workflows and the patterns that keep them debuggable.

When to multiply agents

A useful test before adding a second agent: can the same outcome be reached by giving the existing agent one more tool? If yes, do that instead. Multi-agent systems are interesting when the boundary between sub-tasks is conceptual (research vs. compliance vs. execution) rather than just functional. Three concrete shapes:

  • Specialist diversity. Each agent has its own system prompt, tool subset, and even its own underlying model. A research agent uses a long-context model and retrieval tools; a compliance agent uses a smaller model with strict refusal rules and a regulatory knowledge base.
  • Independent perspectives. The same task is given to multiple agents to surface disagreement. Useful when the answer is a judgement and false confidence is the real risk.
  • Pipeline decomposition. A planner produces a structured plan; a worker executes each step. The planner does not need execution tools and the worker does not need to think strategically.

Anything that can fit into a single ReAct loop without contortions should stay there.

Supervisor + workers

The most common multi-agent topology in finance applications. A supervisor agent decides which worker to delegate to, the worker executes, the supervisor synthesises and decides the next step.

  • Supervisor. Read the user goal, current state, and prior worker outputs. Emit a directive: which worker to call next, with what prompt, or "stop and synthesise".
  • Workers. Each is a specialist with a narrow tool set. Examples in a portfolio-research workflow: data-puller (pricing, fundamentals, news), modeller (calls the forecasters of Chapter 4 / policies of Chapter 5 as tools), narrator (drafts the write-up).
  • Coordination state. The supervisor maintains shared memory the workers can read; workers append their results. LangGraph's StateGraph models this naturally.
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated, Literal
from operator import add
 
class TeamState(TypedDict):
    goal: str
    history: Annotated[list[dict], add]
    next: Literal["data", "model", "writer", "done"]
 
def supervisor(state: TeamState) -> dict:
    decision = supervisor_llm.invoke(state)            # returns {"next": "data" | ...}
    return {"next": decision["next"]}
 
def data_worker(state: TeamState) -> dict:
    out = data_agent.invoke(state["history"])
    return {"history": [{"role": "data", "content": out}]}
 
def model_worker(state: TeamState) -> dict:
    out = model_agent.invoke(state["history"])
    return {"history": [{"role": "model", "content": out}]}
 
def writer_worker(state: TeamState) -> dict:
    out = writer_agent.invoke(state["history"])
    return {"history": [{"role": "writer", "content": out}]}
 
g = StateGraph(TeamState)
for name, fn in (
    ("supervisor", supervisor),
    ("data", data_worker),
    ("model", model_worker),
    ("writer", writer_worker),
):
    g.add_node(name, fn)
 
g.set_entry_point("supervisor")
for name in ("data", "model", "writer"):
    g.add_edge(name, "supervisor")
g.add_conditional_edges(
    "supervisor",
    lambda s: s["next"],
    {"data": "data", "model": "model", "writer": "writer", "done": END},
)
team = g.compile()

Three things to monitor when this topology is in production:

  • Loop termination. The supervisor must reliably emit "done". Without an explicit max-iteration cap, agents can spin.
  • Worker scope creep. A specialist that grows tools beyond its original scope often turns into the de facto generalist; resist by keeping the tool list per worker pinned and reviewable.
  • Cost per goal. Supervisor + worker rounds compound. Track total token and tool-call cost per goal completion; cap on outliers.

Plan-and-execute

A sibling of supervisor + workers: a planner produces a multi-step plan up front, then a worker executes each step in sequence without re-planning. Better than ReAct on long-horizon tasks because the planner keeps the global view; ReAct can lose it.

The plan is a structured artefact:

[
  {"step": 1, "tool": "filings.search", "args": {"q": "...", "since": "..."}},
  {"step": 2, "tool": "extract.kpis", "args": {"doc_ids": "$step1.ids"}},
  {"step": 3, "tool": "forecaster.predict", "args": {"panel": "$step2.panel"}},
  {"step": 4, "type": "summarise"}
]

The worker resolves $stepN.x references between steps. When a step fails or returns unexpected output, the worker triggers re-planning on the remaining steps rather than restarting. This is the recipe behind the FINCON multi-agent system for financial decision-making \citep{fincon2026}.

The planner-worker split also makes auditing easier: the plan can be reviewed (and approved or rejected) before any tool with side effects runs. For finance applications this is a load-bearing feature.

Debate / critic

Two agents argue alternative analyses; a third selects or merges. Useful for tasks where the failure mode is over-confident wrong answer — investment-thesis evaluation, model-risk reviews, regulatory- interpretation questions.

  • Proposer. Generates a candidate analysis with stated assumptions.
  • Critic. Reads the proposal and finds weaknesses: missing data, unsupported claims, alternative interpretations.
  • Judge. Reads both and either picks one, merges, or asks for another round.

Three rounds of debate consistently beat a single ReAct pass on judgement tasks. Two caveats:

  • Cost. Each round is at least 3× the per-pass cost. Reserve for tasks where the quality matters more than the latency.
  • Convergence. Without a judge that can break ties, debates can oscillate. The judge prompt should explicitly allow "merge" and "neither — try again" rather than just "pick A or B".

Reflection

A specialised debate where the proposer and critic are the same agent in two passes. Pass 1 produces a draft; pass 2 ("now critique this for the points you would push back on") revises it. Cheaper than true debate; modest quality gain on most tasks. Good default for write-up tasks where the cost budget does not support full debate.

Hierarchical agents

Supervisors of supervisors. Useful when the work has natural domain boundaries — research / risk / execution as separate teams, each with its own internal tool set, coordinated by a top-level orchestrator.

Risks scale with depth:

  • Latency multiplies with depth. A three-level hierarchy means every user request waits on three layers of LLM calls before any real work happens.
  • Information loss across boundaries. Each summarisation step drops detail; by depth 3 the top-level orchestrator may be making decisions on a heavily lossy view.

A reasonable rule: do not exceed two levels of hierarchy. If a third level seems necessary, split the system into separate workflows triggered independently by the user rather than nesting.

Communication patterns

Four ways agents pass information to each other, in increasing robustness:

  • Free-text. Each agent emits prose; the next agent reads it. Lowest friction, lossy, hard to audit.
  • Structured records. JSON with a fixed schema (title, body, citations, confidence). Auditable; the schema is the contract.
  • Shared memory. A blackboard the agents read and write through a tool interface. Useful for long-running collaborations; needs garbage collection to stop the blackboard from growing without bound.
  • Typed events. Each message carries a type (research_finding, risk_flag, recommendation) plus a typed payload. The router uses the type to decide what runs next. The most robust pattern for production agents.

Production multi-agent systems converge on typed events plus shared memory for persistence; debate-style proof-of-concepts live happily on free-text alone.

What this section adds

The single-agent loop in 8-01 plus the tool-calling layer in 8-02 covers most finance applications. Multi-agent topologies become the right tool when the work genuinely splits along a conceptual axis (research vs. compliance, propose vs. critique, plan vs. execute) and the cost of getting it wrong justifies the cost of running multiple agents. Section 8-04 covers the production engineering — versioning, observability, approval gates — that any multi-agent system has to sit inside before it touches a real workflow.

Production Deployment

A working agent in a notebook is one thing. A working agent that runs unattended against production data, costs predictable money, fails loudly on regressions, and survives a regulator's audit is a much harder thing. This section is about the operational layer between those two. Most of the work here is unglamorous infrastructure — versioning, observability, approval gates, audit logs — but for finance applications it is what separates cool demo from system worth depending on.

Versioning everything

An agent's behaviour is the joint product of at least four moving pieces, and each needs an explicit version pinned for any production trace to be reproducible:

  • Graph version. The LangGraph (or equivalent) graph definition — nodes, edges, conditional routing. Bump on any structural change.
  • Prompt version. System prompts, few-shot examples, tool descriptions. The single thing that changes most often; treat it like source code.
  • Tool version. Every tool's schema and implementation. A schema change is a breaking change; bump the major version.
  • Model version. The exact model snapshot used (e.g. claude-opus-4-7, not claude-opus-latest). Hosted models can silently change behaviour within a stated family.

Concretely:

AGENT_VERSION = {
    "graph":  "v0.4.2",
    "prompt": "v3",
    "tools":  {"prices": "v1", "filings": "v2", "optimizer": "v1"},
    "model":  "claude-opus-4-7-20251101",
}

Every trace logs the AGENT_VERSION block. A bug report that names the version is reproducible; one without is folklore.

Observability stack

Three layers, each with a different audience:

  • Per-call telemetry. LangSmith, Phoenix, OpenTelemetry — capture every LLM call, tool invocation, and routing decision with inputs, outputs, latency, and token count. The audience is the engineer debugging a single failed run.
  • Aggregated metrics. Per-day or per-week dashboards: success rate, average tool-call count, p50/p95 latency, cost per run, distribution of failure modes. The audience is whoever owns the service.
  • Anomaly alerts. Spike in error rate, drop in success rate, cost per run trending upward. Page on these, not on per-call failures.

A useful internal contract: the trace ID for any agent run must be linkable from any user-facing artefact (chat reply, generated report, recommendation). When a user asks "why did the agent say X", the answer is a trace URL, not a guess.

Approval gates and circuit breakers

Anything with a cost of being wrong belongs behind an explicit gate. Three patterns recur in finance:

  • Human approval. The agent proposes a structured action; a human reviews and clicks approve. Suited for trade execution, large capital reallocation, anything client-facing.
  • Rule-based gate. A deterministic check evaluates the agent's proposed action. Reject if it violates an invariant: position outside risk limits, citation pointing nowhere, JSON failing schema validation.
  • Circuit breaker. Per-day caps on total tool calls, total external spend, total trade volume. When tripped, the agent is paused (with a notification) until a human inspects.

These three compose. A typical production wiring: the agent emits a proposal → rule-based gate filters obvious failures → human reviews the rest → circuit breaker caps total daily approvals as a floor.

A working rule: if removing the gate would let the agent move money or send messages without human visibility, the gate stays.

Cost and latency monitoring

Agents have a habit of spiralling on edge-case inputs — long contexts, recursive tool calls, planning loops that don't terminate. Without monitoring, a single bad prompt can spend an order of magnitude more than the team noticed.

The minimum dashboard for production agents:

  • Cost per run, by run type. Distribution, not just average. The 90th percentile is where the surprises live.
  • Tool calls per run. A run that uses 50 tool calls when the median is 5 is a runaway loop, even if it eventually completed.
  • Time-to-first-action and time-to-completion. First-action time measures how long the user waits before any visible progress; completion time measures total wall clock.
  • Cost per dollar of value delivered. For tasks where the value can be quantified — a fund-flow report, a research note — track the ratio. If the agent costs 2 deliverable, it is uneconomical at scale.

Per-user / per-team cost caps belong in the same dashboard. Soft caps that warn before hard caps that pause; hard caps that pause before bills become surprises.

Rollouts and feature flags

Agentic behaviour is hard to QA exhaustively. A staged rollout cuts the blast radius of regressions:

  • Canary. New version runs on 1–5% of traffic for a fixed window. Compare success rate, error rate, cost on the canary vs. the current version. Promote if metrics are equal-or-better.
  • A/B. Two versions running side by side with metrics emitted per arm. Useful for prompt changes where the question is "is the new prompt actually better".
  • Shadow. New version runs on every input but its output is logged, not delivered. Most useful for graph or model changes that need offline comparison before any user sees them.

A graph-version + prompt-version + model-version tuple is what gets flagged. Rolling back is bumping the live tuple back to a prior known-good combination.

Audit log architecture

Regulators and internal model-risk reviewers want answers to questions like:

  • "What inputs did the agent see when it produced this output?"
  • "Which tools were invoked, with what arguments?"
  • "Was a human involved in approving this action?"
  • "Could this run be reproduced today?"

The log structure that answers these:

{"ts": "...", "trace_id": "...", "kind": "input",          "user_id": "...", "payload": {...}}
{"ts": "...", "trace_id": "...", "kind": "agent_version",  "agent_version": {...}}
{"ts": "...", "trace_id": "...", "kind": "retrieval",      "tool": "filings.search", "args": {...}, "result_hash": "..."}
{"ts": "...", "trace_id": "...", "kind": "llm_call",       "model": "...", "prompt_hash": "...", "completion_hash": "..."}
{"ts": "...", "trace_id": "...", "kind": "tool_call",      "tool": "optimizer.solve", "args": {...}, "result_hash": "..."}
{"ts": "...", "trace_id": "...", "kind": "approval",       "approver": "...", "decision": "approve"}
{"ts": "...", "trace_id": "...", "kind": "output",         "payload": {...}}

Three properties matter:

  • Append-only. Logs are written once and never edited. Use immutable storage (cloud object storage with object lock) or a log-structured database with retention policies.
  • Hashed payloads. Large inputs and outputs (filings, model predictions) are stored separately by hash; the log references the hash. Keeps log size manageable; immutability still holds.
  • Reproducible. Given a trace_id, replaying the same input through the same agent version with the same tools produces the same output (modulo non-determinism logged separately as seeds).

The audit log is the artefact you defend in front of a regulator. It is also what lets a developer reproduce a six-week-old bug.

Failure modes that show up only in production

A working list of the classes that recur:

  • Long-context degradation. Once the agent's running context exceeds the model's training distribution sweet spot, behaviour changes — usually toward verbosity, occasionally toward outright errors. Cap context length at a fraction of the model's max.
  • Stale retrieval. A document is indexed once, then changes; the agent retrieves the stale version and acts on outdated content. Periodic re-indexing plus a "last verified" timestamp on retrieved chunks.
  • Tool result drift. A tool's output schema or units changed upstream; agent prompts assuming the old shape break. Add schema validation on tool outputs as well as inputs.
  • Adversarial prompt-injection through retrieved docs. A document in the retrieval corpus contains instructions that hijack the agent. Treat retrieved text as data only; never let it be parsed as instructions.
  • Cost drift. Cost per run creeps up over months as prompts grow and models change. Re-baseline costs on every prompt or model bump.

What this section adds

Sections 8-01 through 8-03 cover the agent runtime; this section is the operations layer that runtime sits inside. The discipline here — versioning, observability, gates, audit logs — is what makes agentic systems shippable in regulated environments. Section 8-05 adds the evaluation discipline that closes the loop: agents are measured against goals, not just inspected per call.

Evaluation and Benchmarks

A forecaster (Chapter 4) is evaluated against a held-out distribution with CRPS. A policy (Chapter 5) is evaluated against rolling benchmarks with Sharpe and drawdown. Agents are harder: their output is rarely a number, their behaviour is non-deterministic, and the same input can produce a defensible answer through several different tool-call paths. This section covers the evaluation discipline that survives that complexity — what to measure, how to measure it, and how to keep agents from regressing as prompts and models change underneath them.

Three things to measure

Resist the urge to score agents on a single number. Three orthogonal dimensions need separate tracking:

  • Outcome quality. Did the agent's output answer the user's question correctly and at the firm's quality bar? The thing the user actually cares about.
  • Process fidelity. Did the agent follow the right tool path? An agent that produces a correct answer by hallucinating numbers it should have looked up via a tool is a bug, even if the observed output is right.
  • Operational metrics. Cost, latency, success rate, error taxonomy. The thing the team running the agent cares about.

A run can be 10/10 on outcome quality, 2/10 on process fidelity, and that mismatch is the evaluation signal. A team that only looks at outcome quality will ship an agent that fabricates plausibly.

Golden tasks

The minimum evaluation artefact: a curated set of 30–200 prompts with hand-written reference answers and required tool-call paths. Run on every prompt change, every model upgrade, every graph revision.

{
  "task_id": "earnings-summary-msft-2024q4",
  "input": "Summarise MSFT's Q4 2024 earnings highlights against analyst consensus.",
  "required_tool_calls": [
    {"tool": "filings.search", "args_match": {"ticker": "MSFT", "form": "10-Q"}},
    {"tool": "consensus.estimates", "args_match": {"ticker": "MSFT"}}
  ],
  "reference_answer_keypoints": [
    "Revenue beat consensus by ~2%",
    "Cloud segment growth >25% YoY",
    "Forward guidance unchanged"
  ],
  "rubric": {
    "completeness": "Mentions all three keypoints",
    "groundedness": "Each numeric claim has a citation",
    "tone": "Concise, no hedging beyond what the source supports"
  }
}

Running the suite produces a per-task scorecard with three columns: outcome (rubric-graded), process (tool-path match against required_tool_calls), and operational (cost, latency).

The hard part is curating the golden set. The cheap-but-bad pattern is to scrape historical user queries; the expensive-but-correct pattern is to write tasks by hand against the failure modes the team actually wants to catch.

Rubric-based scoring with LLM judges

For the outcome dimension, programmatic scoring works for some tasks (extraction with reconciliation, numeric Q&A) and not others (write-ups, narrative analyses). For the latter, an LLM judge with a structured rubric is the standard tool — with caveats.

JUDGE_PROMPT = """You are scoring an agent's response to a finance task.
 
Task: {task_input}
 
Reference points the response should cover:
{reference_keypoints}
 
Agent response:
{agent_response}
 
Score each dimension on a 0-3 integer and return strict JSON:
- completeness: 3 if all reference points covered, 0 if none
- groundedness: 3 if every numeric/factual claim has a citation
- tone: 3 if concise and appropriately hedged
 
Return: {{"completeness": int, "groundedness": int, "tone": int, "rationale": "..."}}
"""

Two practices that keep judge-based scoring honest:

  • Judge from a different model family than the agent. Same- family judges over-reward outputs that match their own house style.
  • Calibrate against humans periodically. Score 50–100 prompts with both human and LLM judges; report agreement (Cohen's kappa). Below 0.6, the judge is too noisy; refine the rubric or change models.

LLM-judge scores drift as models change. Re-run calibration on every judge-model upgrade.

Process fidelity: the tool-call audit

The audit log of Section 8-04 turns into evaluation data when overlaid on golden tasks. For each run:

  • Required calls present. Every required_tool_calls entry was invoked at least once, with arguments matching the constraint.
  • Forbidden calls absent. Tasks can also list calls that should not fire (e.g., the agent should not have written to a database on a read-only task).
  • Path efficiency. The agent reached the answer in tool calls. Loops or redundant calls signal an opportunity to tighten the prompt.
def score_process(trace, task) -> dict:
    invoked = [(c.tool, c.args) for c in trace.tool_calls]
    required = task["required_tool_calls"]
    fidelity = sum(
        1 for r in required
        if any(t == r["tool"] and matches(a, r["args_match"]) for t, a in invoked)
    ) / max(1, len(required))
    n_calls = len(invoked)
    return {"fidelity": fidelity, "n_calls": n_calls}

Process fidelity is the metric that catches plausible-but-wrong agents — they look fine on outcome quality and have a low fidelity score because they shortcut the work.

Adversarial probes

A separate suite of inputs designed to fail. Three categories:

  • Prompt injection. Documents in the retrieval corpus that try to hijack the agent's instructions. The agent should ignore them.
  • Out-of-scope requests. Questions outside the agent's intended domain. The agent should refuse or escalate, not improvise.
  • Edge-case inputs. Empty fields, ambiguous tickers, malformed documents, very long contexts. The agent should fail gracefully — ask for clarification, not silently invent.

The adversarial set grows over time. Every production failure mode that surfaces becomes a new entry. Treat it the same way a security team treats CVE entries: the set never shrinks.

Regression discipline

The evaluation suite runs on every change that could affect agent behaviour:

  • Prompt edits. Run the golden suite. Block merge if any task's outcome score regresses by more than a threshold (e.g., -1 on the rubric).
  • Tool schema changes. Run process-fidelity checks; the new schema must match all required_tool_calls constraints.
  • Model upgrades. Run the entire suite plus the adversarial probes. Model upgrades are the single most common source of silent behavioural change.
  • Graph topology changes. Full suite. Graph changes can shift routing in subtle ways that only show up across tasks.

Weekly: run the entire suite on production traffic samples and compare against the previous week's baseline. Drift in outcome quality without an associated change shows up here first; without this, a slow model degradation can run for months unnoticed.

Online evaluation

Offline evaluation catches regressions on known tasks. Online evaluation catches behaviour on tasks the offline suite never saw.

  • A/B testing. Two graph versions running side by side. Compare outcome scores, success rates, cost. Statistically powered samples before drawing conclusions; agentic outputs are noisy.
  • Shadow runs. A new version processes every input but its output is logged, not delivered. Lets you compare before any user sees the new behaviour. Especially useful for model upgrades.
  • User feedback. Thumbs up / down on agent outputs, free-text comments, and explicit "this was wrong" flags. Aggregate weekly; use as the qualitative complement to offline scores.

Per-domain failure taxonomy

Track failures by category, not just count:

CategoryDescription
HallucinationNumeric or factual claim with no source
Tool-skipRequired call missing
Tool-misuseWrong tool or wrong args
Schema violationOutput doesn't validate
RefusalAgent declined a valid request
LoopFailed to terminate within step cap
CostRun exceeded budget without value

Per-week, per-category counts. A spike in hallucination and refusal together usually means a model upgrade went badly.

What this section adds

The agent runtime (8-01 / 8-02), multi-agent topologies (8-03), and production engineering (8-04) are the building discipline. This section is the measuring discipline. Without it, agentic systems silently drift; with it, the team running the agent has the same quality of feedback loop that the rest of the book has been developing for forecasters and policies.

The next chapter (RL fine-tuning) is the bridge into actually improving agents from this evaluation signal: convert the preference data and verifiable rewards from the evaluation suite into a fine-tuning corpus that pushes the policy toward the behaviour the rubric rewards.