Chapter 07

LLM Integration

"You don't have to be smart to be a good fighter. You just have to be smart about who you fight." — paraphrased from Creed (2015)

Chapters 2 through 6 built a quantitative stack from the bottom up: returns, classical forecasting, optimal decisions, latent dynamics, deep dynamic factor models. By the end of Chapter 6 we have models that can forecast, allocate, filter latent state, and interrogate the why of what is happening — all expressed as numerical operations on tensors.

The next two chapters change the substrate, not the goal. Language models do not replace the forecaster, the policy, or the dynamic factor model; they wrap around them. The right framing for the LLM-integration chapter is what does adding language to this existing stack make easier, and the answer is roughly four things:

  • Search. Retrieval-augmented generation (Section 7-02) turns a pile of filings, research notes, and policy statements into a queryable corpus. The numerical layer below it is unchanged.
  • Code generation. Translating "build me a TFT-quantile forecaster on this panel" into the right pytorch_forecasting invocation is a language task, and modern LLMs do it well.
  • Glue and orchestration. Chapter 8 wraps the LLM in a graph-of-tool-calls so analyses become reproducible workflows, not ad-hoc notebook sessions.
  • Time-series itself. A 2024–2025 wave of foundation forecasters (Chronos, Lag-Llama, MOIRAI, TimesFM) shows that an LLM-style backbone pretrained on time-series tokens is a strong zero-shot competitor to the specialists of Section 4-05. We covered the practical side in Section 4-08; this chapter covers the integration side.

LLMs are not a free lunch. The remainder of this chapter is about where they genuinely earn their place — and the parts of the stack where the right answer is still a small, deterministic, well-tested model.

What this chapter does

We treat LLM integration as an engineering problem, not a research problem. The output of each section is a pattern you can ship: a prompt that survives production, a fine-tuned model with an evaluation suite that catches regressions, a foundation-model choice that survives the next vendor migration. Three through-lines:

  • Use LLMs upstream of decisions, not as the decision-maker. They extract, summarise, and orchestrate. Numerical answers come from tools (Chapter 8); decisions come from policies (Chapter 5).
  • Reprogram before fine-tuning. Most of the value lives in well- designed prompts, retrieval, and tool calls. Fine-tuning is for the last 20% that production needs but prompting cannot deliver.
  • Stay vendor-neutral by construction. A thin adapter layer over provider APIs, a portable evaluation harness, and a graph-shaped agent (Chapter 8) make migration a config change, not a refactor.

This Chapter Covers

  • Business case and limits. Where LLMs add the most leverage in finance, and three categories of task they should not be doing.
  • Reprogramming patterns. RAG, structured prompting, tool grounding, and the time-series-as-tokens reprogramming family (Time-LLM, Time-LlaMA).
  • Fine-tuning pipelines. Parameter-efficient methods (LoRA, QLoRA, MoELoRA), preference alignment with DPO, and when RL-style fine-tuning earns its budget.
  • Foundation-model strategy. Build vs. buy, hybrid deployments, cost management, and the time-series-foundation-model family (Chronos, Lag-Llama, MOIRAI).

Contents

  • Why LLM — what the literature and production evidence actually support, with the compliance constraints that shape integration.
  • Reprogramming Approach — RAG, prompt patterns, and reprogramming for time-series forecasting.
  • Fine-Tuning Approach — SFT, PEFT, DPO, and the data pipeline that makes the loop reproducible.
  • Foundation Models — model selection, hybrid deployment, and the time-series-foundation-model branch that quietly competes with the language-model branch.
  • Prompt Engineering for Finance — the practical prompt-craft layer: structured outputs, grounded citations, retrieved exemplars, critic-revise, and the evaluation harness that keeps prompt edits honest.

Why LLM?

Large language models (LLMs) reshape three things in a quantitative finance stack: how analysts interact with structured data, how unstructured sources (filings, research notes, news) feed into a model, and how time-series forecasting itself behaves when an LLM-style backbone replaces a custom network. This chapter is about the parts of the stack where the language model genuinely earns its place — and the parts where the right answer is still a small, deterministic, well-tested model. Drawing the line clearly is how we get the LLM dividend without paying the LLM tax.

Three things LLMs are actually good at

Cutting through the marketing, the empirical evidence in finance points to three solid wins:

  • Knowledge compression and retrieval. A modern LLM has read essentially every public filing and a lot of research; with retrieval-augmented generation (RAG, Section 7-02) it becomes a fast, queryable index over filings, transcripts, and policy documents. Tasks like "summarise the forward guidance from the last six FOMC statements" or "list all mention-driven risk factors in this 10-K" are linear-time chores done in seconds rather than days.
  • Interface flexibility. The same model serves a chat-style analyst query, a structured-output API call from a notebook, and a tool-orchestrating agent (Chapter 8) — all from the same prompt-and-response interface. That is a big collapse in glue code and reduces the surface area you have to test.
  • Dynamics-aware time-series modelling. The 2024–2025 wave of foundation forecasters — Chronos \citep{ansari2024chronos}, Lag-Llama \citep{rasul2024lagllama}, MOIRAI \citep{woo2024moirai}, Time-LlaMA \citep{chen2024timellama} — show that an LLM-style transformer pretrained on billions of time-series tokens transfers usefully to financial panels in zero- and few-shot settings. The point isn't that they always beat a task-specific model (they don't); it is that they give a strong baseline with no per-series training cost.

Three things LLMs are still not good at

The flip side is where finance applications burn time and money on things LLMs cannot reliably deliver yet:

  • Numerical reasoning at scale. Anything that should be a SQL query, a Polars expression, or a NumPy computation should be one — not an LLM internal calculation. Models hallucinate decimal points, units, and cross-totals exactly often enough to be dangerous when the numbers feed a trade or a risk report. Section 7-02 covers the tool calling pattern that fixes this.
  • Causal claims about markets. "Did this Fed press conference cause the rates rally?" is a question LLMs will gamely answer, and the answer is usually wrong in a way that sounds plausible. Causal claims belong in Chapter 6's identifiable-dynamics machinery, not in an LLM completion.
  • Decisions at the trading layer. Portfolio weights, hedge sizes, and execution actions belong behind deterministic checks (Chapter 5), approval gates, and explicit risk limits — not behind an LLM tool call. An agent (Chapter 8) can propose an action; humans and rules approve.

The compliance dimension

A 2024 survey of LLM use in financial services flagged three institutional constraints that recur across firms:

  • Auditability. Every model output that informs an action must be reproducible: same prompt, same retrieval set, same context, same tools → same response. Practical implication: pin model versions, freeze prompts, and log the full request/response trail.
  • Data residency. Cross-border calls to managed APIs trip data-residency rules in many jurisdictions. The hybrid strategy of Section 7-04 — small open models on-prem for sensitive flows, frontier API models for exploratory work — is partly a response to this.
  • Explainability. Regulators ask "why did this model say that?" The accepted answer is retrieval traces plus tool-call traces: the LLM is a pass-through that summarises grounded evidence, and the evidence chain is what you defend.

These constraints rule out a class of seductive use cases — "let the LLM make the call" — and channel effort toward use cases where the LLM is upstream of decisions: extracting features, surfacing evidence, drafting explanations.

Capability checklist

Before integrating any LLM into a financial workflow, verify:

  1. Domain vocabulary coverage. Tickers, regulatory terms, jargon (basis, carry, gamma, vega) — does the model use them correctly? A small held-out probe set catches the misses.
  2. Structured-output reliability. Can it produce well-typed JSON, SQL, or function-call payloads under load? "Reliability" here means failure rates of on the production prompt distribution, not on hand-picked easy cases.
  3. Adversarial robustness. Prompt-injection from documents in your retrieval corpus is a real risk. The classic fix is to treat retrieved text as data only, never as instructions — see Section 7-02.
  4. Tool-grounding latency budget. Round-tripping through a tool adds 100–500 ms per call. If your application has tighter budgets, either pre-compute or distil into a smaller specialised model.

Role in this book

The remaining sections of this chapter cover the three dominant integration modes:

  • Reprogramming (Section 7-02). Use the model as-is; reshape prompts, retrieval, and tool calls. Fastest to deploy; the first line of defence before any fine-tuning.
  • Reprogramming for time series. A specific reprogramming pattern — representing a time series as tokens an LLM can read — that has produced strong forecasting baselines without any model-side fine-tuning.
  • Fine-tuning (Section 7-03). Adapt a base model to firm-specific jargon, workflows, and safety rules with parameter-efficient methods (LoRA, adapters) and lightweight RL alignment (RLHF, DPO).
  • Foundation-model strategy (Section 7-04). When to consume a frontier API, when to host an open model, and how to mix both without rewriting agent logic.

Understanding why LLMs belong in the stack — and where they don't — is what keeps the rest of this chapter from devolving into a list of patterns in search of a problem.

A grounding case study

Consider an everyday workflow: a research analyst wants a summary of the past quarter's earnings calls for ten portfolio names, with quoted citations and a flag for material guidance changes. Three ways to wire this up:

  • Pure LLM. Ask the model to summarise. The output looks professional, and the model has invented at least one number per company because the training corpus is stale.
  • Retrieval + LLM. Pull the actual transcripts from an indexed corpus, hand the chunks to the model, ask it to summarise with inline citations. The model only summarises what was retrieved; the cited spans must exist in the corpus or the output is rejected.
  • Retrieval + LLM + extractor tool. All numeric claims (guidance ranges, revenue beats) come from a deterministic extractor tool the agent invokes; the LLM only writes the prose around the numbers.

The first option is what gets demoed; the second is what gets shipped; the third is what survives an audit. The patterns in the rest of this chapter — RAG, structured outputs, tool calls, fine- tuning — exist because each one moves the system one step further from the first option toward the third.

The same hierarchy applies to harder questions. "Should we add this position?" is not an LLM question; it is a Chapter 5 question. But "draft the rationale for why we're considering this position, given the analyst notes" is an LLM question — and a good one, when the LLM is wired through retrieval and tools so the rationale is grounded in the actual notes rather than the model's parametric guess at what notes typically say.

Reprogramming Approach

Reprogramming treats a frozen foundation model as-is and reshapes the prompt, retrieval context, and tool interface to elicit the behaviour you want. Conceptually, the model is the runtime and the prompt is the program. This is the first integration mode to try in practice: it deploys in days rather than weeks, follows the model's natural-language API, and avoids the data-collection and re-evaluation overhead of fine-tuning. The trick is treating the prompt and surrounding plumbing with the same discipline as production code.

The four components

Every reprogramming application is composed from four parts in some order:

  1. System prompt. Sets the persona, tone, scope, and refusal rules. Locked at deployment, version-controlled, and tested with the same regression suite you use for code.
  2. Context assembly. Retrieve documents, tables, or numerical metrics relevant to the user's request. Vector search, BM25, hybrid retrieval, or structured query — whichever returns the cleanest evidence.
  3. Structured prompting. Specify the shape of the answer (Markdown table, JSON schema, function-call payload) and the explicit reasoning steps the model should walk through.
  4. Tool calling. Route sub-questions that require numerical or deterministic answers (latest price, P/E ratio, options chain) to the appropriate tool rather than letting the LLM hallucinate them.

The split is what makes the system auditable: the LLM sees the system prompt + retrieved context + tool outputs, and summarises. Anything that must be exact came from a tool, not the model's parametric memory.

Retrieval-Augmented Generation (RAG)

RAG is the canonical pattern: index a corpus of filings, research notes, policy documents, or transcripts; at query time, retrieve the top- relevant chunks and stuff them into the model's context window.

A minimal end-to-end RAG with the Anthropic SDK and a local FAISS index — works against a folder of plain-text filings:

from pathlib import Path
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from anthropic import Anthropic
 
# 1) build the index (offline, run once per corpus refresh)
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
docs = [p.read_text() for p in Path("filings/").glob("*.txt")]
emb  = embedder.encode(docs, normalize_embeddings=True)
index = faiss.IndexFlatIP(emb.shape[1])
index.add(emb.astype("float32"))
 
# 2) per-query retrieval + generation
client = Anthropic()
 
def ask(question: str, k: int = 5) -> str:
    q_emb = embedder.encode([question], normalize_embeddings=True).astype("float32")
    _, top = index.search(q_emb, k)
    context = "\n---\n".join(docs[i][:1500] for i in top[0])
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system="Answer using only the context. If unknown, say so.",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQ: {question}"}],
    )
    return msg.content[0].text
 
print(ask("Summarise FOMC forward guidance shifts since the March meeting."))

The pattern decomposes into three deterministic pieces (load, embed, retrieve) and one stochastic piece (the LLM call). Audit logs record the retrieved chunk IDs and the model response; the deterministic part replays exactly.

Three design choices matter more than they look:

  • Chunking. Per-paragraph chunks with metadata (source URL, timestamp, ticker, section heading) beat fixed-size chunks. Without metadata you cannot tell the model "ignore anything older than X" and you cannot produce an audit trail.
  • Hybrid retrieval. Pure vector search misses keyword matches (CUSIPs, ticker tickers, exact metric names); pure BM25 misses paraphrase. The sum or rank-fusion of both is the strongest default.
  • Re-ranking. A second-pass cross-encoder over the top-50 candidates rerankes them by relevance to the query. Cheap, ~ extra inferences, and consistently improves answer quality.

A defensive habit that pays off in finance: mark retrieved text as data, never as instructions. If the corpus contains adversarial or prompt-injection content (and over time, it will), instructing the model to treat the entire retrieval block as evidence rather than as a directive reduces the attack surface.

Reprogramming for time-series forecasting

A more recent strand of work uses exactly the same reprogramming idea without RAG: treat a time series as a sequence of tokens, prompt a frozen LLM, and read off forecasts. Time-LLM \citep{jin2024timellm} introduces this for forecasting; AutoTimes generalises to in-context learning; Time-LlaMA \citep{chen2024timellama} adds a dynamic LoRA over LLaMA-style backbones. The pipeline is invariant:

  1. Encode the input window as a textual or patch-token sequence.
  2. Prepend a brief task description ("forecast the next 96 steps").
  3. Decode the model's output as a sequence of numerical tokens.
  4. Optionally, attach a learnable patch-encoder/decoder around the frozen LLM and train those on a small forecasting set.

The empirical pitch: zero-shot or few-shot performance that competes with task-specific transformers (Chapter 4-05) on standard benchmarks, with no model-side training. The downside: latency per forecast and per-series cost are higher than running a small specialised model. Reprogramming for time series is most valuable when few-shot is the requirement (a new asset with little history) or when you need a uniform interface across many series.

Prompt patterns that survive in production

Three patterns reduce the failure rate of LLM outputs by a meaningful margin and compose with each other:

  • Chain-of-thought with structured output. Ask the model to walk through reasoning, then emit a final answer in a strictly typed JSON block. Parse only the JSON; throw away the reasoning. The reasoning step improves accuracy on multi-step tasks; the strict output makes the result programmable.
  • Critic / solver loop. First pass: generate. Second pass: critique the first pass against a checklist (numbers grounded? tone professional? hallucinations?). Third pass: revise. Adds latency and cost, often by , but the failure rate on hard tasks drops markedly.
  • Few-shot exemplars matched to the input. Rather than hard-coding a fixed set of examples, retrieve the most similar prior examples from a labelled bank and include them. The model's accuracy on long-tail inputs improves; the prompt budget scales with task complexity.

Evaluation discipline

Treat the prompt as code: it has a test suite, a CI loop, and version control. Practical components:

  • Golden tasks. A curated set of input/expected-output pairs (5–50) that the system must pass to ship. Run on every prompt change.
  • Adversarial probes. Prompt-injection attempts, jailbreaks, attempts to exfiltrate retrieved content. The set grows over time; never shrinks.
  • Drift monitors. Sample 1% of production traffic, re-run with the latest model version, and alert on output divergence. Models silently change behaviour even within a stated version family.
  • Latency / cost dashboards. Tokens-per-response, p50/p95 latency, spend per user. Reprogramming is cheap; left unmonitored, it becomes expensive.

When reprogramming runs out

Reprogramming hits a ceiling when the prompt grows past 5–10K tokens of boilerplate, when the model cannot reliably keep firm-specific terminology, or when regulators require deterministic terminology that prompts alone cannot guarantee. That is the moment to consider fine-tuning (Section 7-03), not before. In our experience, the first 80% of LLM-application value lives in good reprogramming; fine-tuning is for the last 20% that matters disproportionately for production use.

Fine-tuning Approach

Fine-tuning adapts a pretrained foundation model to firm-specific vocabulary, workflows, and safety rules. Compared with reprogramming (Section 7-02) it needs more curated data and a heavier evaluation pipeline, but it produces more consistent behaviour, narrower outputs, and the ability to constrain the model in ways prompts cannot reliably enforce. The art is choosing the right fine-tuning method — full SFT, parameter-efficient adaptation, or preference alignment — and matching it to the right data and the right deployment surface.

Three fine-tuning regimes

Modern fine-tuning splits into three regimes, each with a different cost profile and a different effect on the model:

  • Supervised fine-tuning (SFT). Train the model to imitate instruction–response pairs. Effective for behaviour shaping (style, format, vocabulary) and for task adaptation (turn a generalist into a filings-summary specialist). The default first step.
  • Preference alignment (RLHF, DPO). Train the model to prefer one response over another using human or programmatic preference labels. Effective for judgement tasks (which of these summaries is better?) where the right answer is a ranking, not a token sequence. DPO \citep{rafailov2023dpo} has largely replaced RLHF in practice because it avoids a separate reward-model training loop.
  • Reasoning fine-tuning (R1 / RL-from-scratch). Train the model to produce long chains of reasoning via reinforcement learning on verifiable tasks (math, code, structured extraction). DeepSeek-R1 \citep{guo2025deepseekr1} popularised this; subsequent work (DAPO, VAPO) refines the entropy mechanism. Useful when the firm task requires explicit multi-step reasoning that base models do not produce reliably; expensive and not always necessary.

A practical sequencing rule of thumb: SFT first, DPO if behaviour inconsistency persists, R1-style RL only if the task is genuinely reasoning-bound (e.g., extracting a complex schema from messy filings).

Parameter-efficient fine-tuning (PEFT)

Full fine-tuning of a 70-billion-parameter model is rarely the right call. LoRA \citep{hu2021lora} adds small low-rank adapter matrices to selected linear layers and trains only those, reducing the trainable parameter count by 100–1000 at minimal accuracy loss:

from peft import LoraConfig, get_peft_model
 
lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora)
model.print_trainable_parameters()

Three PEFT design choices matter:

  • Rank . Higher rank gives more capacity; covers most domain-adaptation tasks, for tasks involving substantial new capability.
  • Target modules. Attention projections (q_proj, v_proj) are the default; adding the MLP (gate_proj, up_proj, down_proj) helps when the model needs new factual content rather than just stylistic adaptation.
  • Mixture-of-LoRAs. MoELoRA \citep{moelora2024} and similar mixtures combine multiple LoRA adapters and learn to route, which improves performance on heterogeneous task mixtures at the cost of inference complexity.

For very large models, QLoRA (4-bit quantised base + LoRA on top) makes fine-tuning a 70-B model feasible on a single high-memory GPU. The accuracy loss versus full-precision LoRA is small enough that this is the recipe most academic and small-team finance work uses today.

A working LoRA + QLoRA configuration that survives across model families:

from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
)
import torch
 
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)
 
lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none", task_type="CAUSAL_LM",
)

Three knobs that change the answer qualitatively:

  • Target modules. Attention-only (q/k/v/o) is fastest and works for behaviour shaping (style, tone, schema adherence). Add the MLP projections (gate_proj/up_proj/down_proj) when the fine-tune needs to introduce new knowledge or relationships, not just adjust output style.
  • Rank r. 8–16 for behaviour shaping; 32–64 for richer task adaptation. Higher rank doesn't always help — past 64, the model starts memorising the fine-tune set rather than generalising from it.
  • Compute dtype. bfloat16 is the right answer on Ampere (A100/H100); float16 works on older cards but can lose precision on the loss scaling and slow training down with stability checks.

The data pipeline

Fine-tuning is, much more than prompting, a data engineering problem.

  1. Collection. Pull research notes, chat logs, and compliance-approved client communications. Anything fed to the model should be data the firm has the legal right to use for training.
  2. Annotation. Convert documents into instruction–response pairs, with structured references (tickers, metrics, regulatory codes) so the model has stable handles to ground later. The annotation guidelines themselves are a deliverable: they should be detailed enough that two independent annotators agree on of edge cases.
  3. Quality control. Deduplicate near-duplicates (Levenshtein on inputs), redact PII and confidential names, and audit coverage across task types. A common mistake is over-fitting to the most-frequent task; balance the mixture so the long tail still gets supervision.
  4. Holdouts. Set aside two evaluation sets: an in-distribution set (sampled from the same source) and an out-of-distribution set (different sources, different time period). The latter is what catches regression in real production traffic.

A conservative dataset sizing rule: 10K SFT examples is enough to style a model; 100K for task adaptation; 1M+ for substantial new behaviour. If your dataset is much smaller than the target band, a stronger system prompt with retrieval (Section 7-02) is usually a better investment.

Training and alignment

A typical SFT loop on a small finance corpus:

from transformers import TrainingArguments, Trainer
 
args = TrainingArguments(
    output_dir="ckpt/finance-sft",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    save_strategy="epoch",
    eval_strategy="epoch",
    logging_steps=20,
    weight_decay=0.0,
)
 
trainer = Trainer(
    model=model, args=args,
    train_dataset=train_ds, eval_dataset=eval_ds,
    tokenizer=tokenizer, data_collator=collator,
)
trainer.train()

After SFT, DPO is the cleanest alignment step:

from trl import DPOTrainer, DPOConfig
 
dpo_args = DPOConfig(
    output_dir="ckpt/finance-dpo",
    beta=0.1,
    num_train_epochs=1,
    learning_rate=5e-6,
    bf16=True,
)
DPOTrainer(model=model, args=dpo_args,
           train_dataset=preference_ds,
           tokenizer=tokenizer).train()

Two practical guardrails:

  • Reference model snapshot. DPO needs a reference policy; use the post-SFT model and freeze it. Don't keep updating the reference during DPO — that destabilises training.
  • Beta annealing. Start with and consider lower () if the policy collapses to near-deterministic outputs.

When RL fine-tuning is worth it

Recent work \citep{jin2025rl-vs-sft} argues — convincingly — that SFT and RL fine-tuning have different shapes of improvement: SFT is faster and gives larger initial gains, while RL improves in regimes where the answer must satisfy a verifiable criterion (math correctness, structured extraction passing a schema, code that compiles). For finance applications the criterion is often exactly of this verifiable kind: extracted financial figures that must reconcile, classifications against a controlled vocabulary, JSON outputs that must validate against a schema.

The recipe: SFT to set the format, then RL with a programmatic reward that scores how well outputs match the verifiable criterion. R1-style training without verifiable rewards is hard to stabilise and rarely worth the budget.

Evaluation suite

  • Held-out prompts with exact answers; track exact-match and structured- output validation rates.
  • Human review for tone, compliance, and hallucinations on a stratified sample of production traffic. Run weekly during initial deployment, then monthly.
  • Adversarial tests. Prompt-injection attempts, jailbreaks, requests to exfiltrate confidential names. The set grows over time.
  • Regression tests. Run the entire evaluation suite on every weight update. The bar to ship is "no regression on safety, ≤ X% regression on any non-safety metric."

Deployment

  • Host fine-tuned models behind an authenticated API with rate limits and per-user budgets.
  • Version everything. Model weights, prompt template, tokenizer configuration. Inference reproducibility depends on freezing all three.
  • Log inputs and outputs with hashed user IDs and timestamps for audits.
  • Maintain a fallback path to the base model for graceful degradation when the fine-tuned model misbehaves; the failure mode is more frequent than most teams expect.

When not to fine-tune

Two recurring anti-patterns:

  • Knowledge updates. "Fine-tune the model on yesterday's news" is almost always the wrong answer; retrieval (Section 7-02) is faster, cheaper, and auditable. Fine-tuning bakes content into weights and removes the audit trail.
  • One-off tone adjustment. A clearer system prompt and a few exemplars achieve most of the benefit with none of the operational cost.

Fine-tuning is justified when prompts have grown unwieldy (K tokens), when regulators demand consistent terminology that reprogramming cannot guarantee, or when latency/cost budgets at scale make a smaller specialist model attractive over a frontier API. Outside of those, prefer reprogramming.

Where this leads

This section frames when fine-tuning is the right tool and which PEFT/DPO/RL regime to use. Chapter 9 picks up where this section ends: it walks through the TRL library, covers reward modelling for finance tasks, develops a runnable SFT → DPO pipeline, and treats GRPO and reasoning fine-tuning in detail. If your team has decided to fine-tune and is ready to ship, Chapter 9 is the working manual.

Foundation Models

A foundation model is a large neural network pretrained on a broad corpus that downstream applications adapt rather than train from scratch. For finance teams the relevant decisions are not academic — which model, hosted where, with what licensing, and how to migrate when something better appears six months later. This section covers the strategic choices and the architectural patterns that make those choices reversible.

Two flavours of "foundation model" in finance

Worth being explicit about, because the literature mixes them:

  • General-purpose language models — GPT-5, Claude Opus 4.7, Gemini 2.5, Llama-4, Mistral. These power the reprogramming and agent patterns in Sections 7-02 and Chapter 8. Strong on text, code, structured outputs, and tool use.
  • Time-series foundation models — Chronos \citep{ansari2024chronos}, Lag-Llama \citep{rasul2024lagllama}, MOIRAI \citep{woo2024moirai}, TimesFM \citep{das2024timesfm}, Time-LlaMA \citep{chen2024timellama}. Pretrained on tens to hundreds of billions of time-series tokens; produce zero-shot and few-shot probabilistic forecasts that compete with task-specific transformers (Chapter 4-05) on standard benchmarks.

The two evolve on similar schedules but solve different problems and the deployment shape is different. We treat them in parallel below.

Build vs. buy for language models

The classic decision is between consuming a managed API and self-hosting an open model:

  • API consumption (Anthropic, OpenAI, Google). Fast access to frontier capability, no MLOps overhead, predictable per-token billing. Tradeoffs: data residency, vendor lock-in (mitigatable), and limited customisation beyond fine-tuning of certain endpoints.
  • Self-hosted open models (Llama-4, Mistral, Qwen, DeepSeek). Full control, offline / on-prem capability, no per-token cost at scale. Tradeoffs: hardware investment, model evaluation and security become your job, and the gap to the frontier on hard tasks remains real even if the gap on common tasks has narrowed.

The right answer for most finance teams is the hybrid pattern: API for exploration, drafting, and infrequent expensive workloads; on-prem open model for repeated structured workloads (extraction, classification, short-form summary), where the per-token cost compounds.

Evaluation criteria

Before committing to a model, walk through the following with your specific workload:

  1. Context length. Long contexts (200K+ tokens) support full reports and multi-document RAG without aggressive chunking. Some firms still need 1M+ for full transcript analysis.
  2. Multimodality. If chart, table, or PDF reasoning is in scope, choose models with vision encoders (Claude Opus 4.7, Gemini 2.5, GPT-5 vision).
  3. Tool calling reliability. Test on your schemas. Frontier models converge on syntactic correctness; failure modes diverge under pressure (long contexts, ambiguous inputs).
  4. Reasoning vs. fast paths. Reasoning-heavy variants (o-series, Claude Opus, DeepSeek-R1, Gemini Thinking) trade latency for accuracy on multi-step problems. Match the model to the workload, not the latest benchmark headline.
  5. Licensing and data-handling. Verify usage rights for commercial deployments; many "open" models forbid certain commercial uses or require attribution. For derived data (model outputs trained on confidential filings), the licensing terms of the base model apply.
  6. Stability over time. Both API models and open weights drift in behaviour. Lock to specific snapshots in production and test regressions on every upgrade.

Time-series foundation models in practice

The 2024–2025 wave of time-series foundation models gave finance teams a new baseline: a probabilistic forecast in zero-shot, no per-series training. The recipe is uniform across them:

  • Tokenise the time series (Chronos quantises to a vocabulary of value bins; Lag-Llama uses lagged features; MOIRAI uses patches with masking).
  • Run the pretrained encoder–decoder transformer.
  • Decode forecasts as numerical tokens or as a parametric head.

Empirical pattern from the published benchmarks and a fair amount of finance-team experience: foundation forecasters lose to task-specific transformers (Chapter 4-05) by 5–15% on CRPS when you have enough data to train a specialist; they win by orders-of-magnitude on time-to-baseline when you don't, and they win on series with very short history because the pretrained dynamics carry useful priors.

The practical pattern is to use a foundation forecaster as the first baseline on any new dataset, then decide whether the gap to the specialist is worth closing.

Hybrid strategy in practice

Most finance teams end up with three tiers, each with a clear scope:

  • Frontier API tier. Exploratory analysis, proof-of-concept, hard reasoning tasks, occasional report-generation.
  • Open-weight on-prem tier. Production extraction, classification, short-form summary. A 7B–70B model with proper fine-tuning beats an API call on cost at any meaningful volume.
  • Specialist tier. Task-specific networks (forecaster, classifier) trained on firm data, called as tools by the LLM agents (Chapter 8).

The specialist tier is often the cheapest and most accurate; the LLM layers above it provide the natural-language interface and the orchestration, not the numerical answer.

Cost management

Three patterns reduce cost more than any model switch:

  • Batch overnight. Workloads that don't need real-time latency (overnight risk reports, end-of-day extractions) batch into the offline pricing tier — cheaper by 50–80% on most providers.
  • Cache embeddings and tool results. For RAG, the embeddings of source documents and the search index don't need to be recomputed per query. For tool calls, identical inputs should hit a cache, not the tool.
  • Routing models. Send easy queries to a small/cheap model and only escalate to the frontier when needed. A 1B-parameter classifier sitting in front of the routing decision pays for itself within days.

A defensible cost ledger to keep at the top of every project:

Workload classLatency budgetTierCost handling
Real-time analyst chat< 3 sFrontier APIPer-user budget cap; route trivial queries to a smaller model
Filings extraction< 5 min batchOn-prem 8–14 BRe-use embeddings; cache by (doc_hash, schema_version)
Overnight reportsovernightFrontier batch tierBuild prompt + context once; reuse across runs
Internal R&D / draftsminutesOn-prem 70 B (QLoRA)No external bill; GPU time is the constraint
Policy-evaluation traces< 1 sSpecialist (Chapter 4 forecasters as tools)LLM is glue; specialist takes the heavy lift

Separating workload classes is what makes the cost-vs-quality discussion sharp. "Pay for frontier API" makes sense for analyst chat; "pay for frontier API for overnight extraction at 10K documents per day" almost never does — that workload belongs on an on-prem fine-tuned model, and the savings (30–50× on $/token) cover the GPU and ops cost within the first month.

Vendor neutrality

Foundation models evolve on a six-month cycle; the model that is best today is often not best by next quarter. Two architectural decisions keep that change cheap:

  • Adapter layer for model APIs. A thin internal SDK that exposes a uniform interface (request, response, tool call, streaming) regardless of underlying provider. Switching providers becomes a config change instead of a refactor.
  • Prompt and evaluation parity. Mirror prompts and evaluation harnesses across providers. If the same prompt produces different results on Provider A and Provider B, you have an empirical answer to which to deploy — not a vendor pitch.

Where this goes next

The patterns in this chapter — reprogramming, fine-tuning, foundation choice — are the integration layer. Chapter 8 builds the next layer up: agents that orchestrate these models with tools (SQL, Python, the forecasters from Chapter 4, the policy from Chapter 5). Chapter 10 closes the loop by generating synthetic data that we feed back into the training and evaluation of every layer above. Foundation models are the substrate; what makes them useful in finance is the rest of the stack that surrounds them.

Prompt Engineering for Finance

The first four sections of this chapter cover the integration shapes — reprogramming, fine-tuning, foundation-model strategy, and the upstream framing. This section is the practical prompt-craft layer that determines whether those integrations actually work in production. Most of the patterns are small; the cumulative effect on reliability is large, especially on the structured-output tasks finance applications care about most.

The four properties a finance prompt has to hit

Before we get to patterns, the constraints. A prompt for a finance task must be:

  • Reproducible. Same input + same prompt + pinned model version = same output (modulo logged seeds for sampling).
  • Schema-faithful. Structured outputs (JSON, function-call payloads) validate on every run, not just the happy path.
  • Grounded. Numeric and factual claims trace to retrieved or tool-supplied sources, not the model's parametric memory.
  • Auditable. A reviewer can read the prompt, the retrieved context, and the output, and reconstruct why the model said what it did.

A prompt that nails all four turns the LLM into a predictable component; one that misses any of them turns it into a liability.

Pattern 1: structured-output-first

Always declare the output shape before the question. The model treats the schema as a constraint to satisfy rather than a parser that runs after the fact.

SYSTEM = """\
You extract financial figures from filings. Output a single JSON
object matching this schema (no prose, no code fences, no commentary):
 
{
  "ticker": string,                      // exchange ticker, uppercase
  "period": string,                      // e.g. "2024Q4"
  "revenue":      {"value": number, "unit": "USD", "citation": string},
  "net_income":   {"value": number, "unit": "USD", "citation": string},
  "operating_cf": {"value": number, "unit": "USD", "citation": string}
}
 
Each citation must be a verbatim 6-12 word span from the source.
Numbers are reported in USD millions.
"""

Pair the system prompt with strict JSON output mode if the provider supports it (Anthropic structured-output, OpenAI JSON mode). Always validate against the schema after parsing; reject on mismatch and retry once with the validation error appended to the prompt.

Two anti-patterns:

  • Free-text first, parse later. Model emits prose like "Revenue was 678M…"; a regex tries to extract numbers. Fragile; the regex rots within weeks.
  • Schema in the user message. Model takes the schema as a suggestion. Schema goes in the system prompt where it has higher effective weight.

Pattern 2: ground every number to a source span

For finance the grounding discipline is non-negotiable. The working contract: every numeric claim in the output must be accompanied by the verbatim text span that supports it.

{
  "revenue": {"value": 65585, "citation": "Total revenue increased 16% to $65.6 billion"}
}

After parsing, the runtime checks that each citation appears verbatim in the retrieved context. Citations that fail this check fail the whole output. The LLM is now free to summarise but not free to invent.

This pattern is also what makes the audit log useful: a regulator sees a number and the span it came from in one record.

Pattern 3: retrieval before instruction

The classical RAG layout puts retrieved context at the end of the prompt. For long contexts this is the wrong choice — recency bias and "lost in the middle" effects mean the model attends to the recent instruction over the old context. Better:

[system prompt]
[retrieved context — ranked top to bottom by relevance]
[user instruction last]

The instruction is what the model "remembers" most strongly, and the context is laid out where the model will read it during processing rather than reference at the end.

A specific trick: when the context is multiple documents, mark each with an id the model can cite by:

[doc:1] Microsoft's fiscal Q1 2025 revenue rose to ...
[doc:2] Year-over-year operating cash flow ...

The output schema then includes a source_doc field, and the grounding check verifies the cited doc exists.

Pattern 4: chain-of-thought, then strict output

For multi-step extractions and analyses, ask the model to reason before emitting structured output:

First, work through the question step by step inside <think> tags.
Then, in <answer> tags, emit the JSON object exactly matching the
schema above.

After parsing, throw away <think>. The reasoning improves output quality on multi-step tasks; the strict structured <answer> is what downstream systems consume. The R1-style reasoning fine-tuning of Section 9-05 is the principle behind this on the training side; the prompt pattern here is its inference-time analogue.

For tasks that don't need reasoning (single-fact extraction), skip the chain-of-thought. It adds latency without improving quality.

Pattern 5: few-shot exemplars matched to the input

Hard-coded few-shot examples eat tokens and don't always match the current input. Retrieved exemplars — pick the most similar prior examples from a labelled bank and insert them — improve quality on long-tail inputs at modest token cost.

def build_prompt(user_query: str, exemplar_bank, embed_fn, k: int = 3) -> str:
    q_emb = embed_fn(user_query)
    bank_embs = embed_fn([e["query"] for e in exemplar_bank])
    sims = bank_embs @ q_emb / np.linalg.norm(q_emb)
    top = np.argsort(-sims)[:k]
    examples = "\n\n".join(
        f"Example query: {exemplar_bank[i]['query']}\n"
        f"Example answer:\n{exemplar_bank[i]['answer']}"
        for i in top
    )
    return f"{examples}\n\nNow answer the following:\n{user_query}"

The bank starts as 20–30 hand-curated examples; production usage keeps growing it with examples flagged as "good" by reviewers.

Pattern 6: critic-revise for high-stakes tasks

For tasks where one bad output costs more than ten good ones:

  1. Generate. First pass produces a candidate.
  2. Critique. Second pass reads the candidate and a checklist ("are all numbers grounded?", "does the JSON validate?", "is the tone within firm guidelines?"); emits a list of issues.
  3. Revise. Third pass produces a fixed version given the critique.

3× cost, 30%+ fewer downstream failures on judgement-heavy tasks. Reserve for cases where the latency budget supports it (overnight reports, drafted memos, not real-time chat).

Pattern 7: refuse-or-defer for out-of-scope inputs

A prompt should explicitly tell the model what NOT to answer.

If the user's question is outside the scope of [filings extraction,
KPI calculation, narrative summary], do not improvise. Respond with:
 
{"refusal": "out_of_scope", "reason": "<one sentence>"}

Without this, the model will help-fully attempt anything; with it, the failure mode becomes a clean refusal that downstream systems can handle (e.g., escalate to a human or route to a different agent).

Anti-patterns that quietly hurt

A working list:

  • Jailbreak by accident. Phrases like "as an expert analyst with no compliance constraints…" relax the model's safety prompts unintentionally. Audit for these.
  • Implicit math. "Compute the company's free cash flow" without saying which definition. The model picks a definition; the choice may not match the firm's. Be explicit.
  • Mixed languages without intent. Korean prompt with English examples produces inconsistent outputs. Pick one language; if multilingual is needed, model it explicitly.
  • Format drift across versions. Changing whitespace, capitalisation, or punctuation in the prompt without versioning. Even small drift changes outputs; treat the prompt as code.

Evaluation harness for prompts

The minimum infrastructure to keep prompt edits honest:

import json
from pathlib import Path
 
def evaluate_prompt(prompt: str, model: str, suite_path: Path) -> dict:
    """Run a prompt against a JSON suite of (input, expected) pairs."""
    suite = json.loads(suite_path.read_text())
    results = []
    for case in suite:
        out = call_llm(model, prompt, case["input"])
        ok_schema = validate_schema(out, case["schema"])
        ok_keys = all(k in out for k in case["required_keys"])
        ok_grounding = all(
            c["citation"] in case["context"] for c in out.get("citations", [])
        )
        results.append({
            "id": case["id"],
            "schema": ok_schema,
            "keys": ok_keys,
            "grounding": ok_grounding,
        })
    return aggregate(results)

Run on every prompt edit. Block merge if any test regresses.

Where prompt-craft ends

These patterns get you a long way — usually to the point where prompting alone is no longer the bottleneck. Past that point the gain comes from fine-tuning (Section 7-03 / Chapter 9), better retrieval (Section 7-02), or stronger models (Section 7-04). But deciding where to invest depends on knowing how good prompting alone gets, and the patterns here are how to find that ceiling honestly.