"You don't have to be smart to be a good fighter. You just have to
be smart about who you fight."
— paraphrased from Creed (2015)
Chapters 2 through 6 built a quantitative stack from the bottom up:
returns, classical forecasting, optimal decisions, latent dynamics,
deep dynamic factor models. By the end of Chapter 6 we have models
that can forecast, allocate, filter latent state, and interrogate the
why of what is happening — all expressed as numerical operations
on tensors.
The next two chapters change the substrate, not the goal. Language
models do not replace the forecaster, the policy, or the dynamic
factor model; they wrap around them. The right framing for the
LLM-integration chapter is what does adding language to this
existing stack make easier, and the answer is roughly four things:
Search. Retrieval-augmented generation (Section 7-02) turns a
pile of filings, research notes, and policy statements into a
queryable corpus. The numerical layer below it is unchanged.
Code generation. Translating "build me a TFT-quantile forecaster
on this panel" into the right pytorch_forecasting invocation is a
language task, and modern LLMs do it well.
Glue and orchestration. Chapter 8 wraps the LLM in a
graph-of-tool-calls so analyses become reproducible workflows, not
ad-hoc notebook sessions.
Time-series itself. A 2024–2025 wave of foundation forecasters
(Chronos, Lag-Llama, MOIRAI, TimesFM) shows that an LLM-style
backbone pretrained on time-series tokens is a strong zero-shot
competitor to the specialists of Section 4-05. We covered the
practical side in Section 4-08; this chapter covers the
integration side.
LLMs are not a free lunch. The remainder of this chapter is about
where they genuinely earn their place — and the parts of the stack
where the right answer is still a small, deterministic, well-tested
model.
What this chapter does
We treat LLM integration as an engineering problem, not a research
problem. The output of each section is a pattern you can ship: a
prompt that survives production, a fine-tuned model with an
evaluation suite that catches regressions, a foundation-model choice
that survives the next vendor migration. Three through-lines:
Use LLMs upstream of decisions, not as the decision-maker. They
extract, summarise, and orchestrate. Numerical answers come from
tools (Chapter 8); decisions come from policies (Chapter 5).
Reprogram before fine-tuning. Most of the value lives in well-
designed prompts, retrieval, and tool calls. Fine-tuning is for the
last 20% that production needs but prompting cannot deliver.
Stay vendor-neutral by construction. A thin adapter layer over
provider APIs, a portable evaluation harness, and a graph-shaped
agent (Chapter 8) make migration a config change, not a refactor.
This Chapter Covers
Business case and limits. Where LLMs add the most leverage in
finance, and three categories of task they should not be doing.
Reprogramming patterns. RAG, structured prompting, tool
grounding, and the time-series-as-tokens reprogramming family
(Time-LLM, Time-LlaMA).
Fine-tuning pipelines. Parameter-efficient methods (LoRA,
QLoRA, MoELoRA), preference alignment with DPO, and when
RL-style fine-tuning earns its budget.
Foundation-model strategy. Build vs. buy, hybrid deployments,
cost management, and the time-series-foundation-model family
(Chronos, Lag-Llama, MOIRAI).
Contents
Why LLM — what the literature and production
evidence actually support, with the compliance constraints that
shape integration.
Reprogramming Approach — RAG,
prompt patterns, and reprogramming for time-series forecasting.
Fine-Tuning Approach — SFT,
PEFT, DPO, and the data pipeline that makes the loop reproducible.
Foundation Models — model selection,
hybrid deployment, and the time-series-foundation-model branch
that quietly competes with the language-model branch.
Prompt Engineering for Finance —
the practical prompt-craft layer: structured outputs, grounded
citations, retrieved exemplars, critic-revise, and the evaluation
harness that keeps prompt edits honest.
Why LLM?
Large language models (LLMs) reshape three things in a quantitative finance
stack: how analysts interact with structured data, how unstructured sources
(filings, research notes, news) feed into a model, and how time-series
forecasting itself behaves when an LLM-style backbone replaces a custom
network. This chapter is about the parts of the stack where the language
model genuinely earns its place — and the parts where the right answer is
still a small, deterministic, well-tested model. Drawing the line clearly is
how we get the LLM dividend without paying the LLM tax.
Three things LLMs are actually good at
Cutting through the marketing, the empirical evidence in finance points to
three solid wins:
Knowledge compression and retrieval. A modern LLM has read essentially
every public filing and a lot of research; with retrieval-augmented
generation (RAG, Section 7-02) it becomes a fast, queryable index over
filings, transcripts, and policy documents. Tasks like "summarise the
forward guidance from the last six FOMC statements" or "list all
mention-driven risk factors in this 10-K" are linear-time chores done in
seconds rather than days.
Interface flexibility. The same model serves a chat-style analyst
query, a structured-output API call from a notebook, and a tool-orchestrating
agent (Chapter 8) — all from the same prompt-and-response interface. That
is a big collapse in glue code and reduces the surface area you have to
test.
Dynamics-aware time-series modelling. The 2024–2025 wave of
foundation forecasters — Chronos \citep{ansari2024chronos}, Lag-Llama
\citep{rasul2024lagllama}, MOIRAI \citep{woo2024moirai}, Time-LlaMA
\citep{chen2024timellama} — show that an LLM-style transformer pretrained
on billions of time-series tokens transfers usefully to financial panels in
zero- and few-shot settings. The point isn't that they always beat a
task-specific model (they don't); it is that they give a strong baseline
with no per-series training cost.
Three things LLMs are still not good at
The flip side is where finance applications burn time and money on things
LLMs cannot reliably deliver yet:
Numerical reasoning at scale. Anything that should be a SQL query, a
Polars expression, or a NumPy computation should be one — not an LLM
internal calculation. Models hallucinate decimal points, units, and
cross-totals exactly often enough to be dangerous when the numbers feed a
trade or a risk report. Section 7-02 covers the tool calling pattern
that fixes this.
Causal claims about markets. "Did this Fed press conference cause the
rates rally?" is a question LLMs will gamely answer, and the answer is
usually wrong in a way that sounds plausible. Causal claims belong in
Chapter 6's identifiable-dynamics machinery, not in an LLM completion.
Decisions at the trading layer. Portfolio weights, hedge sizes, and
execution actions belong behind deterministic checks (Chapter 5),
approval gates, and explicit risk limits — not behind an LLM tool call.
An agent (Chapter 8) can propose an action; humans and rules approve.
The compliance dimension
A 2024 survey of LLM use in financial services flagged three institutional
constraints that recur across firms:
Auditability. Every model output that informs an action must be
reproducible: same prompt, same retrieval set, same context, same tools →
same response. Practical implication: pin model versions, freeze prompts,
and log the full request/response trail.
Data residency. Cross-border calls to managed APIs trip data-residency
rules in many jurisdictions. The hybrid strategy of Section 7-04 — small
open models on-prem for sensitive flows, frontier API models for
exploratory work — is partly a response to this.
Explainability. Regulators ask "why did this model say that?" The
accepted answer is retrieval traces plus tool-call traces: the LLM is a
pass-through that summarises grounded evidence, and the evidence chain is
what you defend.
These constraints rule out a class of seductive use cases — "let the LLM
make the call" — and channel effort toward use cases where the LLM is
upstream of decisions: extracting features, surfacing evidence, drafting
explanations.
Capability checklist
Before integrating any LLM into a financial workflow, verify:
Domain vocabulary coverage. Tickers, regulatory terms, jargon
(basis, carry, gamma, vega) — does the model use them correctly? A
small held-out probe set catches the misses.
Structured-output reliability. Can it produce well-typed JSON, SQL,
or function-call payloads under load? "Reliability" here means failure
rates of ≤1% on the production prompt distribution, not on
hand-picked easy cases.
Adversarial robustness. Prompt-injection from documents in your
retrieval corpus is a real risk. The classic fix is to treat retrieved
text as data only, never as instructions — see Section 7-02.
Tool-grounding latency budget. Round-tripping through a tool
adds 100–500 ms per call. If your application has tighter budgets,
either pre-compute or distil into a smaller specialised model.
Role in this book
The remaining sections of this chapter cover the three dominant integration
modes:
Reprogramming (Section 7-02). Use the model as-is; reshape prompts,
retrieval, and tool calls. Fastest to deploy; the first line of defence
before any fine-tuning.
Reprogramming for time series. A specific reprogramming pattern —
representing a time series as tokens an LLM can read — that has produced
strong forecasting baselines without any model-side fine-tuning.
Fine-tuning (Section 7-03). Adapt a base model to firm-specific
jargon, workflows, and safety rules with parameter-efficient methods
(LoRA, adapters) and lightweight RL alignment (RLHF, DPO).
Foundation-model strategy (Section 7-04). When to consume a frontier
API, when to host an open model, and how to mix both without rewriting
agent logic.
Understanding why LLMs belong in the stack — and where they don't — is
what keeps the rest of this chapter from devolving into a list of patterns
in search of a problem.
A grounding case study
Consider an everyday workflow: a research analyst wants a summary of
the past quarter's earnings calls for ten portfolio names, with
quoted citations and a flag for material guidance changes. Three ways
to wire this up:
Pure LLM. Ask the model to summarise. The output looks
professional, and the model has invented at least one number per
company because the training corpus is stale.
Retrieval + LLM. Pull the actual transcripts from an indexed
corpus, hand the chunks to the model, ask it to summarise with
inline citations. The model only summarises what was retrieved;
the cited spans must exist in the corpus or the output is
rejected.
Retrieval + LLM + extractor tool. All numeric claims (guidance
ranges, revenue beats) come from a deterministic extractor tool
the agent invokes; the LLM only writes the prose around the
numbers.
The first option is what gets demoed; the second is what gets
shipped; the third is what survives an audit. The patterns in the
rest of this chapter — RAG, structured outputs, tool calls, fine-
tuning — exist because each one moves the system one step further
from the first option toward the third.
The same hierarchy applies to harder questions. "Should we add this
position?" is not an LLM question; it is a Chapter 5 question. But
"draft the rationale for why we're considering this position, given
the analyst notes" is an LLM question — and a good one, when the
LLM is wired through retrieval and tools so the rationale is
grounded in the actual notes rather than the model's parametric
guess at what notes typically say.
Reprogramming Approach
Reprogramming treats a frozen foundation model as-is and reshapes the
prompt, retrieval context, and tool interface to elicit the
behaviour you want. Conceptually, the model is the runtime and the prompt is
the program. This is the first integration mode to try in practice: it
deploys in days rather than weeks, follows the model's natural-language API,
and avoids the data-collection and re-evaluation overhead of fine-tuning.
The trick is treating the prompt and surrounding plumbing with the same
discipline as production code.
The four components
Every reprogramming application is composed from four parts in some order:
System prompt. Sets the persona, tone, scope, and refusal rules.
Locked at deployment, version-controlled, and tested with the same
regression suite you use for code.
Context assembly. Retrieve documents, tables, or numerical metrics
relevant to the user's request. Vector search, BM25, hybrid retrieval, or
structured query — whichever returns the cleanest evidence.
Structured prompting. Specify the shape of the answer (Markdown
table, JSON schema, function-call payload) and the explicit reasoning
steps the model should walk through.
Tool calling. Route sub-questions that require numerical or
deterministic answers (latest price, P/E ratio, options chain) to the
appropriate tool rather than letting the LLM hallucinate them.
The split is what makes the system auditable: the LLM sees the system
prompt + retrieved context + tool outputs, and summarises. Anything that
must be exact came from a tool, not the model's parametric memory.
Retrieval-Augmented Generation (RAG)
RAG is the canonical pattern: index a corpus of filings, research notes,
policy documents, or transcripts; at query time, retrieve the top-k
relevant chunks and stuff them into the model's context window.
A minimal end-to-end RAG with the Anthropic SDK and a local FAISS
index — works against a folder of plain-text filings:
from pathlib import Pathimport faissimport numpy as npfrom sentence_transformers import SentenceTransformerfrom anthropic import Anthropic# 1) build the index (offline, run once per corpus refresh)embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")docs = [p.read_text() for p in Path("filings/").glob("*.txt")]emb = embedder.encode(docs, normalize_embeddings=True)index = faiss.IndexFlatIP(emb.shape[1])index.add(emb.astype("float32"))# 2) per-query retrieval + generationclient = Anthropic()def ask(question: str, k: int = 5) -> str: q_emb = embedder.encode([question], normalize_embeddings=True).astype("float32") _, top = index.search(q_emb, k) context = "\n---\n".join(docs[i][:1500] for i in top[0]) msg = client.messages.create( model="claude-opus-4-7", max_tokens=1024, system="Answer using only the context. If unknown, say so.", messages=[{"role": "user", "content": f"Context:\n{context}\n\nQ: {question}"}], ) return msg.content[0].textprint(ask("Summarise FOMC forward guidance shifts since the March meeting."))
The pattern decomposes into three deterministic pieces (load,
embed, retrieve) and one stochastic piece (the LLM call). Audit logs
record the retrieved chunk IDs and the model response; the
deterministic part replays exactly.
Three design choices matter more than they look:
Chunking. Per-paragraph chunks with metadata (source URL, timestamp,
ticker, section heading) beat fixed-size chunks. Without metadata you
cannot tell the model "ignore anything older than X" and you cannot
produce an audit trail.
Hybrid retrieval. Pure vector search misses keyword matches (CUSIPs,
ticker tickers, exact metric names); pure BM25 misses paraphrase. The
sum or rank-fusion of both is the strongest default.
Re-ranking. A second-pass cross-encoder over the top-50 candidates
rerankes them by relevance to the query. Cheap, ~O(50) extra
inferences, and consistently improves answer quality.
A defensive habit that pays off in finance: mark retrieved text as data,
never as instructions. If the corpus contains adversarial or
prompt-injection content (and over time, it will), instructing the model to
treat the entire retrieval block as evidence rather than as a directive
reduces the attack surface.
Reprogramming for time-series forecasting
A more recent strand of work uses exactly the same reprogramming idea
without RAG: treat a time series as a sequence of tokens, prompt a frozen
LLM, and read off forecasts. Time-LLM \citep{jin2024timellm} introduces
this for forecasting; AutoTimes generalises to in-context learning;
Time-LlaMA \citep{chen2024timellama} adds a dynamic LoRA over LLaMA-style
backbones. The pipeline is invariant:
Encode the input window as a textual or patch-token sequence.
Prepend a brief task description ("forecast the next 96 steps").
Decode the model's output as a sequence of numerical tokens.
Optionally, attach a learnable patch-encoder/decoder around the frozen
LLM and train those on a small forecasting set.
The empirical pitch: zero-shot or few-shot performance that competes with
task-specific transformers (Chapter 4-05) on standard benchmarks, with no
model-side training. The downside: latency per forecast and per-series cost
are higher than running a small specialised model. Reprogramming for time
series is most valuable when few-shot is the requirement (a new asset
with little history) or when you need a uniform interface across many
series.
Prompt patterns that survive in production
Three patterns reduce the failure rate of LLM outputs by a meaningful
margin and compose with each other:
Chain-of-thought with structured output. Ask the model to walk
through reasoning, then emit a final answer in a strictly typed JSON
block. Parse only the JSON; throw away the reasoning. The reasoning step
improves accuracy on multi-step tasks; the strict output makes the result
programmable.
Critic / solver loop. First pass: generate. Second pass: critique
the first pass against a checklist (numbers grounded? tone professional?
hallucinations?). Third pass: revise. Adds latency and cost, often by
2×, but the failure rate on hard tasks drops markedly.
Few-shot exemplars matched to the input. Rather than hard-coding a
fixed set of examples, retrieve the k most similar prior examples from
a labelled bank and include them. The model's accuracy on long-tail
inputs improves; the prompt budget scales with task complexity.
Evaluation discipline
Treat the prompt as code: it has a test suite, a CI loop, and version
control. Practical components:
Golden tasks. A curated set of input/expected-output pairs (5–50)
that the system must pass to ship. Run on every prompt change.
Adversarial probes. Prompt-injection attempts, jailbreaks, attempts
to exfiltrate retrieved content. The set grows over time; never shrinks.
Drift monitors. Sample ∼1% of production traffic, re-run with
the latest model version, and alert on output divergence. Models silently
change behaviour even within a stated version family.
Latency / cost dashboards. Tokens-per-response, p50/p95 latency,
spend per user. Reprogramming is cheap; left unmonitored, it becomes
expensive.
When reprogramming runs out
Reprogramming hits a ceiling when the prompt grows past 5–10K tokens of
boilerplate, when the model cannot reliably keep firm-specific terminology,
or when regulators require deterministic terminology that prompts alone
cannot guarantee. That is the moment to consider fine-tuning (Section
7-03), not before. In our experience, the first 80% of LLM-application
value lives in good reprogramming; fine-tuning is for the last 20% that
matters disproportionately for production use.
Fine-tuning Approach
Fine-tuning adapts a pretrained foundation model to firm-specific vocabulary,
workflows, and safety rules. Compared with reprogramming (Section 7-02) it
needs more curated data and a heavier evaluation pipeline, but it produces
more consistent behaviour, narrower outputs, and the ability to constrain
the model in ways prompts cannot reliably enforce. The art is choosing the
right fine-tuning method — full SFT, parameter-efficient adaptation, or
preference alignment — and matching it to the right data and the right
deployment surface.
Three fine-tuning regimes
Modern fine-tuning splits into three regimes, each with a different cost
profile and a different effect on the model:
Supervised fine-tuning (SFT). Train the model to imitate
instruction–response pairs. Effective for behaviour shaping (style,
format, vocabulary) and for task adaptation (turn a generalist into a
filings-summary specialist). The default first step.
Preference alignment (RLHF, DPO). Train the model to prefer one
response over another using human or programmatic preference labels.
Effective for judgement tasks (which of these summaries is better?)
where the right answer is a ranking, not a token sequence. DPO
\citep{rafailov2023dpo} has largely replaced RLHF in practice because it
avoids a separate reward-model training loop.
Reasoning fine-tuning (R1 / RL-from-scratch). Train the model to
produce long chains of reasoning via reinforcement learning on
verifiable tasks (math, code, structured extraction). DeepSeek-R1
\citep{guo2025deepseekr1} popularised this; subsequent work
(DAPO, VAPO) refines the entropy mechanism. Useful when the firm task
requires explicit multi-step reasoning that base models do not produce
reliably; expensive and not always necessary.
A practical sequencing rule of thumb: SFT first, DPO if behaviour
inconsistency persists, R1-style RL only if the task is genuinely
reasoning-bound (e.g., extracting a complex schema from messy filings).
Parameter-efficient fine-tuning (PEFT)
Full fine-tuning of a 70-billion-parameter model is rarely the right call.
LoRA \citep{hu2021lora} adds small low-rank adapter matrices to selected
linear layers and trains only those, reducing the trainable parameter count
by 100–1000× at minimal accuracy loss:
Rank r. Higher rank gives more capacity; r=8–32 covers most
domain-adaptation tasks, r=64+ for tasks involving substantial new
capability.
Target modules. Attention projections (q_proj, v_proj) are the
default; adding the MLP (gate_proj, up_proj, down_proj) helps when
the model needs new factual content rather than just stylistic adaptation.
Mixture-of-LoRAs.MoELoRA \citep{moelora2024} and similar mixtures
combine multiple LoRA adapters and learn to route, which improves
performance on heterogeneous task mixtures at the cost of inference
complexity.
For very large models, QLoRA (4-bit quantised base + LoRA on top) makes
fine-tuning a 70-B model feasible on a single high-memory GPU. The accuracy
loss versus full-precision LoRA is small enough that this is the recipe
most academic and small-team finance work uses today.
A working LoRA + QLoRA configuration that survives across model
families:
Target modules. Attention-only (q/k/v/o) is fastest and works
for behaviour shaping (style, tone, schema adherence). Add the MLP
projections (gate_proj/up_proj/down_proj) when the
fine-tune needs to introduce new knowledge or relationships, not
just adjust output style.
Rank r. 8–16 for behaviour shaping; 32–64 for richer task
adaptation. Higher rank doesn't always help — past 64, the model
starts memorising the fine-tune set rather than generalising from
it.
Compute dtype. bfloat16 is the right answer on Ampere
(A100/H100); float16 works on older cards but can lose precision
on the loss scaling and slow training down with stability checks.
The data pipeline
Fine-tuning is, much more than prompting, a data engineering problem.
Collection. Pull research notes, chat logs, and compliance-approved
client communications. Anything fed to the model should be data the firm
has the legal right to use for training.
Annotation. Convert documents into instruction–response pairs, with
structured references (tickers, metrics, regulatory codes) so the model
has stable handles to ground later. The annotation guidelines themselves
are a deliverable: they should be detailed enough that two independent
annotators agree on ≥90% of edge cases.
Quality control. Deduplicate near-duplicates (Levenshtein on inputs),
redact PII and confidential names, and audit coverage across task types.
A common mistake is over-fitting to the most-frequent task; balance the
mixture so the long tail still gets supervision.
Holdouts. Set aside two evaluation sets: an in-distribution set
(sampled from the same source) and an out-of-distribution set
(different sources, different time period). The latter is what catches
regression in real production traffic.
A conservative dataset sizing rule: 10K SFT examples is enough to style
a model; 100K for task adaptation; 1M+ for substantial new behaviour.
If your dataset is much smaller than the target band, a stronger system
prompt with retrieval (Section 7-02) is usually a better investment.
Reference model snapshot. DPO needs a reference policy; use the
post-SFT model and freeze it. Don't keep updating the reference during
DPO — that destabilises training.
Beta annealing. Start with β=0.1 and consider lower
(0.01–0.05) if the policy collapses to near-deterministic outputs.
When RL fine-tuning is worth it
Recent work \citep{jin2025rl-vs-sft} argues — convincingly — that SFT and
RL fine-tuning have different shapes of improvement: SFT is faster and
gives larger initial gains, while RL improves in regimes where the answer
must satisfy a verifiable criterion (math correctness, structured
extraction passing a schema, code that compiles). For finance applications
the criterion is often exactly of this verifiable kind: extracted
financial figures that must reconcile, classifications against a
controlled vocabulary, JSON outputs that must validate against a schema.
The recipe: SFT to set the format, then RL with a programmatic reward
that scores how well outputs match the verifiable criterion. R1-style
training without verifiable rewards is hard to stabilise and rarely
worth the budget.
Evaluation suite
Held-out prompts with exact answers; track exact-match and structured-
output validation rates.
Human review for tone, compliance, and hallucinations on a stratified
sample of production traffic. Run weekly during initial deployment, then
monthly.
Adversarial tests. Prompt-injection attempts, jailbreaks, requests to
exfiltrate confidential names. The set grows over time.
Regression tests. Run the entire evaluation suite on every weight
update. The bar to ship is "no regression on safety, ≤ X% regression on
any non-safety metric."
Deployment
Host fine-tuned models behind an authenticated API with rate limits and
per-user budgets.
Version everything. Model weights, prompt template, tokenizer
configuration. Inference reproducibility depends on freezing all three.
Log inputs and outputs with hashed user IDs and timestamps for audits.
Maintain a fallback path to the base model for graceful degradation when
the fine-tuned model misbehaves; the failure mode is more frequent than
most teams expect.
When not to fine-tune
Two recurring anti-patterns:
Knowledge updates. "Fine-tune the model on yesterday's news" is
almost always the wrong answer; retrieval (Section 7-02) is faster,
cheaper, and auditable. Fine-tuning bakes content into weights and
removes the audit trail.
One-off tone adjustment. A clearer system prompt and a few exemplars
achieve most of the benefit with none of the operational cost.
Fine-tuning is justified when prompts have grown unwieldy (>5−10K
tokens), when regulators demand consistent terminology that reprogramming
cannot guarantee, or when latency/cost budgets at scale make a smaller
specialist model attractive over a frontier API. Outside of those, prefer
reprogramming.
Where this leads
This section frames when fine-tuning is the right tool and which
PEFT/DPO/RL regime to use. Chapter 9 picks up where this section
ends: it walks through the TRL library, covers reward modelling for
finance tasks, develops a runnable SFT → DPO pipeline, and treats
GRPO and reasoning fine-tuning in detail. If your team has decided to
fine-tune and is ready to ship, Chapter 9 is the working manual.
Foundation Models
A foundation model is a large neural network pretrained on a broad corpus
that downstream applications adapt rather than train from scratch. For
finance teams the relevant decisions are not academic — which model,
hosted where, with what licensing, and how to migrate when something
better appears six months later. This section covers the strategic choices
and the architectural patterns that make those choices reversible.
Two flavours of "foundation model" in finance
Worth being explicit about, because the literature mixes them:
General-purpose language models — GPT-5, Claude Opus 4.7, Gemini 2.5,
Llama-4, Mistral. These power the reprogramming and agent patterns in
Sections 7-02 and Chapter 8. Strong on text, code, structured outputs,
and tool use.
Time-series foundation models — Chronos
\citep{ansari2024chronos}, Lag-Llama \citep{rasul2024lagllama}, MOIRAI
\citep{woo2024moirai}, TimesFM \citep{das2024timesfm}, Time-LlaMA
\citep{chen2024timellama}. Pretrained on tens to hundreds of billions of
time-series tokens; produce zero-shot and few-shot probabilistic forecasts
that compete with task-specific transformers (Chapter 4-05) on standard
benchmarks.
The two evolve on similar schedules but solve different problems and the
deployment shape is different. We treat them in parallel below.
Build vs. buy for language models
The classic decision is between consuming a managed API and self-hosting
an open model:
API consumption (Anthropic, OpenAI, Google). Fast access to frontier
capability, no MLOps overhead, predictable per-token billing. Tradeoffs:
data residency, vendor lock-in (mitigatable), and limited customisation
beyond fine-tuning of certain endpoints.
Self-hosted open models (Llama-4, Mistral, Qwen, DeepSeek). Full
control, offline / on-prem capability, no per-token cost at scale.
Tradeoffs: hardware investment, model evaluation and security become
your job, and the gap to the frontier on hard tasks remains real even
if the gap on common tasks has narrowed.
The right answer for most finance teams is the hybrid pattern: API for
exploration, drafting, and infrequent expensive workloads; on-prem open
model for repeated structured workloads (extraction, classification,
short-form summary), where the per-token cost compounds.
Evaluation criteria
Before committing to a model, walk through the following with your specific
workload:
Context length. Long contexts (200K+ tokens) support full reports
and multi-document RAG without aggressive chunking. Some firms still
need 1M+ for full transcript analysis.
Multimodality. If chart, table, or PDF reasoning is in scope, choose
models with vision encoders (Claude Opus 4.7, Gemini 2.5, GPT-5 vision).
Tool calling reliability. Test on your schemas. Frontier models
converge on ≥99% syntactic correctness; failure modes diverge
under pressure (long contexts, ambiguous inputs).
Reasoning vs. fast paths. Reasoning-heavy variants (o-series, Claude
Opus, DeepSeek-R1, Gemini Thinking) trade latency for accuracy on
multi-step problems. Match the model to the workload, not the latest
benchmark headline.
Licensing and data-handling. Verify usage rights for commercial
deployments; many "open" models forbid certain commercial uses or
require attribution. For derived data (model outputs trained on
confidential filings), the licensing terms of the base model apply.
Stability over time. Both API models and open weights drift in
behaviour. Lock to specific snapshots in production and test
regressions on every upgrade.
Time-series foundation models in practice
The 2024–2025 wave of time-series foundation models gave finance teams a
new baseline: a probabilistic forecast in zero-shot, no per-series
training. The recipe is uniform across them:
Tokenise the time series (Chronos quantises to a vocabulary of value
bins; Lag-Llama uses lagged features; MOIRAI uses patches with masking).
Run the pretrained encoder–decoder transformer.
Decode forecasts as numerical tokens or as a parametric head.
Empirical pattern from the published benchmarks and a fair amount of
finance-team experience: foundation forecasters lose to task-specific
transformers (Chapter 4-05) by 5–15% on CRPS when you have enough data to
train a specialist; they win by orders-of-magnitude on time-to-baseline
when you don't, and they win on series with very short history because the
pretrained dynamics carry useful priors.
The practical pattern is to use a foundation forecaster as the first
baseline on any new dataset, then decide whether the gap to the
specialist is worth closing.
Hybrid strategy in practice
Most finance teams end up with three tiers, each with a clear scope:
Frontier API tier. Exploratory analysis, proof-of-concept, hard
reasoning tasks, occasional report-generation.
Open-weight on-prem tier. Production extraction, classification,
short-form summary. A 7B–70B model with proper fine-tuning beats an API
call on cost at any meaningful volume.
Specialist tier. Task-specific networks (forecaster, classifier)
trained on firm data, called as tools by the LLM agents (Chapter 8).
The specialist tier is often the cheapest and most accurate; the LLM
layers above it provide the natural-language interface and the
orchestration, not the numerical answer.
Cost management
Three patterns reduce cost more than any model switch:
Batch overnight. Workloads that don't need real-time latency
(overnight risk reports, end-of-day extractions) batch into the offline
pricing tier — cheaper by 50–80% on most providers.
Cache embeddings and tool results. For RAG, the embeddings of source
documents and the search index don't need to be recomputed per query.
For tool calls, identical inputs should hit a cache, not the tool.
Routing models. Send easy queries to a small/cheap model and only
escalate to the frontier when needed. A 1B-parameter classifier sitting
in front of the routing decision pays for itself within days.
A defensible cost ledger to keep at the top of every project:
Workload class
Latency budget
Tier
Cost handling
Real-time analyst chat
< 3 s
Frontier API
Per-user budget cap; route trivial queries to a smaller model
Filings extraction
< 5 min batch
On-prem 8–14 B
Re-use embeddings; cache by (doc_hash, schema_version)
Overnight reports
overnight
Frontier batch tier
Build prompt + context once; reuse across runs
Internal R&D / drafts
minutes
On-prem 70 B (QLoRA)
No external bill; GPU time is the constraint
Policy-evaluation traces
< 1 s
Specialist (Chapter 4 forecasters as tools)
LLM is glue; specialist takes the heavy lift
Separating workload classes is what makes the cost-vs-quality
discussion sharp. "Pay for frontier API" makes sense for analyst
chat; "pay for frontier API for overnight extraction at 10K
documents per day" almost never does — that workload belongs on
an on-prem fine-tuned model, and the savings (30–50× on $/token)
cover the GPU and ops cost within the first month.
Vendor neutrality
Foundation models evolve on a six-month cycle; the model that is best
today is often not best by next quarter. Two architectural decisions keep
that change cheap:
Adapter layer for model APIs. A thin internal SDK that exposes a
uniform interface (request, response, tool call, streaming) regardless
of underlying provider. Switching providers becomes a config change
instead of a refactor.
Prompt and evaluation parity. Mirror prompts and evaluation
harnesses across providers. If the same prompt produces different
results on Provider A and Provider B, you have an empirical answer to
which to deploy — not a vendor pitch.
Where this goes next
The patterns in this chapter — reprogramming, fine-tuning, foundation
choice — are the integration layer. Chapter 8 builds the next layer up:
agents that orchestrate these models with tools (SQL, Python, the
forecasters from Chapter 4, the policy from Chapter 5). Chapter 10 closes
the loop by generating synthetic data that we feed back into the
training and evaluation of every layer above. Foundation models are the
substrate; what makes them useful in finance is the rest of the stack
that surrounds them.
Prompt Engineering for Finance
The first four sections of this chapter cover the integration shapes
— reprogramming, fine-tuning, foundation-model strategy, and the
upstream framing. This section is the practical prompt-craft layer
that determines whether those integrations actually work in
production. Most of the patterns are small; the cumulative effect on
reliability is large, especially on the structured-output tasks
finance applications care about most.
The four properties a finance prompt has to hit
Before we get to patterns, the constraints. A prompt for a finance
task must be:
Reproducible. Same input + same prompt + pinned model version =
same output (modulo logged seeds for sampling).
Schema-faithful. Structured outputs (JSON, function-call
payloads) validate on every run, not just the happy path.
Grounded. Numeric and factual claims trace to retrieved or
tool-supplied sources, not the model's parametric memory.
Auditable. A reviewer can read the prompt, the retrieved
context, and the output, and reconstruct why the model said what
it did.
A prompt that nails all four turns the LLM into a predictable
component; one that misses any of them turns it into a liability.
Pattern 1: structured-output-first
Always declare the output shape before the question. The model
treats the schema as a constraint to satisfy rather than a parser
that runs after the fact.
SYSTEM = """\You extract financial figures from filings. Output a single JSONobject matching this schema (no prose, no code fences, no commentary):{ "ticker": string, // exchange ticker, uppercase "period": string, // e.g. "2024Q4" "revenue": {"value": number, "unit": "USD", "citation": string}, "net_income": {"value": number, "unit": "USD", "citation": string}, "operating_cf": {"value": number, "unit": "USD", "citation": string}}Each citation must be a verbatim 6-12 word span from the source.Numbers are reported in USD millions."""
Pair the system prompt with strict JSON output mode if the
provider supports it (Anthropic structured-output, OpenAI JSON mode).
Always validate against the schema after parsing; reject on mismatch
and retry once with the validation error appended to the prompt.
Two anti-patterns:
Free-text first, parse later. Model emits prose like "Revenue
was 12.3B,netincome678M…"; a regex tries to extract numbers.
Fragile; the regex rots within weeks.
Schema in the user message. Model takes the schema as a
suggestion. Schema goes in the system prompt where it has higher
effective weight.
Pattern 2: ground every number to a source span
For finance the grounding discipline is non-negotiable. The
working contract: every numeric claim in the output must be
accompanied by the verbatim text span that supports it.
After parsing, the runtime checks that each citation appears
verbatim in the retrieved context. Citations that fail this check
fail the whole output. The LLM is now free to summarise but not
free to invent.
This pattern is also what makes the audit log useful: a regulator
sees a number and the span it came from in one record.
Pattern 3: retrieval before instruction
The classical RAG layout puts retrieved context at the end of the
prompt. For long contexts this is the wrong choice — recency bias
and "lost in the middle" effects mean the model attends to the
recent instruction over the old context. Better:
[system prompt][retrieved context — ranked top to bottom by relevance][user instruction last]
The instruction is what the model "remembers" most strongly, and
the context is laid out where the model will read it during
processing rather than reference at the end.
A specific trick: when the context is multiple documents, mark each
with an id the model can cite by:
[doc:1] Microsoft's fiscal Q1 2025 revenue rose to ...[doc:2] Year-over-year operating cash flow ...
The output schema then includes a source_doc field, and the
grounding check verifies the cited doc exists.
Pattern 4: chain-of-thought, then strict output
For multi-step extractions and analyses, ask the model to reason
before emitting structured output:
First, work through the question step by step inside <think> tags.Then, in <answer> tags, emit the JSON object exactly matching theschema above.
After parsing, throw away <think>. The reasoning improves output
quality on multi-step tasks; the strict structured <answer> is
what downstream systems consume. The R1-style reasoning fine-tuning
of Section 9-05 is the principle behind this on the training side;
the prompt pattern here is its inference-time analogue.
For tasks that don't need reasoning (single-fact extraction),
skip the chain-of-thought. It adds latency without improving
quality.
Pattern 5: few-shot exemplars matched to the input
Hard-coded few-shot examples eat tokens and don't always match the
current input. Retrieved exemplars — pick the k most similar
prior examples from a labelled bank and insert them — improve
quality on long-tail inputs at modest token cost.
def build_prompt(user_query: str, exemplar_bank, embed_fn, k: int = 3) -> str: q_emb = embed_fn(user_query) bank_embs = embed_fn([e["query"] for e in exemplar_bank]) sims = bank_embs @ q_emb / np.linalg.norm(q_emb) top = np.argsort(-sims)[:k] examples = "\n\n".join( f"Example query: {exemplar_bank[i]['query']}\n" f"Example answer:\n{exemplar_bank[i]['answer']}" for i in top ) return f"{examples}\n\nNow answer the following:\n{user_query}"
The bank starts as 20–30 hand-curated examples; production usage
keeps growing it with examples flagged as "good" by reviewers.
Pattern 6: critic-revise for high-stakes tasks
For tasks where one bad output costs more than ten good ones:
Generate. First pass produces a candidate.
Critique. Second pass reads the candidate and a checklist
("are all numbers grounded?", "does the JSON validate?", "is the
tone within firm guidelines?"); emits a list of issues.
Revise. Third pass produces a fixed version given the
critique.
3× cost, 30%+ fewer downstream failures on judgement-heavy tasks.
Reserve for cases where the latency budget supports it (overnight
reports, drafted memos, not real-time chat).
Pattern 7: refuse-or-defer for out-of-scope inputs
A prompt should explicitly tell the model what NOT to answer.
If the user's question is outside the scope of [filings extraction,KPI calculation, narrative summary], do not improvise. Respond with:{"refusal": "out_of_scope", "reason": "<one sentence>"}
Without this, the model will help-fully attempt anything; with it,
the failure mode becomes a clean refusal that downstream systems
can handle (e.g., escalate to a human or route to a different
agent).
Anti-patterns that quietly hurt
A working list:
Jailbreak by accident. Phrases like "as an expert analyst with
no compliance constraints…" relax the model's safety prompts
unintentionally. Audit for these.
Implicit math. "Compute the company's free cash flow" without
saying which definition. The model picks a definition; the choice
may not match the firm's. Be explicit.
Mixed languages without intent. Korean prompt with English
examples produces inconsistent outputs. Pick one language; if
multilingual is needed, model it explicitly.
Format drift across versions. Changing whitespace, capitalisation,
or punctuation in the prompt without versioning. Even small drift
changes outputs; treat the prompt as code.
Evaluation harness for prompts
The minimum infrastructure to keep prompt edits honest:
import jsonfrom pathlib import Pathdef evaluate_prompt(prompt: str, model: str, suite_path: Path) -> dict: """Run a prompt against a JSON suite of (input, expected) pairs.""" suite = json.loads(suite_path.read_text()) results = [] for case in suite: out = call_llm(model, prompt, case["input"]) ok_schema = validate_schema(out, case["schema"]) ok_keys = all(k in out for k in case["required_keys"]) ok_grounding = all( c["citation"] in case["context"] for c in out.get("citations", []) ) results.append({ "id": case["id"], "schema": ok_schema, "keys": ok_keys, "grounding": ok_grounding, }) return aggregate(results)
Run on every prompt edit. Block merge if any test regresses.
Where prompt-craft ends
These patterns get you a long way — usually to the point where
prompting alone is no longer the bottleneck. Past that point the
gain comes from fine-tuning (Section 7-03 / Chapter 9), better
retrieval (Section 7-02), or stronger models (Section 7-04). But
deciding where to invest depends on knowing how good prompting
alone gets, and the patterns here are how to find that ceiling
honestly.