"There are no two words in the English language more harmful than
'good job.'"
— Whiplash (2014)
Supervised fine-tuning teaches a language model to imitate. RL
fine-tuning teaches it to be measured against something — a reward
model, a verifiable criterion, a preference signal — and to update
its weights toward higher score on that measure. For a lot of
finance applications the difference matters: the question is rarely
"what is the most likely next token" but "given two extractions from
this 10-K, which one survives reconciliation against the bundled
totals?" That second question is exactly what RL fine-tuning is for.
This chapter is the working manual. We cover TRL — the Hugging
Face library that has become the de facto standard for fine-tuning
language models with RL — in enough depth to ship a finance-specific
DPO or GRPO pipeline. The bigger picture from Section 7-03 (when to
fine-tune at all) is the prerequisite; this chapter assumes you have
decided that prompting and SFT are not enough and you need the model
to prefer certain outputs over others rather than merely produce
them.
What this chapter does
The chapter is organised so the chapter-end output is a runnable
SFT → DPO pipeline that you can fine-tune a 7B-class model on with
a single GPU and a small preference dataset:
Why RL fine-tuning matters for financial tasks specifically — the
parts where verifiable rewards are the right tool, and the parts
where it is overkill.
The TRL library at a level of detail that lets you choose
between SFTTrainer, DPOTrainer, PPOTrainer, GRPOTrainer
based on your data and your loss surface.
Reward modelling for finance: programmatic rewards (schema
validation, reconciliation), preference-based rewards (analyst
rankings), and the trade-offs between them.
An end-to-end SFT → DPO pipeline on a small finance dataset
with the evaluation discipline that catches regressions.
GRPO and R1-style reasoning — the post-2024 recipe that powers
the new wave of reasoning-finetuned LLMs for financial extraction
and structured analysis.
Three through-lines
Pick the cheapest method that solves the problem. SFT first,
DPO if behaviour disagreements persist, full RL (PPO / GRPO) only
when verifiable rewards are central. The literature evidence
\citep{jin2025rl-vs-sft} is consistent on this ordering.
Programmatic rewards beat learned reward models when you can
get them. Finance tasks have a lot of verifiable structure
(numbers must reconcile, JSON must validate, citations must point
to real spans). Use that structure.
Distillation and small models. A fine-tuned 7B model with
good rewards routinely beats a frontier API on a narrow finance
task at a fraction of the per-token cost. The downstream
cost-management story of Section 7-04 is the reason RL fine-tuning
matters operationally.
Contents
Why RL Fine-Tuning — when RL helps
beyond SFT, which finance tasks it suits, and the risks of doing
it for the wrong reason.
The TRL Library — SFTTrainer,
DPOTrainer, PPOTrainer, GRPOTrainer, RewardTrainer, and
the surrounding plumbing (PEFT, vLLM rollout, accelerate).
Reward Modelling — programmatic and
preference-based rewards, schema-and-reconciliation patterns for
financial extraction, and how to build a calibration set.
SFT to DPO Pipeline — runnable
end-to-end fine-tune of a 7B model on a small finance dataset,
with the evaluation suite that decides whether to ship.
GRPO and Reasoning — Group Relative
Policy Optimization, R1-style reasoning fine-tuning, and the
finance applications that benefit (multi-step extraction,
structured analysis, code-from-spec).
Why RL Fine-Tuning
Section 7-03 covered when fine-tuning is worth the budget at all.
This section is one level deeper: given that you have decided to
fine-tune, when should you reach past SFT and add an RL step?
The empirical answer that has stabilised over 2024–2025 is mostly
no, sometimes yes — and the sometimes is large enough in finance
to deserve a chapter. Two recent papers frame the trade-off cleanly.
Jin et al. \citep{jin2025rl-vs-sft} show that SFT delivers most of
the gain, fastest, on tasks where the right answer is a fixed token
sequence; RL pulls ahead on tasks where the criterion is
verifiable but the answer space is large (math, code, structured
extraction). DeepSeek-R1 \citep{guo2025deepseekr1} shows what happens
when you push the second case to its limit: pure RL with verifiable
rewards produces a reasoning-capable model from a non-reasoning base
without any supervised reasoning traces.
Both directions matter for finance.
When SFT is enough
Three task shapes that finetune to production quality with SFT alone,
no RL needed:
Style and tone normalisation. Producing memos in a firm's house
style; rewriting client-facing prose to a tone guideline. The
target distribution is well-defined and SFT examples are easy to
collect.
Schema-bound extraction with abundant data. Pulling out
structured fields from filings when you have thousands of
high-quality (input, output) pairs. SFT will learn the schema and
the field semantics directly.
Domain vocabulary adaptation. Teaching a base model the firm's
terminology, ticker conventions, and product names. A small SFT on
curated examples is the cheapest path.
If you are in any of these regimes, stop here. The cost of an RL
pipeline (reward model, preference data, more compute) is not earned.
When RL fine-tuning helps
Three shapes that genuinely benefit from an RL step on top of SFT:
Verifiable-reward extraction. The output has to reconcile:
totals add up, fields match a schema, citations point to real
spans, derived figures match the source. A reward function can
check this programmatically; RL pushes the model toward outputs
that pass.
Preference judgements. The right answer is a ranking, not a
token sequence — which of two analyst summaries is sharper, which
of two trade rationales is more defensible. SFT cannot encode "this
one over that one" cleanly; DPO can.
Multi-step reasoning. Tasks that require working through
several intermediate steps before producing the final answer
(multi-table extraction, structured derivation of a financial
metric from primary sources). GRPO and similar group-relative
methods reinforce reasoning patterns that SFT examples cannot
cleanly demonstrate.
The pattern across these three: the criterion is clear, but the
path to satisfy it is not unique. RL is a tool for telling the model
"this output is better than that one" without prescribing exactly
how it gets there.
The three RL fine-tuning regimes
The chapter covers three regimes in increasing operational complexity:
DPO (Direct Preference Optimisation) \citep{rafailov2023dpo}.
Closed-form contrastive loss between a chosen and a rejected
response. No reward model, no reinforcement-learning loop. The
default first attempt; works well when preferences exist as pairs.
PPO with a learned reward model. The classical RLHF recipe.
Train a reward model on preferences, then PPO the policy against
it. More moving parts; more compute; pays off when preferences are
too rich to fit DPO's pair-wise contrast.
GRPO (Group Relative Policy Optimisation). Sample N
candidates per prompt, score each with a verifiable reward,
normalise within the group, and update. No value model, no
separate critic; the group itself sets the baseline. The recipe
behind DeepSeek-R1 and most reasoning-finetuned models since.
Section 9-02 walks through the TRL library trainer for each of these.
Three failure modes worth knowing
RL fine-tuning has a richer failure surface than SFT, and the
failures are easy to miss without explicit monitoring.
Reward hacking. The policy finds an output the reward model
scores highly but humans dislike. The classic example: outputs
that copy reward-model favouritism (verbose summaries, hedged
claims) without genuinely improving. Mitigation: a held-out
human eval set scored separately.
Mode collapse. RL pushes the policy toward a small set of
near-deterministic outputs. Token-level entropy collapses; the
model loses creativity on unseen prompts. Mitigation: KL
regularisation against the SFT reference policy
\citep{cui2025entropy} and an explicit entropy floor in the
optimiser.
Reward model drift. As the policy updates, its outputs move
outside the distribution the reward model was trained on; the
reward model's predictions become less reliable. Mitigation:
periodic reward-model retraining on fresh policy samples, or
programmatic rewards (next section) that do not have this
problem.
How this chapter connects
RL fine-tuning sits between Section 7-03 (when to fine-tune at
all) and the agent layer of Chapter 8 (where fine-tuned models
actually get called as tools). The synthetic-data Chapter 10 then
provides the evaluation regime where fine-tuned policies are
stress-tested. The four chapters together close the loop:
Decide whether to fine-tune (Section 7-03).
Decide how to fine-tune (this chapter).
Wrap the fine-tuned model in agents and tools (Chapter 8).
Evaluate the result on synthetic stress data (Chapter 10).
Section 9-02 covers the TRL library that provides the actual
machinery for step 2.
The TRL Library
Hugging Face's TRL (Transformer Reinforcement Learning) library
is the de facto standard for fine-tuning language models with RL on
top of transformers. It exposes a set of trainers — SFTTrainer,
RewardTrainer, DPOTrainer, PPOTrainer, GRPOTrainer,
KTOTrainer, ORPOTrainer — that share the same configuration and
checkpoint conventions as the rest of the Hugging Face stack and
compose with PEFT (LoRA / QLoRA), accelerate, and vllm for
inference rollouts. This section is a working tour: what each
trainer does, when to reach for it, and the integration glue that
makes the whole thing run on a single GPU.
Architecture in one paragraph
TRL trainers are subclasses of transformers.Trainer (or its
distributed variants) with the loss and the data-loading patched
for the RL or preference-learning objective. The model under training
is a transformers.PreTrainedModel; the optional reference model
(for KL terms) is another instance of the same architecture; the
optional reward model is yet another. PEFT adapters can be inserted
on the model under training so the trainable parameter count drops
to a few percent of the base; the reference model stays frozen and
shares weights via LoRA's adapter-disable trick.
from trl import SFTTrainer, DPOTrainer, PPOTrainer, GRPOTrainer
The five trainers below are the load-bearing ones; the rest
(KTOTrainer, ORPOTrainer, CPOTrainer) are variations on the
DPO theme that make sense once you have a reason to prefer a
different preference-learning loss.
SFTTrainer
The starting point and the prerequisite for everything else. SFT is
ordinary causal-language-model fine-tuning with TRL's data-formatting
and packing helpers.
packing=True packs short examples into longer sequences,
which roughly doubles SFT throughput on finance datasets where
most examples are 200–800 tokens.
max_seq_length should match the longest example you actually
see; padding to a larger context wastes compute.
For PEFT-only training, wrap the model in a LoRA config first; TRL
recognises PEFT models and keeps the reference / value heads on the
right adapter graph automatically.
RewardTrainer
Trains a reward model from preference data — pairs of
(prompt, chosen, rejected). Loss is a Bradley–Terry pairwise
log-likelihood. Useful only if you plan to do PPO; DPO skips this
step.
The reward model needs a held-out preference set with human-graded
agreement: the percentage of held-out pairs where the reward
model's preference matches the human label. Below 70% the reward
model is too noisy to PPO against; above 85% it is good enough to
serve as the optimisation target for several thousand steps before
needing a refresh.
DPOTrainer
Direct Preference Optimisation \citep{rafailov2023dpo} reformulates
RLHF as a closed-form contrastive loss: maximise the log-ratio of
chosen to rejected probabilities relative to a frozen reference
model, with no reward model and no value head. The default first
attempt for any preference-based fine-tuning task.
beta: KL regularisation strength. Larger β keeps the
policy closer to the reference; smaller β allows bigger
shifts. Start at 0.1; lower if the policy is collapsing.
ref_model: pass None when using PEFT — TRL re-uses the
base model with adapters disabled to compute the reference. This
is the trick that makes DPO fit on a single GPU.
PPOTrainer
The classical RLHF recipe: rollout from the policy, score with a
reward model, compute advantages, PPO update. More moving parts than
DPO; reach for it when preferences are too rich to fit a pairwise
contrast.
vLLM rollout backend can be plugged in to accelerate the
generation step (the bottleneck of PPO) — pass
use_vllm=True and vllm_device in the PPOConfig if your TRL
version supports it.
kl_coef vs DPO's beta are the same idea; PPO usually wants
a smaller value because the rollout-and-update loop has its own
exploration mechanism.
A working recipe for PPO on a small finance dataset that surfaces
the operational complexity:
Three knobs that decide whether PPO converges or oscillates:
target_kl early stop. The trainer stops the current epoch
when the per-batch KL exceeds this; without it, the policy can
drift far enough that the importance ratio explodes.
mini_batch_size vs batch_size. The full batch is split into
mini-batches and each mini-batch produces an update. Smaller
mini-batches give noisier but more frequent updates; larger gives
smoother but slower training. 8/64 is the standard ratio.
Reward standardisation. The reward model's outputs may have
large absolute scale; PPO is sensitive to advantage variance.
Normalise rewards per batch (subtract mean, divide by std) before
computing returns, the same trick that stabilises actor-critic in
Section 5-04.
When the reward model drifts (its predictions diverge from human
preferences as the policy moves out of the RM's training
distribution), the standard fix is iterative: every N PPO epochs,
sample fresh rollouts from the current policy, re-label with humans,
re-train the RM, and continue. This is the core RLHF loop; modern
DPO and GRPO recipes mostly exist to avoid it.
GRPOTrainer
Group Relative Policy Optimisation \citep{shao2024deepseekmath} is
the "PPO without the value model" idea that DeepSeek-R1
\citep{guo2025deepseekr1} popularised for reasoning fine-tuning. For
each prompt, sample N candidate completions; score each with a
verifiable reward (math-correctness, JSON validity, schema
match); normalise within the group; PPO-style update. No value
network, no separate critic, no reward-model drift.
from trl import GRPOTrainer, GRPOConfigdef reward_fn(completions, prompts, **kwargs): """Programmatic reward: 1.0 if the completion's JSON validates against the target schema and reconciles with the source totals; 0.0 otherwise.""" return [score_one(p, c) for p, c in zip(prompts, completions)]GRPOTrainer( model=model, reward_funcs=reward_fn, # list of callables; multiple rewards mix processing_class=tok, args=GRPOConfig( output_dir="ckpt/grpo", num_train_epochs=1, per_device_train_batch_size=2, num_generations=8, # group size N max_prompt_length=2048, max_completion_length=2048, learning_rate=1e-6, beta=0.04, bf16=True, ), train_dataset=load_dataset("json", data_files="data/prompts.jsonl", split="train"),).train()
GRPO shines when:
The reward is verifiable (you can write score_one as a small
function), not "judge with another LLM".
The task admits diverse correct trajectories — multi-step
reasoning, structured extraction with several valid orderings.
You want a small model that beats a frontier API on a narrow
task at low inference cost.
Section 9-05 develops a full GRPO recipe for a financial reasoning
task.
PEFT, accelerate, and vllm — the integration glue
Three companion libraries make TRL pipelines fit on consumer-class
hardware:
PEFT (LoRA / QLoRA / DoRA). Add a peft_config to any of the
trainers above and only the LoRA adapter is updated. QLoRA's 4-bit
base + LoRA on top makes 70B fine-tuning feasible on a single
H100; for 7B it makes the whole pipeline run on a 24 GB consumer
GPU.
accelerate. TRL trainers are accelerate-aware. Multi-GPU
data parallelism is a accelerate config away; FSDP / DeepSpeed
for >1B trainable parameters are a config-flag away.
vllm. GRPO and PPO bottleneck on rollout generation. vllm's
paged-attention kernel speeds up generation 4–10× and TRL's
use_vllm=True flag plugs it in transparently. For finance
fine-tunes on a single GPU this turns "one epoch overnight" into
"one epoch in two hours".
Replacing DPO when the chosen/rejected gap is small
KTOTrainer, ORPOTrainer
The decision tree:
Is there a verifiable reward? → GRPO.
Are there preference pairs? → DPO (or RewardTrainer + PPO if
pairs are too rich for a pairwise loss).
Is the goal to imitate exemplars? → SFT.
Otherwise: SFT first, observe whether the failure mode is "wrong
style" (stop) or "wrong choice between defensible alternatives"
(DPO) or "wrong reasoning trajectory" (GRPO).
Section 9-03 covers reward modelling — both programmatic and
preference-based — in detail. Section 9-04 wires SFT and DPO
together into a runnable pipeline.
Reward Modelling
The reward function is the single most important design decision in
an RL fine-tuning pipeline. Everything else — the trainer, the
batch size, the KL coefficient — is downstream of what you are
optimising. Get the reward right and a small model fine-tunes into
something useful; get it wrong and a large model finds the
loophole instead of the task.
This section covers the three classes of reward that recur in
finance applications, the trade-offs between them, and the
calibration discipline that decides whether a reward is good enough
to optimise against.
Three classes of reward
Reward functions fall into three buckets, in increasing order of
difficulty to construct and operate.
Programmatic reward. A pure function of the model's output
(and optionally the source data) that returns a scalar. JSON
validity, schema match, numeric reconciliation, code that
compiles, citations that resolve. These are the cleanest rewards
and the foundation for GRPO-style fine-tuning.
Preference-based reward. A learned model that scores a
candidate output by predicting how a human would rank it against
alternatives. Used in classical RLHF; the substrate for RewardTrainer
PPO.
LLM-judge reward. A separate language model is prompted to
score the candidate against a rubric. Cheap to construct, easy
to bias, and prone to reward hacking; treat as a last resort or
as a secondary signal alongside a programmatic one.
The empirical evidence over 2024–2025 is consistent: when a
programmatic reward exists, it dominates. Most finance tasks have
more programmatic structure than the literature gives them credit
for.
Programmatic rewards for finance
Three patterns recur and cover most production use cases.
Schema and reconciliation
The output is a structured extraction — fields from a 10-K,
positions from a portfolio report, entries from a press release.
The reward checks that:
The output validates against a JSON schema (typed, required
fields present, correct cardinalities).
Numeric fields reconcile: line totals add up, sub-categories
sum to category totals, derived ratios match recomputation from
primary fields.
Citations point to spans that actually exist in the source.
import json, refrom jsonschema import validate, ValidationErrordef schema_reconcile_reward(prompt: str, completion: str, schema: dict, source_text: str) -> float: # extract the JSON block from the completion m = re.search(r"```json\s*(\{.*?\})\s*```", completion, re.S) if not m: return 0.0 try: out = json.loads(m.group(1)) except json.JSONDecodeError: return 0.0 # schema validity try: validate(out, schema) except ValidationError: return 0.2 # partial credit for valid JSON # reconciliation: subtotals must equal sum of components components = out.get("components", []) total = out.get("total") if total is not None and components: recomputed = sum(c.get("value", 0) for c in components) if abs(recomputed - total) > 1e-2: return 0.5 # JSON valid, totals broken # citations resolve for cite in out.get("citations", []): if cite not in source_text: return 0.7 # mostly right, citation broken return 1.0 # all checks pass
The graded scoring (0.2 / 0.5 / 0.7 / 1.0) gives the optimiser a
dense gradient signal: there are partial credits for the
intermediate steps so the policy can climb the reward surface
gradually, instead of finding only the trivial 0.0 vs 1.0 cliff.
Numeric correctness
For tasks with a unique numeric answer (financial-math problems,
ratio computations from primary statements), the reward is exact-
match on the numeric field after normalisation:
def numeric_match_reward(completion: str, target: float, tol: float = 1e-3) -> float: m = re.search(r"answer:\s*(-?\d+\.?\d*)", completion, re.I) if not m: return 0.0 pred = float(m.group(1)) return 1.0 if abs(pred - target) / (abs(target) + 1e-9) < tol else 0.0
DeepSeekMath \citep{shao2024deepseekmath} and the broader R1 line
\citep{guo2025deepseekr1} show that a binary numeric reward is
enough to teach a non-reasoning base model to reason on math —
because the action space (CoT trajectories) is rich enough that
the policy can find paths that satisfy the criterion. The same
applies to financial-math problems with a unique answer.
Code-from-spec
For tasks where the model must produce code that satisfies a spec
(generate a Polars query for a question, generate a backtest from
a strategy description), the reward is (compiles, passes_tests):
import subprocess, tempfiledef code_reward(completion: str, test_cases: list) -> float: code = extract_code_block(completion) with tempfile.NamedTemporaryFile(mode="w", suffix=".py") as f: f.write(code); f.flush() try: r = subprocess.run(["python", f.name], capture_output=True, timeout=10, input=test_cases_payload(test_cases)) return 1.0 if r.returncode == 0 else 0.3 except subprocess.TimeoutExpired: return 0.0
For finance specifically, this works well for narrow code tasks
(generate a feature pipeline, write a backtest harness with named
parameters). It does not work for "write a trading strategy" —
the criterion is not verifiable.
Preference-based rewards
When the task admits no programmatic reward — "summarise this
analyst note in our house style", "draft a client-facing memo about
this market move" — the alternative is to collect human preferences
and learn a reward model.
Preference data collection
Each item in the dataset is a triple (prompt, chosen, rejected).
Two collection patterns:
Pairwise from a single annotator. Show an annotator a prompt
and two candidate completions; they pick the better one.
Group ranking. Show N candidates per prompt; the annotator
ranks them. Decompose the ranking into all (2N) pairs
for the pairwise loss.
The dataset size that produces a useful reward model is in the
low thousands of pairs for narrow finance tasks. Below that the
reward model is too noisy; above ~50K the marginal signal flattens.
Reward-model training
RewardTrainer (Section 9-02) handles the actual fitting. The
output you should monitor:
Held-out pair accuracy. Random baseline is 50%; a useful
model is ≥70%; a strong model is ≥85%. Below 70%, do
not PPO against this model — the gradient is too noisy.
Calibration. Plot reward-model score vs human-graded quality
on a held-out set. A flat or non-monotone curve means the
optimiser will follow the wrong direction.
Out-of-distribution behaviour. Sample from the current
policy and score; the score distribution should overlap the
training distribution. Drift means the reward model has expired.
When to refresh the reward model
The classical RLHF pattern: train RM, do a few thousand PPO steps,
evaluate, retrain RM on fresh policy outputs labelled by humans,
repeat. For finance applications the cadence depends on the rate of
policy drift; a useful default is every 1–2K PPO steps in early
training, every 5–10K once the policy stabilises.
LLM-judge rewards (use sparingly)
A growing literature uses one LLM to score another's outputs
against a rubric. Cheap to set up; treacherous to optimise against.
Three failure modes to watch for:
Self-preference. A judge from the same model family rewards
outputs in its own house style; the policy learns the style, not
the substance.
Length bias. Judges over-reward verbose outputs; the policy
learns to be wordy.
Hedge bias. Judges over-reward outputs that include caveats;
the policy learns to refuse useful conclusions.
If you must use an LLM judge:
Pair it with a programmatic reward and weight the programmatic
one higher.
Use a judge from a different model family than the policy.
Audit a sample of policy outputs against the judge's labels every
few hundred steps.
Building a reward calibration set
Whichever class of reward you use, set aside a held-out reward
calibration set: prompts with both human-graded labels and
programmatic-reward scores. This set lets you:
Verify the reward model agrees with humans (preference rewards).
Verify the programmatic reward correlates with what people
actually want (programmatic rewards).
Detect reward drift across training (re-score the policy on the
same prompts every k steps and watch for anomalies).
The calibration set is also what you report to a model-risk reviewer.
A pipeline that cannot defend its reward function in front of a
human auditor is not ready for production.
What this section adds
The reward function is the thing the policy is being pushed
toward. The TRL trainers in Section 9-02 are the mechanism. The
next section wires the two together into a runnable SFT → DPO
pipeline; Section 9-05 takes the same architecture into the GRPO
regime where verifiable rewards from this section get to flex.
SFT to DPO Pipeline
This section wires Sections 9-02 (TRL library) and 9-03 (reward
design) into a runnable end-to-end pipeline: take a base 7B model,
SFT it on finance instructions, DPO it on preference pairs, and
ship a fine-tuned adapter that beats the base on a held-out finance
benchmark. The recipe is the one most production teams actually use
in 2025.
The pipeline at a glance
base model (e.g. Llama-3.1-8B-Instruct) │ ▼ SFT (Section 9-02 SFTTrainer, Section 7-03 data pipeline) │SFT model (style adapted, schema mostly correct) │ ▼ DPO (Section 9-02 DPOTrainer, Section 9-03 preferences) │DPO model (preferred outputs preferred more often) │ ▼ Eval (Section 9-03 calibration set + held-out OOD) │ship as PEFT adapter
Three guard-rails that travel with this pipeline:
PEFT (LoRA) only. Both SFT and DPO update the same LoRA
adapter. The base weights stay frozen. The shipped artefact is a
small (~50 MB) adapter, not a 16 GB model copy.
Reference model from the SFT checkpoint. DPO compares the
policy to a reference; using the SFT checkpoint as the reference
rather than the original base gives DPO a much closer starting
distribution and stabler training.
Held-out evaluation untouched by either step. The same OOD
set is used to compare base, SFT, and DPO models; no reuse of
training prompts in evaluation.
Stage 1: SFT
Assume a finance instruction dataset
data/sft_finance.jsonl where each row is
{"prompt": "Extract revenue and net income from the following 10-Q...", "completion": "{\n \"revenue\": 12345,\n \"net_income\": 678\n}\n"}
packing=True doubles throughput on short examples; use it
unless your data has very long single examples.
learning_rate=2e-4 for LoRA; full fine-tuning would be
∼10× smaller. Don't accidentally use full-FT learning
rates with LoRA — you will under-train.
Stage 2: collect preference pairs
Three patterns for getting (prompt, chosen, rejected) triples on
a finance dataset:
Sample-and-judge. Sample 4–8 completions per prompt from the
SFT model with temperature=0.7. Have an annotator (human or LLM
judge with a programmatic backstop from Section 9-03) pick the
best and worst. Use those as chosen and rejected.
SFT-vs-rule. Use the SFT completion as chosen and a
deliberately-flawed alternative (rule-based output, base model
output, mangled JSON) as rejected. Cheap; the resulting DPO
mostly teaches the policy to avoid obvious errors.
Programmatic re-ranking. For tasks with a verifiable reward
(Section 9-03), score N samples and pick the best as chosen,
worst as rejected. The reward function does the labelling work.
A reasonable starter dataset size is 1–5K pairs; below that
DPO is noisy.
Stage 3: DPO
from trl import DPOTrainer, DPOConfigfrom peft import PeftModel# Reload the SFT-adapted model as the policysft_path = "ckpt/sft-finance"policy = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")policy = PeftModel.from_pretrained(policy, sft_path, is_trainable=True)ds_pref = load_dataset("json", data_files="data/pref_pairs.jsonl", split="train")# expects columns: prompt, chosen, rejecteddpo_cfg = DPOConfig( output_dir="ckpt/dpo-finance", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=5e-6, lr_scheduler_type="cosine", beta=0.1, bf16=True, max_length=4096, max_prompt_length=2048, save_strategy="epoch",)DPOTrainer( model=policy, ref_model=None, # uses adapter-disable on the SFT model processing_class=tok, args=dpo_cfg, train_dataset=ds_pref,).train()
Three things to watch during DPO training:
reward_chosen vs reward_rejected margin in the trainer's
metrics. Should grow during training; if it spikes early and
collapses, the policy has found a degenerate solution.
KL divergence to the reference. Should grow gradually; a
step-function jump means the optimisation blew up.
Token-level entropy. Should decrease but not collapse to zero.
If it does, lower beta or stop earlier.
Stage 4: evaluation
Three evaluation tracks, all run on a calibration set the model
has not seen:
Programmatic eval. Run the schema/reconciliation reward
(Section 9-03) on base, SFT, DPO outputs. The DPO model should
score ≥ SFT ≥ base.
Human-graded eval. A held-out sample of 100–500 prompts
reviewed by an annotator using the firm's rubric. Report
preference rates: SFT vs base, DPO vs SFT, DPO vs base.
Negative-control eval. A set of out-of-scope prompts (e.g.
general-purpose factual questions) where the fine-tuned model
should not have regressed. Catches the case where DPO over-
specialised and broke the model on tasks outside the finance
scope.
def evaluate_models(prompts, base_model, sft_model, dpo_model, tok, reward_fn): rows = [] for p in prompts: outs = { "base": generate(base_model, tok, p), "sft": generate(sft_model, tok, p), "dpo": generate(dpo_model, tok, p), } rows.append({k: reward_fn(p, v) for k, v in outs.items()}) return rows
The bar to ship: DPO wins on programmatic eval and human eval
and does not regress on negative controls. Two out of three is
not enough.
A more complete evaluation harness for production use:
import jsonfrom dataclasses import dataclass, asdictfrom pathlib import Path@dataclassclass EvalReport: dataset: str n: int base_score: float sft_score: float dpo_score: float sft_vs_base_pref: float # fraction of pairs where SFT > base dpo_vs_sft_pref: float dpo_vs_base_pref: float regressions: list[dict] # tasks where DPO did worse than SFTdef production_eval(suite, base, sft, dpo, tok, reward_fn, out_path: Path) -> EvalReport: base_scores, sft_scores, dpo_scores = [], [], [] regressions = [] for case in suite: b = reward_fn(case, generate(base, tok, case["input"])) s = reward_fn(case, generate(sft, tok, case["input"])) d = reward_fn(case, generate(dpo, tok, case["input"])) base_scores.append(b); sft_scores.append(s); dpo_scores.append(d) if d < s - 0.5: regressions.append({"id": case["id"], "sft": s, "dpo": d}) n = len(suite) pref = lambda a, b: sum(1 for x, y in zip(a, b) if x > y) / n report = EvalReport( dataset=str(out_path), n=n, base_score=sum(base_scores) / n, sft_score=sum(sft_scores) / n, dpo_score=sum(dpo_scores) / n, sft_vs_base_pref=pref(sft_scores, base_scores), dpo_vs_sft_pref=pref(dpo_scores, sft_scores), dpo_vs_base_pref=pref(dpo_scores, base_scores), regressions=regressions, ) out_path.write_text(json.dumps(asdict(report), indent=2)) return report
The output is the artefact you take to a model-risk reviewer: per-
task rewards, pairwise preference rates, and a list of per-task
regressions. A reviewer who cannot reproduce the numbers from this
JSON given the model checkpoint and the dataset has not been given
enough; one who can has the full audit trail.
What can go wrong
A working list of failure modes and fixes:
DPO does not improve over SFT. Most common cause: weak
preference data. Re-collect with stronger contrasts (chosen
clearly better than rejected); the noisier the labels, the more
pairs you need. A second cause: β too high — try 0.05.
Mode collapse. The DPO policy emits near-identical responses
for varied prompts. Lower the LoRA rank, raise β, shorten
training.
Catastrophic forgetting on negative controls. The DPO policy
is much worse on out-of-scope prompts. Add a small sample of
general instruction data into the DPO training set with
self-preferred pairs (the SFT response as both chosen and
rejected); this anchors the policy to the SFT distribution on
out-of-scope inputs.
Reward hacking on programmatic rewards. The policy finds a
loophole in the reward function. Tighten the reward (Section 9-03)
or add a graded penalty for length / format / boilerplate.
Shipping the artefact
The final artefact is a PEFT adapter, ~50 MB on disk. To deploy:
The merged model is a standard AutoModelForCausalLM; it serves
through vllm or any other inference runtime without the PEFT
overhead.
Where this fits
This pipeline is the operational version of the methods in 9-02
and 9-03. Section 9-05 takes the same skeleton into a third regime —
GRPO with verifiable rewards — for tasks where pairwise preference
data is impractical to collect but a programmatic scorer exists.
GRPO and Reasoning
The 2024–2025 wave of "reasoning" language models —
DeepSeek-R1 \citep{guo2025deepseekr1}, OpenAI's o1/o3 family,
Anthropic's extended-thinking modes — train policies that produce
long chains of thought before emitting a final answer. The
training method behind most open-source members of this family is
Group Relative Policy Optimisation (GRPO) with verifiable
rewards. This section covers how GRPO works, why it suits financial
reasoning tasks, and how to run a GRPO fine-tune with TRL on a
narrow finance benchmark.
What GRPO is
PPO needs a value function (a learned baseline) to compute
advantages. GRPO removes the value function and uses group
statistics instead. For each prompt, sample N candidate
completions, score each with a reward, and normalise within the
group:
A^i=std({r1,…,rN})ri−mean({r1,…,rN}).
The policy update is then PPO-style with the clipped surrogate, but
using A^i in place of the GAE advantage from a value model.
Two consequences make GRPO operationally attractive:
No value-model training. One fewer network to train, debug,
and keep aligned with the policy.
Variance reduction within a group. A prompt-specific baseline
is closer to the optimal baseline than any global value-function
estimate, which means more stable training at smaller batches.
The trade-off is that GRPO needs N rollouts per prompt rather
than one — typically N∈[4,16]. With vLLM rollouts (Section
9-02) this is affordable; without it, the per-step cost scales
poorly.
Why GRPO suits finance reasoning
Three task shapes that benefit from the GRPO recipe:
Multi-step extraction with reconciliation. Pull figures from
multiple tables in a 10-K; produce derived ratios that must
reconcile to source totals. The reasoning trajectory has many
branching points; a verifiable reward (totals match) tells the
policy when it found a working trajectory.
Structured derivation. "Compute the company's free cash flow
from these primary statements." There is a unique answer; the
path to it requires interpretation of accounting conventions; a
numeric-match reward suffices for training signal.
Code-from-spec. "Write a Polars query that computes a
rolling 60-day Sharpe per ticker." The reward is whether the code
compiles and produces the expected output on a small test fixture.
Notice the common structure: clear criterion, rich solution
space, partial-credit shape in the reward function. SFT teaches
"do this exact trajectory"; GRPO teaches "find any trajectory that
hits the criterion".
A composite reward made of programmatic checks (Section 9-03):
import json, refrom jsonschema import validate, ValidationErrorSCHEMA = {...} # the target schema for this taskdef reasoning_reward(completions, prompts, **kwargs): """Reward a list of completions; return a list of scalars.""" rewards = [] for prompt, comp in zip(prompts, completions): score = 0.0 # 1) format: <think>...</think><answer>...</answer> if not re.search(r"<think>.*?</think>", comp, re.S): rewards.append(0.0); continue m = re.search(r"<answer>(.*?)</answer>", comp, re.S) if not m: rewards.append(0.1); continue score += 0.2 # 2) JSON validity in the <answer> try: out = json.loads(m.group(1)) validate(out, SCHEMA) score += 0.3 except (json.JSONDecodeError, ValidationError): rewards.append(score); continue # 3) reconciliation against source totals if reconciles(out, prompt): score += 0.5 rewards.append(score) return rewards
The dataset is a list of prompts; no chosen / rejected columns
are needed because GRPO scores its own rollouts.
num_generations (N). Larger groups give a smaller-variance
baseline. N=4 is the floor; N=8–16 is typical.
beta (KL coefficient). Smaller than DPO; 0.02–0.05 is
the working range. R1-style training uses very small β
because the policy needs to drift far from the SFT base to
develop reasoning patterns.
max_completion_length. Reasoning trajectories grow during
training; under-budgeting here is the most common cause of GRPO
failing to learn.
What "develops" during GRPO
A surprising empirical observation from the R1 line of work: as GRPO
training progresses, completion length grows without being
explicitly rewarded for it. The model discovers that thinking
longer lets it satisfy the reward more often. The pattern often
looks like:
Step 0–500: completions are short and shallow; reward is near-zero.
Step 500–2K: completions get longer; the model starts emitting
intermediate reasoning steps; reward climbs.
Step 2K+: occasional "aha" moments where the model produces a
qualitatively new reasoning structure; reward plateaus then
jumps.
This is partly why GRPO suits reasoning: the optimiser has to find a
trajectory through a large space of possible reasoning paths, and
it does so by lengthening the chain of thought.
Three diagnostics to track:
Mean completion length. Should grow then stabilise. Sudden
collapse means the policy found a short loophole in the reward.
Reward distribution. Should shift from 0-heavy to 1-heavy as
training progresses; multimodal in the early phase, narrower
later.
Format validity rate. The fraction of completions that
satisfy the format reward (the lowest-difficulty component) —
should hit ≥95% within a few hundred steps and stay there.
Caveats and known failure modes
Reward hacking is more dangerous in GRPO than in DPO,
because the reward function is the only signal. Programmatic
rewards must be tight; partial-credit components must be
monotone in genuine quality, not just length or formatting.
Multilingual drift. R1 paper notes that early training led to
outputs mixing languages because the model had not anchored on
English. A small language-consistency auxiliary reward is
the standard fix.
Collapse to verbose hedging. Without an entropy penalty, the
policy can collapse to long but uninformative outputs that scrape
through the format reward without doing the task. Mitigation:
weight the substantive component (reconciliation, numeric
correctness) much higher than the format component.
Compute footprint. GRPO with N=8 and 8B parameters takes
~10× more wall-clock than DPO at the same dataset size. Budget
accordingly; vLLM rollouts roughly halve the cost.
When GRPO is overkill
GRPO is the right tool when you have a verifiable reward and a
task that benefits from chain-of-thought reasoning. It is
overkill when:
The task is surface-level (style, vocabulary, schema match without
reasoning). SFT or DPO suffice.
The reward function is brittle. Reward hacking will dominate the
signal; you will spend more time tightening the reward than
training the model.
The compute budget is small. Even with vLLM, GRPO is the most
expensive training mode in this chapter.
Where this fits
GRPO closes the chapter's progression: SFT (Section 7-03), DPO
(Sections 9-02 and 9-04), and finally GRPO for reasoning. The four
together cover the practical fine-tuning landscape for finance LLMs
in 2025. Chapter 10 (synthetic data) provides the stress-test
substrate for evaluating these models on regimes the training data
did not contain — the same stress-test discipline the rest of the
book uses for forecasters and policies.