Chapter 09

RL Fine-Tuning

"There are no two words in the English language more harmful than 'good job.'"Whiplash (2014)

Supervised fine-tuning teaches a language model to imitate. RL fine-tuning teaches it to be measured against something — a reward model, a verifiable criterion, a preference signal — and to update its weights toward higher score on that measure. For a lot of finance applications the difference matters: the question is rarely "what is the most likely next token" but "given two extractions from this 10-K, which one survives reconciliation against the bundled totals?" That second question is exactly what RL fine-tuning is for.

This chapter is the working manual. We cover TRL — the Hugging Face library that has become the de facto standard for fine-tuning language models with RL — in enough depth to ship a finance-specific DPO or GRPO pipeline. The bigger picture from Section 7-03 (when to fine-tune at all) is the prerequisite; this chapter assumes you have decided that prompting and SFT are not enough and you need the model to prefer certain outputs over others rather than merely produce them.

What this chapter does

The chapter is organised so the chapter-end output is a runnable SFT → DPO pipeline that you can fine-tune a 7B-class model on with a single GPU and a small preference dataset:

  • Why RL fine-tuning matters for financial tasks specifically — the parts where verifiable rewards are the right tool, and the parts where it is overkill.
  • The TRL library at a level of detail that lets you choose between SFTTrainer, DPOTrainer, PPOTrainer, GRPOTrainer based on your data and your loss surface.
  • Reward modelling for finance: programmatic rewards (schema validation, reconciliation), preference-based rewards (analyst rankings), and the trade-offs between them.
  • An end-to-end SFT → DPO pipeline on a small finance dataset with the evaluation discipline that catches regressions.
  • GRPO and R1-style reasoning — the post-2024 recipe that powers the new wave of reasoning-finetuned LLMs for financial extraction and structured analysis.

Three through-lines

  • Pick the cheapest method that solves the problem. SFT first, DPO if behaviour disagreements persist, full RL (PPO / GRPO) only when verifiable rewards are central. The literature evidence \citep{jin2025rl-vs-sft} is consistent on this ordering.
  • Programmatic rewards beat learned reward models when you can get them. Finance tasks have a lot of verifiable structure (numbers must reconcile, JSON must validate, citations must point to real spans). Use that structure.
  • Distillation and small models. A fine-tuned 7B model with good rewards routinely beats a frontier API on a narrow finance task at a fraction of the per-token cost. The downstream cost-management story of Section 7-04 is the reason RL fine-tuning matters operationally.

Contents

  • Why RL Fine-Tuning — when RL helps beyond SFT, which finance tasks it suits, and the risks of doing it for the wrong reason.
  • The TRL LibrarySFTTrainer, DPOTrainer, PPOTrainer, GRPOTrainer, RewardTrainer, and the surrounding plumbing (PEFT, vLLM rollout, accelerate).
  • Reward Modelling — programmatic and preference-based rewards, schema-and-reconciliation patterns for financial extraction, and how to build a calibration set.
  • SFT to DPO Pipeline — runnable end-to-end fine-tune of a 7B model on a small finance dataset, with the evaluation suite that decides whether to ship.
  • GRPO and Reasoning — Group Relative Policy Optimization, R1-style reasoning fine-tuning, and the finance applications that benefit (multi-step extraction, structured analysis, code-from-spec).

Why RL Fine-Tuning

Section 7-03 covered when fine-tuning is worth the budget at all. This section is one level deeper: given that you have decided to fine-tune, when should you reach past SFT and add an RL step?

The empirical answer that has stabilised over 2024–2025 is mostly no, sometimes yes — and the sometimes is large enough in finance to deserve a chapter. Two recent papers frame the trade-off cleanly. Jin et al. \citep{jin2025rl-vs-sft} show that SFT delivers most of the gain, fastest, on tasks where the right answer is a fixed token sequence; RL pulls ahead on tasks where the criterion is verifiable but the answer space is large (math, code, structured extraction). DeepSeek-R1 \citep{guo2025deepseekr1} shows what happens when you push the second case to its limit: pure RL with verifiable rewards produces a reasoning-capable model from a non-reasoning base without any supervised reasoning traces.

Both directions matter for finance.

When SFT is enough

Three task shapes that finetune to production quality with SFT alone, no RL needed:

  • Style and tone normalisation. Producing memos in a firm's house style; rewriting client-facing prose to a tone guideline. The target distribution is well-defined and SFT examples are easy to collect.
  • Schema-bound extraction with abundant data. Pulling out structured fields from filings when you have thousands of high-quality (input, output) pairs. SFT will learn the schema and the field semantics directly.
  • Domain vocabulary adaptation. Teaching a base model the firm's terminology, ticker conventions, and product names. A small SFT on curated examples is the cheapest path.

If you are in any of these regimes, stop here. The cost of an RL pipeline (reward model, preference data, more compute) is not earned.

When RL fine-tuning helps

Three shapes that genuinely benefit from an RL step on top of SFT:

  • Verifiable-reward extraction. The output has to reconcile: totals add up, fields match a schema, citations point to real spans, derived figures match the source. A reward function can check this programmatically; RL pushes the model toward outputs that pass.
  • Preference judgements. The right answer is a ranking, not a token sequence — which of two analyst summaries is sharper, which of two trade rationales is more defensible. SFT cannot encode "this one over that one" cleanly; DPO can.
  • Multi-step reasoning. Tasks that require working through several intermediate steps before producing the final answer (multi-table extraction, structured derivation of a financial metric from primary sources). GRPO and similar group-relative methods reinforce reasoning patterns that SFT examples cannot cleanly demonstrate.

The pattern across these three: the criterion is clear, but the path to satisfy it is not unique. RL is a tool for telling the model "this output is better than that one" without prescribing exactly how it gets there.

The three RL fine-tuning regimes

The chapter covers three regimes in increasing operational complexity:

  • DPO (Direct Preference Optimisation) \citep{rafailov2023dpo}. Closed-form contrastive loss between a chosen and a rejected response. No reward model, no reinforcement-learning loop. The default first attempt; works well when preferences exist as pairs.
  • PPO with a learned reward model. The classical RLHF recipe. Train a reward model on preferences, then PPO the policy against it. More moving parts; more compute; pays off when preferences are too rich to fit DPO's pair-wise contrast.
  • GRPO (Group Relative Policy Optimisation). Sample candidates per prompt, score each with a verifiable reward, normalise within the group, and update. No value model, no separate critic; the group itself sets the baseline. The recipe behind DeepSeek-R1 and most reasoning-finetuned models since.

Section 9-02 walks through the TRL library trainer for each of these.

Three failure modes worth knowing

RL fine-tuning has a richer failure surface than SFT, and the failures are easy to miss without explicit monitoring.

  • Reward hacking. The policy finds an output the reward model scores highly but humans dislike. The classic example: outputs that copy reward-model favouritism (verbose summaries, hedged claims) without genuinely improving. Mitigation: a held-out human eval set scored separately.
  • Mode collapse. RL pushes the policy toward a small set of near-deterministic outputs. Token-level entropy collapses; the model loses creativity on unseen prompts. Mitigation: KL regularisation against the SFT reference policy \citep{cui2025entropy} and an explicit entropy floor in the optimiser.
  • Reward model drift. As the policy updates, its outputs move outside the distribution the reward model was trained on; the reward model's predictions become less reliable. Mitigation: periodic reward-model retraining on fresh policy samples, or programmatic rewards (next section) that do not have this problem.

How this chapter connects

RL fine-tuning sits between Section 7-03 (when to fine-tune at all) and the agent layer of Chapter 8 (where fine-tuned models actually get called as tools). The synthetic-data Chapter 10 then provides the evaluation regime where fine-tuned policies are stress-tested. The four chapters together close the loop:

  1. Decide whether to fine-tune (Section 7-03).
  2. Decide how to fine-tune (this chapter).
  3. Wrap the fine-tuned model in agents and tools (Chapter 8).
  4. Evaluate the result on synthetic stress data (Chapter 10).

Section 9-02 covers the TRL library that provides the actual machinery for step 2.

The TRL Library

Hugging Face's TRL (Transformer Reinforcement Learning) library is the de facto standard for fine-tuning language models with RL on top of transformers. It exposes a set of trainers — SFTTrainer, RewardTrainer, DPOTrainer, PPOTrainer, GRPOTrainer, KTOTrainer, ORPOTrainer — that share the same configuration and checkpoint conventions as the rest of the Hugging Face stack and compose with PEFT (LoRA / QLoRA), accelerate, and vllm for inference rollouts. This section is a working tour: what each trainer does, when to reach for it, and the integration glue that makes the whole thing run on a single GPU.

Architecture in one paragraph

TRL trainers are subclasses of transformers.Trainer (or its distributed variants) with the loss and the data-loading patched for the RL or preference-learning objective. The model under training is a transformers.PreTrainedModel; the optional reference model (for KL terms) is another instance of the same architecture; the optional reward model is yet another. PEFT adapters can be inserted on the model under training so the trainable parameter count drops to a few percent of the base; the reference model stays frozen and shares weights via LoRA's adapter-disable trick.

from trl import SFTTrainer, DPOTrainer, PPOTrainer, GRPOTrainer

The five trainers below are the load-bearing ones; the rest (KTOTrainer, ORPOTrainer, CPOTrainer) are variations on the DPO theme that make sense once you have a reason to prefer a different preference-learning loss.

SFTTrainer

The starting point and the prerequisite for everything else. SFT is ordinary causal-language-model fine-tuning with TRL's data-formatting and packing helpers.

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
 
base = "meta-llama/Llama-3.1-8B-Instruct"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
 
ds = load_dataset("json", data_files="data/sft_finance.jsonl", split="train")
 
cfg = SFTConfig(
    output_dir="ckpt/sft",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    bf16=True,
    packing=True,           # concatenate short examples into longer sequences
    max_seq_length=4096,
)
 
SFTTrainer(model=model, processing_class=tok, args=cfg, train_dataset=ds).train()

Two practical knobs that move the needle:

  • packing=True packs short examples into longer sequences, which roughly doubles SFT throughput on finance datasets where most examples are 200–800 tokens.
  • max_seq_length should match the longest example you actually see; padding to a larger context wastes compute.

For PEFT-only training, wrap the model in a LoRA config first; TRL recognises PEFT models and keeps the reference / value heads on the right adapter graph automatically.

RewardTrainer

Trains a reward model from preference data — pairs of (prompt, chosen, rejected). Loss is a Bradley–Terry pairwise log-likelihood. Useful only if you plan to do PPO; DPO skips this step.

from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification
 
rm_base = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct", num_labels=1, torch_dtype="bfloat16"
)
RewardTrainer(
    model=rm_base, processing_class=tok,
    args=RewardConfig(output_dir="ckpt/rm", learning_rate=1e-5,
                       per_device_train_batch_size=4, num_train_epochs=1, bf16=True),
    train_dataset=load_dataset("json", data_files="data/pref_pairs.jsonl",
                                split="train"),
).train()

The reward model needs a held-out preference set with human-graded agreement: the percentage of held-out pairs where the reward model's preference matches the human label. Below 70% the reward model is too noisy to PPO against; above 85% it is good enough to serve as the optimisation target for several thousand steps before needing a refresh.

DPOTrainer

Direct Preference Optimisation \citep{rafailov2023dpo} reformulates RLHF as a closed-form contrastive loss: maximise the log-ratio of chosen to rejected probabilities relative to a frozen reference model, with no reward model and no value head. The default first attempt for any preference-based fine-tuning task.

from trl import DPOTrainer, DPOConfig
 
DPOTrainer(
    model=model,
    ref_model=None,                # uses LoRA adapter-disable for the reference
    processing_class=tok,
    args=DPOConfig(
        output_dir="ckpt/dpo",
        num_train_epochs=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=5e-6,
        beta=0.1,                  # KL strength; 0.05–0.3 typical
        bf16=True,
        max_length=4096,
        max_prompt_length=2048,
    ),
    train_dataset=load_dataset("json", data_files="data/pref_pairs.jsonl",
                                split="train"),
).train()

Two knobs that matter:

  • beta: KL regularisation strength. Larger keeps the policy closer to the reference; smaller allows bigger shifts. Start at ; lower if the policy is collapsing.
  • ref_model: pass None when using PEFT — TRL re-uses the base model with adapters disabled to compute the reference. This is the trick that makes DPO fit on a single GPU.

PPOTrainer

The classical RLHF recipe: rollout from the policy, score with a reward model, compute advantages, PPO update. More moving parts than DPO; reach for it when preferences are too rich to fit a pairwise contrast.

from trl import PPOTrainer, PPOConfig
 
ppo = PPOTrainer(
    model=model, ref_model=None, reward_model=rm_base,
    processing_class=tok,
    args=PPOConfig(
        output_dir="ckpt/ppo",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=1e-6,
        cliprange=0.2,
        kl_coef=0.05,
        bf16=True,
    ),
    train_dataset=load_dataset("json", data_files="data/prompts.jsonl",
                                split="train"),
)
ppo.train()

Operational notes:

  • vLLM rollout backend can be plugged in to accelerate the generation step (the bottleneck of PPO) — pass use_vllm=True and vllm_device in the PPOConfig if your TRL version supports it.
  • kl_coef vs DPO's beta are the same idea; PPO usually wants a smaller value because the rollout-and-update loop has its own exploration mechanism.

A working recipe for PPO on a small finance dataset that surfaces the operational complexity:

import torch
from trl import PPOTrainer, PPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
 
base = "meta-llama/Llama-3.1-8B-Instruct"
tok = AutoTokenizer.from_pretrained(base)
tok.pad_token = tok.eos_token
 
policy = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16)
policy = get_peft_model(policy, LoraConfig(
    r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
))
 
# Reward model trained earlier with RewardTrainer.
reward_model = AutoModelForCausalLM.from_pretrained(
    "ckpt/rm-finance", torch_dtype=torch.bfloat16,
)
 
ppo = PPOTrainer(
    model=policy,
    ref_model=None,                                 # adapter-disable
    reward_model=reward_model,
    processing_class=tok,
    args=PPOConfig(
        output_dir="ckpt/ppo-finance",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=1e-6,
        cliprange=0.2,
        cliprange_value=0.2,
        kl_coef=0.05,
        target_kl=0.1,
        batch_size=64,
        mini_batch_size=8,
        num_ppo_epochs=2,
        bf16=True,
    ),
    train_dataset=load_dataset("json", data_files="data/prompts.jsonl",
                                split="train"),
)
ppo.train()

Three knobs that decide whether PPO converges or oscillates:

  • target_kl early stop. The trainer stops the current epoch when the per-batch KL exceeds this; without it, the policy can drift far enough that the importance ratio explodes.
  • mini_batch_size vs batch_size. The full batch is split into mini-batches and each mini-batch produces an update. Smaller mini-batches give noisier but more frequent updates; larger gives smoother but slower training. 8/64 is the standard ratio.
  • Reward standardisation. The reward model's outputs may have large absolute scale; PPO is sensitive to advantage variance. Normalise rewards per batch (subtract mean, divide by std) before computing returns, the same trick that stabilises actor-critic in Section 5-04.

When the reward model drifts (its predictions diverge from human preferences as the policy moves out of the RM's training distribution), the standard fix is iterative: every N PPO epochs, sample fresh rollouts from the current policy, re-label with humans, re-train the RM, and continue. This is the core RLHF loop; modern DPO and GRPO recipes mostly exist to avoid it.

GRPOTrainer

Group Relative Policy Optimisation \citep{shao2024deepseekmath} is the "PPO without the value model" idea that DeepSeek-R1 \citep{guo2025deepseekr1} popularised for reasoning fine-tuning. For each prompt, sample candidate completions; score each with a verifiable reward (math-correctness, JSON validity, schema match); normalise within the group; PPO-style update. No value network, no separate critic, no reward-model drift.

from trl import GRPOTrainer, GRPOConfig
 
def reward_fn(completions, prompts, **kwargs):
    """Programmatic reward: 1.0 if the completion's JSON validates
    against the target schema and reconciles with the source totals;
    0.0 otherwise."""
    return [score_one(p, c) for p, c in zip(prompts, completions)]
 
GRPOTrainer(
    model=model,
    reward_funcs=reward_fn,           # list of callables; multiple rewards mix
    processing_class=tok,
    args=GRPOConfig(
        output_dir="ckpt/grpo",
        num_train_epochs=1,
        per_device_train_batch_size=2,
        num_generations=8,             # group size N
        max_prompt_length=2048,
        max_completion_length=2048,
        learning_rate=1e-6,
        beta=0.04,
        bf16=True,
    ),
    train_dataset=load_dataset("json", data_files="data/prompts.jsonl",
                                split="train"),
).train()

GRPO shines when:

  • The reward is verifiable (you can write score_one as a small function), not "judge with another LLM".
  • The task admits diverse correct trajectories — multi-step reasoning, structured extraction with several valid orderings.
  • You want a small model that beats a frontier API on a narrow task at low inference cost.

Section 9-05 develops a full GRPO recipe for a financial reasoning task.

PEFT, accelerate, and vllm — the integration glue

Three companion libraries make TRL pipelines fit on consumer-class hardware:

  • PEFT (LoRA / QLoRA / DoRA). Add a peft_config to any of the trainers above and only the LoRA adapter is updated. QLoRA's 4-bit base + LoRA on top makes 70B fine-tuning feasible on a single H100; for 7B it makes the whole pipeline run on a 24 GB consumer GPU.
  • accelerate. TRL trainers are accelerate-aware. Multi-GPU data parallelism is a accelerate config away; FSDP / DeepSpeed for >1B trainable parameters are a config-flag away.
  • vllm. GRPO and PPO bottleneck on rollout generation. vllm's paged-attention kernel speeds up generation 4–10× and TRL's use_vllm=True flag plugs it in transparently. For finance fine-tunes on a single GPU this turns "one epoch overnight" into "one epoch in two hours".

When to choose which trainer

Use caseTrainer
First-pass fine-tune to firm style and vocabularySFTTrainer
Pairwise human preferences ("A is better than B")DPOTrainer
Pointwise reward model trained on preferencesRewardTrainer + PPOTrainer
Verifiable programmatic reward (schema, math, code)GRPOTrainer
Training a reward classifier aloneRewardTrainer
Replacing DPO when the chosen/rejected gap is smallKTOTrainer, ORPOTrainer

The decision tree:

  1. Is there a verifiable reward? → GRPO.
  2. Are there preference pairs? → DPO (or RewardTrainer + PPO if pairs are too rich for a pairwise loss).
  3. Is the goal to imitate exemplars? → SFT.
  4. Otherwise: SFT first, observe whether the failure mode is "wrong style" (stop) or "wrong choice between defensible alternatives" (DPO) or "wrong reasoning trajectory" (GRPO).

Section 9-03 covers reward modelling — both programmatic and preference-based — in detail. Section 9-04 wires SFT and DPO together into a runnable pipeline.

Reward Modelling

The reward function is the single most important design decision in an RL fine-tuning pipeline. Everything else — the trainer, the batch size, the KL coefficient — is downstream of what you are optimising. Get the reward right and a small model fine-tunes into something useful; get it wrong and a large model finds the loophole instead of the task.

This section covers the three classes of reward that recur in finance applications, the trade-offs between them, and the calibration discipline that decides whether a reward is good enough to optimise against.

Three classes of reward

Reward functions fall into three buckets, in increasing order of difficulty to construct and operate.

  • Programmatic reward. A pure function of the model's output (and optionally the source data) that returns a scalar. JSON validity, schema match, numeric reconciliation, code that compiles, citations that resolve. These are the cleanest rewards and the foundation for GRPO-style fine-tuning.
  • Preference-based reward. A learned model that scores a candidate output by predicting how a human would rank it against alternatives. Used in classical RLHF; the substrate for RewardTrainer
    • PPO.
  • LLM-judge reward. A separate language model is prompted to score the candidate against a rubric. Cheap to construct, easy to bias, and prone to reward hacking; treat as a last resort or as a secondary signal alongside a programmatic one.

The empirical evidence over 2024–2025 is consistent: when a programmatic reward exists, it dominates. Most finance tasks have more programmatic structure than the literature gives them credit for.

Programmatic rewards for finance

Three patterns recur and cover most production use cases.

Schema and reconciliation

The output is a structured extraction — fields from a 10-K, positions from a portfolio report, entries from a press release. The reward checks that:

  1. The output validates against a JSON schema (typed, required fields present, correct cardinalities).
  2. Numeric fields reconcile: line totals add up, sub-categories sum to category totals, derived ratios match recomputation from primary fields.
  3. Citations point to spans that actually exist in the source.
import json, re
from jsonschema import validate, ValidationError
 
def schema_reconcile_reward(prompt: str, completion: str, schema: dict,
                             source_text: str) -> float:
    # extract the JSON block from the completion
    m = re.search(r"```json\s*(\{.*?\})\s*```", completion, re.S)
    if not m:
        return 0.0
    try:
        out = json.loads(m.group(1))
    except json.JSONDecodeError:
        return 0.0
    # schema validity
    try:
        validate(out, schema)
    except ValidationError:
        return 0.2                                  # partial credit for valid JSON
    # reconciliation: subtotals must equal sum of components
    components = out.get("components", [])
    total = out.get("total")
    if total is not None and components:
        recomputed = sum(c.get("value", 0) for c in components)
        if abs(recomputed - total) > 1e-2:
            return 0.5                              # JSON valid, totals broken
    # citations resolve
    for cite in out.get("citations", []):
        if cite not in source_text:
            return 0.7                              # mostly right, citation broken
    return 1.0                                       # all checks pass

The graded scoring (0.2 / 0.5 / 0.7 / 1.0) gives the optimiser a dense gradient signal: there are partial credits for the intermediate steps so the policy can climb the reward surface gradually, instead of finding only the trivial 0.0 vs 1.0 cliff.

Numeric correctness

For tasks with a unique numeric answer (financial-math problems, ratio computations from primary statements), the reward is exact- match on the numeric field after normalisation:

def numeric_match_reward(completion: str, target: float, tol: float = 1e-3) -> float:
    m = re.search(r"answer:\s*(-?\d+\.?\d*)", completion, re.I)
    if not m:
        return 0.0
    pred = float(m.group(1))
    return 1.0 if abs(pred - target) / (abs(target) + 1e-9) < tol else 0.0

DeepSeekMath \citep{shao2024deepseekmath} and the broader R1 line \citep{guo2025deepseekr1} show that a binary numeric reward is enough to teach a non-reasoning base model to reason on math — because the action space (CoT trajectories) is rich enough that the policy can find paths that satisfy the criterion. The same applies to financial-math problems with a unique answer.

Code-from-spec

For tasks where the model must produce code that satisfies a spec (generate a Polars query for a question, generate a backtest from a strategy description), the reward is (compiles, passes_tests):

import subprocess, tempfile
 
def code_reward(completion: str, test_cases: list) -> float:
    code = extract_code_block(completion)
    with tempfile.NamedTemporaryFile(mode="w", suffix=".py") as f:
        f.write(code); f.flush()
        try:
            r = subprocess.run(["python", f.name], capture_output=True, timeout=10,
                               input=test_cases_payload(test_cases))
            return 1.0 if r.returncode == 0 else 0.3
        except subprocess.TimeoutExpired:
            return 0.0

For finance specifically, this works well for narrow code tasks (generate a feature pipeline, write a backtest harness with named parameters). It does not work for "write a trading strategy" — the criterion is not verifiable.

Preference-based rewards

When the task admits no programmatic reward — "summarise this analyst note in our house style", "draft a client-facing memo about this market move" — the alternative is to collect human preferences and learn a reward model.

Preference data collection

Each item in the dataset is a triple (prompt, chosen, rejected). Two collection patterns:

  • Pairwise from a single annotator. Show an annotator a prompt and two candidate completions; they pick the better one.
  • Group ranking. Show candidates per prompt; the annotator ranks them. Decompose the ranking into all pairs for the pairwise loss.

The dataset size that produces a useful reward model is in the low thousands of pairs for narrow finance tasks. Below that the reward model is too noisy; above ~50K the marginal signal flattens.

Reward-model training

RewardTrainer (Section 9-02) handles the actual fitting. The output you should monitor:

  • Held-out pair accuracy. Random baseline is 50%; a useful model is ; a strong model is . Below 70%, do not PPO against this model — the gradient is too noisy.
  • Calibration. Plot reward-model score vs human-graded quality on a held-out set. A flat or non-monotone curve means the optimiser will follow the wrong direction.
  • Out-of-distribution behaviour. Sample from the current policy and score; the score distribution should overlap the training distribution. Drift means the reward model has expired.

When to refresh the reward model

The classical RLHF pattern: train RM, do a few thousand PPO steps, evaluate, retrain RM on fresh policy outputs labelled by humans, repeat. For finance applications the cadence depends on the rate of policy drift; a useful default is every 1–2K PPO steps in early training, every 5–10K once the policy stabilises.

LLM-judge rewards (use sparingly)

A growing literature uses one LLM to score another's outputs against a rubric. Cheap to set up; treacherous to optimise against.

Three failure modes to watch for:

  • Self-preference. A judge from the same model family rewards outputs in its own house style; the policy learns the style, not the substance.
  • Length bias. Judges over-reward verbose outputs; the policy learns to be wordy.
  • Hedge bias. Judges over-reward outputs that include caveats; the policy learns to refuse useful conclusions.

If you must use an LLM judge:

  • Pair it with a programmatic reward and weight the programmatic one higher.
  • Use a judge from a different model family than the policy.
  • Audit a sample of policy outputs against the judge's labels every few hundred steps.

Building a reward calibration set

Whichever class of reward you use, set aside a held-out reward calibration set: prompts with both human-graded labels and programmatic-reward scores. This set lets you:

  • Verify the reward model agrees with humans (preference rewards).
  • Verify the programmatic reward correlates with what people actually want (programmatic rewards).
  • Detect reward drift across training (re-score the policy on the same prompts every steps and watch for anomalies).

The calibration set is also what you report to a model-risk reviewer. A pipeline that cannot defend its reward function in front of a human auditor is not ready for production.

What this section adds

The reward function is the thing the policy is being pushed toward. The TRL trainers in Section 9-02 are the mechanism. The next section wires the two together into a runnable SFT → DPO pipeline; Section 9-05 takes the same architecture into the GRPO regime where verifiable rewards from this section get to flex.

SFT to DPO Pipeline

This section wires Sections 9-02 (TRL library) and 9-03 (reward design) into a runnable end-to-end pipeline: take a base 7B model, SFT it on finance instructions, DPO it on preference pairs, and ship a fine-tuned adapter that beats the base on a held-out finance benchmark. The recipe is the one most production teams actually use in 2025.

The pipeline at a glance

base model               (e.g. Llama-3.1-8B-Instruct)

   ▼  SFT  (Section 9-02 SFTTrainer, Section 7-03 data pipeline)

SFT model                (style adapted, schema mostly correct)

   ▼  DPO  (Section 9-02 DPOTrainer, Section 9-03 preferences)

DPO model                (preferred outputs preferred more often)

   ▼  Eval (Section 9-03 calibration set + held-out OOD)

ship as PEFT adapter

Three guard-rails that travel with this pipeline:

  • PEFT (LoRA) only. Both SFT and DPO update the same LoRA adapter. The base weights stay frozen. The shipped artefact is a small (~50 MB) adapter, not a 16 GB model copy.
  • Reference model from the SFT checkpoint. DPO compares the policy to a reference; using the SFT checkpoint as the reference rather than the original base gives DPO a much closer starting distribution and stabler training.
  • Held-out evaluation untouched by either step. The same OOD set is used to compare base, SFT, and DPO models; no reuse of training prompts in evaluation.

Stage 1: SFT

Assume a finance instruction dataset data/sft_finance.jsonl where each row is

{"prompt": "Extract revenue and net income from the following 10-Q...",
 "completion": "{\n  \"revenue\": 12345,\n  \"net_income\": 678\n}\n"}

The trainer:

from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
 
base = "meta-llama/Llama-3.1-8B-Instruct"
tok  = AutoTokenizer.from_pretrained(base)
tok.pad_token = tok.eos_token
 
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
 
lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)
 
cfg = SFTConfig(
    output_dir="ckpt/sft-finance",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    packing=True,
    max_seq_length=4096,
    save_strategy="epoch",
)
 
ds_train = load_dataset("json", data_files="data/sft_finance.jsonl", split="train")
 
trainer = SFTTrainer(
    model=model, processing_class=tok,
    args=cfg, train_dataset=ds_train, peft_config=lora,
)
trainer.train()
trainer.save_model("ckpt/sft-finance")

Two practical notes:

  • packing=True doubles throughput on short examples; use it unless your data has very long single examples.
  • learning_rate=2e-4 for LoRA; full fine-tuning would be smaller. Don't accidentally use full-FT learning rates with LoRA — you will under-train.

Stage 2: collect preference pairs

Three patterns for getting (prompt, chosen, rejected) triples on a finance dataset:

  • Sample-and-judge. Sample 4–8 completions per prompt from the SFT model with temperature=0.7. Have an annotator (human or LLM judge with a programmatic backstop from Section 9-03) pick the best and worst. Use those as chosen and rejected.
  • SFT-vs-rule. Use the SFT completion as chosen and a deliberately-flawed alternative (rule-based output, base model output, mangled JSON) as rejected. Cheap; the resulting DPO mostly teaches the policy to avoid obvious errors.
  • Programmatic re-ranking. For tasks with a verifiable reward (Section 9-03), score samples and pick the best as chosen, worst as rejected. The reward function does the labelling work.

A reasonable starter dataset size is 1–5K pairs; below that DPO is noisy.

Stage 3: DPO

from trl import DPOTrainer, DPOConfig
from peft import PeftModel
 
# Reload the SFT-adapted model as the policy
sft_path = "ckpt/sft-finance"
policy = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
policy = PeftModel.from_pretrained(policy, sft_path, is_trainable=True)
 
ds_pref = load_dataset("json", data_files="data/pref_pairs.jsonl",
                        split="train")
# expects columns: prompt, chosen, rejected
 
dpo_cfg = DPOConfig(
    output_dir="ckpt/dpo-finance",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-6,
    lr_scheduler_type="cosine",
    beta=0.1,
    bf16=True,
    max_length=4096,
    max_prompt_length=2048,
    save_strategy="epoch",
)
 
DPOTrainer(
    model=policy,
    ref_model=None,                 # uses adapter-disable on the SFT model
    processing_class=tok,
    args=dpo_cfg, train_dataset=ds_pref,
).train()

Three things to watch during DPO training:

  • reward_chosen vs reward_rejected margin in the trainer's metrics. Should grow during training; if it spikes early and collapses, the policy has found a degenerate solution.
  • KL divergence to the reference. Should grow gradually; a step-function jump means the optimisation blew up.
  • Token-level entropy. Should decrease but not collapse to zero. If it does, lower beta or stop earlier.

Stage 4: evaluation

Three evaluation tracks, all run on a calibration set the model has not seen:

  • Programmatic eval. Run the schema/reconciliation reward (Section 9-03) on base, SFT, DPO outputs. The DPO model should score SFT base.
  • Human-graded eval. A held-out sample of 100–500 prompts reviewed by an annotator using the firm's rubric. Report preference rates: SFT vs base, DPO vs SFT, DPO vs base.
  • Negative-control eval. A set of out-of-scope prompts (e.g. general-purpose factual questions) where the fine-tuned model should not have regressed. Catches the case where DPO over- specialised and broke the model on tasks outside the finance scope.
def evaluate_models(prompts, base_model, sft_model, dpo_model, tok, reward_fn):
    rows = []
    for p in prompts:
        outs = {
            "base": generate(base_model, tok, p),
            "sft":  generate(sft_model,  tok, p),
            "dpo":  generate(dpo_model,  tok, p),
        }
        rows.append({k: reward_fn(p, v) for k, v in outs.items()})
    return rows

The bar to ship: DPO wins on programmatic eval and human eval and does not regress on negative controls. Two out of three is not enough.

A more complete evaluation harness for production use:

import json
from dataclasses import dataclass, asdict
from pathlib import Path
 
@dataclass
class EvalReport:
    dataset: str
    n: int
    base_score: float
    sft_score: float
    dpo_score: float
    sft_vs_base_pref: float          # fraction of pairs where SFT > base
    dpo_vs_sft_pref: float
    dpo_vs_base_pref: float
    regressions: list[dict]          # tasks where DPO did worse than SFT
 
 
def production_eval(suite, base, sft, dpo, tok, reward_fn,
                    out_path: Path) -> EvalReport:
    base_scores, sft_scores, dpo_scores = [], [], []
    regressions = []
    for case in suite:
        b = reward_fn(case, generate(base, tok, case["input"]))
        s = reward_fn(case, generate(sft,  tok, case["input"]))
        d = reward_fn(case, generate(dpo,  tok, case["input"]))
        base_scores.append(b); sft_scores.append(s); dpo_scores.append(d)
        if d < s - 0.5:
            regressions.append({"id": case["id"], "sft": s, "dpo": d})
 
    n = len(suite)
    pref = lambda a, b: sum(1 for x, y in zip(a, b) if x > y) / n
    report = EvalReport(
        dataset=str(out_path),
        n=n,
        base_score=sum(base_scores) / n,
        sft_score=sum(sft_scores) / n,
        dpo_score=sum(dpo_scores) / n,
        sft_vs_base_pref=pref(sft_scores, base_scores),
        dpo_vs_sft_pref=pref(dpo_scores, sft_scores),
        dpo_vs_base_pref=pref(dpo_scores, base_scores),
        regressions=regressions,
    )
    out_path.write_text(json.dumps(asdict(report), indent=2))
    return report

The output is the artefact you take to a model-risk reviewer: per- task rewards, pairwise preference rates, and a list of per-task regressions. A reviewer who cannot reproduce the numbers from this JSON given the model checkpoint and the dataset has not been given enough; one who can has the full audit trail.

What can go wrong

A working list of failure modes and fixes:

  • DPO does not improve over SFT. Most common cause: weak preference data. Re-collect with stronger contrasts (chosen clearly better than rejected); the noisier the labels, the more pairs you need. A second cause: too high — try .
  • Mode collapse. The DPO policy emits near-identical responses for varied prompts. Lower the LoRA rank, raise , shorten training.
  • Catastrophic forgetting on negative controls. The DPO policy is much worse on out-of-scope prompts. Add a small sample of general instruction data into the DPO training set with self-preferred pairs (the SFT response as both chosen and rejected); this anchors the policy to the SFT distribution on out-of-scope inputs.
  • Reward hacking on programmatic rewards. The policy finds a loophole in the reward function. Tighten the reward (Section 9-03) or add a graded penalty for length / format / boilerplate.

Shipping the artefact

The final artefact is a PEFT adapter, ~50 MB on disk. To deploy:

from peft import PeftModel
from transformers import AutoModelForCausalLM
 
base = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct", torch_dtype="bfloat16"
)
adapter = PeftModel.from_pretrained(base, "ckpt/dpo-finance")
adapter.eval()

For lower latency, merge the adapter into the base weights once training is final:

merged = adapter.merge_and_unload()
merged.save_pretrained("ckpt/dpo-finance-merged")

The merged model is a standard AutoModelForCausalLM; it serves through vllm or any other inference runtime without the PEFT overhead.

Where this fits

This pipeline is the operational version of the methods in 9-02 and 9-03. Section 9-05 takes the same skeleton into a third regime — GRPO with verifiable rewards — for tasks where pairwise preference data is impractical to collect but a programmatic scorer exists.

GRPO and Reasoning

The 2024–2025 wave of "reasoning" language models — DeepSeek-R1 \citep{guo2025deepseekr1}, OpenAI's o1/o3 family, Anthropic's extended-thinking modes — train policies that produce long chains of thought before emitting a final answer. The training method behind most open-source members of this family is Group Relative Policy Optimisation (GRPO) with verifiable rewards. This section covers how GRPO works, why it suits financial reasoning tasks, and how to run a GRPO fine-tune with TRL on a narrow finance benchmark.

What GRPO is

PPO needs a value function (a learned baseline) to compute advantages. GRPO removes the value function and uses group statistics instead. For each prompt, sample candidate completions, score each with a reward, and normalise within the group:

The policy update is then PPO-style with the clipped surrogate, but using in place of the GAE advantage from a value model.

Two consequences make GRPO operationally attractive:

  • No value-model training. One fewer network to train, debug, and keep aligned with the policy.
  • Variance reduction within a group. A prompt-specific baseline is closer to the optimal baseline than any global value-function estimate, which means more stable training at smaller batches.

The trade-off is that GRPO needs rollouts per prompt rather than one — typically . With vLLM rollouts (Section 9-02) this is affordable; without it, the per-step cost scales poorly.

Why GRPO suits finance reasoning

Three task shapes that benefit from the GRPO recipe:

  • Multi-step extraction with reconciliation. Pull figures from multiple tables in a 10-K; produce derived ratios that must reconcile to source totals. The reasoning trajectory has many branching points; a verifiable reward (totals match) tells the policy when it found a working trajectory.
  • Structured derivation. "Compute the company's free cash flow from these primary statements." There is a unique answer; the path to it requires interpretation of accounting conventions; a numeric-match reward suffices for training signal.
  • Code-from-spec. "Write a Polars query that computes a rolling 60-day Sharpe per ticker." The reward is whether the code compiles and produces the expected output on a small test fixture.

Notice the common structure: clear criterion, rich solution space, partial-credit shape in the reward function. SFT teaches "do this exact trajectory"; GRPO teaches "find any trajectory that hits the criterion".

A GRPO recipe for finance

The skeleton:

from trl import GRPOTrainer, GRPOConfig
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
 
base = "meta-llama/Llama-3.1-8B-Instruct"
tok  = AutoTokenizer.from_pretrained(base)
tok.pad_token = tok.eos_token
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
 
lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

A composite reward made of programmatic checks (Section 9-03):

import json, re
from jsonschema import validate, ValidationError
 
SCHEMA = {...}                       # the target schema for this task
 
def reasoning_reward(completions, prompts, **kwargs):
    """Reward a list of completions; return a list of scalars."""
    rewards = []
    for prompt, comp in zip(prompts, completions):
        score = 0.0
 
        # 1) format: <think>...</think><answer>...</answer>
        if not re.search(r"<think>.*?</think>", comp, re.S):
            rewards.append(0.0); continue
        m = re.search(r"<answer>(.*?)</answer>", comp, re.S)
        if not m:
            rewards.append(0.1); continue
        score += 0.2
 
        # 2) JSON validity in the <answer>
        try:
            out = json.loads(m.group(1))
            validate(out, SCHEMA)
            score += 0.3
        except (json.JSONDecodeError, ValidationError):
            rewards.append(score); continue
 
        # 3) reconciliation against source totals
        if reconciles(out, prompt):
            score += 0.5
        rewards.append(score)
    return rewards

The dataset is a list of prompts; no chosen / rejected columns are needed because GRPO scores its own rollouts.

ds = load_dataset("json", data_files="data/finance_prompts.jsonl",
                   split="train")
 
cfg = GRPOConfig(
    output_dir="ckpt/grpo-finance",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_generations=8,                # group size N
    max_prompt_length=2048,
    max_completion_length=2048,
    learning_rate=1e-6,
    beta=0.04,
    bf16=True,
    use_vllm=True,                   # accelerated rollout
)
 
GRPOTrainer(
    model=model, peft_config=lora, processing_class=tok,
    args=cfg, train_dataset=ds, reward_funcs=[reasoning_reward],
).train()

Three knobs that matter most:

  • num_generations (). Larger groups give a smaller-variance baseline. is the floor; is typical.
  • beta (KL coefficient). Smaller than DPO; is the working range. R1-style training uses very small because the policy needs to drift far from the SFT base to develop reasoning patterns.
  • max_completion_length. Reasoning trajectories grow during training; under-budgeting here is the most common cause of GRPO failing to learn.

What "develops" during GRPO

A surprising empirical observation from the R1 line of work: as GRPO training progresses, completion length grows without being explicitly rewarded for it. The model discovers that thinking longer lets it satisfy the reward more often. The pattern often looks like:

  • Step 0–500: completions are short and shallow; reward is near-zero.
  • Step 500–2K: completions get longer; the model starts emitting intermediate reasoning steps; reward climbs.
  • Step 2K+: occasional "aha" moments where the model produces a qualitatively new reasoning structure; reward plateaus then jumps.

This is partly why GRPO suits reasoning: the optimiser has to find a trajectory through a large space of possible reasoning paths, and it does so by lengthening the chain of thought.

Three diagnostics to track:

  • Mean completion length. Should grow then stabilise. Sudden collapse means the policy found a short loophole in the reward.
  • Reward distribution. Should shift from 0-heavy to 1-heavy as training progresses; multimodal in the early phase, narrower later.
  • Format validity rate. The fraction of completions that satisfy the format reward (the lowest-difficulty component) — should hit within a few hundred steps and stay there.

Caveats and known failure modes

  • Reward hacking is more dangerous in GRPO than in DPO, because the reward function is the only signal. Programmatic rewards must be tight; partial-credit components must be monotone in genuine quality, not just length or formatting.
  • Multilingual drift. R1 paper notes that early training led to outputs mixing languages because the model had not anchored on English. A small language-consistency auxiliary reward is the standard fix.
  • Collapse to verbose hedging. Without an entropy penalty, the policy can collapse to long but uninformative outputs that scrape through the format reward without doing the task. Mitigation: weight the substantive component (reconciliation, numeric correctness) much higher than the format component.
  • Compute footprint. GRPO with and 8B parameters takes ~10× more wall-clock than DPO at the same dataset size. Budget accordingly; vLLM rollouts roughly halve the cost.

When GRPO is overkill

GRPO is the right tool when you have a verifiable reward and a task that benefits from chain-of-thought reasoning. It is overkill when:

  • The task is surface-level (style, vocabulary, schema match without reasoning). SFT or DPO suffice.
  • The reward function is brittle. Reward hacking will dominate the signal; you will spend more time tightening the reward than training the model.
  • The compute budget is small. Even with vLLM, GRPO is the most expensive training mode in this chapter.

Where this fits

GRPO closes the chapter's progression: SFT (Section 7-03), DPO (Sections 9-02 and 9-04), and finally GRPO for reasoning. The four together cover the practical fine-tuning landscape for finance LLMs in 2025. Chapter 10 (synthetic data) provides the stress-test substrate for evaluating these models on regimes the training data did not contain — the same stress-test discipline the rest of the book uses for forecasters and policies.