Chapter 05

Optimal Decision

“I wake up every morning, despite not knowing what to do, I put one foot in front of the other and try to make the best choice I can. I screw up all the time but that is being human and that's my greatest strength…”
Superman (2025)

The quote captures a fundamental truth about decision-making: the future is uncertain, and mistakes are inevitable. Yet we constantly evaluate options, anticipate consequences, and try to select the best possible choice given what we know.

This chapter explores how economics, decision theory, and machine learning formalize this idea. Here, an optimal decision is not a perfect choice—rather, it is the best action an agent can take given their information, constraints, and objectives. We will study how optimal decisions are defined, modeled, and computed in both classical economics and modern reinforcement learning.

By the end of the chapter, you will understand:

  • how economics defines optimality under uncertainty

  • how dynamic programming provides a recursive method for optimal decision-making

  • how optimal policies are derived

  • and how Deep Q-Networks approximate optimal decisions in complex environments


Contents

What Is Optimal Decision?

"Optimal" in finance rarely means perfect. It means the best action an agent can take given its information, constraints, and objectives. The job of this chapter is to build that idea up from a single-period allocation problem to dynamic programming and reinforcement learning, keeping the same notation and the same conceptual line: forecasts feed a utility, the utility lives inside a constraint set, and time-consistent behaviour falls out of a recursion.

Ingredients of an optimal decision

Every optimal-decision problem has four ingredients:

  • Objective. The utility or cost function to maximise (or minimise). The shape of — linear, quadratic, CRRA, recursive Epstein–Zin — is what encodes risk attitude and intertemporal preference.
  • Information set. , the history available at the decision moment. Policies may only depend on ; a policy that uses future prices is not a policy.
  • Action space. , possibly state-dependent. Box constraints, simplex projections, leverage caps, turnover budgets all live here.
  • Dynamics. How actions and shocks evolve the state. Wealth update, inventory dynamics, regime transitions — see Chapter 6 for the state-space view of these.

If any ingredient is missing, "optimal" is undefined. A trading signal without execution constraints is not optimal once slippage is real; a utility without a discount factor is not optimal across horizons.

Static vs. dynamic decisions

  • Static. Choose once given — for example, Markowitz weights given . The optimisation is a single convex problem (Section 02-03) and the answer has closed-form structure.
  • Dynamic. Choose repeatedly; future opportunities depend on today's choice. Requires dynamic programming (Section 5-03) or policy search (Sections 5-04 / 5-05).

Real trading is dynamic. Risk budgets reset daily, funding costs change, counterparties default, and information arrives. The static problem is the right starting point — it is the benchmark every dynamic policy should be measured against — but it is rarely the answer.

Utility and risk preferences

Utility shapes the optimal portfolio. Three forms recur:

  • Quadratic / mean–variance. Penalises variance linearly; equivalent to expected utility under joint Gaussianity.
  • CRRA / CARA. Captures diminishing marginal utility of wealth. CRRA scales risk premium with wealth (the right fit for compounding investors); CARA gives wealth-independent positions (the right fit for fixed-capacity desks).
  • Prospect theory. Asymmetric value function with loss aversion and probability weighting. Important for understanding investor flows; rarely used inside an optimiser because it is non-convex.

Choosing the wrong utility is equivalent to optimising someone else's problem. Stakeholder interviews and governance guidelines should inform the selected objective before any code is written.

Information sets and avoiding leakage

Optimality is conditional on what you know. A policy can depend on and nothing more. This is also the constraint that prevents look-ahead bugs in machine-learning pipelines: feature engineering must avoid future data, and validation must split along time. The forecasting evaluation discipline of Section 02-02 is what keeps this enforceable.

From forecasts to decisions

Forecasting models deliver expected returns, conditional volatilities, and predictive distributions. Optimal-decision modules convert them into trades by solving optimisation problems or evaluating learned policies. The glue between the two is the loss / utility function. A probabilistic return forecast feeds a mean–CVaR optimisation; a scenario tree feeds a reinforcement-learning environment; a regime probability feeds a switching policy.

Without the principle of optimality, we would have to solve the entire multi-period decision problem in one impossible step. With it, we can work backward — solve the last-period problem for any reachable state, plug that solution into the second-to-last period, and so on. That is the recursion the rest of the chapter formalises.


Investing for three days

To see the principle of optimality in practice, consider a three-day investment problem. We begin at day 0 with wealth . At each decision date (day 0 and day 1) we choose how to allocate wealth between a risky asset with random gross return and a risk- free asset with known gross return . For now, the risk-free rate is constant and there is only one risky asset; think of the choice as "S&P 500 vs. cash." We assume all wealth stays invested, and the only thing that matters is final wealth on day 2, .

Even in this minimal setting, sequential optimal decision-making appears clearly.

Wealth dynamics

Let be the fraction of wealth invested in the risky asset at day , so goes into the risk-free asset. The portfolio gross return from day to is

Wealth then evolves as . Over three days,

By substitution, final wealth depends on both decisions:

Objective

We model investor preferences with a utility function satisfying (more wealth is better) and (concavity, hence risk aversion). Concavity is what produces diversification in the static portfolio theory of Section 02-03; here it produces dynamic caution. With random returns, is random too, so we maximise expected utility:

This generalises mean–variance optimisation. When utility is quadratic, expected utility reduces to mean–variance; for general utility, it handles non-normal return distributions and richer preferences.

Solving all at once

A stubborn approach treats and as two unknowns and solves the joint first-order conditions:

For two periods this is feasible. It does not scale: the action space grows exponentially in horizon, and the joint optimisation has no useful structure. We need to exploit the sequential nature of the problem.

The principle of optimality

Imagine we are standing at day 1 with wealth already known. Only one decision remains:

This subproblem does not depend on how we arrived at . Whatever happened on day 0 is summarised in the state , and the optimal is defined entirely by the day-1 information set.

Now step back to day 0. When we choose we know two things:

  1. determines the distribution of .
  2. From day 1 onwards we will follow .

The day-0 problem becomes

Day 0 no longer has to "guess" what we will do in the future. The future is summarised by . This is the principle of optimality:

If the decision for the final period is optimal, then the optimal decision for the previous period is the one that leads into that optimal future.

The recursion is solved backward and runs in time linear in the horizon. Section 5-03 turns this informal statement into the Bellman equation.


Case study: a risk-neutral agent

To see an explicit solution, take linear utility . Since , there is no curvature and no risk aversion. At day 1,

is positive and known, so the choice maximises , which is linear in with slope . Therefore

  • if ,
  • if ,
  • any if equal.

The same argument at day 0 gives the same kind of bang-bang rule. The optimal three-day strategy under linear utility: in each period, invest all wealth in whichever asset has the higher expected gross return at that time.

This explicit solution is not realistic. Real investors are risk-averse, not risk-neutral, and a risk-neutral policy has a catastrophic flaw: in any environment with positive variance, betting all wealth on the risky asset every period accumulates ruin probability. A 5% chance of ruin per period compounds to near-certain ruin over 20 periods. Empirical evidence from financial markets is consistent — the strategies that ignore downside risk eventually collapse. Risk-neutral preferences are useful as a teaching device for the recursion; they are the wrong utility for any policy that has to live through stress.

Adding survival instinct

The risk-neutral case shows the recursion clearly. To get realistic behaviour we need a utility that encodes both risk aversion and survival concern. The minimum requirements:

  • — more wealth is always better.
  • — concavity, hence risk aversion (Section 02-03).
  • — ruin is catastrophic.

The optimisation problem is unchanged in form, , but low-wealth trajectories are heavily penalised. The day-1 first-order condition now reads

with . Because is decreasing, the marginal utility depends on : lower initial wealth makes the agent more sensitive to downside risk. The optimal policy becomes wealth-dependent, , and decreases as approaches zero. The dynamic programming structure is unchanged — we still solve backward — but the utility shape has produced cautious, wealth-dependent behaviour without imposing external constraints.

What this chapter builds toward

Section 5-02 fills in the explicit utility forms (CRRA, log, quadratic, Epstein–Zin) and turns the recursion into the static and intertemporal optimisation problems they correspond to. Section 5-03 formalises the recursion as the Bellman equation and walks through value iteration, policy iteration, and the HJB equation. Sections 5-04 and 5-05 carry the recursion into reinforcement learning, where the value function and the policy are approximated by neural networks because the state space is too large to enumerate.

The principle of optimality is the load-bearing idea. The rest of the chapter is about what to do when its assumptions hold imperfectly — when the state is high-dimensional, the dynamics are unknown, or the utility includes risk attitudes that closed-form solutions do not support.

Economic Optimal Decision

The static portfolio theory of Section 02-03 is the right starting point; it is also where we stop if we ignore time. The economic version of the optimal-decision problem turns the single-period utility maximisation into a sequential one, and that is what every dynamic policy in the rest of the chapter will approximate. Working through the economic version first means the reinforcement-learning machinery later in the chapter has a clear target — and a clear failure criterion: when the RL solution diverges from the closed-form economic benchmark on the benchmark's home turf, the bug is in the RL setup, not in the theory.

Expected utility maximisation, in one period and many

Given wealth and action , wealth evolves as for stochastic returns . The agent picks to maximise

The choice of is the choice of what the agent wants. Section 02-03 introduces the static recipe (mean–variance, CRRA, CARA); this section focuses on the dynamic generalisations that mean–variance cannot express.

Beyond expected utility: recursive utility

Discounted expected CRRA tangles two attitudes that finance treats as separate: aversion to risk in the cross-section, and willingness to substitute consumption (or value) across time. Epstein–Zin recursive utility \citep{epstein1989substitution} unbundles them:

with risk aversion, the elasticity of intertemporal substitution, and a per-period consumption (or consumption-like proxy when the agent does not literally consume). When the formula collapses to expected discounted CRRA. The two-parameter version is what asset-pricing models reach for whenever the long-run risk channel matters, and Section 5-04 plugs the same machinery into a critic-based RL backup so the policy learns risk attitude separately from patience.

Budget and market constraints in a dynamic problem

Section 02-03 lists the static constraints. The dynamic version adds two operational concerns that single-period optimisation cannot model:

  • Path-dependent costs. Quadratic transaction cost is the right approximation for temporary impact; fixed costs and minimum lot sizes break convexity and require specialised solvers or approximations.
  • State-dependent action sets. Cash gates, leverage caps that scale with realised volatility, margin buffers — the feasible set depends on the current state, which is precisely what static optimisation cannot represent.

Encoding these inside the dynamic problem is what stops the RL agent from learning policies that look great in simulation and unrecognisable to a risk officer.

Lagrangian view and the static benchmark

Section 02-03 derives the closed-form mean–variance solution under a budget constraint. That solution is the static benchmark we keep referring to. Two operational uses:

  • Sanity-check the dynamic solver. On a stationary, single-period problem with Gaussian returns, any dynamic optimiser must reproduce the static solution.
  • Stress-test parameter sensitivity. A 10% perturbation of should not flip a position long-to-short. If it does in the static benchmark, no dynamic version will save it.

The robust patches discussed in Section 02-03 — Black–Litterman, Ledoit–Wolf shrinkage, resampled frontier — are prerequisites for a sensible dynamic policy, not alternatives. Always run them on the static benchmark before anyone trains an RL policy on the same inputs.

Dynamic intertemporal problems and Merton's solution

Single-period mean–variance ignores that today's allocation affects tomorrow's opportunity set. Merton's classical consumption–investment problem \citep{merton1969lifetime} solves the continuous-time version: with a single risky asset, CRRA preferences, and i.i.d. excess returns, the optimal risky share is

def merton_share(mu_excess: float, sigma2: float, gamma: float) -> float:
    return mu_excess / (gamma * sigma2)

Three lessons survive every more elaborate model in the book:

  • Risk aversion → smaller risky share. Doubling halves the position; mechanical, not optional.
  • Volatility → smaller risky share. Doubling quarters the position. This is why a vol-targeting overlay (scale by ) captures most of the value of the more elaborate dynamic policies.
  • Sharpe sets the level. , so improving the Sharpe of a forecast is more valuable than improving its raw mean.

When returns are time-varying — say a small autoregressive drift component on — the closed form generalises to a state-dependent rule that the recursion of Section 5-03 reproduces numerically.

What recursive utility changes for the policy

The Merton solution is a mean-driven policy: the optimal share moves with and ignores the shape of the conditional distribution beyond its first two moments. Recursive utility breaks that simplification. With Epstein–Zin preferences and , the optimal policy:

  • Reacts to predictive variance, not just predictive mean. Tail risk in the predictive distribution shrinks the risky share even when the conditional mean is favourable.
  • Penalises persistence. A regime with high predictive volatility and high persistence reduces the position more than a high-vol regime that mean-reverts quickly.
  • Decouples patience from caution. sets the willingness to wait; sets the willingness to take risk; the two move independently.

This is the structural difference that motivates the recursive-utility critic in Section 5-04: the value function the critic learns must encode the certainty equivalent of continuation value, not just its mean.

A practical checklist before any optimisation

  1. Document the objective. Write down explicitly. State , , , or .
  2. Encode every operational constraint. Leverage, turnover, sector limits, ESG filters — in the optimiser, not in production patches.
  3. Validate assumptions with stakeholders. Can leverage exceed ? Are short positions allowed? Is the universe truly fixed over the rebalance period?
  4. Stress-test under multiple covariances. Crisis-2008, COVID-2020, calm-2017. The allocation should change in directions a risk officer can defend.
  5. Confirm sensitivity to . A 10% perturbation of must not flip a long position to short. If it does, robustify (Section 02-03) before shipping.
  6. Decide static or dynamic. Path-dependent costs and time-varying forecasts justify dynamic; the rest of the chapter develops the tools for that.

The economic lens keeps machine learning grounded: every RL policy must ultimately map back to a clear objective and a constraint set we can audit. Section 5-03 formalises the recursion; Sections 5-04 and 5-05 build the function-approximation machinery we need when exact backward induction stops being feasible.

Dynamic Programming

Dynamic programming (DP) is the algorithmic backbone behind every method in the rest of this chapter. It solves sequential decision problems by working backwards from the future, exploiting Bellman's principle of optimality: once the current state summarises the past sufficiently, the optimal action from there does not depend on how we got here. Reinforcement learning is, in one phrase, "DP with function approximation when exact DP is too expensive." Understanding DP is therefore the prerequisite for understanding what RL is trying to approximate and where it should and should not be trusted.

The Bellman equation

Let be the state, the action, the per-period reward (e.g. log-return minus transaction cost), and the discount. The value function satisfies

Three ingredients matter:

  • State . The Markovian summary of the past — typically wealth, current weights, observable predictors (Section 6's latent factors are popular here).
  • Action set . State-dependent because feasible actions depend on cash, leverage caps, position limits.
  • Transition. The conditional law is the forecasting model from Chapter 4 plus the wealth-update rule.

Two flavours of the equation appear in the literature: the action-value form with , and the expected-utility form for recursive utility (Section 5-02) where the inner expectation is replaced by a certainty equivalent. Both are useful — the form is the one RL algorithms actually parameterise; the recursive-utility form is what we use when the agent cares about the shape of the continuation distribution, not just its mean.

Backward induction

Finite-horizon problems admit a direct recursion:

  1. Initialise the terminal value .
  2. Step backward . At each and every state on the grid, evaluate the Bellman equation by enumerating actions and computing .
  3. Record the maximising action — this is the optimal policy .
import numpy as np
 
T = 3
wealth_grid = np.linspace(0.5, 1.5, 101)
V = np.zeros((T + 1, len(wealth_grid)))
policy = np.zeros_like(V)
 
V[T] = np.log(wealth_grid)
for t in range(T - 1, -1, -1):
    for i, w in enumerate(wealth_grid):
        actions = np.linspace(-0.5, 1.0, 51)   # fraction in risky asset
        values = []
        for a in actions:
            next_w = w * (1 + a * 0.02)        # deterministic for demo
            u = np.log(next_w)                  # one-step utility (placeholder)
            values.append(u)
        best_idx = int(np.argmax(values))
        V[t, i] = values[best_idx]
        policy[t, i] = actions[best_idx]

For stochastic transitions, replace the inner deterministic step with a weighted sum over scenarios. Two standard approximations:

  • Discrete scenarios. Quantise into a small number of nodes with associated probabilities; this is the lattice approach used in classical option-pricing exercises.
  • Quadrature on a Gaussian / GHM grid. For continuous return distributions, Gauss–Hermite quadrature gives accurate expectations with five to ten nodes per dimension.

The result of backward induction is a complete policy — a function that maps state to action at every — together with the value function that RL methods later try to approximate.

Worked example: Merton with a horizon

Consider a finite-horizon log-utility investor with a single risky asset and i.i.d. excess returns . Backward induction gives, for any horizon , a constant-mix optimal policy: the optimal weight does not depend on or , only on the moments of . This is the discrete-time analogue of Merton's continuous-time result and a useful sanity check on any solver: if you change the horizon and the action trajectory shifts in your numerical solution, your solver has a bug.

When returns are predictable — say a small autoregressive component drives — the optimal policy becomes time-varying and state-dependent. Backward induction gives the right answer; the RL methods later in this chapter are different ways to approximate it when the state space is too large to enumerate.

Policy iteration and value iteration

For infinite-horizon problems with stationary dynamics, the Bellman equation becomes a fixed-point equation in . Two algorithms solve it:

  • Value iteration. Iterate where ; is a contraction with modulus , so converges geometrically.
  • Policy iteration. Alternate between policy evaluation (solve a linear system for ) and policy improvement (). Converges in finitely many steps in the tabular setting.

Both are written for tabular state spaces and are the conceptual targets that approximate methods aim to mimic. Value iteration is the closer ancestor of -learning (Section 5-05); policy iteration is the closer ancestor of actor–critic methods.

The curse of dimensionality

DP's grid explodes with the number of state dimensions. With assets and a 100-point grid per dimension you already have cells; for the recursion is intractable. Three families of remedies, each leading into a later section:

  • Approximate DP. Replace the exact with a parametric form — linear in basis functions, neural network — and update by Bellman-residual minimisation. Section 5-04 builds this view out in detail.
  • Monte Carlo tree search. Sample state trajectories rather than enumerating; useful when the action space is small and the transition model is the expensive part. Less common in finance because the action space (continuous weights) is the hard side, not the transitions.
  • Reinforcement learning. Learn value or policy functions from samples, using gradient methods to navigate the parameter space rather than the state space. Sections 5.4–5.5 are this.

Continuous-time DP and the HJB equation

In continuous time, the Bellman equation becomes the Hamilton–Jacobi–Bellman (HJB) PDE. With wealth dynamics and CRRA utility, the HJB has a closed-form solution recovering Merton's constant share . The continuous-time view is also where the certainty-equivalent generalisation of Section 5-02 lives naturally (Duffie–Epstein–Zin in continuous time), and the closed forms serve as benchmarks for both the discrete-time DP solvers above and the RL methods below.

Practical guidance

  • Always solve a small DP first. Even a 2-asset, 3-period problem with Gaussian returns gives you a benchmark to validate every subsequent approximation against. Most RL bugs in finance trace back to a missing benchmark.
  • State-grid resolution drives accuracy. Wealth and weight grids should be denser near boundaries (cash limit, leverage cap) where the policy is least smooth.
  • Numerical-stability traps. Working in log-wealth and clipping wealth to a finite range avoids the underflow/overflow that bites every naive CRRA solver.
  • Plot the policy, not just the value. Bugs in the action grid show up as discontinuous policy surfaces long before they show up as biased value functions.

Dynamic programming therefore provides the theoretical backbone — and the working ground truth — for the policy-optimisation and reinforcement-learning machinery that the next two sections build.

Optimal Policy

An optimal policy maps every state to the action that maximises expected return — or, more honestly, expected utility — according to the Bellman equation from Section 5-03. Computing exactly is feasible only on small state spaces; the rest of this section is about how to iterate between evaluating a policy and improving it, and how to make that iteration work when the state space is too big for exact methods.

Policy evaluation

Given a fixed policy , the value function satisfies

Three options for solving it:

  • Exact (small state space). Treat the equation as a linear system and solve directly. Works whenever is in the thousands or fewer.
  • Iterative (medium state space). Iterate to a fixed point. Each step is a contraction with modulus so convergence is geometric.
  • Monte Carlo (large or unknown dynamics). Roll out trajectories under and average discounted returns.
import numpy as np
 
def evaluate(policy, env, gamma=0.99, episodes=100):
    values = []
    for _ in range(episodes):
        s = env.reset()
        G, discount, done = 0.0, 1.0, False
        while not done:
            a = policy(s)
            s, r, done, _ = env.step(a)
            G += discount * r
            discount *= gamma
        values.append(G)
    return float(np.mean(values))

The Monte Carlo estimator is unbiased but has high variance; TD(λ) sits between TD(0) (one-step bootstrap) and Monte Carlo (full rollout) and is a better default when episodes are long.

Policy improvement

Policy iteration alternates evaluation with greedy improvement:

  1. Evaluate the current policy to get .
  2. Improve greedily, .
  3. Repeat until .

In a finite tabular MDP this terminates in finitely many steps. Value iteration is the special case where we collapse the two operators into a single update, .

The policy improvement theorem is what makes the loop work: at every step, for all , with strict improvement at every state where the greedy action differs from . Strict monotonicity in a finite MDP gives finite-step convergence.

Constraints, regularisation, and the realised reward

Frictionless mean reward is rarely the right reward. In financial RL the realised reward is some combination of pnl, transaction cost, and risk penalty; tuning these is half the work of getting a stable policy.

def reward(pnl, weights, prev_weights, leverage, target_vol, realised_vol):
    turnover = float(abs(weights - prev_weights).sum())
    cost     = 1e-3 * turnover                                      # turnover penalty
    breach   = 5.0 * max(0.0, leverage - 1.5)                      # hard leverage cap
    vol_pen  = 0.5 * (max(0.0, realised_vol - target_vol)) ** 2     # ex-post vol cap
    return pnl - cost - breach - vol_pen

Three knobs come up over and over:

  • Turnover penalty. Without it, the policy reacts to every blip in and bleeds PnL to costs.
  • Hard constraints via large penalties. Add a term that is zero in the feasible set and large outside it; this is operationally simpler than projecting actions, and the gradient still points back into feasibility.
  • Entropy regularisation. Adding to the objective keeps the policy stochastic enough to explore and smooths the value landscape. Standard in PPO, soft actor–critic, and the recursive-utility variants in Section 5-05.

A subtle but important rule: tune the risk-aversion in the reward, not the learning rate of the optimiser, when you want the policy to take less risk. They look interchangeable on a single backtest and are nothing alike out of sample.

Approximate policies

When the state space is continuous or high-dimensional we cannot tabulate . Two parametric routes:

  • Value-based. Parameterise and act greedily, . The DQN family (Section 5-05) is this view extended with replay buffers and target networks.
  • Policy-based. Parameterise directly and update along the policy gradient

In practice we use the advantage in place of to reduce variance. The result is the actor–critic family — the policy is the actor, the critic estimates — which is what most production-quality RL algorithms reduce to.

For policy-gradient methods we need the policy to be differentiable in and to assign positive probability to every action that could matter. Common choices:

  • Gaussian policy: , the default for continuous actions like portfolio weights.
  • Tanh-Gaussian with squashing for bounded actions; the log-probability correction matters and is easy to forget.
  • Dirichlet for actions on the simplex (long-only weights summing to 1) — the natural choice that saves you the projection step.

Trust regions and PPO

Vanilla policy gradient takes destructive steps when the policy moves too far from its previous version: the advantage estimates were computed under the old policy and become misleading. Trust-region and proximal methods constrain the per-update change in policy:

  • TRPO \citep{schulman2015trpo} solves a constrained optimisation step with a KL-divergence trust region.
  • PPO \citep{schulman2017ppo} approximates the trust region with a clipped surrogate objective,

with . Simpler to implement, similar performance; the de facto default in financial RL benchmarks.

PPO with , a small entropy bonus, and Adam at is the recipe that survives in most papers. A careful advantage normalisation per minibatch (subtract mean, divide by std) is the single most important trick to keep training stable.

A complete PPO update step concretely:

import torch
import torch.nn.functional as F
 
def ppo_update(
    policy, value, optimiser,
    obs, actions, old_log_probs, advantages, returns,
    *,
    clip=0.2, vf_coef=0.5, ent_coef=0.01, max_grad_norm=1.0,
):
    """One minibatch update with the standard PPO recipe.
 
    `advantages` and `returns` come from a GAE rollout buffer; both
    are detached from the graph by construction.
    """
 
    # Advantage normalisation — by far the highest-leverage stability trick.
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
 
    dist = policy(obs)                           # current policy distribution
    log_probs = dist.log_prob(actions)
    entropy = dist.entropy().mean()
 
    ratio = torch.exp(log_probs - old_log_probs)  # rho_t in the equation above
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip, 1 + clip) * advantages
    actor_loss = -torch.min(surr1, surr2).mean()
 
    value_pred = value(obs).squeeze(-1)
    critic_loss = F.smooth_l1_loss(value_pred, returns)
 
    loss = actor_loss + vf_coef * critic_loss - ent_coef * entropy
 
    optimiser.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(
        list(policy.parameters()) + list(value.parameters()), max_grad_norm,
    )
    optimiser.step()
 
    # Diagnostics worth logging every step
    with torch.no_grad():
        approx_kl = (old_log_probs - log_probs).mean().item()
        clip_fraction = (
            (torch.abs(ratio - 1) > clip).float().mean().item()
        )
    return {
        "actor_loss": actor_loss.item(),
        "critic_loss": critic_loss.item(),
        "entropy": entropy.item(),
        "approx_kl": approx_kl,
        "clip_fraction": clip_fraction,
    }

Three diagnostics this update emits and why each matters:

  • approx_kl. The mean log-ratio between current and old policy. Tracks how much the policy actually moved; a hard upper bound (e.g. early-stop the epoch when approx_kl > 1.5 * target_kl) prevents blow-ups even when the clipped objective allows them.
  • clip_fraction. Fraction of samples where the clip kicked in. 10–30% is healthy; near 0 means the learning rate is too small; near 1 means too large.
  • entropy. Should decay smoothly during training. A flat-line early or a sudden collapse is a warning sign — either the policy is over-regularised by the entropy bonus or it has fallen into a near-deterministic local optimum.

A typical training loop calls ppo_update once per minibatch over several epochs of the same rollout buffer; target_kl ≈ 0.02 is the default early-stop threshold.

Risk-aware critics

The critic does not have to be a mean estimator. Replacing the Bellman expectation with a certainty-equivalent (Epstein–Zin) or with a CVaR target turns it into a risk-aware value learner that drops into PPO / A2C without changing the actor update. Section 5-06 develops both extensions in detail; the brief preview is that they pay off when and the data has heavy left tails — the regime where mean-variance optima fall apart.

Diagnostics that actually correlate with deployment success

  • Policy entropy curve. Decaying smoothly is healthy; spiking up means the optimiser blew up; collapsing to zero too fast means premature exploitation. Watch this every run.
  • KL between successive policies. Should stay below the trust-region bound; sudden jumps explain bad iterations after the fact.
  • Advantage variance. Decreasing across training is the right sign; exploding means your normalisation is off.
  • Held-out CRPS / utility on synthetic data. Train on real data, evaluate on synthetic counterfactuals (Chapter 10). A policy that wins on real out-of-sample but fails on stress synthetic is a regime-fitter; don't ship it.

This section's machinery — policy iteration, advantage estimation, trust regions, recursive-utility critics — is what Section 5-05 specialises into the DQN family for discrete actions, and what the recursive-utility actor–critic work referenced in Chapter 1 builds on for portfolio policies.

Deep Q-Network

Deep Q-Networks (DQNs) extend tabular Q-learning to settings where the state is high-dimensional — limit-order books, multi-asset feature vectors, factor-augmented panels — by approximating the action-value function with a neural network. The original Atari paper \citep{mnih2015dqn} put the algorithmic core on the map, and a string of follow-up improvements (Double DQN, Dueling DQN, Prioritised Experience Replay, Rainbow) made it stable enough for finance applications. This section walks through the algorithm, the stability tricks, and the diagnostics that decide whether a DQN-style policy is worth shipping.

When DQN is the right tool

DQN shines when the action set is small and discrete: buy / hold / sell, target weight in a small grid, position size in N quantiles. For continuous action spaces (raw simplex weights), policy-gradient methods from Section 5-04 — PPO, SAC — usually fit better. The hybrid is to discretise the action space coarsely, run DQN, and then warm-start a continuous-action PPO from the discretised policy.

A common application pattern in finance:

  • Inventory / execution. Discrete actions (place limit at level , market-cross, wait) over a continuous state (book imbalance, depth, spread). DQN-style methods do well because the action set is naturally discrete \citep{cartea2024rl-execution}.
  • Tactical asset allocation with discrete targets. Action = pick one of exposure. Useful when stakeholders want positions to land at named risk levels.
  • Pair-trading / spread strategies. Action = open / close / flip / hold. Small action set, complex state from cointegration features.

Algorithm sketch

  1. Initialise an online network and a target network with .
  2. Collect transitions via an -greedy policy and store them in a replay buffer.
  3. Sample minibatches of transitions and minimise the TD loss
  1. Refresh the target network every steps (hard copy) or via Polyak averaging .
  2. Anneal from a large value (e.g. 1.0) toward a small floor (e.g. 0.05) over training.
import torch
import torch.nn as nn
import torch.optim as optim
 
class QNet(nn.Module):
    def __init__(self, state_dim: int, action_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, action_dim),
        )
    def forward(self, x):
        return self.net(x)
 
q = QNet(state_dim=64, action_dim=11)
target_q = QNet(64, 11)
target_q.load_state_dict(q.state_dict())
optimizer = optim.Adam(q.parameters(), lr=1e-4)

Experience replay

The replay buffer breaks the correlation between consecutive transitions that would otherwise destabilise SGD on neural networks.

from collections import deque
import random
import numpy as np
 
buffer = deque(maxlen=200_000)
 
def remember(s, a, r, s2, done):
    buffer.append((s, a, r, s2, done))
 
def sample(batch_size):
    batch = random.sample(buffer, batch_size)
    return tuple(np.array(x) for x in zip(*batch))

Prioritised experience replay \citep{schaul2016per} weights transitions by their absolute TD error so the network spends compute on transitions it finds surprising. Empirically: faster early learning, sometimes worse final policy unless you also weight the gradients by importance-sampling correction with annealed to 1.

A working sum-tree-backed PER buffer:

import numpy as np
 
class PrioritisedReplay:
    def __init__(self, capacity: int, alpha: float = 0.6):
        self.capacity = capacity
        self.alpha = alpha
        self.priorities = np.zeros(capacity, dtype=np.float32)
        self.buffer = [None] * capacity
        self.pos = 0
        self.size = 0
        self.max_priority = 1.0
 
    def add(self, transition):
        self.buffer[self.pos] = transition
        self.priorities[self.pos] = self.max_priority
        self.pos = (self.pos + 1) % self.capacity
        self.size = min(self.size + 1, self.capacity)
 
    def sample(self, batch_size: int, beta: float = 0.4):
        prios = self.priorities[: self.size] ** self.alpha
        probs = prios / prios.sum()
        idx = np.random.choice(self.size, batch_size, p=probs)
        batch = [self.buffer[i] for i in idx]
        # Importance-sampling weights, normalised to max=1 for stability.
        weights = (self.size * probs[idx]) ** (-beta)
        weights /= weights.max()
        return batch, idx, weights.astype(np.float32)
 
    def update_priorities(self, idx, td_errors):
        new_prios = np.abs(td_errors) + 1e-6
        self.priorities[idx] = new_prios
        self.max_priority = max(self.max_priority, float(new_prios.max()))

The training step uses the sampled weights to scale the per-sample loss before summing, then writes back the new TD errors as updated priorities:

def per_train_step(replay, q, target_q, optimiser, beta):
    batch, idx, w = replay.sample(BATCH, beta=beta)
    states, actions, rewards, next_states, dones = map(np.asarray, zip(*batch))
    states      = torch.as_tensor(states, dtype=torch.float32)
    next_states = torch.as_tensor(next_states, dtype=torch.float32)
    actions     = torch.as_tensor(actions, dtype=torch.long)
    rewards     = torch.as_tensor(rewards, dtype=torch.float32)
    dones       = torch.as_tensor(dones, dtype=torch.float32)
    weights     = torch.as_tensor(w)
 
    q_values = q(states).gather(1, actions.unsqueeze(1)).squeeze(1)
    with torch.no_grad():
        next_a = q(next_states).argmax(dim=1, keepdim=True)
        next_q = target_q(next_states).gather(1, next_a).squeeze(1)
        target = rewards + GAMMA * (1.0 - dones) * next_q
    td = target - q_values
    loss = (weights * td.pow(2)).mean()                  # weighted Huber-ish
 
    optimiser.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(q.parameters(), 10.0)
    optimiser.step()
 
    replay.update_priorities(idx, td.detach().cpu().numpy())
    return loss.item()

Three knobs for PER:

  • alpha controls how much priorities matter. is uniform replay; is fully proportional. is the default that balances sharpness and bias.
  • beta controls importance-sampling correction strength. Anneal from at training start to by the end; without this anneal the early gradients are biased.
  • Buffer size. For finance the buffer size is the most under-discussed hyperparameter. Daily data has on the order of 250 episodes per year of training, which means a buffer of 200K transitions effectively memorises decades of history. If you're worried about regime-shift, consider a smaller, sliding buffer that explicitly discards old data — the canonical "remember everything" recipe was designed for Atari, where the environment doesn't drift.

Dueling head, in detail

The Dueling DQN \citep{wang2016dueling} architecture decomposes . The mean-subtraction is crucial — without it, and are not identifiable (any constant can be shifted between them):

class DuelingQNet(nn.Module):
    def __init__(self, state_dim: int, action_dim: int, hidden: int = 256):
        super().__init__()
        self.trunk = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden),    nn.ReLU(),
        )
        self.value     = nn.Linear(hidden, 1)
        self.advantage = nn.Linear(hidden, action_dim)
 
    def forward(self, x):
        h = self.trunk(x)
        v = self.value(h)
        a = self.advantage(h)
        return v + (a - a.mean(dim=-1, keepdim=True))

Drop-in replacement for QNet. Helps most in states where many actions look equally good — the value stream takes responsibility for "how good is this state" while the advantage stream isolates "which action".

Training loop

import torch.nn.functional as F
 
BATCH = 512
GAMMA = 0.99
 
def train_step():
    states, actions, rewards, next_states, dones = sample(BATCH)
    states      = torch.as_tensor(states, dtype=torch.float32)
    next_states = torch.as_tensor(next_states, dtype=torch.float32)
    actions     = torch.as_tensor(actions, dtype=torch.long)
    rewards     = torch.as_tensor(rewards, dtype=torch.float32)
    dones       = torch.as_tensor(dones, dtype=torch.float32)
 
    q_values = q(states).gather(1, actions.unsqueeze(1)).squeeze(1)
    with torch.no_grad():
        # Double DQN: choose the action with the online net, evaluate with the target net
        next_actions = q(next_states).argmax(dim=1, keepdim=True)
        next_q = target_q(next_states).gather(1, next_actions).squeeze(1)
        target = rewards + GAMMA * (1.0 - dones) * next_q
 
    loss = F.smooth_l1_loss(q_values, target)        # Huber loss
    optimizer.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(q.parameters(), 10.0)
    optimizer.step()

Stability tricks that matter in practice

  • Double DQN \citep{vanhasselt2016doubledqn}. Use the online net for the argmax but the target net for the value. Reduces the chronic overestimation bias of vanilla DQN; one-line change with consistent empirical wins.
  • Dueling architecture \citep{wang2016dueling}. Decompose . The value stream learns "how good is this state?" and the advantage stream learns "which action is best here?"; helps in states where most actions have similar Q.
  • Huber / smooth-L1 loss. Caps the gradient on outlier targets, which is exactly the kind of thing financial data produces.
  • Reward clipping. Originally Atari-specific; in finance, prefer a vol-normalised reward (divide by realised vol) over hard clipping so that large legitimate moves do not get squashed.
  • Action masking. Disallow actions that breach hard constraints (position limits, cash shortfalls) by setting their Q values to before the argmax. Simpler than penalty terms and never produces infeasible actions.
  • Gradient clipping. Norm clip at 10 — sometimes 5 — keeps a single bad batch from corrupting the network.

Distributional and risk-aware extensions

Two structural problems with vanilla DQN — overestimation in the mean target, blindness to tail risk — are fixed by distributional DQN (C51, QR-DQN) and by risk-aware Bellman targets (CVaR, Epstein–Zin certainty equivalent). Both are first-class topics in Section 5-06, so we defer the full treatment there; the takeaway for the DQN training loop above is that swapping the value head for a quantile head (and the mean target for a CVaR target) is a one-screen change that often pays off on stress regimes.

Evaluation

Simulate trading episodes with the learned policy and benchmark against rule-based strategies — equal-weight, vol-targeted, classical mean–variance. Numbers to track, in order of importance:

  • Sharpe ratio with seed-level confidence intervals. A single seed's Sharpe is meaningless; report mean ± std over at least 5 seeds, ideally 10.
  • Maximum drawdown and time-under-water. Tail-aware metrics are what get a strategy paused in production; they should not surprise you in evaluation.
  • Turnover and realised cost. Evaluate the policy with realistic cost assumptions, not the cost penalty used during training.
  • Constraint violations. Should be zero by construction (action masking) or asymptotically zero (penalty); if not, fix before any PnL claim.
  • Comparison to passive baselines. , buy-and-hold, fixed-mix. Beating these consistently across rolling windows is the bar. Beating a single test split is one seed's worth of evidence.

DQN policies can overfit simulator idiosyncrasies — execution model, slippage assumptions, training distribution — more than is obvious from in-sample reward curves. Two practical defences: (i) train on multiple data splits with different macro regimes and require the policy to win on each; (ii) evaluate on synthetic counterfactuals from Chapter 10 that the model has never seen. A policy that wins on real and synthetic out-of-sample is a candidate; a policy that wins on only one is a regime-fitter.

Where DQN sits in the broader picture

DQN is the value-based limb of the family that began with Q-learning and ends, on the policy-based side, with PPO and SAC. The choice between them is mostly determined by the action space: discrete and small → DQN-family; continuous, especially on a simplex → policy-gradient family. The infrastructure — replay buffers, target networks, advantage estimation, trust regions — is increasingly shared across the two camps in modern implementations. Section 5-04's actor–critic backbone and this section's DQN backbone are the two pillars on which the rest of the chapter's applications are built.

Risk-Aware Reinforcement Learning

The DQN and PPO algorithms of the previous sections optimise expected reward. Finance applications almost always want something subtler: high expected reward subject to a cap on tail risk, or expected utility under a concave utility, or a recursive Epstein–Zin objective that unbundles risk aversion from intertemporal substitution. This section covers the family of risk-aware RL methods — distributional RL, CVaR optimisation, recursive-utility critics — that turn the standard machinery into something you can defend in front of a risk officer.

Why mean is not enough

  • Symmetric losses are a fiction. Real utility weights a 1% loss more heavily than a 1% gain. Mean-only optimisation gets the answer wrong by exactly the asymmetry it ignores.
  • The thing that ends a career is in the tail. Drawdown, not average PnL, drives termination. Risk-aware objectives align the optimiser with the objective the firm actually has.
  • Stress samples are scarce and informative. A tail-aware learner uses every stress observation maximally; a mean learner under-weights the most informative samples in the dataset.

Distributional RL

Standard RL learns a value function where is the discounted return random variable. Distributional RL \citep{bellemare2017c51} learns the full distribution instead, then derives any risk measure from it.

The Bellman equation for distributions reads

Two practical parameterisations:

  • C51 \citep{bellemare2017c51}. Discretise the support into atoms; the network outputs a probability over atoms. Trained by KL between current and bootstrapped distributions; needs a projection step to keep the bootstrapped target on the same atom grid.
  • QR-DQN \citep{dabney2018qrdqn}. Output quantile estimates directly; train with quantile-Huber loss. No support discretisation, no projection, easier to scale to high . The default for most recent finance applications.

Once is available the policy can act on any distortion risk measure. Three useful ones:

  • CVaR at level : . Coherent and finance-standard.
  • Expected utility under concave : , computed by sampling from the distribution and averaging.
  • Wang transform / proportional hazard: a parametric distortion that smoothly interpolates between expected value (no distortion) and worst-case (full distortion).

Acting greedily under a risk measure rather than the mean is a one-line change once the value distribution is available:

import numpy as np
 
def cvar_action(quantile_values: np.ndarray, alpha: float = 0.05) -> int:
    """Pick the action with the highest CVaR_alpha given QR-DQN-style
    per-action quantile predictions of shape (A, N_quantiles)."""
    A, N = quantile_values.shape
    k = max(1, int(alpha * N))
    cvar = quantile_values[:, :k].mean(axis=1)            # tail mean
    return int(np.argmax(cvar))

CVaR-optimal policies

Two routes to a policy that explicitly optimises CVaR:

  • CVaR DQN. Use a distributional DQN, derive CVaR from , act greedily under CVaR. Simple, strong baseline; doesn't optimise CVaR exactly because the policy improvement step uses a different criterion than the value learning step.
  • Risk-sensitive policy gradient. Replace the mean advantage in a PPO update with a CVaR-weighted advantage; the gradient pushes the policy toward states with high CVaR, not just high mean. Tatsiana Tamar's worst-case PG \citep{tamar2015cvarpg} is the foundational work; recent variants use distributional critics to estimate CVaR more efficiently.

The empirical pattern: CVaR-trained policies under-perform mean- trained policies on average return by a small amount and outperform them on stress-regime metrics by a larger amount. The right risk trade-off is set by ; typical choices are 5% or 10%.

Recursive-utility critics

Section 02-03 and 05-02 introduced Epstein–Zin recursive utility, which unbundles risk aversion () from intertemporal substitution (). The RL counterpart replaces the standard Bellman backup with a certainty-equivalent continuation:

where is the power-mean certainty equivalent and is a per-period "consumption" or accounting flow. A practical Monte Carlo estimator,

drops into a critic update with no change to the actor. This is the mechanism behind the recursive-utility portfolio policy referenced in Chapter 1; the empirical claim is that on Korean ETF panels, the recursive critic produces a higher Sharpe ratio than a naive discounted critic at the cost of training-time compute.

Two implementation notes that matter:

  • Numerical stability. Working in log-wealth and clipping the exponent avoids the overflow / underflow that bites every naive CRRA solver.
  • Sample size . The CE estimator's variance scales with ; financial applications with heavy left tails need for stable training, for a clean run.

Risk-constrained policy optimisation

Sometimes the right framing is maximise mean reward subject to a risk constraint rather than swap the objective entirely. The Lagrangian formulation

where is some risk measure (CVaR, drawdown, breach rate) maps to a multi-objective RL problem. Two standard approaches:

  • Lagrangian multiplier. Train with reward and adapt online to keep the constraint tight. Used by safe-RL libraries; works well when the constraint is differentiable through the policy.
  • Reward shaping with hard penalty. Add a large negative reward for constraint violations. Simpler but the optimisation landscape is harsher and the constraint is satisfied "in spirit" rather than exactly.

For finance the most operationally useful constraint is drawdown cap: a soft penalty proportional to peak-to-trough loss, plus a hard episode termination if drawdown exceeds a threshold.

Calibration of the risk metric

A subtle but consequential point: the training risk measure should match the evaluation risk measure. A policy trained on CVaR should be evaluated on CVaR — not on Sharpe, not on hit rate. Mismatched metrics produce policies that look worse than they are under the deployment criterion. Track training and evaluation metrics side by side and watch for drift; if the policy starts winning on training metric and losing on evaluation metric, the loss is overfit.

Diagnostics specific to risk-aware training

  • Entropy of the value distribution. A distributional critic with collapsed entropy is reporting near-deterministic estimates and is probably wrong about the tails.
  • Coverage of the predicted CVaR. On held-out trajectories, the fraction of returns at or below the predicted CVaR should approximately equal . Drift in coverage is the cleanest signal that the value distribution is mis-calibrated.
  • Per-regime evaluation. Average risk metrics conceal regime- specific failures. Always report CVaR per quartile of realised volatility.

Where this fits

Risk-aware RL is the operational version of the recursive-utility material in Section 05-02. The ideas are the same; the difference is that here we are training a policy from samples rather than solving a recursion in closed form. Section 05-07 and 05-08 take the resulting policies into execution and multi-asset portfolio settings; Chapter 10 uses synthetic stress data to validate that risk-aware policies actually behave as designed when the regime moves.

Execution and Market Making

The earlier sections of this chapter treat the allocation problem: choose target weights given a forecast and a utility. Execution is the orthogonal problem: given a target trade, decide how to send it to the market. The choice has a real PnL impact — slippage of 5 bps on a 1% target turnover compounds to 12% per year — and it has a clean RL formulation, which is why this section earns its place alongside the portfolio policies.

The closely-related market-making problem inverts the question: post quotes that earn the bid-ask spread while bounding inventory risk. Both share the same state-space structure (a small order-book state, a continuous action, a strict latency budget), and the techniques developed for execution carry directly into market-making with a sign change on the inventory term.

The execution problem

A meta-order to buy shares over horizon minutes (or seconds or ticks). Each instant the agent chooses how much to slice into the market and how aggressively (limit vs. market) to post the slice. Three costs to balance:

  • Permanent impact. The expected price drift caused by the meta-order; once paid, it stays.
  • Temporary impact. The transient price elevation around each child order, which decays over a known timescale.
  • Risk of waiting. Price drifts adversely while you delay; longer execution windows incur more diffusion risk.

The classical Almgren–Chriss \citep{almgren2001optimal} solution trades these off in a closed-form quadratic problem with linear permanent impact and linear temporary impact under Gaussian price diffusion. The optimal trajectory is exponential — front-load when volatility risk dominates, back-load when impact dominates.

Where Almgren–Chriss runs out

Three places the closed form is wrong, and where RL or other data-driven methods are competitive:

  • Nonlinear impact. Real impact is closer to square-root in size \citep{tothlillo2011}: for traded volume and daily volume . Almgren–Chriss is the linearisation around small .
  • Order-book state. The closed form ignores book depth, queue position, and microstructure. Empirically, conditioning on the book imbalance and the instantaneous spread changes the optimal child-order sizing substantially.
  • Adversarial conditions. When the meta-order signals information to the market, the impact term is no longer a fixed function — the market reacts to the order's history.

RL formulation

State, action, and reward for a typical execution agent:

  • State. Time remaining, inventory remaining, recent volatility, book imbalance, queue depth at top-of-book, recent trade flow.
  • Action. Discrete actions over (size, aggression):
    • market vs. limit at level above / below mid;
    • a fraction of remaining inventory.
  • Reward per step. Negative slippage on the filled portion plus penalty terms for market-impact contribution and unfilled inventory at termination.

A skeletal Gymnasium environment for execution shows the loop concretely:

import numpy as np
import gymnasium as gym
 
class ExecutionEnv(gym.Env):
    def __init__(self, prices, mid, target_qty=10_000, horizon=60):
        self.prices, self.mid = np.asarray(prices), np.asarray(mid)
        self.target_qty, self.horizon = target_qty, horizon
        # action = (level in {0,1,2}, fraction in {0.0, 0.05, 0.1, 0.25, 0.5})
        self.action_space = gym.spaces.MultiDiscrete([3, 5])
        self.observation_space = gym.spaces.Box(low=-1e6, high=1e6, shape=(5,))
 
    def reset(self, *, seed=None, options=None):
        super().reset(seed=seed)
        self.t, self.inv = 0, self.target_qty
        return self._obs(), {}
 
    def _obs(self):
        time_left = (self.horizon - self.t) / self.horizon
        vol = float(np.std(np.diff(np.log(self.mid[max(0, self.t-20):self.t+1])))) \
              if self.t > 1 else 0.0
        spread = float(self.prices[self.t] - self.mid[self.t])
        imbalance = 0.0   # placeholder; real env reads book imbalance
        return np.array([time_left, self.inv / self.target_qty, vol, imbalance, spread])
 
    def step(self, action):
        level, frac_idx = int(action[0]), int(action[1])
        frac = [0.0, 0.05, 0.1, 0.25, 0.5][frac_idx]
        size = int(self.inv * frac)
        fill_price = self.prices[self.t] + level * 0.0001 * self.mid[self.t]
        slippage = (fill_price - self.mid[self.t]) * size
        self.inv -= size
        self.t   += 1
        done = self.t >= self.horizon or self.inv <= 0
        reward = -slippage
        return self._obs(), reward, done, False, {"inv": self.inv}

This minimal env is enough to sanity-check a DQN agent before plugging in a high-fidelity order-book simulator.

The action space is small and discrete, which makes DQN-family methods (Section 05-05) the right tool. Distributional DQN specifically — the value distribution of slippage matters more than its mean, because the policy needs to avoid the worst-case fills, not just minimise the average. Recent benchmarks \citep{cartea2024rl-execution} report that distributional DQN outperforms Almgren–Chriss on out-of-sample slippage across a range of conditions, with the gap largest under high-volatility, high-imbalance regimes.

Imitation warm-start

Training execution RL from scratch is sample-inefficient and expensive (real-market backtesting requires high-fidelity simulation of the order book). A practical pattern:

  1. Imitate Almgren–Chriss. Roll out the closed-form solution as the expert; train the policy with behaviour cloning on the resulting trajectories.
  2. Fine-tune with RL. Switch to a value-based or policy-gradient loss; the warm-started policy is already in the neighbourhood of a strong solution.

The expert does not have to be Almgren–Chriss; any sensible heuristic (VWAP slicer, time-weighted average price (TWAP) tracker) works as the imitation target.

Order-book simulation

A reliable simulation environment is what makes execution RL useful. Two layers of fidelity:

  • Replay-with-impact. Replay a historical limit-order book and simulate the agent's child orders by adding them to the queue at the appropriate level, using a market-impact model to perturb subsequent prices. Cheap, defensible if the impact model is calibrated; the stylised reality of impact-on-replay is the floor.
  • Agent-based simulation. Multiple synthetic agents (informed trader, market maker, noise trader) generate the order flow; the agent under training trades within this synthetic ecosystem. More realistic for adversarial conditions; calibration is harder.

For most production use cases, replay-with-impact on a year of book data is sufficient. The synthetic-data techniques in Chapter 10 can generate stress scenarios (a vol shock, a liquidity vacuum) that back the replay onto regimes the historical data does not contain.

Market making

Inverse of execution. The agent posts buy and sell quotes; rewards are spread captured minus inventory penalty.

  • State. Inventory, time-of-day, recent trade flow, book pressure, midprice volatility.
  • Action. Per-tick: bid offset , ask offset , size at each level, optional inventory-skew bias.
  • Reward per step. Spread captured on filled quotes minus to penalise inventory drift.

The optimal market-making problem also has a closed-form benchmark: Avellaneda–Stoikov \citep{avellaneda2008mm} solves the HJB equation for a CARA-utility maker with Brownian midprice and Poisson fills. Like Almgren–Chriss, it gets the qualitative structure right (skew quotes against inventory, widen during high volatility) and gets the magnitude wrong because real microstructure is not Brownian-Poisson.

The same imitation + RL recipe applies: warm-start from Avellaneda–Stoikov; fine-tune with a distributional critic that targets risk-adjusted PnL rather than mean PnL.

Production considerations

Three things separate research-quality execution / market-making agents from production-quality ones.

  • Latency. A useful execution agent must decide and submit in under a few milliseconds end-to-end; a market-making agent in microseconds. Network distillation to a tiny student model and hand-written inference paths are standard.
  • Risk circuit breakers. Hard inventory caps, daily-loss caps, and a kill-switch that the agent cannot override. The RL policy proposes; the rule-based safety layer disposes.
  • Online evaluation. Keep a static benchmark policy live next to the RL agent and route a controlled fraction of meta-orders to each. Drift in the relative performance of the two is the cleanest signal that the RL policy needs retraining.

Where this fits

Execution and market-making are the micro end of the decision spectrum: tiny actions, fast clocks, tight risk caps. The methods — distributional RL, imitation warm-starting, order-book simulation — extend the chapter's machinery rather than replace it. Section 05-08 goes the opposite direction and treats macro portfolio decisions where the action space is large and the clock is daily; the same distributional and risk-aware ideas show up there too, scaled to a different domain.

Multi-Asset Portfolio RL Benchmark

The earlier sections develop the algorithmic core (DQN, PPO, recursive- utility critic, distributional methods). This section is about putting them to work on the target problem of the chapter: choosing weights across a multi-asset portfolio under realistic constraints, realistic frictions, and realistic noise. We organise it as a benchmark — environments, baselines, evaluation protocol — because that is the form in which RL portfolio results are reported in the literature \citep{liu2020finrl, dynamicfinrl2025} and the form that survives the move from research to production.

What "benchmark" means here

A benchmark is more than a number on a leaderboard. It is the package of:

  • Environment. State space, action space, transition dynamics, reward function, and termination criteria.
  • Baselines. The policies the new method has to beat to be interesting — equal-weight, vol-targeted, classical mean–variance, rules-based.
  • Evaluation protocol. Splits, seeds, metrics, and the rules for reporting.
  • Failure modes. The conditions under which any policy in the family is expected to break.

The literature has multiple frameworks (FinRL, FinRL-Meta, RecursiveRL); the design choices below are the intersection that travels across most of them.

Environment design

Five recurring decisions matter more than the choice of RL algorithm.

  • State representation. At minimum: log-wealth, current weights, recent returns, recent volatility per asset, regime indicators. For factor-aware policies add factor exposures. For execution-aware policies add inventory and recent slippage. More state is not always better; redundant state slows training and amplifies overfitting.

  • Action space. Three parameterisations recur:

    • Target weights on the simplex (long-only) or in a box (long- short). Use a Dirichlet head for the simplex or a tanh head for the box.
    • Trade increments projected onto the feasible set after the policy emits them. Easier to learn from inertia ("do nothing") than predicting absolute weights from scratch.
    • Discrete weight grid. Five or seven exposure levels per asset. DQN-family methods apply directly; the action space grows exponentially with the number of assets, so this is only viable for asset classes.
  • Reward shaping. PnL alone produces unstable policies. The recipe that travels:

for portfolio log-return , transaction cost , turnover , volatility cap , and leverage cap . The hard breach indicators stabilise training; a pure soft penalty often produces a policy that hugs the constraint boundary and slips through under noise. Concretely:

import numpy as np
 
def step_reward(weights_new, weights_old, returns, *,
                cost_bps=5.0, target_vol=0.10, lev_cap=1.5):
    port_ret = float(np.dot(weights_new, returns))
    turnover = float(np.abs(weights_new - weights_old).sum())
    cost     = (cost_bps / 1e4) * turnover
    lev      = float(np.abs(weights_new).sum())
    vol_pen  = max(0.0, np.std(returns) - target_vol) ** 2
    breach   = 5.0 * max(0.0, lev - lev_cap)
    return port_ret - cost - 0.5 * vol_pen - breach
  • Episode structure. Episodes should span enough time for the dynamics to evolve (many months) but be short enough that resetting recovers from a ruinous trajectory. Two patterns:

    • Random start. Each episode begins at a random date in the training window; the agent sees the panel from there. Builds robustness to regime-start variability.
    • Walk-forward. Sequential windows; the agent sees regimes in chronological order. Simpler to debug and matches deployment.
  • Reward normalisation. Vol-normalise reward by an EWMA of the realised reward variance. Prevents the optimiser from over-weighting calm periods and ignoring stress.

Baselines that have to be beaten

Three baselines that every RL portfolio result should be compared against:

  • Equal-weight (1/N). Notoriously hard to beat after costs on most benchmarks; if the RL policy doesn't beat this, there is no story.
  • Volatility targeting. Scale a fixed weight vector by . Captures a large fraction of Merton's value when the dominant signal is just "sometimes the market is choppy." A clean one-line benchmark.
  • Markowitz with shrinkage. Static mean–variance optimiser using a Black–Litterman-style and a Ledoit–Wolf . The classical tangency portfolio, properly robustified. Anything worth deploying has to clear this.

Reporting RL results without these baselines is one of the most common failure modes in the financial-RL literature.

Algorithm choice

Three RL families show up, each with a different sweet spot.

  • PPO (continuous actions). The default for simplex-valued weights or box-constrained allocations. Stable training, well- understood hyperparameters; the recursive-utility critic of Section 05-04 plugs directly into PPO.
  • SAC (continuous actions). Off-policy with entropy regularisation. More sample-efficient than PPO; the price is more hyperparameter-tuning sensitivity. Worth trying when sample efficiency is binding.
  • DQN-family (discrete actions). Use when the action space collapses to a small grid (regime-aware exposure choice). DQN with risk-aware extensions from Section 05-06 is the right tool for "buy/sell/hold-then-rebalance" style problems.

A useful empirical pattern: PPO with a recursive-utility critic beats vanilla PPO (mean critic) on Sharpe and drawdown when the training environment includes left-tailed regimes; the gap closes on calm-market data. This is the kind of result risk-aware methods are designed to produce.

Evaluation protocol

The protocol that survives the research-to-production transition.

  • Multiple seeds. Five at the absolute minimum, ten preferred. Report mean ± std; a single-seed Sharpe is not evidence.

  • Multiple data splits. A rolling-origin walk over the training window with at least four splits. The same policy class must win on every split, not just one.

  • Out-of-time hold-out. A genuinely held-out final window (typically the last 10–20% of the data) that no hyperparameter was tuned on. Numbers reported on this set are the only ones worth quoting.

  • Stress test on synthetic data. Generate counterfactual paths from the diffusion model of Chapter 10; evaluate the trained policy on those. A policy that ranks consistently in real and synthetic stress is the candidate to ship; one that wins on real but degrades on synthetic is a regime-fitter. Concretely:

    import numpy as np
     
    def evaluate_on_synthetic(policy, env_factory, synthetic_returns, n_episodes=64):
        """Run the trained policy on environments seeded with synthetic
        return paths and return per-episode Sharpe / drawdown."""
        sharpe, drawdown = [], []
        for s in range(n_episodes):
            env = env_factory(returns=synthetic_returns[s])
            obs, _ = env.reset(seed=s)
            rewards = []
            done = False
            while not done:
                action = policy(obs)
                obs, r, done, _, _ = env.step(action)
                rewards.append(r)
            rewards = np.asarray(rewards)
            sharpe.append(rewards.mean() / (rewards.std() + 1e-9) * np.sqrt(252))
            equity = np.exp(np.cumsum(rewards))
            drawdown.append(float(np.max(1 - equity / np.maximum.accumulate(equity))))
        return np.array(sharpe), np.array(drawdown)

Metrics

Five metrics, in order of importance for a deployed policy:

  • Sharpe ratio with seed-level confidence intervals.
  • Maximum drawdown and time under water.
  • Calmar ratio = annual return / max drawdown. Penalises tail- heavy strategies.
  • Turnover-adjusted return. PnL net of realistic transaction costs.
  • Constraint compliance. Should be zero violations under action masking; under penalty-based constraints, report the rate.

A policy that wins on Sharpe but has 5× the drawdown of equal-weight is not a winning policy. A policy that wins on Calmar after costs is.

Reproducibility

The reproducibility floor for any reported RL portfolio result:

  1. Pinned environment. Commit hash for the env code, version pins for the data sources, frozen pre-processing pipeline.
  2. Seeded everything. Python, NumPy, PyTorch, the env's PRNG, the action sampler.
  3. Logged hyperparameters and intermediate metrics. Not just the final number; the training curves, the per-seed results, the per-split breakdowns.
  4. Released training scripts and inference scripts. A reader should be able to reproduce the headline number with one command.

Without these, a reported Sharpe is a literary device, not a scientific result.

What this section adds

Sections 05-01 through 05-06 are organised by technique. This section is organised by problem. The point is the bridge: the machinery from earlier in the chapter exists to be deployed on this benchmark. The synthetic-data work in Chapter 10 closes the loop by generating the stress regimes the policies have to be tested on beyond the historical record. Production-grade portfolio RL in finance amounts to assembling these pieces with the right reproducibility discipline; the algorithms are the easy part.