"In every job that must be done, there is an element of fun. You
find the fun, and snap! The job's a game."
— Mary Poppins (1964)
There is no fun in financial mathematics without the right framing.
This chapter assembles the mathematical toolkit that underpins the
rest of the book — return conventions, forecasting evaluation,
portfolio theory, classical time-series models, estimation and
simulation — and keeps each piece close to the trading or risk
decision it enables. The theory is the spoonful of sugar; the
operational use is the medicine that makes everything later actually
matter. Intuition first, formulas when needed, and a constant
reminder of how the theory plugs into trading and risk decisions.
This Chapter Covers
Return conventions, compounding rules, and the empirical facts that break
Gaussian assumptions.
Forecasting principles and evaluation frameworks that keep models honest.
Portfolio theory as an optimisation problem with realistic constraints and
modern estimators.
Classical time-series baselines (ARIMA, GARCH, VAR) and state-space thinking.
Estimation and simulation techniques that bridge statistical models and
downstream decision modules.
We begin with the unit of account for every model that follows: returns.
Prices are path-dependent, currency-specific, and scale with splits or
corporate actions. Returns render those raw prices into dimensionless
quantities that can be aggregated, compared across assets, and fed into
optimisation or simulation routines. The chapter stays practical —
anchoring each formula to the trading or risk decision it enables — and
the conventions it introduces (notation for rt, gt, wt,
Σ) are used unchanged for the rest of the book.
Simple, log, and excess returns
Let Pt be the asset price at time t and Dt the cash distribution
over (t−1,t].
Simple returnrt=(Pt+Dt−Pt−1)/Pt−1. Intuitive for
performance attribution and compatible with portfolio weights summing
to one.
Log returngt=logPt−logPt−1. Converts multiplicative
compounding into addition — logPT−logP0=∑t=1Tgt —
and stays well-defined for long horizons or near-zero prices.
Excess returnxt=rt−rtrf for risk-free rate
rtrf. Centres the distribution on the economic compensation
for risk.
Use simple returns for cash-flow modelling and reporting; log returns for
analytics and anything involving multi-period summations. In code, store
both so that downstream modules can select the appropriate representation.
For n periods with log returns gt, cumulative return is
G0→n=∑t=1ngt and cumulative growth exp(G0→n)−1.
Annualise a daily series by
μann=252⋅E[gt],σann=252⋅Std[gt].
Scaling assumes iid increments; when volatility clusters, replace
constants with realised counts per regime (e.g., 21-day rolling windows)
to avoid overstating confidence during calm periods.
Returns rarely align to a single clock. When aggregating intraday to
daily or daily to monthly, watch for microstructure noise and overlapping-
window bias. For simple returns, the h-period compounded return is
∏k=0h−1(1+rt+k)−1. For log returns, additivity simplifies
alignment: gt:t+h=∑k=0h−1gt+k.
Currency conversion adds another layer: if the asset return in local
currency is rtLC and the FX return is rtFX, the
USD return is (1+rtLC)(1+rtFX)−1. Keep FX
returns as separate features so that hedged and unhedged views are
auditable.
Stylised facts to respect
Real markets violate Gaussian fairy tales. Four stylised facts recur and
shape almost every modelling choice in the book:
Heavy tails. Empirical kurtosis exceeds 3; drawdown risk is
non-negligible at any horizon.
Volatility clustering.gt2 exhibits strong positive
autocorrelation — the empirical ground for GARCH (Section 02-04) and
realised-volatility features.
Asymmetry and leverage. Negative shocks often precede higher
volatility; downside and upside tails differ.
Mild serial correlation at low frequency. Slow-moving factors
create predictive structure over weeks or months — what the dynamic
factor models of Chapter 6 try to capture.
A backtest that assumes iid normal noise will be fragile unless the
target horizon truly washes out these effects.
Diagnostics and good hygiene
Before fitting fancy models, interrogate the distribution and dynamics:
Descriptive statistics per asset: mean, std, skew, kurtosis.
For tail modelling beyond empirical quantiles, Extreme Value Theory
(EVT) gives a principled fit. The Generalised Pareto Distribution (GPD)
approximates exceedances above a high threshold u with shape parameter
ξ and scale β. Backtesting EVT-based VaR requires independence
checks of exceedances and stability of ξ across rolling samples;
without those, the GPD fit is over-confident exactly where over-confidence
is dangerous.
Robust estimation of moments
Sample moments degrade in the presence of outliers or regime shifts.
Robust estimators — median, median absolute deviation (MAD), shrinkage
covariance — are not optional dressing; they are the default in any
production pipeline.
A widely used shrinkage covariance is Ledoit–Wolf
\citep{ledoitwolf2004},
Σ^LW=(1−δ)S+δF,
where S is the sample covariance, F a structured target (constant-
correlation, single-factor), and δ the shrinkage intensity
estimated from data. Robust moments feed directly into portfolio
optimisation (Section 02-03) and scenario generation (Chapter 10).
Dependence, co-movement, and multivariate tails
Risk in a portfolio depends on joint behaviour, not just per-asset
moments. Three tools, in increasing order of generality:
Linear correlationρij=Cov(ri,rj)/(σiσj).
Easy to compute, blind to tail co-movement.
Rank correlation (Spearman, Kendall τ). Captures monotonic
relationships and is resilient to outliers and nonlinearities.
Copulas. Model the joint distribution via marginal CDFs Fi and a
copula C such that P(R1≤r1,R2≤r2)=C(F1(r1),F2(r2)).
Gaussian copulas enforce elliptical symmetry; Student-t copulas admit
tail dependence. The tail-dependence coefficient
λU=limu→1−P(U2>u∣U1>u) summarises how
often two assets crash together.
Dynamic conditional correlation (DCC-GARCH) \citep{engle2002dcc}.
Marginal GARCH per series with a slowly-evolving correlation matrix —
the workhorse risk-model engine in many investment banks.
For multivariate stress tests, copula-based EVT combines parametric
tails on the margins with a copula on the dependence: transform returns
to uniforms, fit a Gaussian / t / vine copula, and invert with EVT
margins. Stress tests built this way expose scenarios where diversified
books still incur synchronous losses.
When feeding features into machine-learning models, decorrelate inputs
with PCA or autoencoders to reduce redundancy, but retain interpretable
factors (value, momentum, carry) for explainability — the dynamic-factor
machinery of Chapter 6 builds on this view.
Microstructure considerations
High-frequency prices embed bid-ask bounce, order-book depth, and latency
artifacts. Preprocessing steps that stabilise returns include:
Mid-quote returns instead of last trade to dampen bid-ask
oscillation.
Volume- or dollar-weighted bars to align sampling with liquidity
rather than wall-clock time.
Subsampling or pre-averaging to mitigate autocorrelation induced
by microstructure noise.
Handling zero returns common in illiquid names; tick-size effects
matter when scaling volatility.
These details prevent spurious predictability and improve the realism of
simulated trading costs in Chapter 10.
Regime awareness
Markets alternate between calm and stressed regimes. A binary regime
indicator — based on rolling realised volatility, credit spreads, or a
liquidity proxy — is the simplest tool that prevents a model trained on
one environment from over-confidently extrapolating into another. Hidden
Markov models and Bayesian change-point detection refine the same idea.
Regime probabilities recur as state variables in Chapter 5's policies and
as conditioning inputs in Chapter 10's generators.
Cross-sectional dependence and factor hygiene
Equity universes share factor exposures (value, momentum, quality), and
that shared structure has to be respected to avoid double-counting.
Factor-neutral returns. Regress asset returns on factor returns
and use the residuals ε^t as model inputs. This isolates
idiosyncratic signals and stabilises covariance estimates.
Cluster-aware sampling. When bootstrapping (Section 02-05), sample
by industry clusters to preserve realistic co-movement. Block bootstrap
over time and clusters reduces the risk of overly optimistic
backtests.
Regularised correlations. Compute correlation matrices with
graphical lasso or random-matrix-theory filtering to remove
noise-dominated eigenvalues — important for risk-parity allocations
whose stability depends on Σ−1.
Worked mathematics: from moments to risk limits
Let weights be w and asset log returns gt. The
portfolio log return is gp,t=w⊤gt. Assuming
finite second moments,
σp2=w⊤Σw,Σ=Cov(gt).
A maximum volatility limit σp≤σˉ is a quadratic
constraint on weights. Adding a drawdown cap Dmax, an approximate
peak-to-trough loss under Gaussianity is Φ−1(α)σp at
confidence α, giving the linearised limit
Φ−1(α)σp≤Dmax. These mathematical links turn
descriptive statistics into actionable guardrails — the same form the
constrained portfolio optimisations of Section 02-03 consume.
Scenario libraries for stress testing
Beyond point estimates, assemble scenario libraries from historical
crises (2008, 2020), synthetic shocks (FX devaluation, rate spike), and
factor rotations. Each scenario is a vector of returns
r(s) used later for Monte Carlo overlays or constraint
testing. Label scenarios with metadata (duration, trigger, regime) so
that conditional risk conversations are reproducible. Chapter 10
revisits this with diffusion-based generators that produce unseen but
plausible scenarios on demand.
Case study: a return stack for a global book
Consider a portfolio of global equities, EM sovereign bonds, and crypto.
Four operational concerns shape the return stack before any modelling:
Clock alignment. Equities align to local closes; bonds to New York
17:00; crypto trades 24/7. Normalise to UTC and define a canonical
evaluation timestamp per asset class.
Corporate actions and carry. Equities need dividends and splits;
sovereigns need accrued interest and pull-to-par; perpetual futures
need funding-rate adjustments. Maintain per-asset adjustment ledgers so
derived returns can be reconstructed and audited.
Liquidity tiers. Thinly traded names have sporadic prints. Use
Kalman smoothing on log prices to impute small gaps, flagging the
imputed points so downstream models can downweight them.
Holiday calendars. Forward-filling across holidays that differ by
country corrupts factor timing. Build calendar-aware joins.
The output is a tall data frame
(timestamp, asset_id, local_return, fx_return, hedged_return, log_return, quality_flags).
Each column feeds later chapters: portfolio optimisation consumes hedged
returns; forecasting models ingest log returns and quality flags;
simulation routines draw from the scenario library linked to those
flags.
Implementation checklist
Version every transformation: raw prices → adjusted prices → returns →
features.
Persist metadata: currency, exchange, adjustment flags, data latency.
Validate monotonicity of timestamps and absence of duplicate
(asset, date) rows.
Roll-aware handling of futures: splice contracts using volume- or
open-interest-based rolls; record roll yield separately.
Deterministic reruns via fixed seeds and pinned library versions in
lock files.
Unit tests that recompute summary statistics after every pipeline
change and assert tolerance bounds.
A reproducible return series underpins every chapter that follows; errors
here propagate everywhere else.
Forecasting and Evaluation Basics
Forecasts translate past information into beliefs about the future
distribution of returns, volatility, or latent states. They are not
decisions yet; they are the inputs that the optimisation routines of
Chapter 5 will later consume. A clean separation between forecasting and
decision-making is what keeps pipelines honest about what each model
knew at each point in time, and it is the discipline this section lays
down.
Forecasts as conditional expectations
At time t with information set Ft, the optimal point
forecast for Xt+h under squared-error loss is the conditional
expectation x^t+h=E[Xt+h∣Ft].
Different loss functions induce different optimal summaries — quantiles
for pinball loss, full densities for log loss. Stating the loss
explicitly is the first step toward a well-posed forecasting problem.
Information sets matter. A forecast built with future prices is
meaningless in a trading system; a forecast restricted to tradable
signals at t is credible. Always make the time index and permissible
inputs explicit in code and documentation. The validation workflow at
the end of this section assumes you have already done so.
Crafting a forecast
Three design choices repeat across every model class in Chapter 4.
Recursive vs. direct horizons
Recursive (iterated). Fit a one-step model and roll it forward
using its own predictions. Economical, but error compounds across
horizons.
Direct (multi-output). Fit a model to predict
h=1,2,…,H simultaneously. Higher data demand; avoids
feedback loops.
Hybrid. Direct forecasts for the near term combined with recursive
tails for longer horizons. The right answer is workload-dependent;
test both on the held-out window before committing.
Point, quantile, and density forecasts
Point forecasts. Single-number summaries; easy to use in
heuristics or linear constraints.
Quantile forecasts.qα(Xt+h∣Ft) for a
small set of levels; informs VaR-style limits and asymmetric payoffs.
Density forecasts. Model p(x∣Ft) explicitly via
parametric (Gaussian, Student-t) or nonparametric (normalising flow,
diffusion) forms. Essential for simulation-based decision modules and
for the synthetic-data work in Chapter 10.
The decision layer of Chapter 5 consumes some functional of the
predictive distribution — never just a point — so density / quantile
outputs are the default in this book.
Practical features
Rolling features. Means, vols, correlations over windows tuned to
the trading horizon.
Calendar effects. Day-of-week, month-end, time-to-event for macro
releases and earnings.
Regime flags. Volatility-state, liquidity stress, or
spread-level proxies — see Section 02-01 for construction.
Feature engineering must respect causality: only data available at t
may enter the feature for a forecast made at t, and the
transformations must be auditable to that constraint.
Feature temporal integrity and label construction
Accurate timestamps matter more than algorithm choice. Enforce a strict
data model:
Align features to the forecast origin t; forward-fill only where
economically justified, and never across embargoed windows.
Construct labels that mirror the tradable horizon —
yt=logPt+h−logPt for returns, or
1{Pt+h>Pt} for direction.
Avoid overlapping labels that inflate sample size and induce
leakage. If unavoidable, use block bootstrap (Section 02-05) or
Newey–West adjustments when reporting uncertainty.
A short validation script that asserts these invariants prevents subtle
data snooping and is cheaper than the post-hoc detective work after a
failed deployment.
Feature engineering playbook
Robust forecasts rest on disciplined feature construction:
Stationarisation. Differences, log ratios, or demeaning by
rolling means to stabilise levels. Compute every transformation with
causal windows.
Volatility alignment. Scale features by recent realised volatility
so that model parameters are comparable across regimes. For deep
models, standardise per feature using expanding or rolling
statistics.
Event windows. Pre- and post-event aggregates around earnings or
macro releases, labelled with surprise magnitudes for conditional
forecasting.
Microstructure features. Order-book imbalance, realised
volatility, signed volume — make sure the timestamps line up with
the trading session of interest.
Hierarchical features. Combine cross-sectional statistics
(industry mean, sector momentum) with asset-level signals; helps
generalisation for sparse names.
Pipeline hygiene comes down to unit tests: monotonic timestamps, no
forward fills across embargoed periods, stable feature counts after
schema evolution.
Evaluating forecast quality
Evaluation must mirror deployment. Split data along time, not randomly.
Match metrics to the forecast type and the downstream objective.
Point metrics
MAE=N1∑∣yi−y^i∣. Robust to outliers.
RMSE=N1∑(yi−y^i)2. Penalises
large mistakes.
MAPE. Scale-free but unstable when values approach zero — avoid
for returns near zero.
Probabilistic metrics
Pinball loss for quantiles:
ρα(y,q^)=(α−1{y<q^})(y−q^).
Continuous Ranked Probability Score (CRPS). Integral of pinball
losses over all quantiles; compares the full predictive distribution
to observations. The default probabilistic score in this book.
Log-likelihood / cross-entropy. Rewards calibrated densities,
punishes overconfident tails — but is sensitive to outliers, so
report alongside CRPS rather than instead of it.
Calibration and sharpness
A density forecast should be calibrated (statistically consistent
with observed frequencies) andsharp (concentrated). The
probability integral transform (PIT) histogram is the diagnostic:
ui=Fθ^(yi) should be uniform on [0,1] under perfect
calibration. Deviations reveal mis-specified tails (U-shape) or
dispersion (peaked). For quantile models, check coverage of prediction
intervals — the proportion below the α-quantile should approach
α.
Sharpness is measured by entropy or average interval width conditional
on correct coverage. A model with slightly higher MAE but honest tails
often dominates in decision contexts involving leverage or options.
Classification-style metrics
For directional forecasts, use balanced accuracy or Matthews
correlation to avoid being fooled by class imbalance (mostly up days).
Thresholds should be chosen via nested validation, not cherry-picked
post-hoc.
Backtesting and cross-validation
Time-series cross-validation preserves ordering. Common schemes:
Expanding window. Train on [1,t], test on (t,t+h], expand
and repeat. Mimics live trading where history accumulates.
Sliding window. Fixed-length training window — guards against
concept drift but limits sample size.
Purged gaps. Remove a buffer between train and test to mitigate
leakage from overlapping labels (important for high-frequency data,
per de Prado's purged k-fold).
Backtests should ingest the same feature construction pipeline used
at inference; running a separate "research" pipeline at training time
is the most common source of look-ahead leakage.
Bias–variance trade-offs and model risk
Forecast errors decompose into bias (systematic mis-specification),
variance (sensitivity to sample noise), and irreducible noise. Linear
models with strong regularisation shrink variance at the cost of bias;
over-parameterised deep nets push bias low and variance up. Calibration
curves and reliability diagrams diagnose which side you are on.
Ensembling (bagging, boosting, stacking) reduces variance and
diversifies inductive biases — see "Forecast combination" below.
Model-risk governance turns these diagnostics into operational rules:
benchmark against naive models (random walk, historical average),
monitor population drift, and record decision logs that tie forecasts
to positions. When a model degrades, having a challenger ready limits
downtime.
Forecast combination and ensembling
Single models overfit idiosyncratic noise. Two reliable combiners:
Simple averages. Surprisingly robust when individual models are
weakly correlated.
Precision-weighted averages. Weights proportional to inverse
forecast variance or past risk-adjusted performance.
Stacking. A meta-learner that consumes base forecasts and
metadata (horizon, regime score) as inputs. Keep the meta-learner
linear or monotone to avoid amplifying base-model noise.
Track ensemble diversity with the pairwise correlation of forecast
errors; ensembles whose members are too similar offer little benefit.
Regime-conditioned forecasting
Forecast skill varies by regime. Make the regime explicit:
Hidden Markov models or Bayesian change-point detection to
identify shifts in volatility or correlation.
Regime-conditioned models. Train one model per regime, or include
regime probabilities as features. Smooth transitions via
mixture-of-experts with temperature-controlled gating networks.
Stress-aware validation. Make sure each evaluation fold contains
stress episodes; otherwise accuracy will vanish exactly when it
matters.
Regime probabilities recur as state variables in Chapter 5's policies
and as conditioning inputs in Chapter 10's generators.
Stress testing forecasts
Even well-tuned models fail under stress. Probe robustness by:
Perturbing inputs. Add noise or simulate missing data; forecasts
that swing wildly are fragile.
Regime conditioning. Evaluate separately in high-volatility and
low-liquidity windows.
Scenario shocks. Force macro variables or spreads to crisis
levels and propagate through the model.
Stress results inform downstream safeguards: caps on position size,
volatility scaling, or fallbacks to simpler models during turmoil.
From forecasts to decisions
Forecasts supply distributions; decisions consume them via utility or
cost functions. Keep the two modular: a change in the decision rule
should not require re-fitting the forecaster, and vice versa.
Economic evaluation
Statistical accuracy is necessary but not sufficient. Convert forecast
quality into economic terms:
Utility-based scoring. Maximise
E[U(PnL)] for a concave U that
penalises tail risk.
PnL attribution. Run a simple trading rule (sign of forecast,
vol-scaling) and decompose PnL into hit rate, average win/loss,
turnover, and costs.
Cost-aware scoring. Subtract realistic transaction costs and
market impact; penalise turnover or gap risk explicitly.
Capacity analysis. Examine decay of forecast value as notional
increases; fit impact curves (square-root law) to convert accuracy
into capacity estimates.
Mandate alignment. Evaluation must respect the leverage,
concentration, and liquidity constraints that downstream portfolio
construction will enforce.
Economic backtests belong in a separate, audited pipeline so they do
not contaminate the pure forecasting evaluation.
Interpreting models
Explainability builds trust and uncovers spurious signals. For linear
models, inspect coefficients and partial correlations. For tree
ensembles or neural networks, use SHAP, permutation importance, and
counterfactual analysis. Temporal saliency (integrated gradients
over time) shows whether the model anchors on economically meaningful
windows or just memorises noise. Interpretability checks should be
part of model acceptance, not an afterthought.
Practical validation workflow
A lightweight but rigorous workflow:
Data split. Training, validation, and test chronologically with
embargo gaps.
Hyperparameter tuning. Bayesian optimisation with time-series
cross-validation.
Model audit. Calibration plots, residual diagnostics,
sensitivity to feature perturbations.
Live shadow. Run forecasts in paper trading to verify latency,
cost estimates, and monitoring dashboards.
Promotion. Freeze weights, document feature schema, register the
model with metadata (version, owner, validation set, decay policy).
This process mirrors the MLOps routines used later when forecasts are
plugged into the RL or portfolio systems of Chapter 5.
Monitoring in production
Once deployed, forecasts require continuous health checks:
Data drift. Track feature distributions via population stability
index (PSI) or Wasserstein distance; alert on threshold breaches.
Performance decay. Monitor rolling calibration error, hit rate,
and realised Sharpe of forecast-driven strategies.
Latency budgets. Log end-to-end inference times, especially when
features rely on streaming market data.
Fallback policies. Define deterministic rules — neutrality,
risk-off — when data quality flags fire or models exceed error
budgets.
These monitoring practices echo the control loops in Chapter 5: a
forecast is part of a closed system, not a detached prediction, and the
feedback path matters as much as the model itself.
Portfolio Theory
We rarely invest in a single asset. The investor's problem is to allocate
wealth across correlated bets while respecting constraints, preferences,
and market frictions. Portfolio theory provides the map: how to quantify
trade-offs between risk and reward, and how to express them as
optimisation problems that the dynamic policies of Chapter 5 will later
solve through time. The static formulations here are the ground truth
that every later RL or function-approximation method should be benchmarked
against.
Mean–variance core
Let w∈Rn be portfolio weights summing to one,
μ the vector of expected returns, and Σ the
covariance matrix of returns. The portfolio mean and variance are
μp=w⊤μ,σp2=w⊤Σw.
Markowitz's classical problem chooses w to minimise variance
subject to a return target,
wminw⊤Σws.t.w⊤μ≥μ⋆,1⊤w=1.
The set of optimal (σp,μp) pairs traces the efficient
frontier. Adding a risk-free asset converts the frontier into the
Capital Market Line with slope equal to the Sharpe ratio of the
tangency portfolio,
wtan⋆∝Σ−1(μ−rf1).
This is the closed-form benchmark referenced throughout Chapter 5; if
your dynamic policy under-performs the tangency portfolio out-of-sample,
the right place to look first is the inputs (μ,Σ),
not the policy class.
Utility functions and risk preferences
The mean–variance problem is one of several utility-based formulations.
Common choices, and where they fit:
Quadratic / mean–variance utility:U(w)=w⊤μ−2λw⊤Σw,
giving a closed-form w⋆=λ1Σ−1μ
subject to constraints. Approximates expected utility under joint
Gaussianity.
CRRA utility:U(W)=1−γW1−γ for γ=1 (and
logW at γ=1). Produces multi-period rules in Chapter 5
whose risk premium scales with wealth — the natural fit for compounding
investors.
Prospect-theory utility. Asymmetric value function with loss
aversion and probability weighting. Useful for understanding investor
flows; rarely used directly inside an optimiser.
A subtle but important point: the choice of risk-aversion parameter
(λ or γ) is a modelling choice, not a market quantity.
Sensitivity analysis across plausible values should be part of every
honest report.
Risk measures beyond variance
Variance treats upside and downside symmetrically. The alternatives
focus on the loss region and matter once tails dominate the decision:
Value-at-Risk (VaR).α-quantile of the portfolio loss
distribution. Easy to communicate, not coherent (subadditivity
fails).
Conditional VaR / Expected Shortfall.CVaRα=E[L∣L≥VaRα].
Coherent and admits a linear-programming reformulation
\citep{rockafellar2000cvar}, which is why portfolio optimisers prefer
it over VaR when the question is "minimise tail loss subject to
return target."
Tracking error. Variance of active returns
w⊤r−b⊤r relative to a
benchmark b. The right risk measure for benchmarked mandates.
Choosing the risk measure changes the optimum: CVaR-optimised portfolios
reallocate mass away from fat-tailed exposures; tracking-error mandates
suppress active bets regardless of standalone Sharpe ratio.
Constraints, costs, and frictions
Real mandates deviate from frictionless theory. Encode these constraints
inside the optimisation, not as post-processing — that is what keeps
the static benchmark honest enough to compare against dynamic policies.
Budget and leverage:1⊤w=1 for fully
invested books, or ∥w∥1≤L to cap gross
exposure.
Box / sign constraints:wimin≤wi≤wimax;
wi≥0 for long-only.
Turnover and transaction costs. Linear costs ci∣Δwi∣
approximate commissions and bid-ask; quadratic costs
ηΔwi2 approximate temporary market impact and yield
smooth optimisations that play well with gradient methods.
Risk caps. Volatility ceilings
w⊤Σw≤σmax, factor
exposure caps ∣f⊤w∣≤b, concentration limits
by sector or region.
Liquidity caps.∣Δwi∣≤k×ADVi with
k∈[0.1,0.3] depending on venue depth.
ESG / policy filters. Exclusion lists or scoring thresholds with
weights re-normalised after the filter is applied.
With convex risk and linear constraints, a quadratic-program solver
(cvxpy, qpsolvers) finds the exact solution. For non-convex frictions
(integer lot sizes, fixed costs), switch to specialised solvers but keep
the same objective structure so the result is comparable to the convex
benchmark.
Factor models and shrinkage
Estimating Σ is brittle when assets outnumber observations. Two
remedies, used together rather than apart:
Factor models. Express returns as
r=Bf+ε with factor loadings
B, factor covariance Σf, and idiosyncratic risk Ψ
(diagonal). Then Σ=BΣfB⊤+Ψ — far fewer
parameters, and the loadings carry an economic interpretation. This
is also the bridge to the dynamic-factor machinery of Chapter 6.
Shrinkage covariance. Combine sample covariance S with a
structured target T:
Σ^=λT+(1−λ)S.
Ledoit–Wolf \citep{ledoitwolf2004} chooses λ optimally
under quadratic loss with T a constant-correlation target. For
factor structures, shrinking the residual covariance Ψ rather
than the full Σ tends to perform better.
These techniques keep the optimisation well-posed and align with the
linear-algebra workflows of Chapter 3.
Robust and Bayesian portfolio optimisation
Markets shift, and the inputs to a static optimiser are estimated from
data that lags those shifts. Two routes to a less fragile optimum:
Robust optimisation. Treat inputs as uncertain sets and optimise
the worst case. For mean uncertainty in an ellipsoid
{μ:∥μ−μ^∥2≤δ},
the worst-case objective penalises return by δ∥w∥2.
Distributionally robust optimisation (DRO) specifies ambiguity sets
in Wasserstein distance and yields portfolios resilient to tail shifts.
Bayesian / Black–Litterman. Place priors on μ and
Σ. The Black–Litterman \citep{blacklitterman1992} formula
blends market-implied equilibrium returns π with
subjective views q through
μBL=[(τΣ)−1+P⊤Ω−1P]−1[(τΣ)−1π+P⊤Ω−1q].
This is the canonical recipe to turn idiosyncratic conviction into a
confidence-weighted allocation rather than letting the optimiser ride
the noisiest forecast.
Robust and Bayesian methods change the answer qualitatively, not just
quantitatively. Always test the policy under both on the same data;
fragile static benchmarks are usually traceable to skipping this step.
Risk parity, minimum variance, and alternatives
Mean–variance is one objective among many. Three alternatives recur:
Risk parity. Equalise marginal risk contributions
wi(Σw)i across assets. Better diversification when
μ is hard to estimate (which is most of the time).
Minimum variance. Set μ=0 and minimise
w⊤Σw. A defensive core; surprisingly hard
to beat out-of-sample.
Maximum diversification. Maximise
w⊤σ/w⊤Σw
for a vector of volatilities σ. Favours
low-correlation assets.
Drawdown-constrained. Constrain expected drawdown via scenarios
(Chapter 10) or via CVaR-style objectives on rolling peak-to-trough
losses.
Strategy stacking — running several of these policies side by side and
allocating across them — is a common operational answer when no single
objective dominates across regimes.
Multi-period and transaction-cost-aware allocation
Single-period optimality undershoots when trading costs and path
dependence matter. Three pragmatic generalisations:
Static plus turnover penalty. A quadratic
η∥wt−wt−1∥22 in the
objective reproduces most of the multi-period benefit at the cost of
a single QP per rebalance.
Model predictive control (MPC). Solve a horizon-H optimisation
each period using forecasted returns and costs, apply only the first
step, and repeat. Useful when the forecast horizon and rebalance
cadence are mismatched.
Stochastic dynamic programming. The fully general formulation,
treated in Chapter 5. The closed-form Merton solution is a special
case; the policies of Section 5-04 approximate it under realistic
noise.
Incorporating quadratic transaction costs and limit-order-fill
probabilities yields allocations that trade less during illiquidity
spikes — the bias that almost every realistic objective wants.
Mean–variance frontier and performance attribution
The frontier with a risk-free asset has the closed form
w⋆∝Σ−1(μ−rf1),wtan⋆=1⊤Σ−1(μ−rf1)Σ−1(μ−rf1).
Performance attribution decomposes realised return into contributions
from factors, allocation, and selection. Brinson-style attribution works
for long-only mandates; for long–short, prefer exposure-based
attribution with beta and residual components. Attribution is what makes
the optimum narratable — without it, a stakeholder review produces
arguments rather than decisions.
Scenario-based robustness checks
Stress testing complements optimisation:
Historical replay. Apply candidate weights to crisis periods and
evaluate drawdowns, turnover, and breach counts for hard limits.
Factor shocks. Shock value, momentum, carry, rates, or FX factors
by multiples of their historical volatilities; observe portfolio
responses.
Path-dependent costs. Simulate limit-order execution with
fill probabilities and queue positioning to reveal slippage that a
quadratic cost term hides.
Stress results feed back into constraint tuning or risk overlays — the
same loop the synthetic-data workflow of Chapter 10 closes via generated
counterfactuals.
Numerical stability and solver hygiene
Optimisation routines fail silently without care.
Ill-conditioned covariance. Add diagonal jitter, prefer
Cholesky-based solvers, or reduce dimensionality via factors. Always
check condition number before trusting the inverse.
Scaling. Rescale weights and returns so that typical magnitudes
are around one — gradient methods need this; QPs are forgiving but
benefit from it.
Warm starts. Start from the previous-period weights and gradually
tighten constraints to prevent oscillations in daily rebalances.
Stochastic approximations. For very large universes, use SGD or
coordinate descent on differentiable objectives (risk parity,
quadratic-cost MV) with periodic full-frontier re-projections.
Document solver tolerances and convergence diagnostics as part of the
portfolio build sheet; reproducibility of an optimum requires the
solver's internals, not just the output.
Linking to reinforcement-learning policies
The portfolio policies of Chapter 5 build on these structures. Two
practical bridges:
Action spaces. RL agents output target weights or trades; the
action space must respect leverage and liquidity constraints. Use
squashing functions (tanh) and projection layers onto feasible sets,
or parameterise directly on the simplex with a Dirichlet head.
Reward shaping. Encode risk preferences via utility, penalise
turnover and limit breaches, and normalise rewards by realised
volatility so training is stable across regimes.
Aligning RL environments with the classical portfolio theory in this
section is what stops agents from discovering policies that look great
in simulation and unrecognisable to a risk officer.
Governance and model risk management
Institutional portfolios operate under governance, and the optimisation
loop is no exception:
Model inventory. Log optimisation configurations, parameter
priors, and constraint sets; attach performance history and known
failure modes.
Challenge sessions. Periodically stress assumptions (estimation
horizon, cost curves) and record outcomes. Rotate challengers to
avoid groupthink.
Override protocols. Define when human overrides are allowed
(liquidity crises, system outages) and how they are rolled back.
Overrides must be auditable and must not bypass pre-trade risk
checks.
Governance is what keeps sophisticated optimisation accountable. A
well-specified static benchmark is also what lets you justify the next
chapter's switch to dynamic optimisation: the reader knows exactly
which preferences and frictions the dynamic version is being asked to
respect.
Classical Time-Series Models
Classical time-series models are the first diagnostic tool for any
financial series. They impose structure that is transparent, quick to
estimate, and easy to stress. Even when deep models eventually replace
them in production, the residuals of an ARIMA or a GARCH fit reveal
whether more complex architectures are fighting the data or are merely
re-discovering the structure these classical baselines already capture.
This section sets up the family — ARIMA, GARCH, VAR, state-space, plus a
few specialised tools — that recurs throughout Chapters 4 and 6.
Univariate dynamics: AR, MA, ARIMA, SARIMA
An autoregressive model of order p writes
rt=c+i=1∑pϕirt−i+ϵt,
with white-noise innovations ϵt. A moving-average MA(q)
captures shock propagation,
rt=c+i=1∑qθiϵt−i+ϵt.
Combining both and allowing differencing gives ARIMA(p,d,q),
with L the lag operator and d the integer order of differencing.
Seasonal ARIMA (SARIMA) layers seasonal differencing on top:
Φ(BS)ϕ(B)(1−B)d(1−BS)Dyt=Θ(BS)θ(B)εt,
with period S. Use information criteria (AIC / BIC / HQIC) and
diagnostics on residual autocorrelation to choose parsimonious orders.
Finance often prefers low orders (1–2) to avoid chasing noise.
Exponential smoothing (ETS) decomposes the series into level, trend,
and seasonal components and updates each with its own smoothing
parameter. ETS lacks an explicit autocorrelation structure but is
remarkably stable on short series — a reliable benchmark before
escalating to ARIMA.
Modelling volatility: GARCH and friends
Returns often have little autocorrelation while squared returns clearly
do — that is the volatility-clustering stylised fact of Section 02-01.
GARCH models capture it by writing the conditional variance:
EGARCH. Models logσt2, so the variance stays positive
without parameter constraints; admits asymmetric responses to positive
and negative shocks.
GJR-GARCH. Adds a leverage term so negative shocks raise variance
more than positive shocks of the same size.
Component GARCH. Splits the conditional variance into long- and
short-run components — better fit on series with persistent volatility
regimes.
Conditional variance forecasts feed directly into portfolio risk limits
(Section 02-03) and volatility-scaling rules. Long-memory assets
(absolute returns with slowly-decaying autocorrelation) are better
captured by ARFIMA with fractional differencing d∈(0,0.5),
estimated by Geweke–Porter–Hudak log-periodogram or Whittle likelihood;
ARFIMA-GARCH hybrids are the default for realised-volatility forecasting.
Multivariate dynamics: VAR, VECM, DCC
Vector autoregressions model joint evolution:
yt=c+i=1∑pΦiyt−i+εt.
Granger causality tests whether adding lagged yj improves
prediction of yi given the rest. Impulse-response analysis traces
how a one-off shock to one variable propagates through the system over
time.
When non-stationary series share a long-run equilibrium (the basis
between a futures contract and its underlying, the relationship between
inflation and short rates), use a Vector Error-Correction Model (VECM).
VECMs differentiate the series but add an equilibrium-correction term:
Δyt=Πyt−1+i=1∑p−1ΓiΔyt−i+εt,Π=αβ⊤.
The columns of β define the cointegrating vectors (the spreads
that mean-revert) and α is the adjustment speed. The Johansen
test estimates the cointegration rank. Most pair-trading and
statistical-arbitrage work in financial practice ultimately reduces to a
VECM in disguise.
Dynamic Conditional Correlation (DCC-GARCH) \citep{engle2002dcc}
gives a tractable multivariate volatility model: marginal GARCH per
series with a slowly-evolving correlation matrix
Ht=DtRtDt, Dt=diag(σi,t). DCC is the
default risk-model engine in many investment banks because it produces
covariance forecasts cheap enough to refit nightly and stable enough to
plug straight into a mean-variance optimiser.
Section 04-02 (multivariate forecasting) extends these ideas to feature-
enriched panels and brings BVAR shrinkage priors and reduced-rank VAR
into the picture; this section is the generative baseline that those
forecasters compete against.
State-space view and the Kalman filter
State-space models unify trend, seasonality, and dynamics into a single
framework. With latent state αt∈Rr and
observation yt∈Rn,
The Kalman filter recursively computes the posterior mean and
covariance of the state given observations up to t, and the Kalman
smoother refines those estimates using the full sample (offline). Use
cases that recur throughout the book:
Local-level / local-trend models.yt=μt+εt,
μt=μt−1+ηt — the classical Bayesian filter for
drifting levels.
Time-varying parameter models. AR or regression coefficients evolve
over time, enabling adaptive strategies in changing regimes.
Trend–cycle decomposition. Separate a smooth trend from a
stationary cyclical component for macro forecasting.
Latent-factor extraction. When Z is the loading matrix, the
filter delivers latent factor trajectories — the connecting tissue
to Chapter 6's dynamic factor models.
Beyond the linear Gaussian setting, extended and unscented Kalman
filters handle smooth nonlinearities, and particle filters handle
heavy-tailed noise or full nonlinear dynamics at higher computational
cost. The full state-space machinery — including identification,
smoothing, and EM-based parameter estimation — is the subject of
Chapter 6, which treats it as a first-class topic rather than as a tool.
Seasonality, cycles, and regime changes
Classical models separate deterministic seasonality from stochastic
cycles. A multiplicative log-price decomposition reads
yt=μt+st+ct+εt,
with μt a trend, st a seasonal of period S, ct a cyclical
ARMA component, and εt noise. SARIMA captures both trend
and seasonality jointly. For regime changes,
Markov-switching AR processes allow parameters (ϕk,σk)
to depend on a latent regime kt∈{1,…,K} governed by a
transition matrix P. Markov-switching models are the simplest answer
to "the data look like two different processes" and a clean
generalisation of the regime-flag approach in Section 02-01.
Calendar effects deserve their own engineering layer:
Day-of-week and month-end effects. Include seasonal dummies or
deterministic regressors for settlement cycles and fund-flow patterns.
Holiday proximity. Widened spreads and reduced depth around
holidays can be encoded via binary indicators or spline terms.
Intraday seasonality. U-shaped volatility in equities and futures
needs time-of-day factors before applying ARMA or GARCH at intraday
frequency.
Seasonal adjustment prevents spurious autocorrelation and improves
forecast calibration when markets follow habitual rhythms.
Frequency-domain tools
The spectral density f(λ) summarises variance allocation across
frequencies. The periodogram
I(λ)=2πT1t=1∑Tyte−iλt2
highlights dominant cycles; smoothing kernels yield consistent estimates.
Coherence extends the idea to pairs of series and identifies
lead–lag relations — the analogue of cross-correlation in the frequency
domain. Frequency-domain filters (Hodrick–Prescott, Baxter–King,
Christiano–Fitzgerald) isolate business-cycle components for macro
applications.
For intraday data and non-stationary regimes, wavelet and empirical
mode decomposition give time–frequency localisation: trade intraday
mean reversion at one scale while holding weekly momentum at another,
with cost models tuned per scale. Multiscale decomposition also informs
hierarchical RL policies in Chapter 5 with separate intra-episode and
inter-episode dynamics.
Model checking and stability
Regardless of the specification, the diagnostic recipe is the same:
Residual autocorrelation. ACF / PACF should show no significant
spikes; Ljung–Box tests formalise it. Persistent structure means the
lag order is wrong.
Residual normality and tails. Standardised residuals should
resemble iid noise. Heavy tails suggest Student-t innovations
rather than Gaussian.
Conditional heteroskedasticity. ARCH tests on squared residuals;
if they fire, the volatility specification needs work.
Parameter stability. CUSUM and Chow tests detect structural
breaks. Re-estimate on rolling windows when breaks are frequent.
Out-of-sample. Rolling-origin forecasts are the only honest test;
every diagnostic above is necessary, none is sufficient.
Order selection via AIC / BIC balances parsimony and fit, but should
always be tempered by economic judgement.
Practical implementation
Rolling estimation. Window length should match the half-life of
regime persistence. Expanding windows reduce variance but adapt
slowly; rolling windows react faster with more variance.
Missing data. Use Kalman smoothing or EM rather than forward-fill.
Software.statsmodels for ARIMA / SARIMA / VAR / VECM,
pmdarima for auto-order selection, arch for GARCH-family,
pykalman and dfm-python for state-space and dynamic-factor models.
Wrap the estimators in reproducible pipelines with versioned
parameters and seeded random draws (Section 02-05).
Bridging to machine learning
Classical models are not just a baseline; they are also feature
generators. Three patterns recur:
Lag features and residual features. AR coefficients, GARCH
conditional volatility, ARIMA residuals, VAR impulse responses — all
feed into the tree and deep models of Chapter 4.
Hybrid predictors. ARIMA for linear structure plus a tree
ensemble or neural net on the residuals; a strong default that often
beats either component alone.
Benchmarks. Every deep model in Chapter 4 should beat an
appropriately-tuned ARIMA / GARCH on validation data; if it does not,
the deep model is the wrong tool, the features are wrong, or the
evaluation is leaking.
The classical layer is where structure becomes visible. The chapters
that follow add capacity, but they do not change the principle: an
honest forecast lives inside an honest data-generating story, and
classical models are how we tell that story.
Estimation and Simulation
Estimating model parameters and simulating scenarios are the two hands
that shape any quantitative workflow. Estimation aligns models with
observed data; simulation explores worlds not yet seen. Together they
provide the inputs to the portfolio construction of Section 02-03 and
the optimal control of Chapter 5 — which means a careful estimation
pipeline pays for itself many times over once the downstream policies
are running on its outputs.
Maximum likelihood and friends
Maximum likelihood (MLE) finds parameters θ that maximise the
likelihood L(θ)=p(y1:T∣θ). Under regularity
conditions, MLE is consistent and asymptotically efficient. In practice:
Use gradient-based optimisers with automatic differentiation when
the likelihood is differentiable (ARIMA, GARCH, neural likelihoods).
Apply regularisation — ridge, lasso, elastic-net — to stabilise
estimates when dimensionality approaches sample size.
For constrained parameters (variances, probabilities), optimise in
unconstrained space via softplus or log transforms; this avoids
fighting the optimiser at the boundary.
Bayesian estimation augments likelihood with priors. Posterior
summaries provide credible intervals and propagate parameter uncertainty
into downstream decisions — a crucial advantage when sample sizes are
small or when stakeholders need explicit uncertainty bands. MCMC
(Hamiltonian Monte Carlo, NUTS) is the workhorse; variational inference
trades exactness for speed and is often good enough for streaming
estimation.
Loss functions as design choices
Every estimator minimises some loss. Stating the loss explicitly aligns
estimation with how the model will be used rather than with what is
easiest to optimise.
Squared error. Conditional means; ideal when the downstream
decision is symmetric in error.
Quantile (pinball) loss. Conditional quantiles; the right loss
when the decision is asymmetric (VaR, capital allocation).
Negative log-likelihood. Calibrated densities; the right loss when
the downstream module consumes the full distribution.
CRPS as a training loss for probabilistic forecasters when the
downstream evaluation is also CRPS. Easier than NLL on heavy-tailed
data.
Custom utility-aligned losses — downside-weighted, turnover-
penalising — that tie estimation directly to a Chapter-5 objective.
The choice of loss is a modelling decision; document it as you would a
prior or a constraint.
Estimation under constraints and regularisation
When sample size is small relative to dimensionality, unconstrained MLE
overfits. Regularisation introduces bias to reduce variance:
Ridge / lasso / elastic-net. Penalise ℓ2 / ℓ1 / mixed
norms of parameters. Lasso encourages sparsity (useful when only a
handful of regressors should matter); ridge is the safer default in
finance where signals are weak and dense.
Positive-definite covariance. Project noisy covariance matrices
onto the nearest PSD matrix via eigenvalue clipping or Tikhonov
regularisation Σ^+λI.
Graphical lasso. Sparse inverse-covariance estimates produce more
stable risk matrices and illuminate conditional dependence structures
— useful for risk-parity and factor-neutral construction.
Shrinkage priors. Bayesian horseshoe or spike-and-slab encourage
sparsity while quantifying uncertainty.
Stability selection. Subsample data and record variable inclusion
frequencies to identify robust predictors. The most defensible way to
pick a sparse model when the validation horizon is short.
Constraints embody domain knowledge: non-negativity of variances,
monotonicity of yield curves, no-arbitrage on implied-volatility
surfaces. Estimators that respect these constraints produce more
realistic simulations downstream.
State-space estimation
For linear Gaussian state-space models the Kalman filter (Section 02-04)
provides optimal online estimation of hidden states given observations,
and the smoother refines them using the full sample. When the system
matrices T,Z,Q,R are unknown, the Expectation–Maximisation (EM)
algorithm is the standard parameter-learning recipe:
E-step. Run the filter and smoother to compute sufficient
statistics under the current parameters.
M-step. Update parameters to maximise the expected complete-data
log-likelihood.
EM is robust, monotonically increases the marginal likelihood, and
underpins many time-varying factor and beta models. Convergence is to a
local optimum; multiple random initialisations and a sanity-check on
the log-likelihood path are part of the recipe.
For nonlinear or non-Gaussian state-space models, particle filters
combined with particle MCMC give samples from the joint posterior
over states and parameters. Slower; appropriate when the linear-Gaussian
approximation is materially wrong.
Calibration and parameter uncertainty
Point estimates hide uncertainty. Three complementary tools:
Fisher information / asymptotic confidence intervals. Cheap,
asymptotic, often optimistic on financial sample sizes.
Parametric or nonparametric bootstrap. Refit the model on
resampled data; the empirical distribution of estimates is the
uncertainty band. Block bootstraps (next subsection) for time series.
Bayesian posterior. Treat parameters as random and propagate the
full posterior into simulation. The right answer when downstream
decisions need credible intervals.
Sensitivity analysis — perturbing parameters within a confidence set and
re-running the downstream pipeline — reveals stability of decisions, not
just of estimates. Decisions that are stable under parameter
perturbation are the ones to ship.
Bootstrap, block bootstrap, and jackknife
Resampling quantifies uncertainty without strong parametric assumptions.
IID bootstrap fails for dependent series, so use block methods:
Moving block bootstrap. Resample contiguous blocks of length
ℓ to preserve autocorrelation. Choose ℓ proportional to the
half-life of the dominant autocorrelation.
Stationary bootstrap. Random block lengths with geometric
distribution mitigate boundary artifacts.
Circular bootstrap. Wrap the series end-to-start to reduce edge
effects on short samples.
Jackknife leave-one-out approximations provide cheap bias estimates
for smooth statistics. Combined with parameter shrinkage, resampling
methods give honest uncertainty bands under data scarcity.
Monte Carlo for scenario generation
Simulation serves two purposes: forward-looking scenarios for stress
testing, and algorithm benchmarking. Three layers:
Parametric. Draw r∼N(μ,Σ)
for the simplest case; layer GARCH on top for volatility clustering;
add a Student-t or skew-t marginal for heavy tails.
Copula-based. Fit margins and a copula separately; sample by
drawing from the copula and inverting the margins. The right answer
when the dependence structure is what matters most for the stress
test.
Generative. Diffusion, GAN, or VAE generators that learn the joint
distribution end-to-end. Treated in detail in Chapter 10; the Monte
Carlo point of view here is a useful prior for that work.
Variance reduction (antithetic variates, control variates, importance
sampling) accelerates Monte Carlo whenever the quantity of interest has
identifiable structure — almost always a worthwhile investment for
risk-tail scenarios.
Validation of simulated worlds
A simulator is only useful if it preserves the aspects of reality
relevant to the decision at hand. The validation protocol has four
layers, in increasing order of strength:
Marginal fit. Mean, std, skew, kurtosis per series compared to
real; KS or energy distance tests on returns.
Autocorrelation match. ACF of returns and ACF of squared /
absolute returns. A simulator that breaks volatility clustering is
unusable for risk modelling.
Tail coverage. VaR, ES at 1%, 0.5%, 0.1% on simulated vs. real
paths. Tail under-coverage is the failure mode that quietly breaks
stress tests.
Downstream test. Train a forecaster (Chapter 4) or a policy
(Chapter 5) on simulated data; evaluate on real held-out data.
Performance in the same band as training-on-real means the simulator
preserved the right structure; a collapse means the simulator is
leaking spurious correlations.
This protocol applies whether the generator is a parametric Gaussian
copula or a neural diffusion model.
Data quality and error models
Estimation is only as good as the input quality. Model the imperfections
explicitly:
Outlier models. Mixture distributions with contamination
components capture occasional bad prints; downweight those points in
the likelihood rather than dropping them silently.
Missingness mechanisms. Distinguish missing-at-random (MAR) from
not-at-random (MNAR); use EM with explicit observation indicators
rather than imputing in advance.
Latency and revision. Macro series are revised; maintain vintages
and estimate models on real-time data to reflect the information set
available at the decision moment.
Quality flags as features. Forward-propagate per-row quality
indicators (imputed, stale, out-of-hours) so downstream models can
downweight them.
Document data filters and error assumptions so simulation draws reflect
the same imperfections seen in production.
Sensitivity analysis and parameter heatmaps
Before trusting an estimator, map how outputs change with inputs:
Sweep estimation windows, decay factors, and regularisation strengths;
visualise stability bands for downstream metrics (Sharpe, turnover).
Compute Sobol or variance-based sensitivity indices when models are
differentiable or emulatable.
Present parameter heatmaps to stakeholders so they can see fragile
regions and avoid operating near cliffs where small input changes flip
sign or magnitude.
These tools reduce model risk and guide hyperparameter defaults.
Bridging classical and deep estimation
Deep models rely on the same principles. Training a neural forecaster
with negative log-likelihood is still MLE; using quantile loss is still
estimating conditional quantiles. Regularisation, cross-validation,
and calibration checks mirror the classical toolkit. The main difference
is capacity: deep models can approximate complex nonlinearities,
demanding careful evaluation to avoid learning artefacts of the training
regime rather than genuine market structure. Chapters 4 and 6 each
revisit estimation under this view.
Linking estimation, simulation, and optimisation
Estimation produces parameters; simulation explores scenarios;
optimisation selects decisions. Keep the pipeline modular so
improvements in any component can be swapped without rewriting the
others. Concretely:
A portfolio optimiser (Section 02-03) consumes a generic scenario
matrix, whether it came from a bootstrapped history or a deep
generative model.
A reinforcement-learning policy (Chapter 5) consumes an environment
whose dynamics may come from any of the simulators above; the policy
itself does not care.
A monitoring layer (Section 02-02) consumes a stream of forecasts and
outcomes; the forecasting model can be replaced without retraining
the monitor.
Logging interfaces that record which estimator and seed produced each
parameter set or scenario simplify audits and reproducibility.
End-to-end reproducibility playbook
Immutable datasets. Versioned Parquet/Arrow snapshots with
hashing; no on-the-fly downloading during estimation runs.
Deterministic randomness. Seed every PRNG (Python, NumPy,
PyTorch, simulation libraries) and log the seeds with results.
Configuration management. Typed configs (Pydantic, OmegaConf,
Hydra) that specify estimator choices, priors, simulation seeds, and
validation windows.
CI hooks. Lightweight estimation–simulation sanity checks on every
commit so drifts are caught early.
Result manifests. Every artefact (parameter set, scenario file,
forecast) carries a manifest with the input data hash, the config,
and the seed.
A reproducible estimation–simulation backbone is the difference between
"the model performed well" and "we can re-run that result and explain
why." Every later chapter consumes this backbone, so the discipline
introduced here pays the largest compounding interest in the book.