Chapter 02

Theoretical Basics

"In every job that must be done, there is an element of fun. You find the fun, and snap! The job's a game."Mary Poppins (1964)

There is no fun in financial mathematics without the right framing. This chapter assembles the mathematical toolkit that underpins the rest of the book — return conventions, forecasting evaluation, portfolio theory, classical time-series models, estimation and simulation — and keeps each piece close to the trading or risk decision it enables. The theory is the spoonful of sugar; the operational use is the medicine that makes everything later actually matter. Intuition first, formulas when needed, and a constant reminder of how the theory plugs into trading and risk decisions.

This Chapter Covers

  • Return conventions, compounding rules, and the empirical facts that break Gaussian assumptions.
  • Forecasting principles and evaluation frameworks that keep models honest.
  • Portfolio theory as an optimisation problem with realistic constraints and modern estimators.
  • Classical time-series baselines (ARIMA, GARCH, VAR) and state-space thinking.
  • Estimation and simulation techniques that bridge statistical models and downstream decision modules.

Contents

Returns and Distributions

We begin with the unit of account for every model that follows: returns. Prices are path-dependent, currency-specific, and scale with splits or corporate actions. Returns render those raw prices into dimensionless quantities that can be aggregated, compared across assets, and fed into optimisation or simulation routines. The chapter stays practical — anchoring each formula to the trading or risk decision it enables — and the conventions it introduces (notation for , , , ) are used unchanged for the rest of the book.

Simple, log, and excess returns

Let be the asset price at time and the cash distribution over .

  • Simple return . Intuitive for performance attribution and compatible with portfolio weights summing to one.
  • Log return . Converts multiplicative compounding into addition — — and stays well-defined for long horizons or near-zero prices.
  • Excess return for risk-free rate . Centres the distribution on the economic compensation for risk.

Use simple returns for cash-flow modelling and reporting; log returns for analytics and anything involving multi-period summations. In code, store both so that downstream modules can select the appropriate representation.

import polars as pl
 
prices = pl.read_csv("data/prices_daily.csv", try_parse_dates=True)
returns = (
    prices.sort(["Ticker", "Date"])
    .with_columns(
        pl.col("AdjClose").pct_change().alias("simple_ret"),
        (pl.col("AdjClose").log() - pl.col("AdjClose").log().shift(1)).alias("log_ret"),
    )
)

Compounding, annualisation, and aggregation

For periods with log returns , cumulative return is and cumulative growth . Annualise a daily series by

Scaling assumes iid increments; when volatility clusters, replace constants with realised counts per regime (e.g., 21-day rolling windows) to avoid overstating confidence during calm periods.

Returns rarely align to a single clock. When aggregating intraday to daily or daily to monthly, watch for microstructure noise and overlapping- window bias. For simple returns, the -period compounded return is . For log returns, additivity simplifies alignment: .

Currency conversion adds another layer: if the asset return in local currency is and the FX return is , the USD return is . Keep FX returns as separate features so that hedged and unhedged views are auditable.

Stylised facts to respect

Real markets violate Gaussian fairy tales. Four stylised facts recur and shape almost every modelling choice in the book:

  • Heavy tails. Empirical kurtosis exceeds 3; drawdown risk is non-negligible at any horizon.
  • Volatility clustering. exhibits strong positive autocorrelation — the empirical ground for GARCH (Section 02-04) and realised-volatility features.
  • Asymmetry and leverage. Negative shocks often precede higher volatility; downside and upside tails differ.
  • Mild serial correlation at low frequency. Slow-moving factors create predictive structure over weeks or months — what the dynamic factor models of Chapter 6 try to capture.

A backtest that assumes iid normal noise will be fragile unless the target horizon truly washes out these effects.

Diagnostics and good hygiene

Before fitting fancy models, interrogate the distribution and dynamics:

  • Descriptive statistics per asset: mean, std, skew, kurtosis.
  • Normality tests (Jarque–Bera), autocorrelation (Ljung–Box), conditional heteroskedasticity (ARCH test).
  • Visual checks: histograms with Gaussian overlay, QQ plots, rolling volatility.
summary = returns.group_by("Ticker").agg([
    pl.col("log_ret").mean().alias("mean"),
    pl.col("log_ret").std().alias("std"),
    pl.col("log_ret").skew().alias("skew"),
    pl.col("log_ret").kurtosis().alias("kurt"),
])

Diagnostics shape preprocessing — demeaning by asset, volatility scaling, outlier winsorisation, regime segmentation. Clean inputs reduce downstream model variance more than elaborate architectures.

Tail risk, drawdowns, and EVT

Risk management lives in the tails. Three measures show up everywhere later, and they do not say the same thing.

  • Value at Risk (VaR) at level : . Easy to communicate, not coherent (not subadditive).
  • Expected Shortfall (ES / CVaR): . Coherent and differentiable, which is why optimisation routines (Section 02-03) prefer it.
  • Maximum drawdown: for running peak and value . Relevant for levered strategies and behavioural risk caps.
cum = returns.with_columns(
    pl.col("log_ret").fill_null(0).cum_sum().exp().alias("growth")
)
paths = cum.group_by("Ticker").agg([
    pl.col("growth").cum_max().alias("peak"),
    pl.col("growth").alias("value"),
])
drawdowns = paths.select((pl.col("value") / pl.col("peak") - 1).alias("dd"))

For tail modelling beyond empirical quantiles, Extreme Value Theory (EVT) gives a principled fit. The Generalised Pareto Distribution (GPD) approximates exceedances above a high threshold with shape parameter and scale . Backtesting EVT-based VaR requires independence checks of exceedances and stability of across rolling samples; without those, the GPD fit is over-confident exactly where over-confidence is dangerous.

Robust estimation of moments

Sample moments degrade in the presence of outliers or regime shifts. Robust estimators — median, median absolute deviation (MAD), shrinkage covariance — are not optional dressing; they are the default in any production pipeline.

A widely used shrinkage covariance is Ledoit–Wolf \citep{ledoitwolf2004},

where is the sample covariance, a structured target (constant- correlation, single-factor), and the shrinkage intensity estimated from data. Robust moments feed directly into portfolio optimisation (Section 02-03) and scenario generation (Chapter 10).

Dependence, co-movement, and multivariate tails

Risk in a portfolio depends on joint behaviour, not just per-asset moments. Three tools, in increasing order of generality:

  • Linear correlation . Easy to compute, blind to tail co-movement.
  • Rank correlation (Spearman, Kendall ). Captures monotonic relationships and is resilient to outliers and nonlinearities.
  • Copulas. Model the joint distribution via marginal CDFs and a copula such that . Gaussian copulas enforce elliptical symmetry; Student- copulas admit tail dependence. The tail-dependence coefficient summarises how often two assets crash together.
  • Dynamic conditional correlation (DCC-GARCH) \citep{engle2002dcc}. Marginal GARCH per series with a slowly-evolving correlation matrix — the workhorse risk-model engine in many investment banks.

For multivariate stress tests, copula-based EVT combines parametric tails on the margins with a copula on the dependence: transform returns to uniforms, fit a Gaussian / / vine copula, and invert with EVT margins. Stress tests built this way expose scenarios where diversified books still incur synchronous losses.

When feeding features into machine-learning models, decorrelate inputs with PCA or autoencoders to reduce redundancy, but retain interpretable factors (value, momentum, carry) for explainability — the dynamic-factor machinery of Chapter 6 builds on this view.

Microstructure considerations

High-frequency prices embed bid-ask bounce, order-book depth, and latency artifacts. Preprocessing steps that stabilise returns include:

  • Mid-quote returns instead of last trade to dampen bid-ask oscillation.
  • Volume- or dollar-weighted bars to align sampling with liquidity rather than wall-clock time.
  • Subsampling or pre-averaging to mitigate autocorrelation induced by microstructure noise.
  • Handling zero returns common in illiquid names; tick-size effects matter when scaling volatility.

These details prevent spurious predictability and improve the realism of simulated trading costs in Chapter 10.

Regime awareness

Markets alternate between calm and stressed regimes. A binary regime indicator — based on rolling realised volatility, credit spreads, or a liquidity proxy — is the simplest tool that prevents a model trained on one environment from over-confidently extrapolating into another. Hidden Markov models and Bayesian change-point detection refine the same idea. Regime probabilities recur as state variables in Chapter 5's policies and as conditioning inputs in Chapter 10's generators.

Cross-sectional dependence and factor hygiene

Equity universes share factor exposures (value, momentum, quality), and that shared structure has to be respected to avoid double-counting.

  • Factor-neutral returns. Regress asset returns on factor returns and use the residuals as model inputs. This isolates idiosyncratic signals and stabilises covariance estimates.
  • Cluster-aware sampling. When bootstrapping (Section 02-05), sample by industry clusters to preserve realistic co-movement. Block bootstrap over time and clusters reduces the risk of overly optimistic backtests.
  • Regularised correlations. Compute correlation matrices with graphical lasso or random-matrix-theory filtering to remove noise-dominated eigenvalues — important for risk-parity allocations whose stability depends on .

Worked mathematics: from moments to risk limits

Let weights be and asset log returns . The portfolio log return is . Assuming finite second moments,

A maximum volatility limit is a quadratic constraint on weights. Adding a drawdown cap , an approximate peak-to-trough loss under Gaussianity is at confidence , giving the linearised limit . These mathematical links turn descriptive statistics into actionable guardrails — the same form the constrained portfolio optimisations of Section 02-03 consume.

Scenario libraries for stress testing

Beyond point estimates, assemble scenario libraries from historical crises (2008, 2020), synthetic shocks (FX devaluation, rate spike), and factor rotations. Each scenario is a vector of returns used later for Monte Carlo overlays or constraint testing. Label scenarios with metadata (duration, trigger, regime) so that conditional risk conversations are reproducible. Chapter 10 revisits this with diffusion-based generators that produce unseen but plausible scenarios on demand.

Case study: a return stack for a global book

Consider a portfolio of global equities, EM sovereign bonds, and crypto. Four operational concerns shape the return stack before any modelling:

  • Clock alignment. Equities align to local closes; bonds to New York 17:00; crypto trades 24/7. Normalise to UTC and define a canonical evaluation timestamp per asset class.
  • Corporate actions and carry. Equities need dividends and splits; sovereigns need accrued interest and pull-to-par; perpetual futures need funding-rate adjustments. Maintain per-asset adjustment ledgers so derived returns can be reconstructed and audited.
  • Liquidity tiers. Thinly traded names have sporadic prints. Use Kalman smoothing on log prices to impute small gaps, flagging the imputed points so downstream models can downweight them.
  • Holiday calendars. Forward-filling across holidays that differ by country corrupts factor timing. Build calendar-aware joins.

The output is a tall data frame (timestamp, asset_id, local_return, fx_return, hedged_return, log_return, quality_flags). Each column feeds later chapters: portfolio optimisation consumes hedged returns; forecasting models ingest log returns and quality flags; simulation routines draw from the scenario library linked to those flags.

Implementation checklist

  • Version every transformation: raw prices → adjusted prices → returns → features.
  • Persist metadata: currency, exchange, adjustment flags, data latency.
  • Validate monotonicity of timestamps and absence of duplicate (asset, date) rows.
  • Roll-aware handling of futures: splice contracts using volume- or open-interest-based rolls; record roll yield separately.
  • Deterministic reruns via fixed seeds and pinned library versions in lock files.
  • Unit tests that recompute summary statistics after every pipeline change and assert tolerance bounds.

A reproducible return series underpins every chapter that follows; errors here propagate everywhere else.

Forecasting and Evaluation Basics

Forecasts translate past information into beliefs about the future distribution of returns, volatility, or latent states. They are not decisions yet; they are the inputs that the optimisation routines of Chapter 5 will later consume. A clean separation between forecasting and decision-making is what keeps pipelines honest about what each model knew at each point in time, and it is the discipline this section lays down.

Forecasts as conditional expectations

At time with information set , the optimal point forecast for under squared-error loss is the conditional expectation . Different loss functions induce different optimal summaries — quantiles for pinball loss, full densities for log loss. Stating the loss explicitly is the first step toward a well-posed forecasting problem.

Information sets matter. A forecast built with future prices is meaningless in a trading system; a forecast restricted to tradable signals at is credible. Always make the time index and permissible inputs explicit in code and documentation. The validation workflow at the end of this section assumes you have already done so.

Crafting a forecast

Three design choices repeat across every model class in Chapter 4.

Recursive vs. direct horizons

  • Recursive (iterated). Fit a one-step model and roll it forward using its own predictions. Economical, but error compounds across horizons.
  • Direct (multi-output). Fit a model to predict simultaneously. Higher data demand; avoids feedback loops.
  • Hybrid. Direct forecasts for the near term combined with recursive tails for longer horizons. The right answer is workload-dependent; test both on the held-out window before committing.

Point, quantile, and density forecasts

  • Point forecasts. Single-number summaries; easy to use in heuristics or linear constraints.
  • Quantile forecasts. for a small set of levels; informs VaR-style limits and asymmetric payoffs.
  • Density forecasts. Model explicitly via parametric (Gaussian, Student-) or nonparametric (normalising flow, diffusion) forms. Essential for simulation-based decision modules and for the synthetic-data work in Chapter 10.

The decision layer of Chapter 5 consumes some functional of the predictive distribution — never just a point — so density / quantile outputs are the default in this book.

Practical features

  • Rolling features. Means, vols, correlations over windows tuned to the trading horizon.
  • Calendar effects. Day-of-week, month-end, time-to-event for macro releases and earnings.
  • Regime flags. Volatility-state, liquidity stress, or spread-level proxies — see Section 02-01 for construction.

Feature engineering must respect causality: only data available at may enter the feature for a forecast made at , and the transformations must be auditable to that constraint.

Feature temporal integrity and label construction

Accurate timestamps matter more than algorithm choice. Enforce a strict data model:

  1. Align features to the forecast origin ; forward-fill only where economically justified, and never across embargoed windows.
  2. Construct labels that mirror the tradable horizon — for returns, or for direction.
  3. Avoid overlapping labels that inflate sample size and induce leakage. If unavoidable, use block bootstrap (Section 02-05) or Newey–West adjustments when reporting uncertainty.

A short validation script that asserts these invariants prevents subtle data snooping and is cheaper than the post-hoc detective work after a failed deployment.

Feature engineering playbook

Robust forecasts rest on disciplined feature construction:

  • Stationarisation. Differences, log ratios, or demeaning by rolling means to stabilise levels. Compute every transformation with causal windows.
  • Volatility alignment. Scale features by recent realised volatility so that model parameters are comparable across regimes. For deep models, standardise per feature using expanding or rolling statistics.
  • Event windows. Pre- and post-event aggregates around earnings or macro releases, labelled with surprise magnitudes for conditional forecasting.
  • Microstructure features. Order-book imbalance, realised volatility, signed volume — make sure the timestamps line up with the trading session of interest.
  • Hierarchical features. Combine cross-sectional statistics (industry mean, sector momentum) with asset-level signals; helps generalisation for sparse names.

Pipeline hygiene comes down to unit tests: monotonic timestamps, no forward fills across embargoed periods, stable feature counts after schema evolution.

Evaluating forecast quality

Evaluation must mirror deployment. Split data along time, not randomly. Match metrics to the forecast type and the downstream objective.

Point metrics

  • MAE . Robust to outliers.
  • RMSE . Penalises large mistakes.
  • MAPE. Scale-free but unstable when values approach zero — avoid for returns near zero.

Probabilistic metrics

  • Pinball loss for quantiles: .
  • Continuous Ranked Probability Score (CRPS). Integral of pinball losses over all quantiles; compares the full predictive distribution to observations. The default probabilistic score in this book.
  • Log-likelihood / cross-entropy. Rewards calibrated densities, punishes overconfident tails — but is sensitive to outliers, so report alongside CRPS rather than instead of it.

Calibration and sharpness

A density forecast should be calibrated (statistically consistent with observed frequencies) and sharp (concentrated). The probability integral transform (PIT) histogram is the diagnostic: should be uniform on under perfect calibration. Deviations reveal mis-specified tails (U-shape) or dispersion (peaked). For quantile models, check coverage of prediction intervals — the proportion below the -quantile should approach .

Sharpness is measured by entropy or average interval width conditional on correct coverage. A model with slightly higher MAE but honest tails often dominates in decision contexts involving leverage or options.

Classification-style metrics

For directional forecasts, use balanced accuracy or Matthews correlation to avoid being fooled by class imbalance (mostly up days). Thresholds should be chosen via nested validation, not cherry-picked post-hoc.

Backtesting and cross-validation

Time-series cross-validation preserves ordering. Common schemes:

  • Expanding window. Train on , test on , expand and repeat. Mimics live trading where history accumulates.
  • Sliding window. Fixed-length training window — guards against concept drift but limits sample size.
  • Purged gaps. Remove a buffer between train and test to mitigate leakage from overlapping labels (important for high-frequency data, per de Prado's purged -fold).

Backtests should ingest the same feature construction pipeline used at inference; running a separate "research" pipeline at training time is the most common source of look-ahead leakage.

Bias–variance trade-offs and model risk

Forecast errors decompose into bias (systematic mis-specification), variance (sensitivity to sample noise), and irreducible noise. Linear models with strong regularisation shrink variance at the cost of bias; over-parameterised deep nets push bias low and variance up. Calibration curves and reliability diagrams diagnose which side you are on. Ensembling (bagging, boosting, stacking) reduces variance and diversifies inductive biases — see "Forecast combination" below.

Model-risk governance turns these diagnostics into operational rules: benchmark against naive models (random walk, historical average), monitor population drift, and record decision logs that tie forecasts to positions. When a model degrades, having a challenger ready limits downtime.

Forecast combination and ensembling

Single models overfit idiosyncratic noise. Two reliable combiners:

  • Simple averages. Surprisingly robust when individual models are weakly correlated.
  • Precision-weighted averages. Weights proportional to inverse forecast variance or past risk-adjusted performance.
  • Stacking. A meta-learner that consumes base forecasts and metadata (horizon, regime score) as inputs. Keep the meta-learner linear or monotone to avoid amplifying base-model noise.

Track ensemble diversity with the pairwise correlation of forecast errors; ensembles whose members are too similar offer little benefit.

Regime-conditioned forecasting

Forecast skill varies by regime. Make the regime explicit:

  • Hidden Markov models or Bayesian change-point detection to identify shifts in volatility or correlation.
  • Regime-conditioned models. Train one model per regime, or include regime probabilities as features. Smooth transitions via mixture-of-experts with temperature-controlled gating networks.
  • Stress-aware validation. Make sure each evaluation fold contains stress episodes; otherwise accuracy will vanish exactly when it matters.

Regime probabilities recur as state variables in Chapter 5's policies and as conditioning inputs in Chapter 10's generators.

Stress testing forecasts

Even well-tuned models fail under stress. Probe robustness by:

  • Perturbing inputs. Add noise or simulate missing data; forecasts that swing wildly are fragile.
  • Regime conditioning. Evaluate separately in high-volatility and low-liquidity windows.
  • Scenario shocks. Force macro variables or spreads to crisis levels and propagate through the model.

Stress results inform downstream safeguards: caps on position size, volatility scaling, or fallbacks to simpler models during turmoil.

From forecasts to decisions

Forecasts supply distributions; decisions consume them via utility or cost functions. Keep the two modular: a change in the decision rule should not require re-fitting the forecaster, and vice versa.

Economic evaluation

Statistical accuracy is necessary but not sufficient. Convert forecast quality into economic terms:

  • Utility-based scoring. Maximise for a concave that penalises tail risk.
  • PnL attribution. Run a simple trading rule (sign of forecast, vol-scaling) and decompose PnL into hit rate, average win/loss, turnover, and costs.
  • Cost-aware scoring. Subtract realistic transaction costs and market impact; penalise turnover or gap risk explicitly.
  • Capacity analysis. Examine decay of forecast value as notional increases; fit impact curves (square-root law) to convert accuracy into capacity estimates.
  • Mandate alignment. Evaluation must respect the leverage, concentration, and liquidity constraints that downstream portfolio construction will enforce.

Economic backtests belong in a separate, audited pipeline so they do not contaminate the pure forecasting evaluation.

Interpreting models

Explainability builds trust and uncovers spurious signals. For linear models, inspect coefficients and partial correlations. For tree ensembles or neural networks, use SHAP, permutation importance, and counterfactual analysis. Temporal saliency (integrated gradients over time) shows whether the model anchors on economically meaningful windows or just memorises noise. Interpretability checks should be part of model acceptance, not an afterthought.

Practical validation workflow

A lightweight but rigorous workflow:

  1. Data split. Training, validation, and test chronologically with embargo gaps.
  2. Hyperparameter tuning. Bayesian optimisation with time-series cross-validation.
  3. Model audit. Calibration plots, residual diagnostics, sensitivity to feature perturbations.
  4. Live shadow. Run forecasts in paper trading to verify latency, cost estimates, and monitoring dashboards.
  5. Promotion. Freeze weights, document feature schema, register the model with metadata (version, owner, validation set, decay policy).

This process mirrors the MLOps routines used later when forecasts are plugged into the RL or portfolio systems of Chapter 5.

Monitoring in production

Once deployed, forecasts require continuous health checks:

  • Data drift. Track feature distributions via population stability index (PSI) or Wasserstein distance; alert on threshold breaches.
  • Performance decay. Monitor rolling calibration error, hit rate, and realised Sharpe of forecast-driven strategies.
  • Latency budgets. Log end-to-end inference times, especially when features rely on streaming market data.
  • Fallback policies. Define deterministic rules — neutrality, risk-off — when data quality flags fire or models exceed error budgets.

These monitoring practices echo the control loops in Chapter 5: a forecast is part of a closed system, not a detached prediction, and the feedback path matters as much as the model itself.

Portfolio Theory

We rarely invest in a single asset. The investor's problem is to allocate wealth across correlated bets while respecting constraints, preferences, and market frictions. Portfolio theory provides the map: how to quantify trade-offs between risk and reward, and how to express them as optimisation problems that the dynamic policies of Chapter 5 will later solve through time. The static formulations here are the ground truth that every later RL or function-approximation method should be benchmarked against.

Mean–variance core

Let be portfolio weights summing to one, the vector of expected returns, and the covariance matrix of returns. The portfolio mean and variance are

Markowitz's classical problem chooses to minimise variance subject to a return target,

The set of optimal pairs traces the efficient frontier. Adding a risk-free asset converts the frontier into the Capital Market Line with slope equal to the Sharpe ratio of the tangency portfolio,

This is the closed-form benchmark referenced throughout Chapter 5; if your dynamic policy under-performs the tangency portfolio out-of-sample, the right place to look first is the inputs , not the policy class.

Utility functions and risk preferences

The mean–variance problem is one of several utility-based formulations. Common choices, and where they fit:

  • Quadratic / mean–variance utility: , giving a closed-form subject to constraints. Approximates expected utility under joint Gaussianity.
  • CRRA utility: for (and at ). Produces multi-period rules in Chapter 5 whose risk premium scales with wealth — the natural fit for compounding investors.
  • Prospect-theory utility. Asymmetric value function with loss aversion and probability weighting. Useful for understanding investor flows; rarely used directly inside an optimiser.

A subtle but important point: the choice of risk-aversion parameter ( or ) is a modelling choice, not a market quantity. Sensitivity analysis across plausible values should be part of every honest report.

Risk measures beyond variance

Variance treats upside and downside symmetrically. The alternatives focus on the loss region and matter once tails dominate the decision:

  • Value-at-Risk (VaR). -quantile of the portfolio loss distribution. Easy to communicate, not coherent (subadditivity fails).
  • Conditional VaR / Expected Shortfall. . Coherent and admits a linear-programming reformulation \citep{rockafellar2000cvar}, which is why portfolio optimisers prefer it over VaR when the question is "minimise tail loss subject to return target."
  • Tracking error. Variance of active returns relative to a benchmark . The right risk measure for benchmarked mandates.

Choosing the risk measure changes the optimum: CVaR-optimised portfolios reallocate mass away from fat-tailed exposures; tracking-error mandates suppress active bets regardless of standalone Sharpe ratio.

Constraints, costs, and frictions

Real mandates deviate from frictionless theory. Encode these constraints inside the optimisation, not as post-processing — that is what keeps the static benchmark honest enough to compare against dynamic policies.

  • Budget and leverage: for fully invested books, or to cap gross exposure.
  • Box / sign constraints: ; for long-only.
  • Turnover and transaction costs. Linear costs approximate commissions and bid-ask; quadratic costs approximate temporary market impact and yield smooth optimisations that play well with gradient methods.
  • Risk caps. Volatility ceilings , factor exposure caps , concentration limits by sector or region.
  • Liquidity caps. with depending on venue depth.
  • ESG / policy filters. Exclusion lists or scoring thresholds with weights re-normalised after the filter is applied.

With convex risk and linear constraints, a quadratic-program solver (cvxpy, qpsolvers) finds the exact solution. For non-convex frictions (integer lot sizes, fixed costs), switch to specialised solvers but keep the same objective structure so the result is comparable to the convex benchmark.

Factor models and shrinkage

Estimating is brittle when assets outnumber observations. Two remedies, used together rather than apart:

  • Factor models. Express returns as with factor loadings , factor covariance , and idiosyncratic risk (diagonal). Then — far fewer parameters, and the loadings carry an economic interpretation. This is also the bridge to the dynamic-factor machinery of Chapter 6.
  • Shrinkage covariance. Combine sample covariance with a structured target :

Ledoit–Wolf \citep{ledoitwolf2004} chooses optimally under quadratic loss with a constant-correlation target. For factor structures, shrinking the residual covariance rather than the full tends to perform better.

These techniques keep the optimisation well-posed and align with the linear-algebra workflows of Chapter 3.

Robust and Bayesian portfolio optimisation

Markets shift, and the inputs to a static optimiser are estimated from data that lags those shifts. Two routes to a less fragile optimum:

  • Robust optimisation. Treat inputs as uncertain sets and optimise the worst case. For mean uncertainty in an ellipsoid , the worst-case objective penalises return by . Distributionally robust optimisation (DRO) specifies ambiguity sets in Wasserstein distance and yields portfolios resilient to tail shifts.
  • Bayesian / Black–Litterman. Place priors on and . The Black–Litterman \citep{blacklitterman1992} formula blends market-implied equilibrium returns with subjective views through

This is the canonical recipe to turn idiosyncratic conviction into a confidence-weighted allocation rather than letting the optimiser ride the noisiest forecast.

Robust and Bayesian methods change the answer qualitatively, not just quantitatively. Always test the policy under both on the same data; fragile static benchmarks are usually traceable to skipping this step.

Risk parity, minimum variance, and alternatives

Mean–variance is one objective among many. Three alternatives recur:

  • Risk parity. Equalise marginal risk contributions across assets. Better diversification when is hard to estimate (which is most of the time).
  • Minimum variance. Set and minimise . A defensive core; surprisingly hard to beat out-of-sample.
  • Maximum diversification. Maximise for a vector of volatilities . Favours low-correlation assets.
  • Drawdown-constrained. Constrain expected drawdown via scenarios (Chapter 10) or via CVaR-style objectives on rolling peak-to-trough losses.

Strategy stacking — running several of these policies side by side and allocating across them — is a common operational answer when no single objective dominates across regimes.

Multi-period and transaction-cost-aware allocation

Single-period optimality undershoots when trading costs and path dependence matter. Three pragmatic generalisations:

  • Static plus turnover penalty. A quadratic in the objective reproduces most of the multi-period benefit at the cost of a single QP per rebalance.
  • Model predictive control (MPC). Solve a horizon- optimisation each period using forecasted returns and costs, apply only the first step, and repeat. Useful when the forecast horizon and rebalance cadence are mismatched.
  • Stochastic dynamic programming. The fully general formulation, treated in Chapter 5. The closed-form Merton solution is a special case; the policies of Section 5-04 approximate it under realistic noise.

Incorporating quadratic transaction costs and limit-order-fill probabilities yields allocations that trade less during illiquidity spikes — the bias that almost every realistic objective wants.

Mean–variance frontier and performance attribution

The frontier with a risk-free asset has the closed form

Performance attribution decomposes realised return into contributions from factors, allocation, and selection. Brinson-style attribution works for long-only mandates; for long–short, prefer exposure-based attribution with beta and residual components. Attribution is what makes the optimum narratable — without it, a stakeholder review produces arguments rather than decisions.

Scenario-based robustness checks

Stress testing complements optimisation:

  • Historical replay. Apply candidate weights to crisis periods and evaluate drawdowns, turnover, and breach counts for hard limits.
  • Factor shocks. Shock value, momentum, carry, rates, or FX factors by multiples of their historical volatilities; observe portfolio responses.
  • Path-dependent costs. Simulate limit-order execution with fill probabilities and queue positioning to reveal slippage that a quadratic cost term hides.

Stress results feed back into constraint tuning or risk overlays — the same loop the synthetic-data workflow of Chapter 10 closes via generated counterfactuals.

Numerical stability and solver hygiene

Optimisation routines fail silently without care.

  • Ill-conditioned covariance. Add diagonal jitter, prefer Cholesky-based solvers, or reduce dimensionality via factors. Always check condition number before trusting the inverse.
  • Scaling. Rescale weights and returns so that typical magnitudes are around one — gradient methods need this; QPs are forgiving but benefit from it.
  • Warm starts. Start from the previous-period weights and gradually tighten constraints to prevent oscillations in daily rebalances.
  • Stochastic approximations. For very large universes, use SGD or coordinate descent on differentiable objectives (risk parity, quadratic-cost MV) with periodic full-frontier re-projections.

Document solver tolerances and convergence diagnostics as part of the portfolio build sheet; reproducibility of an optimum requires the solver's internals, not just the output.

Linking to reinforcement-learning policies

The portfolio policies of Chapter 5 build on these structures. Two practical bridges:

  • Action spaces. RL agents output target weights or trades; the action space must respect leverage and liquidity constraints. Use squashing functions (tanh) and projection layers onto feasible sets, or parameterise directly on the simplex with a Dirichlet head.
  • Reward shaping. Encode risk preferences via utility, penalise turnover and limit breaches, and normalise rewards by realised volatility so training is stable across regimes.

Aligning RL environments with the classical portfolio theory in this section is what stops agents from discovering policies that look great in simulation and unrecognisable to a risk officer.

Governance and model risk management

Institutional portfolios operate under governance, and the optimisation loop is no exception:

  • Model inventory. Log optimisation configurations, parameter priors, and constraint sets; attach performance history and known failure modes.
  • Challenge sessions. Periodically stress assumptions (estimation horizon, cost curves) and record outcomes. Rotate challengers to avoid groupthink.
  • Override protocols. Define when human overrides are allowed (liquidity crises, system outages) and how they are rolled back. Overrides must be auditable and must not bypass pre-trade risk checks.

Governance is what keeps sophisticated optimisation accountable. A well-specified static benchmark is also what lets you justify the next chapter's switch to dynamic optimisation: the reader knows exactly which preferences and frictions the dynamic version is being asked to respect.

Classical Time-Series Models

Classical time-series models are the first diagnostic tool for any financial series. They impose structure that is transparent, quick to estimate, and easy to stress. Even when deep models eventually replace them in production, the residuals of an ARIMA or a GARCH fit reveal whether more complex architectures are fighting the data or are merely re-discovering the structure these classical baselines already capture. This section sets up the family — ARIMA, GARCH, VAR, state-space, plus a few specialised tools — that recurs throughout Chapters 4 and 6.

Univariate dynamics: AR, MA, ARIMA, SARIMA

An autoregressive model of order writes

with white-noise innovations . A moving-average MA() captures shock propagation,

Combining both and allowing differencing gives ARIMA(),

with the lag operator and the integer order of differencing. Seasonal ARIMA (SARIMA) layers seasonal differencing on top:

with period . Use information criteria (AIC / BIC / HQIC) and diagnostics on residual autocorrelation to choose parsimonious orders. Finance often prefers low orders (1–2) to avoid chasing noise.

Exponential smoothing (ETS) decomposes the series into level, trend, and seasonal components and updates each with its own smoothing parameter. ETS lacks an explicit autocorrelation structure but is remarkably stable on short series — a reliable benchmark before escalating to ARIMA.

Modelling volatility: GARCH and friends

Returns often have little autocorrelation while squared returns clearly do — that is the volatility-clustering stylised fact of Section 02-01. GARCH models capture it by writing the conditional variance:

Useful extensions:

  • EGARCH. Models , so the variance stays positive without parameter constraints; admits asymmetric responses to positive and negative shocks.
  • GJR-GARCH. Adds a leverage term so negative shocks raise variance more than positive shocks of the same size.
  • Component GARCH. Splits the conditional variance into long- and short-run components — better fit on series with persistent volatility regimes.

Conditional variance forecasts feed directly into portfolio risk limits (Section 02-03) and volatility-scaling rules. Long-memory assets (absolute returns with slowly-decaying autocorrelation) are better captured by ARFIMA with fractional differencing , estimated by Geweke–Porter–Hudak log-periodogram or Whittle likelihood; ARFIMA-GARCH hybrids are the default for realised-volatility forecasting.

Multivariate dynamics: VAR, VECM, DCC

Vector autoregressions model joint evolution:

Granger causality tests whether adding lagged improves prediction of given the rest. Impulse-response analysis traces how a one-off shock to one variable propagates through the system over time.

When non-stationary series share a long-run equilibrium (the basis between a futures contract and its underlying, the relationship between inflation and short rates), use a Vector Error-Correction Model (VECM). VECMs differentiate the series but add an equilibrium-correction term:

The columns of define the cointegrating vectors (the spreads that mean-revert) and is the adjustment speed. The Johansen test estimates the cointegration rank. Most pair-trading and statistical-arbitrage work in financial practice ultimately reduces to a VECM in disguise.

Dynamic Conditional Correlation (DCC-GARCH) \citep{engle2002dcc} gives a tractable multivariate volatility model: marginal GARCH per series with a slowly-evolving correlation matrix , . DCC is the default risk-model engine in many investment banks because it produces covariance forecasts cheap enough to refit nightly and stable enough to plug straight into a mean-variance optimiser.

Section 04-02 (multivariate forecasting) extends these ideas to feature- enriched panels and brings BVAR shrinkage priors and reduced-rank VAR into the picture; this section is the generative baseline that those forecasters compete against.

State-space view and the Kalman filter

State-space models unify trend, seasonality, and dynamics into a single framework. With latent state and observation ,

The Kalman filter recursively computes the posterior mean and covariance of the state given observations up to , and the Kalman smoother refines those estimates using the full sample (offline). Use cases that recur throughout the book:

  • Local-level / local-trend models. , — the classical Bayesian filter for drifting levels.
  • Time-varying parameter models. AR or regression coefficients evolve over time, enabling adaptive strategies in changing regimes.
  • Trend–cycle decomposition. Separate a smooth trend from a stationary cyclical component for macro forecasting.
  • Latent-factor extraction. When is the loading matrix, the filter delivers latent factor trajectories — the connecting tissue to Chapter 6's dynamic factor models.

Beyond the linear Gaussian setting, extended and unscented Kalman filters handle smooth nonlinearities, and particle filters handle heavy-tailed noise or full nonlinear dynamics at higher computational cost. The full state-space machinery — including identification, smoothing, and EM-based parameter estimation — is the subject of Chapter 6, which treats it as a first-class topic rather than as a tool.

Seasonality, cycles, and regime changes

Classical models separate deterministic seasonality from stochastic cycles. A multiplicative log-price decomposition reads

with a trend, a seasonal of period , a cyclical ARMA component, and noise. SARIMA captures both trend and seasonality jointly. For regime changes, Markov-switching AR processes allow parameters to depend on a latent regime governed by a transition matrix . Markov-switching models are the simplest answer to "the data look like two different processes" and a clean generalisation of the regime-flag approach in Section 02-01.

Calendar effects deserve their own engineering layer:

  • Day-of-week and month-end effects. Include seasonal dummies or deterministic regressors for settlement cycles and fund-flow patterns.
  • Holiday proximity. Widened spreads and reduced depth around holidays can be encoded via binary indicators or spline terms.
  • Intraday seasonality. U-shaped volatility in equities and futures needs time-of-day factors before applying ARMA or GARCH at intraday frequency.

Seasonal adjustment prevents spurious autocorrelation and improves forecast calibration when markets follow habitual rhythms.

Frequency-domain tools

The spectral density summarises variance allocation across frequencies. The periodogram

highlights dominant cycles; smoothing kernels yield consistent estimates. Coherence extends the idea to pairs of series and identifies lead–lag relations — the analogue of cross-correlation in the frequency domain. Frequency-domain filters (Hodrick–Prescott, Baxter–King, Christiano–Fitzgerald) isolate business-cycle components for macro applications.

For intraday data and non-stationary regimes, wavelet and empirical mode decomposition give time–frequency localisation: trade intraday mean reversion at one scale while holding weekly momentum at another, with cost models tuned per scale. Multiscale decomposition also informs hierarchical RL policies in Chapter 5 with separate intra-episode and inter-episode dynamics.

Model checking and stability

Regardless of the specification, the diagnostic recipe is the same:

  • Residual autocorrelation. ACF / PACF should show no significant spikes; Ljung–Box tests formalise it. Persistent structure means the lag order is wrong.
  • Residual normality and tails. Standardised residuals should resemble iid noise. Heavy tails suggest Student- innovations rather than Gaussian.
  • Conditional heteroskedasticity. ARCH tests on squared residuals; if they fire, the volatility specification needs work.
  • Parameter stability. CUSUM and Chow tests detect structural breaks. Re-estimate on rolling windows when breaks are frequent.
  • Out-of-sample. Rolling-origin forecasts are the only honest test; every diagnostic above is necessary, none is sufficient.

Order selection via AIC / BIC balances parsimony and fit, but should always be tempered by economic judgement.

Practical implementation

  • Rolling estimation. Window length should match the half-life of regime persistence. Expanding windows reduce variance but adapt slowly; rolling windows react faster with more variance.
  • Missing data. Use Kalman smoothing or EM rather than forward-fill.
  • Software. statsmodels for ARIMA / SARIMA / VAR / VECM, pmdarima for auto-order selection, arch for GARCH-family, pykalman and dfm-python for state-space and dynamic-factor models. Wrap the estimators in reproducible pipelines with versioned parameters and seeded random draws (Section 02-05).

Bridging to machine learning

Classical models are not just a baseline; they are also feature generators. Three patterns recur:

  • Lag features and residual features. AR coefficients, GARCH conditional volatility, ARIMA residuals, VAR impulse responses — all feed into the tree and deep models of Chapter 4.
  • Hybrid predictors. ARIMA for linear structure plus a tree ensemble or neural net on the residuals; a strong default that often beats either component alone.
  • Benchmarks. Every deep model in Chapter 4 should beat an appropriately-tuned ARIMA / GARCH on validation data; if it does not, the deep model is the wrong tool, the features are wrong, or the evaluation is leaking.

The classical layer is where structure becomes visible. The chapters that follow add capacity, but they do not change the principle: an honest forecast lives inside an honest data-generating story, and classical models are how we tell that story.

Estimation and Simulation

Estimating model parameters and simulating scenarios are the two hands that shape any quantitative workflow. Estimation aligns models with observed data; simulation explores worlds not yet seen. Together they provide the inputs to the portfolio construction of Section 02-03 and the optimal control of Chapter 5 — which means a careful estimation pipeline pays for itself many times over once the downstream policies are running on its outputs.

Maximum likelihood and friends

Maximum likelihood (MLE) finds parameters that maximise the likelihood . Under regularity conditions, MLE is consistent and asymptotically efficient. In practice:

  • Use gradient-based optimisers with automatic differentiation when the likelihood is differentiable (ARIMA, GARCH, neural likelihoods).
  • Apply regularisation — ridge, lasso, elastic-net — to stabilise estimates when dimensionality approaches sample size.
  • For constrained parameters (variances, probabilities), optimise in unconstrained space via softplus or log transforms; this avoids fighting the optimiser at the boundary.

Bayesian estimation augments likelihood with priors. Posterior summaries provide credible intervals and propagate parameter uncertainty into downstream decisions — a crucial advantage when sample sizes are small or when stakeholders need explicit uncertainty bands. MCMC (Hamiltonian Monte Carlo, NUTS) is the workhorse; variational inference trades exactness for speed and is often good enough for streaming estimation.

Loss functions as design choices

Every estimator minimises some loss. Stating the loss explicitly aligns estimation with how the model will be used rather than with what is easiest to optimise.

  • Squared error. Conditional means; ideal when the downstream decision is symmetric in error.
  • Quantile (pinball) loss. Conditional quantiles; the right loss when the decision is asymmetric (VaR, capital allocation).
  • Negative log-likelihood. Calibrated densities; the right loss when the downstream module consumes the full distribution.
  • CRPS as a training loss for probabilistic forecasters when the downstream evaluation is also CRPS. Easier than NLL on heavy-tailed data.
  • Custom utility-aligned losses — downside-weighted, turnover- penalising — that tie estimation directly to a Chapter-5 objective.

The choice of loss is a modelling decision; document it as you would a prior or a constraint.

Estimation under constraints and regularisation

When sample size is small relative to dimensionality, unconstrained MLE overfits. Regularisation introduces bias to reduce variance:

  • Ridge / lasso / elastic-net. Penalise / / mixed norms of parameters. Lasso encourages sparsity (useful when only a handful of regressors should matter); ridge is the safer default in finance where signals are weak and dense.
  • Positive-definite covariance. Project noisy covariance matrices onto the nearest PSD matrix via eigenvalue clipping or Tikhonov regularisation .
  • Graphical lasso. Sparse inverse-covariance estimates produce more stable risk matrices and illuminate conditional dependence structures — useful for risk-parity and factor-neutral construction.
  • Shrinkage priors. Bayesian horseshoe or spike-and-slab encourage sparsity while quantifying uncertainty.
  • Stability selection. Subsample data and record variable inclusion frequencies to identify robust predictors. The most defensible way to pick a sparse model when the validation horizon is short.

Constraints embody domain knowledge: non-negativity of variances, monotonicity of yield curves, no-arbitrage on implied-volatility surfaces. Estimators that respect these constraints produce more realistic simulations downstream.

State-space estimation

For linear Gaussian state-space models the Kalman filter (Section 02-04) provides optimal online estimation of hidden states given observations, and the smoother refines them using the full sample. When the system matrices are unknown, the Expectation–Maximisation (EM) algorithm is the standard parameter-learning recipe:

  1. E-step. Run the filter and smoother to compute sufficient statistics under the current parameters.
  2. M-step. Update parameters to maximise the expected complete-data log-likelihood.

EM is robust, monotonically increases the marginal likelihood, and underpins many time-varying factor and beta models. Convergence is to a local optimum; multiple random initialisations and a sanity-check on the log-likelihood path are part of the recipe.

For nonlinear or non-Gaussian state-space models, particle filters combined with particle MCMC give samples from the joint posterior over states and parameters. Slower; appropriate when the linear-Gaussian approximation is materially wrong.

Calibration and parameter uncertainty

Point estimates hide uncertainty. Three complementary tools:

  • Fisher information / asymptotic confidence intervals. Cheap, asymptotic, often optimistic on financial sample sizes.
  • Parametric or nonparametric bootstrap. Refit the model on resampled data; the empirical distribution of estimates is the uncertainty band. Block bootstraps (next subsection) for time series.
  • Bayesian posterior. Treat parameters as random and propagate the full posterior into simulation. The right answer when downstream decisions need credible intervals.

Sensitivity analysis — perturbing parameters within a confidence set and re-running the downstream pipeline — reveals stability of decisions, not just of estimates. Decisions that are stable under parameter perturbation are the ones to ship.

Bootstrap, block bootstrap, and jackknife

Resampling quantifies uncertainty without strong parametric assumptions. IID bootstrap fails for dependent series, so use block methods:

  • Moving block bootstrap. Resample contiguous blocks of length to preserve autocorrelation. Choose proportional to the half-life of the dominant autocorrelation.
  • Stationary bootstrap. Random block lengths with geometric distribution mitigate boundary artifacts.
  • Circular bootstrap. Wrap the series end-to-start to reduce edge effects on short samples.

Jackknife leave-one-out approximations provide cheap bias estimates for smooth statistics. Combined with parameter shrinkage, resampling methods give honest uncertainty bands under data scarcity.

Monte Carlo for scenario generation

Simulation serves two purposes: forward-looking scenarios for stress testing, and algorithm benchmarking. Three layers:

  • Parametric. Draw for the simplest case; layer GARCH on top for volatility clustering; add a Student- or skew- marginal for heavy tails.
  • Copula-based. Fit margins and a copula separately; sample by drawing from the copula and inverting the margins. The right answer when the dependence structure is what matters most for the stress test.
  • Generative. Diffusion, GAN, or VAE generators that learn the joint distribution end-to-end. Treated in detail in Chapter 10; the Monte Carlo point of view here is a useful prior for that work.

Variance reduction (antithetic variates, control variates, importance sampling) accelerates Monte Carlo whenever the quantity of interest has identifiable structure — almost always a worthwhile investment for risk-tail scenarios.

Validation of simulated worlds

A simulator is only useful if it preserves the aspects of reality relevant to the decision at hand. The validation protocol has four layers, in increasing order of strength:

  • Marginal fit. Mean, std, skew, kurtosis per series compared to real; KS or energy distance tests on returns.
  • Autocorrelation match. ACF of returns and ACF of squared / absolute returns. A simulator that breaks volatility clustering is unusable for risk modelling.
  • Tail coverage. VaR, ES at 1%, 0.5%, 0.1% on simulated vs. real paths. Tail under-coverage is the failure mode that quietly breaks stress tests.
  • Downstream test. Train a forecaster (Chapter 4) or a policy (Chapter 5) on simulated data; evaluate on real held-out data. Performance in the same band as training-on-real means the simulator preserved the right structure; a collapse means the simulator is leaking spurious correlations.

This protocol applies whether the generator is a parametric Gaussian copula or a neural diffusion model.

Data quality and error models

Estimation is only as good as the input quality. Model the imperfections explicitly:

  • Outlier models. Mixture distributions with contamination components capture occasional bad prints; downweight those points in the likelihood rather than dropping them silently.
  • Missingness mechanisms. Distinguish missing-at-random (MAR) from not-at-random (MNAR); use EM with explicit observation indicators rather than imputing in advance.
  • Latency and revision. Macro series are revised; maintain vintages and estimate models on real-time data to reflect the information set available at the decision moment.
  • Quality flags as features. Forward-propagate per-row quality indicators (imputed, stale, out-of-hours) so downstream models can downweight them.

Document data filters and error assumptions so simulation draws reflect the same imperfections seen in production.

Sensitivity analysis and parameter heatmaps

Before trusting an estimator, map how outputs change with inputs:

  • Sweep estimation windows, decay factors, and regularisation strengths; visualise stability bands for downstream metrics (Sharpe, turnover).
  • Compute Sobol or variance-based sensitivity indices when models are differentiable or emulatable.
  • Present parameter heatmaps to stakeholders so they can see fragile regions and avoid operating near cliffs where small input changes flip sign or magnitude.

These tools reduce model risk and guide hyperparameter defaults.

Bridging classical and deep estimation

Deep models rely on the same principles. Training a neural forecaster with negative log-likelihood is still MLE; using quantile loss is still estimating conditional quantiles. Regularisation, cross-validation, and calibration checks mirror the classical toolkit. The main difference is capacity: deep models can approximate complex nonlinearities, demanding careful evaluation to avoid learning artefacts of the training regime rather than genuine market structure. Chapters 4 and 6 each revisit estimation under this view.

Linking estimation, simulation, and optimisation

Estimation produces parameters; simulation explores scenarios; optimisation selects decisions. Keep the pipeline modular so improvements in any component can be swapped without rewriting the others. Concretely:

  • A portfolio optimiser (Section 02-03) consumes a generic scenario matrix, whether it came from a bootstrapped history or a deep generative model.
  • A reinforcement-learning policy (Chapter 5) consumes an environment whose dynamics may come from any of the simulators above; the policy itself does not care.
  • A monitoring layer (Section 02-02) consumes a stream of forecasts and outcomes; the forecasting model can be replaced without retraining the monitor.

Logging interfaces that record which estimator and seed produced each parameter set or scenario simplify audits and reproducibility.

End-to-end reproducibility playbook

  • Immutable datasets. Versioned Parquet/Arrow snapshots with hashing; no on-the-fly downloading during estimation runs.
  • Deterministic randomness. Seed every PRNG (Python, NumPy, PyTorch, simulation libraries) and log the seeds with results.
  • Configuration management. Typed configs (Pydantic, OmegaConf, Hydra) that specify estimator choices, priors, simulation seeds, and validation windows.
  • CI hooks. Lightweight estimation–simulation sanity checks on every commit so drifts are caught early.
  • Result manifests. Every artefact (parameter set, scenario file, forecast) carries a manifest with the input data hash, the config, and the seed.

A reproducible estimation–simulation backbone is the difference between "the model performed well" and "we can re-run that result and explain why." Every later chapter consumes this backbone, so the discipline introduced here pays the largest compounding interest in the book.