Chapter 04

Forecasting

"In this world, there's two kinds of people, my friend: those with loaded guns and those who dig. You dig."The Good, the Bad and the Ugly (1966)

A forecast is the loaded gun. The decision layer of the next chapter is what gets to use it. The forecaster who confuses the two — who ships a model into production with no policy on how its outputs will be acted on — has dug a hole, not built a system.

A forecast is an answer to "what is the conditional distribution of the next observation given everything we know now?" Everything downstream — position sizing, hedging, scenario testing — is some functional of that distribution. This chapter walks the spectrum from disciplined univariate baselines through tree ensembles, deep sequence models, and transformer-based forecasters, with the same evaluation discipline applied throughout.

What this chapter does

Three things repeat across every section, regardless of model class:

  • Define the target. Match the target horizon, transformation, and loss to the decision the forecast feeds. A forecaster trained on point-RMSE will produce different positions than one trained on pinball or CRPS, even for the same architecture.
  • Respect temporal splits. Walk-forward training, rolling-origin evaluation, and a strict separation between future-known and future-unknown features. Most forecasting bugs in finance trace back to a leak here.
  • Evaluate distributionally. RMSE on a median forecast is rarely enough. CRPS, reliability diagrams, and per-regime breakdowns are the metrics that actually correlate with downstream PnL.

This Chapter Covers

  • Crafting disciplined univariate baselines (ARIMA, ETS, simple ML) that any later model has to beat to justify itself.
  • Leveraging cross-series structure via VARs, BVARs, VECMs, and feature-enriched panels — including the cross-sectional volatility models that risk overlays consume.
  • Tree ensembles for nonlinear tabular signals; the strongest baseline on most engineered-feature problems and the one to beat before reaching for a deep model.
  • Deep sequence models (RNN, TCN, DeepAR, N-BEATS / N-HiTS) for panels where shared parameters across series and long-range temporal structure pay off.
  • Transformer-based forecasters — PatchTST, iTransformer, TimeMixer, TimeXer — and the foundation-forecaster baselines (Chronos, Lag-Llama, MOIRAI) that show up in Chapter 7.

Contents

Univariate Forecasting

We start with the simplest setting: one time series, one forecast horizon. The goal is not to compete with the deep models later in the chapter but to establish the disciplined baselines that those models will have to clear. If a transformer cannot beat a tuned ARIMA or an exponential smoothing model on a clean univariate series, the problem is either ill-posed or the features are insufficient — both worth knowing before the next training run.

The classical time-series machinery (AR, MA, ARIMA, GARCH, ETS) is introduced in Section 02-04. This section assumes that vocabulary and focuses on the workflow: how to turn the theory into honest forecasts that downstream chapters can consume.

Workflow

The univariate workflow is the same recipe every other forecasting model in the book follows.

  1. Define the target. Log returns, realised volatility, a macro indicator. Match the target to the horizon and the loss; a forecaster trained on point-RMSE produces different positions than one trained on pinball or CRPS.
  2. Split by time. Train / validation / test folds respect chronology. Reserve the latest 15–20% of observations for held-out testing and apply purged gaps if labels overlap.
  3. Choose a naive benchmark. Random walk for log returns, seasonal naïve for volatility or macro series, drift for trending levels. The benchmark is what every later metric is compared against.
  4. Fit a small set of statistical models. AR, ARIMA, ETS. Use information criteria for order selection and inspect residuals.
  5. Roll the evaluation forward. Sliding-origin or expanding-origin re-estimation captures regime change; a single train/test split does not.

Baseline models

Autoregressive

An AR() regression of the current value on the past lags captures mean-reversion and momentum. Useful as a first model on log returns and realised volatility.

import polars as pl
from statsmodels.tsa.ar_model import AutoReg
 
series = pl.read_csv("data/returns_daily.csv", try_parse_dates=True)
asset = series.filter(pl.col("Ticker") == "SPY").sort("Date")
y = asset.get_column("log_ret").to_numpy().astype("float64")
 
model = AutoReg(y, lags=5, trend="n", old_names=False).fit()
forecast_one_step = model.predict(start=len(y), end=len(y))

ARIMA / SARIMA

When a series carries unit roots or deterministic trend (yields, inflation, level series), difference before modelling. ARIMA() absorbs levels of differencing plus an MA component; SARIMA adds seasonal terms with period — weekly cycles in daily macro data, monthly cycles in retail prints.

from statsmodels.tsa.arima.model import ARIMA
 
arima = ARIMA(y, order=(2, 1, 2)).fit()
next_value = arima.forecast()[0]

For order selection on long batches, pmdarima.auto_arima automates the AIC/BIC sweep and adds a stationarity-test gate. A short manual ACF/PACF inspection is still a good idea — auto-selection occasionally over-fits when the series has structural breaks.

Exponential smoothing

Holt–Winters and the broader ETS family decompose level, trend, and seasonality and update each with its own smoothing parameter. ETS is the right tool when seasonality dominates and the noise is mild — weekly volatility cycles, monthly fund flows, intraday volume profiles after detrending.

from statsmodels.tsa.holtwinters import ExponentialSmoothing
 
ets = ExponentialSmoothing(
    asset.get_column("realised_vol").to_numpy(),
    trend="add",
    seasonal="add",
    seasonal_periods=52,
).fit()
weekly_forecast = ets.forecast(steps=1)

GARCH for volatility

For conditional volatility forecasting, the right baseline is GARCH (Section 02-04), not the AR / ARIMA family applied to squared returns. The arch package is the production default:

from arch import arch_model
 
am = arch_model(y * 100, mean="Constant", vol="Garch", p=1, q=1, dist="t")
fit = am.fit(disp="off")
sigma_forecast = fit.forecast(horizon=5).variance.values[-1, :] ** 0.5

The Student- innovation distribution is almost always preferable to Gaussian on equity-class data; use the QQ plot of standardised residuals to verify.

Rolling backtest

Every univariate model gets the same evaluation harness. The pattern: fit on a training window, predict the next horizon, slide forward, and record the full error distribution rather than a single summary.

import numpy as np
 
window  = 252 * 3
horizon = 5
preds, targets = [], []
 
for start in range(0, len(y) - window - horizon):
    train = y[start : start + window]
    model = AutoReg(train, lags=10, old_names=False).fit()
    preds.append(
        model.predict(start=len(train), end=len(train) + horizon - 1)
    )
    targets.append(y[start + window : start + window + horizon])
 
errors = np.array(preds) - np.array(targets)
rmse  = float(np.sqrt((errors ** 2).mean()))
mae   = float(np.abs(errors).mean())

Two diagnostics belong on every backtest report:

  • Per-regime breakdown. Split the held-out errors by realised volatility tertile, by year, or by an exogenous regime flag. A model that wins on the average but loses in stressed regimes is the wrong model for any decision that has to live through stress.
  • Error distribution, not just RMSE. Histogram of errors, ACF of errors, and a Diebold–Mariano test against the naive benchmark. Persistent autocorrelation in errors is a sign that the model is systematically missing structure.

Probabilistic outputs

Point forecasts hide risk; the decision layer of Chapter 5 needs full predictive distributions. Two recipes work well at the univariate baseline level:

  • Quantile regression. Train one model per quantile minimising the pinball loss . Cheap, distribution-free, robust.
  • Parametric density. ARIMA + Student- residuals or GARCH + -innovations gives a closed-form predictive distribution. Validate with the probability integral transform (PIT) histogram (Section 02-02): under a calibrated model, is uniform on .

For univariate scenario generation, bootstrapping standardised residuals preserves the empirical tail without committing to a parametric distribution. Carrying the GARCH fit from the snippet above:

import numpy as np
forecast    = fit.forecast(horizon=horizon, reindex=False)
mean_path   = forecast.mean.values[-1, :]            # shape (horizon,)
sigma_path  = forecast.variance.values[-1, :] ** 0.5 # shape (horizon,)
 
resid_std   = fit.std_resid.dropna().values          # standardised residuals
rng         = np.random.default_rng(0)
boot        = rng.choice(resid_std, size=(1_000, horizon), replace=True)
scenarios   = mean_path + boot * sigma_path           # shape (1_000, horizon)

The scenarios feed the synthetic-data and stress-testing workflows of Chapter 10.

When the univariate baseline runs out

Three signals that the univariate model has stopped being enough:

  • Cross-series structure dominates. Adding correlated series to the feature set genuinely improves the held-out CRPS — Section 04-02 is where to go.
  • Long-range dependencies that lag cannot capture. Slow drift, multi-week cycles, complex seasonality — tree models with engineered features (04-03) or deep models (04-04 / 04-05) are better suited.
  • Distributional output is the deliverable. Quantile boosters and deep probabilistic forecasters give full distributions with less fragility than ad-hoc bootstrap-on-residuals.

Even in those cases, keep the univariate baseline live. Drift in the gap between the univariate baseline's CRPS and the production model's CRPS is the earliest, cheapest signal of regime change you will get.

Multivariate Forecasting

Real portfolios rarely depend on a single time series. Macro indicators, rates, commodities, and alternative data interact, and the cross-sectional structure carries as much signal as the dynamics of any individual series. Multivariate models are the workhorse layer that turns those interactions into forecasts you can hand to a portfolio optimiser.

When to go multivariate

Three questions tell you whether a univariate model is enough.

  • Are the series correlated? Pairs (oil and energy equities), regimes (rates and credit spreads), and panels (a sector basket) all carry contemporaneous covariance that univariate models cannot exploit.
  • Do you need coherent forecasts? A bond portfolio needs forecasts whose yield-curve shape is internally consistent; a basket of FX crosses needs forecasts that respect triangular arbitrage. Independent univariate forecasts do not.
  • Is the forecast variance the binding constraint? When the covariance of the forecast error matters more than its mean — risk budgeting, tracking error, hedge sizing — you need a model that learns the covariance structure end-to-end.

Vector Autoregressions (VAR)

VAR models extend AR ideas to -dimensional vectors. Each variable depends on its own lags and the lags of others,

Use them when you care about impulse responses (how a one-off shock to one variable propagates), Granger causality (whether series 's past helps predict series given the rest), or simultaneous forecasts of correlated series such as inflation, unemployment, and policy rates.

import polars as pl
from statsmodels.tsa.api import VAR
 
data = pl.read_csv("data/macro_weekly.csv", try_parse_dates=True)
wide = data.pivot(index="Date", columns="Series", values="Value").drop_nulls()
var = VAR(wide.to_pandas()).fit(maxlags=12, ic="aic")
forecast = var.forecast(wide.to_pandas().values[-var.k_ar:], steps=4)

Two diagnostics belong on every VAR you ship: the eigenvalues of the companion matrix (must lie inside the unit circle for stationarity), and the residual autocorrelation matrix (white-ish off-diagonals; persistent structure means you under-specified the lag order or missed a regime). Information criteria (AIC, BIC, HQIC) trade fit against parsimony — BIC tends to under-fit on financial data, AIC to over-fit; HQIC sits in the middle and is a reasonable default.

Bayesian VAR (BVAR)

Financial macro datasets often contain more variables than observations. Coefficient estimates of unrestricted VARs blow up; forecasts inherit that noise. Bayesian priors shrink coefficients toward economically plausible values:

  • Minnesota prior \citep{litterman1986bvar}: shrinks own-lag coefficients toward 1 and cross-lag coefficients toward 0, with shrinkage strength decaying in lag order. The default in central-bank work.
  • Horseshoe / spike-and-slab priors: heavier-tailed shrinkage that allows a small number of large coefficients while crushing the rest. Better when the truth is genuinely sparse.
  • Hierarchical priors for panels (countries, sectors): pool information across cross-sectional units while letting each unit deviate.
import numpy as np
from statsmodels.tsa.api import VAR as SMVAR
from sklearn.linear_model import Ridge
 
# A small Minnesota-style ridge regression on a stacked design matrix.
# k series, p lags  =>  X has k * p columns + 1 (intercept).
def build_lag_matrix(y: np.ndarray, p: int):
    T, k = y.shape
    rows = [np.concatenate([y[t - i] for i in range(1, p + 1)]) for t in range(p, T)]
    return np.asarray(rows), y[p:]
 
X, Y = build_lag_matrix(wide.to_pandas().to_numpy(), p=12)
bvar = Ridge(alpha=1.0).fit(X, Y)               # one ridge per equation, stacked
print(bvar.coef_.shape)                          # (k_target, k * p)

Ridge with a single is the simplest realisation of the Minnesota prior; the more elaborate priors (Horseshoe, hierarchical) are best approached through bayesian packages (pybats, statsmodels's BayesianVAR) once the ridge baseline is calibrated.

For high-dimensional VARs (dozens of series), consider reduced-rank VARs or factor-augmented VARs (FAVAR) that project data into a low-dimensional latent space — exactly the dynamic-factor machinery of Chapter 6 — and then run the VAR on factors. That route also makes the structural-shock identification in the chapter on dynamics much cleaner.

Cointegration and VECM

When nonstationary series share a long-run equilibrium (think pairs trading, the basis between a futures contract and its underlying, or the relationship between inflation and short rates), use a Vector Error-Correction Model. VECMs differentiate the series but add a term that pulls them back toward equilibrium,

with . The columns of define the cointegrating vectors (the spreads that mean-revert), and is the adjustment speed back to equilibrium. The Johansen test estimates the rank of and so the number of cointegrating relationships.

In financial practice the most common use is statistical arbitrage on a small basket: estimate , watch the residual spread , and trade mean-reversion in that spread with a half-life set by . Always test stability of in rolling windows — cointegrating relationships break, and the moment they do is exactly when the strategy would otherwise stack risk on a stale signal.

Cross-Series Volatility: DCC-GARCH

For risk-budgeting and option pricing the object of interest is not the mean forecast but the covariance one. A full multivariate GARCH is heavy because it has free parameters; Dynamic Conditional Correlation (DCC) \citep{engle2002dcc} buys most of the value at cost by separating marginal volatilities and the correlation matrix:

with a univariate GARCH per series and a slowly-evolving correlation matrix. DCC is the default risk-model engine in many investment banks because it is fast, stable, and produces covariance forecasts that plug straight into a mean-variance optimiser.

Feature-Enriched Deep Models

Modern deep forecasters (temporal convolutions, transformers) eat multivariate panels too, but they require richer features and a consistent tensor layout:

  1. Align calendars. Forward-fill macro data to match daily market frequency while flagging imputed points so the model can downweight them.
  2. Engineer covariates. Include lagged returns, rolling statistics, calendar features (day-of-week, month, holiday flags), and exogenous drivers (oil, FX, rates) that you genuinely have at forecast time.
  3. Standardise per series. Scale each column with its own mean/std to avoid dominance by high-variance signals; for fat-tailed series consider robust scaling (median / MAD).
  4. Store metadata. Keep masks for missing values and a separate channel for known-future features (planned policy meetings, scheduled earnings, roll dates) so the model can condition on them without leaking.
from sklearn.preprocessing import StandardScaler
import numpy as np
 
values = wide.select(pl.all().exclude("Date")).to_numpy()
scaler = StandardScaler().fit(values)
scaled = scaler.transform(values)
np.savez("artifacts/multivariate_tensor.npz", data=scaled)

Two architectural patterns dominate the recent literature on multivariate deep forecasting and we will return to both:

  • Channel-mixing. Cross-series attention or MLP layers that explicitly exchange information across the panel. Effective when the cross-sectional signal is strong; iTransformer \citep{liu2024itransformer} and TimeMixer \citep{wang2024timemixer} are representative.
  • Channel-independence with shared parameters. The same architecture applied per series with shared weights, betting that temporal patterns generalise across series and that cross-series structure can be handled downstream. PatchTST \citep{nie2022patchtst} popularised this view.

Whether mixing or independence wins depends on the panel. Equity-cross-section problems with strong factor structure tend to benefit from mixing; long single-asset series with rich seasonality often do better with independence. The forecasters in Sections 04-04 and 04-05 implement both, and we benchmark them on the same macro and equity panels.

Tree-Based Forecasting

Gradient-boosted trees and random forests absorb nonlinear interactions and discontinuous signals better than linear models, and they remain the strongest baseline on most financial forecasting problems where the signal is in tabular engineered features rather than in the raw waveform of the series. For the M5 retail competition, the M6 financial-forecasting competition, and a long string of Kaggle finance challenges, the winning entries were variations on LightGBM with thoughtful features. The lesson generalises: before reaching for deep models, build a tree baseline you can defend.

Why trees are the default baseline

Three properties of trees make them well-matched to financial tabular data.

  • Heterogeneous features without preprocessing. Trees split on a single feature at a time, so they handle a mix of integer indicators, log-scaled prices, raw counts, and categorical IDs without needing whitening or one-hot expansions. That removes a whole class of bugs around scaling, missing values, and outliers.
  • Stable on small data. Boosted trees with regularisation and early stopping fit reliably on the 5–20 year history that most financial datasets give us. Deep models in the same regime tend to overfit unless you have strong cross-series structure to share parameters across.
  • Nonparametric interactions for free. Tree depth automatically captures conjunctions ("low-volatility and steep yield curve"); you do not need to specify the interaction up front the way you would in a linear model.

The tradeoff is that trees do not extrapolate beyond the training range and do not learn the temporal continuity of the series. Those are the gaps that the deep and transformer-based forecasters in the next two sections fill.

Feature engineering is the model

A boosted tree is only as good as the features you hand it. Build a supervised table where each row is one observation:

  • Target: — the future return, log-return, or realised volatility at the horizon you actually care about. Match the target to the decision (Section 5-01): if the policy uses log-returns, train on log-returns.
  • Lagged features: for a lookback window tied to the autocorrelation horizon of the data.
  • Rolling statistics: moving averages, standard deviations, skew, kurtosis; realised range and downside semi-variance for risk-aware features.
  • Cross-asset features: sector returns, factor exposures (value, momentum, quality), or correlations to a benchmark.
  • Calendar and event features: day-of-week, month, time-to-earnings, holiday distance, scheduled-event flags. Trees love these.
  • Metadata for the trainer: sample weights, regime labels, asset IDs as categorical features (LightGBM handles them natively).
import polars as pl
 
def build_features(df: pl.DataFrame, horizon: int = 5) -> pl.DataFrame:
    return (
        df.sort("Date")
        .with_columns([pl.col("log_ret").shift(i).alias(f"lag_{i}") for i in range(1, 11)])
        .with_columns([
            pl.col("log_ret").rolling_mean(window_size=20).alias("ma20"),
            pl.col("log_ret").rolling_std(window_size=20).alias("vol20"),
            pl.col("log_ret").rolling_skew(window_size=60).alias("skew60"),
            pl.col("Volume").log().alias("log_volume"),
            pl.col("Date").dt.weekday().alias("weekday"),
            pl.col("Date").dt.month().alias("month"),
        ])
        .with_columns(pl.col("log_ret").shift(-horizon).alias("target"))
        .drop_nulls()
    )

The most common subtle bug in tree-based finance pipelines is target leakage: a "lagged" feature accidentally constructed from same-day data, a rolling statistic that includes the row's own observation, an asset ID encoder fit on the full sample. Every feature should be auditable to the information available at time , and the audit is easier if every column name encodes the lag.

Gradient boosting in practice

LightGBM is the default for fast experimentation; XGBoost and CatBoost are near-equivalent and the choice mostly comes down to ergonomics and the shape of your categorical features. For most financial settings:

import lightgbm as lgb
from sklearn.model_selection import TimeSeriesSplit
 
features = build_features(asset_df)
X = features.drop(["Date", "target"]).to_numpy()
y = features["target"].to_numpy()
 
tscv = TimeSeriesSplit(n_splits=5)
params = dict(
    objective="regression",
    metric="rmse",
    learning_rate=0.05,
    num_leaves=128,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=5,
    lambda_l2=1.0,
    min_child_samples=200,
)
 
scores = []
for train_idx, val_idx in tscv.split(X):
    train = lgb.Dataset(X[train_idx], label=y[train_idx])
    valid = lgb.Dataset(X[val_idx], label=y[val_idx])
    model = lgb.train(
        params, train,
        valid_sets=[valid],
        num_boost_round=2000,
        callbacks=[lgb.early_stopping(50)],
    )
    scores.append(model.best_score["valid_0"]["rmse"])

Three knobs do most of the work:

  • num_leaves × min_child_samples controls capacity. On daily equity data with 2000 rows per asset, num_leaves=64–128 and min_child_samples=200–500 is a sane starting point; larger leaves overfit the tail.
  • feature_fraction and bagging_fraction add stochastic regularisation per tree. Both around 0.8 is the standard finance recipe.
  • learning_rate × num_boost_round trades fit speed for stability. Smaller learning rates with early stopping on a held-out window almost always win on out-of-sample loss; default to learning_rate=0.03–0.05 and let early stopping pick the round count.

Monitor feature importance qualitatively. Trees split greedily, so two collinear features can each take half the splits and look "less important" than they are. Use SHAP values for a more honest attribution, especially when you need to defend the model to a non-modeller stakeholder.

Quantile and probabilistic forecasts

A point forecast from a tree is rarely enough for the decision layer of Chapter 5. Two patterns recover a full predictive distribution:

  • Quantile regression with pinball loss. Train one model per quantile with LightGBM's objective="quantile", alpha=tau. Concatenate the predictions to get a predictive CDF. Cheap and reliable; the only catch is non-monotone predictions across quantiles, which you fix with a small isotonic post-processing step.
  • Quantile Regression Forests \citep{meinshausen2006qrf}. Ordinary random forests are aggregated by mean; QRF aggregates the empirical distribution of leaf observations and inverts it. Slower at inference but distributionally consistent by construction.

For evaluation, use CRPS and reliability diagrams from Section 02-02 — not RMSE on the median quantile. A model that wins on RMSE but is mis-calibrated will produce position sizes that are systematically too large or too small.

Random forests as a stress-resistant baseline

When you need a model that resists overfitting and is easy to validate, bootstrap ensembles shine. Random forests cannot extrapolate far beyond the training range — so they are not the right tool when the live regime has moved out-of-sample — but they provide stable probability estimates and very honest feature attribution.

from sklearn.ensemble import RandomForestRegressor
 
rf = RandomForestRegressor(
    n_estimators=500,
    max_depth=10,
    min_samples_leaf=200,
    random_state=0,
    n_jobs=-1,
)
rf.fit(X, y)

A useful pattern is to keep the random forest as a sanity check alongside a boosted-tree production model: when the two diverge sharply on live data, the boosted model is usually overfitting a recent anomaly that the forest's averaging has dampened.

Cross-sectional panels

For equity universes, build panel features and fit a single model across all tickers rather than one model per ticker:

  • Stack rows for every ticker into one training table.
  • Add the ticker as a categorical feature (LightGBM's categorical_feature) or use target encoding with proper out-of-fold construction.
  • Add cross-sectional features: sector or country dummies, industry-level averages, factor exposures.
  • Use grouped time-series cross-validation: train on 2010–2018, validate 2019, test 2020+. Ensure no leakage from future fundamentals, revised data, or look-ahead in cross-sectional rankings.

The panel approach amortises noise across tickers and is what made tree ensembles competitive on the M6 financial competition. The catch is that the panel must be reasonably homogeneous — pooling small-cap technology stocks with large-cap utilities will hurt unless you condition on the segment.

Deployment notes

  • Daily refresh. Update features incrementally with a join on the latest observation rather than recomputing the full table.
  • Audit signals. SHAP or accumulated local effects (ALE) plots show how each feature drives the prediction at the row level; this is what you show in a model risk review.
  • Latency. Export models to treelite or ONNX if you need microsecond-class inference; for daily strategies the native LightGBM predict is usually fast enough.
  • Versioning. Persist both the model and the feature recipe (the exact Polars expressions that built the table). Replays of historical forecasts break the moment the recipe drifts.

Tree-based forecasters bridge classical statistics and deep learning. They thrive on structured, high-SNR signals; they produce interpretable outputs for stakeholders; and they set the bar that the deep models in the next two sections have to clear.

Deep-Learning-Integrated Forecasting

Tree ensembles set the baseline; deep sequence models earn their keep when the signal lives in temporal structure that cannot be flattened into a tabular row, or when the panel is large enough that shared parameters across series materially helps. Before the transformer-specific architectures of the next section, we cover the broader deep-forecasting toolkit — recurrent nets, temporal convolutions, deep AR/seq2seq, and decomposition-based architectures like N-BEATS / N-HiTS — that often beat trees with a fraction of the parameter budget of a transformer.

When to reach for a deep model

Three concrete conditions make a deep model the right choice rather than a boosted tree.

  • Many series, shared dynamics. A panel of hundreds or thousands of related series (an entire equity universe, an electricity-grid panel, a retail-sales panel) lets you fit one model with shared weights. Trees fit per series — or with categorical IDs, per row — but cannot share inductive bias across series the way a deep model with channel-independence can.
  • Long-range temporal pattern. Seasonality at multiple scales, slow regime drift, or hierarchical structure (intraday → daily → weekly) is what RNN/TCN/transformer models reconstruct without you hand-engineering rolling features for every scale.
  • Probabilistic outputs at scale. DeepAR and its descendants produce full-distribution forecasts in one shot; quantile-loss boosters can match them per-quantile but become awkward at panel scale.

If none of these holds — handful of series, mostly tabular signal, short horizon — a LightGBM with the features from Section 04-03 will usually win.

Sequence preparation

Deep forecasters expect tensors shaped (batch, time, feature) plus masks describing missing observations and known-future inputs. The pieces:

  • Encoder inputs: historical target values, past covariates, static metadata embeddings (sector, country, asset class).
  • Decoder inputs (for seq2seq architectures): known-future features (calendar, scheduled events, holidays) and previously generated targets.
  • Masks: binary indicators for observed vs. padded vs. imputed positions. The mask is what lets the model down-weight imputed values rather than treat them as ground truth.
import torch
 
def make_batch(history, future, static):
    encoder = torch.tensor(history, dtype=torch.float32)
    decoder = torch.tensor(future, dtype=torch.float32)
    static_embed = torch.tensor(static, dtype=torch.float32)
    enc_mask = torch.isnan(encoder)
    return dict(
        encoder=encoder.nan_to_num(),
        decoder=decoder.nan_to_num(),
        static=static_embed,
        mask=enc_mask,
    )

A normalisation step that shows up in every reproducible deep-forecasting pipeline is per-series instance normalisation: subtract the per-window mean, divide by the per-window standard deviation, predict in standardised space, and undo the scaling at output. Models like RevIN \citep{kim2021revin} make this part of the network rather than the data pipeline; either way, it is the single biggest robustness lever against regime shift.

Recurrent and convolutional baselines

Three classical deep architectures still hold up:

  • LSTM / GRU encoders with an attention readout for the forecast horizon. Fast to train, easy to debug; good when the autocorrelation structure is short and seasonality is mild.
  • Temporal convolutional networks (TCN). Stacked dilated 1D convolutions give exponentially growing receptive fields with linear parameter cost. TCNs train more stably than RNNs on long contexts and remain competitive with much larger transformers on standard univariate benchmarks.
  • DeepAR \citep{salinas2020deepar}. An RNN that emits the parameters of a per-step likelihood (Gaussian, Student-, negative binomial), trained by maximum likelihood and rolled out autoregressively for multi-step probabilistic forecasts. The default deep probabilistic forecaster in AWS/GluonTS and a strong baseline on retail and load forecasting.
import torch.nn as nn
 
class TCNBlock(nn.Module):
    def __init__(self, c, k=3, d=1):
        super().__init__()
        pad = (k - 1) * d
        self.conv = nn.Conv1d(c, c, kernel_size=k, padding=pad, dilation=d)
        self.act = nn.GELU()
    def forward(self, x):
        y = self.conv(x)[..., : x.shape[-1]]
        return self.act(x + y)

The recipe for any of these baselines is the same: rolling-origin training with early stopping on a held-out window, instance normalisation, and a small number of layers — five or six is usually enough.

Decomposition-based architectures: N-BEATS and N-HiTS

A separate strand of work bypasses recurrence and attention entirely. N-BEATS \citep{oreshkin2020nbeats} stacks fully-connected blocks that output two projections per layer: a backcast that subtracts what the block has explained from the input, and a forecast that contributes to the prediction. Trained end-to-end, this acts like a learned, hierarchical decomposition into trend, seasonality, and residuals. N-HiTS \citep{challu2023nhits} adds multi-rate signal sampling so that long-horizon forecasts spend their parameter budget on coarse structure and short-horizon forecasts on fine detail.

Two reasons to keep these on the menu:

  • They are interpretable: the per-block backcasts decompose the forecast into named components (trend, seasonality, residual) you can plot.
  • They are efficient: an MLP-only architecture with no attention runs faster than a transformer at the same accuracy on M3, M4, and Tourism benchmarks.

neuralforecast provides a clean PyTorch-Lightning wrapper for both.

A working DeepAR-style head

Section 4-06 covers probabilistic forecasting from the output-type perspective — the proper-scoring discipline that any backbone has to satisfy. Here we walk through a concrete parametric likelihood head that pairs naturally with the RNN / TCN backbones above: a Student- density per step, trained by negative log-likelihood \citep{salinas2020deepar}.

import torch
import torch.nn as nn
from torch.distributions import StudentT
 
class StudentTHead(nn.Module):
    """Predicts (location, scale, df) of a Student-t per time step."""
 
    def __init__(self, hidden: int):
        super().__init__()
        self.loc   = nn.Linear(hidden, 1)
        self.scale = nn.Linear(hidden, 1)
        self.df    = nn.Linear(hidden, 1)
 
    def forward(self, h: torch.Tensor) -> StudentT:
        loc   = self.loc(h).squeeze(-1)
        scale = nn.functional.softplus(self.scale(h)).squeeze(-1) + 1e-3
        df    = nn.functional.softplus(self.df(h)).squeeze(-1) + 2.5
        return StudentT(df=df, loc=loc, scale=scale)
 
 
class DeepARLite(nn.Module):
    def __init__(self, n_features: int, hidden: int = 64, layers: int = 2):
        super().__init__()
        self.rnn  = nn.GRU(n_features, hidden, num_layers=layers,
                            batch_first=True)
        self.head = StudentTHead(hidden)
 
    def forward(self, x: torch.Tensor) -> StudentT:
        h, _ = self.rnn(x)
        return self.head(h)
 
    def loss(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        dist = self.forward(x)
        return -dist.log_prob(y).mean()

Three notes that decide whether this trains stably on financial data:

  • df floor. A Student- with has infinite variance; the gradient explodes around the boundary. The +2.5 floor in the softplus keeps optimisation stable.
  • Per-window standardisation (Section 4-05's RevIN trick) is just as important here as for transformers; without it, the head's scale parameters chase the moving baseline.
  • Teacher forcing during training, autoregressive sampling at inference. At training time, feed the true previous values; at inference, sample from the predicted distribution and feed the sample as the next input. The classical DeepAR rollout.

For probabilistic evaluation (Section 4-06), the head exposes a density, so CRPS and reliability diagrams come straight out of sampling 100 trajectories and aggregating.

Quantile head as the alternative

When the parametric assumption is uncomfortable (volatility regime shifts, mixture of return scales), a quantile head trained with pinball loss is the distribution-free alternative:

import torch
import torch.nn as nn
 
QUANTILES = torch.tensor([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95])
 
class QuantileHead(nn.Module):
    def __init__(self, hidden: int, n_quantiles: int = 7):
        super().__init__()
        self.proj = nn.Linear(hidden, n_quantiles)
 
    def forward(self, h: torch.Tensor) -> torch.Tensor:
        return self.proj(h)                                # (B, T, Q)
 
def pinball_loss(pred: torch.Tensor, y: torch.Tensor,
                 quantiles: torch.Tensor) -> torch.Tensor:
    diff = y.unsqueeze(-1) - pred                          # (B, T, Q)
    loss = torch.maximum(quantiles * diff, (quantiles - 1) * diff)
    return loss.mean()

Two practical complications:

  • Quantile crossing. The model can emit non-monotone predictions (5%-quantile above 95%-quantile). A small isotonic regression as post-processing fixes this without touching the trained weights.
  • CRPS approximation. With 7 quantiles the average pinball loss over the grid converges to CRPS as the grid is refined; with only 3 quantiles the approximation is too coarse to be useful.

Conformal post-processing (Section 4-06) wraps either head: train the point or quantile output, then compute conformal intervals from held-out residuals to recover coverage guarantees.

Training tips that actually move the needle

  • Patching. Treat the time axis as a sequence of small patches (length 8, 16, or 32) rather than per-step tokens. This is the single change that closed most of the gap between transformer forecasters and TCN/RNN baselines, and it generalises beyond transformers — patch-and-MLP architectures (TSMixer) get most of the same benefit.
  • Curriculum on horizon. Start with short encoder/decoder lengths and scale up. Stabilises early training and avoids the model collapsing into a copy-of-last-value solution.
  • Channel independence vs. mixing. Try both. PatchTST-style channel independence is a strong default for univariate-shape problems with shared dynamics; iTransformer and TimeMixer mix channels and often win on cross-asset panels with strong factor structure.
  • Loss-weighting. For long horizons, equally-weighted point losses systematically under-fit the back end. A schedule that emphasises later horizons during training is a small but reliable improvement.
  • Data augmentation. Mixup on the time dimension and stochastic input dropping prevent overfitting to specific regimes and cost almost nothing.

Monitoring and deployment

  • Quantile coverage. Track empirical coverage of nominal 80% / 90% intervals on a rolling window. Drift in coverage is the earliest signal of regime change and almost always precedes a drop in PnL.
  • Attention / variable importance. For TFT-style models, log the variable selection weights; for transformer-style, log the average attention pattern. Sudden shifts mean the model has started relying on a different feature, and that deserves a second look before the model goes into the next decision.
  • Baseline alongside. Keep a tree-ensemble baseline live next to the deep model. If the baseline suddenly beats the deep model on rolling CRPS, pause and retrain — that is the cleanest drift trigger you will get.
  • Export. PyTorch → TorchScript or ONNX for production; quantise to float16 (or int8 for embedded scenarios) when latency matters more than the last decimal of accuracy.

The deep models in this section are the right tool when shared parameters across many series, long-range temporal structure, or probabilistic outputs are central. For pure attention-based architectures and the foundation-model-style pretrained transformers, jump to Section 04-05.

Transformer-Based Forecasting

Transformer architectures, originally developed for language, now dominate the public time-series leaderboards. They are not a free lunch — they are data-hungry, expensive at long contexts, and they bring their own failure modes — but for problems with rich cross-series structure they often beat the tree and recurrent baselines of Sections 04-03 and 04-04 by a clear margin. This section covers the architectural ideas, the practical recipes, and the tradeoffs that make a transformer the right or wrong tool for a financial forecasting workload.

What changes when we move from MLP / RNN to a transformer

Three architectural ideas matter for forecasting.

  • Self-attention. Every time step in the input window can attend directly to every other step, weighted by learned compatibility. Compared with RNN recurrence, this removes the bottleneck of a fixed-width hidden state and gives the network direct access to long-range structure.
  • Cross-series attention (channel mixing). Letting the model exchange information between related series — sectors moving together, FX cross-rates, related futures contracts — without flattening them into a single feature vector is the core trick that makes transformers competitive on multivariate panels.
  • Patching and tokenisation. Recent forecasters split the time axis into patches before applying attention; PatchTST and TSMixer popularised this view. The choice of tokenisation is often as important as the attention pattern itself, and it is what closed most of the gap between early Transformer-for-time-series papers (Informer, Autoformer) and the simpler baselines they tried to beat.

A short genealogy

Three families show up repeatedly in financial work.

Encoder-only forecasters

Take the input window, encode it with a stack of attention layers, and project to the forecast horizon. No autoregressive decoding, no encoder–decoder cross attention.

  • PatchTST \citep{nie2022patchtst}. Tokenise each series into patches of length 8–32, apply a standard transformer encoder channel-independently (one network per series with shared weights), and project the last patch's representation to the forecast horizon. The dominant baseline for univariate-shape long-horizon problems.
  • iTransformer \citep{liu2024itransformer}. Inverts the role of time and channel: each series becomes a token, and self-attention operates across series at each time step. A natural fit for cross-asset panels with strong factor structure.

Encoder–decoder forecasters

Use an encoder for the historical window and a decoder that generates the forecast horizon, often with cross-attention.

  • Informer \citep{zhou2021informer}. ProbSparse attention — only the top- keys per query — to handle very long contexts.
  • Autoformer \citep{wu2021autoformer}. Decomposition blocks (trend / seasonal) inside the transformer and an auto-correlation attention that aggregates by lag rather than by similarity.
  • TimeXer \citep{wang2024timexer}. Encoder–decoder design with an explicit endogenous–exogenous split, so known-future covariates (calendar, scheduled events, holiday flags) are routed through a dedicated channel rather than mixed into the historical window. A clean fit for forecasting workloads that have strong known-future structure (electricity load, retail demand, macro releases).

Mixer-style hybrids

Replace some or all attention layers with simpler mixing primitives — MLPs, depthwise convolutions, linear attention — that retain the patching ideas but reduce compute.

  • TimeMixer \citep{wang2024timemixer}. Multiscale mixing across decomposed seasonal and trend components; competitive with full transformers at a fraction of the parameter budget.
  • TSMixer. All-MLP architecture with per-time and per-channel mixing layers. The simplest baseline that captures the patching benefits without any attention.

For most financial datasets — daily or weekly observations, low hundreds of series, limited history — encoder-only PatchTST and the mixer hybrids are the most reliable starting points. Encoder–decoder models become attractive when you genuinely need autoregressive long-horizon generation or when known-future covariates are central.

Sequence preparation

Transformer forecasters expect tensors shaped (batch, time, feature) plus masks describing missing observations and known-future inputs.

import torch
 
def make_batch(history, future, static):
    encoder = torch.tensor(history, dtype=torch.float32)
    decoder = torch.tensor(future,  dtype=torch.float32)
    static_embed = torch.tensor(static, dtype=torch.float32)
    enc_mask = torch.isnan(encoder)
    return dict(
        encoder=encoder.nan_to_num(),
        decoder=decoder.nan_to_num(),
        static=static_embed,
        mask=enc_mask,
    )

Two preprocessing steps are non-negotiable:

  • Patch tokenisation. Group consecutive time steps into patches of length 8, 16, or 32 and treat each patch as a token. Reduces sequence length and makes attention quadratic in patches rather than steps.
  • Reversible instance normalisation (RevIN) \citep{kim2021revin}. Subtract the per-window mean, divide by the per-window standard deviation, predict in standardised space, undo the scaling at output. The single biggest robustness lever against regime drift in transformer forecasters.

For multivariate inputs, decide between channel independence (one network per series, shared weights) and channel mixing (cross-series attention). Try both; on equity panels with strong factor structure mixing usually wins, on long single-asset series independence usually wins.

A working PatchTST-style training loop

Section 4-04 covered the parametric-likelihood head. The transformer- forecasting workhorse pairs the patch-tokenised encoder with a quantile head trained on pinball loss. A minimal but realistic training step:

import torch
import torch.nn as nn
 
class PatchTSTLite(nn.Module):
    """PatchTST-flavoured transformer forecaster with a quantile head.
 
    - Channel-independent: each series goes through the same encoder
      with shared weights. For multivariate panels, the channel
      dimension is folded into the batch.
    - Patch length is the most-tuned hyperparameter; 16 is a good
      starting point for daily data.
    """
 
    def __init__(self, *, patch_len=16, d_model=128, n_heads=4,
                 n_layers=3, n_quantiles=7, horizon=20):
        super().__init__()
        self.patch_len = patch_len
        self.embed = nn.Linear(patch_len, d_model)
        self.pos = nn.Parameter(torch.zeros(1, 256, d_model))   # max patches
        layer = nn.TransformerEncoderLayer(
            d_model, n_heads, dim_feedforward=d_model * 4,
            batch_first=True, norm_first=True,
        )
        self.encoder = nn.TransformerEncoder(layer, num_layers=n_layers)
        self.head = nn.Linear(d_model, n_quantiles * horizon)
        self.n_quantiles = n_quantiles
        self.horizon = horizon
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (B, L) for univariate; for multivariate (B, C, L) call
        # this in a loop or fold C into B.
        B, L = x.shape
        n_patches = L // self.patch_len
        x = x[:, : n_patches * self.patch_len]
        patches = x.reshape(B, n_patches, self.patch_len)
        h = self.embed(patches) + self.pos[:, :n_patches]
        h = self.encoder(h)
        # Final patch's representation predicts all horizon quantiles.
        out = self.head(h[:, -1])
        return out.view(B, self.horizon, self.n_quantiles)
 
 
QUANTILES = torch.tensor([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95])
 
def pinball(pred: torch.Tensor, y: torch.Tensor,
            qs: torch.Tensor = QUANTILES) -> torch.Tensor:
    diff = y.unsqueeze(-1) - pred                       # (B, H, Q)
    return torch.maximum(qs * diff, (qs - 1) * diff).mean()
 
 
# Training step with reversible instance normalisation
def step(model, opt, x, y):
    mu = x.mean(dim=-1, keepdim=True)
    sd = x.std(dim=-1, keepdim=True).clamp_min(1e-6)
    pred = model((x - mu) / sd) * sd.unsqueeze(-1) + mu.unsqueeze(-1)
    loss = pinball(pred, y)
    opt.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    opt.step()
    return loss.item()

Three knobs that move the held-out CRPS most:

  • Patch length × lookback. The encoder sees lookback / patch_len tokens; aim for 64–128 tokens. For daily equity windows, that puts patch length in 8–16 with 1–2 years of lookback.
  • Channel mode. Channel-independent (shared encoder) is more data- efficient on small panels; channel-mixing (token-per-series, à la iTransformer) wins on equity cross-sections with strong factor structure. Test both; the cost is a config flag.
  • Quantile grid. Seven quantiles ([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]) is the sweet spot for CRPS approximation; coarser grids leave coverage gaps, finer grids add training noise.

For probabilistic evaluation, post-process the quantile output with isotonic regression to remove crossings (Section 4-06), then compute CRPS as the average pinball over the grid. The same fitted model plugs into the conformal wrapper if calibrated coverage guarantees are required.

Practical considerations

  • Lookback windows. Long contexts help only up to the autocorrelation horizon of the data. For daily equity returns, windows beyond a few months of history rarely add signal — they mostly add overfitting risk. Patch length and lookback should be jointly tuned; a useful starting point is 96 patches of length 16, i.e. about 6 years of daily data.
  • Channel independence vs. mixing — tune empirically. A small grid over patch length × channel mode × number of layers usually surfaces a clear winner.
  • Calibration drift. Transformers tend to look very confident on in-distribution data and silently miscalibrate when regimes shift. Rolling-origin CRPS plus a per-regime reliability diagram is non-negotiable.
  • Pretraining vs. from scratch. Foundation-style time-series models — Chronos, Lag-Llama, MOIRAI, TimesFM — are useful baselines when the target dataset is small. Section 4-08 covers the practical side (which to pick, how to evaluate, how to combine with a specialist); Chapter 7 covers the broader LLM-integration story.

Training recipe

A recipe that travels well across financial forecasting workloads:

  • Optimiser. AdamW at with cosine schedule and warm-up. Weight decay .
  • Batch size. As large as the GPU permits; gradient accumulation if necessary. Larger batches stabilise multivariate training in particular.
  • Loss. Quantile or NLL on standardised targets; un-standardise only for evaluation.
  • Regularisation. Mixup along the time axis, stochastic depth in the encoder, and a small dropout (0.05–0.1) on attention outputs.
  • Curriculum on horizon. Start with short forecast lengths and increase across epochs; stabilises early training and prevents collapse to the copy-of-last-value solution.
  • Early stopping. On held-out CRPS, not training loss.

Three training-time pitfalls worth knowing:

  • Look-ahead via padding. Right-padding the encoder input mixes future zeros into attention. Mask, do not just pad.
  • Leakage via instance normalisation. RevIN must use causal statistics; the moments must be computed only from the window the encoder sees.
  • Label leakage in panel splits. Stack rows by date, split dates not rows, when training on cross-sectional panels.

Monitoring and deployment

  • Quantile coverage of nominal 80% / 90% intervals, tracked on a rolling window. Drift in coverage is the earliest signal of regime change and almost always precedes a drop in PnL.
  • Attention pattern logged on a fixed validation set. Sudden shifts mean the model has started relying on a different feature; that is worth a second look before the model goes into the next decision.
  • Latency budget. Quantise to float16 (or int8 for embedded scenarios) and export to ONNX or TorchScript when latency is binding.
  • Baseline alongside. Keep the LightGBM baseline of 04-03 and the univariate baseline of 04-01 live next to the transformer. If either baseline suddenly beats the transformer on rolling CRPS, pause and investigate — that is the cleanest drift signal you will get.

Where this fits in the book

Transformer forecasters slot in alongside the tree-based forecasters from Section 04-03 and the deep-learning baselines from 04-04. The recommended workflow is:

  1. Start with a tree baseline (LightGBM with engineered features from 04-03).
  2. Add a small transformer (PatchTST or iTransformer at a few hundred thousand parameters).
  3. Graduate to larger or more elaborate variants only when the smaller ones plateau on the held-out CRPS.

The forecasts produced here feed directly into the decision layer of Chapter 5 and the dynamics layer of Chapter 6; they are also the benchmarks that the foundation forecasters of Chapter 7 are compared against.

Probabilistic Forecasting

Every forecast in this book is, ultimately, a probabilistic statement. The earlier sections covered point forecasts and a handful of distribution-friendly tools (quantile heads, conformal wrappers); this section is the deep dive. We treat probabilistic forecasting as a first-class discipline — what to model, how to score it, how to calibrate it, and how to plug the resulting distributions into the decision layer of Chapter 5 without losing information along the way.

What we mean by "probabilistic"

A probabilistic forecast is a predictive distribution over the future quantity given the information set . Three concrete forms recur in practice, each with a different deployment cost:

  • Quantile forecasts. A discrete set with . Cheap to produce, easy to consume; missing the part of the distribution between adjacent quantiles is the price.
  • Parametric densities. A named family — Gaussian, Student-, mixture of Gaussians, normalising flow — with parameters as a function of . Compact and composable; sensitive to family mis-specification when tails are heavy.
  • Sample-based ensembles. A set of trajectory samples drawn from the model. Robust to arbitrary distributional shape; the cost is storage and the need for to be large enough that tail statistics are estimable.

Picking among them is a deployment question, not a theoretical one. Mean-variance optimisers consume parametric Gaussian neatly; a recursive-utility critic (Section 5-04) needs samples; a VaR limit needs a tight quantile. Match the form to the consumer.

CRPS, log-score, and the case for proper scoring

The right loss to train and evaluate a probabilistic forecaster is a proper scoring rule: a function such that the expected score is minimised when is the true distribution of . Two dominate financial practice.

  • Continuous Ranked Probability Score (CRPS). . Equivalently, for independent , which gives a tractable Monte Carlo estimator from samples. Defaults in this book: CRPS reported per-horizon, averaged over rolling-origin folds.
  • Logarithmic score. . Sharper signal per observation than CRPS but extremely sensitive to tail mis-specification; a single outlier far from the predictive density can dominate the average. Use alongside CRPS, not instead of.

A useful property: CRPS generalises MAE — it equals MAE when the predictive distribution collapses to a point. Practical consequence: CRPS-trained probabilistic forecasters often match or beat point forecasters even on point metrics. Distributional richness is not bought at the cost of point accuracy.

A pure-NumPy CRPS for sample-based forecasts:

import numpy as np
 
def crps_sample(samples: np.ndarray, actual: float) -> float:
    """CRPS for a single observation given S Monte Carlo samples."""
    fa = np.mean(np.abs(samples - actual))                    # E|X - x|
    fb = 0.5 * np.mean(np.abs(samples[:, None] - samples[None, :]))  # 0.5 E|X - X'|
    return float(fa - fb)

For a panel of forecasts evaluated at many origins, average crps_sample over (date, series, horizon); report per-horizon breakdowns rather than a single scalar.

For quantile-based forecasts, the pinball loss at quantile is what you train on; the average pinball over converges to CRPS as the quantile grid is refined.

Calibration vs. sharpness

A probabilistic forecast has two properties worth measuring separately.

  • Calibration. Statistical consistency between forecasted probabilities and observed frequencies. Diagnostics:
    • PIT histogram. Under perfect calibration, is uniform on . U-shapes flag mis-specified tails; bumps flag mis-specified mean.
    • Reliability diagram. Observed frequency vs nominal probability for binned predictions. Straight diagonal = calibrated.
  • Sharpness. Concentration of the predictive distribution. A forecast that says "5% chance of large loss" every day is technically calibrated if 5% of days have large losses, but uselessly so. Sharpness is what makes calibrated forecasts actionable.

The principle: maximise sharpness subject to calibration \citep{gneiting2007strictly}. A forecaster that is sharper at the cost of calibration is fragile; a forecaster that is calibrated at the cost of sharpness is uninformative.

Conformal prediction

Conformal prediction \citep{shafer2008conformal} gives distribution- free coverage guarantees by post-processing any point predictor with a held-out calibration set. The recipe:

  1. Train a point forecaster on .
  2. On a held-out calibration set , compute residuals .
  3. For target coverage , find = the -th smallest residual.
  4. Predictive interval at test point : .

Under exchangeability, the resulting interval has marginal coverage exactly . Conformalised quantile regression \citep{romano2019cqr} replaces step 2 with the largest quantile- violation on the calibration set and produces locally adaptive widths.

A 30-line conformal-prediction wrapper around any point forecaster:

import numpy as np
 
def conformal_intervals(predict_fn, X_cal, y_cal, X_test, alpha=0.1):
    """Split-conformal symmetric intervals at coverage 1 - alpha."""
    preds_cal = predict_fn(X_cal)
    residuals = np.abs(y_cal - preds_cal)
    n = len(residuals)
    q = np.quantile(residuals, np.ceil((n + 1) * (1 - alpha)) / n, method="higher")
    preds_test = predict_fn(X_test)
    return preds_test - q, preds_test + q

Practical caveats for finance:

  • Exchangeability fails. Time series are not exchangeable; conformal coverage is approximate at best. Use rolling conformal with a sliding calibration window to track regime change.
  • Distribution-free is also distribution-flat. Conformal intervals do not adapt the shape of the distribution, only the level. For asymmetric losses, prefer quantile or distributional outputs.
  • Per-quantile efficiency. Adaptive conformal methods (online conformal, AC-MLP) maintain coverage under distribution shift at the cost of small loss in efficiency.

Distributional model choices

Three architectural patterns for emitting full predictive distributions, paired with the loss they are trained against.

  • Parametric likelihood head. Network outputs (location, scale, optional shape) of a chosen family; loss is negative log-likelihood. DeepAR \citep{salinas2020deepar} is the canonical RNN version; the same pattern attaches to transformers. Strong on data with consistent shape, fragile when the shape changes.
  • Mixture density network (MDN). Output multiple triplets and form a Gaussian mixture. Captures multimodality (regime-switching returns) at the cost of a harder optimisation landscape; constrain the mixture weights with a softmax and use a small temperature schedule during training.
  • Quantile head with isotonic post-processing. Train a vector of quantiles with pinball loss; pass through an isotonic regression at inference to enforce monotonicity. The defaults are robust on heavy-tailed data and the cheapest path to CRPS-competitive forecasts.
  • Normalising flow. A bijective transformation from a base distribution to the data. Captures arbitrary shape; expensive to train and prone to mode collapse on financial data. Use as a generator (Chapter 10) more than as a routine forecaster.
  • Diffusion-based forecaster. Recently competitive on standard benchmarks \citep{su2025multimodal}. Treats the forecast as a denoising target; produces sample trajectories rather than a closed-form density. Pair with diffusion's natural conditioning machinery for stress-test scenarios.

Backtesting probabilistic forecasts

The rolling-origin protocol of Section 02-02 generalises directly. Three metrics travel together on every probabilistic backtest:

  • Per-horizon CRPS. A vector across forecast horizons; track the ratio against a strong baseline (a Student- GARCH for univariate returns is the appropriate floor).
  • Coverage of nominal / / intervals. A drop below nominal is the early warning of regime change.
  • Pinball loss decomposed by quantile level. Asymmetric drift — good upside coverage, poor downside — is exactly the failure mode a CVaR-driven decision layer is sensitive to.

A useful diagnostic the literature underuses: PIT autocorrelation. Under a calibrated, conditionally-independent forecaster, the PIT sequence is iid uniform; persistent autocorrelation in means the forecaster is missing structure that recurs through time (typical: missing seasonality or regime persistence).

Connecting to the decision layer

The decision modules in Chapter 5 consume the predictive distribution in different ways, and the form has to match.

  • Mean–variance / robust mean–variance. Need . Parametric likelihoods give them directly; quantile-based outputs need post-processing.
  • CVaR / risk-aware optimisation. Needs samples or a parametric family with a tractable tail. Quantile heads with isotonic post- processing give a lower bound on CVaR; MDN or diffusion give consistent estimators.
  • Recursive-utility critic. Needs samples of the continuation value distribution at the next state. Sample-based forecasters compose naturally; parametric ones require an explicit re-sampling step.
  • Expected utility under arbitrary . Sample average over the predictive distribution: . Number of samples is set by the variance of the importance estimator, not by the model complexity.

A common architectural mistake is to throw away the predictive distribution at the forecaster–decision boundary and pass only the mean. The resulting policy is blind to the very distributional information the forecaster's training loss was paying to learn. Pass the full distribution (or at least the moments and tails) all the way through.

Stress-testing forecasts with synthetic data

Real out-of-sample windows contain whatever regimes happen to fall in them, which is rarely all the regimes you want a forecaster to survive. The diffusion generator of Section 10-01 produces conditional counterfactual paths — "a 10% S&P drawdown driven by tech earnings", "a stagflation regime that lasts six months" — that you feed back into the same backtest harness:

import numpy as np
 
def stress_crps(forecaster, history, synthetic_paths, horizon):
    """Apply a trained forecaster to N synthetic stress paths and
    return per-path CRPS against the synthetic path's true continuation."""
    crps_per_path = []
    for path in synthetic_paths:        # path: shape (history_len + horizon,)
        prefix = path[:-horizon]
        actual = path[-horizon:]
        samples = forecaster.predict(prefix, horizon=horizon, n_samples=100)
        crps_per_path.append(crps_sample(samples, actual))
    return np.array(crps_per_path)       # shape (N,)

A forecaster that ranks well on rolling-origin real CRPS and on synthetic-stress CRPS is the candidate to ship. One that wins on real and degrades on synthetic is fitted to the recent regime; pause before deploying it.

What this section adds to the chapter

Sections 04-01 through 04-05 are organised by model class. This section is organised by output type. The two views are complementary: every model class can emit point, quantile, parametric, or sample-based outputs, and the choice is what determines whether the resulting forecast is decision-ready. The forecasters with the strongest published probabilistic performance on financial benchmarks are typically a transformer backbone (04-05) with a quantile head and an isotonic post-processor — but the discipline matters more than the backbone.

Hierarchical and Cross-Sectional Forecasting

Most financial forecasting problems are not flat:

  • A bond-portfolio forecast must be coherent with the per-issuer forecasts beneath it.
  • A sector forecast must add up to the per-stock forecasts in the sector.
  • A GDP forecast has to be consistent with the per-component forecasts (consumption, investment, government, net exports).

Hierarchical and cross-sectional forecasting is the discipline of producing forecasts that respect this structure without giving up the per-node accuracy of treating each series independently.

Why structure matters

Two distinct goals that show up at the same time.

  • Coherence. The forecasts at different aggregation levels must add up. Without this, downstream consumers (a portfolio risk system, a budget allocation, a regulator-facing report) get contradictory numbers depending on which level they query.
  • Pooling for accuracy. Sparse leaves benefit from information at higher aggregation levels (where signal-to-noise is better); high-aggregation forecasts benefit from leaf-level structure.

The two goals usually trade off: forecasts that are independently optimal at each node are rarely coherent; forecasts that are coherent by construction usually pay an accuracy tax somewhere. Hierarchical reconciliation methods aim to give up as little accuracy as possible.

Hierarchies and the summation matrix

A hierarchy is described by a summation matrix of shape . If stacks all node values and holds the leaf-level values, coherence is the linear constraint . Three structures recur in finance:

  • Strict hierarchy. Sector → industry → ticker; each leaf has exactly one parent at each level.
  • Grouped hierarchy. Multiple parallel groupings (sector, geography, market cap) with the same leaves; multiple parent paths.
  • Cross-sectional panel. series with no formal aggregation but strong factor structure. Treated by the same techniques with a factor-loading matrix in place of .

The literature \citep{hyndman2011hier, athanasopoulos2017forecasting} formalises hierarchical forecasting as a linear-algebra problem on the columns of . A small example with one parent and three leaves:

import numpy as np
 
# 3 leaves -> 1 parent.  S maps b in R^3 to y = (parent, leaf1, leaf2, leaf3).
S = np.array([
    [1, 1, 1],
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1],
])
b = np.array([3.2, 5.1, 2.7])
y = S @ b                              # coherent vector for all 4 nodes
print(y)                                # array([11.0,  3.2,  5.1,  2.7])

Reconciliation methods, in increasing sophistication

Given base forecasts that are not coherent, reconciliation produces coherent forecasts for some reconciliation matrix .

  • Bottom-up. Forecast leaves; aggregate. collapses to picking the leaf rows. Coherent by construction; accuracy at higher levels is whatever the leaf forecasts deliver.
  • Top-down. Forecast the top; disaggregate by historical proportions or forecasted proportions. Loses leaf-level signal; rarely worth the simplicity.
  • MinT (minimum trace). \citep{wickramasuriya2019mint} where is the covariance of base-forecast errors. Optimal under unbiased base forecasts: minimises the trace of the reconciled forecast-error covariance. The default in modern hierarchical packages.
  • Trace-minimisation with shrinkage. When is hard to estimate, replace with a Ledoit–Wolf shrinkage estimator or a diagonal target. Gives most of MinT's benefit at a fraction of the estimation cost.

A working MinT reconciler in 15 lines of NumPy:

import numpy as np
 
def mint(y_hat: np.ndarray, S: np.ndarray, W: np.ndarray) -> np.ndarray:
    """y_hat: base forecasts (n_total,); S: (n_total, n_leaves); W: (n_total, n_total)."""
    Winv = np.linalg.pinv(W)
    G = np.linalg.solve(S.T @ Winv @ S, S.T @ Winv)
    return S @ G @ y_hat                                  # coherent forecast

A useful empirical pattern: MinT with shrinkage beats both pure bottom-up and pure top-down on every standard benchmark, and the gap is largest in the middle of the hierarchy where information from both directions matters.

Cross-sectional forecasting on equity panels

Cross-sectional forecasting is hierarchical forecasting where the "parent" is not a sum but a factor. Returns of asset decompose as

with factor loadings , factor returns , and idiosyncratic residuals . Three forecasting strategies that exploit this structure:

  • Forecast factors, propagate to assets. Estimate cross-sectionally, forecast with the multivariate tools of Section 04-02, then . Stable on large universes; ignores asset-specific signal.
  • Forecast residuals, add factor predictions. A two-stage approach: fit factors, then fit a per-asset residual model. The asset-specific signal is the part the factor structure cannot explain.
  • Joint cross-sectional model. Train a single network on the full panel with shared parameters across assets and a per-asset embedding (Section 04-04 / 04-05). The model learns the implicit factor structure rather than relying on a pre-specified .

For a coherent portfolio-level forecast, the factor-level and asset-level forecasts have to satisfy . MinT-style reconciliation works here too: treat the portfolio as the "top" of the hierarchy and reconcile.

Probabilistic reconciliation

Coherence on point forecasts is straightforward; coherence on distributions is harder. Two practical recipes:

  • Reconcile samples. Generate sample trajectories at the leaves; aggregate each sample . The empirical distribution at every node is coherent by construction. Storage scales with ; the computation is embarrassingly parallel.
  • Reconcile parameters. For Gaussian predictive densities at the bottom, the propagated distribution at any aggregation level is also Gaussian with mean and covariance . Closed-form, but parametric mis- specification at the leaves propagates to every level.

Hierarchical-aware foundation models

The 2024–2025 wave of foundation forecasters (Section 04-08) does not yet handle hierarchy natively — they predict per-series and leave reconciliation to the user. Two practical patterns:

  • Forecast per-leaf with the foundation model, reconcile with MinT. Best of both: foundation-model coverage at the leaves, classical coherence at aggregations.
  • Forecast at each level independently, ensemble. The simplest fallback when a foundation model has different strengths at different aggregation levels — top forecast from the foundation model, leaf forecast from a tree baseline, reconcile both.

A workflow that travels

A pattern that works across most financial hierarchies:

  1. Define explicitly. Write down the aggregation as a sparse matrix; check that columns sum correctly.
  2. Forecast each node independently with the per-node best model. Boring is fine — LightGBM with engineered features is the most common winner per node.
  3. Reconcile with MinT (shrinkage). Use a held-out window to estimate the base-error covariance with shrinkage; recompute periodically.
  4. Evaluate at every level. Per-node CRPS and aggregate-level CRPS. Reconciliation should not silently degrade either.
  5. Audit residual coherence. Confirm that the reconciled outputs actually satisfy to tolerance — useful as a CI test.

Where this fits in the chapter

Hierarchical forecasting is the organisational concern that makes forecasts deployable in books with structured products. The forecasters of Sections 04-01 through 04-05 are the components; this section is the glue that makes them produce numbers that downstream systems can defend.

Foundation Forecasters in Practice

Between 2023 and 2025 a new family of time-series models arrived that are pretrained on billions of time-series tokens and produce probabilistic forecasts in zero- or few-shot settings. Chronos \citep{ansari2024chronos}, Lag-Llama \citep{rasul2024lagllama}, MOIRAI \citep{woo2024moirai}, TimesFM, and Time-LlaMA \citep{chen2024timellama} all belong to this family. They are not the final word in time-series forecasting — task-specific transformers (Section 04-05) typically still beat them by 5–15% on CRPS when you have enough data — but they give you a strong baseline with zero per-series training, which is often the single most useful thing in a finance research workflow.

This section is the hands-on companion to the Chapter 7 LLM- integration discussion. The focus is operational: when to reach for a foundation forecaster, how to evaluate it on your own data, and how to combine it with task-specific models.

What "foundation forecaster" means

The term covers two architectural lineages:

  • Decoder-only transformer trained on tokenised series. Chronos quantises every time series into a fixed vocabulary and trains a language-model-style decoder over that vocabulary. Lag-Llama uses a LLaMA backbone with engineered lag features as the token sequence. TimesFM and Lag-Llama are also in this family.
  • Universal encoder–decoder with masking. MOIRAI handles arbitrary frequency, multivariate panels, and exogenous variables in one model with a masked-patch encoder. The more flexible architecture comes at the cost of more parameters.

The unifying claim: pretraining on a sufficiently broad corpus of time series teaches the model temporal patterns that transfer to unseen series. The empirical evidence supports a moderate version of this claim — foundation models are state of the art on cold-start forecasting (no target-series history) and competitive when target history is short.

When to reach for one

Three concrete scenarios where foundation forecasters earn their place.

  • Cold-start. A new asset, a new dataset, a new product line. No history to train a specialist; foundation forecaster gives a defensible baseline within minutes.
  • Many sparse series. A panel of hundreds of series with little history each. Training a specialist per series is impractical; fine-tuning the foundation model on the panel is.
  • Rapid prototyping. Before committing to a feature pipeline and a custom architecture, run a foundation forecaster as a sanity check. If a pretrained model produces a forecast comparable to your custom system, the custom system is not earning its complexity.

Three scenarios where they are the wrong choice:

  • Long, stable history with a specific failure mode. When the important signal is in the tails or in a regime-specific dynamics the pretraining corpus does not contain (a fund-flow seasonal, a market-microstructure pattern), a specialist does it better.
  • Tight latency budget. Foundation forecasters are large; even a quantised inference is slower than a small task-specific transformer or a LightGBM baseline.
  • Hard distribution requirements. If you need exact CRPS on a Student- tail or a specific quantile shape, a specialist with a matching head will win.

A vendor-neutral evaluation harness

The same harness should work across all foundation forecasters so the comparison is honest. Pseudocode pattern:

A concrete pattern using Chronos-T5 (the simplest foundation forecaster API to start with). The model loads from Hugging Face; inference is on a single GPU or CPU.

import numpy as np
import pandas as pd
import torch
from chronos import ChronosPipeline
 
# 1) load a pretrained foundation forecaster (zero-shot)
pipe = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-small",
    device_map="cpu",            # or "cuda" if available
    torch_dtype=torch.float32,
)
 
# 2) take a single asset's history
prices = pd.read_csv("data/prices_daily.csv", parse_dates=["Date"])
spy    = prices.query("Ticker == 'SPY'").sort_values("Date")
ctx    = torch.tensor(np.log(spy["AdjClose"]).diff().dropna().values[-512:])
 
# 3) probabilistic forecast: 100 samples for the next 20 days
samples = pipe.predict(context=ctx, prediction_length=20, num_samples=100)
samples = samples.cpu().numpy()             # shape (1, 100, 20)
 
# 4) per-horizon CRPS against held-out actuals (one rolling origin)
def crps_sample(forecast_samples, actual):
    fa = np.abs(forecast_samples - actual).mean(axis=0)
    fb = 0.5 * np.abs(forecast_samples[:, None] - forecast_samples[None, :]).mean(axis=(0, 1))
    return fa - fb
 
actual = np.log(spy["AdjClose"]).diff().dropna().values[-20:]   # next 20 days actual
crps_h = crps_sample(samples[0], actual)                         # (20,)
print(f"mean CRPS = {crps_h.mean():.4f}")

For a rolling-origin evaluation, wrap steps 2–4 in a loop over a sequence of as-of dates; the per-horizon CRPS averaged across origins is the headline number to compare across foundation models.

Three loops to keep separate: rolling origin, horizon, sample. The aggregation across rolling origins is what gives the headline number; the per-horizon and per-sample distributions are the diagnostics.

A practical complication: foundation models have different default context lengths and different default number of samples. Always benchmark with the same context (largest the smallest model supports) and the same sample count; otherwise the comparison is about hyperparameters, not models.

Patterns for combining with task-specific models

Three patterns that beat either component alone.

  • Foundation model → tree residual. Use the foundation model's point forecast as a feature, train a LightGBM on the residual. The tree absorbs systematic biases the pretrained model has on this particular dataset.
  • Foundation model + specialist ensemble. A precision-weighted ensemble of the foundation forecaster's CRPS and a specialist (PatchTST or LightGBM) routinely produces better calibration and sharpness than either alone.
  • Foundation model warm-start, fine-tune on target. Fine-tune the foundation model on the panel with a small learning rate and parameter-efficient adapters (LoRA, prefix tuning). Trade-off: more compute, but the resulting model often beats both the zero-shot foundation model and the specialist trained from scratch.

Limits and failure modes worth knowing

  • Distributional under-spread on rare regimes. Foundation models trained on a broad corpus tend to produce narrow predictive distributions on stress regimes that look unusual relative to the training mix. The reliability diagram per regime (Section 02-02) catches this.
  • Calendar blindness. Most pretrained forecasters do not see calendar features (day-of-week, month-end, holidays) by default. Inject them as exogenous covariates if the model supports it (MOIRAI does; Chronos does not natively).
  • Tokenisation artefacts. Quantisation-based models (Chronos) bin values; the bin grid is set at pretraining and may not span your data's range. Re-scale targets to the model's training range before inference.

Operational notes

  • Hosting. Foundation forecasters in the 100M–1B parameter range fit on a single GPU; quantised variants run on CPU at acceptable latency. Hugging Face hosts most of them.
  • Reproducibility. Pin the model weights to a commit hash; pin the inference library version; log the seed for the sampling step.
  • Caching. Cache zero-shot forecasts by (model_hash, series_hash, history_hash) — the same query two days in a row should not recompute.

Where this fits

Foundation forecasters sit in a different layer than the task- specific transformers of Section 04-05. Treat them as strong baselines you do not have to train rather than as replacements for the specialist machinery in the rest of the chapter. Their main contribution to a financial AI workflow is velocity: time from a new dataset to a defensible forecast measured in minutes rather than weeks. The decision framework in Chapter 5 and the agent integration in Chapter 8 both compose with foundation forecasters as easily as with specialist forecasters; the rest of the book does not need to care which is under the hood, only that the predictive distributions are calibrated and sharp.