"In this world, there's two kinds of people, my friend: those
with loaded guns and those who dig. You dig."
— The Good, the Bad and the Ugly (1966)
A forecast is the loaded gun. The decision layer of the next chapter
is what gets to use it. The forecaster who confuses the two — who
ships a model into production with no policy on how its outputs will
be acted on — has dug a hole, not built a system.
A forecast is an answer to "what is the conditional distribution of the
next observation given everything we know now?" Everything downstream —
position sizing, hedging, scenario testing — is some functional of that
distribution. This chapter walks the spectrum from disciplined univariate
baselines through tree ensembles, deep sequence models, and
transformer-based forecasters, with the same evaluation discipline
applied throughout.
What this chapter does
Three things repeat across every section, regardless of model class:
Define the target. Match the target horizon, transformation, and
loss to the decision the forecast feeds. A forecaster trained on
point-RMSE will produce different positions than one trained on
pinball or CRPS, even for the same architecture.
Respect temporal splits. Walk-forward training, rolling-origin
evaluation, and a strict separation between future-known and
future-unknown features. Most forecasting bugs in finance trace
back to a leak here.
Evaluate distributionally. RMSE on a median forecast is rarely
enough. CRPS, reliability diagrams, and per-regime breakdowns are
the metrics that actually correlate with downstream PnL.
This Chapter Covers
Crafting disciplined univariate baselines (ARIMA, ETS, simple
ML) that any later model has to beat to justify itself.
Leveraging cross-series structure via VARs, BVARs, VECMs, and
feature-enriched panels — including the cross-sectional volatility
models that risk overlays consume.
Tree ensembles for nonlinear tabular signals; the strongest
baseline on most engineered-feature problems and the one to beat
before reaching for a deep model.
Deep sequence models (RNN, TCN, DeepAR, N-BEATS / N-HiTS) for
panels where shared parameters across series and long-range temporal
structure pay off.
Transformer-based forecasters — PatchTST, iTransformer,
TimeMixer, TimeXer — and the foundation-forecaster baselines
(Chronos, Lag-Llama, MOIRAI) that show up in Chapter 7.
Contents
Univariate Forecasting —
ARIMA / ETS / state-space baselines and the rolling-evaluation
protocol used everywhere else.
Multivariate Forecasting —
VAR, BVAR, VECM, DCC-GARCH, and the panel pre-processing that feeds
deep models.
Deep-Learning-Integrated Forecasting —
RNN / TCN / DeepAR / N-BEATS / N-HiTS and the practical training
recipes that make them stable on financial panels.
Transformer-Based Forecasting —
attention, patching, and the modern transformer family for time
series, with the architecture choices that matter at production
scale.
Probabilistic Forecasting —
proper scoring rules, conformal prediction, distributional model
choices, and how to keep the forecast distribution intact all the
way to the decision layer.
Foundation Forecasters in Practice —
Chronos, Lag-Llama, MOIRAI, TimesFM as zero-shot baselines and the
patterns for combining them with task-specific models.
Univariate Forecasting
We start with the simplest setting: one time series, one forecast horizon.
The goal is not to compete with the deep models later in the chapter but
to establish the disciplined baselines that those models will have to
clear. If a transformer cannot beat a tuned ARIMA or an exponential
smoothing model on a clean univariate series, the problem is either
ill-posed or the features are insufficient — both worth knowing before
the next training run.
The classical time-series machinery (AR, MA, ARIMA, GARCH, ETS) is
introduced in Section 02-04. This section assumes that vocabulary and
focuses on the workflow: how to turn the theory into honest forecasts
that downstream chapters can consume.
Workflow
The univariate workflow is the same recipe every other forecasting model
in the book follows.
Define the target. Log returns, realised volatility, a macro
indicator. Match the target to the horizon and the loss; a forecaster
trained on point-RMSE produces different positions than one trained
on pinball or CRPS.
Split by time. Train / validation / test folds respect chronology.
Reserve the latest 15–20% of observations for held-out testing and
apply purged gaps if labels overlap.
Choose a naive benchmark. Random walk for log returns, seasonal
naïve for volatility or macro series, drift for trending levels. The
benchmark is what every later metric is compared against.
Fit a small set of statistical models. AR, ARIMA, ETS. Use
information criteria for order selection and inspect residuals.
Roll the evaluation forward. Sliding-origin or expanding-origin
re-estimation captures regime change; a single train/test split does
not.
Baseline models
Autoregressive
An AR(p) regression of the current value on the past p lags captures
mean-reversion and momentum. Useful as a first model on log returns and
realised volatility.
When a series carries unit roots or deterministic trend (yields,
inflation, level series), difference before modelling. ARIMA(p,d,q)
absorbs d levels of differencing plus an MA component;
SARIMA(p,d,q)(P,D,Q)s adds seasonal terms with period s —
weekly cycles in daily macro data, monthly cycles in retail prints.
For order selection on long batches, pmdarima.auto_arima automates the
AIC/BIC sweep and adds a stationarity-test gate. A short manual ACF/PACF
inspection is still a good idea — auto-selection occasionally over-fits
when the series has structural breaks.
Exponential smoothing
Holt–Winters and the broader ETS family decompose level, trend, and
seasonality and update each with its own smoothing parameter. ETS is the
right tool when seasonality dominates and the noise is mild — weekly
volatility cycles, monthly fund flows, intraday volume profiles after
detrending.
For conditional volatility forecasting, the right baseline is GARCH
(Section 02-04), not the AR / ARIMA family applied to squared returns.
The arch package is the production default:
The Student-t innovation distribution is almost always preferable to
Gaussian on equity-class data; use the QQ plot of standardised residuals
to verify.
Rolling backtest
Every univariate model gets the same evaluation harness. The pattern:
fit on a training window, predict the next horizon, slide forward, and
record the full error distribution rather than a single summary.
Per-regime breakdown. Split the held-out errors by realised
volatility tertile, by year, or by an exogenous regime flag. A model
that wins on the average but loses in stressed regimes is the wrong
model for any decision that has to live through stress.
Error distribution, not just RMSE. Histogram of errors, ACF of
errors, and a Diebold–Mariano test against the naive benchmark.
Persistent autocorrelation in errors is a sign that the model is
systematically missing structure.
Probabilistic outputs
Point forecasts hide risk; the decision layer of Chapter 5 needs full
predictive distributions. Two recipes work well at the univariate
baseline level:
Quantile regression. Train one model per quantile τ∈{0.05,0.1,…,0.95} minimising the pinball loss
ρτ(y,q^)=(τ−1{y<q^})(y−q^).
Cheap, distribution-free, robust.
Parametric density. ARIMA + Student-t residuals or GARCH +
t-innovations gives a closed-form predictive distribution. Validate
with the probability integral transform (PIT) histogram (Section
02-02): under a calibrated model, ut=F(yt∣Ft−1)
is uniform on [0,1].
For univariate scenario generation, bootstrapping standardised
residuals preserves the empirical tail without committing to a parametric
distribution. Carrying the GARCH fit from the snippet above:
The scenarios feed the synthetic-data and stress-testing workflows of
Chapter 10.
When the univariate baseline runs out
Three signals that the univariate model has stopped being enough:
Cross-series structure dominates. Adding correlated series to the
feature set genuinely improves the held-out CRPS — Section 04-02 is
where to go.
Long-range dependencies that lag p cannot capture. Slow drift,
multi-week cycles, complex seasonality — tree models with engineered
features (04-03) or deep models (04-04 / 04-05) are better suited.
Distributional output is the deliverable. Quantile boosters and
deep probabilistic forecasters give full distributions with less
fragility than ad-hoc bootstrap-on-residuals.
Even in those cases, keep the univariate baseline live. Drift in the
gap between the univariate baseline's CRPS and the production model's
CRPS is the earliest, cheapest signal of regime change you will get.
Multivariate Forecasting
Real portfolios rarely depend on a single time series. Macro indicators, rates,
commodities, and alternative data interact, and the cross-sectional
structure carries as much signal as the dynamics of any individual series.
Multivariate models are the workhorse layer that turns those interactions into
forecasts you can hand to a portfolio optimiser.
When to go multivariate
Three questions tell you whether a univariate model is enough.
Are the series correlated? Pairs (oil and energy equities), regimes
(rates and credit spreads), and panels (a sector basket) all carry
contemporaneous covariance that univariate models cannot exploit.
Do you need coherent forecasts? A bond portfolio needs forecasts whose
yield-curve shape is internally consistent; a basket of FX crosses needs
forecasts that respect triangular arbitrage. Independent univariate forecasts
do not.
Is the forecast variance the binding constraint? When the
covariance of the forecast error matters more than its mean — risk
budgeting, tracking error, hedge sizing — you need a model that learns the
covariance structure end-to-end.
Vector Autoregressions (VAR)
VAR models extend AR ideas to k-dimensional vectors. Each variable depends on
its own lags and the lags of others,
yt=c+i=1∑pAiyt−i+εt,εt∼N(0,Σ).
Use them when you care about impulse responses (how a one-off shock to one
variable propagates), Granger causality (whether series j's past helps
predict series i given the rest), or simultaneous forecasts of correlated
series such as inflation, unemployment, and policy rates.
Two diagnostics belong on every VAR you ship: the eigenvalues of the
companion matrix (must lie inside the unit circle for stationarity), and the
residual autocorrelation matrix (white-ish off-diagonals; persistent
structure means you under-specified the lag order or missed a regime).
Information criteria (AIC, BIC, HQIC) trade fit against parsimony — BIC tends
to under-fit on financial data, AIC to over-fit; HQIC sits in the middle and
is a reasonable default.
Bayesian VAR (BVAR)
Financial macro datasets often contain more variables than observations.
Coefficient estimates of unrestricted VARs blow up; forecasts inherit that
noise. Bayesian priors shrink coefficients toward economically plausible
values:
Minnesota prior \citep{litterman1986bvar}: shrinks own-lag coefficients
toward 1 and cross-lag coefficients toward 0, with shrinkage strength
decaying in lag order. The default in central-bank work.
Horseshoe / spike-and-slab priors: heavier-tailed shrinkage that allows a
small number of large coefficients while crushing the rest. Better when the
truth is genuinely sparse.
Hierarchical priors for panels (countries, sectors): pool information
across cross-sectional units while letting each unit deviate.
import numpy as npfrom statsmodels.tsa.api import VAR as SMVARfrom sklearn.linear_model import Ridge# A small Minnesota-style ridge regression on a stacked design matrix.# k series, p lags => X has k * p columns + 1 (intercept).def build_lag_matrix(y: np.ndarray, p: int): T, k = y.shape rows = [np.concatenate([y[t - i] for i in range(1, p + 1)]) for t in range(p, T)] return np.asarray(rows), y[p:]X, Y = build_lag_matrix(wide.to_pandas().to_numpy(), p=12)bvar = Ridge(alpha=1.0).fit(X, Y) # one ridge per equation, stackedprint(bvar.coef_.shape) # (k_target, k * p)
Ridge with a single λ is the simplest realisation of the
Minnesota prior; the more elaborate priors (Horseshoe, hierarchical)
are best approached through bayesian packages (pybats,
statsmodels's BayesianVAR) once the ridge baseline is calibrated.
For high-dimensional VARs (dozens of series), consider reduced-rank VARs
or factor-augmented VARs (FAVAR) that project data into a low-dimensional
latent space — exactly the dynamic-factor machinery of Chapter 6 — and then
run the VAR on factors. That route also makes the structural-shock
identification in the chapter on dynamics much cleaner.
Cointegration and VECM
When nonstationary series share a long-run equilibrium (think pairs trading,
the basis between a futures contract and its underlying, or the relationship
between inflation and short rates), use a Vector Error-Correction Model.
VECMs differentiate the series but add a term that pulls them back toward
equilibrium,
Δyt=Πyt−1+i=1∑p−1ΓiΔyt−i+εt,
with Π=αβ⊤. The columns of β define the
cointegrating vectors (the spreads that mean-revert), and α is the
adjustment speed back to equilibrium. The Johansen test estimates the
rank of Π and so the number of cointegrating relationships.
In financial practice the most common use is statistical arbitrage on a small
basket: estimate β, watch the residual spread β⊤yt,
and trade mean-reversion in that spread with a half-life set by α.
Always test stability of β in rolling windows — cointegrating
relationships break, and the moment they do is exactly when the strategy
would otherwise stack risk on a stale signal.
Cross-Series Volatility: DCC-GARCH
For risk-budgeting and option pricing the object of interest is not the mean
forecast but the covariance one. A full multivariate GARCH is heavy because
it has O(k2) free parameters; Dynamic Conditional Correlation
(DCC) \citep{engle2002dcc} buys most of the value at O(k) cost
by separating marginal volatilities and the correlation matrix:
Ht=DtRtDt,
with Dt=diag(σ1,t,…,σk,t) a univariate
GARCH per series and Rt a slowly-evolving correlation matrix. DCC is the
default risk-model engine in many investment banks because it is fast,
stable, and produces covariance forecasts that plug straight into a
mean-variance optimiser.
Feature-Enriched Deep Models
Modern deep forecasters (temporal convolutions, transformers) eat
multivariate panels too, but they require richer features and a consistent
tensor layout:
Align calendars. Forward-fill macro data to match daily market
frequency while flagging imputed points so the model can downweight them.
Engineer covariates. Include lagged returns, rolling statistics,
calendar features (day-of-week, month, holiday flags), and exogenous
drivers (oil, FX, rates) that you genuinely have at forecast time.
Standardise per series. Scale each column with its own mean/std to
avoid dominance by high-variance signals; for fat-tailed series consider
robust scaling (median / MAD).
Store metadata. Keep masks for missing values and a separate channel
for known-future features (planned policy meetings, scheduled earnings,
roll dates) so the model can condition on them without leaking.
from sklearn.preprocessing import StandardScalerimport numpy as npvalues = wide.select(pl.all().exclude("Date")).to_numpy()scaler = StandardScaler().fit(values)scaled = scaler.transform(values)np.savez("artifacts/multivariate_tensor.npz", data=scaled)
Two architectural patterns dominate the recent literature on multivariate
deep forecasting and we will return to both:
Channel-mixing. Cross-series attention or MLP layers that explicitly
exchange information across the panel. Effective when the cross-sectional
signal is strong; iTransformer \citep{liu2024itransformer} and TimeMixer
\citep{wang2024timemixer} are representative.
Channel-independence with shared parameters. The same architecture
applied per series with shared weights, betting that temporal patterns
generalise across series and that cross-series structure can be handled
downstream. PatchTST \citep{nie2022patchtst} popularised this view.
Whether mixing or independence wins depends on the panel. Equity-cross-section
problems with strong factor structure tend to benefit from mixing; long
single-asset series with rich seasonality often do better with independence.
The forecasters in Sections 04-04 and 04-05 implement both, and we benchmark
them on the same macro and equity panels.
Tree-Based Forecasting
Gradient-boosted trees and random forests absorb nonlinear interactions and
discontinuous signals better than linear models, and they remain the strongest
baseline on most financial forecasting problems where the signal is in
tabular engineered features rather than in the raw waveform of the series.
For the M5 retail competition, the M6 financial-forecasting competition, and a
long string of Kaggle finance challenges, the winning entries were variations
on LightGBM with thoughtful features. The lesson generalises: before reaching
for deep models, build a tree baseline you can defend.
Why trees are the default baseline
Three properties of trees make them well-matched to financial tabular data.
Heterogeneous features without preprocessing. Trees split on a single
feature at a time, so they handle a mix of integer indicators, log-scaled
prices, raw counts, and categorical IDs without needing whitening or one-hot
expansions. That removes a whole class of bugs around scaling, missing
values, and outliers.
Stable on small data. Boosted trees with regularisation and early
stopping fit reliably on the 5–20 year history that most financial datasets
give us. Deep models in the same regime tend to overfit unless you have
strong cross-series structure to share parameters across.
Nonparametric interactions for free. Tree depth automatically captures
conjunctions ("low-volatility and steep yield curve"); you do not need to
specify the interaction up front the way you would in a linear model.
The tradeoff is that trees do not extrapolate beyond the training range and
do not learn the temporal continuity of the series. Those are the gaps that
the deep and transformer-based forecasters in the next two sections fill.
Feature engineering is the model
A boosted tree is only as good as the features you hand it. Build a
supervised table where each row is one (t,asset) observation:
Target:yt+h — the future return, log-return, or realised
volatility at the horizon you actually care about. Match the target to the
decision (Section 5-01): if the policy uses log-returns, train on log-returns.
Lagged features:yt,yt−1,…,yt−L for a lookback window
L tied to the autocorrelation horizon of the data.
Rolling statistics: moving averages, standard deviations, skew, kurtosis;
realised range and downside semi-variance for risk-aware features.
Cross-asset features: sector returns, factor exposures (value, momentum,
quality), or correlations to a benchmark.
Calendar and event features: day-of-week, month, time-to-earnings,
holiday distance, scheduled-event flags. Trees love these.
Metadata for the trainer: sample weights, regime labels, asset IDs as
categorical features (LightGBM handles them natively).
import polars as pldef build_features(df: pl.DataFrame, horizon: int = 5) -> pl.DataFrame: return ( df.sort("Date") .with_columns([pl.col("log_ret").shift(i).alias(f"lag_{i}") for i in range(1, 11)]) .with_columns([ pl.col("log_ret").rolling_mean(window_size=20).alias("ma20"), pl.col("log_ret").rolling_std(window_size=20).alias("vol20"), pl.col("log_ret").rolling_skew(window_size=60).alias("skew60"), pl.col("Volume").log().alias("log_volume"), pl.col("Date").dt.weekday().alias("weekday"), pl.col("Date").dt.month().alias("month"), ]) .with_columns(pl.col("log_ret").shift(-horizon).alias("target")) .drop_nulls() )
The most common subtle bug in tree-based finance pipelines is target
leakage: a "lagged" feature accidentally constructed from same-day data, a
rolling statistic that includes the row's own observation, an asset ID
encoder fit on the full sample. Every feature should be auditable to the
information available at time t, and the audit is easier if every column
name encodes the lag.
Gradient boosting in practice
LightGBM is the default for fast experimentation; XGBoost and CatBoost are
near-equivalent and the choice mostly comes down to ergonomics and the shape
of your categorical features. For most financial settings:
num_leaves × min_child_samples controls capacity. On daily equity
data with ∼2000 rows per asset, num_leaves=64–128 and
min_child_samples=200–500 is a sane starting point; larger leaves overfit
the tail.
feature_fraction and bagging_fraction add stochastic regularisation
per tree. Both around 0.8 is the standard finance recipe.
learning_rate × num_boost_round trades fit speed for stability.
Smaller learning rates with early stopping on a held-out window almost
always win on out-of-sample loss; default to learning_rate=0.03–0.05 and
let early stopping pick the round count.
Monitor feature importance qualitatively. Trees split greedily, so two
collinear features can each take half the splits and look "less important"
than they are. Use SHAP values for a more honest attribution, especially
when you need to defend the model to a non-modeller stakeholder.
Quantile and probabilistic forecasts
A point forecast from a tree is rarely enough for the decision layer of
Chapter 5. Two patterns recover a full predictive distribution:
Quantile regression with pinball loss. Train one model per quantile
τ∈{0.05,0.1,0.25,0.5,0.75,0.9,0.95} with LightGBM's
objective="quantile", alpha=tau. Concatenate the predictions to get a
predictive CDF. Cheap and reliable; the only catch is non-monotone
predictions across quantiles, which you fix with a small isotonic
post-processing step.
Quantile Regression Forests \citep{meinshausen2006qrf}. Ordinary random
forests are aggregated by mean; QRF aggregates the empirical distribution
of leaf observations and inverts it. Slower at inference but distributionally
consistent by construction.
For evaluation, use CRPS and reliability diagrams from Section 02-02 — not
RMSE on the median quantile. A model that wins on RMSE but is mis-calibrated
will produce position sizes that are systematically too large or too small.
Random forests as a stress-resistant baseline
When you need a model that resists overfitting and is easy to validate,
bootstrap ensembles shine. Random forests cannot extrapolate far beyond the
training range — so they are not the right tool when the live regime has
moved out-of-sample — but they provide stable probability estimates and very
honest feature attribution.
A useful pattern is to keep the random forest as a sanity check alongside a
boosted-tree production model: when the two diverge sharply on live data, the
boosted model is usually overfitting a recent anomaly that the forest's
averaging has dampened.
Cross-sectional panels
For equity universes, build panel features and fit a single model across all
tickers rather than one model per ticker:
Stack rows for every ticker into one training table.
Add the ticker as a categorical feature (LightGBM's categorical_feature)
or use target encoding with proper out-of-fold construction.
Add cross-sectional features: sector or country dummies, industry-level
averages, factor exposures.
Use grouped time-series cross-validation: train on 2010–2018, validate
2019, test 2020+. Ensure no leakage from future fundamentals, revised data,
or look-ahead in cross-sectional rankings.
The panel approach amortises noise across tickers and is what made tree
ensembles competitive on the M6 financial competition. The catch is that the
panel must be reasonably homogeneous — pooling small-cap technology stocks
with large-cap utilities will hurt unless you condition on the segment.
Deployment notes
Daily refresh. Update features incrementally with a join on the latest
observation rather than recomputing the full table.
Audit signals. SHAP or accumulated local effects (ALE) plots show how
each feature drives the prediction at the row level; this is what you show
in a model risk review.
Latency. Export models to treelite or ONNX if you need
microsecond-class inference; for daily strategies the native LightGBM
predict is usually fast enough.
Versioning. Persist both the model and the feature recipe (the exact
Polars expressions that built the table). Replays of historical forecasts
break the moment the recipe drifts.
Tree-based forecasters bridge classical statistics and deep learning. They
thrive on structured, high-SNR signals; they produce interpretable outputs
for stakeholders; and they set the bar that the deep models in the next two
sections have to clear.
Deep-Learning-Integrated Forecasting
Tree ensembles set the baseline; deep sequence models earn their keep when the
signal lives in temporal structure that cannot be flattened into a tabular
row, or when the panel is large enough that shared parameters across series
materially helps. Before the transformer-specific architectures of the next
section, we cover the broader deep-forecasting toolkit — recurrent nets,
temporal convolutions, deep AR/seq2seq, and decomposition-based architectures
like N-BEATS / N-HiTS — that often beat trees with a fraction of the parameter
budget of a transformer.
When to reach for a deep model
Three concrete conditions make a deep model the right choice rather than a
boosted tree.
Many series, shared dynamics. A panel of hundreds or thousands of
related series (an entire equity universe, an electricity-grid panel, a
retail-sales panel) lets you fit one model with shared weights. Trees fit
per series — or with categorical IDs, per row — but cannot share inductive
bias across series the way a deep model with channel-independence can.
Long-range temporal pattern. Seasonality at multiple scales, slow regime
drift, or hierarchical structure (intraday → daily → weekly) is what
RNN/TCN/transformer models reconstruct without you hand-engineering rolling
features for every scale.
Probabilistic outputs at scale. DeepAR and its descendants produce
full-distribution forecasts in one shot; quantile-loss boosters can match
them per-quantile but become awkward at panel scale.
If none of these holds — handful of series, mostly tabular signal, short
horizon — a LightGBM with the features from Section 04-03 will usually win.
Sequence preparation
Deep forecasters expect tensors shaped (batch, time, feature) plus masks
describing missing observations and known-future inputs. The pieces:
Decoder inputs (for seq2seq architectures): known-future features
(calendar, scheduled events, holidays) and previously generated targets.
Masks: binary indicators for observed vs. padded vs. imputed positions.
The mask is what lets the model down-weight imputed values rather than
treat them as ground truth.
A normalisation step that shows up in every reproducible deep-forecasting
pipeline is per-series instance normalisation: subtract the per-window
mean, divide by the per-window standard deviation, predict in standardised
space, and undo the scaling at output. Models like RevIN \citep{kim2021revin}
make this part of the network rather than the data pipeline; either way, it
is the single biggest robustness lever against regime shift.
Recurrent and convolutional baselines
Three classical deep architectures still hold up:
LSTM / GRU encoders with an attention readout for the forecast horizon.
Fast to train, easy to debug; good when the autocorrelation structure is
short and seasonality is mild.
Temporal convolutional networks (TCN). Stacked dilated 1D convolutions
give exponentially growing receptive fields with linear parameter cost.
TCNs train more stably than RNNs on long contexts and remain competitive
with much larger transformers on standard univariate benchmarks.
DeepAR \citep{salinas2020deepar}. An RNN that emits the parameters of a
per-step likelihood (Gaussian, Student-t, negative binomial), trained by
maximum likelihood and rolled out autoregressively for multi-step
probabilistic forecasts. The default deep probabilistic forecaster in
AWS/GluonTS and a strong baseline on retail and load forecasting.
import torch.nn as nnclass TCNBlock(nn.Module): def __init__(self, c, k=3, d=1): super().__init__() pad = (k - 1) * d self.conv = nn.Conv1d(c, c, kernel_size=k, padding=pad, dilation=d) self.act = nn.GELU() def forward(self, x): y = self.conv(x)[..., : x.shape[-1]] return self.act(x + y)
The recipe for any of these baselines is the same: rolling-origin training
with early stopping on a held-out window, instance normalisation, and a
small number of layers — five or six is usually enough.
Decomposition-based architectures: N-BEATS and N-HiTS
A separate strand of work bypasses recurrence and attention entirely. N-BEATS
\citep{oreshkin2020nbeats} stacks fully-connected blocks that output two
projections per layer: a backcast that subtracts what the block has
explained from the input, and a forecast that contributes to the
prediction. Trained end-to-end, this acts like a learned, hierarchical
decomposition into trend, seasonality, and residuals. N-HiTS
\citep{challu2023nhits} adds multi-rate signal sampling so that long-horizon
forecasts spend their parameter budget on coarse structure and short-horizon
forecasts on fine detail.
Two reasons to keep these on the menu:
They are interpretable: the per-block backcasts decompose the forecast
into named components (trend, seasonality, residual) you can plot.
They are efficient: an MLP-only architecture with no attention runs
faster than a transformer at the same accuracy on M3, M4, and Tourism
benchmarks.
neuralforecast provides a clean PyTorch-Lightning wrapper for both.
A working DeepAR-style head
Section 4-06 covers probabilistic forecasting from the output-type
perspective — the proper-scoring discipline that any backbone has to
satisfy. Here we walk through a concrete parametric likelihood
head that pairs naturally with the RNN / TCN backbones above: a
Student-t density per step, trained by negative log-likelihood
\citep{salinas2020deepar}.
Three notes that decide whether this trains stably on financial data:
df floor. A Student-t with ν≤2 has infinite variance;
the gradient explodes around the boundary. The +2.5 floor in the
softplus keeps optimisation stable.
Per-window standardisation (Section 4-05's RevIN trick) is just
as important here as for transformers; without it, the head's
scale parameters chase the moving baseline.
Teacher forcing during training, autoregressive sampling at
inference. At training time, feed the true previous values; at
inference, sample from the predicted distribution and feed the
sample as the next input. The classical DeepAR rollout.
For probabilistic evaluation (Section 4-06), the head exposes a
density, so CRPS and reliability diagrams come straight out of
sampling 100 trajectories and aggregating.
Quantile head as the alternative
When the parametric assumption is uncomfortable (volatility regime
shifts, mixture of return scales), a quantile head trained with
pinball loss is the distribution-free alternative:
Quantile crossing. The model can emit non-monotone predictions
(5%-quantile above 95%-quantile). A small isotonic regression as
post-processing fixes this without touching the trained weights.
CRPS approximation. With 7 quantiles the average pinball loss
over the grid converges to CRPS as the grid is refined; with only
3 quantiles the approximation is too coarse to be useful.
Conformal post-processing (Section 4-06) wraps either head: train
the point or quantile output, then compute conformal intervals from
held-out residuals to recover coverage guarantees.
Training tips that actually move the needle
Patching. Treat the time axis as a sequence of small patches (length 8,
16, or 32) rather than per-step tokens. This is the single change that
closed most of the gap between transformer forecasters and TCN/RNN
baselines, and it generalises beyond transformers — patch-and-MLP
architectures (TSMixer) get most of the same benefit.
Curriculum on horizon. Start with short encoder/decoder lengths and
scale up. Stabilises early training and avoids the model collapsing into a
copy-of-last-value solution.
Channel independence vs. mixing. Try both. PatchTST-style channel
independence is a strong default for univariate-shape problems with shared
dynamics; iTransformer and TimeMixer mix channels and often win on
cross-asset panels with strong factor structure.
Loss-weighting. For long horizons, equally-weighted point losses
systematically under-fit the back end. A schedule that emphasises later
horizons during training is a small but reliable improvement.
Data augmentation. Mixup on the time dimension and stochastic input
dropping prevent overfitting to specific regimes and cost almost nothing.
Monitoring and deployment
Quantile coverage. Track empirical coverage of nominal 80% / 90%
intervals on a rolling window. Drift in coverage is the earliest signal of
regime change and almost always precedes a drop in PnL.
Attention / variable importance. For TFT-style models, log the variable
selection weights; for transformer-style, log the average attention pattern.
Sudden shifts mean the model has started relying on a different feature, and
that deserves a second look before the model goes into the next decision.
Baseline alongside. Keep a tree-ensemble baseline live next to the deep
model. If the baseline suddenly beats the deep model on rolling CRPS, pause
and retrain — that is the cleanest drift trigger you will get.
Export. PyTorch → TorchScript or ONNX for production; quantise to
float16 (or int8 for embedded scenarios) when latency matters more than the
last decimal of accuracy.
The deep models in this section are the right tool when shared parameters
across many series, long-range temporal structure, or probabilistic outputs
are central. For pure attention-based architectures and the
foundation-model-style pretrained transformers, jump to Section 04-05.
Transformer-Based Forecasting
Transformer architectures, originally developed for language, now
dominate the public time-series leaderboards. They are not a free lunch
— they are data-hungry, expensive at long contexts, and they bring their
own failure modes — but for problems with rich cross-series structure
they often beat the tree and recurrent baselines of Sections 04-03 and
04-04 by a clear margin. This section covers the architectural ideas,
the practical recipes, and the tradeoffs that make a transformer the
right or wrong tool for a financial forecasting workload.
What changes when we move from MLP / RNN to a transformer
Three architectural ideas matter for forecasting.
Self-attention. Every time step in the input window can attend
directly to every other step, weighted by learned compatibility.
Compared with RNN recurrence, this removes the bottleneck of a
fixed-width hidden state and gives the network direct access to
long-range structure.
Cross-series attention (channel mixing). Letting the model
exchange information between related series — sectors moving
together, FX cross-rates, related futures contracts — without
flattening them into a single feature vector is the core trick that
makes transformers competitive on multivariate panels.
Patching and tokenisation. Recent forecasters split the time axis
into patches before applying attention; PatchTST and TSMixer popularised
this view. The choice of tokenisation is often as important as the
attention pattern itself, and it is what closed most of the gap between
early Transformer-for-time-series papers (Informer, Autoformer) and
the simpler baselines they tried to beat.
A short genealogy
Three families show up repeatedly in financial work.
Encoder-only forecasters
Take the input window, encode it with a stack of attention layers, and
project to the forecast horizon. No autoregressive decoding, no
encoder–decoder cross attention.
PatchTST \citep{nie2022patchtst}. Tokenise each series into
patches of length 8–32, apply a standard transformer encoder
channel-independently (one network per series with shared weights),
and project the last patch's representation to the forecast horizon.
The dominant baseline for univariate-shape long-horizon problems.
iTransformer \citep{liu2024itransformer}. Inverts the role of
time and channel: each series becomes a token, and self-attention
operates across series at each time step. A natural fit for
cross-asset panels with strong factor structure.
Encoder–decoder forecasters
Use an encoder for the historical window and a decoder that generates
the forecast horizon, often with cross-attention.
Informer \citep{zhou2021informer}. ProbSparse attention — only
the top-k keys per query — to handle very long contexts.
Autoformer \citep{wu2021autoformer}. Decomposition blocks
(trend / seasonal) inside the transformer and an auto-correlation
attention that aggregates by lag rather than by similarity.
TimeXer \citep{wang2024timexer}. Encoder–decoder design with an
explicit endogenous–exogenous split, so known-future covariates
(calendar, scheduled events, holiday flags) are routed through a
dedicated channel rather than mixed into the historical window. A
clean fit for forecasting workloads that have strong known-future
structure (electricity load, retail demand, macro releases).
Mixer-style hybrids
Replace some or all attention layers with simpler mixing primitives —
MLPs, depthwise convolutions, linear attention — that retain the
patching ideas but reduce compute.
TimeMixer \citep{wang2024timemixer}. Multiscale mixing across
decomposed seasonal and trend components; competitive with full
transformers at a fraction of the parameter budget.
TSMixer. All-MLP architecture with per-time and per-channel
mixing layers. The simplest baseline that captures the patching
benefits without any attention.
For most financial datasets — daily or weekly observations, low hundreds
of series, limited history — encoder-only PatchTST and the mixer
hybrids are the most reliable starting points. Encoder–decoder models
become attractive when you genuinely need autoregressive long-horizon
generation or when known-future covariates are central.
Sequence preparation
Transformer forecasters expect tensors shaped (batch, time, feature)
plus masks describing missing observations and known-future inputs.
Patch tokenisation. Group consecutive time steps into patches of
length 8, 16, or 32 and treat each patch as a token. Reduces sequence
length and makes attention quadratic in patches rather than steps.
Reversible instance normalisation (RevIN) \citep{kim2021revin}.
Subtract the per-window mean, divide by the per-window standard
deviation, predict in standardised space, undo the scaling at output.
The single biggest robustness lever against regime drift in
transformer forecasters.
For multivariate inputs, decide between channel independence (one
network per series, shared weights) and channel mixing (cross-series
attention). Try both; on equity panels with strong factor structure
mixing usually wins, on long single-asset series independence usually
wins.
A working PatchTST-style training loop
Section 4-04 covered the parametric-likelihood head. The transformer-
forecasting workhorse pairs the patch-tokenised encoder with a
quantile head trained on pinball loss. A minimal but realistic
training step:
import torchimport torch.nn as nnclass PatchTSTLite(nn.Module): """PatchTST-flavoured transformer forecaster with a quantile head. - Channel-independent: each series goes through the same encoder with shared weights. For multivariate panels, the channel dimension is folded into the batch. - Patch length is the most-tuned hyperparameter; 16 is a good starting point for daily data. """ def __init__(self, *, patch_len=16, d_model=128, n_heads=4, n_layers=3, n_quantiles=7, horizon=20): super().__init__() self.patch_len = patch_len self.embed = nn.Linear(patch_len, d_model) self.pos = nn.Parameter(torch.zeros(1, 256, d_model)) # max patches layer = nn.TransformerEncoderLayer( d_model, n_heads, dim_feedforward=d_model * 4, batch_first=True, norm_first=True, ) self.encoder = nn.TransformerEncoder(layer, num_layers=n_layers) self.head = nn.Linear(d_model, n_quantiles * horizon) self.n_quantiles = n_quantiles self.horizon = horizon def forward(self, x: torch.Tensor) -> torch.Tensor: # x: (B, L) for univariate; for multivariate (B, C, L) call # this in a loop or fold C into B. B, L = x.shape n_patches = L // self.patch_len x = x[:, : n_patches * self.patch_len] patches = x.reshape(B, n_patches, self.patch_len) h = self.embed(patches) + self.pos[:, :n_patches] h = self.encoder(h) # Final patch's representation predicts all horizon quantiles. out = self.head(h[:, -1]) return out.view(B, self.horizon, self.n_quantiles)QUANTILES = torch.tensor([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95])def pinball(pred: torch.Tensor, y: torch.Tensor, qs: torch.Tensor = QUANTILES) -> torch.Tensor: diff = y.unsqueeze(-1) - pred # (B, H, Q) return torch.maximum(qs * diff, (qs - 1) * diff).mean()# Training step with reversible instance normalisationdef step(model, opt, x, y): mu = x.mean(dim=-1, keepdim=True) sd = x.std(dim=-1, keepdim=True).clamp_min(1e-6) pred = model((x - mu) / sd) * sd.unsqueeze(-1) + mu.unsqueeze(-1) loss = pinball(pred, y) opt.zero_grad() loss.backward() nn.utils.clip_grad_norm_(model.parameters(), 1.0) opt.step() return loss.item()
Three knobs that move the held-out CRPS most:
Patch length × lookback. The encoder sees lookback / patch_len
tokens; aim for 64–128 tokens. For daily equity windows, that puts
patch length in 8–16 with 1–2 years of lookback.
Channel mode. Channel-independent (shared encoder) is more data-
efficient on small panels; channel-mixing (token-per-series, à la
iTransformer) wins on equity cross-sections with strong factor
structure. Test both; the cost is a config flag.
Quantile grid. Seven quantiles
([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]) is the sweet spot for
CRPS approximation; coarser grids leave coverage gaps, finer grids
add training noise.
For probabilistic evaluation, post-process the quantile output with
isotonic regression to remove crossings (Section 4-06), then compute
CRPS as the average pinball over the grid. The same fitted model
plugs into the conformal wrapper if calibrated coverage guarantees
are required.
Practical considerations
Lookback windows. Long contexts help only up to the
autocorrelation horizon of the data. For daily equity returns,
windows beyond a few months of history rarely add signal — they mostly
add overfitting risk. Patch length and lookback should be jointly
tuned; a useful starting point is 96 patches of length 16, i.e. about
6 years of daily data.
Channel independence vs. mixing — tune empirically. A small grid
over patch length × channel mode × number of layers usually surfaces
a clear winner.
Calibration drift. Transformers tend to look very confident on
in-distribution data and silently miscalibrate when regimes shift.
Rolling-origin CRPS plus a per-regime reliability diagram is
non-negotiable.
Pretraining vs. from scratch. Foundation-style time-series
models — Chronos, Lag-Llama, MOIRAI, TimesFM — are useful baselines
when the target dataset is small. Section 4-08 covers the practical
side (which to pick, how to evaluate, how to combine with a
specialist); Chapter 7 covers the broader LLM-integration story.
Training recipe
A recipe that travels well across financial forecasting workloads:
Optimiser. AdamW at 1e−4 with cosine schedule and
∼5% warm-up. Weight decay 0.01.
Batch size. As large as the GPU permits; gradient accumulation if
necessary. Larger batches stabilise multivariate training in
particular.
Loss. Quantile or NLL on standardised targets; un-standardise only
for evaluation.
Regularisation. Mixup along the time axis, stochastic depth in
the encoder, and a small dropout (0.05–0.1) on attention outputs.
Curriculum on horizon. Start with short forecast lengths and
increase across epochs; stabilises early training and prevents
collapse to the copy-of-last-value solution.
Early stopping. On held-out CRPS, not training loss.
Three training-time pitfalls worth knowing:
Look-ahead via padding. Right-padding the encoder input mixes
future zeros into attention. Mask, do not just pad.
Leakage via instance normalisation. RevIN must use causal
statistics; the moments must be computed only from the window the
encoder sees.
Label leakage in panel splits. Stack rows by date, split dates
not rows, when training on cross-sectional panels.
Monitoring and deployment
Quantile coverage of nominal 80% / 90% intervals, tracked on a
rolling window. Drift in coverage is the earliest signal of regime
change and almost always precedes a drop in PnL.
Attention pattern logged on a fixed validation set. Sudden shifts
mean the model has started relying on a different feature; that is
worth a second look before the model goes into the next decision.
Latency budget. Quantise to float16 (or int8 for embedded
scenarios) and export to ONNX or TorchScript when latency is binding.
Baseline alongside. Keep the LightGBM baseline of 04-03 and the
univariate baseline of 04-01 live next to the transformer. If either
baseline suddenly beats the transformer on rolling CRPS, pause and
investigate — that is the cleanest drift signal you will get.
Where this fits in the book
Transformer forecasters slot in alongside the tree-based forecasters
from Section 04-03 and the deep-learning baselines from 04-04. The
recommended workflow is:
Start with a tree baseline (LightGBM with engineered features
from 04-03).
Add a small transformer (PatchTST or iTransformer at a few
hundred thousand parameters).
Graduate to larger or more elaborate variants only when the
smaller ones plateau on the held-out CRPS.
The forecasts produced here feed directly into the decision layer of
Chapter 5 and the dynamics layer of Chapter 6; they are also the
benchmarks that the foundation forecasters of Chapter 7 are compared
against.
Probabilistic Forecasting
Every forecast in this book is, ultimately, a probabilistic statement.
The earlier sections covered point forecasts and a handful of
distribution-friendly tools (quantile heads, conformal wrappers); this
section is the deep dive. We treat probabilistic forecasting as a
first-class discipline — what to model, how to score it, how to
calibrate it, and how to plug the resulting distributions into the
decision layer of Chapter 5 without losing information along the way.
What we mean by "probabilistic"
A probabilistic forecast is a predictive distributionF^t+h∣t over the future quantity Xt+h given the
information set Ft. Three concrete forms recur in
practice, each with a different deployment cost:
Quantile forecasts. A discrete set
{q^τ1,…,q^τK} with τk∈(0,1).
Cheap to produce, easy to consume; missing the part of the
distribution between adjacent quantiles is the price.
Parametric densities. A named family pθ(x) — Gaussian,
Student-t, mixture of Gaussians, normalising flow — with
parameters θ as a function of Ft. Compact and
composable; sensitive to family mis-specification when tails are
heavy.
Sample-based ensembles. A set of trajectory samples
{Xt+h(s)}s=1S drawn from the model. Robust to
arbitrary distributional shape; the cost is storage and the need
for S to be large enough that tail statistics are estimable.
Picking among them is a deployment question, not a theoretical one.
Mean-variance optimisers consume parametric Gaussian neatly; a
recursive-utility critic (Section 5-04) needs samples; a VaR limit
needs a tight quantile. Match the form to the consumer.
CRPS, log-score, and the case for proper scoring
The right loss to train and evaluate a probabilistic forecaster is
a proper scoring rule: a function S(F,x) such that the expected
score is minimised when F is the true distribution of x. Two
dominate financial practice.
Continuous Ranked Probability Score (CRPS).CRPS(F,x)=∫−∞∞(F(y)−1{y≥x})2dy.
Equivalently, CRPS(F,x)=EF∣Y−x∣−21EF∣Y−Y′∣
for independent Y,Y′∼F, which gives a tractable Monte Carlo
estimator from samples. Defaults in this book: CRPS reported
per-horizon, averaged over rolling-origin folds.
Logarithmic score.−logpθ(x). Sharper signal per
observation than CRPS but extremely sensitive to tail
mis-specification; a single outlier far from the predictive density
can dominate the average. Use alongside CRPS, not instead of.
A useful property: CRPS generalises MAE — it equals MAE when the
predictive distribution collapses to a point. Practical consequence:
CRPS-trained probabilistic forecasters often match or beat point
forecasters even on point metrics. Distributional richness is not
bought at the cost of point accuracy.
A pure-NumPy CRPS for sample-based forecasts:
import numpy as npdef crps_sample(samples: np.ndarray, actual: float) -> float: """CRPS for a single observation given S Monte Carlo samples.""" fa = np.mean(np.abs(samples - actual)) # E|X - x| fb = 0.5 * np.mean(np.abs(samples[:, None] - samples[None, :])) # 0.5 E|X - X'| return float(fa - fb)
For a panel of forecasts evaluated at many origins, average
crps_sample over (date, series, horizon); report per-horizon
breakdowns rather than a single scalar.
For quantile-based forecasts, the pinball loss at quantile τ
is what you train on; the average pinball over τ converges to
CRPS as the quantile grid is refined.
Calibration vs. sharpness
A probabilistic forecast has two properties worth measuring
separately.
Calibration. Statistical consistency between forecasted
probabilities and observed frequencies. Diagnostics:
PIT histogram. Under perfect calibration,
ut=F^t∣t−h(xt) is uniform on [0,1]. U-shapes
flag mis-specified tails; bumps flag mis-specified mean.
Reliability diagram. Observed frequency vs nominal probability
for binned predictions. Straight diagonal = calibrated.
Sharpness. Concentration of the predictive distribution. A
forecast that says "5% chance of large loss" every day is
technically calibrated if 5% of days have large losses, but
uselessly so. Sharpness is what makes calibrated forecasts
actionable.
The principle: maximise sharpness subject to calibration
\citep{gneiting2007strictly}. A forecaster that is sharper at the cost
of calibration is fragile; a forecaster that is calibrated at the cost
of sharpness is uninformative.
Conformal prediction
Conformal prediction \citep{shafer2008conformal} gives distribution-
free coverage guarantees by post-processing any point predictor with
a held-out calibration set. The recipe:
Train a point forecaster μ^(x) on Dtrain.
On a held-out calibration set Dcal, compute residuals
ri=∣yi−μ^(xi)∣.
For target coverage 1−α, find r^α = the
⌈(ncal+1)(1−α)⌉-th smallest residual.
Predictive interval at test point x:
[μ^(x)−r^α,μ^(x)+r^α].
Under exchangeability, the resulting interval has marginal coverage
exactly 1−α. Conformalised quantile regression
\citep{romano2019cqr} replaces step 2 with the largest quantile-
violation on the calibration set and produces locally adaptive widths.
A 30-line conformal-prediction wrapper around any point forecaster:
Exchangeability fails. Time series are not exchangeable;
conformal coverage is approximate at best. Use rolling conformal
with a sliding calibration window to track regime change.
Distribution-free is also distribution-flat. Conformal intervals
do not adapt the shape of the distribution, only the level. For
asymmetric losses, prefer quantile or distributional outputs.
Per-quantile efficiency. Adaptive conformal methods
(online conformal, AC-MLP) maintain coverage under distribution
shift at the cost of small loss in efficiency.
Distributional model choices
Three architectural patterns for emitting full predictive
distributions, paired with the loss they are trained against.
Parametric likelihood head.
Network outputs (μθ(x),σθ(x),νθ(x))
(location, scale, optional shape) of a chosen family; loss is
negative log-likelihood. DeepAR \citep{salinas2020deepar} is the
canonical RNN version; the same pattern attaches to transformers.
Strong on data with consistent shape, fragile when the shape changes.
Mixture density network (MDN).
Output multiple (πk,μk,σk) triplets and form a
Gaussian mixture. Captures multimodality (regime-switching returns)
at the cost of a harder optimisation landscape; constrain the
mixture weights with a softmax and use a small temperature schedule
during training.
Quantile head with isotonic post-processing.
Train a vector of quantiles with pinball loss; pass through an
isotonic regression at inference to enforce monotonicity. The
defaults are robust on heavy-tailed data and the cheapest path to
CRPS-competitive forecasts.
Normalising flow.
A bijective transformation fθ from a base distribution to
the data. Captures arbitrary shape; expensive to train and prone to
mode collapse on financial data. Use as a generator (Chapter 10)
more than as a routine forecaster.
Diffusion-based forecaster.
Recently competitive on standard benchmarks
\citep{su2025multimodal}. Treats the forecast as a denoising target;
produces sample trajectories rather than a closed-form density. Pair
with diffusion's natural conditioning machinery for stress-test
scenarios.
Backtesting probabilistic forecasts
The rolling-origin protocol of Section 02-02 generalises directly. Three
metrics travel together on every probabilistic backtest:
Per-horizon CRPS. A vector across forecast horizons; track the
ratio against a strong baseline (a Student-t GARCH for univariate
returns is the appropriate floor).
Coverage of nominal 80% / 90% / 95% intervals. A drop
below nominal is the early warning of regime change.
Pinball loss decomposed by quantile level. Asymmetric drift —
good upside coverage, poor downside — is exactly the failure mode a
CVaR-driven decision layer is sensitive to.
A useful diagnostic the literature underuses: PIT autocorrelation.
Under a calibrated, conditionally-independent forecaster, the PIT
sequence ut is iid uniform; persistent autocorrelation in ut
means the forecaster is missing structure that recurs through time
(typical: missing seasonality or regime persistence).
Connecting to the decision layer
The decision modules in Chapter 5 consume the predictive distribution
in different ways, and the form has to match.
Mean–variance / robust mean–variance. Need
(μ,Σ). Parametric likelihoods give them
directly; quantile-based outputs need post-processing.
CVaR / risk-aware optimisation. Needs samples or a parametric
family with a tractable tail. Quantile heads with isotonic post-
processing give a lower bound on CVaR; MDN or diffusion give
consistent estimators.
Recursive-utility critic. Needs samples of the continuation
value distribution at the next state. Sample-based forecasters
compose naturally; parametric ones require an explicit re-sampling
step.
Expected utility under arbitrary U. Sample average over the
predictive distribution: S1∑sU(Wt+1(s)). Number
of samples S is set by the variance of the importance estimator,
not by the model complexity.
A common architectural mistake is to throw away the predictive
distribution at the forecaster–decision boundary and pass only the
mean. The resulting policy is blind to the very distributional
information the forecaster's training loss was paying to learn. Pass
the full distribution (or at least the moments and tails) all the way
through.
Stress-testing forecasts with synthetic data
Real out-of-sample windows contain whatever regimes happen to fall in
them, which is rarely all the regimes you want a forecaster to
survive. The diffusion generator of Section 10-01 produces conditional
counterfactual paths — "a 10% S&P drawdown driven by tech earnings",
"a stagflation regime that lasts six months" — that you feed back
into the same backtest harness:
import numpy as npdef stress_crps(forecaster, history, synthetic_paths, horizon): """Apply a trained forecaster to N synthetic stress paths and return per-path CRPS against the synthetic path's true continuation.""" crps_per_path = [] for path in synthetic_paths: # path: shape (history_len + horizon,) prefix = path[:-horizon] actual = path[-horizon:] samples = forecaster.predict(prefix, horizon=horizon, n_samples=100) crps_per_path.append(crps_sample(samples, actual)) return np.array(crps_per_path) # shape (N,)
A forecaster that ranks well on rolling-origin real CRPS and on
synthetic-stress CRPS is the candidate to ship. One that wins on
real and degrades on synthetic is fitted to the recent regime; pause
before deploying it.
What this section adds to the chapter
Sections 04-01 through 04-05 are organised by model class. This
section is organised by output type. The two views are
complementary: every model class can emit point, quantile, parametric,
or sample-based outputs, and the choice is what determines whether
the resulting forecast is decision-ready. The forecasters with the
strongest published probabilistic performance on financial benchmarks
are typically a transformer backbone (04-05) with a quantile head and
an isotonic post-processor — but the discipline matters more than the
backbone.
Hierarchical and Cross-Sectional Forecasting
Most financial forecasting problems are not flat:
A bond-portfolio forecast must be coherent with the per-issuer
forecasts beneath it.
A sector forecast must add up to the per-stock forecasts in the
sector.
A GDP forecast has to be consistent with the per-component forecasts
(consumption, investment, government, net exports).
Hierarchical and cross-sectional forecasting is the
discipline of producing forecasts that respect this structure without
giving up the per-node accuracy of treating each series
independently.
Why structure matters
Two distinct goals that show up at the same time.
Coherence. The forecasts at different aggregation levels must
add up. Without this, downstream consumers (a portfolio risk
system, a budget allocation, a regulator-facing report) get
contradictory numbers depending on which level they query.
Pooling for accuracy. Sparse leaves benefit from information at
higher aggregation levels (where signal-to-noise is better);
high-aggregation forecasts benefit from leaf-level structure.
The two goals usually trade off: forecasts that are independently
optimal at each node are rarely coherent; forecasts that are coherent
by construction usually pay an accuracy tax somewhere. Hierarchical
reconciliation methods aim to give up as little accuracy as
possible.
Hierarchies and the summation matrix
A hierarchy is described by a summation matrixS of shape
(ntotal×nleaves). If
yt∈Rntotal stacks all node values
and bt∈Rnleaves holds the
leaf-level values, coherence is the linear constraint
yt=Sbt. Three structures recur in finance:
Strict hierarchy. Sector → industry → ticker; each leaf has
exactly one parent at each level.
Grouped hierarchy. Multiple parallel groupings (sector,
geography, market cap) with the same leaves; multiple parent
paths.
Cross-sectional panel.n series with no formal aggregation
but strong factor structure. Treated by the same techniques with a
factor-loading matrix in place of S.
The literature \citep{hyndman2011hier, athanasopoulos2017forecasting}
formalises hierarchical forecasting as a linear-algebra problem on
the columns of S. A small example with one parent and three leaves:
import numpy as np# 3 leaves -> 1 parent. S maps b in R^3 to y = (parent, leaf1, leaf2, leaf3).S = np.array([ [1, 1, 1], [1, 0, 0], [0, 1, 0], [0, 0, 1],])b = np.array([3.2, 5.1, 2.7])y = S @ b # coherent vector for all 4 nodesprint(y) # array([11.0, 3.2, 5.1, 2.7])
Reconciliation methods, in increasing sophistication
Given base forecastsy^t that are not coherent,
reconciliation produces coherent forecasts y~t=SGy^t
for some reconciliation matrix G.
Bottom-up. Forecast leaves; aggregate.
G=(S⊤S)−1S⊤ collapses to picking the leaf rows.
Coherent by construction; accuracy at higher levels is whatever the
leaf forecasts deliver.
Top-down. Forecast the top; disaggregate by historical
proportions or forecasted proportions. Loses leaf-level signal;
rarely worth the simplicity.
MinT (minimum trace). \citep{wickramasuriya2019mint}
G=(S⊤W−1S)−1S⊤W−1 where W is the
covariance of base-forecast errors. Optimal under unbiased base
forecasts: minimises the trace of the reconciled forecast-error
covariance. The default in modern hierarchical packages.
Trace-minimisation with shrinkage. When W is hard to
estimate, replace with a Ledoit–Wolf shrinkage estimator or a
diagonal target. Gives most of MinT's benefit at a fraction of the
estimation cost.
A working MinT reconciler in 15 lines of NumPy:
import numpy as npdef mint(y_hat: np.ndarray, S: np.ndarray, W: np.ndarray) -> np.ndarray: """y_hat: base forecasts (n_total,); S: (n_total, n_leaves); W: (n_total, n_total).""" Winv = np.linalg.pinv(W) G = np.linalg.solve(S.T @ Winv @ S, S.T @ Winv) return S @ G @ y_hat # coherent forecast
A useful empirical pattern: MinT with shrinkage beats both pure
bottom-up and pure top-down on every standard benchmark, and the gap
is largest in the middle of the hierarchy where information from
both directions matters.
Cross-sectional forecasting on equity panels
Cross-sectional forecasting is hierarchical forecasting where the
"parent" is not a sum but a factor. Returns of asset i
decompose as
ri,t=bi⊤ft+εi,t,
with factor loadings bi, factor returns ft, and
idiosyncratic residuals εi,t. Three forecasting
strategies that exploit this structure:
Forecast factors, propagate to assets. Estimate bi
cross-sectionally, forecast ft with the multivariate
tools of Section 04-02, then r^i,t+1=bi⊤f^t+1.
Stable on large universes; ignores asset-specific signal.
Forecast residuals, add factor predictions. A two-stage
approach: fit factors, then fit a per-asset residual model. The
asset-specific signal is the part the factor structure cannot
explain.
Joint cross-sectional model. Train a single network on the full
panel with shared parameters across assets and a per-asset embedding
(Section 04-04 / 04-05). The model learns the implicit factor
structure rather than relying on a pre-specified B.
For a coherent portfolio-level forecast, the factor-level and
asset-level forecasts have to satisfy
r^p,t=wt⊤r^t. MinT-style
reconciliation works here too: treat the portfolio as the "top" of
the hierarchy and reconcile.
Probabilistic reconciliation
Coherence on point forecasts is straightforward; coherence on
distributions is harder. Two practical recipes:
Reconcile samples. Generate sample trajectories
{bt(s)}s=1S at the leaves; aggregate each sample
yt(s)=Sbt(s). The empirical
distribution at every node is coherent by construction. Storage
scales with S×n; the computation is embarrassingly
parallel.
Reconcile parameters. For Gaussian predictive densities
N(μ^t,Σ^t) at the bottom,
the propagated distribution at any aggregation level is also
Gaussian with mean Sμ^t and covariance
SΣ^tS⊤. Closed-form, but parametric mis-
specification at the leaves propagates to every level.
Hierarchical-aware foundation models
The 2024–2025 wave of foundation forecasters (Section 04-08) does
not yet handle hierarchy natively — they predict per-series and
leave reconciliation to the user. Two practical patterns:
Forecast per-leaf with the foundation model, reconcile with MinT.
Best of both: foundation-model coverage at the leaves, classical
coherence at aggregations.
Forecast at each level independently, ensemble. The simplest
fallback when a foundation model has different strengths at
different aggregation levels — top forecast from the foundation
model, leaf forecast from a tree baseline, reconcile both.
A workflow that travels
A pattern that works across most financial hierarchies:
Define S explicitly. Write down the aggregation as a sparse
matrix; check that columns sum correctly.
Forecast each node independently with the per-node best model.
Boring is fine — LightGBM with engineered features is the most
common winner per node.
Reconcile with MinT (shrinkage). Use a held-out window to
estimate the base-error covariance with shrinkage; recompute
periodically.
Evaluate at every level. Per-node CRPS and aggregate-level
CRPS. Reconciliation should not silently degrade either.
Audit residual coherence. Confirm that the reconciled outputs
actually satisfy y~=Sb~ to
tolerance — useful as a CI test.
Where this fits in the chapter
Hierarchical forecasting is the organisational concern that makes
forecasts deployable in books with structured products. The
forecasters of Sections 04-01 through 04-05 are the components; this
section is the glue that makes them produce numbers that downstream
systems can defend.
Foundation Forecasters in Practice
Between 2023 and 2025 a new family of time-series models arrived that
are pretrained on billions of time-series tokens and produce
probabilistic forecasts in zero- or few-shot settings. Chronos
\citep{ansari2024chronos}, Lag-Llama \citep{rasul2024lagllama}, MOIRAI
\citep{woo2024moirai}, TimesFM, and Time-LlaMA
\citep{chen2024timellama} all belong to this family. They are not the
final word in time-series forecasting — task-specific transformers
(Section 04-05) typically still beat them by 5–15% on CRPS when you
have enough data — but they give you a strong baseline with zero
per-series training, which is often the single most useful thing in
a finance research workflow.
This section is the hands-on companion to the Chapter 7 LLM-
integration discussion. The focus is operational: when to reach for a
foundation forecaster, how to evaluate it on your own data, and how
to combine it with task-specific models.
What "foundation forecaster" means
The term covers two architectural lineages:
Decoder-only transformer trained on tokenised series.
Chronos quantises every time series into a fixed vocabulary and
trains a language-model-style decoder over that vocabulary.
Lag-Llama uses a LLaMA backbone with engineered lag features as the
token sequence. TimesFM and Lag-Llama are also in this family.
Universal encoder–decoder with masking.
MOIRAI handles arbitrary frequency, multivariate panels, and
exogenous variables in one model with a masked-patch encoder. The
more flexible architecture comes at the cost of more parameters.
The unifying claim: pretraining on a sufficiently broad corpus of
time series teaches the model temporal patterns that transfer to
unseen series. The empirical evidence supports a moderate version of
this claim — foundation models are state of the art on cold-start
forecasting (no target-series history) and competitive when target
history is short.
When to reach for one
Three concrete scenarios where foundation forecasters earn their place.
Cold-start. A new asset, a new dataset, a new product line.
No history to train a specialist; foundation forecaster gives a
defensible baseline within minutes.
Many sparse series. A panel of hundreds of series with little
history each. Training a specialist per series is impractical;
fine-tuning the foundation model on the panel is.
Rapid prototyping. Before committing to a feature pipeline and
a custom architecture, run a foundation forecaster as a sanity
check. If a pretrained model produces a forecast comparable to
your custom system, the custom system is not earning its
complexity.
Three scenarios where they are the wrong choice:
Long, stable history with a specific failure mode. When the
important signal is in the tails or in a regime-specific dynamics
the pretraining corpus does not contain (a fund-flow seasonal, a
market-microstructure pattern), a specialist does it better.
Tight latency budget. Foundation forecasters are large; even
a quantised inference is slower than a small task-specific
transformer or a LightGBM baseline.
Hard distribution requirements. If you need exact CRPS on a
Student-t tail or a specific quantile shape, a specialist with a
matching head will win.
A vendor-neutral evaluation harness
The same harness should work across all foundation forecasters so the
comparison is honest. Pseudocode pattern:
A concrete pattern using Chronos-T5 (the simplest foundation
forecaster API to start with). The model loads from Hugging Face;
inference is on a single GPU or CPU.
import numpy as npimport pandas as pdimport torchfrom chronos import ChronosPipeline# 1) load a pretrained foundation forecaster (zero-shot)pipe = ChronosPipeline.from_pretrained( "amazon/chronos-t5-small", device_map="cpu", # or "cuda" if available torch_dtype=torch.float32,)# 2) take a single asset's historyprices = pd.read_csv("data/prices_daily.csv", parse_dates=["Date"])spy = prices.query("Ticker == 'SPY'").sort_values("Date")ctx = torch.tensor(np.log(spy["AdjClose"]).diff().dropna().values[-512:])# 3) probabilistic forecast: 100 samples for the next 20 dayssamples = pipe.predict(context=ctx, prediction_length=20, num_samples=100)samples = samples.cpu().numpy() # shape (1, 100, 20)# 4) per-horizon CRPS against held-out actuals (one rolling origin)def crps_sample(forecast_samples, actual): fa = np.abs(forecast_samples - actual).mean(axis=0) fb = 0.5 * np.abs(forecast_samples[:, None] - forecast_samples[None, :]).mean(axis=(0, 1)) return fa - fbactual = np.log(spy["AdjClose"]).diff().dropna().values[-20:] # next 20 days actualcrps_h = crps_sample(samples[0], actual) # (20,)print(f"mean CRPS = {crps_h.mean():.4f}")
For a rolling-origin evaluation, wrap steps 2–4 in a loop over a
sequence of as-of dates; the per-horizon CRPS averaged across origins
is the headline number to compare across foundation models.
Three loops to keep separate: rolling origin, horizon, sample. The
aggregation across rolling origins is what gives the headline number;
the per-horizon and per-sample distributions are the diagnostics.
A practical complication: foundation models have different default
context lengths and different default number of samples. Always
benchmark with the same context (largest the smallest model
supports) and the same sample count; otherwise the comparison is
about hyperparameters, not models.
Patterns for combining with task-specific models
Three patterns that beat either component alone.
Foundation model → tree residual. Use the foundation model's
point forecast as a feature, train a LightGBM on the residual. The
tree absorbs systematic biases the pretrained model has on this
particular dataset.
Foundation model + specialist ensemble. A precision-weighted
ensemble of the foundation forecaster's CRPS and a specialist
(PatchTST or LightGBM) routinely produces better calibration and
sharpness than either alone.
Foundation model warm-start, fine-tune on target. Fine-tune the
foundation model on the panel with a small learning rate and
parameter-efficient adapters (LoRA, prefix tuning). Trade-off: more
compute, but the resulting model often beats both the zero-shot
foundation model and the specialist trained from scratch.
Limits and failure modes worth knowing
Distributional under-spread on rare regimes. Foundation models
trained on a broad corpus tend to produce narrow predictive
distributions on stress regimes that look unusual relative to the
training mix. The reliability diagram per regime (Section 02-02)
catches this.
Calendar blindness. Most pretrained forecasters do not see
calendar features (day-of-week, month-end, holidays) by default.
Inject them as exogenous covariates if the model supports it
(MOIRAI does; Chronos does not natively).
Tokenisation artefacts. Quantisation-based models (Chronos)
bin values; the bin grid is set at pretraining and may not span
your data's range. Re-scale targets to the model's training range
before inference.
Operational notes
Hosting. Foundation forecasters in the 100M–1B parameter range
fit on a single GPU; quantised variants run on CPU at acceptable
latency. Hugging Face hosts most of them.
Reproducibility. Pin the model weights to a commit hash; pin
the inference library version; log the seed for the sampling step.
Caching. Cache zero-shot forecasts by (model_hash, series_hash, history_hash) — the same query two days in a row
should not recompute.
Where this fits
Foundation forecasters sit in a different layer than the task-
specific transformers of Section 04-05. Treat them as strong
baselines you do not have to train rather than as replacements for
the specialist machinery in the rest of the chapter. Their main
contribution to a financial AI workflow is velocity: time from a
new dataset to a defensible forecast measured in minutes rather than
weeks. The decision framework in Chapter 5 and the agent integration
in Chapter 8 both compose with foundation forecasters as easily as
with specialist forecasters; the rest of the book does not need to
care which is under the hood, only that the predictive distributions
are calibrated and sharp.