"What is real? How do you define real?"
— Morpheus, The Matrix (1999)
For a stress test the answer is: real enough that a policy's risk
profile is recognisably the same on real and synthetic data. Real
market data is finite, biased toward bull regimes, and stingy with
exactly the tail events we most need to stress-test against.
Synthetic data is how we fill those gaps in a disciplined way:
train a generative model on the historical record, sample from it
under controlled conditioning, and use the resulting paths for stress
testing, policy evaluation, and pretraining. The chapters before this
one assumed the synthetic loop existed; this chapter is where it
finally gets built.
Why generate, not only resample
Bootstrap and block-bootstrap methods preserve marginals exactly and are
easy to defend, but they cannot produce scenarios that never happened.
A generative model can. The trade-off is that generation introduces a
new layer of modelling assumptions — the generator itself — that have to
be evaluated.
Three modes the book uses
Stress testing. Generate paths that lie deliberately in the tail
(USD spike, credit blow-out, vol regime that the training set never
saw) and check that the policies of Chapter 5 stay within their risk
limits.
Policy evaluation. Replace held-out real trajectories with
held-out synthetic ones from the same DGP to estimate the variance
of an estimator without burning real out-of-sample data.
Pretraining and augmentation. When a target dataset is small (a
niche fund, an emerging asset class), augment with synthetic samples
that share the relevant statistical structure, then fine-tune on the
real data.
What this chapter covers
Diffusion as the modern default — forward / reverse processes,
training loss, conditioning patterns, sampling speed.
GAN-based generators (CTGAN, TimeGAN) for the cases where
GANs still lead — tabular cross-sections and faster sampling.
VAE-based generators for factor-aware counterfactual
generation, the generative sibling of Chapter 6's identifiable
dynamics.
Bootstrap and copula synthesis — the classical, defensible
toolkit for compliance-friendly stress generation.
Evaluation and privacy — the four-leg validation protocol,
memorisation audits, differential privacy, and the
reproducibility manifest every synthetic dataset needs.
Contents
Diffusion — denoising diffusion as the
workhorse generator for time-series synthesis, with the
conditioning patterns that make stress testing operational.
GAN-Based Generators — CTGAN, TimeGAN,
and the strengths and failure modes (mode collapse, tail
under-fit) of the adversarial family on financial data.
VAE-Based Generators — conditional
VAEs and factor-VAEs, the generative bridge from Chapter 6's
identifiable dynamics into synthetic counterfactual paths.
Bootstrap and Copula Synthesis —
block bootstrap, copula-based scenarios, and hybrid recipes for
compliance-friendly stress.
Evaluation and Privacy — the
four-leg validation protocol, memorisation detection, DP-SGD,
and the manifest schema that makes synthetic datasets
reproducible.
Diffusion
Denoising diffusion has become the dominant generative method for time
series, displacing both VAEs (which struggle with sharp tails) and GANs
(which are unstable and mode-drop on heterogeneous panels). The recent
literature converges on diffusion as the right substrate for probabilistic
time-series generation \citep{su2024diffusion-survey}: it accepts rich
conditioning, captures multimodal distributions, and produces tails that
match the training distribution rather than collapsing to a Gaussian
approximation. This section covers the core algorithm, the variants that
matter for finance, and the evaluation protocol that decides whether a
generator is good enough to feed back into Chapters 4–6.
Forward and reverse processes
The forward process progressively adds Gaussian noise to a real sample
x0 across T steps,
q(xt∣xt−1)=N(1−βtxt−1,βtI),
with a noise schedule {βt}t=1T. A useful closed form follows
by composition:
In the denoising-diffusion probabilistic model (DDPM) parameterisation
\citep{ho2020ddpm} we predict the noise that was added,
ϵθ(xt,t), and recover the mean from
the closed-form posterior. The training loss reduces to a denoising MSE:
The denoiser, schedule, and loss are all that is needed to start.
A serious implementation swaps the toy denoiser for a transformer or
state-space backbone (Section "Backbone choices" above) and trains
on real return windows; the loss form is unchanged.
Backbone choices
The denoiser network ϵθ does most of the work.
Three backbones cover the territory:
1-D U-Net. The original DDPM backbone, transplanted from images.
Works well on regular univariate series but does not scale gracefully to
many channels.
Transformer / DiT. Treats time as a sequence of patches with
attention. The default for multivariate panels and long sequences;
conditioning is plumbed through cross-attention or AdaLN.
State-space backbones (Mamba/S4). Recently competitive on long
series, with much lower compute at long context than transformers. A
reasonable choice when sequences are very long (intraday, tick-level).
Empirically, transformer backbones with patch tokenisation produce the
best stability vs. quality trade-off on equity panels and macro series in
the 100–1000 step range — the regime most finance applications live in.
Conditioning
Unconditional generation rarely answers the question we actually have.
Useful conditioners for finance:
Macro regime. Discrete labels (recession / expansion, high-vol /
low-vol) injected via embedding lookup. Lets you sample stress
scenarios on demand.
Calendar features. Day-of-week, day-of-month, time-to-earnings.
Without these, generators produce series that ignore microstructural
seasonality.
Continuous covariates. Yield-curve slope, credit spread,
realised-volatility level. Inject via FiLM or AdaLN.
Text. Multimodal conditional diffusion \citep{su2025multimodal}
conditions the generator on textual scenario descriptions ("a 10%
S&P drawdown driven by tech earnings"). This is where Chapter 8's
agents become useful — they author the conditioning text from a
scenario library.
A FiLM-style conditioning layer that handles the most common case
(scalar regime label + continuous covariate vector) in the denoiser:
Concatenated conditioner. Discrete regime + continuous
covariates fold into one vector before FiLM. Lets you condition
on either alone (zero-pad the missing component) or both.
(1 + gamma) scaling. The standard FiLM trick: at
initialisation gamma = 0 and the layer reduces to identity, so
the conditioner doesn't disrupt training before it has signal to
contribute.
The dominant training-time conditioning trick is classifier-free
guidance \citep{ho2022cfg}: train with the conditioner randomly
dropped (probability ~0.1), then at sample time interpolate between
the conditional and unconditional model with a guidance scale w,
ϵw(xt,t,c)=(1+w)ϵθ(xt,t,c)−wϵθ(xt,t,∅).
Larger w produces samples that adhere more strongly to the conditioning
at the cost of diversity; finance applications tend to use w∈[1,3].
Sampling: speed vs. quality
DDPM with T=1000 steps is slow. Three accelerations:
DDIM \citep{song2021ddim}. A deterministic re-parameterisation that
lets you skip ahead in fewer steps (50 steps is typical). Quality
marginally lower than DDPM at the same step count; usually worth it.
Consistency models / progressive distillation. Train a "shortcut"
model to invert the entire diffusion in 1–4 steps. Tens of times
faster; quality depends on the distillation budget.
Solver-based methods (DPM-Solver). Treat the reverse process as an
ODE and use higher-order ODE solvers; 10–20 steps with tight quality.
For finance we mostly care about batch sampling rather than real-time,
so DDPM with DDIM acceleration is usually sufficient.
Evaluation: more than visual inspection
A diffusion generator that looks right can still mis-fit the parts that
matter for finance. The evaluation protocol has four legs:
Marginal-fit. Per-series mean, std, skewness, kurtosis on synthetic
vs. real. Two-sample tests (Kolmogorov–Smirnov, energy distance) on
marginal returns.
Autocorrelation and clustering. Compare ACF of returns, ACF of
absolute / squared returns. A generator that breaks volatility
clustering is unusable for risk modelling.
Tail coverage. Estimate VaR and ES at 1%, 0.5%, 0.1% on synthetic
and real. Tail under-coverage is the failure mode that quietly breaks
stress tests.
Downstream performance.The test that matters: train the
forecaster of Chapter 4 or the policy of Chapter 5 on synthetic data;
evaluate on real held-out data. If performance is in the same band as
training on real-only, the generator is preserving the right
structure. If it collapses, the generator is leaking spurious
correlations.
A useful complementary diagnostic from the synthetic-data literature
is sample-level utility/disclosure scoring
\citep{drechsler2024synthpop}, which trades realism against memorisation
risk on a per-sample basis. Worth running when the synthetic set is
shared outside the team that owns the source data.
Memorisation and privacy
Diffusion models can and do memorise training samples when the data is
small or repetitive. Two practical defences:
DP-SGD training. Adds calibrated noise to the gradient updates.
Slows training and reduces sample quality; appropriate when the
training set contains client-level data.
Memorisation audits. After training, sample N synthetic windows
and find each sample's nearest training neighbour by L2 distance.
Inspect the distribution of nearest-neighbour distances; samples
much closer than the bulk are likely memorised.
For purely market-data applications, memorisation is mostly an
evaluation concern (you don't want to test policies on copies of the
training set), not a privacy concern.
Deployment patterns
Persist seeds and conditioning. Every synthetic path used in a
report or test is reproducible: save the seed, the conditioner, and
the model version.
Blend, don't replace. When training a downstream model, mix
synthetic and real samples (50/50 is a reasonable default) rather
than substituting synthetic for real. Pure-synthetic training tends to
produce models that ace synthetic out-of-sample and degrade on real.
Monitor distribution shift. When the live regime moves, the
generator trained on yesterday's regime starts producing scenarios
that look implausible. Re-fit on a sliding window — every quarter
is a reasonable cadence for daily macro/equity panels.
Connecting back to the rest of the book
The synthetic-data layer closes the loop:
Train a generator on real data (this section).
Generate stress scenarios conditioned on regimes or text from
Chapter 8 agents.
Evaluate the forecasters of Chapter 4 and the policies of Chapter 5
on those scenarios.
Feed the worst-case results back into Chapter 5's risk constraints.
Diffusion is the engine that makes the loop possible. The rest of the
book — forecasting, decisions, dynamics, agents — is what gives the
loop something to test.
GAN-Based Generators
Diffusion models (Section 10-01) have become the default time-series
generator, but the GAN family — generative adversarial networks —
remains worth knowing for two reasons. First, several finance-specific
generators (CTGAN for tabular cross-sectional data, TimeGAN for
sequential panels) live here. Second, GAN samples are typically faster
to draw than diffusion samples, which matters when the synthetic-data
pipeline is rate-limiting.
This section covers what GANs do, what they get right and wrong on
financial data, and when to reach for them over diffusion.
What a GAN is, in one paragraph
A GAN trains two networks against each other. A generatorGθ
maps noise to a sample, Gθ(z)∈Rd. A
discriminatorDϕ tries to distinguish real samples from
generator samples and emits a probability of "real". The generator
is trained to fool the discriminator; the discriminator is trained
to be harder to fool. At equilibrium the generator's distribution
matches the data distribution and the discriminator's accuracy is
50%. The training objective is the minimax game
The original formulation \citep{goodfellow2014gan} suffers from
training instability and mode collapse; the Wasserstein GAN
variant \citep{arjovsky2017wgan} replaces the cross-entropy with an
Earth-Mover-style distance and is the practical default in modern
work.
Where GANs work for finance
Two specialised architectures have stuck.
CTGAN — tabular cross-sectional data
CTGAN \citep{xu2019ctgan} is the standard for synthesising tabular
data with a mix of continuous and categorical columns. It introduces
two ideas that matter for finance tabular data:
Mode-specific normalisation of continuous columns. Continuous
features in finance (returns, volumes, ratios) often have multiple
modes — separating regimes, asset classes, or sector clusters.
CTGAN fits a Gaussian mixture per column and generates the mode
index plus a normalised value within the mode.
Conditional sampling on rare categories. Without it, GANs
collapse to majority-class samples; with it, the generator can be
asked to produce samples conditioned on a sector, country, or
rating.
from sdv.single_table import CTGANSynthesizerfrom sdv.metadata import SingleTableMetadataimport polars as plpanel = pl.read_csv("data/cross_section.csv") # ticker-level featuresmetadata = SingleTableMetadata()metadata.detect_from_dataframe(panel.to_pandas())syn = CTGANSynthesizer(metadata, epochs=300, batch_size=512)syn.fit(panel.to_pandas())# generate a synthetic panel of the same shapefake = syn.sample(num_rows=len(panel))
CTGAN works well for per-period cross-sections — a snapshot of
the universe at a single date — but not for time-series structure.
For that, TimeGAN.
TimeGAN — sequential panels
TimeGAN \citep{yoon2019timegan} layers a GAN on top of an embedding
network so the adversarial loss is computed in a learned latent
space rather than on raw sequences. Three losses combined:
Reconstruction loss for the embedding network (autoencoder
style).
Adversarial loss between real and synthetic latents.
Supervised loss on next-step prediction in the latent space —
this is the trick that gives TimeGAN better temporal consistency
than vanilla sequence GANs.
The result is plausible-looking returns sequences that preserve
autocorrelation and volatility clustering reasonably well — better
than independent GANs, not as well as a properly trained diffusion
model on the same data.
Where GANs struggle
Three failure modes recur in finance:
Mode collapse. Generator emits a small number of distinct
samples that fool the discriminator without covering the data
distribution. On equity panels this looks like "all synthetic
series resemble the same handful of stocks".
Training instability. Loss curves that don't converge,
generator/discriminator capacity mismatches, hyperparameter
brittleness. Wasserstein GAN with gradient penalty (WGAN-GP)
mitigates but does not eliminate.
Tail under-fit. GAN training emphasises the bulk of the
distribution where the discriminator's gradient is largest. Tails
— the part finance applications care most about — are
systematically under-sampled compared to diffusion.
For pure stress-testing applications where tail fidelity is the
primary requirement, diffusion remains the better choice. For
augmentation tasks where the bulk of the distribution matters more
than the extremes, GANs are competitive and faster.
Evaluation specifics
The four-leg validation protocol of Section 10-01 (marginal,
autocorrelation, tail, downstream) applies to GAN samples too,
with one extra check:
Mode coverage. Cluster the real data into K regimes
(k-means on returns + vol features, or a GMM). Compute the mode
occupancy of synthetic samples and compare. Strong under-coverage
of any cluster is mode collapse, even if the marginal looks fine.
A practical pattern: train both a CTGAN and a small diffusion model
on the same data; if mode-coverage diverges, trust diffusion and
debug the GAN.
When to use GANs
A working decision rule:
Tabular cross-sections without strong temporal structure →
CTGAN. Faster than diffusion, well-supported tooling (SDV).
Sequential panels with limited compute → TimeGAN, with the
caveat that tail fidelity will be worse than diffusion.
Both → ensemble. Use CTGAN samples for augmentation and
diffusion samples for stress. The downstream evaluation in Chapter
5 stays the same.
GANs are not the future of generative modelling for finance — that
title now belongs to diffusion and (where preference data exists)
flow matching. But the specialised finance-tabular and sequential
variants are mature, well-tooled, and worth keeping in the toolkit
for the cases they were designed for.
VAE Generators
Variational autoencoders sit between GANs and diffusion in the
generative-model landscape — a probabilistic encoder-decoder pair
trained by maximising a likelihood-style objective rather than an
adversarial game. For finance applications the practical shape that
matters is the factor-aware variant: a VAE whose latent space is
constrained to admit identifiable axes, which lets us generate
counterfactual paths conditioned on factor exposures rather than on
raw observation features.
This connects directly to Chapter 6's identifiable dynamic factor
models. The VAE side of the story is the generative sibling of the
identification story.
What a VAE does
A VAE consists of an encoderqϕ(z∣x)
that maps observations to a distribution over a latent variable
z, and a decoderpθ(x∣z)
that maps latents back to observations. Training maximises the
evidence lower bound (ELBO):
L(θ,ϕ)=Eqϕ(z∣x)[logpθ(x∣z)]−KL(qϕ(z∣x)∥p(z)).
The first term is reconstruction quality; the second is a
regularisation that keeps the encoder's posterior close to a prior
p(z) (typically standard normal). Sampling is by drawing
z∼p(z) and decoding.
VAEs are easier to train than GANs — no adversarial dynamics — and
give explicit log-likelihoods for evaluation. They tend to produce
blurrier samples than GANs or diffusion in the image domain;
in the time-series domain the analogue is "smooth, mean-reverting"
generated paths that under-represent shocks.
Conditional VAEs
The version that does most of the work in finance is the conditional
VAE (CVAE): the encoder and decoder both condition on auxiliary
information c (regime, calendar, factor exposure). The
ELBO becomes
To generate a path under regime c⋆:
sample z∼p(z∣c⋆)
and decode through pθ(x∣z,c⋆).
Factor-VAE for finance
The architecture that maps cleanly onto Chapter 6's identifiable
dynamic factor model:
Encoder qϕ(zt∣y1:t,ut)
produces innovations zt conditioned on observations
and auxiliary context ut (calendar, regime).
The latent dynamics are diagonal linear:
ft+1=Aft+Bzt.
Decoder pθ(yt∣ft) is an
injective MLP plus fixed observation noise.
Innovation prior p(zt∣ut) is
conditional non-Gaussian (Laplace, in practice).
Under the right assumptions on ut variation, the
innovations are identifiable up to permutation and component-wise
affine maps; the same identifiability story carries over to the
factors. The generative side: sample new innovations, propagate
through the dynamics, decode. The result is a synthetic time-series
panel whose latent factors have the same semantic meaning as the
real-data factors — which makes it usable for factor-aware
backtesting.
This is the "iVDFM" architecture from the paper line referenced in
Chapter 6; the dfm-python package (Section 6-05) ships a working
implementation.
A worked CVAE for daily returns
A minimal training loop for a per-asset CVAE on daily returns,
conditioned on a regime indicator:
Latent dimension (zdim). Too small under-fits the data;
too large makes the prior loose and the samples blurry. Start
with xdim as a heuristic, tune.
KL annealing. Linearly ramp the KL term from 0 to its full
weight over the first 5–20 epochs. Avoids posterior collapse,
the failure mode where the encoder ignores x.
β-VAE. Scale the KL term by a factor β. β>1
pushes the latent toward more disentangled representations at the
cost of reconstruction quality; useful when factor interpretability
is the goal.
Where VAEs struggle
Tail fidelity. Standard Gaussian-prior VAEs systematically
smooth out the heavy tails of returns. Switching to a Student-t
or Laplace prior (and likelihood) helps; even so, diffusion
typically wins on tail metrics.
Sharp regime switches. A single Gaussian latent struggles to
represent abrupt regime changes. Mixture-of-Gaussian priors and
hierarchical VAEs (e.g., NVAE-style) handle them better.
Long sequences. Autoregressive VAEs lose long-range coherence
for the same reason RNN forecasters do; transformer-based VAEs
fix this at higher compute cost.
When VAEs are the right choice
Three concrete scenarios:
Factor-conditioned generation. Generate paths under "factor 2
doubled, factor 3 zeroed" — exactly the kind of intervention
Chapter 6's identifiable dynamics enables. CVAE plus a structured
latent space is the natural fit.
Per-asset augmentation in a small panel. A few hundred series
with limited history. VAE training is more stable than GAN at
this scale; quality is sufficient for augmentation.
Pretraining a forecaster. A VAE encoder, dropped into a
forecaster as a feature extractor, gives reasonable latent
representations to start from.
For pure stress generation, prefer diffusion (Section 10-01). For
identifiable factor-aware generation, VAE is the right tool. For
both at once, the iVDFM-style hybrid in Chapter 6 is the cleanest
expression of the goal.
Bootstrap and Copula Synthesis
Before reaching for a deep generative model, ask whether classical
resampling and copula methods cover the use case. They are simpler,
more defensible, and frequently sufficient — especially when the
operational requirement is preserve the joint distribution of
historical returns rather than generate scenarios that never
happened. This section covers the classical methods that often
beat fancy generators on benchmarks where the ground truth is
"reproduce the empirical distribution".
When non-generative methods suffice
Three concrete cases that resampling handles cleanly:
Backtesting variance estimates. Quantify the variance of a
Sharpe ratio or a drawdown estimate by resampling the historical
series. No generator needed.
Tail-aware Monte Carlo. Generate joint scenarios under the
empirical dependence structure plus parametric tails (EVT-GPD)
on the margins.
Compliance-friendly stress. A regulator asks for "stress
tests using historical observations". A bootstrap of crisis
windows is exactly that, with a defensible audit trail; a
diffusion sample is harder to defend.
If the use case is augmentation or truly out-of-distribution
stress, jump back to Sections 10-01 / 10-02 / 10-03. For the rest,
this section.
Block bootstrap, refresher
Section 02-05 introduced block bootstrap as the fix for the IID
bootstrap's incompatibility with serial dependence. The recipe:
Choose block length ℓ, ideally proportional to the
half-life of the dominant autocorrelation in the data.
Sample ⌈T/ℓ⌉ blocks of length ℓ uniformly
from the original series, with replacement.
Concatenate to form a synthetic series of length ∼T.
Three variants:
Moving block bootstrap. Blocks are contiguous slices.
Stationary bootstrap \citep{politis1994stationary}. Block
lengths are random with geometric distribution mean ℓ.
Reduces edge effects from fixed block boundaries.
Circular bootstrap. Wraps the series end-to-start before
block sampling, eliminating boundary artefacts.
import numpy as npdef stationary_bootstrap(series: np.ndarray, mean_block: int, n: int, rng: np.random.Generator) -> np.ndarray: """Generate one stationary-bootstrap path of length n.""" T = len(series) out = np.empty(n) p = 1.0 / mean_block i = rng.integers(T) for t in range(n): out[t] = series[i % T] if rng.random() < p: i = rng.integers(T) else: i += 1 return out
For a multivariate panel, the same indices are used across all
series in a block — this preserves contemporaneous dependence by
construction. Trying to bootstrap each series independently destroys
the cross-sectional structure and is almost never what you want.
Block bootstrap of innovations
A useful refinement: rather than bootstrapping the raw series,
bootstrap the standardised residuals of a fitted model and
re-introduce the model's mean and variance dynamics.
Fit a model (AR / ARMA / GARCH) to the data.
Standardise residuals ε^t/σ^t.
Block-bootstrap the standardised residuals.
Plug the bootstrapped innovations back into the model's recursion
to produce a synthetic path.
The benefit: the synthetic path has the model's mean and variance
dynamics (which capture volatility clustering and seasonality)
combined with the empirical shape of the innovations (which
captures the heavy tails). For finance applications this often beats
both pure block bootstrap and pure parametric Monte Carlo on
downstream evaluation.
Copula-based scenario generation
Copulas decouple margins from dependence, which is the right
decomposition for many financial stress problems.
The setup: for a d-dimensional return vector with margins
F1,…,Fd and joint CDF F, Sklar's theorem says
there is a copula C with
F(x1,…,xd)=C(F1(x1),…,Fd(xd)),
and C is unique on the support of the margins. The recipe:
Fit margins separately. For finance, the right choice for tails
is EVT-GPD above a threshold, empirical CDF below it.
Transform observations to uniforms via ui=Fi(xi).
Fit a copula on the uniforms.
To sample: draw u∼C, then
xi=Fi−1(ui).
Three copula choices that recur:
Gaussian copula. Simple, fast, no tail dependence. Convenient
default; a known understatement of joint extremes.
Student-t copula. Same parameterisation as Gaussian plus
degrees of freedom; admits symmetric tail dependence. The
default when joint extremes matter and computational cost is a
concern.
Vine copula. Decomposes the joint into a tree of pair
copulas, each chosen separately. Captures asymmetric and
heterogeneous dependence; expensive to fit but the most flexible
classical option.
import numpy as npfrom scipy.stats import t as student_t, normdef fit_t_copula(U: np.ndarray) -> tuple[np.ndarray, float]: """Fit a Student-t copula to uniform samples U (shape (n, d)).""" Z = norm.ppf(U) # to standard normal R = np.corrcoef(Z, rowvar=False) # correlation matrix # Estimate df by maximum likelihood (omitted: use scipy.optimize) nu = 6.0 # placeholder return R, nudef sample_t_copula(R: np.ndarray, nu: float, n: int, rng: np.random.Generator) -> np.ndarray: """Draw n samples from a Student-t copula.""" d = R.shape[0] L = np.linalg.cholesky(R) # standard multivariate-t g = rng.chisquare(nu, size=(n, 1)) / nu z = rng.standard_normal(size=(n, d)) @ L.T t = z / np.sqrt(g) # to uniforms return student_t(df=nu).cdf(t)
Hybrid: bootstrap dependence + parametric tails
A pattern that delivers most of the realism of a deep generator
with much less compute and fewer modelling assumptions:
Empirical CDF for the margins, except the tails (top and
bottom 5%), which are fit with GPD.
Block-bootstrap the rank-uniformised observations to
preserve serial dependence in the dependence structure.
Invert each margin (empirical or GPD) to produce the synthetic
path.
Combines empirical dependence + temporal dependence + parametric
tails. Strong default for stress generation when the auditor wants
"all components defensible".
Validation for resampled / copula-generated paths
Same four-leg protocol from Section 02-05 / 10-01:
Marginal fit. KS or energy distance on each series. Should be
near-perfect for non-tail bootstraps; check tail under-coverage
if EVT margins are used.
Autocorrelation. ACF of returns and squared returns. Block
bootstrap should reproduce the empirical ACF on average; if it
does not, the block length is wrong.
Cross-sectional dependence. Pearson and Kendall τ
matrices on synthetic vs. real. Joint stress (correlation in the
tail) is what copula methods can preserve and IID bootstrap
cannot.
Downstream backtest. Run the policies of Chapter 5 on the
synthetic data; the Sharpe / drawdown distribution should match
the realised distribution under bootstrap, and stress those
distributions under tail-conditioned copula sampling.
When deep generators are the better answer
These classical methods are appropriate when the synthetic data must
reproduce the historical distribution. They are not appropriate
when the data must be plausible but unseen — a regime that the
historical record does not contain. Diffusion (10-01), GAN (10-02),
and VAE (10-03) all condition on the historical distribution to learn
how to extrapolate beyond it; the classical methods here cannot
extrapolate, only resample.
A working rule:
Use case
First tool
Backtest variance
Block bootstrap
Compliance-friendly stress
Block bootstrap + EVT
Tail-aware joint stress
Hybrid copula + EVT margins
Generate unseen regimes
Diffusion (10-01)
Factor-aware counterfactual
Factor-VAE (10-03)
Tabular cross-section augmentation
CTGAN (10-02)
What this section adds
The deep generators get the press; the classical methods get the
production work in regulated environments. Section 10-05 closes the
chapter with the evaluation and privacy discipline that applies to
both classes of generator.
Evaluation and Privacy
A synthetic-data pipeline that lacks an evaluation contract is a
liability — it produces output that looks plausible without any
guarantee it preserves the structure downstream consumers depend
on. This section pulls the evaluation discipline scattered across
10-01 to 10-04 into a single workflow, then layers on the privacy
considerations that matter when the source data is sensitive or
client-level.
The four-leg evaluation protocol
Apply this to every generator the book builds, regardless of class.
Marginal fit. Per-series mean, std, skew, kurtosis on
synthetic vs. real. Two-sample tests (Kolmogorov–Smirnov, energy
distance) on returns. A generator that fails the marginal fit
cannot be useful for any downstream task — fix it before anything
else.
Autocorrelation match. ACF of returns and squared / absolute
returns. The squared-returns ACF is the main carrier of volatility
clustering, and a generator that breaks it is unusable for risk
modelling.
Tail coverage. Empirical VaR and ES at α∈{1%,0.5%,0.1%}
on synthetic and real paths. Tail under-coverage is the failure
mode that quietly breaks stress tests; the diagnostic is a ratio
plot of synthetic / real tail estimates.
Downstream test. The decisive evaluation. Train the forecaster
of Chapter 4 or the policy of Chapter 5 on the synthetic panel;
evaluate on the real held-out panel. Performance in the same band
as training-on-real means the generator preserved the right
structure; a collapse means it leaked spurious correlations.
A generator that scores well on marginals and ACF can still fail on
the downstream test — that is precisely why the downstream test is
the one that actually matters. Run it on every generator before
production deploy.
Mode coverage and diversity
For models prone to mode collapse (GANs especially, also low-rank
VAEs), an additional check:
Cluster the real data into K modes via k-means or GMM on
per-window features (mean return, realised vol, skew).
Compute the mode occupancy of synthetic samples.
Compare. Strong under-coverage of any mode means the generator
has missed a regime.
A total-variation distance over 0.15 is a red flag; under 0.05 is
healthy.
Sample-level utility / disclosure metrics
For a synthetic dataset that will be released externally (to
collaborators, regulators, or the public), the literature on
synthetic-data evaluation distinguishes utility — how well the
synthetic data reproduces statistical properties — from disclosure
risk — how much information about specific real records leaks
through synthetic samples \citep{drechsler2024synthpop}.
Per-sample diagnostics that recur:
Nearest-neighbour distance ratio (NNDR). For each synthetic
sample, compute the ratio of its distance to the nearest real
sample over its distance to the second-nearest real sample. A
ratio near 1 indicates the synthetic sample is "between" two real
ones (likely safe); a ratio near 0 indicates it is almost a copy
of a single real sample (likely memorised).
Membership inference attack success rate. Train a classifier
to distinguish "in training" from "not in training" given the
synthetic data. If success exceeds 50%+ϵ, the
generator memorised parts of the training set.
Authenticity score. Held-out real samples should be more
similar to training-real samples than synthetic samples are, on
average. If synthetic samples are more similar to specific
training samples than other reals are, that's memorisation.
For purely market-data applications (no personally identifiable
data), these are mostly evaluation concerns rather than privacy
ones — but if the synthetic dataset will be redistributed, the
checks become load-bearing.
Memorisation: detection and mitigation
Memorisation is the failure mode where the generator emits training
samples verbatim or near-verbatim. It happens to all generator
classes when the training set is small or repetitive.
Detection (cheap, run after every training):
import numpy as npfrom sklearn.neighbors import NearestNeighborsdef memorisation_audit(real: np.ndarray, fake: np.ndarray, k: int = 2) -> np.ndarray: """Return per-fake-sample distance to nearest real sample.""" nn = NearestNeighbors(n_neighbors=k).fit(real) dists, _ = nn.kneighbors(fake) return dists[:, 0] # nearest real distance
Plot the histogram of nearest-neighbour distances for synthetic
samples; a heavy spike near zero is the symptom.
Mitigation:
More data. The single most effective fix; memorisation is a
data-scarcity symptom.
Stronger regularisation. Higher dropout, lower model capacity,
shorter training.
Differential privacy training. DP-SGD adds calibrated noise to
per-sample gradients. Slower to converge, lower-quality samples,
but provides formal privacy guarantees. Appropriate when the
training data contains client-level information.
Do not skip the audit. A generator that memorises is unusable for
any downstream task that relies on unseen samples — and the
samples it produces are not legally synthetic.
Differential privacy in brief
For applications where synthetic data must protect specific records
in the training set, the formal framework is differential privacy.
A generator M is (ε,δ)-differentially
private if, for any two datasets D,D′ differing in one record and
any output S,
Pr[M(D)∈S]≤eεPr[M(D′)∈S]+δ.
Smaller ε = stronger privacy. The standard mechanism is
DP-SGD: clip per-sample gradients to a max norm, add Gaussian
noise calibrated to the privacy budget, accumulate the privacy loss
across training. Practical libraries (Opacus for PyTorch) implement
the bookkeeping.
DP-SGD imposes a real cost on sample quality — typically
ε∈[1,10] for usable models. Below 1 the samples
degrade noticeably; above 10 the privacy guarantee becomes
asymptotic. For market-data applications without per-client data,
the cost rarely justifies the use; for applications involving
client-level positions or PII, DP-SGD is the responsible default.
Reproducibility and provenance
Every synthetic dataset that touches a downstream pipeline must
carry:
Seeds. Every PRNG seed used to generate the dataset.
Generator version. Model architecture commit hash, training
config, training-data hash.
Conditioning metadata. If the dataset was generated under a
specific regime, calendar, or text prompt, log it.
Validation report. The four-leg protocol's results and the
memorisation audit, embedded as a manifest.
Without this, a downstream Sharpe number computed on synthetic data
is not reproducible — and not defensible.
Validation gate. Synthetic data does not enter a downstream
pipeline until the validation report passes thresholds the team
has agreed to. Automated; regenerate or block.
Mix-don't-replace. Train downstream models on a mix of real
and synthetic data (50/50 default). Pure-synthetic training is the
failure mode where the model learns the generator's idiosyncrasies
rather than market structure.
Refresh on regime shift. When the live regime moves, the
generator trained on the old regime starts producing scenarios
that look implausible. Re-fit on a sliding window — quarterly is a
reasonable default for daily macro/equity panels.
What this section adds
Sections 10-01 through 10-04 cover the generators. This section is
the discipline that turns generated samples into trustworthy inputs
for the rest of the book — Chapter 4's forecasters, Chapter 5's
policies, Chapter 9's fine-tuning evaluation. The synthetic-data
loop closes here: generator → validate → mix into downstream pipeline
→ measure improvement in real performance → re-train generator on
fresh data when needed.
The book finishes with this loop because it is what makes the rest
of the book honest. A forecaster trained only on real history is a
forecaster that has seen one realisation of the world; a forecaster
trained on real history plus validated synthetic counterfactuals
is one that has seen many. The synthetic loop is the difference
between a model that fits and a system that survives.