Chapter 10

Synthetic Data

"What is real? How do you define real?" — Morpheus, The Matrix (1999)

For a stress test the answer is: real enough that a policy's risk profile is recognisably the same on real and synthetic data. Real market data is finite, biased toward bull regimes, and stingy with exactly the tail events we most need to stress-test against. Synthetic data is how we fill those gaps in a disciplined way: train a generative model on the historical record, sample from it under controlled conditioning, and use the resulting paths for stress testing, policy evaluation, and pretraining. The chapters before this one assumed the synthetic loop existed; this chapter is where it finally gets built.

Why generate, not only resample

Bootstrap and block-bootstrap methods preserve marginals exactly and are easy to defend, but they cannot produce scenarios that never happened. A generative model can. The trade-off is that generation introduces a new layer of modelling assumptions — the generator itself — that have to be evaluated.

Three modes the book uses

  • Stress testing. Generate paths that lie deliberately in the tail (USD spike, credit blow-out, vol regime that the training set never saw) and check that the policies of Chapter 5 stay within their risk limits.
  • Policy evaluation. Replace held-out real trajectories with held-out synthetic ones from the same DGP to estimate the variance of an estimator without burning real out-of-sample data.
  • Pretraining and augmentation. When a target dataset is small (a niche fund, an emerging asset class), augment with synthetic samples that share the relevant statistical structure, then fine-tune on the real data.

What this chapter covers

  • Diffusion as the modern default — forward / reverse processes, training loss, conditioning patterns, sampling speed.
  • GAN-based generators (CTGAN, TimeGAN) for the cases where GANs still lead — tabular cross-sections and faster sampling.
  • VAE-based generators for factor-aware counterfactual generation, the generative sibling of Chapter 6's identifiable dynamics.
  • Bootstrap and copula synthesis — the classical, defensible toolkit for compliance-friendly stress generation.
  • Evaluation and privacy — the four-leg validation protocol, memorisation audits, differential privacy, and the reproducibility manifest every synthetic dataset needs.

Contents

  • Diffusion — denoising diffusion as the workhorse generator for time-series synthesis, with the conditioning patterns that make stress testing operational.
  • GAN-Based Generators — CTGAN, TimeGAN, and the strengths and failure modes (mode collapse, tail under-fit) of the adversarial family on financial data.
  • VAE-Based Generators — conditional VAEs and factor-VAEs, the generative bridge from Chapter 6's identifiable dynamics into synthetic counterfactual paths.
  • Bootstrap and Copula Synthesis — block bootstrap, copula-based scenarios, and hybrid recipes for compliance-friendly stress.
  • Evaluation and Privacy — the four-leg validation protocol, memorisation detection, DP-SGD, and the manifest schema that makes synthetic datasets reproducible.

Diffusion

Denoising diffusion has become the dominant generative method for time series, displacing both VAEs (which struggle with sharp tails) and GANs (which are unstable and mode-drop on heterogeneous panels). The recent literature converges on diffusion as the right substrate for probabilistic time-series generation \citep{su2024diffusion-survey}: it accepts rich conditioning, captures multimodal distributions, and produces tails that match the training distribution rather than collapsing to a Gaussian approximation. This section covers the core algorithm, the variants that matter for finance, and the evaluation protocol that decides whether a generator is good enough to feed back into Chapters 4–6.

Forward and reverse processes

The forward process progressively adds Gaussian noise to a real sample across steps,

with a noise schedule . A useful closed form follows by composition:

so we can sample directly from at any .

The reverse process is parameterised as

In the denoising-diffusion probabilistic model (DDPM) parameterisation \citep{ho2020ddpm} we predict the noise that was added, , and recover the mean from the closed-form posterior. The training loss reduces to a denoising MSE:

import torch
 
def diffusion_loss(model, x0, alpha_bar, T):
    t = torch.randint(1, T, (x0.size(0),), device=x0.device)
    abar = alpha_bar[t][:, None, None]            # (B, 1, 1) for (B, L, C) input
    noise = torch.randn_like(x0)
    xt = torch.sqrt(abar) * x0 + torch.sqrt(1 - abar) * noise
    pred = model(xt, t)
    return torch.nn.functional.mse_loss(pred, noise)

A complete (small) training loop for a 1-D denoiser on standardised returns:

import torch, torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
 
T = 200
betas = torch.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
alpha_bar = torch.cumprod(alphas, dim=0)
 
class Denoiser(nn.Module):
    def __init__(self, c=1, hidden=64):
        super().__init__()
        self.t_embed = nn.Embedding(T, hidden)
        self.net = nn.Sequential(
            nn.Conv1d(c + hidden, hidden, 3, padding=1), nn.GELU(),
            nn.Conv1d(hidden, hidden, 3, padding=1), nn.GELU(),
            nn.Conv1d(hidden, c, 1),
        )
    def forward(self, x, t):                       # x: (B, L, 1), t: (B,)
        emb = self.t_embed(t)[:, :, None].expand(-1, -1, x.shape[1])
        h = torch.cat([x.transpose(1, 2), emb], dim=1)
        return self.net(h).transpose(1, 2)
 
# windows: (N, L, 1) of standardised returns
windows = torch.randn(2_000, 64, 1)                # placeholder for real data
loader  = DataLoader(TensorDataset(windows), batch_size=64, shuffle=True)
 
model = Denoiser()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(5):
    for (x0,) in loader:
        loss = diffusion_loss(model, x0, alpha_bar, T)
        opt.zero_grad(); loss.backward(); opt.step()
    print(f"epoch {epoch}  loss {loss.item():.4f}")

The denoiser, schedule, and loss are all that is needed to start. A serious implementation swaps the toy denoiser for a transformer or state-space backbone (Section "Backbone choices" above) and trains on real return windows; the loss form is unchanged.

Backbone choices

The denoiser network does most of the work. Three backbones cover the territory:

  • 1-D U-Net. The original DDPM backbone, transplanted from images. Works well on regular univariate series but does not scale gracefully to many channels.
  • Transformer / DiT. Treats time as a sequence of patches with attention. The default for multivariate panels and long sequences; conditioning is plumbed through cross-attention or AdaLN.
  • State-space backbones (Mamba/S4). Recently competitive on long series, with much lower compute at long context than transformers. A reasonable choice when sequences are very long (intraday, tick-level).

Empirically, transformer backbones with patch tokenisation produce the best stability vs. quality trade-off on equity panels and macro series in the 100–1000 step range — the regime most finance applications live in.

Conditioning

Unconditional generation rarely answers the question we actually have. Useful conditioners for finance:

  • Macro regime. Discrete labels (recession / expansion, high-vol / low-vol) injected via embedding lookup. Lets you sample stress scenarios on demand.
  • Calendar features. Day-of-week, day-of-month, time-to-earnings. Without these, generators produce series that ignore microstructural seasonality.
  • Continuous covariates. Yield-curve slope, credit spread, realised-volatility level. Inject via FiLM or AdaLN.
  • Text. Multimodal conditional diffusion \citep{su2025multimodal} conditions the generator on textual scenario descriptions ("a 10% S&P drawdown driven by tech earnings"). This is where Chapter 8's agents become useful — they author the conditioning text from a scenario library.

A FiLM-style conditioning layer that handles the most common case (scalar regime label + continuous covariate vector) in the denoiser:

import torch
import torch.nn as nn
 
class FiLM(nn.Module):
    """Feature-wise linear modulation: y = gamma(c) * x + beta(c)."""
 
    def __init__(self, c_dim: int, hidden: int):
        super().__init__()
        self.gamma = nn.Linear(c_dim, hidden)
        self.beta  = nn.Linear(c_dim, hidden)
 
    def forward(self, x: torch.Tensor, c: torch.Tensor) -> torch.Tensor:
        # x: (B, hidden, L), c: (B, c_dim)
        gamma = self.gamma(c).unsqueeze(-1)               # (B, hidden, 1)
        beta  = self.beta(c).unsqueeze(-1)
        return (1 + gamma) * x + beta
 
 
class ConditionedDenoiser(nn.Module):
    def __init__(self, in_ch=1, hidden=64, n_regimes=4, cont_dim=3):
        super().__init__()
        self.regime_emb = nn.Embedding(n_regimes, hidden // 2)
        self.cont_proj  = nn.Linear(cont_dim, hidden // 2)
        self.t_embed    = nn.Embedding(200, hidden)
        self.in_proj    = nn.Conv1d(in_ch, hidden, 3, padding=1)
        self.film1      = FiLM(c_dim=hidden, hidden=hidden)
        self.conv       = nn.Conv1d(hidden, hidden, 3, padding=1)
        self.film2      = FiLM(c_dim=hidden, hidden=hidden)
        self.out_proj   = nn.Conv1d(hidden, in_ch, 1)
 
    def forward(self, x, t, regime, cont):
        # x: (B, L, in_ch), regime: (B,), cont: (B, cont_dim)
        c = torch.cat([self.regime_emb(regime), self.cont_proj(cont)], dim=-1)
        c = c + self.t_embed(t)
        h = nn.functional.gelu(self.in_proj(x.transpose(1, 2)))
        h = self.film1(h, c)
        h = nn.functional.gelu(self.conv(h))
        h = self.film2(h, c)
        return self.out_proj(h).transpose(1, 2)

Two design choices worth noting:

  • Concatenated conditioner. Discrete regime + continuous covariates fold into one vector before FiLM. Lets you condition on either alone (zero-pad the missing component) or both.
  • (1 + gamma) scaling. The standard FiLM trick: at initialisation gamma = 0 and the layer reduces to identity, so the conditioner doesn't disrupt training before it has signal to contribute.

The dominant training-time conditioning trick is classifier-free guidance \citep{ho2022cfg}: train with the conditioner randomly dropped (probability ~0.1), then at sample time interpolate between the conditional and unconditional model with a guidance scale ,

Larger produces samples that adhere more strongly to the conditioning at the cost of diversity; finance applications tend to use .

Sampling: speed vs. quality

DDPM with steps is slow. Three accelerations:

  • DDIM \citep{song2021ddim}. A deterministic re-parameterisation that lets you skip ahead in fewer steps (50 steps is typical). Quality marginally lower than DDPM at the same step count; usually worth it.
  • Consistency models / progressive distillation. Train a "shortcut" model to invert the entire diffusion in 1–4 steps. Tens of times faster; quality depends on the distillation budget.
  • Solver-based methods (DPM-Solver). Treat the reverse process as an ODE and use higher-order ODE solvers; 10–20 steps with tight quality.

For finance we mostly care about batch sampling rather than real-time, so DDPM with DDIM acceleration is usually sufficient.

Evaluation: more than visual inspection

A diffusion generator that looks right can still mis-fit the parts that matter for finance. The evaluation protocol has four legs:

  • Marginal-fit. Per-series mean, std, skewness, kurtosis on synthetic vs. real. Two-sample tests (Kolmogorov–Smirnov, energy distance) on marginal returns.
  • Autocorrelation and clustering. Compare ACF of returns, ACF of absolute / squared returns. A generator that breaks volatility clustering is unusable for risk modelling.
  • Tail coverage. Estimate VaR and ES at 1%, 0.5%, 0.1% on synthetic and real. Tail under-coverage is the failure mode that quietly breaks stress tests.
  • Downstream performance. The test that matters: train the forecaster of Chapter 4 or the policy of Chapter 5 on synthetic data; evaluate on real held-out data. If performance is in the same band as training on real-only, the generator is preserving the right structure. If it collapses, the generator is leaking spurious correlations.

A useful complementary diagnostic from the synthetic-data literature is sample-level utility/disclosure scoring \citep{drechsler2024synthpop}, which trades realism against memorisation risk on a per-sample basis. Worth running when the synthetic set is shared outside the team that owns the source data.

Memorisation and privacy

Diffusion models can and do memorise training samples when the data is small or repetitive. Two practical defences:

  • DP-SGD training. Adds calibrated noise to the gradient updates. Slows training and reduces sample quality; appropriate when the training set contains client-level data.
  • Memorisation audits. After training, sample N synthetic windows and find each sample's nearest training neighbour by L2 distance. Inspect the distribution of nearest-neighbour distances; samples much closer than the bulk are likely memorised.

For purely market-data applications, memorisation is mostly an evaluation concern (you don't want to test policies on copies of the training set), not a privacy concern.

Deployment patterns

  • Persist seeds and conditioning. Every synthetic path used in a report or test is reproducible: save the seed, the conditioner, and the model version.
  • Blend, don't replace. When training a downstream model, mix synthetic and real samples (50/50 is a reasonable default) rather than substituting synthetic for real. Pure-synthetic training tends to produce models that ace synthetic out-of-sample and degrade on real.
  • Monitor distribution shift. When the live regime moves, the generator trained on yesterday's regime starts producing scenarios that look implausible. Re-fit on a sliding window — every quarter is a reasonable cadence for daily macro/equity panels.

Connecting back to the rest of the book

The synthetic-data layer closes the loop:

  • Train a generator on real data (this section).
  • Generate stress scenarios conditioned on regimes or text from Chapter 8 agents.
  • Evaluate the forecasters of Chapter 4 and the policies of Chapter 5 on those scenarios.
  • Feed the worst-case results back into Chapter 5's risk constraints.

Diffusion is the engine that makes the loop possible. The rest of the book — forecasting, decisions, dynamics, agents — is what gives the loop something to test.

GAN-Based Generators

Diffusion models (Section 10-01) have become the default time-series generator, but the GAN family — generative adversarial networks — remains worth knowing for two reasons. First, several finance-specific generators (CTGAN for tabular cross-sectional data, TimeGAN for sequential panels) live here. Second, GAN samples are typically faster to draw than diffusion samples, which matters when the synthetic-data pipeline is rate-limiting.

This section covers what GANs do, what they get right and wrong on financial data, and when to reach for them over diffusion.

What a GAN is, in one paragraph

A GAN trains two networks against each other. A generator maps noise to a sample, . A discriminator tries to distinguish real samples from generator samples and emits a probability of "real". The generator is trained to fool the discriminator; the discriminator is trained to be harder to fool. At equilibrium the generator's distribution matches the data distribution and the discriminator's accuracy is 50%. The training objective is the minimax game

The original formulation \citep{goodfellow2014gan} suffers from training instability and mode collapse; the Wasserstein GAN variant \citep{arjovsky2017wgan} replaces the cross-entropy with an Earth-Mover-style distance and is the practical default in modern work.

Where GANs work for finance

Two specialised architectures have stuck.

CTGAN — tabular cross-sectional data

CTGAN \citep{xu2019ctgan} is the standard for synthesising tabular data with a mix of continuous and categorical columns. It introduces two ideas that matter for finance tabular data:

  • Mode-specific normalisation of continuous columns. Continuous features in finance (returns, volumes, ratios) often have multiple modes — separating regimes, asset classes, or sector clusters. CTGAN fits a Gaussian mixture per column and generates the mode index plus a normalised value within the mode.
  • Conditional sampling on rare categories. Without it, GANs collapse to majority-class samples; with it, the generator can be asked to produce samples conditioned on a sector, country, or rating.
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
import polars as pl
 
panel = pl.read_csv("data/cross_section.csv")          # ticker-level features
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(panel.to_pandas())
 
syn = CTGANSynthesizer(metadata, epochs=300, batch_size=512)
syn.fit(panel.to_pandas())
 
# generate a synthetic panel of the same shape
fake = syn.sample(num_rows=len(panel))

CTGAN works well for per-period cross-sections — a snapshot of the universe at a single date — but not for time-series structure. For that, TimeGAN.

TimeGAN — sequential panels

TimeGAN \citep{yoon2019timegan} layers a GAN on top of an embedding network so the adversarial loss is computed in a learned latent space rather than on raw sequences. Three losses combined:

  • Reconstruction loss for the embedding network (autoencoder style).
  • Adversarial loss between real and synthetic latents.
  • Supervised loss on next-step prediction in the latent space — this is the trick that gives TimeGAN better temporal consistency than vanilla sequence GANs.

The result is plausible-looking returns sequences that preserve autocorrelation and volatility clustering reasonably well — better than independent GANs, not as well as a properly trained diffusion model on the same data.

Where GANs struggle

Three failure modes recur in finance:

  • Mode collapse. Generator emits a small number of distinct samples that fool the discriminator without covering the data distribution. On equity panels this looks like "all synthetic series resemble the same handful of stocks".
  • Training instability. Loss curves that don't converge, generator/discriminator capacity mismatches, hyperparameter brittleness. Wasserstein GAN with gradient penalty (WGAN-GP) mitigates but does not eliminate.
  • Tail under-fit. GAN training emphasises the bulk of the distribution where the discriminator's gradient is largest. Tails — the part finance applications care most about — are systematically under-sampled compared to diffusion.

For pure stress-testing applications where tail fidelity is the primary requirement, diffusion remains the better choice. For augmentation tasks where the bulk of the distribution matters more than the extremes, GANs are competitive and faster.

Evaluation specifics

The four-leg validation protocol of Section 10-01 (marginal, autocorrelation, tail, downstream) applies to GAN samples too, with one extra check:

  • Mode coverage. Cluster the real data into regimes (k-means on returns + vol features, or a GMM). Compute the mode occupancy of synthetic samples and compare. Strong under-coverage of any cluster is mode collapse, even if the marginal looks fine.

A practical pattern: train both a CTGAN and a small diffusion model on the same data; if mode-coverage diverges, trust diffusion and debug the GAN.

When to use GANs

A working decision rule:

  • Tabular cross-sections without strong temporal structure → CTGAN. Faster than diffusion, well-supported tooling (SDV).
  • Sequential panels with limited compute → TimeGAN, with the caveat that tail fidelity will be worse than diffusion.
  • Tail-critical stress generation → diffusion (Section 10-01).
  • Both → ensemble. Use CTGAN samples for augmentation and diffusion samples for stress. The downstream evaluation in Chapter 5 stays the same.

GANs are not the future of generative modelling for finance — that title now belongs to diffusion and (where preference data exists) flow matching. But the specialised finance-tabular and sequential variants are mature, well-tooled, and worth keeping in the toolkit for the cases they were designed for.

VAE Generators

Variational autoencoders sit between GANs and diffusion in the generative-model landscape — a probabilistic encoder-decoder pair trained by maximising a likelihood-style objective rather than an adversarial game. For finance applications the practical shape that matters is the factor-aware variant: a VAE whose latent space is constrained to admit identifiable axes, which lets us generate counterfactual paths conditioned on factor exposures rather than on raw observation features.

This connects directly to Chapter 6's identifiable dynamic factor models. The VAE side of the story is the generative sibling of the identification story.

What a VAE does

A VAE consists of an encoder that maps observations to a distribution over a latent variable , and a decoder that maps latents back to observations. Training maximises the evidence lower bound (ELBO):

The first term is reconstruction quality; the second is a regularisation that keeps the encoder's posterior close to a prior (typically standard normal). Sampling is by drawing and decoding.

VAEs are easier to train than GANs — no adversarial dynamics — and give explicit log-likelihoods for evaluation. They tend to produce blurrier samples than GANs or diffusion in the image domain; in the time-series domain the analogue is "smooth, mean-reverting" generated paths that under-represent shocks.

Conditional VAEs

The version that does most of the work in finance is the conditional VAE (CVAE): the encoder and decoder both condition on auxiliary information (regime, calendar, factor exposure). The ELBO becomes

To generate a path under regime : sample and decode through .

Factor-VAE for finance

The architecture that maps cleanly onto Chapter 6's identifiable dynamic factor model:

  • Encoder produces innovations conditioned on observations and auxiliary context (calendar, regime).
  • The latent dynamics are diagonal linear: .
  • Decoder is an injective MLP plus fixed observation noise.
  • Innovation prior is conditional non-Gaussian (Laplace, in practice).

Under the right assumptions on variation, the innovations are identifiable up to permutation and component-wise affine maps; the same identifiability story carries over to the factors. The generative side: sample new innovations, propagate through the dynamics, decode. The result is a synthetic time-series panel whose latent factors have the same semantic meaning as the real-data factors — which makes it usable for factor-aware backtesting.

This is the "iVDFM" architecture from the paper line referenced in Chapter 6; the dfm-python package (Section 6-05) ships a working implementation.

A worked CVAE for daily returns

A minimal training loop for a per-asset CVAE on daily returns, conditioned on a regime indicator:

import torch
import torch.nn as nn
import torch.nn.functional as F
 
class CVAE(nn.Module):
    def __init__(self, x_dim=64, z_dim=8, c_dim=2, h=128):
        super().__init__()
        self.enc = nn.Sequential(
            nn.Linear(x_dim + c_dim, h), nn.GELU(),
            nn.Linear(h, h), nn.GELU(),
            nn.Linear(h, 2 * z_dim),                # mean + log-var
        )
        self.dec = nn.Sequential(
            nn.Linear(z_dim + c_dim, h), nn.GELU(),
            nn.Linear(h, h), nn.GELU(),
            nn.Linear(h, x_dim),
        )
 
    def encode(self, x, c):
        h = self.enc(torch.cat([x, c], dim=-1))
        mu, logv = h.chunk(2, dim=-1)
        return mu, logv
 
    def reparameterise(self, mu, logv):
        std = (0.5 * logv).exp()
        return mu + std * torch.randn_like(std)
 
    def decode(self, z, c):
        return self.dec(torch.cat([z, c], dim=-1))
 
    def forward(self, x, c):
        mu, logv = self.encode(x, c)
        z = self.reparameterise(mu, logv)
        recon = self.decode(z, c)
        return recon, mu, logv
 
def elbo(x, recon, mu, logv):
    rec = F.mse_loss(recon, x, reduction='sum')
    kl = -0.5 * torch.sum(1 + logv - mu.pow(2) - logv.exp())
    return rec + kl
 
# Sample under a target regime c_target.
def sample(model, c_target, n=1000, z_dim=8):
    z = torch.randn(n, z_dim)
    c = c_target.expand(n, -1)
    return model.decode(z, c).detach()

Three knobs that move quality:

  • Latent dimension (). Too small under-fits the data; too large makes the prior loose and the samples blurry. Start with as a heuristic, tune.
  • KL annealing. Linearly ramp the KL term from 0 to its full weight over the first 5–20 epochs. Avoids posterior collapse, the failure mode where the encoder ignores .
  • -VAE. Scale the KL term by a factor . pushes the latent toward more disentangled representations at the cost of reconstruction quality; useful when factor interpretability is the goal.

Where VAEs struggle

  • Tail fidelity. Standard Gaussian-prior VAEs systematically smooth out the heavy tails of returns. Switching to a Student- or Laplace prior (and likelihood) helps; even so, diffusion typically wins on tail metrics.
  • Sharp regime switches. A single Gaussian latent struggles to represent abrupt regime changes. Mixture-of-Gaussian priors and hierarchical VAEs (e.g., NVAE-style) handle them better.
  • Long sequences. Autoregressive VAEs lose long-range coherence for the same reason RNN forecasters do; transformer-based VAEs fix this at higher compute cost.

When VAEs are the right choice

Three concrete scenarios:

  • Factor-conditioned generation. Generate paths under "factor 2 doubled, factor 3 zeroed" — exactly the kind of intervention Chapter 6's identifiable dynamics enables. CVAE plus a structured latent space is the natural fit.
  • Per-asset augmentation in a small panel. A few hundred series with limited history. VAE training is more stable than GAN at this scale; quality is sufficient for augmentation.
  • Pretraining a forecaster. A VAE encoder, dropped into a forecaster as a feature extractor, gives reasonable latent representations to start from.

For pure stress generation, prefer diffusion (Section 10-01). For identifiable factor-aware generation, VAE is the right tool. For both at once, the iVDFM-style hybrid in Chapter 6 is the cleanest expression of the goal.

Bootstrap and Copula Synthesis

Before reaching for a deep generative model, ask whether classical resampling and copula methods cover the use case. They are simpler, more defensible, and frequently sufficient — especially when the operational requirement is preserve the joint distribution of historical returns rather than generate scenarios that never happened. This section covers the classical methods that often beat fancy generators on benchmarks where the ground truth is "reproduce the empirical distribution".

When non-generative methods suffice

Three concrete cases that resampling handles cleanly:

  • Backtesting variance estimates. Quantify the variance of a Sharpe ratio or a drawdown estimate by resampling the historical series. No generator needed.
  • Tail-aware Monte Carlo. Generate joint scenarios under the empirical dependence structure plus parametric tails (EVT-GPD) on the margins.
  • Compliance-friendly stress. A regulator asks for "stress tests using historical observations". A bootstrap of crisis windows is exactly that, with a defensible audit trail; a diffusion sample is harder to defend.

If the use case is augmentation or truly out-of-distribution stress, jump back to Sections 10-01 / 10-02 / 10-03. For the rest, this section.

Block bootstrap, refresher

Section 02-05 introduced block bootstrap as the fix for the IID bootstrap's incompatibility with serial dependence. The recipe:

  1. Choose block length , ideally proportional to the half-life of the dominant autocorrelation in the data.
  2. Sample blocks of length uniformly from the original series, with replacement.
  3. Concatenate to form a synthetic series of length .

Three variants:

  • Moving block bootstrap. Blocks are contiguous slices.
  • Stationary bootstrap \citep{politis1994stationary}. Block lengths are random with geometric distribution mean . Reduces edge effects from fixed block boundaries.
  • Circular bootstrap. Wraps the series end-to-start before block sampling, eliminating boundary artefacts.
import numpy as np
 
def stationary_bootstrap(series: np.ndarray, mean_block: int, n: int,
                         rng: np.random.Generator) -> np.ndarray:
    """Generate one stationary-bootstrap path of length n."""
    T = len(series)
    out = np.empty(n)
    p = 1.0 / mean_block
    i = rng.integers(T)
    for t in range(n):
        out[t] = series[i % T]
        if rng.random() < p:
            i = rng.integers(T)
        else:
            i += 1
    return out

For a multivariate panel, the same indices are used across all series in a block — this preserves contemporaneous dependence by construction. Trying to bootstrap each series independently destroys the cross-sectional structure and is almost never what you want.

Block bootstrap of innovations

A useful refinement: rather than bootstrapping the raw series, bootstrap the standardised residuals of a fitted model and re-introduce the model's mean and variance dynamics.

  1. Fit a model (AR / ARMA / GARCH) to the data.
  2. Standardise residuals .
  3. Block-bootstrap the standardised residuals.
  4. Plug the bootstrapped innovations back into the model's recursion to produce a synthetic path.

The benefit: the synthetic path has the model's mean and variance dynamics (which capture volatility clustering and seasonality) combined with the empirical shape of the innovations (which captures the heavy tails). For finance applications this often beats both pure block bootstrap and pure parametric Monte Carlo on downstream evaluation.

Copula-based scenario generation

Copulas decouple margins from dependence, which is the right decomposition for many financial stress problems.

The setup: for a -dimensional return vector with margins and joint CDF , Sklar's theorem says there is a copula with

and is unique on the support of the margins. The recipe:

  1. Fit margins separately. For finance, the right choice for tails is EVT-GPD above a threshold, empirical CDF below it.
  2. Transform observations to uniforms via .
  3. Fit a copula on the uniforms.
  4. To sample: draw , then .

Three copula choices that recur:

  • Gaussian copula. Simple, fast, no tail dependence. Convenient default; a known understatement of joint extremes.
  • Student- copula. Same parameterisation as Gaussian plus degrees of freedom; admits symmetric tail dependence. The default when joint extremes matter and computational cost is a concern.
  • Vine copula. Decomposes the joint into a tree of pair copulas, each chosen separately. Captures asymmetric and heterogeneous dependence; expensive to fit but the most flexible classical option.
import numpy as np
from scipy.stats import t as student_t, norm
 
def fit_t_copula(U: np.ndarray) -> tuple[np.ndarray, float]:
    """Fit a Student-t copula to uniform samples U (shape (n, d))."""
    Z = norm.ppf(U)                              # to standard normal
    R = np.corrcoef(Z, rowvar=False)             # correlation matrix
    # Estimate df by maximum likelihood (omitted: use scipy.optimize)
    nu = 6.0                                       # placeholder
    return R, nu
 
def sample_t_copula(R: np.ndarray, nu: float, n: int,
                    rng: np.random.Generator) -> np.ndarray:
    """Draw n samples from a Student-t copula."""
    d = R.shape[0]
    L = np.linalg.cholesky(R)
    # standard multivariate-t
    g = rng.chisquare(nu, size=(n, 1)) / nu
    z = rng.standard_normal(size=(n, d)) @ L.T
    t = z / np.sqrt(g)
    # to uniforms
    return student_t(df=nu).cdf(t)

Hybrid: bootstrap dependence + parametric tails

A pattern that delivers most of the realism of a deep generator with much less compute and fewer modelling assumptions:

  1. Empirical CDF for the margins, except the tails (top and bottom 5%), which are fit with GPD.
  2. Block-bootstrap the rank-uniformised observations to preserve serial dependence in the dependence structure.
  3. Invert each margin (empirical or GPD) to produce the synthetic path.

Combines empirical dependence + temporal dependence + parametric tails. Strong default for stress generation when the auditor wants "all components defensible".

Validation for resampled / copula-generated paths

Same four-leg protocol from Section 02-05 / 10-01:

  • Marginal fit. KS or energy distance on each series. Should be near-perfect for non-tail bootstraps; check tail under-coverage if EVT margins are used.
  • Autocorrelation. ACF of returns and squared returns. Block bootstrap should reproduce the empirical ACF on average; if it does not, the block length is wrong.
  • Cross-sectional dependence. Pearson and Kendall matrices on synthetic vs. real. Joint stress (correlation in the tail) is what copula methods can preserve and IID bootstrap cannot.
  • Downstream backtest. Run the policies of Chapter 5 on the synthetic data; the Sharpe / drawdown distribution should match the realised distribution under bootstrap, and stress those distributions under tail-conditioned copula sampling.

When deep generators are the better answer

These classical methods are appropriate when the synthetic data must reproduce the historical distribution. They are not appropriate when the data must be plausible but unseen — a regime that the historical record does not contain. Diffusion (10-01), GAN (10-02), and VAE (10-03) all condition on the historical distribution to learn how to extrapolate beyond it; the classical methods here cannot extrapolate, only resample.

A working rule:

Use caseFirst tool
Backtest varianceBlock bootstrap
Compliance-friendly stressBlock bootstrap + EVT
Tail-aware joint stressHybrid copula + EVT margins
Generate unseen regimesDiffusion (10-01)
Factor-aware counterfactualFactor-VAE (10-03)
Tabular cross-section augmentationCTGAN (10-02)

What this section adds

The deep generators get the press; the classical methods get the production work in regulated environments. Section 10-05 closes the chapter with the evaluation and privacy discipline that applies to both classes of generator.

Evaluation and Privacy

A synthetic-data pipeline that lacks an evaluation contract is a liability — it produces output that looks plausible without any guarantee it preserves the structure downstream consumers depend on. This section pulls the evaluation discipline scattered across 10-01 to 10-04 into a single workflow, then layers on the privacy considerations that matter when the source data is sensitive or client-level.

The four-leg evaluation protocol

Apply this to every generator the book builds, regardless of class.

  • Marginal fit. Per-series mean, std, skew, kurtosis on synthetic vs. real. Two-sample tests (Kolmogorov–Smirnov, energy distance) on returns. A generator that fails the marginal fit cannot be useful for any downstream task — fix it before anything else.
  • Autocorrelation match. ACF of returns and squared / absolute returns. The squared-returns ACF is the main carrier of volatility clustering, and a generator that breaks it is unusable for risk modelling.
  • Tail coverage. Empirical VaR and ES at on synthetic and real paths. Tail under-coverage is the failure mode that quietly breaks stress tests; the diagnostic is a ratio plot of synthetic / real tail estimates.
  • Downstream test. The decisive evaluation. Train the forecaster of Chapter 4 or the policy of Chapter 5 on the synthetic panel; evaluate on the real held-out panel. Performance in the same band as training-on-real means the generator preserved the right structure; a collapse means it leaked spurious correlations.

A generator that scores well on marginals and ACF can still fail on the downstream test — that is precisely why the downstream test is the one that actually matters. Run it on every generator before production deploy.

Mode coverage and diversity

For models prone to mode collapse (GANs especially, also low-rank VAEs), an additional check:

  • Cluster the real data into modes via k-means or GMM on per-window features (mean return, realised vol, skew).
  • Compute the mode occupancy of synthetic samples.
  • Compare. Strong under-coverage of any mode means the generator has missed a regime.
import numpy as np
from sklearn.cluster import KMeans
 
def mode_coverage(real: np.ndarray, fake: np.ndarray, k: int = 8) -> dict:
    """Compare mode occupancy on per-window summary features."""
    feats = lambda X: np.column_stack([X.mean(axis=1), X.std(axis=1)])
    km = KMeans(n_clusters=k, n_init=10, random_state=0).fit(feats(real))
    real_lab = km.labels_
    fake_lab = km.predict(feats(fake))
    real_p = np.bincount(real_lab, minlength=k) / len(real_lab)
    fake_p = np.bincount(fake_lab, minlength=k) / len(fake_lab)
    return {
        "real_p": real_p, "fake_p": fake_p,
        "tv_distance": 0.5 * np.abs(real_p - fake_p).sum(),
    }

A total-variation distance over 0.15 is a red flag; under 0.05 is healthy.

Sample-level utility / disclosure metrics

For a synthetic dataset that will be released externally (to collaborators, regulators, or the public), the literature on synthetic-data evaluation distinguishes utility — how well the synthetic data reproduces statistical properties — from disclosure risk — how much information about specific real records leaks through synthetic samples \citep{drechsler2024synthpop}.

Per-sample diagnostics that recur:

  • Nearest-neighbour distance ratio (NNDR). For each synthetic sample, compute the ratio of its distance to the nearest real sample over its distance to the second-nearest real sample. A ratio near 1 indicates the synthetic sample is "between" two real ones (likely safe); a ratio near 0 indicates it is almost a copy of a single real sample (likely memorised).
  • Membership inference attack success rate. Train a classifier to distinguish "in training" from "not in training" given the synthetic data. If success exceeds , the generator memorised parts of the training set.
  • Authenticity score. Held-out real samples should be more similar to training-real samples than synthetic samples are, on average. If synthetic samples are more similar to specific training samples than other reals are, that's memorisation.

For purely market-data applications (no personally identifiable data), these are mostly evaluation concerns rather than privacy ones — but if the synthetic dataset will be redistributed, the checks become load-bearing.

Memorisation: detection and mitigation

Memorisation is the failure mode where the generator emits training samples verbatim or near-verbatim. It happens to all generator classes when the training set is small or repetitive.

Detection (cheap, run after every training):

import numpy as np
from sklearn.neighbors import NearestNeighbors
 
def memorisation_audit(real: np.ndarray, fake: np.ndarray,
                       k: int = 2) -> np.ndarray:
    """Return per-fake-sample distance to nearest real sample."""
    nn = NearestNeighbors(n_neighbors=k).fit(real)
    dists, _ = nn.kneighbors(fake)
    return dists[:, 0]                                   # nearest real distance

Plot the histogram of nearest-neighbour distances for synthetic samples; a heavy spike near zero is the symptom.

Mitigation:

  • More data. The single most effective fix; memorisation is a data-scarcity symptom.
  • Stronger regularisation. Higher dropout, lower model capacity, shorter training.
  • Differential privacy training. DP-SGD adds calibrated noise to per-sample gradients. Slower to converge, lower-quality samples, but provides formal privacy guarantees. Appropriate when the training data contains client-level information.

Do not skip the audit. A generator that memorises is unusable for any downstream task that relies on unseen samples — and the samples it produces are not legally synthetic.

Differential privacy in brief

For applications where synthetic data must protect specific records in the training set, the formal framework is differential privacy. A generator is -differentially private if, for any two datasets differing in one record and any output ,

Smaller = stronger privacy. The standard mechanism is DP-SGD: clip per-sample gradients to a max norm, add Gaussian noise calibrated to the privacy budget, accumulate the privacy loss across training. Practical libraries (Opacus for PyTorch) implement the bookkeeping.

DP-SGD imposes a real cost on sample quality — typically for usable models. Below 1 the samples degrade noticeably; above 10 the privacy guarantee becomes asymptotic. For market-data applications without per-client data, the cost rarely justifies the use; for applications involving client-level positions or PII, DP-SGD is the responsible default.

Reproducibility and provenance

Every synthetic dataset that touches a downstream pipeline must carry:

  • Seeds. Every PRNG seed used to generate the dataset.
  • Generator version. Model architecture commit hash, training config, training-data hash.
  • Conditioning metadata. If the dataset was generated under a specific regime, calendar, or text prompt, log it.
  • Validation report. The four-leg protocol's results and the memorisation audit, embedded as a manifest.

Without this, a downstream Sharpe number computed on synthetic data is not reproducible — and not defensible.

A working manifest schema:

{
  "dataset_id": "synth-equity-2026-q1",
  "generator": {"kind": "diffusion", "version": "v1.3", "config_hash": "..."},
  "training_data": {"source": "...", "as_of": "...", "row_hash": "..."},
  "conditioning": {"regime": "high_vol", "calendar": "2026-q1"},
  "seeds": [42, 43, 44],
  "n_samples": 10000,
  "validation": {
    "ks_p_value": 0.34,
    "acf_ratio": 0.97,
    "var_5pct_ratio": 1.02,
    "downstream_sharpe_delta": -0.04,
    "memorisation_max_recall": 0.0008
  }
}

Production patterns

Three patterns that travel:

  • Validation gate. Synthetic data does not enter a downstream pipeline until the validation report passes thresholds the team has agreed to. Automated; regenerate or block.
  • Mix-don't-replace. Train downstream models on a mix of real and synthetic data (50/50 default). Pure-synthetic training is the failure mode where the model learns the generator's idiosyncrasies rather than market structure.
  • Refresh on regime shift. When the live regime moves, the generator trained on the old regime starts producing scenarios that look implausible. Re-fit on a sliding window — quarterly is a reasonable default for daily macro/equity panels.

What this section adds

Sections 10-01 through 10-04 cover the generators. This section is the discipline that turns generated samples into trustworthy inputs for the rest of the book — Chapter 4's forecasters, Chapter 5's policies, Chapter 9's fine-tuning evaluation. The synthetic-data loop closes here: generator → validate → mix into downstream pipeline → measure improvement in real performance → re-train generator on fresh data when needed.

The book finishes with this loop because it is what makes the rest of the book honest. A forecaster trained only on real history is a forecaster that has seen one realisation of the world; a forecaster trained on real history plus validated synthetic counterfactuals is one that has seen many. The synthetic loop is the difference between a model that fits and a system that survives.