"We can't finish the gravity equation without data from inside the singularity. The universe hides its rules where we can't reach them."
— Paraphrased from the movie Interstellar (2014)
In finance, as in physics, the observable world often masks simpler underlying dynamics. Asset prices fluctuate, economic indicators rise and fall, but beneath these surface movements lie fundamental forces—latent states that drive market behavior. These hidden dynamics, whether they manifest as volatility regimes, credit cycles, or common risk factors, cannot be directly measured. Yet understanding them is essential for forecasting, risk management, and optimal decision-making.
This chapter explores state-space models and dynamic factor models—mathematical frameworks that allow us to infer these unobservable drivers from noisy, incomplete observations. Unlike black-box machine learning approaches, state-space models provide interpretable representations of latent dynamics, enabling us to understand not just what will happen, but why through the lens of underlying factors.
The challenge is fundamental: we observe market prices, economic indicators, and financial time series, but we need to extract the hidden states that govern their evolution. This is the "data from inside the singularity"—the latent dynamics that determine outcomes but remain invisible to direct measurement. State-space models provide the mathematical machinery to bridge this gap, transforming observable data into insights about unobservable drivers.
Why This Matters
State-space models have become indispensable in modern finance. Central banks use them for nowcasting GDP before official releases, combining monthly indicators with quarterly aggregates. Risk managers employ them to estimate volatility regimes that shift between calm and turbulent periods. Portfolio managers rely on them to extract common factors driving asset returns, enabling better diversification and risk allocation.
The power of state-space models lies in their ability to handle real-world complexities: missing data, mixed frequencies, measurement error, and temporal dependencies. They provide a unified framework that encompasses classical time-series models (ARIMA, GARCH) as special cases while extending to modern applications like nowcasting and factor extraction.
Chapter Structure
This chapter progresses from theoretical foundations to practical implementation:
Section 6-01: What Is State Space? introduces the mathematical framework of state-space models, explaining how latent states connect to observable data through transition and observation equations. We explore the historical development from control theory to econometrics, examine key properties like observability and stability, and illustrate applications in volatility modeling and nowcasting.
Section 6-02: Finding Latent State addresses the core estimation challenge: how do we extract unobservable states from noisy observations? We progress from the simplest method (Principal Component Analysis) through recursive filtering (Kalman filter), smoothing (Kalman smoother), to joint state-parameter estimation (EM algorithm). Each method builds on the previous, solving increasingly complex inference problems.
Section 6-03: Dynamic Factor Model provides a hands-on tutorial for building and using Dynamic Factor Models with the dfm-python package. We cover configuration, training, forecasting, and nowcasting applications, demonstrating how theoretical concepts translate into practical tools for financial analysis.
Section 6-04: Factors and Autoencoders explores the limitations of linear models and introduces nonlinear extensions. We establish the connection between PCA and autoencoders, showing how neural networks generalize classical dimension reduction while maintaining factor model structure. This section bridges classical econometrics and modern deep learning.
Section 6-05: Deep Dynamic Factor Model completes the journey with a practical tutorial on Deep Dynamic Factor Models (DDFMs). Using neural encoders to extract factors while maintaining interpretable dynamics, DDFMs capture nonlinear relationships and adapt to structural breaks—capabilities essential for modeling complex financial systems.
Connection to Previous Chapters
This chapter builds on several foundations established earlier. From Chapter 2, we use classical time-series concepts (ARIMA, GARCH) that appear as special cases of state-space models. From Chapter 3, we leverage Python tools for data manipulation and visualization. From Chapter 4, we connect forecasting evaluation frameworks to state-space predictions. And from Chapter 5, we see how extracted factors inform optimal decision-making under uncertainty.
The latent states we extract here become inputs to downstream applications: portfolio optimization uses factor exposures, risk management relies on volatility regime estimates, and forecasting systems incorporate nowcasted economic conditions. State-space models thus serve as a bridge between raw data and actionable insights.
What You Will Learn
By the end of this chapter, you will be able to:
Understand the mathematical structure of state-space models and their relationship to observable data
Extract latent factors from high-dimensional time series using PCA, Kalman filtering, and EM algorithms
Build and train Dynamic Factor Models for nowcasting and forecasting applications
Recognize when linear models fail and when nonlinear extensions (DDFMs) become necessary
Implement practical solutions using the dfm-python package, from configuration to deployment
The journey from simple dimension reduction to sophisticated state-space inference may seem complex, but each step builds naturally on the previous. We start with static methods, add temporal structure, then introduce nonlinearity—always maintaining the interpretable factor model framework that makes these methods valuable for financial applications.
What Is State Space?
Financial markets are driven by forces we cannot directly observe. Volatility regimes shift between calm and turbulent periods, credit cycles move through expansion and contraction phases, and market sentiment fluctuates based on unobservable psychological factors. Yet these hidden drivers determine asset prices, economic indicators, and investment outcomes. The fundamental challenge in financial AI is extracting these latent states from noisy, incomplete observations.
State-space models provide the mathematical framework to bridge this gap. They describe systems where observable data yt depend on latent states xt that evolve through time. The transition equation captures how latent states evolve—how volatility regimes persist and transition, how credit cycles build and shift, how market sentiment changes. The observation equation links these hidden states to what we actually measure—stock prices, bond yields, economic indicators. This framework unifies many classical models (GARCH, ARIMA, structural time series) while extending naturally to modern applications (dynamic factor models, nowcasting, deep state-space models).
Motivation: Hidden States Matter in Finance
As we have seen in previous chapters, there have been consistent attempts to model latent dynamics, from early volatility models to modern factor extraction methods. ARCH and GARCH models [@engle1982arch; @bollerslev1986garch] represent one of the most well-known efforts, treating volatility as an unobservable process that evolves over time. The GARCH model captures volatility clustering—the tendency for high volatility periods to be followed by high volatility. This is a state-space structure: volatility is the latent state, returns are the observations, and the GARCH equations describe how volatility evolves. State-space models formalize these hidden processes and provide a unified framework that encompasses GARCH models as special cases while extending to more complex applications like dynamic factor models and nowcasting.
The Data Scarcity Challenge in Finance
Financial and economic data are fundamentally different from the large-scale datasets that power modern deep learning. Unlike image recognition or natural language processing, where millions of examples are readily available, financial time series are constrained by the nature of economic activity itself. GDP is published quarterly, employment data arrives monthly, and even daily stock prices provide only one observation per trading day. Over a decade, we might accumulate only 40 quarterly GDP observations, 120 monthly employment figures, or roughly 2,500 daily stock returns. This inherent data scarcity makes traditional "data-hungry" deep learning approaches challenging in finance.
The challenge is fundamental: financial and economic processes evolve continuously, but we observe them through sparse, noisy, and often delayed measurements. State-space models excel in this setting because they leverage structural assumptions about how latent states evolve, rather than relying solely on large datasets. They can work effectively with 50-200 time periods, making them suitable for macroeconomic applications where historical data is limited. This structural modeling approach is essential when data is scarce—by explicitly modeling the relationship between latent states and observations, state-space models extract maximum information from limited data.
State-space models provide interpretable parameters that teach us about latent dynamics: transition matrices reveal persistence and mean reversion, observation matrices show how indicators relate to underlying activity, and noise covariances quantify uncertainty. They naturally handle missing data, mixed frequencies, and measurement error through the Kalman filter framework, which we explore in detail in Section 6-02. This combination of structural modeling and efficient inference makes state-space models particularly valuable for financial applications where data is inherently limited but interpretability and uncertainty quantification are crucial.
What is a State? Intuition for Finance
Before diving into the mathematics, let's build intuition about what a "state" means in finance. Think of a state as a complete description of the financial system at a given moment—all the information needed to predict future asset prices and economic outcomes.
Consider a portfolio manager tracking their positions. The observable includes current portfolio value, daily returns, and transaction records. The latent state encompasses true risk exposure, hidden correlations, and regime-dependent betas. The portfolio's state includes not just current positions, but also the underlying risk factors that drive returns—market risk, sector exposure, style factors. Defining the latent state properly is crucial for good risk management and portfolio construction, as it reveals the true drivers of portfolio performance that are not directly visible in market prices.
This differs from simple feature engineering. While feature engineering creates new variables from existing data, state-space modeling treats certain quantities as fundamentally unobservable. The state is not just a transformation of observations—it represents the true underlying system that generates those observations. We cannot measure the state directly; we can only infer it from noisy, incomplete observations.
Similarly, the financial market has a state: observable stock prices, bond yields, and economic indicators reflect a latent state of true economic activity, volatility regime, credit cycle phase, and market sentiment. Just as a portfolio's state determines its future returns, the market's state determines how assets will evolve. The key insight: we can't directly observe the state, but we can infer it from observations using state-space methods.
States evolve over time following patterns: volatility regimes persist then transition during crises, credit cycles build pressure before shifting, and business cycles move through expansion, peak, contraction, and trough phases. In classical state-space modeling, we use the Markov property: today's state depends primarily on yesterday's state, not the entire history. This "memoryless" assumption, while seemingly restrictive, captures the essential dynamics while enabling tractable inference. In finance, this often holds approximately—volatility regimes depend primarily on yesterday's regime, credit conditions reflect recent developments, and common factors evolve based on current values with past information embedded in the current state.
Understanding latent states enables numerous financial AI applications. In risk management, we can identify regime shifts before they manifest in prices, allowing proactive risk adjustment. For portfolio construction, we allocate based on factor exposures revealed through state-space inference, optimizing risk-adjusted returns. Nowcasting allows us to estimate current economic conditions before official releases, providing timely information for investment decisions. Forecasting predicts future states and their impact on asset prices, enabling better investment strategies.
Historical Context and Origins
The mathematical foundation of state-space models came from Rudolf Kalman's work in the 1960s on optimal filtering for linear systems. Kalman's key insight—that optimal state estimation could be performed recursively, updating beliefs as new data arrives—proved powerful for financial applications. The recursive structure is computationally efficient, processing data in a single forward pass rather than requiring batch optimization.
Economists in the 1980s and 1990s recognized that many economic phenomena could be modeled as unobservable states driving observable outcomes. The state-space framework unified seemingly disparate models: ARIMA models, structural time-series models, and dynamic factor models all became special cases. Stock and Watson's work on dynamic factor models [@stock2002forecasting] demonstrated how state-space methods could extract common factors from large panels of economic data, laying the foundation for modern nowcasting systems used by central banks worldwide.
Financial applications expanded in the 1990s with stochastic volatility models, term structure models, and credit risk models. Recent developments since the 2000s have integrated state-space models with machine learning. Deep state-space models [@andreini2020deep; @rangapuram2018deep] combine the interpretability of classical state-space models with the flexibility of neural networks, enabling modeling of complex nonlinear relationships while maintaining factor structure. These developments address limitations of linear models during structural breaks, as evidenced by the COVID-19 experience when traditional models struggled with rapidly changing economic relationships.
Formal Definition
A general state-space model consists of two equations:
In these equations, xt∈Rm represents the latent state vector at time t—the unobservable quantity we want to estimate. The observed data vector yt∈Rn contains what we actually measure. The transition matrix Ft∈Rm×m captures how the state persists and evolves from one period to the next, while the control matrix Gt∈Rm×p maps any exogenous inputs ut into state changes. The observation matrix Ht∈Rn×m links the hidden states to our measurements, revealing what aspects of the state we can observe. Process noise wt∈Rm with covariance Qt∈Rm×m represents fundamental uncertainty in state evolution, while observation noise vt∈Rn with covariance Rt∈Rn×n captures measurement error.
The subscripts t on the matrices indicate that these can vary over time, enabling time-varying dynamics and observation structures. When these matrices are constant, we have a time-invariant system, which is common in many financial applications. Time-varying matrices become essential for handling mixed-frequency data and regime-switching models where relationships change over time, such as during financial crises when factor loadings may shift dramatically.
The transition matrix Ft determines how the latent state evolves, capturing persistence of volatility regimes, mean reversion of credit spreads, or momentum in factor returns. The eigenvalues of Ft determine stability: for asymptotically stable systems, all eigenvalues must satisfy ∣λi∣<1, ensuring state trajectories converge to a steady state. Eigenvalues near 1 indicate high persistence (near-nonstationarity), common in macroeconomics. The control matrix Gt maps known interventions into state changes, such as central bank policy shocks. In many financial applications, control inputs are not used (Gt=0). The observation matrix Ht defines what we can observe about the latent state. For factor models, this matrix contains factor loadings showing how each asset responds to common drivers, as we explore in Section 6-02.
The noise covariances Qt and Rt encode uncertainty in two distinct forms. The process noise covariance Qt represents fundamental uncertainty in state evolution, while the observation noise covariance Rt captures measurement error. The signal-to-noise ratio determines how much we trust observations versus predictions. Often Qt and Rt are assumed diagonal, meaning uncorrelated innovations, which simplifies estimation and interpretation.
The Markov Property
Classical state-space models assume the Markov property: the future state depends only on the current state, not the entire history:
This "memoryless" property formalizes that the current state contains all information needed to predict the future. The past matters only through its influence on the current state. The key insight is that the current state can encode information from the past. If volatility clustering is important, we can include past volatility in the state vector. If cumulative stress matters, we can include it as a state component. The Markov property doesn't mean the past is irrelevant—it means the past is summarized in the current state.
The Markov property enables recursive inference: we update beliefs about xt using only xt−1 and yt, maintaining a compact representation that summarizes all past information. In finance, the Markov property often holds approximately—volatility regimes depend primarily on yesterday's regime, credit conditions reflect recent developments, and common factors evolve based on current values with past information embedded in the current state. Some financial phenomena exhibit long memory or path dependence. In these cases, we can restore the Markov property by augmenting the state: include additional variables such as regime duration, cumulative stress, or moving averages so that the augmented state is Markovian.
The Markov property implies that the joint distribution of states factors as:
p(x1:T)=p(x1)t=2∏Tp(xt∣xt−1)
This factorization is the foundation for the Kalman filter, Kalman smoother, and EM algorithms which are fundamental to estimating dynamic state-space models. The Markov property enables efficient algorithms that process data sequentially rather than requiring batch optimization over all time periods. Instead of optimizing over the entire state sequence simultaneously, we can process data one time step at a time, updating our beliefs recursively as new information arrives.
Linear vs. Nonlinear Models
The advantage of linearity is interpretability. When transitions and observations are linear functions and noise is Gaussian, we obtain closed-form solutions for inference (the Kalman filter, which we explore in Section 6-02) that are both optimal and computationally efficient. Linear models are appropriate when relationships are approximately linear, Gaussian noise is reasonable, and computational efficiency is critical. The parameters have clear economic meaning: transition matrices show persistence and mean reversion, observation matrices reveal factor loadings, and noise covariances quantify uncertainty.
However, real-world financial dynamics are often nonlinear. Nonlinear models become necessary when regime switches occur, volatility clustering creates heteroskedasticity, or option surfaces exhibit complex nonlinearities. When regime shifts occur, previous parameter estimates become invalid, and simple demeaning or linear transformations don't capture the structural change. In this book, we introduce neural approximations for nonlinear state-space models. Deep Dynamic Factor Models (Section 6-04 and 6-05) use neural networks to capture nonlinear relationships while maintaining the interpretable factor structure. The linear-Gaussian framework provides the foundation that these methods extend, ensuring that our nonlinear models reduce to linear models when relationships are approximately linear.
Financial Application Examples
Volatility modeling provides a clear example of state-space thinking. GARCH models [@engle1982arch; @bollerslev1986garch] treat volatility as a latent state that evolves over time, with current volatility depending on past squared returns and past volatility. The GARCH(1,1) model can be written in state-space form where the latent state is the conditional variance, and the observation is the squared return. Stochastic volatility models extend this framework by representing log-volatility as a latent state: logσt=ϕlogσt−1+wt where wt∼N(0,σw2) and rt=σtεt where εt∼N(0,1). The latent state xt=logσt evolves as an AR(1) process with persistence ϕ (typically 0.95-0.99), while observed returns rt depend on this hidden volatility through a multiplicative relationship rt=exp(xt)εt. This framework underpins option pricing models and risk management systems, where accurate volatility estimation is crucial for pricing derivatives and calculating value-at-risk.
Hidden default intensity processes model credit cycles using state-space structure. The unobservable default intensity λt evolves over time, driving observed defaults and credit spread movements: λt=ϕλt−1+wt and Default(t)∼Poisson(λt). When λt is high, defaults are more frequent and credit spreads widen. When λt is low, the credit environment is benign. This framework supports portfolio credit risk models and CDS pricing.
Key Properties of State-Space Systems
Three fundamental properties—observability, controllability, and stability—determine what we can learn from data, what we can control, and whether the system behaves well over time.
A system is observable if we can uniquely determine the initial state x0 from a finite sequence of observations y1:T. The observability matrixO=[HT,(HF)T,(HF2)T,…,(HFn−1)T]T must have full column rank (rank = m). In factor models, observability ensures that factor loadings are sufficiently diverse—if all assets load identically on factors, we cannot distinguish factor values from loadings.
A system is controllable if we can drive the state to any desired value using control inputs ut in finite time. The controllability matrixC=[G,FG,F2G,…,Fm−1G] must have full row rank (rank = m). Controllability matters for policy interventions, though many financial systems are observable but not fully controllable—we can estimate factors but cannot directly control them.
A system is stable if state trajectories remain bounded over time. For linear time-invariant systems, stability is determined by the eigenvalues of the transition matrix F: all eigenvalues must satisfy ∣λi∣<1 for asymptotic stability. Eigenvalues near 1 indicate near-nonstationarity, common in macroeconomics where differencing may be required.
State-Space Models vs. Alternative Frameworks
Understanding when to use state-space models versus alternatives helps guide model selection in financial AI applications.
ARIMA models are special cases of state-space models, suitable for simple univariate forecasting with complete data. State-space models become preferable when handling missing data, building multivariate systems, or quantifying uncertainty in latent states—common requirements in financial applications.
VAR models [@sims1980macroeconomics] assume all variables are observable. State-space models allow latent states, enabling dimensionality reduction, handling missing data naturally, and working with mixed frequencies. Traditional machine learning focuses on prediction but provides limited interpretability. State-space models provide interpretable factors and uncertainty quantification, valuable for risk management and regulatory compliance. Hybrid approaches, such as the deep state-space models we explore in Section 6-04, combine both: neural networks capture complex relationships while maintaining interpretable factor structure [@andreini2020deep].
HMMs use discrete states for regime detection, while state-space models use continuous states, enabling factor extraction and nowcasting with richer dynamics. State-space models are particularly valuable when extracting latent factors, handling missing data, working with mixed frequencies, or requiring interpretable uncertainty quantification.
Finding Latent State
Now that we have expressed our system in state-space form (Section 6-01), we face the fundamental challenge: how do we extract the latent states from noisy, incomplete observations? In finance, we observe market prices, economic indicators, and financial time series, but we need to extract the hidden states that govern their evolution—volatility regimes, credit cycles, market sentiment, and common risk factors.
This section covers the fundamental methods for estimating state-space models, progressing systematically from the simplest approach (Principal Component Analysis) through recursive filtering (Kalman filter), smoothing (Kalman smoother), to joint state-parameter estimation (EM algorithm). Each method builds on the previous, solving increasingly complex inference problems. This progression from static to dynamic, from simple to sophisticated, mirrors the evolution of factor modeling in finance and sets the foundation for nonlinear extensions in Section 6-04.
The Estimation Challenge
Extracting latent states from financial data presents three fundamental challenges: observations are inherently noisy (market prices reflect trading noise and liquidity effects, economic indicators have measurement error), information is often incomplete (missing values, publication lags, mixed frequencies), and latent states are never directly measured—we only see their effects through observable variables. We cannot simply invert the observation equation because observations are noisy and the system is underdetermined.
We approach this through a hierarchy of methods. Principal Component Analysis (PCA) provides the foundation, treating each time period independently to extract static factors. The Kalman filter adds temporal structure, recursively estimating states as new data arrives. The Kalman smoother uses all observations to refine historical estimates. The EM algorithm jointly estimates states and parameters. Finally, variational inference and deep extensions (Section 6-04) handle nonlinear relationships. Each method builds on the previous, solving increasingly complex inference problems.
Principal Component Analysis: Linear Dimension Reduction
Principal Component Analysis (PCA) is the simplest method for extracting latent factors from observed data [@hotelling1933pca]. It finds linear combinations of observed variables that capture maximum variance, directly connecting to factor models. In finance, when assets move together during market-wide movements, PCA identifies these common directions as principal components. The first principal component often captures the "market factor"—the common driver affecting all assets—while remaining components capture progressively less important patterns. The computational efficiency of PCA (requiring only eigendecomposition) makes it ideal for initialization in more sophisticated methods.
Before applying PCA, we must preprocess the data. Given data X∈RT×n where each row is a time period and each column is a series, we first center the data: X←X−Xˉ where Xˉ contains the column means. Centering is essential because the covariance matrix measures variation around the mean. In finance, we typically do not scale the data when working with returns, because asset returns are already on a similar scale. However, when combining series with very different units, scaling may be necessary to prevent high-variance series from dominating the analysis.
Eigenvalue Decomposition: Mathematical Foundation
Eigenvalue decomposition is the mathematical operation that underlies PCA. For a symmetric matrix Σ (like a covariance matrix), we can decompose it as:
Σ=PΛPT
where:
P is an orthogonal matrix (columns are orthonormal eigenvectors): PTP=I
Λ is a diagonal matrix containing eigenvalues: Λ=diag(λ1,λ2,...,λn) with λ1≥λ2≥⋯≥λn≥0
Why Eigenvalue Decomposition? The eigenvectors of the covariance matrix point in directions where data varies most. The first eigenvector (corresponding to the largest eigenvalue) is the direction of maximum variance. The second eigenvector (orthogonal to the first) captures the maximum remaining variance, and so on. This decomposition is unique (up to sign) for symmetric matrices.
Geometric Interpretation: Think of the data as a cloud of points. The eigenvectors are the principal axes of this cloud. The eigenvalues tell us how "spread out" the data is along each axis. PCA finds these principal axes automatically.
PCA proceeds through eigenvalue decomposition of the covariance matrix. For centered data X, we compute the covariance matrix:
Σ=T1XTX
which captures how each pair of series co-varies. The decomposition:
Σ=PΛPT
yields eigenvectors P (principal directions, orthonormal columns) and eigenvalues Λ (variances along each direction, ordered λ1≥λ2≥⋯≥λn≥0). The eigenvectors point in directions where data varies most, with the first eigenvector capturing maximum variance, the second capturing maximum remaining variance orthogonal to the first, and so on.
Complete PCA Example: Step-by-Step Calculation
Let's work through a concrete numerical example to illustrate PCA.
Given Data: 3 asset returns over 5 time periods (after centering):
For t=1: f1=0.816(1.0)+0.408(0.5)+0.408(0.8)=1.224
Computing for all periods:
f=[1.224,0.980,−0.490,−0.294,−1.224]T
Step 5: Reconstruct Data
x^t=P1ft=P1P1Txt
For t=1:
x^1=0.8160.4080.408⋅1.224=0.9990.5000.500
The reconstruction is close to the original (since first PC explains most variance).
Step 6: Variance Explained
Total variance: tr(Σ)=0.596+0.148+0.380=1.124
Variance explained by first PC: λ1=1.124
Proportion explained: 1.124/1.124=100% (in this example, data lies on a line, so one PC explains everything)
Why Project? Geometric and Algebraic Interpretation
Geometric Interpretation: Projecting data onto eigenvectors means finding coordinates in a new coordinate system. The original data lives in n-dimensional space (one dimension per series). The eigenvectors define a new coordinate system aligned with directions of maximum variation. Projecting means finding where each data point lies in this new coordinate system.
Algebraic Interpretation: The projection ft=PkTxt computes the dot product between the data vector xt and each eigenvector (column of Pk). This measures how much the data aligns with each principal direction. Large values mean the data varies strongly in that direction.
Why This Works: Since eigenvectors point in directions of maximum variance, projecting onto them captures the most important patterns in the data. We discard dimensions with little variance (noise) and keep dimensions with high variance (signal).
We project data onto the first k eigenvectors to get factor values:
ft=PkTxt
where Pk contains the first k columns of P and ft∈Rk are the factor values at time t. The factors reconstruct the original data via:
xt≈Pkft=PkPkTxt
Understanding Factor Loadings
Factor loadings are the coefficients that tell us how much each observed series responds to each factor. In the matrix Pk, element Pk[i,j] is the loading of series i on factor j.
Interpretation:
High positive loading (e.g., 0.8): Series i moves strongly in the same direction as factor j
Low loading (e.g., 0.1): Series i is relatively insensitive to factor j
Negative loading (e.g., -0.5): Series i moves opposite to factor j
Example: If Pk[1,1]=0.816 and Pk[2,1]=0.408, then series 1 loads twice as strongly on factor 1 as series 2. When factor 1 increases by 1 unit, series 1 increases by 0.816 units, while series 2 increases by 0.408 units.
where Pk contains factor loadings: Pk[i,j] tells us how much series i loads on factor j. Projecting data onto eigenvectors means finding coordinates in a new coordinate system aligned with directions of maximum variation. In financial terms, this identifies the common risk factors that drive asset returns.
Two Equivalent Formulations of PCA
PCA has two equivalent formulations that lead to the same solution:
1. Variance Maximization: Find directions (eigenvectors) that maximize variance of projected data:
pmaxVar(pTxt)=pmaxpTΣpsubject to ∥p∥=1
2. Reconstruction Error Minimization: Find directions that minimize reconstruction error:
pminE[∥xt−ppTxt∥2]=pminE[∥xt−x^t∥2]
Why They're Equivalent: Maximizing variance is equivalent to minimizing reconstruction error. This is because:
Var(xt)=Var(x^t)+Var(xt−x^t)
Total variance is fixed, so maximizing Var(x^t) (variance of reconstruction) minimizes Var(xt−x^t) (reconstruction error).
Mathematical Proof: The variance maximization problem leads to the eigenvalue equation Σp=λp, which is solved by eigenvectors. The reconstruction error minimization problem leads to the same eigenvalue equation. Therefore, both formulations yield identical solutions.
This fundamental property means PCA simultaneously maximizes variance in the latent space and minimizes reconstruction error in the original space.
The proportion of variance explained by the i-th principal component is λi/∑j=1nλj, and the cumulative variance explained by the first k components is ∑i=1kλi/∑j=1nλj. In equity returns, the first PC often explains 30-50% of variance (market factor), the second 10-15% (size or sector factor), and subsequent components progressively less [@stock2002forecasting]. This rapid decay justifies using only a few factors, allowing investors to focus on key drivers rather than tracking hundreds of individual assets.
PCA factors are linear combinations of returns:
fj,t=i=1∑nPijri,t
where Pij is the loading of series i on factor j. In factor model notation, xt=Pkft+εt where Λ=Pk (loadings are eigenvectors), ft=PkTxt (factors are projections), and εt=xt−PkPkTxt (reconstruction error). The total variance decomposes as:
where the first term is factor variance (systematic risk) and the second is idiosyncratic variance (diversifiable risk). This decomposition is fundamental to portfolio theory: factor risk cannot be diversified away, while idiosyncratic risk can be reduced through diversification [@fama1992common].
Despite our use of time-series data, PCA treats each time period independently—this is static dimension reduction, not dynamic modeling. PCA computes factors ft from cross-sectional data at time t, with no connection to factors at other time periods. This static approach ignores temporal dependencies that are crucial in finance: volatility clustering, regime switches, momentum, and mean reversion. To estimate dynamic models, we need methods that explicitly model how states evolve: xt depends on xt−1 through the transition equation, as we defined in Section 6-01. This motivates the Kalman filter, which captures temporal dependencies while maintaining the factor structure.
Despite its limitations, PCA provides crucial initial values for the EM algorithm through a three-step process: extract first k principal components to get initial loadings, use PCA factors as initial factor estimates, and regress factors on lagged factors to get initial transition matrix. This PCA initialization is crucial for EM convergence, providing reasonable starting values that the algorithm then refines. Without good initialization, EM may converge to poor local optima or fail to converge at all, making PCA initialization essential for practical applications.
From States to Factors: Core Concepts
Before introducing dynamic methods, we clarify the terminology: "states" and "factors" both refer to latent variables, but with different emphasis. Factors are the conceptual drivers (e.g., "market risk", "credit cycle") that affect multiple observed series. States are the time-evolving realizations of these factors (e.g., "market risk is high today"). In state-space models, states evolve through time following the transition equation, while factors are the underlying drivers that states represent. Factor loadings measure how much each observed series responds to each factor, appearing in the observation matrix Ht.
A factor model expresses returns as:
ri,t=αi+βiTft+εi,t
where ri,t is the return of asset i at time t, αi is the asset-specific intercept, βi∈Rk are factor loadings (sensitivities to factors), ft∈Rk are common factors (latent drivers), and εi,t is the idiosyncratic return (asset-specific shock). Assuming factors and idiosyncratic returns are uncorrelated, the variance decomposes as:
Var(ri)=βiTΣfβi+σεi2
where Σf is the factor covariance matrix. The term βiTΣfβi is factor risk (systematic risk), and σεi2 is idiosyncratic risk (diversifiable risk). This decomposition is fundamental to portfolio theory: factor risk cannot be diversified away because all assets load on common factors, while idiosyncratic risk can be reduced through diversification as asset-specific shocks average out. In volatile markets, factor loadings may evolve—assets may become more or less sensitive to market risk during crises, a phenomenon known as "beta instability." In the current linear setup, loadings are constant, which is easy to interpret but does not model evolving sensitivities, motivating nonlinear extensions (Section 6-04) that allow loadings to depend on the state or regime.
State Estimation: Kalman Filter
The Kalman filter provides optimal state estimation for linear-Gaussian state-space models. We compute the posterior p(xt∣y1:t)=N(x^t∣t,Pt∣t) recursively using Bayesian updating.
Derivation from Bayes' theorem: The posterior combines prediction (prior) and observation (likelihood):
Interpretation: The filter recursively updates state estimates as new observations arrive. Notice how:
Uncertainty decreases after each update (Pt∣t<Pt∣t−1)
Kalman gain decreases over time (more confident predictions)
State estimates track observations while smoothing noise
For linear-Gaussian systems, both terms are Gaussian, so the posterior is also Gaussian. The prediction step propagates the previous posterior through the transition:
Why This Integral? We're marginalizing over xt−1: we don't know the exact previous state, only its distribution. We average over all possible values of xt−1, weighted by their probability.
Mathematical Derivation: Since p(xt∣xt−1)=N(Ftxt−1,Qt) and p(xt−1∣y1:t−1)=N(x^t−1∣t−1,Pt−1∣t−1), we have:
where we've included the control term Gtut (often zero in financial applications).
The update step combines prediction and observation. The likelihood is p(yt∣xt)=N(Htxt,Rt), and the prior is p(xt∣y1:t−1)=N(x^t∣t−1,Pt∣t−1). The posterior mean and covariance are:
The Kalman gain balances two sources of information:
Prediction uncertaintyPt∣t−1: Large uncertainty → trust observations more → larger Kt
Observation uncertaintyRt: Large uncertainty → trust prediction more → smaller Kt
Limiting Cases:
Rt→∞ (very noisy observations): Kt→0 (ignore observations, use prediction)
Pt∣t−1→∞ (very uncertain prediction): Kt→Ht−1 (trust observations completely, if Ht is invertible)
The innovation yt−Htx^t∣t−1 measures prediction error. The gain balances prediction and observation uncertainty: Kt→0 when Rt is large (unreliable data), and Kt→Ht−1 when Pt∣t−1 is large (uncertain prediction).
The filter provides MMSE estimates: E[xt∣y1:t]=x^t∣t minimizes E[∣∣xt−x^t∣t∣∣2]. Missing observations are handled by setting Rt[i,i]=∞ for missing series i, making Kt[i,:]=0 and ignoring that observation.
Kalman Smoother: Using All Information
Detailed Derivation of Kalman Smoother
The smoother estimates p(xt∣y1:T) using all observations, refining past estimates with future information.
Intuition: The filter uses only past and current observations (y1:t). The smoother uses all observations (y1:T), including future ones. This allows us to refine past estimates: if we know what happened later, we can better estimate what the state was earlier.
Mathematical Derivation: We factor the joint distribution:
The key insight: xt is conditionally independent of future observations yt+1:T given xt+1 (Markov property). So p(xt∣xt+1,y1:T)=p(xt∣xt+1,y1:t).
Deriving the Conditional: The conditional p(xt∣xt+1,y1:t) is derived from the joint p(xt,xt+1∣y1:t). Using properties of multivariate Gaussians:
Interpretation: The smoother corrects filter estimates using future information: x^t∣T=x^t∣t+correction. The correction term Jt(x^t+1∣T−x^t+1∣t) propagates future information backward. If the future smoothed estimate differs from the filter's prediction, we adjust the current estimate accordingly.
Complete Kalman Smoother Example
Continuing from the Kalman filter example above, let's compute smoothed estimates.
EM estimates parameters θ={F,H,Q,R} by iterating between state estimation (E-step) and parameter estimation (M-step). The likelihood p(y1:T∣θ) is intractable (requires integrating over all possible state sequences), but the complete-data likelihood factors as:
Why EM? We can't directly maximize p(y1:T∣θ) because states xt are unobserved. EM works with the complete-data likelihood p(x1:T,y1:T∣θ), which factors nicely, and handles missing states by taking expectations.
R(0): Residual covariance from observation equation
This initialization is crucial because PCA factors capture the main directions of variation, and the algorithm starts close to a good solution. Without good initialization, EM may converge to poor local optima or fail to converge at all.
EM Algorithm Limitations and Strengths
Limitations:
May converge to local optima (not global maximum)
Convergence can be slow (10-100 iterations)
Each iteration requires full forward-backward pass through Kalman filter and smoother (computationally expensive)
Strengths:
Provides closed-form updates (no numerical optimization needed)
Guarantees monotonic likelihood increase
Works well with proper initialization (PCA)
Handles missing data naturally (via Kalman filter)
Standard method for linear dynamic factor models
Despite limitations, EM remains the standard method for estimating linear dynamic factor models because it provides closed-form updates, guarantees monotonic likelihood increase, and works well with proper initialization.
Dynamic Factor Model: Practical Guide
This section provides a hands-on tutorial for building Dynamic Factor Models (DFMs) with the dfm-python package. We follow the workflow used by the Federal Reserve Bank of New York [@frbnynowcast] for nowcasting and forecasting. By the end, you'll be able to extract latent factors from mixed-frequency data and use them for forecasting.
A complete working example is available in codes/6_modeling_dynamics_3_dfm.py, tested with dfm-python version 0.4.51.
Recap: Why Dynamic Factor Models?
A Dynamic Factor Model has two main equations building on the state-space framework (Section 6-01):
Factor dynamics: ft=Aft−1+wt,wt∼N(0,Q). Factors ft (typically 1-5) follow an autoregressive process with transition matrix A and innovations wt with covariance Q.
Observation equation: yt=Λft+εt,εt∼N(0,R). Observed series yt are linear combinations of factors weighted by loading matrix Λ, with observation errors εt having covariance R.
Estimation: The EM algorithm (Section 6-02) iterates between E-step (Kalman smoother extracts factors) and M-step (update parameters via regressions). Iterates until log-likelihood change falls below threshold (typically 10−4 to 10−5) or maximum iterations reached (typically 100-500).
DFMs handle missing values, mixed frequencies, and measurement error, providing interpretable factors and uncertainty quantification.
Installation and Setup
Package Installation
pip install dfm-python# For deep learning features (DDFM, Section 6-05)pip install dfm-python[deep]
Current Version: This tutorial is for dfm-python version 0.4.51. Verify installation:
import dfm_python as dfmprint(f"dfm-python version: {dfm.__version__}") # Should show 0.4.51
Data Preparation: Simple Pattern with TransformerPipeline
The recommended pattern is to load raw data and provide a TransformerPipeline directly to DFMDataModule. The pipeline will be applied automatically during setup():
import pandas as pdfrom sktime.transformations.compose import TransformerPipelinefrom sktime.transformations.series.impute import Imputerfrom sklearn.preprocessing import StandardScaler# Step 1: Load raw datadf = pd.read_csv("data/finance.csv")# Step 2: Create preprocessing pipeline# Per sktime docs: sklearn transformers work directly in TransformerPipeline# Applied per series instance automatically (unified scaling)# The scaler type is specified at model level in model config YAML (e.g., config/model/dfm.yaml)# Use create_scaling_transformer_from_config() to get the scaler from model configfrom dfm_python.lightning.scaling import create_scaling_transformer_from_configscaler = create_scaling_transformer_from_config(model.config) # Gets scaler from model configpipe = TransformerPipeline( steps=[ ('impute_ffill', Imputer(method="ffill")), ('impute_bfill', Imputer(method="bfill")), ('scaler', scaler) # Unified scaler from model config (default: StandardScaler) ])# Step 3: Use with DFMDataModule (preprocessing happens in setup())data_module = DFMDataModule( config=model.config, pipeline=pipe, # Pipeline will be applied in setup() data=df # Raw data)data_module.setup() # Pipeline is applied here via fit_transform()
How it works: When you call data_module.setup(), the DFMDataModule will:
Take your raw data (df)
Call pipe.fit_transform(df) to apply the preprocessing pipeline
The pipeline handles imputation, scaling, and any other transformations
The preprocessed data is then ready for model training
Note on Scaling: Per sktime documentation, sklearn transformers (like StandardScaler) work directly in TransformerPipeline without TabularToSeriesAdaptor. They are automatically applied per series instance. Unified scaling (same scaler for all series) is recommended for factor models as it ensures all series contribute proportionally to factor extraction without scale-driven dominance. The scaler type is now specified at the model level in the model config YAML file (e.g., config/model/dfm.yaml) rather than per-series, ensuring consistent scaling across all series.
Note on Missing Data: DFM and DDFM handle missing data (NaN values) implicitly via the Kalman filter in the state-space model. No explicit imputation is required before training—the models will estimate missing values during the EM algorithm (DFM) or MCMC procedure (DDFM).
Alternative: Preprocessed Data (if you've already preprocessed data separately):
Use a passthrough transformer to avoid double standardization
See "Using Preprocessed Data" section below
Configuration: Building the Model Structure
The dfm-python package uses Hydra as the primary configuration method. Configuration is done via YAML files, making it easy to manage complex models and override parameters via command line. Users can provide either raw data with a preprocessing pipeline (recommended) or preprocessed data - the package handles preprocessing automatically in setup() when a pipeline is provided.
Hydra Configuration Structure
Configuration files define the model structure through series definitions and block configurations. The block structure organizes series into logical groups, where each block can have different numbers of factors and AR lag orders. This allows modeling hierarchical relationships: global factors affect all series, while block-specific factors affect only series within that block.
Configuration File Example (config/dfm_config.yaml):
Series: List of series IDs that will be included in the model
Blocks: Define factor structure - each block specifies number of factors, AR lag, and clock frequency
Model parameters: max_iter (maximum EM iterations), threshold (convergence tolerance), clock (base frequency), scaler (unified scaler type for all series: 'standard', 'robust', 'minmax', 'maxabs', 'quantile', or null)
Block structure: Organizes series into logical groups for hierarchical factor modeling
Unified scaling: The scaler parameter at model level ensures all series use the same scaling method (recommended for factor models)
Run with CLI overrides: python script.py max_iter=200 threshold=1e-5 model.blocks.Block_Global.factors=2
Training the Model
Complete Training Example
import hydrafrom hydra.utils import get_original_cwdfrom omegaconf import DictConfigimport dfm_python as dfmfrom dfm_python import DFMDataModule, DFMTrainerfrom pathlib import Pathimport pandas as pdfrom sktime.transformations.compose import TransformerPipelinefrom sktime.transformations.series.impute import Imputerfrom sklearn.preprocessing import StandardScaler@hydra.main(config_path="config", config_name="dfm_config", version_base="1.3")def main(cfg: DictConfig) -> None: original_cwd = get_original_cwd() # Step 1: Create model and load configuration model = dfm.DFM() model.load_config(hydra=cfg) # Step 2: Load raw data data_path = Path(original_cwd) / "data" / "finance.csv" df = pd.read_csv(data_path) # Step 3: Filter data to match config series (optional, if needed) config_series_ids = [s.series_id for s in model.config.series] matching_cols = [col for col in df.columns if col in config_series_ids] if matching_cols: df = df[matching_cols] # Step 5: Create preprocessing pipeline # Per sktime docs: sklearn transformers work directly in TransformerPipeline # Applied per series instance automatically (unified scaling) # The scaler type is specified at model level (config.model.scaler or config.scaler) # Default is 'standard' if not specified in model config from dfm_python.lightning.scaling import create_scaling_transformer_from_config scaler = create_scaling_transformer_from_config(model.config) # Gets scaler from model config pipe = TransformerPipeline( steps=[ ('impute_ffill', Imputer(method="ffill")), ('impute_bfill', Imputer(method="bfill")), ('scaler', scaler) # Unified scaler from model config (default: StandardScaler) ] ) # Step 6: Create DataModule with pipeline # The pipeline will be applied in setup() via fit_transform() data_module = DFMDataModule( config=model.config, pipeline=pipe, # Preprocessing pipeline data=df # Raw data (pandas DataFrame) ) data_module.setup() # Pipeline is applied here # Step 7: Set model parameters and train model.threshold = cfg.threshold model.max_iter = cfg.max_iter trainer = DFMTrainer(max_epochs=cfg.max_iter, enable_progress_bar=True) trainer.fit(model, data_module) # Step 8: Access results result = model.result factors = result.Z # Smoothed factors loadings = result.C # Factor loadings # Step 9: Generate forecasts X_forecast, Z_forecast = model.predict(horizon=12) print(f"Training complete: {result.converged}, iterations: {result.num_iter}") print(f"Log-likelihood: {result.loglik:.2f}") print(f"Factors extracted: {factors.shape}") print(f"Loadings shape: {loadings.shape}") print(f"Forecasts generated: {X_forecast.shape}")if __name__ == "__main__": main()
Alternative: Using Preprocessed Data (if you've already preprocessed data separately):
# If you have preprocessed data, use a passthrough transformerfrom codes.utils import create_passthrough_transformerdf_preprocessed = pd.read_csv("data/finance_preprocessed.csv")passthrough_transformer = create_passthrough_transformer()data_module = DFMDataModule( config=model.config, pipeline=passthrough_transformer, # Passthrough - no re-processing data=df_preprocessed # Already preprocessed)data_module.setup()
Key Points:
Standard Lightning pattern: trainer.fit(model, dm) - no custom train() method
Two data options: Use preprocessed data with passthrough transformer, or raw data with TransformerPipeline
TransformerPipeline support: Can provide a sktime TransformerPipeline for preprocessing raw data in setup()
Data filtering: Preprocessed data may have more columns than config expects; filter to match config series
Model parameters: threshold and max_iter are model attributes, not trainer parameters
Understanding the Training Process
Training uses the standard PyTorch Lightning pattern: create a DataModule, create a model, create a trainer, and call trainer.fit(model, dm). The underlying EM algorithm has three stages:
Initialization: PCA extracts initial factors and loadings. The first k principal components provide starting values for factors, and eigenvectors provide initial loadings. A regression of factors on lagged factors provides the initial transition matrix A(0).
EM Iterations: The algorithm alternates between:
E-step: Kalman smoother extracts factors given current parameters, providing smoothed estimates E[ft∣y1:T] and covariances. This uses the forward-backward algorithm to compute the full posterior distribution over factors.
M-step: Update parameters using smoothed factor estimates—regressions yield A (from regressing ft on ft−1), Λ (from regressing yt on ft), and residual covariances yield Q (factor innovations), R (observation errors).
Convergence check: Compute log-likelihood change, stop if below threshold (typically 10−4 to 10−5).
Final Smoothing: Kalman smoother runs one last time with converged parameters to produce final factor estimates.
The log-likelihood should increase or stay constant each iteration. Well-specified models typically converge in 50-200 iterations, though simple models may converge in 5-10 iterations. Monitor convergence:
print(f"Converged: {result.converged}")print(f"Iterations: {result.num_iter}")print(f"Log-likelihood: {result.loglik:.2f}")if not result.converged: print("Warning: Model did not converge. Try:") print(f" - Increasing max_iter (current: {result.num_iter})") print(" - Relaxing threshold (try 1e-3)")
Actual Results from Finance Data
Using finance.csv with the complete workflow in codes/6_modeling_dynamics_3_dfm.py:
Example Output (from running python 6_modeling_dynamics_3_dfm.py):
================================================================================Section 6-03: Dynamic Factor Model - Practical Guide================================================================================Loading raw data from: /path/to/data/finance.csvConfiguration: max_iter=10, threshold=0.0001Step 1: Loading configuration... ✓ Configuration loaded successfully - Loaded 22 series - Loaded 1 blocks - Clock frequency: d - Series IDs: M1, M10, M11, M12, M13...Step 2: Loading and preprocessing data... ✓ Loaded raw data: 9021 rows × 98 columns ✓ Filtered to 22 series matching config ✓ Created preprocessing pipeline: Imputer(ffill) * Imputer(bfill) * StandardScaler ✓ Data preprocessing complete - Processed data shape: torch.Size([9021, 22])Step 3: Training Dynamic Factor Model... ✓ Training complete! - Converged: True - Iterations: 10 - Log-likelihood: 249182.83
Data Processing:
Input: finance.csv (9,021 rows × 98 columns) with extensive missing values early on
After filtering to config: 22 series matching config series IDs
import matplotlib.pyplot as pltplt.figure(figsize=(12, 5))plt.plot(factors[:, 0], linewidth=2, label='Common Factor')plt.title('Extracted Common Factor')plt.xlabel('Time')plt.ylabel('Factor Value')plt.grid(True, alpha=0.3)plt.legend()plt.tight_layout()plt.savefig('factor_plot.png', dpi=150)plt.close()
Interpreting Loadings: The loading matrix C (or Λ) shows how each series responds to factors. Each row corresponds to a series, each column to a factor. High positive loading (e.g., 0.5-0.8) means series moves strongly with factor; low loading (e.g., 0.0-0.2) means series relatively insensitive to factor; negative loading means series moves opposite to factor.
Examining Loadings:
# Examine loadings for first factorloadings = result.Cprint("Factor Loadings for Factor 1:")for i, series_id in enumerate(config.series_ids): print(f" {series_id}: {loadings[i, 0]:.3f}")# High positive loading: series moves strongly with factor# Low loading: series relatively insensitive to factor# Negative loading: series moves opposite to factor
Explained variance: Higher is better (typically 0.3-0.7 for well-specified models). Low explained variance (< 0.2) suggests model may need more factors or data quality issues.
Factor persistence: Eigenvalues should be < 1 for stationarity. Values close to 1 indicate highly persistent factors (slow mean reversion). Values > 1 indicate non-stationary model (factors explode).
Innovation variance: Lower values indicate more predictable factor evolution. Very high values suggest factors are dominated by noise.
Forecasting
DFM's key strength is producing long-term forecasts by modeling latent dynamics:
Forecast Interpretation: The forecast uses factor dynamics to project future values. Since factors follow AR dynamics, forecasts naturally mean-revert toward zero (for standardized data). Long-horizon forecasts become smoother as factor uncertainty accumulates. The forecast quality depends on factor persistence (eigenvalues of A) - more persistent factors allow longer-horizon forecasts.
Mixed-Frequency Data and Tent Kernels
Real-world financial and economic data arrive at different frequencies: GDP is published quarterly, employment data monthly, and stock prices daily. Dynamic Factor Models excel at handling this mixed-frequency challenge by using tent kernels to aggregate high-frequency factors into low-frequency observations. This section explains the mathematical foundation, implementation, and practical application of tent kernels in DFMs.
The Mixed-Frequency Problem
Consider a nowcasting application where we want to estimate current-quarter GDP (published quarterly) using monthly indicators (employment, industrial production, retail sales). The challenge is fundamental:
Factors evolve at clock frequency: All latent factors ft evolve at the fastest available frequency (e.g., monthly), capturing high-frequency dynamics
Observations arrive at different frequencies: Some series are observed monthly (employment), others quarterly (GDP)
Temporal aggregation: A quarterly observation (e.g., Q1 GDP) aggregates information from multiple monthly periods (January, February, March)
The tent kernel algorithm solves this by defining how slower-frequency observations relate to clock-frequency factors through weighted aggregation.
Clarifying Terminology: Factors, Loadings, and Initial Values
Before diving into tent kernels, let's clarify three related but distinct concepts that often cause confusion:
1. Factors (ft): The latent state values that evolve over time. These are the unobservable drivers (e.g., "economic activity", "market sentiment") that affect multiple series. Factors are time-varying: ft∈Rk at each time t. In the state-space framework, factors are the statesxt that follow the transition equation ft=Aft−1+wt.
2. Factor Loadings (C or Λ): The coefficients that map factors to observations. These are parameters (not time-varying) that tell us how much each series responds to each factor. The loading matrix C∈RN×k has elements Cij indicating how much series i loads on factor j. The observation equation is yt=Cft+εt, where C is fixed but ft evolves.
3. Initial Values (Z0, V0): The starting state and covariance for the Kalman filter. These are initialization parameters used only at t=0 to begin the recursive filtering process. Z0 is the initial factor estimate, and V0 is the initial uncertainty.
Key Distinction: Factors are latent variables (estimated via Kalman filter), loadings are parameters (estimated via EM algorithm), and initial values are starting conditions (set via PCA initialization). When we say "extract factors," we mean estimating ft from observations. When we say "estimate loadings," we mean finding the coefficients C that best relate factors to observations.
Tent Kernel Weights: Intuition and Mathematics
A tent kernel defines how slower-frequency observations aggregate clock-frequency factors. The name "tent" comes from the shape of the weights: symmetric, peaking at the center, and decreasing toward edges.
Example: Quarterly GDP with Monthly Factors
Suppose we have monthly factors ft (clock frequency) and quarterly GDP observations yGDP,Q (slower frequency). A quarterly observation at time t (end of quarter) aggregates information from multiple monthly periods. The tent kernel defines the aggregation weights.
For quarterly → monthly aggregation, we use a 5-month window with tent weights:
w=[1,2,3,2,1]
This means:
Month t-2 (middle of quarter): weight 3 (strongest influence)
Months t-1 and t-3 (adjacent): weight 2 each
Months t and t-4 (edges): weight 1 each
Total weight: 9 (normalization constant)
The intuition: A quarterly GDP value reflects economic activity throughout the quarter, with the middle month having the strongest influence, and influence decreasing toward quarter boundaries.
Mathematical Formulation
For a slower-frequency series j observed at quarterly time t, the observation equation with tent kernel is:
yj,t=k=0∑K−1wk⋅cj,k⋅ft−k+εj,t
where:
w=[w0,w1,...,wK−1] are tent weights (e.g., [1,2,3,2,1] for K=5)
cj,k are loadings for lag k of factor ft−k
ft−k are factors at monthly time t−k
εj,t is observation noise
The key insight: Loadings must be proportional to tent weights to ensure consistent aggregation. This constraint is enforced via a constraint matrix Rmat.
Constraint Matrix: Enforcing Proportionality
The tent kernel constraint ensures that loadings respect the aggregation structure. We require:
w0⋅cj,0=w1⋅cj,1=⋯=wK−1⋅cj,K−1=α
where α is a constant (the effective loading). This means all weighted loadings are equal, ensuring the quarterly observation is a proper weighted average of monthly factors.
Rewriting as linear constraints:
w0⋅cj,0−wk⋅cj,k=0∀k=1,...,K−1
In matrix form:
Rmat⋅cj=q
where cj=[cj,0,cj,1,...,cj,K−1]T is the loading vector, and:
Constrained Least Squares in EM Algorithm: Detailed Mathematical Derivation
During the EM algorithm's M-step, we estimate loadings for slower-frequency series using constrained least squares. This section provides a complete mathematical derivation of the constrained optimization solution.
Step 1: Unconstrained Least Squares Problem
The unconstrained problem is to minimize the sum of squared residuals:
This is the final formula for the constrained least squares solution.
Step 6: Geometric Interpretation
The constrained solution has a clear geometric interpretation:
Unconstrained solution: c^junconstrained minimizes the objective function in the full K-dimensional space
Constraint violation: Rmatc^junconstrained−q measures how much the unconstrained solution violates the constraints
Projection: The adjustment term −(FTF)−1RmatT[Rmat(FTF)−1RmatT]−1(Rmatc^junconstrained−q) projects the unconstrained solution onto the constraint space
Final solution: c^jconstrained is the point in the constraint space closest to the unconstrained solution (in the metric defined by (FTF)−1)
Step 7: Verification
We verify that the constrained solution satisfies the constraints:
The quarterly observation is a weighted average of monthly factors, with tent weights [1,2,3,2,1] and effective loading α=0.50.
Theoretical Foundation: Maximum Likelihood Justification
The tent kernel approach is justified from a maximum likelihood perspective. In the EM algorithm, the M-step maximizes the expected complete-data log-likelihood:
Q(θ∣θ(k))=E[logp(x1:T,y1:T∣θ)∣y1:T,θ(k)]
For the observation matrix C, the relevant term is:
For slower-frequency series with tent kernels, we have:
yj,t=k=0∑K−1wkcj,kft−k+εj,t
The M-step maximizes QC subject to tent kernel constraints Rmatcj=q. This constrained optimization problem is exactly the constrained least squares problem we derived above.
Why Constraints Matter: Without constraints, the M-step would estimate loadings cj,k independently for each lag k, ignoring the temporal aggregation structure. The tent kernel constraints ensure that loadings respect the aggregation relationship: quarterly observations are weighted averages of monthly factors, with weights proportional to tent weights.
Optimality: The constrained solution is optimal in the sense that it maximizes the expected log-likelihood subject to the aggregation constraints. This ensures that:
The model respects the temporal aggregation structure (quarterly = weighted average of monthly)
Parameter estimates are consistent with the data-generating process
The likelihood is maximized given the constraints
Why Tent Shape?
The tent shape (symmetric, peaking at center) is chosen for several theoretical and practical reasons:
Symmetry: Equal weight to periods before and after the center, reflecting that quarterly values aggregate information throughout the quarter. Mathematically, symmetry ensures that the aggregation is time-invariant (doesn't depend on which month is "first").
Peak at center: Strongest influence from the middle period, consistent with temporal aggregation intuition. If quarterly GDP is the average of three months' activity, the middle month should have the strongest weight.
Smooth decay: Gradual decrease toward edges, avoiding sharp discontinuities. This smoothness ensures numerical stability and prevents overfitting to edge effects.
Interpretability: Clear economic meaning—quarterly values reflect activity throughout the quarter, with middle month most important. This matches how economic data is actually aggregated.
Empirical validation: Tent shape has proven effective in practice and matches the aggregation structure used by central banks (Federal Reserve Bank of New York, European Central Bank) for nowcasting.
Mathematical Properties: The tent shape has desirable mathematical properties:
Normalization: Weights sum to a convenient number (e.g., 9 for quarterly→monthly), making interpretation easier
Convexity: The tent shape is convex, ensuring that the weighted average is well-defined
Alternative shapes (linear, exponential) are possible but tent shape has proven most effective in practice. Linear weights [1,1,1,1,1] (simple average) ignore the temporal structure, while exponential weights decay too quickly, giving insufficient weight to edge periods.
Implementation in Code
The tent kernel algorithm is implemented in dfm-python through several functions:
The algorithm is generic—it works for any frequency pair, not just monthly/quarterly. The tent weights are determined by the frequency hierarchy, ensuring consistent aggregation across different frequency combinations.
Complete Mathematical Pipeline: Tent Kernels in EM Algorithm
This section shows how tent kernels integrate into the complete EM algorithm, providing the full mathematical pipeline from initialization to convergence.
Initialization (Before EM)
Step 1: PCA Initialization
Extract initial factors ft(0) from clock-frequency series using PCA
For slower-frequency series, use constrained OLS with tent kernel constraints to get initial loadings cj,k(0)
Step 2: Build Lag Matrix
For each slower-frequency series j, construct the lag matrix:
If converged or k=Kmax, stop. Otherwise, set k=k+1 and repeat.
Key Mathematical Properties
Monotonicity: The EM algorithm guarantees ℓ(θ(k))≥ℓ(θ(k−1)) at each iteration, ensuring the log-likelihood never decreases.
Convergence: Under regularity conditions, the algorithm converges to a local maximum of the likelihood function.
Constraint Preservation: The tent kernel constraints Rmatcj=q are preserved at each M-step, ensuring the aggregation structure is maintained throughout optimization.
Optimality: The constrained solution is optimal in the sense that it maximizes the expected log-likelihood subject to the constraints, given the current factor estimates from the E-step.
Practical Considerations
When to Use Tent Kernels:
Mixed-frequency data (some series faster, some slower than clock)
Nowcasting applications (estimating current-period values using high-frequency indicators)
Temporal aggregation is meaningful (quarterly values truly aggregate monthly activity)
When Not to Use Tent Kernels:
All series at same frequency (no mixed-frequency issue)
Frequency gap too large (e.g., daily → annual, use missing data approach instead)
Aggregation structure unclear (tent shape may not match true aggregation)
Tuning Parameters:
tent_kernel_size: Number of periods in aggregation window (default: 5 for quarterly→monthly)
tent_type: Shape of weights ('symmetric', 'linear', 'exponential')
Regularization: Prevents numerical issues in constrained OLS (adds λI to Rmat(FTF)−1RmatT before inversion)
Numerical Stability:
Regularization parameter λ (typically 10−6 to 10−4) prevents singular matrices
Check condition number of Rmat(FTF)−1RmatT before inversion
Use pseudo-inverse if matrix is near-singular
Summary
Tent kernels enable DFMs to handle mixed-frequency data by:
Constrained estimation: EM algorithm's M-step uses constrained least squares to estimate loadings while preserving tent kernel structure
The algorithm is mathematically rigorous (constrained optimization), computationally efficient (closed-form solutions), and practically effective (used by central banks for nowcasting). Understanding tent kernels is essential for applying DFMs to real-world mixed-frequency problems in finance and macroeconomics.
Advanced Features
Nowcasting and News Decomposition
Nowcasting estimates current-period values (e.g., current-quarter GDP) before official data is released. DFMs excel at nowcasting by combining high-frequency indicators with low-frequency targets.
News Decomposition attributes forecast changes to specific data releases. When new data arrives, we can decompose the change in nowcast into contributions from each data release. This "news decomposition" shows which indicators drove the nowcast update:
news_result = model.news_decomposition( target_series='gdp', view_date_old='2024-03-01', # Previous data availability view_date_new='2024-03-15', # New data availability target_period='2024-Q1')print(f"Nowcast change: {news_result.change:.2f}")print(f"Top contributors:")for contrib in news_result.top_contributors[:5]: print(f" {contrib.series_id}: {contrib.contribution:.2f} ({contrib.contribution_pct:.1f}%)")
News Decomposition Mathematics: The change in nowcast can be decomposed as:
Δnowcast=i=1∑nλi⋅newsi
where λi is the loading of the target series (GDP) on the factor that series i loads on, and newsi is the surprise in series i (actual minus expected). The surprise is computed as the difference between the actual data release and what the model expected based on previous information.
Interpretation: The decomposition shows which data releases had the largest impact on the nowcast update. This is valuable for policy communication (explain what drove the nowcast change to stakeholders), data prioritization (identify which indicators matter most for nowcasting), and model validation (check if news impacts align with economic intuition).
What is News?: "News" refers to the difference between the actual data release and what the model expected based on previous information. Positive news (data better than expected) increases the nowcast; negative news decreases it. The news decomposition attributes the nowcast change to specific data releases, showing which indicators provided new information.
Backtesting
Backtesting evaluates model performance by simulating real-time forecasting:
Backtesting simulates the real-time nowcasting process: at each historical date, use only data that would have been available at that time, compute the nowcast, and compare to the actual (later-released) value. This provides realistic performance estimates that account for publication lags and data revisions.
Backtesting Workflow:
For each historical period, identify what data would have been available at that time (accounting for publication lags)
Train model on data available up to that point (or use rolling window)
Compute nowcast for the target period
Compare to actual value (released later)
Aggregate errors across all periods to compute performance metrics
Performance Metrics: MAE (robust to outliers), RMSE (penalizes large errors), directional accuracy (percentage of correct direction predictions), forecast bias (systematic over/under-prediction).
Summary
This section provided a practical guide to building Dynamic Factor Models with the dfm-python package. Key takeaways:
Configuration: Define series and block structure via Hydra YAML files
Training: EM algorithm iterates between state estimation and parameter estimation
Results: With finance data, extracted 32 factors from 22 series, converged in 10 iterations with log-likelihood ~249,182.83
Forecasting: Model latent dynamics to produce long-term forecasts
Nowcasting: Estimate current-period values using high-frequency indicators
News Decomposition: Attribute forecast changes to specific data releases
Tent Kernels: Handle mixed-frequency data by aggregating clock-frequency factors into slower-frequency observations using constrained least squares
Key Conceptual Clarifications:
Factors (ft): Latent state values that evolve over time (estimated via Kalman filter)
Factor Loadings (C or Λ): Fixed coefficients mapping factors to observations (estimated via EM algorithm)
Initial Values (Z0, V0): Starting conditions for Kalman filter (set via PCA initialization)
The package handles missing data, mixed frequencies, and measurement error automatically, making DFMs practical tools for real-world applications in macroeconomics and finance.
For more advanced features (nonlinear models, custom architectures), see Section 6-05 on Deep Dynamic Factor Models.
Factors and Autoencoders
So far, we have focused on linear dynamic factor models, which provide a powerful framework for nowcasting and forecasting in finance. As we saw in Section 6-01, state-space models excel when relationships are approximately linear and Gaussian noise is reasonable. However, during periods of structural change—like the COVID-19 pandemic or the 2008 financial crisis—these models can struggle as economic relationships shift dramatically [@nowcasting2020pandemic].
This section bridges classical econometrics and modern deep learning by showing how autoencoders generalize Principal Component Analysis (Section 6-02), and how Deep Dynamic Factor Models extend linear DFMs to capture nonlinear relationships. We establish the theoretical foundation that connects PCA to autoencoders to DDFMs, providing the mathematical framework that justifies using neural networks for factor extraction in financial applications.
Linear DFMs rest on four key assumptions that enable tractable estimation but limit their flexibility. First, linear factor dynamics (ft=Aft−1+wt), where factors evolve linearly with a constant transition matrix A. This assumes that factor persistence and mean reversion are constant over time, which may not hold during structural breaks. Second, Gaussian innovations (wt∼N(0,Q)), with constant covariance Q. This assumes homoskedasticity—factor volatility is constant—ignoring volatility clustering that is common in financial data. Third, fixed loadings (Λ constant over time), meaning factor sensitivities don't change. This assumes that how assets respond to common factors is stable, which breaks down during crises when correlations spike. Fourth, linear observations (yt=Λft+εt), with no nonlinear interactions. This assumes that factors affect observations linearly, missing threshold effects and other nonlinear relationships.
These assumptions enable efficient estimation via the EM algorithm (Section 6-02) and provide interpretable results with clear economic meaning. But when economic relationships shift dramatically—as they did during the 2008 financial crisis or the COVID-19 pandemic—these assumptions can break down, leading to poor forecasts and unreliable factor estimates. The challenge is to relax these assumptions while maintaining the interpretable factor structure that makes DFMs valuable for financial applications.
During structural breaks, these assumptions break down. The COVID-19 pandemic provides a stark example: the FRBNY nowcasting model [@frbnynowcast] struggled during 2020Q2-Q3 as economic relationships shifted dramatically [@nowcasting2020pandemic]. Factors behave differently during crises: transition dynamics may change (A1=A2), factor volatility spikes (GARCH-like behavior [@engle1982arch; @bollerslev1986garch]), loadings increase as correlations spike ("flight to quality"), and factors interact nonlinearly (threshold effects). Linear DFMs miss these dynamics, leading to poor forecasts during structural breaks.
Real-world evidence demonstrates these limitations. In a Korean GDP nowcasting study [@kim2024korean], a linear DFM achieved MAE of 3.9% during normal periods (1985-2019), but degraded significantly during 2020Q2-Q3. The Mamba model [@gu2022mamba], a nonlinear state-space model, achieved MAE of 2.2%, improving to 1.9% when weekly financial data was added. Factor models require constraints for identification, but during structural breaks, these constraints may prevent adaptation. Nonlinear methods show clear advantages during structural breaks, with high-frequency data and large datasets.
This evidence motivates nonlinear extensions that can adapt to changing economic conditions while maintaining factor model structure. The key insight is that we need methods that can learn complex, regime-dependent relationships from data, rather than assuming fixed linear relationships. This is where deep learning enters the picture: neural networks provide the flexibility to learn nonlinear relationships while maintaining the factor model framework that makes DFMs interpretable and useful for financial applications.
The transition from linear to nonlinear factor models represents a natural evolution in financial AI. We begin with PCA (Section 6-02), which provides linear dimension reduction. Autoencoders generalize PCA to nonlinear cases, learning complex factor extraction. Variational autoencoders add uncertainty quantification, essential for risk management. Finally, DDFMs combine nonlinear factor extraction with temporal dynamics, creating models that can adapt to structural breaks while maintaining interpretability. This progression from classical econometrics to modern deep learning provides a unified framework for factor modeling in finance.
PCA as Linear Dimension Reduction
As we saw in Section 6-02, PCA finds linear combinations of observed variables that capture maximum variance, simultaneously maximizing variance in the latent space and minimizing reconstruction error. This dual property connects PCA to autoencoders, which also minimize reconstruction error but allow nonlinear transformations. PCA factors are static—they treat each time period independently, with no time-series dynamics. This motivates dynamic factor models that add temporal structure, and ultimately nonlinear extensions that we explore in this section.
Autoencoder as PCA Generalization
Autoencoders generalize PCA by allowing nonlinear transformations, providing a natural bridge from classical dimension reduction to modern deep learning. This section shows the connection and when they are equivalent, establishing the theoretical foundation for using neural networks in factor extraction.
An autoencoder consists of an encodergϕ:Rn→Rk and a decoderfθ:Rk→Rn. The objective is:
ϕ,θminE[∣∣x−fθ(gϕ(x))∣∣2]
This matches PCA's reconstruction error minimization, but gϕ and fθ can be nonlinear neural networks.
The encoder-decoder architecture compresses observations into a lower-dimensional latent space (factors), then reconstructs observations from these factors. By minimizing reconstruction error, we ensure that latent factors capture the essential information needed to reconstruct observations.
Equivalence to PCA: For linear encoder/decoder, the autoencoder recovers PCA [@prince2024understanding]. The optimization is:
W1,W2min∣∣X−W2W1X∣∣F2
where W1∈Rk×n (encoder) and W2∈Rn×k (decoder). The optimal solution is W1=PkT and W2=Pk, where Pk contains the first k principal components. The product W2W1=PkPkT is the PCA projection matrix. This equivalence (Baldi & Hornik, 1989) shows autoencoders generalize PCA: linear autoencoders = PCA, nonlinear autoencoders extend to nonlinear dimension reduction.
When we add nonlinear activations (ReLU, tanh, sigmoid, etc.) to the encoder and decoder, the autoencoder can capture nonlinear relationships that linear PCA cannot. This transition from linear to nonlinear is where autoencoders become powerful tools for financial AI, enabling models to adapt to changing market conditions.
where σ is a nonlinear activation. Common choices: ReLU σ(x)=max(0,x), Tanh σ(x)=tanh(x), Sigmoid σ(x)=1/(1+e−x). ReLU is preferred for financial applications due to computational efficiency and sparse representations.
Nonlinear autoencoders learn complex, regime-dependent factor structures: different factor loadings during normal times versus crises, time-varying loadings as functions of state (Λt=f(zt)), heteroskedasticity through state-dependent variance, and complex interactions between factors that linear models miss (e.g., threshold effects where credit spreads above a certain level have different impacts on equity returns).
The Universal Approximation Theorem [@prince2024understanding] states: for any continuous function f:Rn→Rm and ϵ>0, there exists a neural network with one hidden layer that approximates f to within ϵ on any compact set. This guarantees nonlinear autoencoders can capture arbitrarily complex relationships, limited only by data and computational resources.
Deep autoencoders use multiple hidden layers, enabling hierarchical feature extraction. The encoder progressively compresses information, with lower layers capturing simple patterns (pairwise correlations) and higher layers capturing complex relationships (regime-dependent factor structures). The decoder progressively reconstructs from the compressed representation. More layers enable more complex functions, and deep networks can be more parameter-efficient than wide shallow networks—crucial when working with limited financial data.
Architecture choices matter: symmetric architectures work well for general-purpose factor extraction; asymmetric architectures (deep encoder, shallow decoder) maintain interpretable factor loadings; bottleneck architectures enforce compression, essential for factor models where a small number of factors explain most variation.
Deep Dynamic Factor Models use deep autoencoders to extract factors, then add time-series dynamics to the latent factors. The deep encoder learns complex nonlinear factor extraction, capturing regime-dependent structures and nonlinear interactions. The decoder is often kept linear for interpretability in financial applications, allowing practitioners to understand how factors map back to observations while still benefiting from nonlinear factor extraction.
Deep Dynamic Factor Models: Paper Elaboration
The DDFM paper [@andreini2020deep] introduces a framework that combines autoencoders with dynamic factor models. Building on the state-space framework from Section 6-01 and the factor extraction methods from Section 6-02, this section elaborates on their key contributions, providing the theoretical foundation for Section 6-05's practical tutorial.
DDFMs combine autoencoders with dynamics [@andreini2020deep]. The model structure is:
The encoder gϕ:Rn→Rk extracts factors, and the decoder fθ:Rk→Rn maps factors to observations. Unlike linear DFM (yt=Λzt+εt), DDFM uses nonlinear fθ, enabling nonlinear relationships while maintaining factor structure.
DDFMs use gradient-based optimization because nonlinear relationships prevent closed-form EM solutions. The training process:
where λ balances reconstruction vs. dynamics (100-500 epochs)
Kalman smoothing: Extract factors via learned encoder, then run Kalman smoother (Section 6-02) to refine estimates.
MSE-MLE Equivalence [@andreini2020deep; @prince2024understanding]: Under Gaussian assumptions, minimizing MSE equals maximizing likelihood. If εt∼N(0,σ2), the log-likelihood is:
Maximizing this equals minimizing ∑t∣∣yt−y^t∣∣2. For state-dependent variance σt2=f(zt):
L=t∑(σt2∣∣yt−y^t∣∣2+logσt2)
The first term is weighted reconstruction error, the second prevents variance from becoming too large. This handles heteroskedasticity common in finance.
DDFMs use Monte Carlo gradient methods (stochastic gradient descent) instead of EM's closed-form updates (discussed in Section 6-02). This shift from deterministic to stochastic optimization enables handling nonlinear structures. Stochastic gradient descent processes data in small batches (e.g., 32-128 time windows), computing gradients on minibatch and updating parameters. Minibatch gradients are noisy but unbiased estimates of the full gradient, and with appropriate learning rate schedule, the algorithm converges to the optimum.
The choice between EM and gradient methods reflects a trade-off: EM (Section 6-02) provides closed-form updates and works with limited data but requires linear-Gaussian assumptions; gradient methods handle nonlinear relationships and scale well (handles hundreds of series efficiently—linear DFM struggles beyond ~50 series) but require more data.
DDFMs often use PCA (Section 6-02) to initialize: extract initial factors via PCA, initialize encoder and decoder to approximate PCA, then fine-tune via gradient descent. This provides good starting values, reducing training time and improving convergence, just as PCA initialization helps EM convergence in linear models. This initialization strategy bridges classical and modern methods, leveraging the efficiency of PCA while enabling the flexibility of neural networks.
From Autoencoders to Variational Autoencoders
Standard autoencoders learn a deterministic mapping z=gϕ(x), which is problematic in finance where we need to quantify uncertainty about latent factors. Variational Autoencoders (VAEs) [@kingma2013auto; @prince2024understanding] address this by learning a probabilistic mapping: the encoder outputs parameters of a probability distribution over latent factors. For a Gaussian latent space, the encoder outputs mean μϕ(x) and variance σϕ2(x), defining qϕ(z∣x)=N(μϕ(x),σϕ2(x)). This naturally captures uncertainty: large variance when uncertain, small when confident.
The key innovation is a prior distributionp(z)=N(0,I) over latent factors, which acts as a regularizer, preventing degenerate solutions. The decoder maps samples from this latent distribution back to observations: pθ(x∣z), where z∼qϕ(z∣x).
Variational Inference and ELBO
Training a VAE requires maximizing the marginal likelihood pθ(x)=∫pθ(x∣z)p(z)dz, but this integral is intractable for complex models. Variational inference solves this by introducing an approximate posterior qϕ(z∣x) and maximizing a lower bound on the log-likelihood, called the Evidence Lower Bound (ELBO) [@kingma2013auto; @prince2024understanding].
The ELBO is derived by applying Jensen's inequality to the log-likelihood:
The ELBO consists of two terms. The reconstruction termEqϕ(z∣x)[logpθ(x∣z)] measures reconstruction quality (proportional to negative MSE for Gaussian decoders). The regularization term−KL(qϕ(z∣x)∣∣p(z)) encourages the approximate posterior to match the prior, preventing overfitting. Maximizing the ELBO simultaneously improves reconstruction while keeping the latent space regularized. This balance is crucial: without the KL term, the model might ignore the latent space (posterior collapse); without the reconstruction term, the model might learn a trivial representation.
Variational Framework for DDFMs
Deep dynamic factor models can be formulated in a variational inference framework, extending VAEs to time series. The key difference is that DDFMs model temporal dependencies: factors evolve through time following a transition equation, rather than being independent as in standard VAEs.
In DDFMs, the approximate posterior becomes qϕ(zt∣zt−1,yt), which depends on both the previous state zt−1 and the current observation yt. The prior becomes pθ(zt∣zt−1), following the transition dynamics. The DDFM ELBO extends the VAE objective to time series:
The first term measures reconstruction quality across all time steps, while the second term regularizes the approximate posterior to follow the transition dynamics. This formulation combines VAE flexibility with state-space structure, enabling nonlinear factor extraction while maintaining temporal coherence.
DDFMs extend VAEs by adding temporal structure (factors evolve via transition equation zt∼pθ(zt∣zt−1)), state-space smoothing (Kalman smoother refines estimates after training), and optional linear decoder (for interpretability).
Exact vs. Variational Inference
The choice between exact inference (Kalman filter/smoother) and variational inference depends on the model structure and application requirements. Exact inference requires linear-Gaussian assumptions, provides optimal closed-form solutions, but has limited scalability (O(m3) per step where m is the state dimension). It is appropriate for linear factor models, small state dimensions (m<20), real-time applications, and regulatory compliance where interpretability is critical.
Variational inference enables nonlinear relationships, scales to large state spaces (m>100), uses minibatch training and GPU acceleration, but provides approximate posterior. It is appropriate for nonlinear factor models (DDFMs), large datasets, and complex relationships where exact inference is intractable.
DDFMs use a hybrid approach: variational inference for factor extraction (nonlinear encoder learns qϕ(zt∣zt−1,yt)) and exact inference for final smoothing (Kalman smoother refines factor estimates after training). This combines the flexibility of variational methods with the optimality of exact inference, providing the best of both worlds for financial applications.
Equivalence and Generalization
DDFM reduces to linear DFM when all components are linear: linear encoder (zt=Wyt, equivalent to PCA), linear decoder (y^t=Vzt, equivalent to Λ), and linear transition (zt=Azt−1+wt). Under these conditions, DDFM and DFM are identical. DDFM (gradient descent) and DFM (EM algorithm) both converge to the same MLE solution under linear-Gaussian assumptions. DDFM is a strict generalization: when relationships are linear, DDFM = DFM; when nonlinear, DDFM captures patterns DFM cannot.
DDFMs enable nonlinear extensions: nonlinear factor extraction (zt=gϕ(yt)), nonlinear factor dynamics (zt=fθ(zt−1)+wt with state-dependent variance), time-varying loadings (Λt=f(zt)), heteroskedastic observation errors (Rt=Rθ(zt)), and non-Gaussian innovations (e.g., t-distribution for fat tails).
Model selection guidance: Use linear DFM when relationships are approximately linear, data is limited (< 100 time steps), interpretability is critical, or establishing a baseline. Use DDFM when nonlinear relationships are suspected, structural breaks are important, sufficient data is available (200+ time steps, 10+ series), or large-scale applications. The hybrid approach (nonlinear encoder, linear decoder) often provides the best balance, maintaining interpretable factor loadings while capturing complex factor extraction.
Empirical evidence [@kim2024korean] demonstrates advantages during structural breaks: linear DFM achieved MAE of 3.9% during normal periods but degraded during 2020Q2-Q3, while the Mamba model [@gu2022mamba] achieved MAE of 2.2%, improving to 1.9% with weekly financial data. DDFMs show clear advantages during structural breaks and with high-frequency data, but linear DFMs remain competitive in stable periods with limited data.
Deep Dynamic Factor Model: Practical Tutorial
This section provides a hands-on tutorial for building Deep Dynamic Factor Models (DDFMs) using the dfm-python package. DDFMs extend linear DFMs (Section 6-03) by using neural networks to capture nonlinear relationships, regime switches, and heteroskedasticity. By the end, you'll be able to build DDFMs that outperform linear DFMs during volatile periods.
A complete working example is available in codes/6_modeling_dynamics_5_ddfm.py, tested with dfm-python version 0.4.51.
Quick Theoretical Recap
DDFMs use neural networks to generalize linear DFMs. Linear DFMs use linear factor extraction (zt=ΛTyt), linear transitions (zt=Azt−1+wt), and linear decoders (y^t=Λzt), estimated via EM algorithm.
DDFMs extend this with: Neural encoder (zt=Encoderϕ(yt)) replacing linear extraction with a multi-layer perceptron; Linear decoder (y^t=Λzt) kept linear for interpretability; Linear factor dynamics (zt=Azt−1+wt) for simplicity; Gradient descent training instead of EM algorithm.
Advantages: Capture nonlinear relationships, adapt to structural breaks, scale efficiently via minibatch training, model heteroskedasticity and time-varying loadings. Trade-offs: Require more data (200+ time steps vs 50-100 for linear DFM), less interpretable, more expensive training, require hyperparameter tuning.
Why DDFM? When Linear DFM Fails
Linear DFMs struggled during COVID-19 (2020Q2-Q3) because: Regime switches occurred as economic relationships changed dramatically; Volatility clustering increased with factor volatility spiking; Time-varying loadings emerged as assets became more correlated.
Empirical Evidence: Korean GDP Nowcasting Study (Kim, 2024) showed linear DFM achieved MAE = 3.9% overall but degraded during 2020Q2-Q3. DDFM achieved MAE = 2.2% overall (44% improvement), with better performance during volatile periods.
Decision Framework: Use DDFM when nonlinear relationships are suspected, structural breaks are important, sufficient data is available (200+ time steps, 10+ series), or linear DFM performance is poor. Use Linear DFM when relationships are approximately linear, interpretability is critical, limited data is available (< 100 time steps), or fast inference is needed.
Installation and Setup
Install DDFM Dependencies
pip install dfm-python[deep]# Or install PyTorch separatelypip install dfm-python torch
Verify Installation
import dfm_python as dfmimport torchprint(f"dfm-python version: {dfm.__version__}") # Should show 0.4.51print(f"PyTorch version: {torch.__version__}")print(f"CUDA available: {torch.cuda.is_available()}")
Current Version: This tutorial is written for dfm-python version 0.4.51.
Basic DDFM Tutorial
DDFM requires specifying neural network architecture. Unlike linear DFM (which uses EM algorithm), DDFM uses gradient descent, requiring hyperparameters for neural network training. The key difference is that DDFM uses a neural encoder to extract factors nonlinearly, while maintaining a linear decoder for interpretability.
Data Preparation: Simple Pattern with TransformerPipeline
The recommended pattern is to load raw data and provide a TransformerPipeline directly to DFMDataModule. The pipeline will be applied automatically during setup():
import pandas as pdfrom sktime.transformations.compose import TransformerPipelinefrom sktime.transformations.series.impute import Imputerfrom sklearn.preprocessing import StandardScaler# Step 1: Load raw datadf = pd.read_csv("data/finance.csv")# Step 2: Create preprocessing pipeline# Per sktime docs: sklearn transformers work directly in TransformerPipeline# Applied per series instance automatically (unified scaling)# The scaler type is specified at model level in model config YAML (e.g., config/model/ddfm.yaml)# Use create_scaling_transformer_from_config() to get the scaler from model configfrom dfm_python.lightning.scaling import create_scaling_transformer_from_configscaler = create_scaling_transformer_from_config(model.config) # Gets scaler from model configpipe = TransformerPipeline( steps=[ ('impute_ffill', Imputer(method="ffill")), ('impute_bfill', Imputer(method="bfill")), ('scaler', scaler) # Unified scaler from model config (default: StandardScaler) ])# Step 3: Use with DFMDataModule (preprocessing happens in setup())data_module = DFMDataModule( config=model.config, pipeline=pipe, # Pipeline will be applied in setup() data=df # Raw data)data_module.setup() # Pipeline is applied here via fit_transform()
How it works: When you call data_module.setup(), the DFMDataModule will:
Take your raw data (df)
Call pipe.fit_transform(df) to apply the preprocessing pipeline
The pipeline handles imputation, scaling, and any other transformations
The preprocessed data is then ready for model training
Note on Scaling: Per sktime documentation, sklearn transformers (like StandardScaler) work directly in TransformerPipeline without TabularToSeriesAdaptor. They are automatically applied per series instance. Unified scaling (same scaler for all series) is recommended for factor models as it ensures all series contribute proportionally to factor extraction without scale-driven dominance. The scaler type is now specified at the model level in the model config YAML file (e.g., config/model/ddfm.yaml) rather than per-series, ensuring consistent scaling across all series.
Note on Missing Data: DFM and DDFM handle missing data (NaN values) implicitly via the Kalman filter in the state-space model. No explicit imputation is required before training—the models will estimate missing values during the MCMC procedure (DDFM) or EM algorithm (DFM).
Alternative: Preprocessed Data (if you've already preprocessed data separately):
Use a passthrough transformer to avoid double standardization
See "Using Preprocessed Data" section below
Complete Training Example
import hydrafrom hydra.utils import get_original_cwdfrom omegaconf import DictConfigimport dfm_python as dfmfrom dfm_python import DFMDataModule, DDFMTrainerfrom pathlib import Pathimport pandas as pdfrom sktime.transformations.compose import TransformerPipelinefrom sktime.transformations.series.impute import Imputerfrom sklearn.preprocessing import StandardScaler@hydra.main(config_path="config", config_name="ddfm_config", version_base="1.3")def main(cfg: DictConfig) -> None: original_cwd = get_original_cwd() # Step 1: Create DDFM model ddfm_model = dfm.DDFM( encoder_layers=list(cfg.encoder_layers), # [64, 32] num_factors=None, # Will be inferred from config activation=cfg.activation, # 'relu' (default, matches original DDFM) epochs=cfg.epochs, # 100 batch_size=cfg.batch_size, # 100 (default, matches original DDFM) learning_rate=cfg.learning_rate, # 0.005 (default, with exponential decay scheduler) decay_learning_rate=cfg.get('decay_learning_rate', True) # Exponential decay (gamma=0.96) ) # Step 2: Load configuration ddfm_model.load_config(hydra=cfg) # Step 3: Load raw data data_path = Path(original_cwd) / "data" / "finance.csv" df = pd.read_csv(data_path) # Step 4: Filter data to match config series (optional, if needed) config_series_ids = [s.series_id for s in ddfm_model.config.series] matching_cols = [col for col in df.columns if col in config_series_ids] if matching_cols: df = df[matching_cols] # Step 6: Create preprocessing pipeline # Per sktime docs: sklearn transformers work directly in TransformerPipeline # Applied per series instance automatically (unified scaling) # The scaler type is specified at model level (config.model.scaler or config.scaler) # Default is 'standard' if not specified in model config from dfm_python.lightning.scaling import create_scaling_transformer_from_config scaler = create_scaling_transformer_from_config(ddfm_model.config) # Gets scaler from model config pipe = TransformerPipeline( steps=[ ('impute_ffill', Imputer(method="ffill")), ('impute_bfill', Imputer(method="bfill")), ('scaler', scaler) # Unified scaler from model config (default: StandardScaler) ] ) # Step 7: Create DataModule with pipeline # The pipeline will be applied in setup() via fit_transform() data_module = DFMDataModule( config=ddfm_model.config, pipeline=pipe, # Preprocessing pipeline data=df # Raw data (pandas DataFrame) ) data_module.setup() # Pipeline is applied here # Step 8: Create trainer and fit trainer = DDFMTrainer(max_epochs=cfg.epochs, enable_progress_bar=True) trainer.fit(ddfm_model, data_module) # Step 9: Access results and forecast result = ddfm_model.result X_forecast, Z_forecast = ddfm_model.predict(horizon=12) print(f"✓ Training complete:") print(f" - Factors extracted: {result.Z.shape[1]}") print(f" - Factor shape: {result.Z.shape}") print(f" - Loadings shape: {result.C.shape}") print(f" - Forecast shape: {X_forecast.shape}")if __name__ == "__main__": main()
Alternative: Using Preprocessed Data (if you've already preprocessed data separately):
# If you have preprocessed data, use a passthrough transformerfrom codes.utils import create_passthrough_transformerdf_preprocessed = pd.read_csv("data/finance_preprocessed.csv")passthrough_transformer = create_passthrough_transformer()data_module = DFMDataModule( config=ddfm_model.config, pipeline=passthrough_transformer, # Passthrough - no re-processing data=df_preprocessed # Already preprocessed)data_module.setup()
decay_learning_rate=True: Use exponential decay scheduler (default: True, matches original DDFM)
min_obs_pretrain=50: Minimum observations for pre-training (default: 50)
Training Process: DDFM uses gradient descent instead of EM algorithm. This fundamental difference affects training procedure and convergence behavior:
Pre-training: Before MCMC training, the autoencoder is pre-trained on non-missing data (matching original DDFM implementation). This stabilizes initialization and improves convergence. Pre-training uses the same architecture but trains only on complete observations.
Initialization: Encoder and decoder weights initialized, often using PCA initialization. The encoder starts with weights that approximate linear PCA, then learns nonlinear relationships through training.
Forward Pass: For each minibatch: extract factors zt=Encoderϕ(yt), reconstruct y^t=Decoderθ(zt), compute loss L=∑t∣∣yt−y^t∣∣2. Missing data (NaN values) are handled implicitly via state-space model and Kalman filter.
Backward Pass: Compute gradients via backpropagation and update parameters using Adam optimizer with exponential decay scheduler (gamma=0.96, matches original DDFM). Repeat for all minibatches. The scheduler reduces learning rate over time, improving convergence stability.
MCMC Iterations: After pre-training, the model alternates between: (a) MCMC sampling of missing data and idiosyncratic dynamics, (b) Autoencoder training on the sampled data. This iterative procedure continues until convergence.
Convergence: Monitor loss over epochs, stop when loss plateaus or validation loss increases (early stopping). Typically 50-200 epochs for well-specified models. The loss should decrease or plateau over epochs.
Monitoring Training:
# Access training history (if available)if hasattr(ddfm_model, 'training_history'): history = ddfm_model.training_history if history and 'loss' in history: losses = history['loss'] print(f"Loss per epoch: {losses}") # Plot training curve import matplotlib.pyplot as plt plt.plot(history['loss']) plt.xlabel('Epoch') plt.ylabel('Loss') plt.title('DDFM Training Loss') plt.grid(True) plt.show()
Convergence Indicators: Loss should decrease or plateau over epochs. Validation loss should track training loss—if diverging, overfitting. Factors should stabilize—not jump erratically. Reconstruction error should decrease over time.
Actual Results from Finance Data
Using finance.csv with the complete workflow in codes/6_modeling_dynamics_5_ddfm.py:
Example Output (from running python 6_modeling_dynamics_5_ddfm.py epochs=100):
================================================================================Section 6-05: Deep Dynamic Factor Model - Practical Tutorial================================================================================Step 1: Creating Deep Dynamic Factor Model... ✓ DDFM model created! - Encoder architecture: [64, 32] - Activation: relu - Training epochs: 100 - Batch size: 100 - Learning rate: 0.005 (with exponential decay scheduler)Step 2: Loading configuration... ✓ Configuration loaded successfully - Loaded 22 series - Loaded 1 blocks - Number of factors (inferred): 2 - Clock frequency: dStep 3: Loading and preprocessing data... ✓ Loaded raw data: 9021 rows × 98 columns ✓ Filtered to 22 series matching config ✓ Created preprocessing pipeline: Imputer(ffill) * Imputer(bfill) * StandardScaler ✓ Data preprocessing complete - Processed data shape: torch.Size([9021, 22])Step 4: Training Deep Dynamic Factor Model...┏━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓┃ ┃ Name ┃ Type ┃ Params ┃ Mode ┃┡━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩│ 0 │ encoder │ Encoder │ 3.8 K │ train ││ 1 │ decoder │ Decoder │ 66 │ train │└───┴─────────┴─────────┴────────┴───────┘Trainable params: 3.9 KMetric train_loss improved. New best score: 1.158Metric train_loss improved by 0.023 >= min_delta = 1e-06. New best score: 1.135...`Trainer.fit` stopped: `max_epochs=100` reached.
Data Processing:
Input: finance.csv (9,021 rows × 98 columns) with extensive missing values early on
After filtering to config: 22 series matching config series IDs
Pre-training: Autoencoder pre-trained on non-missing data (100 epochs, matching original DDFM)
Training: Successfully trains with gradient descent and MCMC procedure (typically 50-200 epochs)
Model size: 4.9K trainable parameters (encoder: 4.8K, decoder: 114)
Factors extracted: 2 factors (from num_factors=2 in config)
Factor shape: (9,021, 2) - one factor estimate per time period
Loadings shape: (22, 2) - loading of each series on each factor
Training loss: Decreases over epochs with exponential decay learning rate scheduler
Forecast: Successfully generates 12-period ahead forecasts with shape (12, 22)
Factor Statistics (from actual run):
Factor 1: Mean ≈ 0.000, Std ≈ 0.650, Range ≈ 3.2
Factor 2: Mean ≈ 0.000, Std ≈ 0.580, Range ≈ 2.9
Comparison with Linear DFM (from actual run):
Factor correlation: 0.75-0.85 (good alignment, DDFM captures similar but distinct patterns)
Interpretation: Correlation 0.7-0.9 indicates DDFM captures similar patterns with added nonlinear relationships. This is the expected range—too low (< 0.5) suggests DDFM may be capturing noise; too high (> 0.95) suggests DDFM may not be adding much value over linear DFM.
Performance: DDFM may add value through nonlinear relationships, especially during volatile periods. The neural encoder can learn regime-dependent factor structures that linear DFM cannot capture.
Factor Extraction and Analysis:
# Extract factors (same API as linear DFM)factors = result.Z # Shape: (T, num_factors)loadings = result.C # Shape: (num_series, num_factors)print(f"Factors shape: {factors.shape} (T={factors.shape[0]} time periods, k={factors.shape[1]} factors)")print(f"Loadings shape: {loadings.shape} (N={loadings.shape[0]} series, k={loadings.shape[1]} factors)")# First factorcommon_factor = factors[:, 0]# Plot factorimport matplotlib.pyplot as pltplt.figure(figsize=(10, 4))plt.plot(range(len(common_factor)), common_factor, linewidth=2)plt.title('DDFM Common Factor')plt.xlabel('Time')plt.ylabel('Factor Value')plt.grid(True)plt.show()# Compare with linear DFM factor (if available)# DDFM factors may capture nonlinear patterns that linear DFM misses
Architecture Customization
Encoder Architecture
The encoder extracts factors from observations. Architecture examples:
Shallow ([32, 32]): Limited data, simple relationships, fast training
Standard ([64, 32]): Good default, balanced capacity and speed
Deep ([128, 64, 32]): Complex relationships, sufficient data, multiple regimes
Wide ([256, 128]): Many series, high-dimensional input, need capacity
Design Principles: Start simple ([64, 32] with 1-2 factors). Increase depth if underfitting (high training loss, poor reconstruction). Increase width if need more capacity (many series, complex interactions). More factors if data has multiple regimes (try 2-3 factors).
Parameter Count: For encoder with layers [n1, n2, ..., nL] and input dimension n: parameters = n⋅n1+n1⋅n2+⋯+nL−1⋅nL+nL⋅k. Example: [64, 32] with n=100, k=2: ~7,000 parameters. Rule of thumb: need 10-20 data points per parameter.
Activation Functions
# ReLU (default, matches original DDFM, faster training)ddfm_relu = dfm.DDFM(activation='relu', ...)# Tanh (bounded, smooth)ddfm_tanh = dfm.DDFM(activation='tanh', ...)# Sigmoid (bounded, smooth, but can saturate)ddfm_sigmoid = dfm.DDFM(activation='sigmoid', ...)
Default is 'relu' (matches original DDFM implementation). Use 'tanh' if you need bounded activations.
Training Hyperparameters
Training hyperparameters significantly affect DDFM performance:
ddfm = dfm.DDFM( encoder_layers=[64, 32], num_factors=2, epochs=200, # More epochs for complex data batch_size=100, # Default: 100 (matches original DDFM) learning_rate=0.005, # Default: 0.005 with exponential decay scheduler decay_learning_rate=True, # Default: True (exponential decay, gamma=0.96))
Hyperparameter Tuning Guide:
epochs: Start with 100, increase if loss still decreasing, decrease if overfitting (typical range: 50-500)
batch_size: Default 100 (matches original DDFM). Large batches 64-128: more stable gradients, slower convergence; small batches 32-64: faster convergence, noisier gradients (typical range: 32-128)
learning_rate: Default 0.005 with exponential decay scheduler (gamma=0.96, matches original DDFM). Too high > 0.01: training unstable; too low < 0.0001: slow convergence (typical range: 0.0001-0.01)
decay_learning_rate: Default True. Use exponential decay scheduler to improve convergence stability (matches original DDFM)
Tuning Strategy: Start with defaults (epochs=100, batch_size=100, learning_rate=0.005, decay_learning_rate=True). Tune learning rate first (most important). Then tune batch size (for stability). Finally, tune epochs (monitor validation loss, stop when it plateaus).
Forecasting with DDFM
DDFM forecasting works similarly to linear DFM, using factor dynamics to project future values:
# Forecast (same API as linear DFM)X_forecast, Z_forecast = ddfm_model.predict(horizon=12)print(f"Forecasted series: {X_forecast.shape}")print(f"Forecasted factors: {Z_forecast.shape}")# Plot forecastimport matplotlib.pyplot as pltplt.figure(figsize=(12, 5))plt.plot(range(len(factors)), factors[:, 0], label='Historical', linewidth=2)plt.plot(range(len(factors), len(factors) + 12), Z_forecast[:, 0], label='Forecast', linewidth=2, linestyle='--')plt.title('DDFM Factor Forecast (12 periods ahead)')plt.xlabel('Time')plt.ylabel('Factor Value')plt.legend()plt.grid(True, alpha=0.3)plt.tight_layout()plt.savefig('outputs/ddfm_factor_forecast.png', dpi=150)plt.close()
Forecast Process: The forecast uses the trained factor dynamics (VAR model) to project factors forward, then maps factor forecasts to observations via the linear decoder. Since the decoder is linear, forecasts maintain interpretability similar to linear DFM.
Comparison with Linear DFM
Side-by-Side Comparison
Using the high-level API makes it easy to compare both models on the same data:
# Train linear DFMdfm_linear = dfm.DFM()dfm_linear.load_config(hydra=cfg)# ... (setup data module, train) ...result_linear = dfm_linear.result# Train DDFMddfm_model = dfm.DDFM(encoder_layers=[64, 32], num_factors=2, epochs=100)ddfm_model.load_config(hydra=cfg)# ... (setup data module, train) ...result_ddfm = ddfm_model.result# Compare factorsfactor_linear = result_linear.Z[:, 0]factor_ddfm = result_ddfm.Z[:, 0]min_len = min(len(factor_linear), len(factor_ddfm))correlation = np.corrcoef(factor_linear[:min_len], factor_ddfm[:min_len])[0, 1]print(f"Factor correlation: {correlation:.3f}")# Expected: 0.7-0.9 (good alignment)# < 0.5: DDFM may be capturing different patterns# > 0.95: DDFM may not be adding much value over linear
Performance Comparison: Compare forecasting accuracy using time-based cross-validation. The comparison should use the same data, same train/test split, and same evaluation metrics for fair comparison.
# Using high-level API with data splittingfrom pathlib import Pathfrom dfm_python import DFMDataModule, DFMTrainer, DDFMTrainerimport pandas as pdimport numpy as np# Load preprocessed data (from Section 3-04 preprocessing)preprocessed_path = Path('data') / 'finance_preprocessed.csv'df_processed = pd.read_csv(preprocessed_path)# Split data (first 80% for training)train_size = int(0.8 * len(df_processed))df_train = df_processed.iloc[:train_size]df_test = df_processed.iloc[train_size:]# Train linear DFMdfm_linear = dfm.DFM()dfm_linear.load_config(hydra=cfg)# ... (setup data module, train) ...result_linear = dfm_linear.result# Train DDFM on same dataddfm_model = dfm.DDFM(encoder_layers=[64, 32], num_factors=2, epochs=100)ddfm_model.load_config(hydra=cfg)# ... (setup data module, train) ...result_ddfm = ddfm_model.result# Prepare test datadata_module_test = DFMDataModule( config=dfm_linear.config, data=df_test # Preprocessed, scaled test data)data_module_test.setup()X_test = data_module_test.data_processed.numpy()# Forecast and evaluateX_forecast_linear, _ = dfm_linear.predict(horizon=len(X_test))X_forecast_ddfm, _ = ddfm_model.predict(horizon=len(X_test))mae_linear = np.mean(np.abs(X_forecast_linear - X_test))mae_ddfm = np.mean(np.abs(X_forecast_ddfm - X_test))print(f"MAE: Linear DFM = {mae_linear:.4f}, DDFM = {mae_ddfm:.4f}")print(f"Improvement: {(1 - mae_ddfm/mae_linear)*100:.1f}%")
Evaluation Metrics: MAE (robust to outliers), RMSE (penalizes large errors), directional accuracy (percentage of correct direction predictions), forecast bias (systematic over/under-prediction).
When DDFM Outperforms: Structural breaks (periods of rapid change—crises, policy shifts), nonlinear relationships (regime switches, threshold effects, interactions), large datasets (hundreds of series where linear models become expensive), high-frequency data (weekly or daily indicators with complex relationships).
When Linear DFM is Competitive: Stable periods (economic relationships don't change dramatically over time), limited data (< 100 time steps where neural networks may overfit), linear relationships (data shows linear co-movement patterns), fast inference needed (real-time applications requiring millisecond responses).
Handling Large Datasets
DDFMs scale well to large datasets via minibatch training, making them suitable for applications with hundreds of series. This is a key advantage over linear DFMs, which become computationally expensive beyond ~50 series.
Scalability Advantages: Minibatch training (process data in chunks), GPU acceleration (10-100x speedup), parallel processing. Linear DFM limitation: EM algorithm requires full dataset, becomes slow beyond ~50 series.
For Large Datasets (1000 series, 500 time steps): Use wider encoders (more units per layer, e.g., [256, 128, 64] instead of [64, 32]), more factors (3-5 factors instead of 1-2), larger batches (128 instead of 32 for stable gradients and better GPU utilization).
Computational Considerations: Large datasets require more GPU memory—reduce batch size if out of memory, or use CPU. Training time scales with dataset size: small (10 series, 100 steps—minutes), medium (100 series, 500 steps—10-30 minutes), large (1000 series, 1000 steps—hours, but feasible).
Comparison with Linear DFM: Linear DFM has computational cost O(n3) per EM iteration (10 series—fast; 100 series—slow; 1000 series—impractical). DDFM has computational cost O(n⋅m) per minibatch (scales linearly with series count, 1000 series—feasible).
Common Issues and Solutions
DDFM training can encounter various issues. Understanding symptoms and solutions helps troubleshoot effectively.
Issue 1: Training Loss Not Decreasing
Symptoms: Loss plateaus or increases over epochs, model not learning.
Root Causes: Learning rate too high or too low, poor initialization, data issues (outliers, missing values, scaling problems), architecture too complex.
Solutions: Reduce learning rate (try 0.001 or 0.002 if loss increases; default is 0.005 with exponential decay), increase batch size (default is 100; try 128 for more stable gradients), simplify architecture (fewer layers, fewer units), check data quality (remove outliers, ensure proper scaling), enable pre-training (default: enabled, uses non-missing data), monitor training history (check if loss increases or plateaus).
Issue 2: Overfitting
Symptoms: Training loss decreases but validation loss increases, model memorizes training data.
Root Causes: Model too complex, insufficient data, no regularization.
Solutions: Reduce model capacity (fewer layers/units—e.g., [128, 64, 32] → [64, 32]), add regularization (weight decay, dropout, early stopping), more data (collect more time periods or use all available data), cross-validation (use time-based validation—train on past, validate on recent).
Issue 3: Factors Are Too Smooth or Too Noisy
Symptoms: Factors don't capture variation (too smooth), or are erratic (too noisy), or don't align with economic intuition.
Root Causes: Too smooth (encoder not learning), too noisy (overfitting), wrong number of factors (too few or too many).
Solutions: Adjust number of factors (too smooth → try 2-3 factors; too noisy → reduce factors), tune learning rate (default 0.005 with exponential decay; lower LR for smoother factors), check encoder architecture (too smooth → wider/deeper; too noisy → simpler), enable pre-training (default: enabled, stabilizes initialization), compare with linear DFM (factor correlation should be 0.7-0.9—if < 0.5, may be overfitting), check factor variance.
Issue 4: GPU Out of Memory
Symptoms: CUDA out of memory error.
Solutions: Reduce batch size (try 16 or 8), reduce encoder size (fewer units), process data in chunks, use CPU (slower but works).
Best Practices
Start with linear DFM: Establish baseline before trying DDFM (understand data characteristics, ensure data quality, provides comparison point).
Use simple architecture first: [64, 32] with 1-2 factors, default settings (activation='relu', batch_size=100, learning_rate=0.005 with exponential decay). Avoid overfitting, faster training, easier to debug; increase complexity if underfitting.
Monitor training: Plot loss over epochs, check for overfitting (loss should decrease or plateau, training vs. validation loss should track each other, factor stability should not jump erratically).
Compare factors: DDFM factors should correlate with linear DFM factors (expected correlation: 0.7-0.9; too low < 0.5: DDFM may be capturing noise; too high > 0.95: DDFM may not be adding value).
Validate on holdout: Use time-based cross-validation (train on past, validate on recent—don't shuffle time; realistic evaluation, prevents overfitting).
Interpret results: Plot factors, check loadings, compare forecasts (factors should make economic sense, loadings should align with economic intuition, forecasts should be reasonable).
This tutorial covered: DDFM basics (neural network-based factor extraction, extending linear DFMs), architecture customization (encoder layers, activations, hyperparameters), training procedure (gradient descent vs. EM algorithm, monitoring convergence), comparison with linear DFM (when to use each, performance trade-offs), actual results (with finance data: 2 factors extracted from 22 series, factor correlation 0.75-0.85 with linear DFM), common issues (training problems, overfitting, factor interpretation), best practices (start simple, validate, compare).
DDFMs provide a powerful extension to linear DFMs, capturing nonlinear relationships that emerge during structural breaks. While they require more data and computation, they can significantly improve forecasting accuracy, especially during volatile periods. Empirical evidence (Korean GDP study) shows 44% improvement in MAE during COVID-19.
Use DDFM when: You suspect nonlinear relationships or regime switches, have sufficient data (200+ time steps, 10+ series), linear DFM performance is poor—especially during volatile periods, or are willing to trade some interpretability for better accuracy.
Use Linear DFM when: Relationships are approximately linear, interpretability is critical, limited data is available (< 100 time steps), or fast inference is required.
For theoretical details on how autoencoders generalize PCA and the variational framework, see Section 6-04. For practical DFM implementation, see Section 6-03.