thomaspinder · thomaspinder · Mar 8, 2026 · Mar 8, 2026 · Mar 8, 2026 · Mar 8, 2026
diff --git a/docs/explanation/conditional-forecasting.md b/docs/explanation/conditional-forecasting.md
@@ -0,0 +1,53 @@
+# Conditional Forecasting
+
+## The problem
+
+A standard VAR forecast projects all variables forward simultaneously, with no external constraints. But many questions in macroeconomics and finance require **conditional** projections:
+
+- Central banks publish forecasts conditioned on assumed interest rate paths
+- Financial institutions stress-test portfolios under assumed macroeconomic scenarios
+- Policy analysts ask "what if" questions about specific variable trajectories
+
+Conditional forecasting provides the mathematical framework for these exercises.
+
+## The idea
+
+A VAR forecast can be decomposed into two parts:
+
+$$y_{t+h} = \underbrace{y_{t+h}^{u}}_{\text{unconditional}} + \underbrace{\sum_{s=0}^{h} \Phi_{h-s} \, P \, \varepsilon_{s}}_{\text{shock-driven deviation}}$$
+
+where $y^{u}_{t+h}$ is the unconditional forecast (no future shocks), $\Phi_j$ are the moving-average (MA) coefficient matrices, $P$ is the impact matrix (Cholesky factor of $\Sigma$ or a structural matrix), and $\varepsilon_s$ are the future structural shocks.
+
+The unconditional forecast is deterministic given the posterior draw. The future shocks are unknown. Conditional forecasting amounts to **choosing the shock paths** that make the forecast satisfy the desired constraints.
+
+## The Waggoner-Zha algorithm
+
+Waggoner & Zha (1999) showed that this can be formulated as a linear system. Stack all future shocks into a vector $\varepsilon = [\varepsilon_0, \varepsilon_1, \ldots, \varepsilon_{H-1}]$ of length $H \times n$, where $H$ is the number of forecast steps and $n$ is the number of variables.
+
+Each constraint (e.g., "variable $i$ equals value $v$ at period $h$") translates into a linear equation:
+
+$$R \, \varepsilon = c$$
+
+where $R$ is constructed from the MA coefficients and the impact matrix, and $c$ contains the differences between target values and unconditional forecasts.
+
+The minimum-norm solution $\varepsilon^* = R^\top (R R^\top)^{-1} c$ gives the **smallest set of shocks** (in a least-squares sense) that satisfies all constraints. This is computed via `numpy.linalg.lstsq`.
+
+The conditional forecast is then:
+
+$$y_{t+h}^{c} = y_{t+h}^{u} + \sum_{s=0}^{h} \Phi_{h-s} \, P \, \varepsilon^*_s$$
+
+## Structural extensions
+
+When the model is identified (you have a structural impact matrix $P$ rather than just the Cholesky factor of $\Sigma$), you can also condition on **structural shock paths**. For example, "assume the supply shock is zero for the next 4 periods" translates into direct constraints on elements of $\varepsilon$.
+
+Observable constraints and shock constraints can be combined in a single system. Observable constraints use the full MA representation (involving $\Phi$ and $P$), while shock constraints are simpler — they directly pin individual elements of $\varepsilon$.
+
+## Bayesian uncertainty
+
+The algorithm is applied independently to each posterior draw of $(B, \Sigma)$ or $(B, P)$. This means:
+
+- The unconditional forecast differs across draws (parameter uncertainty)
+- The MA coefficients differ across draws
+- The shock paths satisfying the constraints differ across draws
+
+The result is a full posterior distribution of conditional forecasts, from which you can compute medians, HDIs, and other summaries. The constrained variables will hit their targets exactly in every draw, but the unconstrained variables will show genuine posterior uncertainty about the conditional projection.
diff --git a/docs/explanation/identification.md b/docs/explanation/identification.md
@@ -37,3 +37,25 @@ scheme = SignRestriction(
 ```
 
 Sign restrictions are weaker than Cholesky (they don't uniquely identify the model), but they require fewer assumptions.
+
+## Long-run restrictions (Blanchard-Quah)
+
+Cholesky and sign restrictions both constrain the **contemporaneous** (impact) response to structural shocks. Long-run restrictions take a different approach: they constrain the **cumulative** response as the horizon goes to infinity.
+
+The key object is the **long-run multiplier matrix**:
+
+$$C(1) = (I - A_1 - A_2 - \cdots - A_p)^{-1}$$
+
+This matrix captures the total cumulative effect of a one-time shock. If the VAR is stationary, all shocks are transitory and $C(1)$ is finite. The long-run impact of structural shocks is $C(1) P$, where $P$ is the structural impact matrix.
+
+Blanchard & Quah (1989) proposed forcing $C(1) P$ to be **lower triangular**. This is achieved by applying the Cholesky decomposition not to the residual covariance $\Sigma$ (as in short-run Cholesky) but to the long-run covariance:
+
+$$C(1) \, \Sigma \, C(1)^\top = L \, L^\top$$
+
+The structural impact matrix is then $P = C(1)^{-1} L$.
+
+The interpretation depends on the variable ordering:
+- Shocks later in the ordering have **zero long-run cumulative effect** on variables earlier in the ordering
+- The first shock is unrestricted in its long-run effects
+
+This is exactly the same Cholesky math, applied to a different matrix. The economics, however, are very different: you're restricting permanent effects rather than contemporaneous effects. In the classic Blanchard-Quah example, ordering output before prices means the second shock (interpreted as a demand shock) has no permanent effect on output — only supply shocks do.
diff --git a/docs/explanation/minnesota-prior.md b/docs/explanation/minnesota-prior.md
@@ -1,6 +1,6 @@
 # The Minnesota Prior
 
-The **Minnesota prior** (Impulso, 1986) is the most widely used prior for Bayesian VARs. It encodes the belief that each variable follows a random walk, with coefficients on other variables' lags shrunk toward zero.
+The **Minnesota prior** (Litterman, 1986) is the most widely used prior for Bayesian VARs. It encodes the belief that each variable follows a random walk, with coefficients on other variables' lags shrunk toward zero.
 
 ## Key hyperparameters
 
@@ -27,3 +27,70 @@ spec = VAR(lags=4, prior="minnesota")
 prior = MinnesotaPrior(tightness=0.2, decay="geometric", cross_shrinkage=0.3)
 spec = VAR(lags=4, prior=prior)
 ```
+
+## Conjugate estimation
+
+When the Minnesota prior is paired with a Normal-Inverse-Wishart (NIW) likelihood, the posterior belongs to the same family — this is called **conjugacy**. The practical consequence is dramatic: instead of running an iterative MCMC sampler, we can compute the posterior parameters in closed form and draw from it directly.
+
+The prior specifies:
+
+$$B \mid \Sigma \sim \mathcal{MN}(B_0, \Sigma, V_0), \qquad \Sigma \sim \mathcal{IW}(S_0, \nu_0)$$
+
+where $\mathcal{MN}$ is the matrix normal distribution and $\mathcal{IW}$ is the inverse Wishart. Given data matrices $Y$ (observations) and $X$ (lagged regressors), the posterior parameters are:
+
+$$V_{\text{post}} = (V_0^{-1} + X^\top X)^{-1}$$
+
+$$B_{\text{post}} = V_{\text{post}}(V_0^{-1} B_0 + X^\top Y)$$
+
+$$\nu_{\text{post}} = \nu_0 + T$$
+
+$$S_{\text{post}} = S_0 + Y^\top Y + B_0^\top V_0^{-1} B_0 - B_{\text{post}}^\top V_{\text{post}}^{-1} B_{\text{post}}$$
+
+The posterior mean for $B$ is a precision-weighted average of the prior mean $B_0$ and the OLS estimate $(X^\top X)^{-1} X^\top Y$. When the prior is tight (small $V_0$), the posterior stays close to the random walk prior. When data is abundant (large $X^\top X$), the posterior approaches OLS. The posterior for $\Sigma$ is an inverse Wishart with updated scale and degrees of freedom.
+
+Sampling is straightforward: draw $\Sigma \sim \mathcal{IW}(S_{\text{post}}, \nu_{\text{post}})$, then $B \mid \Sigma \sim \mathcal{MN}(B_{\text{post}}, \Sigma, V_{\text{post}})$. Each draw is independent — no burn-in, no autocorrelation, no convergence diagnostics. `ConjugateVAR` implements this approach.
+
+## Data-driven prior selection
+
+The Minnesota prior's hyperparameters — `tightness`, `cross_shrinkage`, and `decay` — are often chosen by convention or trial-and-error. Giannone, Lenza & Primiceri (2015) proposed an empirical Bayes approach: choose hyperparameters by maximising the **marginal likelihood**.
+
+The marginal likelihood is the probability of the observed data $Y$ given hyperparameters $\lambda$, after integrating out all model parameters:
+
+$$p(Y \mid \lambda) = \int p(Y \mid B, \Sigma) \, p(B, \Sigma \mid \lambda) \, dB \, d\Sigma$$
+
+Thanks to conjugacy, this integral has a closed-form solution involving determinants and multivariate gamma functions. Optimising $\log p(Y \mid \lambda)$ over $\lambda = (\text{tightness}, \text{cross\_shrinkage})$ using standard numerical optimisation (L-BFGS-B) is fast and reliable.
+
+The marginal likelihood naturally balances two forces:
+
+- **Fit**: a loose prior (high tightness) lets the model fit the data closely
+- **Parsimony**: a loose prior also spreads probability mass over implausible parameter values, reducing the marginal likelihood
+
+The optimal hyperparameters sit at the sweet spot where the prior is just flexible enough to capture the data's patterns without wasting probability on noise.
+
+This approach is sometimes called **GLP** after the authors' initials. In Impulso, use `ConjugateVAR.optimize_prior()` or the shorthand `prior="minnesota_optimized"`.
+
+## Dummy observation priors
+
+An elegant trick from the classical VAR literature encodes prior beliefs not by modifying the prior distribution directly, but by **appending synthetic observations** to the dataset. These "dummy observations" push the posterior in the desired direction while preserving conjugacy.
+
+### Sum-of-coefficients prior
+
+The sum-of-coefficients prior (Doan, Litterman & Sims, 1984) encodes the belief that if all variables have been at their initial sample values forever, they should persist at those values. Formally, it adds observations implying:
+
+$$\sum_{j=1}^{p} A_j \approx I$$
+
+where $A_j$ are the lag coefficient matrices. This discourages the model from predicting rapid mean-reversion when it isn't supported by the data. The hyperparameter $\mu$ controls the prior's strength: smaller $\mu$ imposes the belief more tightly.
+
+### Single-unit-root prior
+
+The single-unit-root prior (Sims, 1993) encodes the belief that each variable follows an independent random walk. It adds a single observation per variable that makes the model reluctant to introduce cointegrating relationships not strongly supported by the data. The hyperparameter $\delta$ controls its strength.
+
+### Practical usage
+
+In Impulso, dummy observations are added via `VARData.with_dummy_observations()`:
+
+```python
+data_augmented = data.with_dummy_observations(n_lags=4, mu=1.0, delta=1.0)
+```
+
+The augmented data can then be passed to any estimation method — `ConjugateVAR`, `VAR`, or used with `optimize_prior()`. Because the dummy observations are simply extra rows in the data matrices, they preserve conjugacy and integrate seamlessly with marginal likelihood optimisation.
diff --git a/docs/how-to/conditional-forecasting.md b/docs/how-to/conditional-forecasting.md
@@ -0,0 +1,113 @@
+# Conditional Forecasting
+
+Standard forecasts let the model speak freely — given the historical data, where do the variables go next? But policy analysis often needs to answer a different question: **what happens if we assume a specific path for one variable?**
+
+For example:
+- "What happens to GDP and inflation if the central bank raises the policy rate by 25bp at each of the next 4 meetings?"
+- "What is the inflation outlook if oil prices stay at \$80/barrel for the next year?"
+
+Conditional forecasting answers these questions. It finds the **smallest set of shocks** consistent with the assumed path, then traces out the implications for all other variables. This implements the algorithm of Waggoner & Zha (1999).
+
+## Defining conditions
+
+A `ForecastCondition` specifies the variable, the periods (0-indexed forecast steps), and the target values:
+
+```python
+from impulso import ForecastCondition
+
+# "The policy rate will be 5.25 at steps 0, 1, 2, and 3"
+rate_path = ForecastCondition(
+    variable="rate",
+    periods=[0, 1, 2, 3],
+    values=[5.25, 5.50, 5.75, 6.00],
+)
+```
+
+You can specify multiple conditions on different variables:
+
+```python
+# Also condition on oil prices
+oil_path = ForecastCondition(
+    variable="oil",
+    periods=[0, 1, 2, 3],
+    values=[80.0, 80.0, 80.0, 80.0],
+)
+```
+
+!!! note "Periods are 0-indexed"
+    Period 0 is the first forecast step (one step ahead of the last observation). Period 3 is four steps ahead.
+
+## Reduced-form conditional forecasts
+
+On a `FittedVAR` (before identification), conditional forecasts use the Cholesky factor of the residual covariance to define the shock space:
+
+```python
+from impulso import VAR, VARData, ForecastCondition
+
+data = VARData.from_df(df, endog=["gdp", "inflation", "rate"])
+fitted = VAR(lags=4, prior="minnesota").fit(data)
+
+# Condition on a rising rate path
+rate_path = ForecastCondition(
+    variable="rate",
+    periods=[0, 1, 2, 3],
+    values=[5.25, 5.50, 5.75, 6.00],
+)
+
+result = fitted.conditional_forecast(steps=8, conditions=[rate_path])
+```
+
+The result is a `ConditionalForecastResult` — a subclass of `ForecastResult` that also stores the conditions. You can inspect it the same way:
+
+```python
+# Posterior median conditional forecast
+result.median()
+
+# HDI credible intervals
+result.hdi(prob=0.89)
+
+# Plot
+result.plot()
+```
+
+The constrained variable ("rate") will hit its target values exactly at the specified periods. The unconstrained variables ("gdp", "inflation") show the model's best prediction given those constraints — how the economy would evolve under the assumed rate path.
+
+!!! tip "Interpreting the results"
+    The conditional forecast answers: "What is the most likely path for all variables, given that `rate` follows the specified path?" The uncertainty bands for unconstrained variables reflect both parameter uncertainty (different posterior draws give different answers) and the uncertainty about which combination of shocks would produce the assumed path.
+
+## Structural conditional forecasts
+
+If you have an identified model, you can also condition on **structural shock paths**. This is useful for scenario analysis: "What happens if there are no supply shocks for the next 4 quarters?"
+
+```python
+from impulso.identification import Cholesky
+
+identified = fitted.set_identification_strategy(
+    Cholesky(ordering=["gdp", "inflation", "rate"])
+)
+
+# Condition on zero supply shocks
+no_supply_shocks = ForecastCondition(
+    variable="gdp",  # shock named after the first variable in ordering
+    periods=[0, 1, 2, 3],
+    values=[0.0, 0.0, 0.0, 0.0],
+)
+
+result = identified.conditional_forecast(
+    steps=8,
+    conditions=[rate_path],
+    shock_conditions=[no_supply_shocks],
+)
+```
+
+Here, `conditions` constrain observable variable paths (as before), and `shock_conditions` constrain structural shock paths. You can use either or both.
+
+!!! warning "Shock naming"
+    Shock names correspond to the variable names in the identification scheme's ordering. With `Cholesky(ordering=["gdp", "inflation", "rate"])`, the shocks are named "gdp", "inflation", and "rate".
+
+## Degrees of freedom
+
+Each condition uses up one degree of freedom per constrained period. The total number of constraints cannot exceed `steps * n_variables` (the total number of shock values to be determined). In practice, keeping constraints well below this limit gives more stable results — the system becomes increasingly sensitive as you approach the maximum.
+
+!!! tip "Start simple"
+    Begin with conditions on one variable at a few periods. Add more constraints incrementally and check that results remain sensible. Over-constraining can produce large, implausible shocks.