Can a Transformer "Learn" Economic Relationships?

Replication code for Arpit Gupta and Alex Imas, "Can a Transformer 'Learn' Economic Relationships? Revisiting the Lucas Critique in the age of Transformers" (Arpitrage, December 22, 2025).

This repository implements a simulation-based replication of the post's main exercise: train a causal Transformer on trajectories generated by a linear New Keynesian (NK) model and evaluate whether it can reproduce out-of-sample dynamics and impulse responses under unseen parameter draws. The exercise is deliberately narrow. The data-generating process is known, fixed, and simulated; the default test split holds out a more aggressive monetary-policy regime; and the Transformer is given more information than a reduced-form time-series benchmark.

Model

The simulator uses a linearised three-equation NK model:

$$ x_t = \mathbb{E}_t x_{t+1} - \frac{1}{\sigma}(i_t - \mathbb{E}_t \pi_{t+1} - r_t^n), $$

$$ \pi_t = \beta \mathbb{E}_t \pi_{t+1} + \kappa x_t + u_t, $$

$$ i_t = \phi_\pi \pi_t + \phi_x x_t + v_t. $$

The exogenous states follow independent AR(1) processes:

$$ r_t^n = \rho_r r_{t-1}^n + \varepsilon_t^r,\quad u_t = \rho_u u_{t-1} + \varepsilon_t^u,\quad v_t = \rho_v v_{t-1} + \varepsilon_t^v, $$

with Gaussian innovations and shock-specific standard deviations.

For each parameter vector, the model is solved as a linear rational-expectations system. The policy function is

$$ y_t = P s_t, $$

where $y_t=(x_t,\pi_t)$ and $s_t=(r_t^n,u_t,v_t)$. The interest rate is then recovered from the Taylor rule. Parameter draws that violate the Taylor principle or produce a singular solution matrix are rejected.

Simulation Design

The default run draws 60,000 valid economies (50,000 for training, 5,000 for validation, and 5,000 for testing). Each economy is simulated for 200 periods, with the first 50 periods discarded to retain 150 observations.

The parameter prior ranges have been updated to reflect the modern empirical consensus (see parameters.md for a detailed literature review).

Parameter	Symbol	Range	Reference
Intertemporal elasticity inverse	$\sigma$	$[1.0, 3.0]$	Galí (2008), Smets & Wouters (2007)
Discount factor	$\beta$	$0.99$	Woodford (2003), Galí (2008)
Phillips-curve slope	$\kappa$	$[0.01, 0.20]$	Hazell et al. (2022)
Taylor-rule inflation coefficient	$\phi_\pi$	$[1.1, 3.0]$	Clarida et al. (1999), Taylor (1993)
Taylor-rule output coefficient	$\phi_x$	$[0.0, 1.0]$	Taylor (1993), Smets & Wouters (2007)
Natural-rate persistence	$\rho_r$	$[0.50, 0.95]$	Smets & Wouters (2007)
Cost-push persistence	$\rho_u$	$[0.30, 0.80]$	Smets & Wouters (2007)
Policy-shock persistence	$\rho_v$	$[0.30, 0.70]$	Smets & Wouters (2007)
Natural-rate innovation sd	$\sigma_r$	$[0.004, 0.010]$	Smets & Wouters (2007)
Cost-push innovation sd	$\sigma_u$	$[0.001, 0.008]$	Smets & Wouters (2007)
Policy innovation sd	$\sigma_v$	$[0.001, 0.008]$	Smets & Wouters (2007)

By default, the policy-regime split holds out aggressive inflation-response Taylor rules for testing. Training and validation draw $\phi_\pi \in [1.1,2.4)$; test economies draw $\phi_\pi \in [2.4,3.0]$. Setting experiment.policy_holdout: none in the config file draws all splits from the full prior.

For economy $i$ and date $t$, the supervised input is

$$ X_{i,t} = [\theta_i,\varepsilon_{i,t},y_{i,t-1}] \in \mathbb{R}^{17}, $$

and the target is $y_{i,t}=(x_{i,t},\pi_{i,t},i_{i,t})$. Features and targets are standardised with training-set moments. Generated arrays and normalisation statistics are cached under results/cache.

Transformer

The main model is a causal Transformer encoder with sinusoidal positional encodings. Key architecture details include:

Dimensions: Input (17), Output (3), Model (64), Feedforward (256)
Structure: 4 Layers, 4 Attention heads, 0.1 Dropout
Parameters: Approximately 184,000

Training uses AdamW, mean squared error loss, cosine learning-rate decay, gradient clipping, early stopping, CUDA mixed precision when available, and torch.compile on supported CUDA installations.

Multi-step forecasts are generated autoregressively. The model is initialised with a 50-period context, receives future shock innovations during evaluation, and then uses its own predicted observables as subsequent lags.

Benchmarks

Two reduced-form benchmarks are estimated separately for each test economy:

OLS VAR: Lag order selected by AIC over $p \in {1,\ldots,8}$.
BVAR: Minnesota prior, fixed $p=4$, conjugate Normal-inverse-Wishart posterior.
Kalman VAR: VAR(1) written as a linear Gaussian state-space model (used in the y-only experiment).

The benchmarks use only realised observables. They do not observe the structural parameter vector or the contemporaneous innovation vector. This information difference is part of the experimental design and should be considered when interpreting relative performance.

Evaluation

The pipeline reports the following metrics:

One-step MSE: Predict $y_t$ from $\theta$, $\varepsilon_t$, and the true lag $y_{t-1}$; score after a 50-period warmup.
Multi-step MSE: Forecast horizons $h \in {1,4,8,12,20}$ from a 50-period context.
IRF MSE: Compare predicted impulse responses to analytical NK impulse responses over 20 quarters.
IRF sign accuracy: Share of nonzero-horizon responses for which the predicted sign matches the analytical response.
Sample-size curve: Retrain the Transformer on subsets of the training economies and compare with flat per-economy VAR/BVAR baselines.

Transformer impulse responses are generated by feeding a one-standard-deviation structural shock at impact and zeros thereafter. VAR and BVAR impulse responses use Cholesky identification, so their IRFs are reduced-form comparisons rather than the same structural objects.

Y-only Robustness Experiment

The article's follow-up experiment removes the Transformer's access to structural parameters and innovations. This repository implements that comparison by training a second causal Transformer with input

$$ X_{i,t}^{y\text{-only}} = y_{i,t-1} $$

and target $y_{i,t}$. Its benchmark is a reduced-form Kalman filter: a VAR(1) fitted on the first 50 observations of each test economy and represented as a linear Gaussian state-space model. The pipeline reports one-step MSE for both models.

Limitations

This replication should not be read as evidence that a Transformer has recovered a structural macroeconomic model in the usual econometric sense.

First, the DGP is fixed to the linearised NK model. The experiment tests learning within a known model class, not robustness to a different structural economy.

Second, the main Transformer observes structural parameters and contemporaneous innovations. This is a stronger information set than the one available to the VAR and BVAR benchmarks, which use only realised observables.

Third, the y-only experiment removes that information advantage, but its Kalman benchmark is a reduced-form VAR state-space approximation. It is not a structural estimator of the NK state, and it may not match the exact Kalman specification used by the original authors.

Fourth, the default high-$\phi_\pi$ hold-out evaluates a nearby policy-regime shift within the same NK model class. It is not evidence of robustness to large regime changes, alternative structural models, non-linear constraints, or shifts outside the prior support.

Fifth, predictive accuracy does not identify the learned mechanism. A model may reproduce trajectories while failing to encode economically meaningful counterfactual structure, welfare objects, or interpretable decision rules.

Sixth, Cholesky VAR/BVAR impulse responses are not structurally identified NK shocks. Their inclusion is useful as a reduced-form reference point, but not as a clean test of structural recovery.

Finally, the repository defines the replication pipeline. Numerical claims should be made only after running the full pipeline and inspecting the generated result files.

Repository Structure

nk-transformers/
├── run.py              # Pipeline entry point
├── config.toml        # Experiment configuration
├── src/
│   ├── simulator.py  # NK solution, simulation, caching
│   ├── model.py       # Causal Transformer
│   ├── train.py       # Training loop
│   ├── evaluate.py    # Forecast, IRF, and y-only metrics
│   ├── benchmarks.py  # VAR, BVAR, and Kalman implementations
│   ├── plots.py       # Figures and tables
│   └── __init__.py
├── requirements.txt
└── README.md

Usage

All experiment configuration is managed in config.toml. Copy the default and modify as needed:

cp config.toml config.local.toml  # create local override
# edit config.local.toml with your settings

Run the pipeline:

pip install -r requirements.txt
python run.py --config config.local.toml

Useful operational flags (can be combined):

python run.py --skip-train      # skip training, load checkpoint
python run.py --skip-benchmarks # skip VAR/BVAR evaluations
python run.py --skip-yonly      # skip y-only vs Kalman experiment

See config.toml for all configurable options (paths, training hyperparameters, experiment settings).

Outputs

A complete run writes cached data, model checkpoints, result summaries, and figures under results/:

results/
├── cache/
├── checkpoints/
└── figures/

The main figures are Transformer trajectory overlays, all-model trajectory overlays, IRF path plots, IRF error summaries, sample-size learning curves, and forecast-horizon error plots. The plotting code uses a common large-format academic style: sparse grids, shared legends, consistent colours, and high-resolution output. Numerical summaries, including the y-only Transformer and Kalman MSEs, are written to results/results.json.

References

Clarida, R., Galí, J., & Gertler, M. (1999). The Science of Monetary Policy: A New Keynesian Perspective. Journal of Economic Literature, 37(4), 1661–1707.

Galí, J., & Gertler, M. (1999). Inflation Dynamics: A Structural Econometric Analysis. Journal of Monetary Economics, 44(2), 195–222.

Galí, J. (2008). Monetary Policy, Inflation, and the Business Cycle: An Introduction to the New Keynesian Framework. Princeton University Press.

Gupta, A., & Imas, A. (2025). Can a Transformer 'Learn' Economic Relationships? Revisiting the Lucas Critique in the Age of Transformers. Arpitrage (Substack), December 22, 2025.

Hazell, J., Herreño, J., Nakamura, E., & Steinsson, J. (2022). The Slope of the Phillips Curve: Evidence from U.S. States. Quarterly Journal of Economics, 137(3), 1299–1344.

Rotemberg, J., & Woodford, M. (1997). An Optimization-Based Econometric Framework for the Evaluation of Monetary Policy. NBER Macroeconomics Annual, 12, 297–346.

Smets, F., & Wouters, R. (2007). Shocks and Frictions in US Business Cycles: A Bayesian DSGE Approach. American Economic Review, 97(3), 586–606.

Taylor, J. B. (1993). Discretion versus Policy Rules in Practice. Carnegie-Rochester Conference Series on Public Policy, 39, 195–214.

Woodford, M. (2003). Interest and Prices: Foundations of a Theory of Monetary Policy. Princeton University Press.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can a Transformer "Learn" Economic Relationships?

Contents

Model

Simulation Design

Transformer

Benchmarks

Evaluation

Y-only Robustness Experiment

Limitations

Repository Structure

Usage

Outputs

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
config.toml		config.toml
pyproject.toml		pyproject.toml
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

Can a Transformer "Learn" Economic Relationships?

Contents

Model

Simulation Design

Transformer

Benchmarks

Evaluation

Y-only Robustness Experiment

Limitations

Repository Structure

Usage

Outputs

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages