Skip to content

nnagarajan/TRADEFM

Repository files navigation

TradeFM

A replication of TradeFM (arXiv 2602.23784) — a 524M-parameter decoder-only Transformer that generates realistic order flow by learning from raw Databento MBO (Level 3) event streams.

Kawawa-Beaudan, Sood, Papasotiriou, Borrajo, Veloso — JPMorgan AI Research, Feb 2026

Overview

TradeFM applies the foundation model paradigm to market microstructure. A single model learns unified trade-flow dynamics from billions of transactions across thousands of US equities, without asset-specific calibration. In closed-loop evaluation, generated order flow reproduces canonical stylized facts: heavy-tailed returns, volatility clustering, and lack of return autocorrelation.

Key properties:

  • Partial observability — learns from the Level 3 event stream (what any market participant sees), not full limit order book snapshots
  • Scale-invariant features — normalizes price, volume, and time features so one model works across assets with vastly different prices and liquidity profiles
  • Zero-shot geographic generalization — trained on US equities, transfers to APAC markets with moderate perplexity degradation
  • Closed-loop simulation — integrates with a deterministic LOB simulator for realistic rollouts

Architecture

Databento MBO .csv.zst
  └─ databento_loader.py     load + decode events
       ├─ compute_adv()       rolling ADV → liquidity tier
       └─ preprocess_mbo_for_tradefm()
            ├─ ew_vwap.py     EW-VWAP mid-price from Trade/Fill events
            └─ scale-invariant features per (instrument, date) session
                  δp = (p_order − p_mid) / p_mid
                  v  = log(1 + size)
                  Δt = ts_recv diff (seconds)
                  Δp = (p_mid − p_open) / p_open

tokenizer.py   calibrate() on first 30 days → encode() → composite token
  └─ mixed-base vocab: 2×2×16×16×16 = 16,384 tokens
     contextual (not predicted): liquidity bin, market/participant flag, Δp bin

dataset.py     sliding-window sequences per (instrument, date) → PyTorch Dataset

architecture.py  TradeFM
  ├─ TabularEmbedding  4 embedding tables → concat → Linear projection
  └─ N × DecoderLayer  (RMSNorm + GQA + SwiGLU MLP)
  └─ lm_head → next token (cross-entropy loss)

trainer.py     AdamW (β=0.9, 0.95), linear warmup+decay, fp16, grad accumulation

market_simulator.py   deterministic LOB for closed-loop rollouts

evaluation/stylized_facts.py   ACF, kurtosis, K-S, Wasserstein-1

Model sizes

Preset Layers Hidden Heads (GQA) Params
125M 12 768 12 / 4 ~125M
250M 24 1024 16 / 4 ~250M
500M 32 1024 32 / 8 ~524M

Installation

pip install -r requirements.txt

Requirements: torch>=2.0, numpy, pandas, scipy, zstandard

Usage

Training

python train.py \
    --data "data/**/*.mbo.csv.zst" \
    --model-size 500M \
    --output-dir checkpoints/500M

Key arguments:

Argument Default Description
--data required Path, glob, or list of Databento MBO .csv.zst files
--model-size 500M Size preset: 125M, 250M, 500M
--calib-days 30 Trading days used to calibrate the tokenizer
--val-days 30 Trailing days held out for validation
--context-length 1024 Sequence length in tokens
--epochs 4 Training epochs
--batch-size 24 Per-device batch size
--accum-steps 56 Gradient accumulation (effective batch ≈ 4032)
--lr 5e-5 Peak learning rate
--output-dir checkpoints Checkpoint output directory

Syntax check

python -c "
import ast, pathlib
for f in pathlib.Path('tradefm').rglob('*.py'):
    ast.parse(f.read_text())
    print('OK', f)
"

Input Data Format

Databento MBO .csv.zst files (zstandard-compressed CSV) with columns:

Column Example Notes
ts_event 2026-03-30T08:00:00.016Z Exchange timestamp (nanosecond UTC)
ts_recv 2026-03-30T08:00:00.015Z Feed receive timestamp
action A / C / T / F Add, Cancel, Trade, Fill
side A / B Ask (sell) / Bid (buy)
price 183.460000000 Decimal dollars
size 1635 Shares
order_id 435123903 Links Add → Cancel/Fill events
symbol NVDA Ticker

Only A (Add) and C (Cancel) events are model targets. T/F events update the EW-VWAP estimator only.

Design Choices

  • Tokenizer frozen after calibration — calibrated once on the first 30 trading days, then fixed for all training and inference
  • Equal-frequency bins for price — quantile binning gives high resolution near the mid-price where most orders cluster
  • Equal-width bins for volume/time — applied to log-transformed values, effectively logarithmic in the original space
  • Session boundaries respected — no sliding window crosses (instrument_id, date) boundaries
  • Closed-loop generationTradeFM.generate() feeds each predicted token through the LOB simulator to get the updated price level, which is appended to context before the next step

Research Lineage

This work builds on:

  1. Kawawa-Beaudan et al. 2024 (arXiv 2409.07619) — "Ensemble Methods for Sequence Classification with Hidden Markov Models": established order flow as a sequence modeling problem using HMM ensembles for anomaly classification. TradeFM replaces HMMs with a large generative Transformer and flips the task from classification to generation.

  2. Sirigano & Cont 2021 — showed a single deep learning model trained on pooled multi-stock data outperforms asset-specific models, motivating cross-asset generalization.

About

Replicate TRADEFM Research Paper by JPMC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors