FX Synthetic Data Generator & Ingestor for QuestDB

This script generates highly realistic, multi-level FX market data and ingests it into QuestDB at high speed. It produces three interconnected datasets:

market_data - Full depth orderbook snapshots (L2 market data)
core_price - Best bid/offer (BBO) with indicators and metadata
fx_trades - Executed trades with realistic order matching against orderbooks

Designed for stress testing, benchmarking, and live demo scenarios, it supports both wall-clock-paced ("real-time") and maximum-throughput ("faster-than-life") simulation with multi-process orchestration, pip-accurate pricing, WAL backpressure detection, and robust state management.

Features

Data Generation

Three-Table Architecture:
- market_data: Multi-level orderbook snapshots (bids/asks at 3-50 levels) at 1,200-15,000 events/sec
- core_price: Best bid/offer snapshots with ECN, reason codes, and technical indicators at 700-1,100 events/sec
- fx_trades: Realistic trade executions (5-30 orders/sec) with partial fills, level walking, and counterparty LEIs
Realistic Order Execution:
- Orders execute against actual orderbook liquidity
- Supports passive (limit) and aggressive (market) orders
- Orders can walk multiple levels until filled or limit price reached
- Each order generates 1-N trades with consistent order_id
- ECN consistency: all trades from same order execute on same ECN
30 FX Pairs with Liquidity Ranking:
- Major pairs (EURUSD, USDJPY, GBPUSD) have higher trade frequency
- Exotic pairs trade less frequently based on realistic liquidity distribution
- Pip-accurate pricing for each pair (0.0001 for majors, 0.01 for JPY crosses)
Deterministic Counterparty Pool:
- Generates realistic 20-character LEIs (Legal Entity Identifiers)
- Hash-based generation ensures expandability (first N LEIs always identical)
- Configurable pool size (default: 2000)
Live Market Data Integration (Real-time Mode):
- Yahoo Finance updates price brackets every 5 minutes
- Keeps simulation aligned with current market conditions
- Prevents unrealistic price drift
Temporal Consistency:
- Intra-second price interpolation for smooth OHLC bars
- close(second N) always equals open(second N+1)
- Trades timestamped after their source orderbook snapshot (ASOF join compatible)
- All three tables flush synchronously for dashboard consistency

Performance & Reliability

Parallel High-Throughput Ingestion:
- Multiprocessing splits event generation across workers
- Each process handles unique time partitions (no overlap)
- Typical: 8-11K events/sec sustained on 6-8 cores
WAL Lag Detection and Flow Control:
- Dedicated monitor process queries QuestDB's WAL progress
- Workers pause when lag exceeds threshold, resume when cleared
- Prevents QuestDB memory pressure during fast ingestion
Resumable, Incremental Ingestion:
- Automatically detects latest timestamp in database
- Advances start time to avoid overlap
- Supports continuous/incremental data generation
Materialized Views Auto-Setup:
- Creates OHLC views at multiple time granularities (1s, 1m, 15m, 1d)
- Separate views for core_price and market_data aggregations
- Optional TTLs for demo environments

Tables & Schema

`market_data`

Full orderbook depth with array storage for efficient ingestion:

timestamp TIMESTAMP
symbol SYMBOL
bids DOUBLE[][]  -- bids[0] = prices, bids[1] = volumes
asks DOUBLE[][]  -- asks[0] = prices, asks[1] = volumes

Volume: 1,200-15,000 events/second

`core_price`

Best bid/offer snapshots with metadata:

timestamp TIMESTAMP
symbol SYMBOL
ecn SYMBOL  -- LMAX, EBS, Hotspot, Currenex
bid_price DOUBLE
bid_volume LONG
ask_price DOUBLE
ask_volume LONG
reason SYMBOL  -- normal, news_event, liquidity_event
indicator1 DOUBLE
indicator2 DOUBLE

Volume: 700-1,100 events/second

`fx_trades`

Executed trades with nanosecond precision:

timestamp TIMESTAMP_NS  -- Nanosecond precision
symbol SYMBOL
ecn SYMBOL
trade_id UUID
side SYMBOL  -- buy, sell
passive BOOLEAN
price DOUBLE
quantity DOUBLE
counterparty SYMBOL  -- 20-char LEI
order_id UUID  -- Multiple trades can share same order_id (partial fills)

Volume: 5-30 orders/second → typically 5-50 trades/second (depending on partial fills)

Arguments

Argument	Type	Default	Description
`--host`	str	`127.0.0.1`	Host/IP of QuestDB instance
`--pg_port`	str/int	`8812`	PostgreSQL port for QuestDB metadata queries
`--user`	str	`admin`	Database user for metadata
`--password`	str	`quest`	Password for metadata
`--token`	str	None	ILP authentication token (JWK) for HTTP/HTTPS
`--token_x`	str	None	JWK token X coordinate (for tcps)
`--token_y`	str	None	JWK token Y coordinate (for tcps)
`--ilp_user`	str	`admin`	ILP/HTTP ingestion user
`--protocol`	str	`http`	`tcp` or `http` (`tcps`/`https` if token present)
`--mode`	str	Required	`real-time` (wall clock) or `faster-than-life` (max speed)
`--market_data_min_eps`	int	`1200`	Min `market_data` events/sec (must be > `core_max_eps`)
`--market_data_max_eps`	int	`15000`	Max `market_data` events/sec
`--core_min_eps`	int	`700`	Min `core_price` events/sec (must be > `orders_max_per_sec`)
`--core_max_eps`	int	`1000`	Max `core_price` events/sec (must be < `market_data_min_eps`)
`--orders_min_per_sec`	int	`5`	Min orders/sec (each order → 1-N trades, must be < `core_min_eps`)
`--orders_max_per_sec`	int	`30`	Max orders/sec
`--lei_pool_size`	int	`2000`	Number of unique counterparty LEIs to generate
`--total_market_data_events`	int	`1_000_000`	Total `market_data` events to generate (faster-than-life mode only)
`--start_ts`	str	now (UTC)	Simulation start time (ISO8601). Not allowed in real-time mode
`--end_ts`	str	None	Simulation end time (ISO8601)
`--chunk_seconds`	int	`900`	Max seconds to precompute per chunk in faster-than-life mode (limits memory usage)
`--processes`	int	`1`	Number of worker processes (real-time allows only 1)
`--min_levels`	int	`40`	Min orderbook levels
`--max_levels`	int	`40`	Max orderbook levels
`--incremental`	bool	`false`	Load last state from DB to continue appending (faster-than-life only)
`--create_views`	bool	`true`	Create materialized views if not present
`--short_ttl`	bool	`false`	Enforce TTLs on all tables (3 days for tables, 4 hours-1 month for views)
`--suffix`	str	`""`	Suffix to append to all table/view names
`--yahoo_refresh_secs`	int	`300`	Yahoo Finance bracket refresh interval (real-time mode only)

Event Rate Validation

The script enforces a hierarchy to maintain realism:

market_data_min_eps > core_max_eps (orderbooks update faster than BBO snapshots)
core_min_eps > orders_max_per_sec (more price updates than trades)

Violating these constraints will cause the script to exit with an error message.

Ingestion Modes

Real-Time Mode (`--mode real-time`)

Generates data at wall-clock pace (1 simulated second = 1 real second)
Timestamps are always 2 seconds ahead of wall clock for dashboard responsiveness
Yahoo Finance refreshes price brackets every 5 minutes (configurable)
Single process only (--processes 1)
Runs indefinitely until stopped or --total_market_data_events reached
Ignores --start_ts (always starts from now)

Use cases: Live demos, continuous data streams, dashboard testing

Faster-Than-Life Mode (`--mode faster-than-life`)

Generates data as fast as possible (no wall-clock pacing)
Processes in chunks (default 15 min) to limit memory usage for long runs
Maintains OHLC continuity across chunk boundaries
Supports multiprocessing for parallel ingestion
Requires --total_market_data_events to be set
Can use --start_ts and --end_ts for specific time ranges
Yahoo Finance queried once at startup (or bypassed if --incremental)

Use cases: Backfills, stress testing, benchmarking, generating historical datasets

Usage Examples

Real-Time Streaming (Live Demo)

python fx_data_generator.py \
  --host 172.31.42.41 \
  --market_data_min_eps 1200 \
  --market_data_max_eps 2500 \
  --core_min_eps 700 \
  --core_max_eps 1000 \
  --orders_min_per_sec 5 \
  --orders_max_per_sec 30 \
  --lei_pool_size 2000 \
  --token "your_token_here" \
  --token_x "your_token_x_here" \
  --token_y "your_token_y_here" \
  --ilp_user ilp_ingest \
  --protocol tcp \
  --mode real-time \
  --processes 1 \
  --total_market_data_events 800_000_000

High-Speed Batch Backfill (14 hours of data)

python fx_data_generator.py \
  --host 172.31.42.41 \
  --market_data_min_eps 8200 \
  --market_data_max_eps 11000 \
  --core_min_eps 700 \
  --core_max_eps 1000 \
  --orders_min_per_sec 5 \
  --orders_max_per_sec 30 \
  --lei_pool_size 2000 \
  --token "your_token_here" \
  --token_x "your_token_x_here" \
  --token_y "your_token_y_here" \
  --ilp_user ilp_ingest \
  --protocol tcp \
  --mode faster-than-life \
  --processes 6 \
  --total_market_data_events 600_000_000 \
  --start_ts "2025-11-11T00:00:00.000000Z" \
  --end_ts "2025-11-11T14:00:00.000000Z" \
  --min_levels 3 \
  --max_levels 3 \
  --incremental false

Local Development (No Authentication)

python fx_data_generator.py \
  --host 127.0.0.1 \
  --market_data_min_eps 110 \
  --market_data_max_eps 150 \
  --core_min_eps 70 \
  --core_max_eps 100 \
  --orders_min_per_sec 5 \
  --orders_max_per_sec 30 \
  --protocol tcp \
  --mode real-time \
  --processes 1 \
  --total_market_data_events 100_000_000 \
  --lei_pool_size 2000

Docker Support

Build and run via Docker:

docker build -t fx-generator .

docker run -e HOST=172.31.42.41 \
           -e TOKEN="your_token" \
           -e TOKEN_X="your_token_x" \
           -e TOKEN_Y="your_token_y" \
           -e ILP_USER=ilp_ingest \
           -e MARKET_DATA_MIN_EPS=1200 \
           -e MARKET_DATA_MAX_EPS=2500 \
           -e CORE_MIN_EPS=700 \
           -e CORE_MAX_EPS=1000 \
           -e ORDERS_MIN_PER_SEC=5 \
           -e ORDERS_MAX_PER_SEC=30 \
           -e LEI_POOL_SIZE=2000 \
           fx-generator

All parameters can be overridden via environment variables or passed as arguments to the entrypoint script.

Materialized Views

The script automatically creates OHLC aggregation views:

Core Price Views

core_price_1s: 1-second OHLC (mid-price, spread, bid/ask stats)
core_price_1d: 1-day OHLC

Market Data Views

market_data_ohlc_1m: 1-minute OHLC from orderbook best bid
market_data_ohlc_15m: 15-minute OHLC
market_data_ohlc_1d: 1-day OHLC

Trade Views (User-Created)

Example views for fx_trades aggregations:

CREATE MATERIALIZED VIEW fx_trades_ohlc_1m AS (
    SELECT timestamp, symbol,
        first(price) AS open,
        max(price) AS high,
        min(price) AS low,
        last(price) AS close,
        SUM(quantity) AS total_volume
    FROM fx_trades
    SAMPLE BY 1m
) PARTITION BY HOUR TTL 2 DAYS;

CREATE MATERIALIZED VIEW fx_trades_ohlc_1h AS (
    SELECT timestamp, symbol,
        first(price) AS open,
        max(price) AS high,
        min(price) AS low,
        last(price) AS close,
        SUM(quantity) AS total_volume
    FROM fx_trades
    SAMPLE BY 1h
) PARTITION BY DAY TTL 1 MONTH;

Data Characteristics

Realism Features

Price Continuity:
- close(second N) = open(second N+1) for every symbol
- Intra-second interpolation produces realistic candlesticks (not flat lines)
- Controlled random walk with shock events (±20 pips occasionally)
Spread Dynamics:
- Spreads evolve realistically (1-8 pips for majors)
- Rare stress-widening events (0.05% probability)
- Spread measured in pips to avoid floating-point artifacts
Volume Distribution:
- Log-scaled volume ladder: thinner at best levels, deeper at worse levels
- Best bid/ask: 50K-100K typical
- Deep levels: 500M-1B liquidity
Trade Execution:
- 40% passive orders (limit), 60% aggressive (market)
- Passive: 0-2 pips slippage, aggressive: 3-10 pips slippage
- Orders walk multiple levels if size exceeds liquidity
- Log-normal trade size distribution (many small, few large)
ECN Distribution:
- 4 ECNs: LMAX, EBS, Hotspot, Currenex
- Random but consistent per order (all trades from order at same ECN)
Liquidity Ranking:
- Top 9 pairs (EURUSD, USDJPY, GBPUSD, etc.) trade 10x more than exotics
- Rank-weighted selection for realistic market share

Timestamp Precision

market_data: Microsecond (TIMESTAMP)
core_price: Microsecond (TIMESTAMP)
fx_trades: Nanosecond (TIMESTAMP_NS)

Trade timestamps are always ≥1 microsecond after their source core_price event, ensuring ASOF joins work correctly despite precision differences.

WAL Backpressure Control

Pause Threshold: sequencerTxn - writerTxn > 3 * num_processes
Resume: Only when lag clears completely (sequencerTxn == writerTxn)
Workers print status messages when paused/resumed
Prevents QuestDB OOM during extreme ingestion rates

Safety & Data Integrity

No Overlapping Data: Script queries latest timestamp before starting and advances past it
No Time Travel: If requested start is already covered, script exits cleanly
Atomic Flushes: All three tables flush synchronously for dashboard consistency
Clean Shutdown: All subprocesses (workers, WAL monitor, Yahoo refresher) terminate cleanly

Troubleshooting

Trades Not Appearing

If fx_trades table is empty or has very few rows:

Check --orders_min_per_sec and --orders_max_per_sec are set (default: 5-30)
Verify event rate hierarchy: core_min_eps > orders_max_per_sec
In real-time mode with low event rates, trades may take several seconds to accumulate and flush

Yahoo Finance Errors (Real-time Mode)

If Yahoo Finance fails to fetch prices:

Script falls back to hardcoded price brackets (non-fatal)
Check network connectivity to finance.yahoo.com
Increase --yahoo_refresh_secs to reduce query frequency

WAL Lag Warnings

If workers frequently pause due to WAL lag:

Reduce --processes count
Lower event rates (--market_data_max_eps, etc.)
Increase QuestDB's cairo.max.uncommitted.rows setting
Enable more CPU cores for QuestDB writer threads

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
demo_queries.sql		demo_queries.sql
docker-entrypoint.sh		docker-entrypoint.sh
fx_backfill_ingest.sh		fx_backfill_ingest.sh
fx_data_generator.py		fx_data_generator.py
fx_real_time_ingest.sh		fx_real_time_ingest.sh
grafana_dashboard.json		grafana_dashboard.json
requirements.txt		requirements.txt

License

javier/fx-data-generator

Folders and files

Latest commit

History

Repository files navigation

FX Synthetic Data Generator & Ingestor for QuestDB

Features

Data Generation

Performance & Reliability

Tables & Schema

market_data

core_price

fx_trades

Arguments

Event Rate Validation

Ingestion Modes

Real-Time Mode (--mode real-time)

Faster-Than-Life Mode (--mode faster-than-life)

Usage Examples

Real-Time Streaming (Live Demo)

High-Speed Batch Backfill (14 hours of data)

Local Development (No Authentication)

Docker Support

Materialized Views

Core Price Views

Market Data Views

Trade Views (User-Created)

Data Characteristics

Realism Features

Timestamp Precision

WAL Backpressure Control

Safety & Data Integrity

Troubleshooting

Trades Not Appearing

Yahoo Finance Errors (Real-time Mode)

WAL Lag Warnings

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2