Tick-by-tick financial data is one of the most information-dense environments you can study. At millisecond resolution, markets reveal microstructure dynamics that are completely invisible in daily or hourly data — bid-ask bounce, order flow clustering, volatility regime shifts, and fleeting arbitrage signals. This project dives into that world using modern machine learning.
The goal is to go beyond textbook time series analysis and wrestle with data that is messy, non-stationary, and enormous by design. If you can build models that work here, you can build models that work anywhere.
At the tick level, markets expose their mechanics:
- Microstructure effects — bid-ask bounce and order flow create autocorrelation patterns that break random walk assumptions
- Regime switching — markets cycle between calm and turbulent states that require adaptive, state-dependent models
- Information asymmetry — short-term price signals are embedded in order flow and tick patterns before they aggregate away
- Scale — millions of observations per symbol per month demand efficient, vectorized computation
Working at this resolution is not just academically interesting — it is genuinely hard:
- Noise — tick prices are contaminated by microstructure; raw data is not your signal
- Non-stationarity — regimes shift intraday, requiring models that can adapt or detect change
- Dimensionality — traditional time series tools struggle with millions of correlated observations
- Overfitting — rich data invites spurious patterns; disciplined regularization is essential
Regime-Switching Models Hidden Markov Models and Markov-switching autoregression to identify latent market states. Combined with GARCH for state-dependent volatility dynamics.
Pre-averaging and Feature Engineering Aggregating raw ticks into meaningful windows, and extracting microstructure features: bid-ask spreads, order flow imbalance, inter-tick durations.
Rolling Window Analysis Fitting AR models on rolling windows to capture time-varying autocorrelation. Visualizing how model parameters evolve across the trading day.
Pairs Trading Cointegration analysis to find mean-reverting relationships between currency pairs, with ML-enhanced spread modeling for strategy development.
Every model in this project is a building block in the same hierarchy — from simple linear structure to full conditional heteroskedasticity. Understanding each step makes the next one obvious.
AR(p) — Autoregressive
The price at time
MA(q) — Moving Average
Instead of lagged prices, the model depends on lagged shocks. Short-lived, mean-reverting dynamics.
ARMA(p, q) — Autoregressive Moving Average
Combines both: persistent autocorrelation from AR, transient shock responses from MA.
ARIMA(p, d, q) — Integrated
Non-stationary series are differenced
ARIMAX(p, d, q) — With Exogenous Variables
ARIMA extended with external regressors
ARCH(q) — Autoregressive Conditional Heteroskedasticity
Volatility is not constant — it clusters. ARCH models the conditional variance as a function of past squared residuals.
GARCH(p, q) — Generalized ARCH
Adds lagged variance terms to ARCH, capturing the long memory of volatility with far fewer parameters. The workhorse of financial volatility modeling.
In all models,
pip install -r requirements.txtData lives in code/data/processed/ as compressed Parquet files. Small CSV samples (1 000 rows each) are available in code/data/samples/ for quick experimentation.
Three data sources, covering EUR/USD, EUR/CHF, USD/ZAR and a broad set of forex pairs. All stored as Parquet for fast, memory-efficient loading.
| Source | Coverage | Format | Link |
|---|---|---|---|
| HistData | 32 pairs · Jan–Feb 2026 | NinjaTrader tick CSV | histdata.com |
| TrueFX | 3 pairs · Nov 2025 – Jan 2026 | Tick CSV (bid/ask) | truefx.com |
| Dukascopy | 3 pairs · Nov 2025 – Jan 2026 | API (bid/ask + volume) | dukascopy.com · PyPI |
| Source | Files | Symbols | Total Ticks | Size on Disk |
|---|---|---|---|---|
| HistData | 72 | 32 | ~48.5M | ~423 MB |
| TrueFX | 18 | 3 | ~14.1M | ~122 MB |
| Dukascopy | 18 | 3 | ~36.1M | ~331 MB |
| Total | 108 | 35 | ~98.7M | ~876 MB |
HistData
---means the Ask/Bid file was not downloaded for that symbol (Last price only). EUR/USD, BCO/USD and USD/ZAR include Ask/Bid from ASCII format files.
| Script | Source | Purpose |
|---|---|---|
code/scripts/p_hist.py |
HistData | Converts NinjaTrader + ASCII CSVs to Parquet |
code/scripts/p_true.py |
TrueFX | Converts TrueFX tick CSVs to Parquet |
code/scripts/p_duka.py |
Dukascopy | Downloads tick data via API and saves as Parquet |
code/scripts/samp.py |
— | Generates small CSV samples from Parquet files |
- Course Page (UiO)
- HistData — Free Forex Historical Data
- TrueFX — Historical Downloads
- Dukascopy — Historical Market Data
- dukascopy-python (PyPI)
University of Oslo · Department of Mathematics · Spring 2026