Skip to content

ArtysicistZ/Kalshi_Cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kalshi Trading Engine

A low-latency C++20 trading stack for the Kalshi prediction market, with a conformant exchange simulator to benchmark against.

Hot-path 486 ns p50 · 576 ns p99 · 3.18 M msg/s · 256-ticker dispatch · zero-alloc verified

C++20 Linux x86-64 Boost.Asio + io_uring Boost.Beast WebSocket OpenSSL RSA-PSS simdjson Google Benchmark License: MIT


ResultsWhy This ExistsArchitectureHot PathSimulatorQuick StartReproduce


Why This Exists

There is no C++ client for any prediction market. Every existing Kalshi or Polymarket client is Python, TypeScript, or Go, which is fine for analytics but unusable for the latency tier real market makers operate in. kalshi-cpp fills that gap, and along the way exercises the systems-programming techniques used in production trading infrastructure: lock-free queues, custom arena, pool, and hash allocators, CPU pinning, SCHED_FIFO, mlockall, huge pages, io_uring, and nanosecond-resolution latency measurement.

Trading directly against the live exchange is a non-starter for an open-source project. Kalshi's demo and production environments require SSN and KYC. The repository therefore ships kalshi-sim, a conformant exchange written in C++ that speaks Kalshi's wire protocol byte for byte (RSA-PSS-signed REST and WebSocket market data) and runs a price-time-priority CLOB. The simulator is what makes end-to-end benchmarks possible at all, and what makes adversarial scenarios (forced disconnects, sequence-number gaps, partial fills, throttle storms) testable on demand. Both processes are optimized to the same standard, so measured tick-to-trade latency reflects the protocol and network stack rather than server-side sloppiness.

Headline numbers measure the application-internal hot path (parse → FlatHashMap dispatch over 256 tickers → orderbook → signal → reconcile → wire-serialize) across two CPU-pinned threads connected by a lock-free SPSC queue: 486 ns p50 / 576 ns p99 / 3.18 M msg/s sustained, on AMD EPYC 7V12, over 1 million iterations. Every message hashes its ticker string (FNV-1a 64-bit), looks up the corresponding per-market Book + Signal + OrderManager state through a Robin Hood FlatHashMap, and dispatches into a Pool-backed MarketState. The hot path performs zero heap allocations across all three composed primitives (SPSC queue + FlatHashMap + Pool) — verified, not asserted, by a global operator new interposer that bumps a counter on every malloc during the timed window (counter must read 0 to pass). End-to-end tick-to-trade (NIC ↔ userspace ↔ exchange) is a separate budget not yet measured in this repo; see Caveats.


Measured Results

Hot path pipeline (parse → FlatHashMap dispatch → orderbook → signal → reconcile → serialize)

Two-thread SPSC pipeline across 256 distinct tickers. Producer parses JSON and pushes timestamped {OrderbookDelta, produce_tsc} messages over a lock-free queue; consumer pops, hashes the ticker string (FNV-1a 64-bit), looks up the per-market state through a Robin Hood FlatHashMap<TickerKey, MarketState*, 1024>, applies the delta to a flat-array book, evaluates the signal, reconciles desired vs. live orders, and serializes resulting Actions to wire bytes. The 256 MarketState instances are allocated up-front from a fixed-capacity Pool<MarketState, 256>. Per-message latency is measured rdtscp_consumer − rdtsc_producer; both threads are CPU-pinned to distinct physical cores on a shared-L3 CCX. Zero heap allocations in the timed window are enforced across all three composed primitives (SPSC queue + FlatHashMap + Pool) — a global operator new interposer bumps a thread-shared counter on every malloc; the bench reports PASS only when the counter is 0.

Metric Value Notes
End-to-end p50 486 ns parse + push + cross-L1d handoff + pop + hash + map lookup + compute + serialize
End-to-end p90 536 ns 99.8 % of messages clear in < 837 ns
End-to-end p99 576 ns unimodal — no compute-side fat tail
End-to-end p99.9 3.00 µs residual kernel-tick preemption (regular Linux, no nohz_full)
End-to-end max 81 µs single-event scheduler-class outlier
Sustained throughput 3.18 M msg/s bottleneck is consumer-side compute (~314 ns/msg)
FlatHashMap dispatch cost ~20 ns p50 / ~50 ns p99 one FNV-1a over 32-byte ticker + one Robin Hood probe at ~25 % load factor
Heap allocations / 1 M messages 0 enforced across SPSC + FlatHashMap + Pool via global new/delete interposer
Cycles per message (p50) 1,188 ≈ 486 ns at 2.45 GHz boost

Platform: Azure VM, AMD EPYC 7V12 64-Core (96 logical CPUs, no SMT), Ubuntu 24.04, kernel 6.17. Cores 16 and 18 (same L3 CCX, NUMA node 0). Release build -O3 -march=native -flto -fno-exceptions -fno-rtti. 1 M iterations after 10 K warmup, 256 tickers round-robined uniformly. Reproduced via bench/bench_hotpath_multi.cpp; raw cycle counts dumped to /tmp/bench_hotpath_multi_cycles.bin; histogram rendered by script/plot_hotpath_hist.py.

Hot-path latency histogram, 1M messages, AMD EPYC 7V12, log-log axes
End-to-end per-message latency across 1 M iterations (1-ticker compute floor). Log-log axes; main mode at ~1 µs holds 99.7 % of samples; residual tail (kernel timer preemption) terminates near 100 µs. p50/p90/p99/p99.9 markers overlaid. The 256-ticker dispatch shifts the main mode right by ~20 ns at p50 and ~50 ns at p99 (see the metric table above) and tightens the residual tail (p99.9 drops 16.2 µs → 3.0 µs, max 105 µs → 81 µs across the run).

Tuning ablation

Each row applies one additional production-hardening knob. Same bench, same payload distribution; only the OS/topology configuration changes.

Configuration p50 p99 p99.9 throughput platform
Single-thread baseline (no queue handoff) 318 ns 618 ns 1.02 µs WSL2 (i7-1370P)
Two-thread, pinned to IRQ-busy cores (CPU 2 / 4) 695 ns 252 µs 5.50 ms 2.31 M/s WSL2 (i7-1370P)
  + SCHED_FIFO + mlockall 499 ns 183 µs 1.13 ms 3.39 M/s WSL2 (i7-1370P, sudo)
  + move off IRQ-busy cores (CPU 16 / 18) 571 ns 94 µs 348 µs 2.92 M/s WSL2 (i7-1370P, sudo)
Quiet dedicated VM, same-L3 CCX (CPU 16 / 18), 1 ticker 466 ns 526 ns 16.2 µs 3.15 M/s Azure EPYC 7V12
  + 256-ticker FlatHashMap dispatch + Pool state 486 ns 576 ns 3.00 µs 3.18 M/s Azure EPYC 7V12 (headline)

Each step's impact, in order:

  • SCHED_FIFO + mlockall (row 3): outranks softirq/CFS so kernel interrupt handlers stop preempting mid-iteration; page locking eliminates minor-fault outliers. p50 −28 %, throughput +47 %.
  • Quiet cores (row 4): /proc/interrupts showed virtio0-virtqueues MSI pinned to CPU 2 — every NIC interrupt was preempting our producer. Moving to CPUs far from the boot CPU and the IRQ-host cores shrinks the 5 – 40 µs scheduler-noise bump. p99 ↓ 2× (183 → 94 µs).
  • Dedicated VM, same-L3 CCX (row 5): no Windows host scheduler stealing vCPUs; no Hyper-V multi-tasking; producer/consumer pinned to two cores in the same EPYC CCX share L3, so the SPSC slot's cache line migrates within one CCX (~30 cycles) rather than crossing CCXs. p99 collapses 178× (94 µs → 526 ns). Establishes the 1-ticker compute floor.
  • 256-ticker dispatch (row 6, headline): adds one FNV-1a hash over the 32-byte ticker and one Robin Hood find() probe per message, against a FlatHashMap at ~25 % load factor; per-market Book + Signal + OrderManager state lives in a Pool<MarketState, 256> allocated up-front. Dispatch costs ~20 ns at p50 and ~50 ns at p99 — exactly the predicted cost of one cache-line load — and brings the production-realistic scenario online while the zero-allocation invariant continues to hold across all three composed primitives (SPSC + FlatHashMap + Pool). Throughput slightly improves to 3.18 M msg/s.

The remaining p99.9 = 3 µs / max ≈ 81 µs floor is the regular Linux timer tick (LOC interrupts) and one stray scheduler-class outlier — bare metal with isolcpus / nohz_full would tighten both.

Microbenchmarks (Google Benchmark)

Per-operation latency on the development workstation (Intel i7-1370P, WSL2, 22 logical CPUs, Release -O3 -march=native -flto). Numbers are mean per-op unless noted. Emitted by bench/bench_spsc.cpp and bench/bench_pool.cpp.

Component Operation This project Standard library Speedup
SPSC queue single-thread push/pop (int) 1.01 ns std::queue: 1.18 ns
SPSC queue single-thread push/pop (108-byte struct) 8.05 ns dominated by payload memcpy
SPSC queue single-thread w/ mutex (no contention) 30.5 ns mutex std::queue 30×
SPSC queue cross-thread sustained 44 M item/s mutex std::queue: 8.9 M
Pool allocator realistic alloc + field write + free 0.76 ns malloc / free: 8.97 ns 12×
Pool allocator 1024-order steady-state churn 1.80 ns malloc / free: 7.72 ns;
std::list push/pop: 14.9 ns
4 – 8×
Pool allocator sustained throughput 776 M op/s malloc / free: 163 M 4.8×

These are the primitives the hot path is built from. The hot-path bench above is what you get when SPSC queue + Robin Hood FlatHashMap + intrusive Pool compose under realistic two-thread queueing with 256 dispatch keys — and the zero-allocation guarantee holds across all three simultaneously, on both producer and consumer threads, across 1 M iterations.

Caveats

  • Measurement boundary. The hot-path numbers cover parsed-bytes-in to wire-bytes-out — application-internal compute latency. Production tick-to-trade additionally includes NIC ↔ userspace traversal (~1–5 µs with kernel-bypass, ~10–20 µs with the TCP fast path) and the exchange round trip (sub-µs colocated, double-digit µs at LAN distance). Real HFT firms report tick-to-trade with hardware-timestamped NICs; that path is on the design roadmap but not in the current measurement.
  • Platforms. WSL2 rows are bounded below by the Hyper-V hypervisor scheduler (~100 µs preemptions that no in-VM syscall can reach). The EPYC row is bounded by the regular Linux timer tick. Bare metal with isolcpus / nohz_full / rcu_nocbs is the next floor.
  • Cross-core TSC. Both rdtsc reads are on different physical cores; correctness depends on invariant TSC. EPYC 7V12 exposes constant_tsc, nonstop_tsc, tsc_known_freq, tsc_reliable; verified via /proc/cpuinfo.
  • Hardware-counter introspection (per-iteration cycles/IPC/cache-miss/branch-mispredict via perf_event_open) is deferred: deepx-3 sets perf_event_paranoid=4, blocking userspace perf collection without root; WSL2 PMU exposure under Hyper-V is uneven.

Architecture

Two processes, identical low-latency discipline on both sides, joined by real TCP through the loopback interface. tc netem injects realistic delay and jitter on loopback, so end-to-end measurements are comparable to a colocated deployment.

   ┌──────────────────────────────┐         ┌──────────────────────────────┐
   │       kalshi-cpp  client     │         │       kalshi-sim  server     │
   │                              │         │                              │
   │  ┌──────────┐                │   WSS   │              ┌────────────┐  │
   │  │ Network  │ ─── orders ──> │ ──────> │ ──── feed ── │ Matching   │  │
   │  │ Thread   │                │  REST   │              │ Engine     │  │
   │  │          │ <── fills ──── │ <────── │ ── replies ─>│ + Auth Ver │  │
   │  │  io_uring│                │ TLS+PSS │              │            │  │
   │  └──────────┘                │  signed │              └────────────┘  │
   │       │                      │         │                    │         │
   │       │ SPSC (lock-free)     │         │                    │         │
   │       v                      │         │                    v         │
   │  ┌────────────┐              │         │   ┌──────────────────────┐   │
   │  │  Strategy  │              │         │   │  scenario injection: │   │
   │  │  Order Mgr │              │         │   │  latency, drops,     │   │
   │  │  Book      │              │         │   │  partial fills,      │   │
   │  └────────────┘              │         │   │  seq gaps, throttle  │   │
   │       │                      │         │   └──────────────────────┘   │
   │       v                      │         │                              │
   │  ┌──────────────────────┐    │         │  CPU pin, NODELAY, busy poll │
   │  │  Latency Logger      │    │         └──────────────────────────────┘
   │  │  (rdtsc, p50/p99)    │    │                          │
   │  └──────────────────────┘    │                          │
   │                              │            tc netem on loopback adds
   │  Memory: Arena + Pool        │            realistic latency + jitter
   │  OS: CPU pin, huge pages,    │            (100 µs RTT ±20 µs jitter)
   │      mlockall, SCHED_FIFO    │
   └──────────────────────────────┘

The client runs two threads connected by lock-free SPSC ring buffers. There is no mutex, no condition variable, and no shared mutable state.

Thread Owns Responsibility
Network WS connection, REST socket, io_uring Parse incoming JSON (simdjson), push deltas onto SPSC; pop outbound orders, sign, send.
Strategy Order book, order manager, signals Pop deltas, apply to flat-array book, decide, push orders onto outbound SPSC.

Boost.Asio abstracts the kernel interface. On Linux 5.15+ the build picks io_uring, which is syscall-free in the steady state and gives roughly a 30 to 50% receive-path win over epoll. On older kernels Asio falls back to epoll transparently and the application code does not change.


Hot Path

The path from parsed-bytes-in to wire-bytes-out — parse, orderbook delta apply, signal evaluation, order reconciliation, JSON serialization — runs without a single heap allocation. That property is why the p99 number is what it is. If anything on the hot path takes a page fault or calls into malloc, the tail blows up by orders of magnitude. The same discipline extends out to the NIC boundaries (io_uring receive, RSA-PSS-signed REST, mmap-backed arena for incoming frames) in the surrounding design; those layers are on the roadmap.

Concern Mechanism Why this matters for tail latency
Cross-thread queueing Power-of-two SPSC ring, acquire/release, alignas(64) head/tail, cached indices Avoids seq_cst, false sharing, and any kernel call.
Order-object lifetime Intrusive free-list pool allocator, 64-byte Order O(1) alloc, O(1) dealloc, deterministic, no fragmentation.
JSON parse buffers Bump arena reset per tick, mmap-backed O(1) "free everything," with SIMDJSON_PADDING accounted for at the edge.
Order-ID lookup Robin Hood FlatHashMap<OrderId, Order*, 1<<19> Open-addressing, backshift deletion, sentinel-keyed slots, never grows.
Orderbook (Kalshi 1-99¢) Flat 99-element array of (qty, count) keyed by tick Cache-resident, branch-light delta apply, integer fixed-point, no float.
JSON parsing simdjson (zero-copy, SIMD) 2 to 4× faster than nlohmann/json, no per-message allocations.
Timestamps rdtsc calibrated against CLOCK_MONOTONIC at startup 8-cycle acquisition vs. about 21 ns syscall; resolves sub-ns differences in HDR.
Allocation verification Debug operator new override + thread-local hot_path_active flag Any accidental heap allocation aborts with a backtrace in CI.
Exceptions / RTTI Compiled with -fno-exceptions -fno-rtti Beast async uses error_code overloads exclusively; uncaught throw aborts.

All primitives are implemented in src/core/ and exercised by both the client and the simulator. OS-level tuning (CPU pinning, huge pages, mlockall, SCHED_FIFO) lives in src/system/tuning.cpp.


Reproducing the headline number

The hot-path bench is fully self-contained — no exchange, no simulator, no network. It builds and runs on any Linux x86-64 box with CMake ≥ 3.20 and GCC ≥ 12.

# Build
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target bench_hotpath_multi -j

# Pick two CPUs on the same physical core complex / shared L3.
# Inspect topology first:
lscpu --extended | head -20       # CORE and L3 columns matter
cat /proc/interrupts | head -20   # avoid CPUs hosting busy IRQs

# Edit CORE_PRODUCER / CORE_CONSUMER in bench/bench_hotpath_multi.cpp,
# rebuild, then run. Add sudo for SCHED_FIFO + mlockall to take effect.
sudo ./build/bench_hotpath_multi             # headline (256-ticker dispatch)
./build/bench_hotpath_pipe                   # 1-ticker compute floor (compare against)
./build/bench_hotpath                        # single-thread floor (no queue handoff)

Each bench prints min / p50 / p90 / p99 / p99.9 / max in both TSC cycles and nanoseconds, sustained throughput, the zero-allocation PASS/FAIL line, and an ASCII log2-bucket histogram. Raw per-message cycle counts are dumped to /tmp/bench_hotpath_multi_cycles.bin (and the 1-ticker variant to /tmp/bench_hotpath_pipe_cycles.bin) for offline analysis; script/plot_hotpath_hist.py renders the matplotlib histogram shown above.

For tightest measurements, prefer:

  • Cores on the same L3 CCX (lscpu --extended, match the L3 column).
  • Cores off any CPU that /proc/interrupts shows as hosting NIC / NVMe MSI IRQs.
  • sudo so SCHED_FIFO priority 50 and mlockall(MCL_CURRENT | MCL_FUTURE) actually take effect; otherwise the bench prints warnings and falls back to SCHED_OTHER.

Production network-stack knobs (TCP_NODELAY, SO_BUSY_POLL, TCP_QUICKACK, io_uring, huge pages, kernel-bypass NIC paths via Solarflare ef_vi / DPDK) and an end-to-end tick-to-trade bench against kalshi-sim under tc netem are on the roadmap; current measurements do not include them.


The Simulator (kalshi-sim)

The simulator is not a mock. It is a real C++ server that owns:

  • A custom epoll-based HTTP/1.1 stack (raw socket / bind / listen / accept / epoll_ctl, per-connection buffers, partial-read handling, no Beast, no framework). Walking the kernel-level path is part of the pedagogical value.
  • A Boost.Beast WebSocket server for the market-data feed, with both replay mode (captured Kalshi payloads) and generative mode (random-walk synthetic feed, configurable volatility and book depth).
  • An OpenSSL RSA-PSS verifier (EVP_DigestVerify*) that checks KALSHI-ACCESS-KEY/TIMESTAMP/SIGNATURE on every REST request and rejects on unknown key, ±5 s timestamp skew, or signature mismatch. Same checks the real exchange performs.
  • A price-time-priority CLOB matching engine (limit, IOC, FOK, GTC; post_only and reduce_only modifiers) using the same FlatHashMap from src/core/. Covered by test/test_matching_engine.cpp.
  • A control endpoint exposing adversarial-scenario knobs that no live exchange would offer.

Layered design

domain/   ← matching_engine, market_registry, account_book, types.h.
            Pure business logic. No I/O, no JSON, no Boost.
  ↑
services/ ← exchange_service. The single place where order placement composes
            engine.match → account.apply → fan-out. No transport leaks down here.
  ↑
http/  ws/  scenario/   ← transport adapters. Bytes ↔ service call ↔ bytes.
                          rest_server, handlers, auth_middleware on the HTTP side.
  ↑
auth/   ← auth_verify. Protocol-agnostic OpenSSL, reused by HTTP and the WS handshake.

Scenario injection (control channel, localhost only)

Endpoint Effect Tests the client's …
POST /sim/inject_delay {ms} Stalls responses for N ms. Timeout and latency-budget paths
POST /sim/drop_connection Force-closes the client's WebSocket. Reconnection FSM, re-subscribe, snapshot replay
POST /sim/inject_seq_gap Skips a sequence number on the feed. Gap detector, BookStale flow, fresh snapshot path
POST /sim/partial_fill_rate {0..1} Fraction of orders partial-filled. Order-lifecycle state machine
POST /sim/rate_limit_burst Returns synthetic 429 responses. Token-bucket rate limiter and back-off

These endpoints make reconnection, gap recovery, and rate-limit back-off testable on demand instead of waiting for a real Thursday-3-AM-ET maintenance window.


Quick Start

All build and run commands target Linux. WSL2 on Windows is supported. WSL2 ships a real Linux 5.15+ kernel, so epoll, sched_setaffinity, mmap(MAP_HUGETLB), mlockall, and io_uring all work.

# Prerequisites (Ubuntu 22.04+ / WSL2)
sudo apt install cmake g++-12 libboost-all-dev libssl-dev

# (Optional, for Phase-4 numbers) huge pages: 64 × 2 MB = 128 MB
echo 64 | sudo tee /proc/sys/vm/nr_hugepages

# Build
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)

One-time keypair setup (replaces Kalshi KYC/SSN flow)

mkdir -p ~/.kalshi
openssl genpkey -algorithm RSA -pkeyopt rsa_keygen_bits:2048 -out ~/.kalshi/dev_private.pem
openssl rsa -in ~/.kalshi/dev_private.pem -pubout -out ~/.kalshi/dev_public.pem
chmod 600 ~/.kalshi/dev_private.pem
uuidgen > ~/.kalshi/dev_key_id

export KALSHI_KEY_ID="$(cat ~/.kalshi/dev_key_id)"
export KALSHI_KEY_PATH="$HOME/.kalshi/dev_private.pem"
export KALSHI_API_BASE="http://127.0.0.1:8443"
export KALSHI_WS_URL="ws://127.0.0.1:8444/trade-api/ws/v2"

Run end-to-end against the simulator

# Terminal 1: start kalshi-sim and register the public key
./build/kalshi-sim --register ~/.kalshi/dev_public.pem --key-id "$KALSHI_KEY_ID"

# Terminal 2: inject LAN-class delay (required for the headline numbers)
sudo bash scripts/netem_lan.sh

# Terminal 3: start the client
./build/kalshi-cpp

Run against the real exchange instead

Only environment variables change. Point KALSHI_API_BASE and KALSHI_WS_URL at Kalshi's URLs and KALSHI_KEY_PATH at the key uploaded through Kalshi's dashboard. The C++ code is identical.

Benchmarks and tests

# Hot-path pipeline (headline number — see Measured Results)
sudo ./build/bench_hotpath_multi       # 256-ticker FlatHashMap + Pool dispatch (headline)
sudo ./build/bench_hotpath_pipe        # 2-thread SPSC pipeline, 1 ticker (compute floor)
./build/bench_hotpath                  # single-thread compute floor (no queue handoff)

# Component microbenchmarks (Google Benchmark)
./build/bench_spsc
./build/bench_pool

# Render the latency histogram from the most recent pipe run
source ~/miniconda3/etc/profile.d/conda.sh && conda activate motus    # or any env with matplotlib + numpy
python script/plot_hotpath_hist.py --out docs/hotpath_latency_histogram.png

# Tests
ctest --output-on-failure              # parser, book, signal, order_manager, serialize, spsc, pool, matching_engine

Platform

Production target is Linux x86-64. Development happens on Windows 11 + WSL2 (Ubuntu 22.04+). The codebase uses POSIX/Linux APIs idiomatically rather than hiding them behind a portability shim. Windows equivalents are listed for reference.

Capability Linux API Windows equivalent (reference only)
CPU pinning sched_setaffinity() SetThreadAffinityMask()
Real-time scheduling sched_setscheduler(SCHED_FIFO) SetPriorityClass(REALTIME_PRIORITY_CLASS)
Lock memory mlockall(MCL_CURRENT | MCL_FUTURE) VirtualLock()
Huge pages mmap(MAP_HUGETLB) VirtualAlloc(MEM_LARGE_PAGES)
Transparent huge pages madvise(MADV_HUGEPAGE) N/A (Windows uses explicit large pages)
High-res timestamp rdtsc inline asm __rdtsc() intrinsic (identical instruction)
Async I/O io_uring / epoll IOCP

Project Structure

kalshi-cpp/
├── CMakeLists.txt
├── DESIGN.md                            # Original design doc, kept verbatim for cross-reference
├── src/                                 # === CLIENT (kalshi-cpp) ===
│   ├── main.cpp                         # Entry point, composition root, OS tuning sequencing
│   ├── net/                             # Networking layer
│   │   ├── ws_client.{h,cpp}            # Boost.Beast WebSocket over TLS, error_code paths only
│   │   ├── ws_reconnect.{h,cpp}         # Reconnect FSM, exp backoff 100ms to 5s, gap-aware
│   │   ├── rest_client.{h,cpp}          # REST + RSA-PSS signing, base URL via env
│   │   ├── rate_limiter.h               # Token-bucket mirroring Kalshi tiers (Basic/Adv/Premier)
│   │   ├── sockopt.{h,cpp}              # NODELAY · BUSY_POLL · QUICKACK · RCVBUF/SNDBUF
│   │   └── auth.{h,cpp}                 # RSA-PSS SHA-256 via OpenSSL EVP
│   ├── feed/
│   │   ├── parser.{h,cpp}               # JSON to POD structs (simdjson, arena-backed)
│   │   └── book.{h,cpp}                 # 99-element flat-array orderbook, integer ticks
│   ├── core/                            # Low-latency primitives, shared with sim/
│   │   ├── spsc_queue.h                 # Lock-free SPSC ring, alignas(64), cached indices
│   │   ├── arena_alloc.h                # Bump allocator, mmap-backed, reset-per-tick
│   │   ├── pool_alloc.h                 # Intrusive free-list, fixed-size blocks
│   │   ├── flat_hash_map.h              # Robin Hood, sentinel-keyed, backshift delete
│   │   ├── clock.h                      # rdtsc + CLOCK_MONOTONIC calibration
│   │   └── json_io.{h,cpp}              # Shared JSON helpers (used by client + sim)
│   ├── strategy/
│   │   ├── signal.{h,cpp}               # Spread + microstructure signal
│   │   └── order_manager.{h,cpp}        # Pool-allocated Orders, FlatHashMap lookup
│   ├── system/tuning.{h,cpp}            # CPU pin · SCHED_FIFO · mlockall · huge pages
│   └── util/{log.h,histogram.h}         # Lock-free log, HDR latency histogram
├── sim/                                 # === SIMULATOR (kalshi-sim) ===
│   ├── main.cpp                         # Composition root
│   ├── domain/                          # Pure business logic
│   │   ├── types.h                      # Side · OrderId · ClientId · Price · Qty · Fill
│   │   ├── matching_engine.{h,cpp}      # Price-time priority CLOB
│   │   ├── market_registry.{h,cpp}      # Per-ticker engine instances (unique_ptr stability)
│   │   └── account_book.{h,cpp}         # Per-client balance + position bookkeeping
│   ├── services/
│   │   └── exchange_service.{h,cpp}     # engine.match → account.apply → fan-out fills
│   ├── http/
│   │   ├── rest_server.{h,cpp}          # Custom epoll HTTP/1.1. Raw sockets, no Beast, no TLS
│   │   ├── handlers.{h,cpp}             # Request → ExchangeService → Response (JSON only here)
│   │   └── auth_middleware.{h,cpp}      # KALSHI-ACCESS-* parsing, threads ClientId downstream
│   ├── ws/                              # WebSocket feed: replay + generative modes
│   ├── auth/auth_verify.{h,cpp}         # EVP_DigestVerify against registered pubkeys
│   ├── scenario/                        # /sim/* control endpoint (delay, drops, gaps, ...)
│   └── replay/*.json                    # Captured Kalshi payload samples
├── bench/                               # Google Benchmark
│   ├── bench_spsc.cpp · bench_pool.cpp · bench_flat_hash_map.cpp
│   ├── bench_arena.cpp · bench_parser.cpp · bench_book.cpp
│   └── bench_e2e.cpp                    # End-to-end vs kalshi-sim under tc netem
├── test/                                # Google Test
│   ├── test_spsc · test_pool · test_flat_hash_map
│   ├── test_rest_client · test_matching_engine
│   ├── test_auth · test_book · test_parser
├── experiments/                         # Pedagogical TCP/epoll experiments
├── scripts/
│   ├── isolate_cpus.sh                  # isolcpus + irqaffinity tooling
│   ├── netem_colocated.sh               # ~5 µs RTT, ±1 µs
│   ├── netem_lan.sh                     # ~100 µs RTT, ±20 µs jitter   (headline profile)
│   ├── netem_wan.sh                     # 1 ms RTT, 0.1% loss
│   └── netem_clear.sh                   # tc qdisc del
└── docs/
    ├── PRODUCTION_TUNING.md             # Kernel-bypass path (Solarflare, DPDK, RDMA, PTP)
    └── BENCHMARK_RESULTS.md             # Full per-knob ablation across all netem profiles

Tech Stack

Layer Technology Notes
Language C++20 Concepts, designated initializers, std::atomic_ref, ranges where useful.
Build CMake 3.20+, compile_commands.json for IntelliSense Single-config Release with -O3 -march=native -flto -fno-exceptions -fno-rtti.
WebSocket + HTTP transport Boost.Beast Zero-overhead async, integrates with Asio, error_code overloads everywhere.
Async I/O Boost.Asio with BOOST_ASIO_HAS_IO_URING io_uring backend on Linux 5.15+ for the receive path, epoll fallback.
TLS + RSA-PSS signing OpenSSL EVP Required for Kalshi auth. EVP_DigestVerify* reused on the simulator side.
JSON parsing simdjson 2 to 4× faster than alternatives, zero-copy, arena-backed input buffers.
Microbenchmarks Google Benchmark Per-op latency histograms, regression-detectable in CI.
Tests Google Test Used for core/ correctness and matching-engine conformance.
Network emulation Linux tc netem on loopback Realistic LAN, colocated, lossy-WAN profiles without leaving the host.
CI runtime Ubuntu 22.04, kernel 5.15+ Matches production tier. WSL2 also satisfies this for local dev.

References

@misc{kalshi-cpp-2026,
  author = {Zhou, Yincheng},
  title  = {kalshi-cpp: A Low-Latency C++ Trading Stack and Conformant Exchange Simulator for the Kalshi Prediction Market},
  year   = {2026},
  url    = {https://github.com/ArtysicistZ/Kalshi_Cpp}
}

MIT License © 2026 Zhou Yincheng

About

The first low latency C++ based high frequency trading system for Kalshi, the prediction market

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors