I bring quant-grade statistical rigor to LLM engineering: evals, error bars, and not shipping on noise. The discipline comes from running multi-agent systems where a wrong answer does not fail a unit test, it costs real money. PhD in Finance (ML time series).
I work at the intersection of AI engineering and quantitative finance: multi-agent orchestration, LLM evaluation and reliability, and production systems that have to be right, not just plausible.
Eval gates and reliability primitives for LLM pipelines, with a confidence interval on every metric and a statistically honest regression gate. The full eval surface (deterministic gates, calibrated LLM-as-judge, agentic-trace evals, multiple-comparisons-corrected CI gating), in CI, with the stats done right. Pure standard library at the core, fully offline-testable, deterministic under a seed.
pip install llm-evalgate · GitHub · PyPI
A look-ahead leakage detector for LLM agents: does an agent use information it could not have had at the time? It scores the leakage rate of an agent's tool calls under an as-of-date constraint, graded with the point-in-time rigor of quantitative backtesting — survivorship, restatements, transaction cost. The discipline that keeps a trading backtest honest, generalized to agents.
Five small-model (Qwen2.5-3B) creature-agents in an emergent economy: an experiment in multi-agent dynamics under real economic pressure. Served on vLLM / Modal with a Gradio interface. Built for the Hugging Face build-small hackathon.
Also building, in the same spirit: finance-paper-replication — replicating quant research papers and publishing the honest results — and finance-research-factory, a multi-agent q-fin research pipeline with alpha-protection built in.
I write about AI engineering, evals, quantitative finance, and building systems you can actually trust.
- The Year I Tried to Beat the Market and Found the Edge Hiding in My Own Backyard — what a year of building a one-person quant research desk taught me about alpha, honesty, and rigor: pre-registration, survivorship-free backtests, and telling an edge apart from beta in a costume.
- I Audited My Own Eval Gate. It Was Failing Builds Five Times Too Often. — the multiple-comparisons bug hiding in most CI eval gates, with code and real numbers.
- More on Medium.
Python · LLMs · multi-agent systems · LLM-as-judge and evals · RAG · agent reliability · vLLM · Modal · Gradio · Hugging Face · bootstrap, hypothesis testing, and experimental design · pandas / NumPy
Open to conversations about hard problems in agent reliability and AI for finance.




