Lester A Leong LesterALeong

Lester Leong

AI engineer building reliable, evaluated agentic systems.

I bring quant-grade statistical rigor to LLM engineering: evals, error bars, and not shipping on noise. The discipline comes from running multi-agent systems where a wrong answer does not fail a unit test, it costs real money. PhD in Finance (ML time series).

I work at the intersection of AI engineering and quantitative finance: multi-agent orchestration, LLM evaluation and reliability, and production systems that have to be right, not just plausible.

Featured

llm-evalgate

Eval gates and reliability primitives for LLM pipelines, with a confidence interval on every metric and a statistically honest regression gate. The full eval surface (deterministic gates, calibrated LLM-as-judge, agentic-trace evals, multiple-comparisons-corrected CI gating), in CI, with the stats done right. Pure standard library at the core, fully offline-testable, deterministic under a seed.

pip install llm-evalgate · GitHub · PyPI

anachron

A look-ahead leakage detector for LLM agents: does an agent use information it could not have had at the time? It scores the leakage rate of an agent's tool calls under an as-of-date constraint, graded with the point-in-time rigor of quantitative backtesting — survivorship, restatements, transaction cost. The discipline that keeps a trading backtest honest, generalized to agents.

GitHub

Thousand Token Wood

Five small-model (Qwen2.5-3B) creature-agents in an emergent economy: an experiment in multi-agent dynamics under real economic pressure. Served on vLLM / Modal with a Gradio interface. Built for the Hugging Face build-small hackathon.

Live demo

Also building, in the same spirit: finance-paper-replication — replicating quant research papers and publishing the honest results — and finance-research-factory, a multi-agent q-fin research pipeline with alpha-protection built in.

Writing

I write about AI engineering, evals, quantitative finance, and building systems you can actually trust.

The Year I Tried to Beat the Market and Found the Edge Hiding in My Own Backyard — what a year of building a one-person quant research desk taught me about alpha, honesty, and rigor: pre-registration, survivorship-free backtests, and telling an edge apart from beta in a costume.
I Audited My Own Eval Gate. It Was Failing Builds Five Times Too Often. — the multiple-comparisons bug hiding in most CI eval gates, with code and real numbers.
More on Medium.

Stack

Python · LLMs · multi-agent systems · LLM-as-judge and evals · RAG · agent reliability · vLLM · Modal · Gradio · Hugging Face · bootstrap, hypothesis testing, and experimental design · pandas / NumPy

Open to conversations about hard problems in agent reliability and AI for finance.

LinkedIn: https://www.linkedin.com/in/lester-a-leong
X: https://x.com/RealLesterLeong
Medium: https://medium.com/gradient-growth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly