feat: QED Math Environment with MCP tools and LLM-judge rubric by rycerzes · Pull Request #865 · huggingface/OpenEnv

rycerzes · 2026-06-25T16:02:31Z

Summary

Adds the QED Math Environment, a mathematical proof generation and evaluation env ported from QED-Nano. It deeply integrates both MCP tools (the sole agent API) and LLM-judge rubrics (structured 0–7 grading with normalized rewards), making it a reference implementation for the MCP + Rubrics features.

Supersedes #446 (closed). This PR is that branch rebased cleanly onto current main, plus a pass that aligns the per-rollout reward signal with QED-Nano's actual defaults (see Fidelity alignment below).

What's included

Environment (`qed_math_env`)

QEDMathEnvironment — extends MCPEnvironment; manages problem lifecycle, dataset loading, and proof submission
MathProofRubric — extends openenv.core.rubrics.base.Rubric; LLM-judge grading via OpenAI-compatible endpoints with <score>N</score> parsing, retry logic, and optional score thresholding
QEDMathEnv client — extends MCPToolClient with typed ProblemObservation / ProofSubmissionObservation models
3 MCP tools: get_problem, submit_proof, get_grading_guidelines

Key features

Flexible dataset loading: local JSONL/JSON, Hugging Face Hub, or built-in bootstrap problems
Answer-mode verification: process-pool math_verify \boxed{} checking (no LLM call needed)
Reward shaping: discount factor (γ^tokens), length penalty (buffer zone), optional score thresholding (collapses 1–5 → 1)
Reasoning stripping: configurable delimiters (e.g. </think>) removed before grading
Multi-step problems: configurable max attempts with per-attempt feedback
Verifier metrics: QED-Nano-compatible metrics (verifier/rollouts/*, verifier/failures/*, latency, token counts) in observation metadata

Fidelity alignment with QED-Nano

This branch reconciles the port against the upstream CMU-AIRe/QED-Nano source (rollouts.py, conf/base.yaml, evaluator prompts):

Evaluator prompts: ship v0/v1 alongside v2; default v1 (full 0–7 range, the QED-Nano default) instead of v2 (which constrains scores to {0,1,6,7}).
Reasoning stripping now defaults to ["</think>"], so only the post-think answer is graded — matching upstream.
Grader sampling params (temperature=1.0, optional max_output_tokens) are forwarded, and real provider token usage is captured (heuristic only as a fallback).
Proof vs. answer routing is auto-detected from grading-schema presence (upstream's if "schema" in problem) when a row sets no explicit type.
Answer-mode rewards are selectable via answer_reward_preset (pure_success default, base adds wrong/no_answer/unparsable penalties), keyed on verifier status.
is_correct success threshold aligned to upstream score == 7.
Stale refs fixed: source link meta-pytorch → CMU-AIRe; build refs → huggingface/OpenEnv.

Out of scope (by design): QED-Nano's Reasoning-Cache (RC) training loop — the iterative summarize→refine cycle, separate summarization model, and stream batching — is training orchestration, not an environment concern (and would conflict with the "controls only for orchestration" + dual-API invariants). The env already provides what RC needs: grading against original_problem and reasoning stripping.

Testing

tests/envs/test_qed_math_environment.py — 142 unit tests pass (5 server-integration tests marked @pytest.mark.integration):

Area	Coverage
MCP tools	`ListToolsAction` / `CallToolAction` for all 3 tools
Rubric grading	mocked `grade()`, score→reward normalization, retry/failure paths
Answer verification	correct/wrong/missing `\boxed{}` via `math_verify` service
Reward shaping	discount factor, length penalty, score thresholding
Reasoning stripping	`</think>` handling + fallback
Config defaults	v1 default, v0/v1 prompt loading, `</think>` default, grader sampling
Mode routing	proof/answer auto-detection from schema presence
Answer rewards	`pure_success` / `base` presets
Success threshold	`score == 7` semantics
Token usage	real-usage capture with heuristic fallback
Dataset loading	local JSONL/JSON, problem-ID selection, `\boxed{}` wrapping
Multi-step / original_problem	attempt tracking + RC-stream grading target
Verifier metrics	`GradingResult` + submission payload

PYTHONPATH=src:envs uv run pytest tests/envs/test_qed_math_environment.py -v

Lint (ruff format / ruff check src tests), usort, and scripts/sync_env_docs.py --check all pass.

CC: @burtenshaw

- add QEDMathAction and QEDMathObservation dataclasses in models.py - implement QEDMathEnv client with reset, step, get_problem, submit_proof

- main env class - mcp server tools - step & reset logic

- impl MathProofRubric - LLM Grading Logic - rubric config in env

- map problem data structure - support multiple types of problems

- wss handling - client methods

- `submit_proof` method accepts an optional `output_length_tokens` parameter for reward shaping. - Introduced `remove_reasoning` function to strip reasoning traces from model output. - Added `length_penalty` function to compute penalties for overlong sequences. - Adjusted grading logic to apply discount factors and penalties based on token count.

- implement metrics aggregation

dockerfile

- fix dockerfile

- refer to huggingface#456

…solution handling - update tests for answer mode

- integrate with QED Math environment

- tests

- Fix MCP client parsing bug where reset/step observations were coerced into base Observation, dropping env-specific fields - tests

…ated methods

Ship the QED-Nano v0 and v1 evaluator prompt templates verbatim alongside the existing v2. v1 is the QED-Nano default (full 0-7 score range); v2 constrains scores to {0,1,6,7}; v0 additionally uses the reference solution.

…Nano Bring the per-rollout reward signal in line with QED-Nano: - Default prompt_name v2 -> v1 (matches QED-Nano base.yaml; full 0-7 range). - Default reasoning_delimiters None -> ["</think>"] so only the post-think answer is graded, as upstream does. - Forward grader sampling params (grader_temperature=1.0, optional grader_max_output_tokens) to the Responses/Chat calls, and capture real token usage from the provider response (heuristic fallback only when absent). - Auto-detect proof vs answer mode from grading-schema presence, mirroring upstream's `if "schema" in problem` routing, when no explicit type is set. - Add answer_reward_preset (pure_success default, base adds wrong/no_answer/ unparsable penalties) keyed on verifier status; infra errors stay neutral. - Align is_correct success threshold to upstream score==7 (default 7). - Fix stale meta-pytorch/OpenEnv issue link in a docstring.

The repo moved from meta-pytorch/OpenEnv to huggingface/OpenEnv. Update the Docker base image (ghcr.io/huggingface/openenv-base), the openenv-core git dependency in pyproject.toml, and the matching uv.lock source entries.

Add tests for the new defaults and behavior: v1 default + v0/v1 prompt loading, default </think> reasoning stripping, proof/answer auto-detection from schema presence, answer reward presets (pure_success/base), success threshold == 7, grader sampling-param forwarding, and real token-usage capture.

Update the config tables and prose for the new defaults (v1 prompt, </think> stripping, grader sampling params, answer_reward_preset, success threshold 7, proof/answer auto-detection) and clarify token-usage metrics. Fix the broken QED-Nano source link (meta-pytorch -> CMU-AIRe) and add the generated docs/source/environments/qed_math.md stub.

rycerzes added 30 commits June 25, 2026 20:29

initial qed-math OpenEnv scaffold with models and client

c8795ed

- add QEDMathAction and QEDMathObservation dataclasses in models.py - implement QEDMathEnv client with reset, step, get_problem, submit_proof

implement MCP server tools and QEDMath environment

2c9d1b0

- main env class - mcp server tools - step & reset logic

rubric implementation

a9fc15f

- impl MathProofRubric - LLM Grading Logic - rubric config in env

data pipeline integration

d3b38c1

- map problem data structure - support multiple types of problems

grading prompts

15112a0

API & Client

98970f1

- wss handling - client methods

fix deps

dfb4ff3

fix eval prompt

58650c2

original_problem field for grading in QED-Nano RC stream

89609fe

add verifier metrics to grading results

8d66115

- implement metrics aggregation

docs

0601674

dockerfile

tests for QED Math Env

365935e

add QED Math inference example

2440eca

- fix dockerfile

improve async step handling

7c1b18b

- refer to huggingface#456

increase timeout for LLM judge calls and adjust WebSocket ping settings

603e051

add feedback output for incorrect answers in QED Math inference

fc1dbc8

refactor get_problem method to include metadata and adjust reference …

220d961

…solution handling - update tests for answer mode

refactor math expression parsing and simplify async calls in tests

56aacd6

impl process-based math verification service

0ef8d55

- integrate with QED Math environment

gold answer caching and request admission control

23ae83e

- tests

worker pool restart and health probe reporting

51d7091

preserve custom observation fields during parsing

4756ff2

- Fix MCP client parsing bug where reset/step observations were coerced into base Observation, dropping env-specific fields - tests

metrics tracking for MathVerifierService and QEDMathEnvironment

b0ede2d

update env docs

662bc05

refactor: remove output_length_tokens param from submit_proof and rel…

7914510

…ated methods

chore(lint): usort + ruff format

04348b7

feat(qed_math): add upstream v0/v1 evaluator prompts

5e2d4e2

Ship the QED-Nano v0 and v1 evaluator prompt templates verbatim alongside the existing v2. v1 is the QED-Nano default (full 0-7 score range); v2 constrains scores to {0,1,6,7}; v0 additionally uses the reference solution.

fix(qed_math): point build refs at canonical huggingface/OpenEnv

1c923bf

The repo moved from meta-pytorch/OpenEnv to huggingface/OpenEnv. Update the Docker base image (ghcr.io/huggingface/openenv-base), the openenv-core git dependency in pyproject.toml, and the matching uv.lock source entries.

rycerzes added 2 commits June 25, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: QED Math Environment with MCP tools and LLM-judge rubric#865

feat: QED Math Environment with MCP tools and LLM-judge rubric#865
rycerzes wants to merge 32 commits into
huggingface:mainfrom
rycerzes:feat/qed-math-env

rycerzes commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rycerzes commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Environment (qed_math_env)

Key features

Fidelity alignment with QED-Nano

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rycerzes commented Jun 25, 2026 •

edited

Loading

Environment (`qed_math_env`)