Skip to content

feat: QED Math Environment with MCP tools and LLM-judge rubric#865

Open
rycerzes wants to merge 32 commits into
huggingface:mainfrom
rycerzes:feat/qed-math-env
Open

feat: QED Math Environment with MCP tools and LLM-judge rubric#865
rycerzes wants to merge 32 commits into
huggingface:mainfrom
rycerzes:feat/qed-math-env

Conversation

@rycerzes

@rycerzes rycerzes commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the QED Math Environment, a mathematical proof generation and evaluation env ported from QED-Nano. It deeply integrates both MCP tools (the sole agent API) and LLM-judge rubrics (structured 0–7 grading with normalized rewards), making it a reference implementation for the MCP + Rubrics features.

Supersedes #446 (closed). This PR is that branch rebased cleanly onto current main, plus a pass that aligns the per-rollout reward signal with QED-Nano's actual defaults (see Fidelity alignment below).

What's included

Environment (qed_math_env)

  • QEDMathEnvironment — extends MCPEnvironment; manages problem lifecycle, dataset loading, and proof submission
  • MathProofRubric — extends openenv.core.rubrics.base.Rubric; LLM-judge grading via OpenAI-compatible endpoints with <score>N</score> parsing, retry logic, and optional score thresholding
  • QEDMathEnv client — extends MCPToolClient with typed ProblemObservation / ProofSubmissionObservation models
  • 3 MCP tools: get_problem, submit_proof, get_grading_guidelines

Key features

  • Flexible dataset loading: local JSONL/JSON, Hugging Face Hub, or built-in bootstrap problems
  • Answer-mode verification: process-pool math_verify \boxed{} checking (no LLM call needed)
  • Reward shaping: discount factor (γ^tokens), length penalty (buffer zone), optional score thresholding (collapses 1–5 → 1)
  • Reasoning stripping: configurable delimiters (e.g. </think>) removed before grading
  • Multi-step problems: configurable max attempts with per-attempt feedback
  • Verifier metrics: QED-Nano-compatible metrics (verifier/rollouts/*, verifier/failures/*, latency, token counts) in observation metadata

Fidelity alignment with QED-Nano

This branch reconciles the port against the upstream CMU-AIRe/QED-Nano source (rollouts.py, conf/base.yaml, evaluator prompts):

  • Evaluator prompts: ship v0/v1 alongside v2; default v1 (full 0–7 range, the QED-Nano default) instead of v2 (which constrains scores to {0,1,6,7}).
  • Reasoning stripping now defaults to ["</think>"], so only the post-think answer is graded — matching upstream.
  • Grader sampling params (temperature=1.0, optional max_output_tokens) are forwarded, and real provider token usage is captured (heuristic only as a fallback).
  • Proof vs. answer routing is auto-detected from grading-schema presence (upstream's if "schema" in problem) when a row sets no explicit type.
  • Answer-mode rewards are selectable via answer_reward_preset (pure_success default, base adds wrong/no_answer/unparsable penalties), keyed on verifier status.
  • is_correct success threshold aligned to upstream score == 7.
  • Stale refs fixed: source link meta-pytorchCMU-AIRe; build refs → huggingface/OpenEnv.

Out of scope (by design): QED-Nano's Reasoning-Cache (RC) training loop — the iterative summarize→refine cycle, separate summarization model, and stream batching — is training orchestration, not an environment concern (and would conflict with the "controls only for orchestration" + dual-API invariants). The env already provides what RC needs: grading against original_problem and reasoning stripping.

Testing

tests/envs/test_qed_math_environment.py142 unit tests pass (5 server-integration tests marked @pytest.mark.integration):

Area Coverage
MCP tools ListToolsAction / CallToolAction for all 3 tools
Rubric grading mocked grade(), score→reward normalization, retry/failure paths
Answer verification correct/wrong/missing \boxed{} via math_verify service
Reward shaping discount factor, length penalty, score thresholding
Reasoning stripping </think> handling + fallback
Config defaults v1 default, v0/v1 prompt loading, </think> default, grader sampling
Mode routing proof/answer auto-detection from schema presence
Answer rewards pure_success / base presets
Success threshold score == 7 semantics
Token usage real-usage capture with heuristic fallback
Dataset loading local JSONL/JSON, problem-ID selection, \boxed{} wrapping
Multi-step / original_problem attempt tracking + RC-stream grading target
Verifier metrics GradingResult + submission payload
PYTHONPATH=src:envs uv run pytest tests/envs/test_qed_math_environment.py -v

Lint (ruff format / ruff check src tests), usort, and scripts/sync_env_docs.py --check all pass.


CC: @burtenshaw

rycerzes added 30 commits June 25, 2026 20:29
- add QEDMathAction and QEDMathObservation dataclasses in models.py
- implement QEDMathEnv client with reset, step, get_problem,
submit_proof
- main env class
- mcp server tools
- step & reset logic
- impl MathProofRubric
- LLM Grading Logic
- rubric config in env
- map problem data structure
- support multiple types of problems
- wss handling
- client methods
- `submit_proof` method accepts an optional `output_length_tokens` parameter for reward shaping.
- Introduced `remove_reasoning` function to strip reasoning traces from model output.
- Added `length_penalty` function to compute penalties for overlong sequences.
- Adjusted grading logic to apply discount factors and penalties based on token count.
- implement metrics aggregation
dockerfile
…solution handling

- update tests for answer mode
- integrate with QED Math environment
- Fix MCP client parsing bug where reset/step observations were coerced
into base Observation, dropping env-specific fields
- tests
Ship the QED-Nano v0 and v1 evaluator prompt templates verbatim alongside
the existing v2. v1 is the QED-Nano default (full 0-7 score range); v2
constrains scores to {0,1,6,7}; v0 additionally uses the reference solution.
…Nano

Bring the per-rollout reward signal in line with QED-Nano:

- Default prompt_name v2 -> v1 (matches QED-Nano base.yaml; full 0-7 range).
- Default reasoning_delimiters None -> ["</think>"] so only the post-think
  answer is graded, as upstream does.
- Forward grader sampling params (grader_temperature=1.0, optional
  grader_max_output_tokens) to the Responses/Chat calls, and capture real
  token usage from the provider response (heuristic fallback only when absent).
- Auto-detect proof vs answer mode from grading-schema presence, mirroring
  upstream's `if "schema" in problem` routing, when no explicit type is set.
- Add answer_reward_preset (pure_success default, base adds wrong/no_answer/
  unparsable penalties) keyed on verifier status; infra errors stay neutral.
- Align is_correct success threshold to upstream score==7 (default 7).
- Fix stale meta-pytorch/OpenEnv issue link in a docstring.
The repo moved from meta-pytorch/OpenEnv to huggingface/OpenEnv. Update the
Docker base image (ghcr.io/huggingface/openenv-base), the openenv-core git
dependency in pyproject.toml, and the matching uv.lock source entries.
rycerzes added 2 commits June 25, 2026 21:31
Add tests for the new defaults and behavior: v1 default + v0/v1 prompt
loading, default </think> reasoning stripping, proof/answer auto-detection
from schema presence, answer reward presets (pure_success/base), success
threshold == 7, grader sampling-param forwarding, and real token-usage
capture.
Update the config tables and prose for the new defaults (v1 prompt, </think>
stripping, grader sampling params, answer_reward_preset, success threshold 7,
proof/answer auto-detection) and clarify token-usage metrics. Fix the broken
QED-Nano source link (meta-pytorch -> CMU-AIRe) and add the generated
docs/source/environments/qed_math.md stub.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant