feat: QED Math Environment with MCP tools and LLM-judge rubric#865
Open
rycerzes wants to merge 32 commits into
Open
feat: QED Math Environment with MCP tools and LLM-judge rubric#865rycerzes wants to merge 32 commits into
rycerzes wants to merge 32 commits into
Conversation
- add QEDMathAction and QEDMathObservation dataclasses in models.py - implement QEDMathEnv client with reset, step, get_problem, submit_proof
- main env class - mcp server tools - step & reset logic
- impl MathProofRubric - LLM Grading Logic - rubric config in env
- map problem data structure - support multiple types of problems
- wss handling - client methods
- `submit_proof` method accepts an optional `output_length_tokens` parameter for reward shaping. - Introduced `remove_reasoning` function to strip reasoning traces from model output. - Added `length_penalty` function to compute penalties for overlong sequences. - Adjusted grading logic to apply discount factors and penalties based on token count.
- implement metrics aggregation
- fix dockerfile
- refer to huggingface#456
…solution handling - update tests for answer mode
- integrate with QED Math environment
- Fix MCP client parsing bug where reset/step observations were coerced into base Observation, dropping env-specific fields - tests
Ship the QED-Nano v0 and v1 evaluator prompt templates verbatim alongside
the existing v2. v1 is the QED-Nano default (full 0-7 score range); v2
constrains scores to {0,1,6,7}; v0 additionally uses the reference solution.
…Nano Bring the per-rollout reward signal in line with QED-Nano: - Default prompt_name v2 -> v1 (matches QED-Nano base.yaml; full 0-7 range). - Default reasoning_delimiters None -> ["</think>"] so only the post-think answer is graded, as upstream does. - Forward grader sampling params (grader_temperature=1.0, optional grader_max_output_tokens) to the Responses/Chat calls, and capture real token usage from the provider response (heuristic fallback only when absent). - Auto-detect proof vs answer mode from grading-schema presence, mirroring upstream's `if "schema" in problem` routing, when no explicit type is set. - Add answer_reward_preset (pure_success default, base adds wrong/no_answer/ unparsable penalties) keyed on verifier status; infra errors stay neutral. - Align is_correct success threshold to upstream score==7 (default 7). - Fix stale meta-pytorch/OpenEnv issue link in a docstring.
The repo moved from meta-pytorch/OpenEnv to huggingface/OpenEnv. Update the Docker base image (ghcr.io/huggingface/openenv-base), the openenv-core git dependency in pyproject.toml, and the matching uv.lock source entries.
Add tests for the new defaults and behavior: v1 default + v0/v1 prompt loading, default </think> reasoning stripping, proof/answer auto-detection from schema presence, answer reward presets (pure_success/base), success threshold == 7, grader sampling-param forwarding, and real token-usage capture.
Update the config tables and prose for the new defaults (v1 prompt, </think> stripping, grader sampling params, answer_reward_preset, success threshold 7, proof/answer auto-detection) and clarify token-usage metrics. Fix the broken QED-Nano source link (meta-pytorch -> CMU-AIRe) and add the generated docs/source/environments/qed_math.md stub.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the QED Math Environment, a mathematical proof generation and evaluation env ported from QED-Nano. It deeply integrates both MCP tools (the sole agent API) and LLM-judge rubrics (structured 0–7 grading with normalized rewards), making it a reference implementation for the MCP + Rubrics features.
What's included
Environment (
qed_math_env)QEDMathEnvironment— extendsMCPEnvironment; manages problem lifecycle, dataset loading, and proof submissionMathProofRubric— extendsopenenv.core.rubrics.base.Rubric; LLM-judge grading via OpenAI-compatible endpoints with<score>N</score>parsing, retry logic, and optional score thresholdingQEDMathEnvclient — extendsMCPToolClientwith typedProblemObservation/ProofSubmissionObservationmodelsget_problem,submit_proof,get_grading_guidelinesKey features
math_verify\boxed{}checking (no LLM call needed)γ^tokens), length penalty (buffer zone), optional score thresholding (collapses 1–5 → 1)</think>) removed before gradingverifier/rollouts/*,verifier/failures/*, latency, token counts) in observation metadataFidelity alignment with QED-Nano
This branch reconciles the port against the upstream
CMU-AIRe/QED-Nanosource (rollouts.py,conf/base.yaml, evaluator prompts):v0/v1alongsidev2; defaultv1(full 0–7 range, the QED-Nano default) instead ofv2(which constrains scores to{0,1,6,7}).["</think>"], so only the post-think answer is graded — matching upstream.temperature=1.0, optionalmax_output_tokens) are forwarded, and real provider token usage is captured (heuristic only as a fallback).if "schema" in problem) when a row sets no explicit type.answer_reward_preset(pure_successdefault,baseadds wrong/no_answer/unparsable penalties), keyed on verifier status.is_correctsuccess threshold aligned to upstreamscore == 7.meta-pytorch→CMU-AIRe; build refs →huggingface/OpenEnv.Testing
tests/envs/test_qed_math_environment.py— 142 unit tests pass (5 server-integration tests marked@pytest.mark.integration):ListToolsAction/CallToolActionfor all 3 toolsgrade(), score→reward normalization, retry/failure paths\boxed{}viamath_verifyservice</think>handling + fallback</think>default, grader samplingpure_success/basepresetsscore == 7semantics\boxed{}wrappingGradingResult+ submission payloadLint (
ruff format/ruff check src tests),usort, andscripts/sync_env_docs.py --checkall pass.CC: @burtenshaw