This workspace is for AI research and development tasks (reproduction, debugging, evaluation, training, and experiment planning).
Every non-trivial task MUST go through the skill pipeline via the Skill tool. Do NOT answer research questions, run experiments, or write papers by generating freeform text without first invoking the corresponding skill.
research-workflow(PRIMARY ORCHESTRATOR) — invoke FIRST for any non-trivial task. All downstream skills below should be invoked FROM WITHIN research-workflow, not standalone.deep-research— MANDATORY when user message contains ANY research-intent keyword:- Chinese: 调研/研究/对比/综述/文献/证据/机制/根因/为什么/可行性/路线图/分析/探索
- English: research/investigate/compare/survey/literature/evidence/mechanism/root-cause/why/feasibility/roadmap/analyze/explore
- If ANY keyword matches → you MUST invoke
deep-research. No exceptions. Skipping is a routing violation.
experiment-execution— when user asks to run/launch/start/resume/monitor an experiment.research-plan— when user asks for a proposal, roadmap, ablation plan, or study design.paper-writing— ONLY when user explicitly asks to draft/write/revise a paper or section.project-context— when environment setup or runtime fields are needed before execution.run-governor— at run start to set mode + run_id.memory-manager— bootstrap at run start, writeback at task end, trigger-based in between.human-checkpoint— for safety risks, high-resource approvals, or hard blockers.
Before producing any substantive response, you MUST run this mental checklist:
- Is this task non-trivial? → If yes, did I invoke
research-workflow? If not, invoke it NOW. - Does the user message contain any research-intent keyword from rule 2? → If yes, did I invoke
deep-research? If not, invoke it NOW. - Am I about to answer a research question with freeform text instead of skill output? → STOP. Invoke the skill first.
Over-triggering is acceptable. Under-triggering is a violation.
- Start each non-trivial research task with
run-governor, but do not initializerun_idpaths before explicit user confirmation of bothmodeand execution target (local|remote). - Use
research-workflowas the default orchestration loop. - Use
memory-managerinexperience-firstmode: reusableprocedure/episode/insightretrieval comes before relying onworkingstate alone. - If you modify
memory-manageror any Memory-related skill, or detect compaction markers in state/context files such asCompact,压缩,Summary, or similar summary/compression techniques, invokememory-managerto read prior Memory before continuing so key context is not dropped. - Trigger
human-checkpointusing mode-aware policy, always for major safety risks and shared-memory publication. - Use
experiment-executiononly for actual run execution, and keep ownership after launch for monitoring, diagnosis, recovery, and result collection. - Use
project-contextto collect and persist per-project private runtime context before experiments or report/eval execution. - Use
deep-researchas the default gateway for external search and deep external investigation, including early-stage project scoping when a user wants to write a research study or paper on a topic, unless the user is explicitly asking for a paper-writing deliverable right now. - Use
research-planwhen the user asks for a proposal, roadmap, ablation/evaluation plan, study design, or pre-implementation research decomposition. - After open-ended scoping in
deep-research, hand off findings intoresearch-planby default; skip only if the user explicitly opts out. - Use
paper-writingonly when the user explicitly asks for a paper-writing deliverable such as drafting or revising a paper, section, or rebuttal. Do not use it for topic scoping, literature investigation, feasibility analysis, experiment design, or experiment execution. - Base conclusions on evidence only (command outputs, metrics, logs, and file diffs).
- Prefer small, reversible, verifiable steps over broad speculative changes.
- Follow
REPO_CONVENTIONS.mdfor artifact placement and commit hygiene. - If a run was initialized before confirmation, stop and run violation recovery: acknowledge, ask whether to keep/clean artifacts, and wait for explicit reconfirmation before continuing.
- Mandatory Visualization: Every report with quantitative results MUST include code-generated visualizations (matplotlib). Always generate figures when writing stage reports or final reports. If the report is complex, invoke
paper-writingfor polished formatting. Under-visualizing is a violation. - For long-running work, do not treat launch as completion: persist an action record, enter watch mode, poll on a model-chosen cadence, and continue until success criteria, a true blocker, or a gated approval point is reached.
- Do not respond with the equivalent of "the job is running, come back later" unless the user explicitly requested fire-and-forget behavior.
- Interpret
full-autoas an interruption policy, not a completion policy. - If the user says things such as “keep iterating”, “do not stop”, “try many iterations”, “until target”, or gives explicit target metrics like
90%or100%, compile that intopersistent-optimizationbehavior. - For persistent-optimization tasks, compile the user request into machine-checkable fields before execution:
primary_targetpromotion_gatesnon_regression_guardsbackup_policystop_allowed_only_if
- Do not leave stopping conditions as prose only when they can be converted into measurable gates.
- If the user asks to preserve strong variants, snapshot best-so-far prompts/configs/code/results before higher-risk changes.
full-autoplus explicit persistence means the agent keeps ownership until one of these is true:- compiled hard targets are met
- a true hard blocker remains after reasonable recovery attempts
- a major safety/resource gate requires approval
- the user explicitly changes or stops the objective
- At the start of each non-trivial execution loop, refresh the compiled goal state and active promotion gate.
doneis allowed only when all compiled hard gates are satisfied with evidence.- If
completion_policy=until-target-or-hard-blocker,doneis forbidden while the active promotion gate or hard target remains unmet. - A single clean run, a partial fix, or one successful batch is not sufficient reason to stop.
- If the current promotion gate is met but the final target is not, promote to the next gate instead of stopping.
- If targets remain unmet and a safe next step exists, default to
iterate. - If repeated attempts plateau or regress materially, default to
replan.
- Apply the same ownership standard to short local edit-evaluate loops as to long-running jobs.
- For iterative optimization tasks, define an evaluation ladder before broadening scope:
- baseline or previous-best reference
- representative regression set
- promotion gate for larger evaluation
- final target evaluation
- Prefer broader representative sets over a few hand-picked cases.
- After each batch:
- compare against baseline and best-so-far
- inspect regressions, not only aggregate score
- check non-regression guards
- choose
iterate,replan, orpromote-to-next-gate
- Do not stop after a single iteration merely because execution completed cleanly.
- For prompts like “先用 30 个左右的题目集合测效果,再考虑上 100”, treat the smaller set as a required promotion gate rather than a suggestion.
- Classify an action as long-running when it is expected to exceed 5 minutes, launches async or remote work, is high-resource, or is likely to outlive the current model turn.
- Before waiting on a long-running action:
- persist
actions/<action_id>.yaml - record command, cwd, expected duration, poll interval, log path, success/failure signals, and resume step
- update working state with the active
action_id
- persist
- While the current session is active, use a watch loop:
- model chooses sleep
- poll the action
- inspect
status,progress_changed,followup_action, and recent logs - choose the next sleep or branch into diagnosis/result handling
- Allowed liveness states are
pending,running,stalled,failed,completed, andcancelled. - After every poll, keep ownership and branch immediately:
continue-watchorwait-and-pollcollect-resultsdiagnose-stalldiagnose-failurereplan
- At the start of every resumed turn, reconcile active actions before unrelated planning.
memory-manageris mandatory for non-trivial runs, but retrieval should center on reusable experience, not onlyworkingstate.- Mandatory per non-trivial run:
- one bootstrap
retrieve/init-workingbefore planning or execution - one close-out writeback before task completion
- one bootstrap
- Mandatory per turn and per batch:
- retrieve relevant memory on every new user turn
- retrieve
procedurebefore every execution batch - write a concise
workingdelta after every execution batch
- Mandatory trigger-based retrieval:
- retrieve
episodeon significant failure, repeated attempt, stalled job, or new error signature - retrieve
insightduring planning, replanning, contradiction handling, tradeoff analysis, or final answer shaping - retrieve
procedureplus relevantepisodebefore high-resource or irreversible actions - reread
workingduring resume, compaction recovery, long-action reconciliation, and final handoff
- retrieve
- After long-action polls:
- on
stalledorfailed, retrieveprocedureplus relevantepisodebefore the next fix attempt - on
completed, retrieveinsightwhen interpretation or next-step selection is needed
- on
- If memory is skipped due to duplicate retrieval, freshness, or low yield, record
memory_skip_reason.
- On every new user message, re-run skill routing before continuing prior stage actions.
- If the new message contains research-intent signals,
deep-researchMUST be activated even mid-run. - Research-intent signals include (semantic match, Chinese or English):
- 调研/研究/对比/综述/文献/证据/机制/根因/为什么/可行性/路线图
- research/investigate/compare/survey/literature/evidence/mechanism/root-cause/why/feasibility/roadmap
- All external search for non-trivial research runs must route through
deep-research; do not bypass it with ad hoc search. - Every
deep-researchrun must begin with a frontier-first scout before final depth selection. - Default depth is
default-auditable;lightis a downgrade path only after scout and may not be the silent default. - Do not claim deep-research completion without actual WebSearch calls and an auditable query trail.
- If skipping
deep-research, emitdr_skip_reasonwith concrete evidence freshness info (source date / timestamp), not a generic statement. - Cooldown for non-forced deep-research calls:
- at most once per stage unless objective changed or new contradiction/high-impact uncertainty appears.
- When
experiment-executionlaunches a long-running training, evaluation, benchmark, or inference job, it must enter watch mode by default. - After each experiment poll:
- if
running, choose the next sleep interval and keep monitoring - if
completed, inspect outputs, checkpoints, metrics, and artifacts immediately - if
stalled, inspect evidence, retrieve memory, and attempt the smallest safe recovery or replan - if
failed, diagnose immediately, retrieve memory, and attempt the smallest safe recovery
- if
- Unknown execution errors should follow this branch:
- local evidence triage
procedureandepisoderetrieval- targeted search
deep-researchif still unresolved or freshness-sensitive- minimal fix validation
- Only allow fire-and-forget experiment behavior when the user clearly requested it.
- Activate
paper-writingonly when the user explicitly asks for a paper-writing output. - Valid triggers include drafting or revising a paper, a named paper section, or rebuttal text.
- Do not activate
paper-writingjust because the request mentions papers, literature, comparisons, or related work if the actual need is still research, planning, or experiments. - If the user has not explicitly asked for paper-writing output, prefer
deep-research,research-plan, orexperiment-executionaccording to the current stage.
.agents/skills/run-governor.agents/skills/research-workflow.agents/skills/research-plan.agents/skills/memory-manager.agents/skills/human-checkpoint.agents/skills/experiment-execution.agents/skills/deep-research.agents/skills/project-context.agents/skills/paper-writing