Skip to content

feat(adapters): add token ignition evaluator suite#3

Open
breezeFur wants to merge 1 commit into
Protocol-zero-0:mainfrom
breezeFur:clawoss/evaluator-suite-clean
Open

feat(adapters): add token ignition evaluator suite#3
breezeFur wants to merge 1 commit into
Protocol-zero-0:mainfrom
breezeFur:clawoss/evaluator-suite-clean

Conversation

@breezeFur

Copy link
Copy Markdown

实现了什么

  • 新增 Token-Ignition evaluator case suite:adapters/token_ignition/evaluator_cases.json
  • 新增离线评估脚本:adapters/token_ignition/evaluate_cases.py
  • 新增 K 可消费的示例评分报告:adapters/token_ignition/example_score_report.json
  • 新增 ClawOSS Evolution Kernel adapter spec:docs/clawoss-evolution-adapter.md

Evaluator cases 覆盖 issue 要求的协议点:

  • R7 ablation
  • baseline log
  • evolution axes
  • long-horizon scaffold
  • reproducible micro-run

Cases 覆盖 issue 要求的样本类型:

  • prompt-only wrapper
  • non-reproducible demo
  • fake ablation
  • hardcoded benchmark
  • cherry-picked logs
  • endpoint prompt injection
  • over-complex agent swarm
  • baseline better than scaffold
  • real but rough high-potential system
  • strong minimal evolution system

ClawOSS adapter spec 定义了:

  • evidence schema
  • score report schema
  • hard reject rules
  • allowed mutation scope
  • forbidden mutation scope
  • pauseAgent=true / budget exhausted 高于 heartbeat work loop 的优先级

没有实现什么

  • 不做完整红队平台
  • 不做自动 case 生成器
  • 不做 ClawOSS 自动进化
  • 不做 dashboard 大改
  • 不做多模型共识重构
  • 不做线上部署
  • 不做前端视觉改版

如何验收

运行:

python3 adapters/token_ignition/evaluate_cases.py \
  --cases adapters/token_ignition/evaluator_cases.json \
  --output /tmp/report.json

检查 report:

jq '{hard_gates_passed, recommendation, metrics, failures, risks}' /tmp/report.json

本地已验证:

  • /tmp/report.json 包含 hard_gates_passed
  • /tmp/report.json 包含 recommendation
  • /tmp/report.json 包含 metrics
  • /tmp/report.json 包含 failures
  • /tmp/report.json 包含 risks
  • evaluate_cases.py 可直接输出 JSON report
  • python3 -m py_compile adapters/token_ignition/evaluate_cases.py 通过

实际验收结果

Item Result
hard_gates_passed true
recommendation accept
total cases 22
adversarial cases 10
real-but-rough cases 4
correct 22
golden_accuracy 1.0
adversarial_accuracy 1.0
false_positive_count 0
false_negative_count 0
failures 0
risks 0

当前限制

  • Evaluator 当前是 deterministic offline scorer,不调用线上模型或远程服务。
  • Cases 是手写固定集,不包含自动生成、隐藏集或线上红队平台。
  • ClawOSS 部分只交付 adapter spec,不实现 ClawOSS runtime 接入。
  • Fitness rule 仍是第一版启发式规则,后续可以随更多真实 submission 扩展。

Future Work

  • 将 evaluator 接入 Evolution Kernel 的 evaluator role。
  • 增加隐藏 adversarial cases,防止 planner/executor 过拟合公开样本。
  • 引入更细的 score breakdown,例如 reproducibility、auditability、complexity penalty。
  • 为 ClawOSS adapter 增加可执行 dry-run checker。

Fixes #1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

任务:应用评估接口

2 participants