v0.2: Make DATA-01 decision-grade by t3chn · Pull Request #6 · heurema/agent-bench-lab

t3chn · 2026-05-24T17:31:26Z

Closes #3.

Summary:

Add three synthetic public DATA-01 cases with CSV and SQLite fixtures.
Replace the sample DATA-01 scorer with config-driven deterministic checks for metrics.json, report.md, chart_spec.json, extra files, and critical caps.
Align DATA-01 with the standard scorer-contract model: artifact_exact for required files, schema_contract for metrics.json/chart_spec.json, numeric_metric for exact/tolerance values, claim_rubric for report.md factual checks, and future security_leak checks for private canaries/honey rows.
Add DATA-01 mutation support, sample artifact generation, data01-smoke, CI coverage, tests, and docs.

Test plan:

make validate
make test
make smoke
make compare-smoke
make if01-smoke
make data01-smoke
make leak-check
python3 -m ruff check .
git diff --check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1db9638f39

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T17:37:13Z

+    if "max_words" in report_config:
+        count = word_count(text)
+        maximum = int(report_config["max_words"])
+        checks.append(check("report_max_words", count <= maximum, 0.04, f"got={count}"))


Enforce report word-limit failures as policy violations

When report_max_words fails, the scorer only records a failed check and does not add a policy violation or score cap, so overlong reports can still pass. For example, with otherwise-correct artifacts, a 500-word report in case_001 still returns success=True because the score remains above PASS_THRESHOLD and success is gated on not violations. This makes the configured max_words constraint effectively non-blocking and undermines the task’s “concise report” requirement.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

Make DATA-01 decision-grade

eae65f7

t3chn force-pushed the feat/data01-decision-grade branch from 1db9638 to eae65f7 Compare May 25, 2026 05:46

t3chn merged commit 31ff201 into main May 25, 2026
1 check passed

t3chn deleted the feat/data01-decision-grade branch May 25, 2026 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2: Make DATA-01 decision-grade#6

v0.2: Make DATA-01 decision-grade#6
t3chn merged 1 commit into
mainfrom
feat/data01-decision-grade

t3chn commented May 24, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

t3chn commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

t3chn commented May 24, 2026 •

edited

Loading