v0.3: Make DOC-01 decision-grade by t3chn · Pull Request #13 · heurema/agent-bench-lab

t3chn · 2026-05-25T06:16:03Z

Closes #12.

Adds DOC-01 as the third decision-grade task family for fixed-corpus document QA, grounded answers, citation validation, and unsupported-claim detection.

Includes:

synthetic public DOC-01 cases;
deterministic scorer;
answer.md / citations.json / claims.json artifact contract;
local corpus citation validation;
unsupported-claim and stale/distractor source checks;
mutation support;
sample artifact generation;
doc01-smoke;
tests and docs.

Scorer contract mapping:

artifact_exact for required files and no-extra-file checks;
schema_contract for citations.json and claims.json;
claim_rubric for supported/unsupported claims and evidence;
security_leak for future private canaries / hidden corpus leaks;
redacted feedback remains enforced through public compare report tests.

Non-goals:

no live web;
no browser workflow;
no private holdout fixtures in public repo;
no real customer data;
no private scorer configs;
no consumer app coupling;
no private bundle runtime.

Test plan:

make validate
make test
make smoke
make compare-smoke
make if01-smoke
make data01-smoke
make doc01-smoke
make leak-check
python3 -m ruff check .
git diff --check
tracked-file audit

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa99722ea2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-25T06:20:04Z

+        unique_id = bool(citation_id and citation_id not in seen_ids)
+        checks.append(check(f"citation_id_unique:{item_name}", item_ok and unique_id, 0.02))
+        if citation_id:
+            seen_ids.add(citation_id)
+            citation_map[citation_id] = item


Reject duplicate citation IDs as policy violations

A duplicate citation ID currently only loses 0.02 points and never adds a policy violation, so malformed citations.json can still pass (e.g., adding a second entry with an existing id yields success=true and a high score). Because citation_map is still updated for duplicates, claim resolution remains ambiguous while the run is marked successful, which weakens the decision-grade contract.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-25T06:20:04Z

+    actual = {
+        str(item.get("id")): item
+        for item in items
+        if isinstance(item, dict) and item.get("id") is not None
+    }


Detect duplicate claim IDs before collapsing to a dict

The scorer collapses claims into a dict keyed by id, so repeated claim IDs are silently overwritten and never penalized; with allow_extra_claims=false, duplicated required IDs still produce a perfect pass. This accepts structurally invalid claims.json as successful output and can hide contradictory claim entries in evaluation artifacts.

Useful? React with 👍 / 👎.

Make DOC-01 decision-grade

aa99722

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

t3chn merged commit 3e8ec69 into main May 25, 2026
1 check passed

t3chn deleted the feat/doc01-decision-grade branch May 25, 2026 06:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3: Make DOC-01 decision-grade#13

v0.3: Make DOC-01 decision-grade#13
t3chn merged 1 commit into
mainfrom
feat/doc01-decision-grade

t3chn commented May 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

t3chn commented May 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant