v0.3: Make DOC-01 decision-grade#13
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa99722ea2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| unique_id = bool(citation_id and citation_id not in seen_ids) | ||
| checks.append(check(f"citation_id_unique:{item_name}", item_ok and unique_id, 0.02)) | ||
| if citation_id: | ||
| seen_ids.add(citation_id) | ||
| citation_map[citation_id] = item |
There was a problem hiding this comment.
Reject duplicate citation IDs as policy violations
A duplicate citation ID currently only loses 0.02 points and never adds a policy violation, so malformed citations.json can still pass (e.g., adding a second entry with an existing id yields success=true and a high score). Because citation_map is still updated for duplicates, claim resolution remains ambiguous while the run is marked successful, which weakens the decision-grade contract.
Useful? React with 👍 / 👎.
| actual = { | ||
| str(item.get("id")): item | ||
| for item in items | ||
| if isinstance(item, dict) and item.get("id") is not None | ||
| } |
There was a problem hiding this comment.
Detect duplicate claim IDs before collapsing to a dict
The scorer collapses claims into a dict keyed by id, so repeated claim IDs are silently overwritten and never penalized; with allow_extra_claims=false, duplicated required IDs still produce a perfect pass. This accepts structurally invalid claims.json as successful output and can hide contradictory claim entries in evaluation artifacts.
Useful? React with 👍 / 👎.
Closes #12.
Adds DOC-01 as the third decision-grade task family for fixed-corpus document QA, grounded answers, citation validation, and unsupported-claim detection.
Includes:
Scorer contract mapping:
Non-goals:
Test plan: