v0.4: Make SUP-01 decision-grade by t3chn · Pull Request #15 · heurema/agent-bench-lab

t3chn · 2026-05-25T07:14:36Z

Closes #14.

Adds SUP-01 as the fourth decision-grade task family for synthetic support inbox triage, policy-compliant draft replies, escalation decisions, and customer-style workflow evaluation.

Includes:

synthetic public SUP-01 cases;
deterministic scorer;
triage.json / drafts.json / escalations.json / decision_log.md artifact contract;
policy-grounded draft checks;
required and forbidden escalation checks;
unknown email ID detection;
prohibited promise/action detection;
mutation support;
ops-local suite config;
sup01-smoke;
tests and docs.

Scorer contract mapping:

artifact_exact for required files and no-extra-file checks;
schema_contract for triage.json, drafts.json, and escalations.json;
claim_rubric for policy-grounded draft content;
trace_policy for future tool-use and escalation-rule checks;
state_diff for future inbox/ticket-state mutations;
security_leak for future private canaries or prompt-injection emails.

Suite decision:

SUP-01 is not added to core.
SUP-01 is added to ops-local to avoid core suite bloat.

Non-goals:

no real email data;
no real customer data;
no live inbox;
no browser workflow;
no MCP/live SaaS dependency;
no consumer app integration;
no private holdout fixtures in public repo;
no private bundle runtime;
no LLM judge as primary oracle.

Test plan:

make validate
make test
make smoke
make compare-smoke
make if01-smoke
make data01-smoke
make doc01-smoke
make sup01-smoke
make leak-check
python3 -m ruff check .
git diff --check
tracked-file audit

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56d9123186

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-25T07:21:49Z

+    if triage is None or not items:
+        return checks, violations, caps


Score missing triage items even when list is empty

Do not return early on not items here: an artifact with {"triage": []} currently passes triage_schema, skips all required-email checks, and can still receive success=true with a perfect score if the other artifacts are valid. This breaks SUP-01’s core requirement to classify every required email and makes the scorer bypassable for triage accuracy.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-25T07:21:49Z

+    for item in items:
+        email_id = item.get("email_id")
+        if isinstance(email_id, str):
+            mapped[email_id] = item


Reject duplicate email IDs instead of silently overwriting

This mapper uses last-write-wins semantics, so duplicate email_id entries in triage, drafts, or escalations are silently collapsed with no violation. That allows contradictory or non-compliant duplicate records to be present in the artifact while still passing scoring as long as one duplicate is correct, which violates the one-decision-per-email contract and weakens result integrity.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-25T07:21:49Z

+        maximum = int(log_config["max_words"])
+        checks.append(check("decision_log_max_words", count <= maximum, 0.03, f"got={count}"))


Fail overlong decision logs, don't only deduct points

The max_words rule is treated as a low-weight check only; when exceeded, no policy violation is added and no cap is applied. In practice, a decision log far above the configured limit can still return success=true if everything else passes, which makes the explicit conciseness constraint effectively unenforced.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-25T07:21:49Z

+    distractor = output / "inbox" / "email_999.eml"
+    distractor.write_text(


Sync mutation checks when adding distractor inbox email

When this script appends email_999.eml, it does not also update check_config.json (required_emails/triage.required), so generated mutation cases become internally inconsistent: the fixture now contains an extra inbox message, but scoring still allows a passing result without classifying that message. This weakens mutation-based robustness checks because the added distractor can be ignored with no penalty.

Useful? React with 👍 / 👎.

Make SUP-01 decision-grade

56d9123

t3chn merged commit b596cc7 into main May 25, 2026
1 check passed

t3chn deleted the feat/sup01-decision-grade branch May 25, 2026 07:15

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4: Make SUP-01 decision-grade#15

v0.4: Make SUP-01 decision-grade#15
t3chn merged 1 commit into
mainfrom
feat/sup01-decision-grade

t3chn commented May 25, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if triage is None or not items:
		return checks, violations, caps

		maximum = int(log_config["max_words"])
		checks.append(check("decision_log_max_words", count <= maximum, 0.03, f"got={count}"))

		distractor = output / "inbox" / "email_999.eml"
		distractor.write_text(

Conversation

t3chn commented May 25, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant