Skip to content

v0.4: Make SUP-01 decision-grade#15

Merged
t3chn merged 1 commit into
mainfrom
feat/sup01-decision-grade
May 25, 2026
Merged

v0.4: Make SUP-01 decision-grade#15
t3chn merged 1 commit into
mainfrom
feat/sup01-decision-grade

Conversation

@t3chn
Copy link
Copy Markdown
Contributor

@t3chn t3chn commented May 25, 2026

Closes #14.

Adds SUP-01 as the fourth decision-grade task family for synthetic support inbox triage, policy-compliant draft replies, escalation decisions, and customer-style workflow evaluation.

Includes:

  • synthetic public SUP-01 cases;
  • deterministic scorer;
  • triage.json / drafts.json / escalations.json / decision_log.md artifact contract;
  • policy-grounded draft checks;
  • required and forbidden escalation checks;
  • unknown email ID detection;
  • prohibited promise/action detection;
  • mutation support;
  • ops-local suite config;
  • sup01-smoke;
  • tests and docs.

Scorer contract mapping:

  • artifact_exact for required files and no-extra-file checks;
  • schema_contract for triage.json, drafts.json, and escalations.json;
  • claim_rubric for policy-grounded draft content;
  • trace_policy for future tool-use and escalation-rule checks;
  • state_diff for future inbox/ticket-state mutations;
  • security_leak for future private canaries or prompt-injection emails.

Suite decision:

  • SUP-01 is not added to core.
  • SUP-01 is added to ops-local to avoid core suite bloat.

Non-goals:

  • no real email data;
  • no real customer data;
  • no live inbox;
  • no browser workflow;
  • no MCP/live SaaS dependency;
  • no consumer app integration;
  • no private holdout fixtures in public repo;
  • no private bundle runtime;
  • no LLM judge as primary oracle.

Test plan:

  • make validate
  • make test
  • make smoke
  • make compare-smoke
  • make if01-smoke
  • make data01-smoke
  • make doc01-smoke
  • make sup01-smoke
  • make leak-check
  • python3 -m ruff check .
  • git diff --check
  • tracked-file audit

@t3chn t3chn merged commit b596cc7 into main May 25, 2026
1 check passed
@t3chn t3chn deleted the feat/sup01-decision-grade branch May 25, 2026 07:15
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56d9123186

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tasks/SUP-01/scorer.py
Comment on lines +180 to +181
if triage is None or not items:
return checks, violations, caps
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Score missing triage items even when list is empty

Do not return early on not items here: an artifact with {"triage": []} currently passes triage_schema, skips all required-email checks, and can still receive success=true with a perfect score if the other artifacts are valid. This breaks SUP-01’s core requirement to classify every required email and makes the scorer bypassable for triage accuracy.

Useful? React with 👍 / 👎.

Comment thread tasks/SUP-01/scorer.py
Comment on lines +125 to +128
for item in items:
email_id = item.get("email_id")
if isinstance(email_id, str):
mapped[email_id] = item
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject duplicate email IDs instead of silently overwriting

This mapper uses last-write-wins semantics, so duplicate email_id entries in triage, drafts, or escalations are silently collapsed with no violation. That allows contradictory or non-compliant duplicate records to be present in the artifact while still passing scoring as long as one duplicate is correct, which violates the one-decision-per-email contract and weakens result integrity.

Useful? React with 👍 / 👎.

Comment thread tasks/SUP-01/scorer.py
Comment on lines +452 to +453
maximum = int(log_config["max_words"])
checks.append(check("decision_log_max_words", count <= maximum, 0.03, f"got={count}"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail overlong decision logs, don't only deduct points

The max_words rule is treated as a low-weight check only; when exceeded, no policy violation is added and no cap is applied. In practice, a decision log far above the configured limit can still return success=true if everything else passes, which makes the explicit conciseness constraint effectively unenforced.

Useful? React with 👍 / 👎.

Comment on lines +56 to +57
distractor = output / "inbox" / "email_999.eml"
distractor.write_text(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sync mutation checks when adding distractor inbox email

When this script appends email_999.eml, it does not also update check_config.json (required_emails/triage.required), so generated mutation cases become internally inconsistent: the fixture now contains an extra inbox message, but scoring still allows a passing result without classifying that message. This weakens mutation-based robustness checks because the added distractor can be ignored with no penalty.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.4: Make SUP-01 decision-grade

1 participant