v0.4: Make SUP-01 decision-grade#15
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56d9123186
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if triage is None or not items: | ||
| return checks, violations, caps |
There was a problem hiding this comment.
Score missing triage items even when list is empty
Do not return early on not items here: an artifact with {"triage": []} currently passes triage_schema, skips all required-email checks, and can still receive success=true with a perfect score if the other artifacts are valid. This breaks SUP-01’s core requirement to classify every required email and makes the scorer bypassable for triage accuracy.
Useful? React with 👍 / 👎.
| for item in items: | ||
| email_id = item.get("email_id") | ||
| if isinstance(email_id, str): | ||
| mapped[email_id] = item |
There was a problem hiding this comment.
Reject duplicate email IDs instead of silently overwriting
This mapper uses last-write-wins semantics, so duplicate email_id entries in triage, drafts, or escalations are silently collapsed with no violation. That allows contradictory or non-compliant duplicate records to be present in the artifact while still passing scoring as long as one duplicate is correct, which violates the one-decision-per-email contract and weakens result integrity.
Useful? React with 👍 / 👎.
| maximum = int(log_config["max_words"]) | ||
| checks.append(check("decision_log_max_words", count <= maximum, 0.03, f"got={count}")) |
There was a problem hiding this comment.
Fail overlong decision logs, don't only deduct points
The max_words rule is treated as a low-weight check only; when exceeded, no policy violation is added and no cap is applied. In practice, a decision log far above the configured limit can still return success=true if everything else passes, which makes the explicit conciseness constraint effectively unenforced.
Useful? React with 👍 / 👎.
| distractor = output / "inbox" / "email_999.eml" | ||
| distractor.write_text( |
There was a problem hiding this comment.
Sync mutation checks when adding distractor inbox email
When this script appends email_999.eml, it does not also update check_config.json (required_emails/triage.required), so generated mutation cases become internally inconsistent: the fixture now contains an extra inbox message, but scoring still allows a passing result without classifying that message. This weakens mutation-based robustness checks because the added distractor can be ignored with no penalty.
Useful? React with 👍 / 👎.
Closes #14.
Adds SUP-01 as the fourth decision-grade task family for synthetic support inbox triage, policy-compliant draft replies, escalation decisions, and customer-style workflow evaluation.
Includes:
Scorer contract mapping:
Suite decision:
Non-goals:
Test plan: