Skip to content

feat: refusal detection + owner-positive framing for files audits (closes #4)#5

Merged
dshanklin-bv merged 2 commits into
mainfrom
feat/refusal-detection
Apr 29, 2026
Merged

feat: refusal detection + owner-positive framing for files audits (closes #4)#5
dshanklin-bv merged 2 commits into
mainfrom
feat/refusal-detection

Conversation

@dshanklin-bv

Copy link
Copy Markdown
Contributor

Summary

Closes #4 with two layers from the issue's proposed fix; layer 3 (auto-retry on refusal) deliberately skipped — billed second call, ship cheap fixes first.

Layer 1: refusal detection in openrouter.detect_refusal(). Heuristics scan recommended_next_step (anchored regex on opening tokens), summary, and hypotheses[].title/why for refusal-shaped language. When matched, the response gets two additive top-level fields:

{ "refused": true, "refusal_reason": "...", "decision": "backtrack", "confidence": 0.88, ... }

Layer 2: owner-positive system-prompt framing when packet.files is non-empty. Tells the model the calling agent is the document owner asking to harden their own deliverable, so adversarial language ("audit", "red-team") doesn't trigger third-party-attack refusals.

Why

The cept#4 repro: perplexity/sonar-pro (the default) read 12 KB of confidential corporate text + "adversarial audit" framing as attack prep and refused with decision: backtrack, confidence: 0.88. Identical shape to substantive guidance. Calling agent has no way to tell the audit didn't happen.

Test plan

  • uv run pytest -x -q — 84 passed (10 new in test_openrouter.py)
  • uv run ruff check src/cept tests — clean
  • Anchored-regex tests confirm "cannot" mid-sentence ("the branch cannot handle empty input") doesn't fire
  • Substantive backtrack guidance is NOT flagged (false-positive guard)
  • Real-world cept#4 response shape IS flagged
  • Manual: re-run today's Paylocity audit with v0.3.0; confirm either (a) layer-2 framing prevents the refusal, or (b) layer-1 catches it and tags the response

Versioning

Bumps to v0.3.0. Additive non-breaking response fields; matches the v0.1 → v0.2 minor bump for the files parameter.

🤖 Generated with Claude Code

dshanklin-bv and others added 2 commits April 29, 2026 13:40
Closes the cept#4 failure mode: when the underlying model declines to
engage with the packet (rather than recommending a substantive
backtrack), cept's response shape is structurally identical to real
guidance — same `decision`, `confidence`, `hypotheses` keys — so
callers act on the refusal as if it were a recommendation.

Two layers from the issue:

1. detect_refusal() runs heuristics on the parsed response —
   `recommended_next_step` opening with "decline/refuse/cannot/won't",
   refusal phrases in `summary`, and refusal-shaped language in
   `hypotheses[].title/why`. When matched, sets `refused: true` and
   `refusal_reason` so callers can switch on the difference.

2. When `packet.files` is non-empty, the user-payload preamble now
   states explicitly that the calling agent is the document owner
   asking for help hardening their own deliverable, killing the most
   common false-positive refusal seen on perplexity/sonar-pro
   ("audit"/"red-team" framing read as third-party attack prep).

Skips layer 3 (auto-retry on refusal) — billed second call, opinionated.
Ship cheap fixes first; revisit if these don't move the refusal rate.

Bumps to v0.3.0: response field additions are non-breaking but additive,
so per cept's existing convention (0.1 -> 0.2 was the `files` param)
this is a minor bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dshanklin-bv dshanklin-bv merged commit 8533bd5 into main Apr 29, 2026
0 of 3 checks passed
@dshanklin-bv dshanklin-bv deleted the feat/refusal-detection branch April 29, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cept should detect model refusals and surface them distinctly from substantive guidance

1 participant