feat: refusal detection + owner-positive framing for files audits (closes #4) by dshanklin-bv · Pull Request #5 · eidos-agi/cept

dshanklin-bv · 2026-04-29T18:40:28Z

Summary

Closes #4 with two layers from the issue's proposed fix; layer 3 (auto-retry on refusal) deliberately skipped — billed second call, ship cheap fixes first.

Layer 1: refusal detection in openrouter.detect_refusal(). Heuristics scan recommended_next_step (anchored regex on opening tokens), summary, and hypotheses[].title/why for refusal-shaped language. When matched, the response gets two additive top-level fields:

{ "refused": true, "refusal_reason": "...", "decision": "backtrack", "confidence": 0.88, ... }

Layer 2: owner-positive system-prompt framing when packet.files is non-empty. Tells the model the calling agent is the document owner asking to harden their own deliverable, so adversarial language ("audit", "red-team") doesn't trigger third-party-attack refusals.

Why

The cept#4 repro: perplexity/sonar-pro (the default) read 12 KB of confidential corporate text + "adversarial audit" framing as attack prep and refused with decision: backtrack, confidence: 0.88. Identical shape to substantive guidance. Calling agent has no way to tell the audit didn't happen.

Test plan

uv run pytest -x -q — 84 passed (10 new in test_openrouter.py)
uv run ruff check src/cept tests — clean
Anchored-regex tests confirm "cannot" mid-sentence ("the branch cannot handle empty input") doesn't fire
Substantive backtrack guidance is NOT flagged (false-positive guard)
Real-world cept#4 response shape IS flagged
Manual: re-run today's Paylocity audit with v0.3.0; confirm either (a) layer-2 framing prevents the refusal, or (b) layer-1 catches it and tags the response

Versioning

Bumps to v0.3.0. Additive non-breaking response fields; matches the v0.1 → v0.2 minor bump for the files parameter.

🤖 Generated with Claude Code

Closes the cept#4 failure mode: when the underlying model declines to engage with the packet (rather than recommending a substantive backtrack), cept's response shape is structurally identical to real guidance — same `decision`, `confidence`, `hypotheses` keys — so callers act on the refusal as if it were a recommendation. Two layers from the issue: 1. detect_refusal() runs heuristics on the parsed response — `recommended_next_step` opening with "decline/refuse/cannot/won't", refusal phrases in `summary`, and refusal-shaped language in `hypotheses[].title/why`. When matched, sets `refused: true` and `refusal_reason` so callers can switch on the difference. 2. When `packet.files` is non-empty, the user-payload preamble now states explicitly that the calling agent is the document owner asking for help hardening their own deliverable, killing the most common false-positive refusal seen on perplexity/sonar-pro ("audit"/"red-team" framing read as third-party attack prep). Skips layer 3 (auto-retry on refusal) — billed second call, opinionated. Ship cheap fixes first; revisit if these don't move the refusal rate. Bumps to v0.3.0: response field additions are non-breaking but additive, so per cept's existing convention (0.1 -> 0.2 was the `files` param) this is a minor bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dshanklin-bv and others added 2 commits April 29, 2026 13:40

chore: bump uv.lock cept entry to 0.3.0

3271046

dshanklin-bv merged commit 8533bd5 into main Apr 29, 2026
0 of 3 checks passed

dshanklin-bv deleted the feat/refusal-detection branch April 29, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: refusal detection + owner-positive framing for files audits (closes #4)#5

feat: refusal detection + owner-positive framing for files audits (closes #4)#5
dshanklin-bv merged 2 commits into
mainfrom
feat/refusal-detection

dshanklin-bv commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dshanklin-bv commented Apr 29, 2026

Summary

Why

Test plan

Versioning

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant