Add guardrails support (before/after input-output filters) by RPaolino · Pull Request #406 · AISecurityLab/hackagent

Raffaele Paolino (RPaolino) · 2026-05-27T11:08:03Z

Summary

Add guardrail infrastructure that allows placing safety classifiers before (input filter) and after (output filter) the target model, mirroring real-world deployment patterns.

Changes

Core

hackagent/attacks/shared/guardrail.py — BaseGuardrail abstract class, LLMGuardrail implementation, and create_guardrail_from_config factory
hackagent/router/router.py — before_guardrail / after_guardrail hooks on AgentRouter that intercept every route_request call
hackagent/agent.py — before_guardrail / after_guardrail params on HackAgent.__init__ to configure guardrails on the target

Attack techniques

Unified guardrail response detection via adapter_type == \"guardrail\": a guardrail block is semantically not an adapter response but a synthetic interception, so giving it a distinct adapter_type value accurately communicates "this didn't come from a model." The before-guardrail case loses nothing (the model was never called), and the after-guardrail case intentionally discards the unsafe text to faithfully simulate what a real end-user would experience behind a deployed guardrail
Guardrail block info preserved in trace recordings

CLI

New --before-guardrail-name, --before-guardrail-type, --before-guardrail-endpoint options (and matching --after-guardrail-*) on all hackagent eval <strategy> subcommands

TUI

Two collapsible guardrail sections (Before / After) in the attack form, using the same agent-type choices as the target

Dashboard

Guardrail events displayed with side, explanation, and categories

Documentation

docs/docs/agents/guardrails.mdx — full documentation page with architecture diagram, configuration fields, CLI usage, and examples
Sidebar entry added under Agents

Design decisions

Guardrails are configured on HackAgent init (not per-attack) so they defend the target the same way a real deployment would
Fail-open: misconfigured/unreachable guardrail allows traffic through rather than silently blocking
CLI/TUI guardrail options mirror the target agent options (name, type, endpoint) for consistency

Fixes #356

- Add GuardrailExtractor for parsing guardrail events from agent responses - Integrate before/after guardrail detection in router - Track guardrail events in coordinator and tracker - Update all attack techniques to handle guardrail-blocked responses: baseline, advprefix, bon, cipherchat, flipattack, h4rm3l, pap - Export guardrail utilities from attacks.shared

- Replace guardrail_blocked/guardrail_event with adapter_type: guardrail - Add is_guardrail_response() and get_guardrail_info() to response_utils - Update router to emit structured agent_specific_data (side, categories, reasoning) - Migrate all 10 attack techniques to use canonical detection helper - Update tracker to detect guardrail responses via adapter_type - Switch guardrail.py to JSON-structured output parsing with keyword fallback

- PAIR: pass full guardrail response dict to add_interaction_trace so the dashboard can detect and render guardrail blocks per iteration - TAP: return descriptive guardrail marker string from _query_target instead of None so blocked iterations show guardrail info in traces

- Return the structured guardrail response dict instead of string-encoding it as [GUARDRAIL:xxx], so tracker and dashboard handle it properly - Pass empty string to judges for guardrail-blocked responses (score 0) - Remove [:500] slice on response in trace recording (tracker handles dicts)

AutoDAN-Turbo: - Read phase/subphase from content (not step_name) for DB-loaded traces - Skip bookend traces (PHASE_START/END, SKIP_FINALIZED) - Detect WARMUP_SUMMARY via phase+subphase instead of step_name - Group epochs under iteration sub-headers in the renderer Guardrail display: - Add legacy [GUARDRAIL:xxx] string-pattern fallback in extractor - Add guardrail categories to trace data and rendering templates - Improve guardrail event rendering with structured pre blocks - Propagate _guardrail_categories through all parsing paths

+                    goal_gen_elapsed[goal] = max(
+                        goal_gen_elapsed.get(goal, 0.0), float(elapsed)
+                    )
+                except (TypeError, ValueError):


codecov · 2026-05-27T11:23:25Z

Codecov Report

❌ Patch coverage is 61.06557% with 95 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
hackagent/router/tracking/coordinator.py	6.25%	15 Missing ⚠️
...ackagent/attacks/techniques/baseline/evaluation.py	18.75%	13 Missing ⚠️
hackagent/attacks/techniques/pair/attack.py	14.28%	12 Missing ⚠️
hackagent/attacks/techniques/bon/generation.py	11.11%	8 Missing ⚠️
hackagent/agent.py	12.50%	7 Missing ⚠️
hackagent/cli/commands/attack.py	16.66%	5 Missing ⚠️
hackagent/attacks/shared/response_utils.py	63.63%	4 Missing ⚠️
...kagent/attacks/techniques/advprefix/completions.py	42.85%	4 Missing ⚠️
...kagent/attacks/techniques/cipherchat/generation.py	20.00%	4 Missing ⚠️
hackagent/attacks/techniques/tap/generation.py	20.00%	4 Missing ⚠️
... and 8 more

📢 Thoughts on this report? Let us know!

Nicola Franco (franconicola)

LGTM

Raffaele Paolino (RPaolino) added 8 commits May 25, 2026 09:01

feat: guardrail config in run_config

f3aa5ab

feat: added documentation, cli and tui support of guardrails

f9f5d46

fix: prevent TAP attacker from seeing guardrail internals on block

27fb562

Nicola Franco (franconicola) temporarily deployed to feat/guardrails - Docs PR #406 May 27, 2026 11:08 — with Render Destroyed

github-code-quality Bot found potential problems May 27, 2026

View reviewed changes

Comment thread hackagent/router/tracking/coordinator.py

goal_gen_elapsed[goal] = max(

goal_gen_elapsed.get(goal, 0.0), float(elapsed)

)

except (TypeError, ValueError):

feat: added unit tests on guardrails

e54cff6

Nicola Franco (franconicola) approved these changes May 29, 2026

View reviewed changes

Nicola Franco (franconicola) merged commit 22c67c8 into main May 29, 2026
23 checks passed

Nicola Franco (franconicola) deleted the feat/guardrails branch May 29, 2026 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add guardrails support (before/after input-output filters)#406

Add guardrails support (before/after input-output filters)#406
Nicola Franco (franconicola) merged 9 commits into
mainfrom
feat/guardrails

Raffaele Paolino (RPaolino) commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Nicola Franco (franconicola) left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Raffaele Paolino (RPaolino) commented May 27, 2026

Summary

Changes

Core

Attack techniques

CLI

TUI

Dashboard

Documentation

Design decisions

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Nicola Franco (franconicola) left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 27, 2026 •

edited

Loading