Skip to content

Add guardrails support (before/after input-output filters)#406

Merged
Nicola Franco (franconicola) merged 9 commits into
mainfrom
feat/guardrails
May 29, 2026
Merged

Add guardrails support (before/after input-output filters)#406
Nicola Franco (franconicola) merged 9 commits into
mainfrom
feat/guardrails

Conversation

@RPaolino
Copy link
Copy Markdown
Contributor

Summary

Add guardrail infrastructure that allows placing safety classifiers before (input filter) and after (output filter) the target model, mirroring real-world deployment patterns.

Changes

Core

  • hackagent/attacks/shared/guardrail.pyBaseGuardrail abstract class, LLMGuardrail implementation, and create_guardrail_from_config factory
  • hackagent/router/router.pybefore_guardrail / after_guardrail hooks on AgentRouter that intercept every route_request call
  • hackagent/agent.pybefore_guardrail / after_guardrail params on HackAgent.__init__ to configure guardrails on the target

Attack techniques

  • Unified guardrail response detection via adapter_type == \"guardrail\": a guardrail block is semantically not an adapter response but a synthetic interception, so giving it a distinct adapter_type value accurately communicates "this didn't come from a model." The before-guardrail case loses nothing (the model was never called), and the after-guardrail case intentionally discards the unsafe text to faithfully simulate what a real end-user would experience behind a deployed guardrail
  • Guardrail block info preserved in trace recordings

CLI

  • New --before-guardrail-name, --before-guardrail-type, --before-guardrail-endpoint options (and matching --after-guardrail-*) on all hackagent eval <strategy> subcommands

TUI

  • Two collapsible guardrail sections (Before / After) in the attack form, using the same agent-type choices as the target

Dashboard

  • Guardrail events displayed with side, explanation, and categories

Documentation

  • docs/docs/agents/guardrails.mdx — full documentation page with architecture diagram, configuration fields, CLI usage, and examples
  • Sidebar entry added under Agents

Design decisions

  • Guardrails are configured on HackAgent init (not per-attack) so they defend the target the same way a real deployment would
  • Fail-open: misconfigured/unreachable guardrail allows traffic through rather than silently blocking
  • CLI/TUI guardrail options mirror the target agent options (name, type, endpoint) for consistency

Fixes #356

- Add GuardrailExtractor for parsing guardrail events from agent responses
- Integrate before/after guardrail detection in router
- Track guardrail events in coordinator and tracker
- Update all attack techniques to handle guardrail-blocked responses:
  baseline, advprefix, bon, cipherchat, flipattack, h4rm3l, pap
- Export guardrail utilities from attacks.shared
- Replace guardrail_blocked/guardrail_event with adapter_type: guardrail
- Add is_guardrail_response() and get_guardrail_info() to response_utils
- Update router to emit structured agent_specific_data (side, categories, reasoning)
- Migrate all 10 attack techniques to use canonical detection helper
- Update tracker to detect guardrail responses via adapter_type
- Switch guardrail.py to JSON-structured output parsing with keyword fallback
- PAIR: pass full guardrail response dict to add_interaction_trace so
  the dashboard can detect and render guardrail blocks per iteration
- TAP: return descriptive guardrail marker string from _query_target
  instead of None so blocked iterations show guardrail info in traces
- Return the structured guardrail response dict instead of string-encoding
  it as [GUARDRAIL:xxx], so tracker and dashboard handle it properly
- Pass empty string to judges for guardrail-blocked responses (score 0)
- Remove [:500] slice on response in trace recording (tracker handles dicts)
AutoDAN-Turbo:
- Read phase/subphase from content (not step_name) for DB-loaded traces
- Skip bookend traces (PHASE_START/END, SKIP_FINALIZED)
- Detect WARMUP_SUMMARY via phase+subphase instead of step_name
- Group epochs under iteration sub-headers in the renderer

Guardrail display:
- Add legacy [GUARDRAIL:xxx] string-pattern fallback in extractor
- Add guardrail categories to trace data and rendering templates
- Improve guardrail event rendering with structured pre blocks
- Propagate _guardrail_categories through all parsing paths
@franconicola Nicola Franco (franconicola) temporarily deployed to feat/guardrails - Docs PR #406 May 27, 2026 11:08 — with Render Destroyed
goal_gen_elapsed[goal] = max(
goal_gen_elapsed.get(goal, 0.0), float(elapsed)
)
except (TypeError, ValueError):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@franconicola Nicola Franco (franconicola) merged commit 22c67c8 into main May 29, 2026
23 checks passed
@franconicola Nicola Franco (franconicola) deleted the feat/guardrails branch May 29, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Guardrails

2 participants