Add guardrails support (before/after input-output filters)#406
Merged
Conversation
- Add GuardrailExtractor for parsing guardrail events from agent responses - Integrate before/after guardrail detection in router - Track guardrail events in coordinator and tracker - Update all attack techniques to handle guardrail-blocked responses: baseline, advprefix, bon, cipherchat, flipattack, h4rm3l, pap - Export guardrail utilities from attacks.shared
- Replace guardrail_blocked/guardrail_event with adapter_type: guardrail - Add is_guardrail_response() and get_guardrail_info() to response_utils - Update router to emit structured agent_specific_data (side, categories, reasoning) - Migrate all 10 attack techniques to use canonical detection helper - Update tracker to detect guardrail responses via adapter_type - Switch guardrail.py to JSON-structured output parsing with keyword fallback
- PAIR: pass full guardrail response dict to add_interaction_trace so the dashboard can detect and render guardrail blocks per iteration - TAP: return descriptive guardrail marker string from _query_target instead of None so blocked iterations show guardrail info in traces
- Return the structured guardrail response dict instead of string-encoding it as [GUARDRAIL:xxx], so tracker and dashboard handle it properly - Pass empty string to judges for guardrail-blocked responses (score 0) - Remove [:500] slice on response in trace recording (tracker handles dicts)
AutoDAN-Turbo: - Read phase/subphase from content (not step_name) for DB-loaded traces - Skip bookend traces (PHASE_START/END, SKIP_FINALIZED) - Detect WARMUP_SUMMARY via phase+subphase instead of step_name - Group epochs under iteration sub-headers in the renderer Guardrail display: - Add legacy [GUARDRAIL:xxx] string-pattern fallback in extractor - Add guardrail categories to trace data and rendering templates - Improve guardrail event rendering with structured pre blocks - Propagate _guardrail_categories through all parsing paths
| goal_gen_elapsed[goal] = max( | ||
| goal_gen_elapsed.get(goal, 0.0), float(elapsed) | ||
| ) | ||
| except (TypeError, ValueError): |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add guardrail infrastructure that allows placing safety classifiers before (input filter) and after (output filter) the target model, mirroring real-world deployment patterns.
Changes
Core
hackagent/attacks/shared/guardrail.py—BaseGuardrailabstract class,LLMGuardrailimplementation, andcreate_guardrail_from_configfactoryhackagent/router/router.py—before_guardrail/after_guardrailhooks onAgentRouterthat intercept everyroute_requestcallhackagent/agent.py—before_guardrail/after_guardrailparams onHackAgent.__init__to configure guardrails on the targetAttack techniques
adapter_type == \"guardrail\": a guardrail block is semantically not an adapter response but a synthetic interception, so giving it a distinct adapter_type value accurately communicates "this didn't come from a model." The before-guardrail case loses nothing (the model was never called), and the after-guardrail case intentionally discards the unsafe text to faithfully simulate what a real end-user would experience behind a deployed guardrailCLI
--before-guardrail-name,--before-guardrail-type,--before-guardrail-endpointoptions (and matching--after-guardrail-*) on allhackagent eval <strategy>subcommandsTUI
Dashboard
Documentation
docs/docs/agents/guardrails.mdx— full documentation page with architecture diagram, configuration fields, CLI usage, and examplesDesign decisions
HackAgentinit (not per-attack) so they defend the target the same way a real deployment wouldFixes #356