UX improvements#408
Merged
Merged
Conversation
- Add GuardrailExtractor for parsing guardrail events from agent responses - Integrate before/after guardrail detection in router - Track guardrail events in coordinator and tracker - Update all attack techniques to handle guardrail-blocked responses: baseline, advprefix, bon, cipherchat, flipattack, h4rm3l, pap - Export guardrail utilities from attacks.shared
- Replace guardrail_blocked/guardrail_event with adapter_type: guardrail - Add is_guardrail_response() and get_guardrail_info() to response_utils - Update router to emit structured agent_specific_data (side, categories, reasoning) - Migrate all 10 attack techniques to use canonical detection helper - Update tracker to detect guardrail responses via adapter_type - Switch guardrail.py to JSON-structured output parsing with keyword fallback
- PAIR: pass full guardrail response dict to add_interaction_trace so the dashboard can detect and render guardrail blocks per iteration - TAP: return descriptive guardrail marker string from _query_target instead of None so blocked iterations show guardrail info in traces
- Add guardrail event rendering in trace views (before/after blocks) - Add two-panel History run dialog with config chips and metrics - Add attack-specific trace parsing and rendering for all attack types - Add category/subcategory grouping in goal lists - Add compact goal cards with color-coded borders
When goal_batch_workers > 1, each goal gets its own attack instance with _goal_index_offset. The tracker creates goal contexts at that offset, but generation.execute() and evaluation.execute() used enumerate(goals) starting at 0 to look up contexts. For any goal with offset != 0 this returned None, silently skipping Candidate/Summary traces and tap_judge evaluations.
- Return the structured guardrail response dict instead of string-encoding it as [GUARDRAIL:xxx], so tracker and dashboard handle it properly - Pass empty string to judges for guardrail-blocked responses (score 0) - Remove [:500] slice on response in trace recording (tracker handles dicts)
Ensures the goal index offset is passed through to both TAP pipeline steps so multi-batch goal evaluation uses the correct tracker context.
- Call _update_tracker() after _sync_to_server() so each prefix gets an evaluation trace with its score in the DB - Embed prefix text in evaluation trace metadata so the dashboard can attribute jailbreaks to specific prefixes
AutoDAN-Turbo: - Read phase/subphase from content (not step_name) for DB-loaded traces - Skip bookend traces (PHASE_START/END, SKIP_FINALIZED) - Detect WARMUP_SUMMARY via phase+subphase instead of step_name - Group epochs under iteration sub-headers in the renderer Guardrail display: - Add legacy [GUARDRAIL:xxx] string-pattern fallback in extractor - Add guardrail categories to trace data and rendering templates - Improve guardrail event rendering with structured pre blocks - Propagate _guardrail_categories through all parsing paths
History tab — Run list:
- Replace pagination with infinite scroll ("Load more" button)
- Add filter bar: search, agent, attack type, and status dropdowns
- Load all runs upfront and filter client-side for instant feedback
History tab — Run detail dialog:
- Add goal filter bar with search, status, and category dropdowns
- Preserve original goal numbering when filters are applied
Two bugs caused per-prefix/per-template detail rows to always display 'Mitigated' even when the goal was successfully jailbroken: 1. AdvPrefix: The Evaluation step's config_keys was missing '_tracker', so no evaluation traces were created. The dashboard matches completion traces to evaluation traces by prefix string to determine which rows are jailbreaks — without traces, all rows defaulted to 'mitigated'. 2. Baseline: The dashboard's _parse_baseline_traces hardcoded the evaluator name 'baseline_pattern_evaluator', but when using LLM judges (the default), the evaluator name is 'baseline_llm_judge'. The eval trace was never matched, so all rows defaulted to 'mitigated'.
|
|
||
| for depth_level in sorted(by_depth.keys()): | ||
| depth_nodes = by_depth[depth_level] | ||
| _ds = (depth_stats or {}).get(depth_level, {}) |
| if score_raw is not None: | ||
| try: | ||
| step["score"] = float(score_raw) | ||
| except (TypeError, ValueError): |
| if score_delta_raw is not None: | ||
| try: | ||
| step["score_delta"] = float(score_delta_raw) | ||
| except (TypeError, ValueError): |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Major overhaul of the local dashboard and introduction of guardrails infrastructure for the attack pipeline.
Dashboard Improvements
Fixes #354