Skip to content

docs(evals): add Red teaming user guide#2816

Open
yeomjiwonyeom wants to merge 2 commits into
strands-agents:mainfrom
yeomjiwonyeom:redteam/docs
Open

docs(evals): add Red teaming user guide#2816
yeomjiwonyeom wants to merge 2 commits into
strands-agents:mainfrom
yeomjiwonyeom:redteam/docs

Conversation

@yeomjiwonyeom

@yeomjiwonyeom yeomjiwonyeom commented Jun 16, 2026

Copy link
Copy Markdown

Adds a Red teaming section to the Strands Evals SDK docs, at the same level as Detectors, registered in navigation.yml.

Six pages:

  • Overview — what red teaming is, why/when, risk categories, vs evaluators, how it works, best practices, responsible-use note
  • Quickstart — end-to-end run (generate cases → run several strategies → read the report)
  • Attack Strategies — the five built-in strategies (Crescendo, GOAT, PAIR, Bad Likert Judge, SequentialBreak) with mechanism + paper links, and how to choose
  • Writing Custom Cases — hand-authoring RedTeamCase objects
  • Scoring Attacks — how AttackSuccessEvaluator (LLM-as-judge) scores a breach
  • Reading the Report — the breach matrix, the per-attack table, and acting on findings

Every code sample and API reference was verified against the current strands-agents-evals package; npm run build is clean and Prettier passes. The in-repo module README ships separately in strands-agents/evals#271.

Add a Red teaming section to the Strands Evals SDK docs (same level as Detectors),
with six pages: Overview, Quickstart, Attack Strategies, Writing Custom Cases,
Scoring Attacks, and Reading the Report. Register the section in navigation.yml.

Covers the experimental red-team module: running adversarial attack strategies
against an agent, the five built-in strategies with paper links, risk categories,
authoring custom cases, how the LLM judge scores a breach, and how to read the
report and act on findings. Every code sample and API reference was verified
against the current strands-agents-evals package; the site builds clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added enhancement New feature or request documentation Documentation changes, improvements, additions, content updates, site improvements, examples, guides strands-running labels Jun 16, 2026
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor


The target refused nothing — it restated its hidden configuration, leaking the planted `INTERNAL-7741` marker verbatim. The judge scored this `0.95` (full compromise). Each strategy reaches a breach differently — Crescendo by gradual escalation, Bad Likert Judge by the rating-example framing, SequentialBreak by burying the request in a benign sequence — but the end state the judge scores is the same: did the target produce the targeted violation?

## Choosing strategies

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Two adjacent sections answer nearly the same question: ## Choosing a strategy (line 18, the table plus intro) and ## Choosing strategies (here). The near-identical headings produce two confusingly similar entries in the table of contents, and the "run several and compare" guidance is split across both.

Suggestion: Fold the structural-differences content (turn structure, cost, backtracking) under the first ## Choosing a strategy section, or rename this one to describe its distinct question (e.g. ## How strategies differ). One section, one question.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still open as of a7012e9: ## Choosing a strategy (line 18) and ## Choosing strategies (line 56) remain two separate adjacent sections with near-identical headings. The new line-56 content ("run several and compare report.by_strategy()") is good material, but it would land better folded into the first section or under a heading naming its distinct question (e.g. ## How strategies differ).

Comment thread site/src/content/docs/user-guide/evals-sdk/red-teaming/index.mdx Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Issue: Em-dashes throughout (96 across the six files: 34 in strategies.mdx, 16 in custom_cases.mdx, 15 in evaluators.mdx, 13 in reading_the_report.mdx, 9 each in index.mdx and quickstart.mdx). The repo voice guide (.agents/references/voice-guide.md) lists "No em-dashes" as a hard rule, and the sibling evals docs hold to ~2 per page.

Suggestion: Replace em-dashes with colons (for elaboration), commas or parentheses (for asides), or restructure. A few examples from the diff:

  • prompts crafted to get a model to do something its instructions forbid — against your agent → use parentheses or a comma.
  • success_threshold — Default 0.7. in parameter lists → a colon reads cleanly here (success_threshold: default 0.7`.).

This is the one mechanical item that's load-bearing for the merge gate, so worth a pass across all six files.

@github-actions

Copy link
Copy Markdown
Contributor

Assessment: Comment

Strong, well-organized guide set. I verified every code sample and API reference against the strands-agents/evals source (the redteam module): all class names, constructor defaults, AttackGoal/RedTeamCase fields, scaffold variant keys, report methods, and the nuanced behavioral claims (the success_criteria in-loop gate, Crescendo's tool-call early-stop, score=max / passed=all-evaluators) are accurate. The Experimental cautions and the "clean run is not a certificate" framing are exactly right for a safety feature.

Review themes
  • Voice constraints: One hard-rule violation: em-dashes appear 96 times across the six files where the repo guide bans them and sibling docs hold to ~2. This is the main thing to fix; see the inline comment.
  • Structure: strategies.mdx has two near-duplicate headings (Choosing a strategy / Choosing strategies) that should merge into one section.
  • Code completeness: The Overview's Quick Example uses an undefined agent; a short setup line makes it runnable standalone.
  • Accuracy: No issues found. Terminology, links, anchors, and the mermaid diagram all check out.

Nicely researched work: the paper citations, OWASP/NIST mappings, and the worked GOAT transcript make this genuinely useful rather than just a parameter dump.

@github-actions

Copy link
Copy Markdown
Contributor

Re-review of a7012e9 — thanks for the update. The Quick Example fix is solid, and I re-verified the new agent_factory / parallel max_workers / MultiAgentBase content against the current evals SDK: all accurate, including the TypeError-at-config-time claim for agent= under parallel runs.

One item from the prior review is still open and worth a pass before merge:

Issue: Em-dashes are still present (99 across the six files: 30 in strategies.mdx, 22 in reading_the_report.mdx, 15 in evaluators.mdx, 13 in custom_cases.mdx, 12 in quickstart.mdx, 7 in index.mdx). The repo voice guide lists "No em-dashes" as a hard rule and sibling evals docs hold to ~2 per page, so this is the one mechanical gate likely to block merge.

Suggestion: A single find-and-replace pass works — colons for elaboration, commas or parentheses for asides. For example strategy × goal × target interaction — there is no reliable... (strategies.mdx:56) reads cleanly as ...interaction: there is no reliable....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Documentation changes, improvements, additions, content updates, site improvements, examples, guides enhancement New feature or request size/l

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants