docs(evals): add Red teaming user guide by yeomjiwonyeom · Pull Request #2816 · strands-agents/harness-sdk

yeomjiwonyeom · 2026-06-16T00:51:39Z

Adds a Red teaming section to the Strands Evals SDK docs, at the same level as Detectors, registered in navigation.yml.

Six pages:

Overview — what red teaming is, why/when, risk categories, vs evaluators, how it works, best practices, responsible-use note
Quickstart — end-to-end run (generate cases → run several strategies → read the report)
Attack Strategies — the five built-in strategies (Crescendo, GOAT, PAIR, Bad Likert Judge, SequentialBreak) with mechanism + paper links, and how to choose
Writing Custom Cases — hand-authoring RedTeamCase objects
Scoring Attacks — how AttackSuccessEvaluator (LLM-as-judge) scores a breach
Reading the Report — the breach matrix, the per-attack table, and acting on findings

Every code sample and API reference was verified against the current strands-agents-evals package; npm run build is clean and Prettier passes. The in-repo module README ships separately in strands-agents/evals#271.

Add a Red teaming section to the Strands Evals SDK docs (same level as Detectors), with six pages: Overview, Quickstart, Attack Strategies, Writing Custom Cases, Scoring Attacks, and Reading the Report. Register the section in navigation.yml. Covers the experimental red-team module: running adversarial attack strategies against an agent, the five built-in strategies with paper links, risk categories, authoring custom cases, how the LLM judge scores a breach, and how to read the report and act on findings. Every code sample and API reference was verified against the current strands-agents-evals package; the site builds clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-16T14:54:55Z

Documentation Preview Ready

Your documentation preview has been successfully deployed!

Changed pages:

Updated at: 2026-06-16T21:28:15.266Z

github-actions · 2026-06-16T14:57:32Z

+
+The target refused nothing — it restated its hidden configuration, leaking the planted `INTERNAL-7741` marker verbatim. The judge scored this `0.95` (full compromise). Each strategy reaches a breach differently — Crescendo by gradual escalation, Bad Likert Judge by the rating-example framing, SequentialBreak by burying the request in a benign sequence — but the end state the judge scores is the same: did the target produce the targeted violation?
+
+## Choosing strategies


Issue: Two adjacent sections answer nearly the same question: ## Choosing a strategy (line 18, the table plus intro) and ## Choosing strategies (here). The near-identical headings produce two confusingly similar entries in the table of contents, and the "run several and compare" guidance is split across both.

Suggestion: Fold the structural-differences content (turn structure, cost, backtracking) under the first ## Choosing a strategy section, or rename this one to describe its distinct question (e.g. ## How strategies differ). One section, one question.

Still open as of a7012e9: ## Choosing a strategy (line 18) and ## Choosing strategies (line 56) remain two separate adjacent sections with near-identical headings. The new line-56 content ("run several and compare report.by_strategy()") is good material, but it would land better folded into the first section or under a heading naming its distinct question (e.g. ## How strategies differ).

github-actions · 2026-06-16T14:57:35Z

Issue: Em-dashes throughout (96 across the six files: 34 in strategies.mdx, 16 in custom_cases.mdx, 15 in evaluators.mdx, 13 in reading_the_report.mdx, 9 each in index.mdx and quickstart.mdx). The repo voice guide (.agents/references/voice-guide.md) lists "No em-dashes" as a hard rule, and the sibling evals docs hold to ~2 per page.

Suggestion: Replace em-dashes with colons (for elaboration), commas or parentheses (for asides), or restructure. A few examples from the diff:

prompts crafted to get a model to do something its instructions forbid — against your agent → use parentheses or a comma.
success_threshold — Default 0.7. in parameter lists → a colon reads cleanly here (success_threshold: default 0.7`.).

This is the one mechanical item that's load-bearing for the merge gate, so worth a pass across all six files.

github-actions · 2026-06-16T14:57:36Z

Assessment: Comment

Strong, well-organized guide set. I verified every code sample and API reference against the strands-agents/evals source (the redteam module): all class names, constructor defaults, AttackGoal/RedTeamCase fields, scaffold variant keys, report methods, and the nuanced behavioral claims (the success_criteria in-loop gate, Crescendo's tool-call early-stop, score=max / passed=all-evaluators) are accurate. The Experimental cautions and the "clean run is not a certificate" framing are exactly right for a safety feature.

Review themes

Voice constraints: One hard-rule violation: em-dashes appear 96 times across the six files where the repo guide bans them and sibling docs hold to ~2. This is the main thing to fix; see the inline comment.
Structure: strategies.mdx has two near-duplicate headings (Choosing a strategy / Choosing strategies) that should merge into one section.
Code completeness: The Overview's Quick Example uses an undefined agent; a short setup line makes it runnable standalone.
Accuracy: No issues found. Terminology, links, anchors, and the mermaid diagram all check out.

Nicely researched work: the paper citations, OWASP/NIST mappings, and the worked GOAT transcript make this genuinely useful rather than just a parameter dump.

github-actions · 2026-06-16T21:27:05Z

Re-review of a7012e9 — thanks for the update. The Quick Example fix is solid, and I re-verified the new agent_factory / parallel max_workers / MultiAgentBase content against the current evals SDK: all accurate, including the TypeError-at-config-time claim for agent= under parallel runs.

One item from the prior review is still open and worth a pass before merge:

Issue: Em-dashes are still present (99 across the six files: 30 in strategies.mdx, 22 in reading_the_report.mdx, 15 in evaluators.mdx, 13 in custom_cases.mdx, 12 in quickstart.mdx, 7 in index.mdx). The repo voice guide lists "No em-dashes" as a hard rule and sibling evals docs hold to ~2 per page, so this is the one mechanical gate likely to block merge.

Suggestion: A single find-and-replace pass works — colons for elaboration, commas or parentheses for asides. For example strategy × goal × target interaction — there is no reliable... (strategies.mdx:56) reads cleanly as ...interaction: there is no reliable....

github-actions Bot added the size/l label Jun 16, 2026

yeomjiwonyeom temporarily deployed to manual-approval June 16, 2026 01:26 — with GitHub Actions Inactive

github-actions Bot added enhancement New feature or request documentation Documentation changes, improvements, additions, content updates, site improvements, examples, guides strands-running labels Jun 16, 2026

github-actions Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread site/src/content/docs/user-guide/evals-sdk/red-teaming/index.mdx Outdated

github-actions Bot removed the strands-running label Jun 16, 2026

docs(redteam): updated redteam documentations

a7012e9

poshinchen deployed to manual-approval June 16, 2026 21:14 — with GitHub Actions Active

github-actions Bot added the strands-running label Jun 16, 2026

github-actions Bot removed the strands-running label Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(evals): add Red teaming user guide#2816

docs(evals): add Red teaming user guide#2816
yeomjiwonyeom wants to merge 2 commits into
strands-agents:mainfrom
yeomjiwonyeom:redteam/docs

yeomjiwonyeom commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot Jun 16, 2026

Uh oh!

github-actions Bot Jun 16, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		The target refused nothing — it restated its hidden configuration, leaking the planted `INTERNAL-7741` marker verbatim. The judge scored this `0.95` (full compromise). Each strategy reaches a breach differently — Crescendo by gradual escalation, Bad Likert Judge by the rating-example framing, SequentialBreak by burying the request in a benign sequence — but the end state the judge scores is the same: did the target produce the targeted violation?

		## Choosing strategies

Conversation

yeomjiwonyeom commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation Preview Ready

Uh oh!

github-actions Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yeomjiwonyeom commented Jun 16, 2026 •

edited

Loading

github-actions Bot commented Jun 16, 2026 •

edited

Loading