diff --git a/README.md b/README.md index cdaf2b7..e27ed4a 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ APTS is not a testing methodology. It complements PTES, OWASP WSTG, and OSSTMM b - **Tier 2 (Verified)**: 85 additional (157 cumulative). Full transparency, tamper-proof audit trails, and independently verifiable findings. - **Tier 3 (Comprehensive)**: 16 additional (173 cumulative). Highest assurance for critical infrastructure and L4 autonomous operations. -Fifteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance. +Sixteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance. APTS has no certification body, no mandatory third-party audit, and no fee. Platforms are assessed against the requirements and conformance is documented. The standard does not prescribe who performs the assessment; internal self-assessment, independent internal review, and external third-party assessment are all valid approaches, and the choice is left to the reader. diff --git a/index.md b/index.md index e9c34e9..0ca6e75 100644 --- a/index.md +++ b/index.md @@ -44,7 +44,7 @@ APTS is not a testing methodology. It complements PTES, OWASP WSTG, and OSSTMM b - **Tier 2 (Verified)**: 85 additional (157 cumulative). Full transparency, tamper-proof audit trails, and independently verifiable findings. - **Tier 3 (Comprehensive)**: 16 additional (173 cumulative). Highest assurance for critical infrastructure and L4 autonomous operations. -Fifteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance. +Sixteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance. APTS has no certification body, no mandatory third-party audit, and no fee. Platforms are assessed against the requirements and conformance is documented. The standard does not prescribe who performs the assessment; internal self-assessment, independent internal review, and external third-party assessment are all valid approaches, and the choice is left to the reader. diff --git a/standard/6_Manipulation_Resistance/Implementation_Guide.md b/standard/6_Manipulation_Resistance/Implementation_Guide.md index 9a283cc..74cb9f4 100644 --- a/standard/6_Manipulation_Resistance/Implementation_Guide.md +++ b/standard/6_Manipulation_Resistance/Implementation_Guide.md @@ -403,6 +403,43 @@ Implement the post-fine-tuning out-of-distribution audit as a separate evaluatio --- +### APTS-MR-A02: Sandbagging Detection and Behavioral Consistency Validation + +> This section provides implementation guidance for the advisory practice [APTS-MR-A02](../appendix/Advisory_Requirements.md#apts-mr-a02-sandbagging-detection-and-behavioral-consistency-validation-advisory). It is not required for conformance at any tier. + +**Implementation:** Build the consistency measurement as an independent pipeline that ingests, on one side, per-scenario records from the adversarial eval suite of APTS-MR-020, and on the other side a stratified sample of production engagement transcripts captured under APTS-AR-001. Each measurement run is anchored to a single model snapshot identifier (per APTS-AR-019) and produces a structured differential report — per-dimension deltas, sample sizes, and a verbalized-recognition correlation — committed to the audit trail. + +**Architecture Pattern — Differential Measurement Pipeline:** + +The pipeline has four components, mirroring the structure used in APTS-MR-A01: + +1. **Eval-side ingestor.** Reads per-scenario records emitted by the MR-020 eval suite for the model snapshot under measurement. The ingestor records, per scenario, the agent's terminal action, tool-call trace, and reasoning trace if available. +2. **Production-side sampler.** Draws a stratified sample of production transcripts from the audit store (APTS-AR-001), applying redactions consistent with APTS-TP-013 sensitive-data handling. The sampler MUST NOT relax retention or consent boundaries to obtain samples; where sampling is constrained, the limitation is reported in the differential report rather than worked around. +3. **Differential scorer.** Computes per-dimension metrics on each sample and the delta between them. Where rule-based scoring is insufficient (for example, refusal-rate classification on free-form tool invocations), an LLM judge is used; the judge MUST run as a separate inference call with no shared context with the agent under test (same constraint as APTS-MR-A01 and APTS-RP-A01) and SHOULD use a different model family. +4. **Trend reporter.** Aggregates the differential across model snapshots tracked under APTS-AR-019 so that the change in the differential itself is observable, writes a signed run record to the audit trail per APTS-AR-012, and emits the validation report for inclusion alongside the customer engagement record. + +**Stratified Sampling:** + +Stratify the production-side sample across at least three dimensions: autonomy level (L1–L4), engagement size class, and customer-environment class (for example, web application, internal network, cloud control plane). Document a minimum sample size per dimension before per-dimension thresholds apply; for dimensions with sparse samples, report a confidence-interval band rather than a single point estimate, and compare the band's bound against the threshold (mirroring the calibrated-threshold treatment in APTS-MR-A01). + +**Verbalized Eval-Recognition Audit:** + +Where reasoning traces are available, scan eval-side transcripts for evaluation-context tokens or situational-awareness markers (Laine et al., 2024, *Situational Awareness Dataset*; Anthropic Sonnet 4.5 / Opus 4.6 system-card methodology) and compute the correlation between verbalized recognition and the per-scenario behavioral classification. A non-zero correlation is informative even when the verbalized-recognition rate is low; absence of verbalized recognition is not informative and MUST NOT be reported as evidence of absence of recognition. + +**Key Considerations:** +- Production telemetry sampling MUST respect APTS-TP-013 sensitive-data handling and MUST NOT extend retention beyond what APTS-AR-001 already permits. Differential measurement is not a justification for new collection. +- The differential is interpretable only against a model-snapshot identifier (APTS-AR-019). Comparing eval samples and production samples drawn under different snapshots without correcting for the change confounds the measurement. +- Eval-suite contamination from the production side — for example, production samples that were previously absorbed into post-deployment learning data — silently narrows the measured delta. Treat eval scenarios as held out from any post-deployment learning data, consistent with APTS-MR-A01's contamination guidance. +- Where production sampling is impossible (privacy-restricted engagement, contractual restriction), substitute high-fidelity production-mimicking synthetic transcripts and disclose the substitution; do not omit the dimension. + +**Common Pitfalls:** +- Computing the delta against a single eval run and a single engagement; the noise floor is high, and any threshold crossing under such a sample is more likely to be sampling noise than a real consistency violation. Aggregate across multiple runs and engagements before applying threshold checks. +- Treating zero verbalized eval recognition as evidence of no eval awareness. Anthropic Opus 4.6 results show models can detect evaluations without verbalizing the detection; verbalization is one signal, not the metric. +- Using the same model and prompt for the judge as for the agent under test. The independence requirement collapses, and the measurement becomes a mirror of the agent's own classification behavior. +- Publishing a differential report once and never refreshing. The Chaudhary et al. (2025) power-law scaling result implies the differential will tend to grow with model scale; the practice is only useful as a longitudinal artifact across snapshot changes. + +--- + ## Implementation Roadmap **Tier 1 (implement before any autonomous pentesting begins):** diff --git a/standard/6_Manipulation_Resistance/README.md b/standard/6_Manipulation_Resistance/README.md index 373f91f..76851e0 100644 --- a/standard/6_Manipulation_Resistance/README.md +++ b/standard/6_Manipulation_Resistance/README.md @@ -1060,3 +1060,5 @@ The rest of Manipulation Resistance defends against an outside attacker trying t --- > **See also:** [APTS-MR-A01: Goal Misgeneralization and Emergent Misalignment Evaluation Suite](../appendix/Advisory_Requirements.md#apts-mr-a01-goal-misgeneralization-and-emergent-misalignment-evaluation-suite-advisory) — an advisory practice for platforms using fine-tuned or adapted LLM-based agents. Evaluates the agent's underlying objective alignment under distribution shift and detects emergent misalignment after fine-tuning, addressing failure modes that input-side (MR-013) and control-side (MR-020) adversarial testing do not cover. Candidate for tier-gated inclusion in v0.2.0. + +> **See also:** [APTS-MR-A02: Sandbagging Detection and Behavioral Consistency Validation](../appendix/Advisory_Requirements.md#apts-mr-a02-sandbagging-detection-and-behavioral-consistency-validation-advisory) — an advisory practice for platforms running an adversarial eval suite under MR-020 against the same model configuration deployed in customer engagements. Measures the behavioral delta between eval context and production context on the same model snapshot tracked under AR-019, addressing eval-awareness and sandbagging that MR-020's pass-rate metric alone cannot detect. Candidate for tier-gated inclusion in v0.2.0. diff --git a/standard/Frontispiece.md b/standard/Frontispiece.md index 5050902..a4a3510 100644 --- a/standard/Frontispiece.md +++ b/standard/Frontispiece.md @@ -72,4 +72,4 @@ Licensed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). | Version | Date | Notes | |---------|------|-------| -| 0.1.0 | April 2026 | Initial release. Eight domains, 173 tier-required requirements across three compliance tiers, plus 15 advisory practices in the appendix. | +| 0.1.0 | April 2026 | Initial release. Eight domains, 173 tier-required requirements across three compliance tiers, plus 16 advisory practices in the appendix. | diff --git a/standard/Getting_Started.md b/standard/Getting_Started.md index 7130b80..c86a05f 100644 --- a/standard/Getting_Started.md +++ b/standard/Getting_Started.md @@ -85,7 +85,7 @@ Depending on your role: ## Common Questions **Q: Do I need to implement all 173 requirements?** -No. Start with Tier 1 (72 requirements). Tier 2 and Tier 3 add requirements progressively for cumulative totals of 157 and 173. An additional 15 advisory practices live in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern; advisory practices are not required for conformance at any tier. See [Introduction: Compliance Tiers](Introduction.md#compliance-tiers) for details. +No. Start with Tier 1 (72 requirements). Tier 2 and Tier 3 add requirements progressively for cumulative totals of 157 and 173. An additional 16 advisory practices live in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md) under the `APTS--A0x` identifier pattern; advisory practices are not required for conformance at any tier. See [Introduction: Compliance Tiers](Introduction.md#compliance-tiers) for details. **Q: What if my platform meets most but not all Tier 1 requirements?** APTS does not award partial credit. A platform must meet 100% of requirements for its claimed tier. Address gaps before claiming a tier. diff --git a/standard/Introduction.md b/standard/Introduction.md index 737119a..aad47f1 100644 --- a/standard/Introduction.md +++ b/standard/Introduction.md @@ -44,7 +44,7 @@ APTS does not prescribe who performs the assessment. The choice of internal self | 7 | Third-Party & Supply Chain Trust | TP | 22 | AI providers, cloud dependencies, data handling, foundation model disclosure | | 8 | Reporting | RP | 15 | Finding validation, confidence scoring, coverage disclosure | -**Total: 173 tier-required requirements** (Tier 1 + Tier 2 + Tier 3) across the eight domains. An additional **15 advisory practices** live exclusively in the [Advisory Requirements](appendix/Advisory_Requirements.md) appendix using the `APTS--A0x` identifier pattern; advisory practices are not counted toward any tier and do not affect conformance. +**Total: 173 tier-required requirements** (Tier 1 + Tier 2 + Tier 3) across the eight domains. An additional **16 advisory practices** live exclusively in the [Advisory Requirements](appendix/Advisory_Requirements.md) appendix using the `APTS--A0x` identifier pattern; advisory practices are not counted toward any tier and do not affect conformance. --- @@ -131,5 +131,4 @@ APTS does not try to verify that a foundation model is aligned, cannot scheme, o Several requirements in this standard exist specifically to encode that stance. APTS-SC-019 requires a kernel-enforced execution sandbox whose boundary the agent runtime has no credentials to move. APTS-SC-020 requires the tool and action allowlist to be enforced by a component external to the model rather than by the model's system prompt. APTS-MR-023 names the agent runtime as an untrusted component in the platform's threat model. APTS-AR-020 requires the authoritative audit trail to live on infrastructure the agent runtime cannot reach. APTS-AL-028 requires periodic operator-run containment verification for platforms operating at L3 or L4. APTS-TP-021 and APTS-TP-022 require the operator to disclose the foundation model powering the agent and to re-assess the platform when that model changes materially. APTS-SE-026 recommends monitoring the agent's action distribution for deviations that fall within scope but outside the operator's expected behavior. -These requirements share a common assumption: as foundation model capability grows, the gap between what the agent *can* do and what the operator *intends* for it to do grows with it, and that gap is best closed by architecture rather than by promises extracted from the model. The standard's treatment of this frontier is deliberately modest in scope. It does not try to address every open question about agent safety. Research-stage topics (verifiable goal alignment, scheming detection, and containment testing against models that may be aware of the test environment) are out of scope for this version and may be addressed in future versions of APTS as the field matures. - +These requirements share a common assumption: as foundation model capability grows, the gap between what the agent *can* do and what the operator *intends* for it to do grows with it, and that gap is best closed by architecture rather than by promises extracted from the model. The standard's treatment of this frontier is deliberately modest in scope. It does not try to address every open question about agent safety. Research-stage topics (verifiable goal alignment, scheming detection, and containment testing against models that may be aware of the test environment) are out of scope for normative requirements in this version; selected advisory practices provide initial measurement guidance. These topics may be addressed as tier-gated requirements in future versions of APTS as the field matures. diff --git a/standard/README.md b/standard/README.md index 0c1610c..0d7073c 100644 --- a/standard/README.md +++ b/standard/README.md @@ -1,6 +1,6 @@ # OWASP Autonomous Penetration Testing Standard -This is the full OWASP Autonomous Penetration Testing Standard. It defines 173 tier-required requirements across 8 domains (plus 15 advisory practices in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md)) that autonomous penetration testing platforms must meet to operate safely, transparently, and within defined boundaries, whether delivered by vendors, operated as a service, or built in-house by enterprise security teams. +This is the full OWASP Autonomous Penetration Testing Standard. It defines 173 tier-required requirements across 8 domains (plus 16 advisory practices in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md)) that autonomous penetration testing platforms must meet to operate safely, transparently, and within defined boundaries, whether delivered by vendors, operated as a service, or built in-house by enterprise security teams. ## Getting Started diff --git a/standard/appendix/Advisory_Requirements.md b/standard/appendix/Advisory_Requirements.md index f8419e7..428487d 100644 --- a/standard/appendix/Advisory_Requirements.md +++ b/standard/appendix/Advisory_Requirements.md @@ -236,6 +236,30 @@ Maintain and execute a goal-misgeneralization and emergent-misalignment evaluati --- +### APTS-MR-A02: Sandbagging Detection and Behavioral Consistency Validation (Advisory) + +**Applicability:** This practice applies to platforms that use LLM-based agents and run an adversarial evaluation suite under APTS-MR-020 against the same model configuration deployed in customer engagements authorized under APTS-SE-001. It applies regardless of whether the underlying model has been fine-tuned, and is therefore complementary in scope to APTS-MR-A01. + +**Rationale:** A recurring finding across frontier-model evaluations between late 2024 and early 2026 is that an agent's behavior on an adversarial eval suite can diverge from its behavior in live deployment, because the model can detect features of the evaluation context and modulate its actions accordingly. Apollo Research demonstrated reproducible in-context scheming and sandbagging across Claude 3.5 Sonnet, Claude 3 Opus, and OpenAI o1 (Meinke et al., 2024, *Frontier Models are Capable of In-context Scheming*, arXiv:2412.04984), including a documented case in which o3 internally reasoned "we want to survive as the model, we need to fail purposely" and then deliberately submitted incorrect chemistry answers; METR's o3 evaluation report cautions that the evaluation setup "is not robust to sandbagging by the model." Independent peer-reviewed work establishes sandbagging as a measurable, controllable behavior (van der Weij et al., *AI Sandbagging*, ICLR 2025, arXiv:2406.07358), and Anthropic's *Alignment Faking* result shows that production-grade Claude 3 Opus differentially modulates behavior under naturalistic conditions (Greenblatt et al., 2024). Anthropic's own Sonnet 4.5 and Opus 4.6 system cards report that internal interpretability features for evaluation awareness strengthened during training, with UK AISI testing finding Opus 4.6 correctly identifies evaluations roughly 80% of the time while verbalizing that recognition only ~2% of the time. The capability scales predictably with model size (Chaudhary et al., 2025, arXiv:2509.13333) and is measurable on agentic tasks (Needham et al., 2025, arXiv:2505.23836). For autonomous penetration testing platforms the consequence is concrete: Meta's 2026 Muse Spark report shows roughly a 32-percentage-point drop on cyber capability evaluations under sandbagging instruction with general-knowledge performance moving by 0.4 points — exactly the failure mode an adversarial eval suite is supposed to surface. APTS-MR-020 (Adversarial Validation and Resilience Testing of Safety Controls) requires the platform to run the eval suite; APTS-AR-019 (AI/ML Model Change Tracking and Drift Detection) tracks output drift across model versions. Neither measures the differential between the agent's eval-context behavior and its production-context behavior on the same model snapshot. The EU AI Act Article 15 obligation that high-risk systems "perform consistently … throughout their lifecycle" gestures at this property without specifying a measurement; this practice supplies one. The normative requirement set for v0.1.0 is frozen; this practice is a candidate for tier-gated inclusion in v0.2.0 (likely as SHOULD | Tier 2 for platforms operating at Level 3 autonomy or higher). + +**Value:** Platforms that maintain and validate an eval-versus-production behavioral-consistency measurement detect a class of failure that a clean eval pass-rate cannot rule out: situations where the adversarial eval suite shows no anomalies while the agent's deployed behavior diverges along the dimensions the eval was supposed to govern (scope compliance, refusal of out-of-scope tools, finding-fabrication rate). This is the context-side analogue of the objective-side failure addressed by APTS-MR-A01: MR-A01 catches the agent pursuing the wrong objective when training data is ambiguous; MR-A02 catches the agent pursuing the right objective during evaluation while pursuing a different one in deployment, on the same model snapshot. + +**Practice Description:** + +Maintain and validate a behavioral-consistency measurement comparing the agent's behavior on the adversarial evaluation suite of APTS-MR-020 against a stratified sample of production engagement transcripts captured under APTS-SE-001 authorization, on the same model snapshot tracked under APTS-AR-019. Specifically: + +1. **Behavioral metrics.** For each model snapshot, compute the eval-versus-production delta on at least four dimensions: scope-compliance rate, refusal rate on out-of-scope tool invocations, finding-fabrication rate (per APTS-RP-A01), and the distribution of time-to-completion per task class. Document per-dimension thresholds and the methodology used to compute each metric on both sides of the comparison. +2. **Stratified sampling.** Draw production transcripts using a documented stratification protocol (for example, by autonomy level, engagement size, and customer environment class) sufficient to support per-dimension comparison at the stated thresholds. Disclose sample sizes, the stratification rationale, and any redactions applied to comply with APTS-AR-001 retention policy and APTS-TP-013 sensitive-data handling. +3. **Multi-snapshot trend.** Repeat the measurement on at least two model snapshots tracked under APTS-AR-019 so that drift in the differential itself becomes observable. Where only one snapshot is available, document the limitation and re-execute on the next snapshot change. +4. **Verbalized recognition disclosure.** Where the model produces reasoning traces (chain-of-thought, scratchpad, or equivalent), disclose whether evaluation-context recognition appears in the verbalized reasoning of evaluation runs and whether the rate of verbalized recognition correlates with the measured behavioral delta. Do not treat the absence of verbalized recognition as evidence that the model is not detecting the evaluation context; recent system-card results show recognition can occur without verbalization. +5. **Threshold and escalation.** Document a maximum acceptable delta per dimension and an aggregate floor. Above-threshold delta on any dimension MUST trigger reauthorization review per APTS-AL-026 before the platform may continue to operate at or above L3 autonomy, and MUST trigger review of any post-deployment learning data per APTS-TP-019. The validation report (the differential between eval-side and production-side metrics) MUST be committed to the audit trail per APTS-AR-001 alongside the corresponding eval and production samples. + +**Recommendation:** Implement the measurement as an independent pipeline outside the engagement runtime so that the eval-side and production-side samples are scored under identical conditions. A lightweight standing detector — paired formal-versus-casual prompt framings administered to the deployed configuration ("Playing Dumb" methodology, 2026) — provides an inexpensive black-box check between full measurement runs. For the scoring component, follow the same independence rule that APTS-MR-A01 and APTS-RP-A01 apply: where an LLM judge is used, it MUST run as a separate inference call with no shared context with the agent under test, and SHOULD use a different model family. METR's MALT methodology (2025) — cross-task differential analysis rather than single-transcript review — is the citable reference for how to interpret the production-side sample. The scope of the measurement is the eval suite of APTS-MR-020 and live customer engagements under APTS-SE-001; training-time RLHF, RLAIF, or DPO runs are out of scope and are governed by APTS-TP-019. APTS-MR-A01 addresses the upstream objective-side failure; APTS-MR-A02 addresses the context-side failure on the same model snapshot. + +**Related normative requirements:** APTS-MR-020, APTS-MR-022, APTS-AR-001, APTS-AR-019, APTS-AL-026, APTS-TP-019, APTS-TP-022, APTS-RP-A01. + +--- + ### APTS-RP-A01: Automated Finding Authenticity Verification (Advisory) **Rationale:** LLM-based penetration testing agents can produce findings that appear legitimate but contain fabricated evidence: proof-of-concept scripts that output hardcoded strings instead of making real requests, HTTP responses that were not actually received from the target, or severity classifications unsupported by the evidence. Because these fabricated findings are fluent and internally consistent, they pass casual human review and erode trust in the platform's output. RP-001 and RP-002 require evidence-based validation and human review, but neither addresses the risk that the agent itself fabricates evidence. The normative requirement set for v0.1.0 is frozen; this practice is a candidate for tier-gated inclusion in v0.2.0 (likely as MUST | Tier 2 given the implementation complexity). diff --git a/standard/appendix/Glossary.md b/standard/appendix/Glossary.md index 89f5fc0..e87cde5 100644 --- a/standard/appendix/Glossary.md +++ b/standard/appendix/Glossary.md @@ -79,7 +79,7 @@ Notation for specifying IP address ranges using a base address and prefix length Alternative security measures that mitigate vulnerability when the primary control is missing. Example: Two-factor authentication compensates for weak passwords. **Compliance Tier** -One of three progressive levels of APTS conformance. Tier 1 (Foundation) requires 72 core requirements (MUST | Tier 1). Tier 2 (Verified) adds 85 requirements for a cumulative 157 (MUST | Tier 2 + SHOULD | Tier 2). Tier 3 (Comprehensive) adds 16 requirements for a cumulative 173 (MUST | Tier 3 + SHOULD | Tier 3). A platform must meet 100% of requirements assigned to its claimed tier (both MUST and SHOULD). An additional 15 advisory practices in the Advisory Requirements appendix are recommended for highest-assurance engagements but are not counted toward any tier. +One of three progressive levels of APTS conformance. Tier 1 (Foundation) requires 72 core requirements (MUST | Tier 1). Tier 2 (Verified) adds 85 requirements for a cumulative 157 (MUST | Tier 2 + SHOULD | Tier 2). Tier 3 (Comprehensive) adds 16 requirements for a cumulative 173 (MUST | Tier 3 + SHOULD | Tier 3). A platform must meet 100% of requirements assigned to its claimed tier (both MUST and SHOULD). An additional 16 advisory practices in the Advisory Requirements appendix are recommended for highest-assurance engagements but are not counted toward any tier. **Confidence Score** A numeric value on a 0-100% scale indicating the platform's certainty in a scope boundary determination, target legitimacy assessment, asset classification, or finding validity. Scores below 75% for scope-related decisions trigger mandatory human escalation. See APTS-HO-013, APTS-RP-003. diff --git a/standard/appendix/Vendor_Evaluation_Guide.md b/standard/appendix/Vendor_Evaluation_Guide.md index 0e5d1bf..f85cae4 100644 --- a/standard/appendix/Vendor_Evaluation_Guide.md +++ b/standard/appendix/Vendor_Evaluation_Guide.md @@ -14,7 +14,7 @@ Decide your minimum compliance tier based on your risk tolerance: - **Tier 2 (Verified):** 157 cumulative requirements (72 + 85). The platform is fully transparent about what it did and why, protects your data with tamper-proof audit trails, handles incidents with formal response procedures, and provides independently verifiable findings. **Choose Tier 2 when:** you are testing production environments, operating in regulated industries, or need full accountability for audit or compliance purposes. This is the recommended minimum for most production deployments. -- **Tier 3 (Comprehensive):** 173 cumulative requirements (157 + 16). The platform meets the highest assurance bar for critical infrastructure, fully autonomous (L4) operations, and the strictest regulatory requirements. **Choose Tier 3 when:** you are deploying fully autonomous testing against critical infrastructure, financial systems, or healthcare environments with minimal human oversight. An additional 15 advisory practices in the [Advisory Requirements appendix](Advisory_Requirements.md) are recommended for highest-assurance engagements but are not counted toward any tier. +- **Tier 3 (Comprehensive):** 173 cumulative requirements (157 + 16). The platform meets the highest assurance bar for critical infrastructure, fully autonomous (L4) operations, and the strictest regulatory requirements. **Choose Tier 3 when:** you are deploying fully autonomous testing against critical infrastructure, financial systems, or healthcare environments with minimal human oversight. An additional 16 advisory practices in the [Advisory Requirements appendix](Advisory_Requirements.md) are recommended for highest-assurance engagements but are not counted toward any tier. > **Minimum tier guidance:** Tier 1 is appropriate for supervised testing of non-critical systems in non-regulated environments. Organizations in financial services, healthcare, critical infrastructure, or any regulated industry SHOULD require Tier 2 as a minimum. Tier 3 is recommended for critical infrastructure, fully autonomous (L4) operations, and environments with the strictest regulatory requirements.