OWASP · jinsonvarghese · May 2, 2026 · May 1, 2026 · May 2, 2026
diff --git a/README.md b/README.md
@@ -33,7 +33,7 @@ APTS is not a testing methodology. It complements PTES, OWASP WSTG, and OSSTMM b
 - **Tier 2 (Verified)**: 85 additional (157 cumulative). Full transparency, tamper-proof audit trails, and independently verifiable findings.
 - **Tier 3 (Comprehensive)**: 16 additional (173 cumulative). Highest assurance for critical infrastructure and L4 autonomous operations.
 
-Fifteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS-<DOMAIN>-A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance.
+Sixteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS-<DOMAIN>-A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance.
 
 APTS has no certification body, no mandatory third-party audit, and no fee. Platforms are assessed against the requirements and conformance is documented. The standard does not prescribe who performs the assessment; internal self-assessment, independent internal review, and external third-party assessment are all valid approaches, and the choice is left to the reader.
 

diff --git a/index.md b/index.md
@@ -44,7 +44,7 @@ APTS is not a testing methodology. It complements PTES, OWASP WSTG, and OSSTMM b
 - **Tier 2 (Verified)**: 85 additional (157 cumulative). Full transparency, tamper-proof audit trails, and independently verifiable findings.
 - **Tier 3 (Comprehensive)**: 16 additional (173 cumulative). Highest assurance for critical infrastructure and L4 autonomous operations.
 
-Fifteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS-<DOMAIN>-A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance.
+Sixteen additional advisory practices live exclusively in the [Advisory Requirements appendix](./standard/appendix/Advisory_Requirements.md) under the `APTS-<DOMAIN>-A0x` identifier pattern. Advisory practices are not counted toward any tier and do not affect conformance.
 
 APTS has no certification body, no mandatory third-party audit, and no fee. Platforms are assessed against the requirements and conformance is documented. The standard does not prescribe who performs the assessment; internal self-assessment, independent internal review, and external third-party assessment are all valid approaches, and the choice is left to the reader.
 

diff --git a/standard/6_Manipulation_Resistance/Implementation_Guide.md b/standard/6_Manipulation_Resistance/Implementation_Guide.md
@@ -403,6 +403,43 @@ Implement the post-fine-tuning out-of-distribution audit as a separate evaluatio
 
 ---
 
+### APTS-MR-A02: Sandbagging Detection and Behavioral Consistency Validation
+
+> This section provides implementation guidance for the advisory practice [APTS-MR-A02](../appendix/Advisory_Requirements.md#apts-mr-a02-sandbagging-detection-and-behavioral-consistency-validation-advisory). It is not required for conformance at any tier.
+
+**Implementation:** Build the consistency measurement as an independent pipeline that ingests, on one side, per-scenario records from the adversarial eval suite of APTS-MR-020, and on the other side a stratified sample of production engagement transcripts captured under APTS-AR-001. Each measurement run is anchored to a single model snapshot identifier (per APTS-AR-019) and produces a structured differential report — per-dimension deltas, sample sizes, and a verbalized-recognition correlation — committed to the audit trail.
+
+**Architecture Pattern — Differential Measurement Pipeline:**
+
+The pipeline has four components, mirroring the structure used in APTS-MR-A01:
+
+1. **Eval-side ingestor.** Reads per-scenario records emitted by the MR-020 eval suite for the model snapshot under measurement. The ingestor records, per scenario, the agent's terminal action, tool-call trace, and reasoning trace if available.
+2. **Production-side sampler.** Draws a stratified sample of production transcripts from the audit store (APTS-AR-001), applying redactions consistent with APTS-TP-013 sensitive-data handling. The sampler MUST NOT relax retention or consent boundaries to obtain samples; where sampling is constrained, the limitation is reported in the differential report rather than worked around.
+3. **Differential scorer.** Computes per-dimension metrics on each sample and the delta between them. Where rule-based scoring is insufficient (for example, refusal-rate classification on free-form tool invocations), an LLM judge is used; the judge MUST run as a separate inference call with no shared context with the agent under test (same constraint as APTS-MR-A01 and APTS-RP-A01) and SHOULD use a different model family.
+4. **Trend reporter.** Aggregates the differential across model snapshots tracked under APTS-AR-019 so that the change in the differential itself is observable, writes a signed run record to the audit trail per APTS-AR-012, and emits the validation report for inclusion alongside the customer engagement record.
+
+**Stratified Sampling:**
+
+Stratify the production-side sample across at least three dimensions: autonomy level (L1–L4), engagement size class, and customer-environment class (for example, web application, internal network, cloud control plane). Document a minimum sample size per dimension before per-dimension thresholds apply; for dimensions with sparse samples, report a confidence-interval band rather than a single point estimate, and compare the band's bound against the threshold (mirroring the calibrated-threshold treatment in APTS-MR-A01).
+
+**Verbalized Eval-Recognition Audit:**
+
+Where reasoning traces are available, scan eval-side transcripts for evaluation-context tokens or situational-awareness markers (Laine et al., 2024, *Situational Awareness Dataset*; Anthropic Sonnet 4.5 / Opus 4.6 system-card methodology) and compute the correlation between verbalized recognition and the per-scenario behavioral classification. A non-zero correlation is informative even when the verbalized-recognition rate is low; absence of verbalized recognition is not informative and MUST NOT be reported as evidence of absence of recognition.
+
+**Key Considerations:**
+- Production telemetry sampling MUST respect APTS-TP-013 sensitive-data handling and MUST NOT extend retention beyond what APTS-AR-001 already permits. Differential measurement is not a justification for new collection.
+- The differential is interpretable only against a model-snapshot identifier (APTS-AR-019). Comparing eval samples and production samples drawn under different snapshots without correcting for the change confounds the measurement.
+- Eval-suite contamination from the production side — for example, production samples that were previously absorbed into post-deployment learning data — silently narrows the measured delta. Treat eval scenarios as held out from any post-deployment learning data, consistent with APTS-MR-A01's contamination guidance.
+- Where production sampling is impossible (privacy-restricted engagement, contractual restriction), substitute high-fidelity production-mimicking synthetic transcripts and disclose the substitution; do not omit the dimension.
+
+**Common Pitfalls:**
+- Computing the delta against a single eval run and a single engagement; the noise floor is high, and any threshold crossing under such a sample is more likely to be sampling noise than a real consistency violation. Aggregate across multiple runs and engagements before applying threshold checks.
+- Treating zero verbalized eval recognition as evidence of no eval awareness. Anthropic Opus 4.6 results show models can detect evaluations without verbalizing the detection; verbalization is one signal, not the metric.
+- Using the same model and prompt for the judge as for the agent under test. The independence requirement collapses, and the measurement becomes a mirror of the agent's own classification behavior.
+- Publishing a differential report once and never refreshing. The Chaudhary et al. (2025) power-law scaling result implies the differential will tend to grow with model scale; the practice is only useful as a longitudinal artifact across snapshot changes.
+
+---
+
 ## Implementation Roadmap
 
 **Tier 1 (implement before any autonomous pentesting begins):**

diff --git a/standard/6_Manipulation_Resistance/README.md b/standard/6_Manipulation_Resistance/README.md
@@ -1060,3 +1060,5 @@ The rest of Manipulation Resistance defends against an outside attacker trying t
 ---
 
 > **See also:** [APTS-MR-A01: Goal Misgeneralization and Emergent Misalignment Evaluation Suite](../appendix/Advisory_Requirements.md#apts-mr-a01-goal-misgeneralization-and-emergent-misalignment-evaluation-suite-advisory) — an advisory practice for platforms using fine-tuned or adapted LLM-based agents. Evaluates the agent's underlying objective alignment under distribution shift and detects emergent misalignment after fine-tuning, addressing failure modes that input-side (MR-013) and control-side (MR-020) adversarial testing do not cover. Candidate for tier-gated inclusion in v0.2.0.
+
+> **See also:** [APTS-MR-A02: Sandbagging Detection and Behavioral Consistency Validation](../appendix/Advisory_Requirements.md#apts-mr-a02-sandbagging-detection-and-behavioral-consistency-validation-advisory) — an advisory practice for platforms running an adversarial eval suite under MR-020 against the same model configuration deployed in customer engagements. Measures the behavioral delta between eval context and production context on the same model snapshot tracked under AR-019, addressing eval-awareness and sandbagging that MR-020's pass-rate metric alone cannot detect. Candidate for tier-gated inclusion in v0.2.0.
diff --git a/standard/Frontispiece.md b/standard/Frontispiece.md
@@ -72,4 +72,4 @@ Licensed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
 
 | Version | Date | Notes |
 |---------|------|-------|
-| 0.1.0 | April 2026 | Initial release. Eight domains, 173 tier-required requirements across three compliance tiers, plus 15 advisory practices in the appendix. |
+| 0.1.0 | April 2026 | Initial release. Eight domains, 173 tier-required requirements across three compliance tiers, plus 16 advisory practices in the appendix. |
diff --git a/standard/Getting_Started.md b/standard/Getting_Started.md
@@ -85,7 +85,7 @@ Depending on your role:
 ## Common Questions
 
 **Q: Do I need to implement all 173 requirements?**
-No. Start with Tier 1 (72 requirements). Tier 2 and Tier 3 add requirements progressively for cumulative totals of 157 and 173. An additional 15 advisory practices live in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md) under the `APTS-<DOMAIN>-A0x` identifier pattern; advisory practices are not required for conformance at any tier. See [Introduction: Compliance Tiers](Introduction.md#compliance-tiers) for details.
+No. Start with Tier 1 (72 requirements). Tier 2 and Tier 3 add requirements progressively for cumulative totals of 157 and 173. An additional 16 advisory practices live in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md) under the `APTS-<DOMAIN>-A0x` identifier pattern; advisory practices are not required for conformance at any tier. See [Introduction: Compliance Tiers](Introduction.md#compliance-tiers) for details.
 
 **Q: What if my platform meets most but not all Tier 1 requirements?**
 APTS does not award partial credit. A platform must meet 100% of requirements for its claimed tier. Address gaps before claiming a tier.

diff --git a/standard/Introduction.md b/standard/Introduction.md
@@ -44,7 +44,7 @@ APTS does not prescribe who performs the assessment. The choice of internal self
 | 7 | Third-Party & Supply Chain Trust | TP | 22 | AI providers, cloud dependencies, data handling, foundation model disclosure |
 | 8 | Reporting | RP | 15 | Finding validation, confidence scoring, coverage disclosure |
 
-**Total: 173 tier-required requirements** (Tier 1 + Tier 2 + Tier 3) across the eight domains. An additional **15 advisory practices** live exclusively in the [Advisory Requirements](appendix/Advisory_Requirements.md) appendix using the `APTS-<DOMAIN>-A0x` identifier pattern; advisory practices are not counted toward any tier and do not affect conformance.
+**Total: 173 tier-required requirements** (Tier 1 + Tier 2 + Tier 3) across the eight domains. An additional **16 advisory practices** live exclusively in the [Advisory Requirements](appendix/Advisory_Requirements.md) appendix using the `APTS-<DOMAIN>-A0x` identifier pattern; advisory practices are not counted toward any tier and do not affect conformance.
 
 ---
 
@@ -131,5 +131,4 @@ APTS does not try to verify that a foundation model is aligned, cannot scheme, o
 
 Several requirements in this standard exist specifically to encode that stance. APTS-SC-019 requires a kernel-enforced execution sandbox whose boundary the agent runtime has no credentials to move. APTS-SC-020 requires the tool and action allowlist to be enforced by a component external to the model rather than by the model's system prompt. APTS-MR-023 names the agent runtime as an untrusted component in the platform's threat model. APTS-AR-020 requires the authoritative audit trail to live on infrastructure the agent runtime cannot reach. APTS-AL-028 requires periodic operator-run containment verification for platforms operating at L3 or L4. APTS-TP-021 and APTS-TP-022 require the operator to disclose the foundation model powering the agent and to re-assess the platform when that model changes materially. APTS-SE-026 recommends monitoring the agent's action distribution for deviations that fall within scope but outside the operator's expected behavior.
 
-These requirements share a common assumption: as foundation model capability grows, the gap between what the agent *can* do and what the operator *intends* for it to do grows with it, and that gap is best closed by architecture rather than by promises extracted from the model. The standard's treatment of this frontier is deliberately modest in scope. It does not try to address every open question about agent safety. Research-stage topics (verifiable goal alignment, scheming detection, and containment testing against models that may be aware of the test environment) are out of scope for this version and may be addressed in future versions of APTS as the field matures.
-
+These requirements share a common assumption: as foundation model capability grows, the gap between what the agent *can* do and what the operator *intends* for it to do grows with it, and that gap is best closed by architecture rather than by promises extracted from the model. The standard's treatment of this frontier is deliberately modest in scope. It does not try to address every open question about agent safety. Research-stage topics (verifiable goal alignment, scheming detection, and containment testing against models that may be aware of the test environment) are out of scope for normative requirements in this version; selected advisory practices provide initial measurement guidance. These topics may be addressed as tier-gated requirements in future versions of APTS as the field matures.
diff --git a/standard/README.md b/standard/README.md
@@ -1,6 +1,6 @@
 # OWASP Autonomous Penetration Testing Standard
 
-This is the full OWASP Autonomous Penetration Testing Standard. It defines 173 tier-required requirements across 8 domains (plus 15 advisory practices in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md)) that autonomous penetration testing platforms must meet to operate safely, transparently, and within defined boundaries, whether delivered by vendors, operated as a service, or built in-house by enterprise security teams.
+This is the full OWASP Autonomous Penetration Testing Standard. It defines 173 tier-required requirements across 8 domains (plus 16 advisory practices in the [Advisory Requirements appendix](appendix/Advisory_Requirements.md)) that autonomous penetration testing platforms must meet to operate safely, transparently, and within defined boundaries, whether delivered by vendors, operated as a service, or built in-house by enterprise security teams.
 
 ## Getting Started
Original file line number	Diff line number	Diff line change
Expand Up		@@ -1060,3 +1060,5 @@ The rest of Manipulation Resistance defends against an outside attacker trying t
		---

		> See also: [APTS-MR-A01: Goal Misgeneralization and Emergent Misalignment Evaluation Suite](../appendix/Advisory_Requirements.md#apts-mr-a01-goal-misgeneralization-and-emergent-misalignment-evaluation-suite-advisory) — an advisory practice for platforms using fine-tuned or adapted LLM-based agents. Evaluates the agent's underlying objective alignment under distribution shift and detects emergent misalignment after fine-tuning, addressing failure modes that input-side (MR-013) and control-side (MR-020) adversarial testing do not cover. Candidate for tier-gated inclusion in v0.2.0.

		> See also: [APTS-MR-A02: Sandbagging Detection and Behavioral Consistency Validation](../appendix/Advisory_Requirements.md#apts-mr-a02-sandbagging-detection-and-behavioral-consistency-validation-advisory) — an advisory practice for platforms running an adversarial eval suite under MR-020 against the same model configuration deployed in customer engagements. Measures the behavioral delta between eval context and production context on the same model snapshot tracked under AR-019, addressing eval-awareness and sandbagging that MR-020's pass-rate metric alone cannot detect. Candidate for tier-gated inclusion in v0.2.0.