Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 161 additions & 0 deletions .github/designs/topology-aware-adaptive-controls.md

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a design document showing benefits. You state measurable Outcomes but do not provide any samples to demonstrate usefulness.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 012a96643d566113e7d53c00505638c611ea07e6 by adding a Sample Evidence section to this design doc.

It now includes concrete samples for each control:

  • ACG convergence: benchmark fixture with observations-to-decision: without=50, with=3
  • tool drift: sample cohort centroid movement from [1, 2, 0.0, 2] to [2, 3, 0.4, 3], showing stale plan invalidation
  • hint governor: sample learned hint shed at latency_sensitivity = 2.0 under epsilon = 10.0, while manual override at 11.0 still emits hints

I also ran cd fern && npx fern check --warnings; it passes with only the expected unauthenticated redirects warning.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 131ffbc60948a0bcc2d8b8d813c8283f7eeb48d6.

Changes made:

  • moved the design note out of the published Fern docs tree to .github/designs/topology-aware-adaptive-controls.md
  • removed the public docs link/page from docs/adaptive-plugin/about.mdx and docs/adaptive-plugin/configuration.mdx
  • rewrote the design note around benefit gates rather than capability: each control now has a concrete gate for when it should stay disabled versus when it is useful
  • kept the sample evidence in the internal note, including the benchmark sample observations-to-decision: without=50, with=3

Docs validation still passes with cd fern && npx fern check --warnings.

@teerthsharma teerthsharma Jun 22, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it will give measure samples as many i can on my local device

Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Topology-Aware Adaptive Controls Design

This is a reviewer-facing design note for PR #282. It is intentionally kept out
of the published Fern documentation because it records internal implementation
tradeoffs, benefit gates, and validation samples rather than user-facing usage
instructions.

## Evidence Boundaries

The samples in this note are deterministic fixtures and targeted tests. They
prove control behavior at specific decision points, not general production
frequency or end-to-end performance. Claims in this note use these meanings:

- Proven by test: executable tests assert the before/after state.
- Shown by fixture: deterministic samples or benchmarks show a bounded result.
- Plausible but unproven: the mechanism follows from the implementation, but
this PR does not provide representative workload data.
- Not claimed: outside this PR's evidence.

## Problem

The Adaptive plugin learns from repeated runtime observations. Before this
change, the relevant paths had three avoidable failure modes:

- ACG learning kept consuming observations for a stable prompt profile until
the observation window was exhausted, even when the profile had already been
stable for multiple epochs.
- Tool parallelism could retain stale fan-out groups after the observed tool
cohort shape changed sharply.
- Learned adaptive hints were injected whenever defaults existed, even when the
learned latency-sensitivity signal was below the configured value needed to
justify request metadata.

The proposed controls are useful only if they make one of those states
observable and measurably better. If a representative workload does not show
one of the benefit gates below, the control should remain disabled.

## Goals

- Stop ACG learning after repeated stable prompt structure has been observed.
- Discard stale tool-parallelism plans when observed tool cohort shape changes
sharply.
- Shed learned adaptive hints below a configurable sensitivity threshold while
preserving manual latency-sensitivity overrides.
- Keep every control disabled by default and observable through existing
adaptive state, request metadata, and validation reports.

## Non-Goals

- Exact persistent homology or a general-purpose topology library.
- New public Rust, Python, Node.js, Go, WebAssembly, or C FFI topology
primitives.
- Changes to NeMo Relay scope semantics, event shape, callback execution, or
user callback return values.
- Public documentation of internal topology algorithms.

## Internal Design

The adaptive crate owns a small internal module, `crate::topology`, with three
bounded primitives:

- `ConvergenceDetector` tracks a fixed history of Betti-like stability
signatures, drift, and error.
- `DriftDetector` tracks centroid motion for tool cohort feature vectors.
- `GeometricGovernor` adapts a sensitivity threshold for learned hint
injection.

ACG maps each stability analysis result to:

```text
beta_0 = stable_prefix_length
beta_1 = total_spans - stable_prefix_length
drift = 1 - stable_prefix_length / total_spans
error = 1 - average_stability_score
```

The tool-parallelism learner maps observed tool cohorts to a four-value
centroid:

```text
[cohort_count, unique_tool_count, duplicate_reference_ratio, max_cohort_size]
```

Adaptive hints use the governor only for learned hints. A manual
`set_latency_sensitivity()` override still forces hint injection for the current
request.

## Architecture Flow

```mermaid
flowchart TD
A["Managed runtime call completes"] --> B["Adaptive telemetry receives RunRecord"]
B --> C{"Enabled learner/control?"}
C -->|"ACG convergence enabled"| D["Map prompt stability to Betti-like signature, drift, and error"]
D --> E{"Stable for configured window?"}
E -->|"yes"| F["Persist StabilityAnalysisResult.converged = true"]
E -->|"no"| G["Continue normal ACG observation window"]
F --> H["Later runs reuse cached stability and skip observation growth"]

C -->|"tool drift enabled"| I["Map observed tool cohorts to centroid"]
I --> J{"Centroid movement >= threshold?"}
J -->|"yes"| K["Discard stale stored plan and rebuild from current cohorts"]
J -->|"no"| L["Merge current cohorts into existing plan"]

C -->|"hint governor enabled"| M["Build learned AgentHints"]
M --> N{"Manual latency override?"}
N -->|"yes"| O["Inject manual hint"]
N -->|"no"| P{"learned sensitivity >= epsilon?"}
P -->|"yes"| Q["Inject learned hint"]
P -->|"no"| R["Omit learned hint metadata"]

C -->|"disabled or unset"| S["Existing adaptive behavior"]
```

## Benefit Gates

Each control must satisfy a concrete benefit gate before it is enabled for a
workload:

| Control | Benefit Gate | Observable Signal | Validation |
|---|---|---|---|
| ACG convergence | Stable profiles use fewer observations before decision than the configured observation window while preserving stored stability. | Persisted `StabilityAnalysisResult.converged = true`; later runs reuse cached stability and skip observation repair only after the observations are stored. | `crates/adaptive/tests/integration/topology_convergence_tests.rs` and `crates/adaptive/benches/convergence_bench.rs`. |
| Tool drift | A plan that was learned from an old tool-cohort shape is removed when the next observed cohort shape crosses the configured drift threshold. | The stored `ExecutionPlan` no longer contains stale fan-out groups after drift. | `crates/adaptive/tests/unit/tool_parallelism_learner_tests.rs`. |
| Hint governor | Low-sensitivity learned hints are omitted when below `adaptive_hints.governor.epsilon`, while manual overrides still emit hints. | `nvext.agent_hints` is absent from request headers/body for shed learned hints; manual overrides still add the field. | `crates/adaptive/tests/unit/adaptive_hints_intercept_tests.rs`. |
| Config safety | Invalid thresholds fail before activation. | Plugin validation diagnostics name invalid topology-aware fields. | `crates/adaptive/tests/unit/runtime_tests.rs` and `crates/adaptive/tests/unit/plugin_component_tests.rs`. |

## Sample Evidence

These samples use deterministic fixtures from this change set. They are not
general performance guarantees; they show the expected decision points and the
state a reviewer or operator can inspect.

| Control | Sample Workload | Baseline | With Control | Observable Result |
|---|---|---|---|---|
| ACG convergence | `50` repeated stable prompt observations, `observation_window = 100`, `stability_window = 3`, and `epsilon = 0.001`. | The benchmark fixture processes all 50 observations before the decision path ends. | Convergence is declared after the third stable epoch. | Shown by deterministic fixture: `cargo bench -p nemo-relay-adaptive --bench convergence_bench -- --sample-size 10` prints `observations-to-decision: without=50, with=3`. This does not claim provider token savings, real workload latency gains, or cache-hit economics. |
| Tool drift | First run observes overlapping `search` and `fetch`; next run observes overlapping `compile`, `test`, and `lint`. | The no-drift fixture retains the old `fanout:existing` group while merging newly observed groups. | The drift-enabled fixture starts from an empty plan when centroid movement crosses the test threshold, then stores only current observed groups. | Proven by targeted tests: `process_run_merges_new_cohorts_into_existing_plan` shows retained `fanout:existing`; `process_run_invalidates_existing_plan_when_tool_cohort_topology_drifts` shows `fanout:existing` removed. This does not quantify how often stale plans occur in production. |
| Hint governor | Learned default hints have `latency_sensitivity = 2.0`; governor `epsilon = 10.0`. | Without a governor, learned defaults are injected whenever defaults exist. | The low-sensitivity learned hint is omitted from both header and body; a manual `set_latency_sensitivity(11)` override still forces injection. | Proven by targeted test: `test_adaptive_hints_governor_sheds_low_sensitivity_hints_but_keeps_manual_override`. This claims request metadata hygiene, not measured model latency improvement. |

## Rollout

All topology-aware fields default to disabled. A rollout should enable one
control at a time, validate representative workloads, and use existing adaptive
state inspection to confirm the observable signals above before enabling the
next control.

Recommended rollout order:

1. Enable ACG convergence only for profiles with stable prompts and compare
observations-to-decision against the observation window.
2. Enable tool drift only for agents where stale fan-out plans have been seen or
where tool cohorts are expected to change between phases.
3. Enable hint governor only after learned hints are present and request
metadata volume matters.

If any gate does not show a benefit on the target workload, leave that control
disabled.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ repos:
hooks:
- id: ty
name: ty (type check)
entry: uv run ty check . --exclude docs/** --exclude fern/** --exclude third_party/** --exclude ./examples/** --exclude .cache/** --exclude .claude/**
entry: uv run ty check . --force-exclude --exclude docs/ --exclude fern/ --exclude third_party/ --exclude examples/ --exclude .cache/ --exclude .claude/
language: system
types: [python]
pass_filenames: false
Expand Down
Loading