Skip to content

feat(steering): cross-rank applied-action checksum in dynamic status#233

Open
RhizoNymph wants to merge 1 commit into
feat/dynamic-steeringfrom
feat/steering-determinism-checksum
Open

feat(steering): cross-rank applied-action checksum in dynamic status#233
RhizoNymph wants to merge 1 commit into
feat/dynamic-steeringfrom
feat/steering-determinism-checksum

Conversation

@RhizoNymph

Copy link
Copy Markdown
Owner

Failure mode

Sync capture consumers run identically and independently on every TP rank in lock-step, with zero communication. Each rank applies the returned steering actions via SteeringModelRunnerMixin._apply_steering_actions. The contract is enforced only by convention: if one rank hits a local fault (consumer OOM, a swallowed exception at the observer boundary in _run_sync_consumers), its dyn_id allocation and steering tables silently desync from its siblings forever, corrupting output with no error surfaced anywhere.

Detector

An always-on, cheap rolling checksum of applied steering actions per worker:

  • Worker (steering_model_runner_mixin.py): _apply_steering_actions folds every action that is actually applied (rejected actions excluded) into self._steering_action_checksum (u64). The fold is zlib.crc32 over a compact, PYTHONHASHSEED-free digest of the action content (class, target req_id/config_hash/dyn_id, hook/layer, source, and a bit-exact shape+CRC of any vector/probe payload) mixed splitmix-style in application order, plus a per-drain-batch ordinal so "same actions, different step" differs. Digests are bit-exact (not norms) because actions are host-side numpy built from rank-identical inputs — strictly stronger and never legitimately divergent across ranks. O(applied actions); zero cost on idle steps. The rollback of a failed declarative override is itself folded. get_dynamic_steering_status exposes action_checksum (hex) + action_count (picklable primitives).
  • Router (_merge.py + api_router.py): GET /v1/steering/dynamic compares checksums across workers via check_action_determinism. A mismatch does not 500 — the response carries determinism: {consistent: false, checksums: {...}} and a rate-limited server-side ERROR fires; on match determinism: {consistent: true, action_count: N}.
  • Reuses steering_update_accepted (new thin wrapper over the existing _validate_update) so the batched vector-update path folds exactly the applied set with no duplicated validation logic.

Topology scoping / granularity caveat

Comparison is scoped within each PP stage (grouped by pp_rank), mirroring deep_merge_status: TP ranks in a stage own identical layers and must match, while PP stages own disjoint layers and may legitimately differ. Sync-consumer-originated actions only exist at pipeline_parallel_size == 1 (enforced in vllm/v1/capture/registry.py), where this reduces to an all-workers comparison — exactly right. Granularity is poll-time: a desync is detected on the next status poll, so the checksum bounds (does not prevent) corruption; pair with periodic polling.

Docs

New docs/design/dynamic_steering.md §6.1 documents the detector and its poll-time granularity.

Expected sibling-PR conflicts

  • chore/steering-row-owner touches steering_manager.py and the mixin's scale/monitor apply methods (different regions of the same file).
  • Docs conflicts possible in docs/design/dynamic_steering.md with chore/steering-trust-hardening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant