Skip to content

Conversation

@diegomrsantos
Copy link
Member

@diegomrsantos diegomrsantos commented Dec 16, 2025

Issue Addressed

Implements the subnet topology fork transition (SIP-43) to safely migrate the network from committee-ID-based to operator-ID-based (MinHash topology) subnet calculation.

Proposed Changes

The Problem: The network needs to change how it calculates which subnet (gossipsub topic) a committee belongs to. This affects every network operation: subscription, publishing, message validation, and peer scoring. Getting this wrong causes network partitions, message loss, or incorrect peer punishment.

Centralized Routing Policy: Instead of scattering fork logic across components, this PR creates a single routing policy module (routing.rs) that answers four questions for any epoch:

  1. Which schema to publish with? → publish_schema()
  2. Which schemas to subscribe to? → subscribe_schemas()
  3. Which schemas to accept messages from (gossipsub)? → accept_schemas()
  4. Which schemas to process messages for (committee work)? → process_schemas()

SIP-43 Compliant Fork Timeline (F = fork_epoch):

Epoch Subscribe Publish Accept (gossipsub) Process
F-3 {Pre} Pre {Pre} {Pre}
F-2 {Pre, Post} Pre {Pre, Post} {Pre}
F-1 {Pre, Post} Pre {Pre, Post} {Pre}
F {Post} Post {Post} {Post}
F+1 {Post} Post {Post} {Post}

Key behaviors per SIP-43:

  • PRIOR_WINDOW = 2: Start dual-subscribing 2 epochs before fork
  • Accept during pre-subscribe: Accept PostFork messages for gossipsub propagation to warm up mesh
  • Don't process pre-fork: Only process PreFork messages until fork epoch
  • Immediate cutoff at fork: No grace period after fork

Fork-Aware Network Operations: Updated all network components to be fork-aware:

  • Subnet subscription handles dual-subscription during the 2-epoch transition window
  • Message publishing switches algorithms at the fork epoch
  • Message validation accepts messages from appropriate schemas, distinguishing between gossipsub acceptance and processing
  • SubnetRouter provides centralized routing decisions shared across components

Type Safety Improvements:

  • Use NonZeroU64 for subnet_count to prevent division by zero at compile time
  • Proper Result error handling with map_err instead of expect in production code
  • SUBNET_COUNT_NZ constant for safe subnet count usage

Configuration: Added subnet_topology_fork_epoch as an optional network configuration parameter. When unset, the system operates in pre-fork mode indefinitely.

Additional Info

Design Principle: Centralized policy prevents subtle inconsistencies (e.g., subscription using one epoch threshold, validation using another) that could cause hard-to-debug network issues.

Safety: Message validation uses "Ignore" (not "Reject") for topic mismatches to avoid peer punishment during contentious upgrades or configuration issues.

MinHash Algorithm: Implements the SIP-43 specified algorithm - SHA256 hash each operator ID (little-endian u64), find minimum hash, return min_hash % 128.

@diegomrsantos diegomrsantos self-assigned this Dec 16, 2025
@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Dec 16, 2025

Claude encountered an error —— View job


I'll analyze this and get back to you.

@diegomrsantos diegomrsantos marked this pull request as ready for review December 16, 2025 13:20
@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Dec 16, 2025

Claude finished @diegomrsantos's task in 1m 49s —— View job


Code Review Complete

Todo List

  • Read repository guidelines (CLAUDE.md)
  • Review core routing policy module (routing.rs)
  • Review subnet_service changes
  • Review message_validator fork-aware validation
  • Review message_sender fork-aware publishing
  • Review configuration changes
  • Check test coverage
  • Provide comprehensive feedback

Summary

This PR implements a well-designed fork transition mechanism for migrating from committee-ID-based to operator-ID-based subnet routing. The centralized routing policy approach is sound and addresses a complex distributed systems challenge. The implementation demonstrates good practices with comprehensive tests and clear documentation.

Strengths

  1. Excellent architectural approach: The centralized routing policy module (routing.rs) is a textbook example of good design. It provides a single source of truth for fork-related decisions and prevents subtle inconsistencies.

  2. Comprehensive documentation: The timeline table in routing.rs:10-20 and detailed function documentation make the fork behavior crystal clear.

  3. Strong test coverage: The routing module has excellent test coverage including edge cases (pre-fork forever, node starts after fork, strict N-1 accept policy).

  4. Proper error handling: Uses Ignore instead of Reject for topic mismatches during transitions (message_validator/src/lib.rs:421), preventing peer punishment during contentious upgrades.

  5. Consistent epoch calculation: The current_epoch() helper ensures all components compute epochs identically.

Issues Identified

Critical

  1. Potential clock skew issues (routing.rs:291)

The current_epoch() function returns None before genesis, which is handled correctly. However, there's no explicit handling for clock skew between nodes during the transition period. Consider documenting the expected clock synchronization requirements.

Location: anchor/subnet_service/src/routing.rs:291

pub fn current_epoch(slot_clock: &impl SlotClock, slots_per_epoch: u64) -> Option<Epoch> {
    slot_clock.now().map(|slot| slot.epoch(slots_per_epoch))
}

Recommendation: Add a doc comment explaining that nodes must have synchronized clocks (within typical NTP bounds) for the fork transition to work correctly.


High Priority

  1. Missing integration tests for fork transition

While the routing module has excellent unit tests, there are no integration tests that verify the end-to-end behavior across multiple components during a fork transition. Specifically:

  • No tests verifying that subnet_service, message_validator, and message_sender all agree on subnet selection during transition periods
  • No tests simulating a network with mixed fork-aware and pre-fork nodes

Recommendation: Add integration tests that simulate:

  • A node going through the complete fork timeline
  • Message validation during the transition period with dual-subscription
  • Publishing algorithm switch at fork epoch

  1. Hardcoded transition constants (routing.rs:50, 56)
pub const PRE_SUBSCRIBE_EPOCHS: u64 = 1;
pub const POST_GRACE_EPOCHS: u64 = 2;

The PR description mentions these are "hardcoded based on network requirements," but there's no documentation explaining:

  • Why these specific values were chosen
  • What happens if they need to change in the future
  • Whether different networks might need different values

Recommendation: Add doc comments explaining the rationale for these values and the constraints that determined them.


  1. Error handling in calculate_subnet_for_committee (subnet_service/src/lib.rs:499-535)

When calculating subnet for PostFork schema, if the operator list is empty, it returns SubnetCalculationError::EmptyOperatorList. However, callers handle this by continue-ing to the next schema. During the grace period, this could silently skip committees that haven't been properly configured.

Location: anchor/subnet_service/src/lib.rs:525-527

Recommendation: Add logging at warn or error level when a committee cannot be calculated for PostFork schema, so operators can detect configuration issues before the grace period ends.


Medium Priority

  1. Inconsistent error handling patterns

In message_validator/src/lib.rs:406-419, the subnet calculation error is logged but validation continues. However, in message_sender/src/network.rs:164-173, the error causes an early return and message drop.

This inconsistency is intentional (validator is more permissive, sender is strict), but it's not documented.

Recommendation: Add comments explaining why the validator ignores calculation errors (to handle misconfigured peers gracefully) while the sender fails fast (to avoid sending invalid messages).


  1. No metrics for fork transition progress

There are no metrics to monitor:

  • Current fork phase (pre-fork, transition, grace, post-grace)
  • Number of messages validated under each schema
  • Subnet calculation failures

Recommendation: Add Prometheus metrics for fork transition observability.


  1. Missing validation for fork_epoch configuration (ssv_network_config/src/lib.rs:42-44)

The subnet_topology_fork_epoch field is optional but there's no validation that:

  • The fork epoch is in the future (setting it to a past epoch could cause immediate dual-subscription)
  • The fork epoch is reasonable (not too close to current epoch)

Recommendation: Add validation in the configuration loading logic or document expected operator behavior.


Low Priority

  1. Doctest marked as ignore (commit 35c1502)

The commit message mentions marking current_epoch doctest as ignore. The doctest in routing.rs:280-290 uses ignore because it requires mocking a SlotClock implementation.

Recommendation: Consider creating a test helper that provides a mock SlotClock so the doctest can run, or convert it to a regular unit test.


  1. Code duplication in schema iteration

The pattern for iterating through active schemas is repeated in multiple places:

let (schema_count, schemas) = routing::accept_schemas(epoch, fork_schedule);
for schema in schemas.iter().take(schema_count) {
    // ... use schema
}

Recommendation: Consider adding a helper method that returns an iterator or slice to reduce repetition and potential errors.


  1. Type safety for SubnetId (subnet_service/src/lib.rs:31-103)

SubnetId uses Deref to expose the inner u64, which makes it easy to accidentally use the raw value instead of the SubnetId type. This reduces type safety benefits.

Recommendation: Consider removing the Deref impl and providing explicit accessor methods. This is a minor point as the current implementation is acceptable.


Code Quality

Positive

  • ✅ Follows Rust idioms and best practices
  • ✅ Comprehensive error types with context
  • ✅ Good use of const generics and type safety
  • ✅ Clear function documentation
  • ✅ Proper use of Option for fork epoch (pre-fork vs configured fork)
  • ✅ Avoids allocations with fixed-size arrays for schemas

Areas for Improvement

  • ⚠️ Limited observability (metrics) for fork transition
  • ⚠️ No integration tests for fork transition scenarios
  • ⚠️ Some code duplication in schema iteration patterns

Security

No security vulnerabilities identified. The implementation correctly:

  • Uses Ignore instead of Reject for ambiguous cases
  • Validates topic subscription before processing messages
  • Prevents misconfigured nodes from causing network splits

Performance

The implementation is efficient:

  • ✅ Uses stack-allocated arrays instead of heap allocations for schema lists
  • ✅ Epoch calculation is done once per message
  • ✅ Subnet calculation uses efficient hashing (SHA256 + modulo)
  • ✅ No unnecessary cloning or allocation in hot paths

One minor optimization opportunity: The calculate_subnet_for_committee function is called multiple times for the same committee during validation/scoring. Consider caching results if this becomes a bottleneck.


Testing

Strengths:

  • Excellent unit test coverage for routing logic
  • Tests cover edge cases (pre-fork forever, post-fork start, strict N-1 accept)
  • Tests verify order independence of operator IDs for MinHash

Gaps:

  • No integration tests for fork transition
  • No tests simulating clock skew scenarios
  • No tests for configuration validation

Documentation

Strengths:

  • Clear PR description explaining the problem and solution
  • Excellent inline documentation in routing.rs
  • Good use of doc comments with examples

Improvements needed:

  • Document clock synchronization requirements
  • Explain rationale for hardcoded transition constants
  • Add operator guide for fork transition

Recommendations

  1. Before merge (high priority):

    • Add doc comments explaining the rationale for PRE_SUBSCRIBE_EPOCHS and POST_GRACE_EPOCHS
    • Document clock synchronization requirements
    • Add warning-level logging when subnet calculation fails for PostFork schema
  2. Future work (medium priority):

    • Add integration tests for fork transition
    • Add Prometheus metrics for fork observability
    • Add configuration validation for fork_epoch
    • Consider reducing code duplication in schema iteration
  3. Nice to have (low priority):

    • Make doctests runnable or convert to unit tests
    • Add helper methods to reduce schema iteration boilerplate

Conclusion

This is a well-designed and well-implemented feature. The centralized routing policy approach is the right solution for this problem, and the implementation demonstrates careful consideration of edge cases and failure modes. The identified issues are mostly about observability, testing, and documentation rather than fundamental design flaws.

The code is production-ready with minor improvements for observability and testing.

Recommendation: Approve with minor improvements


@diegomrsantos diegomrsantos marked this pull request as draft December 16, 2025 13:26
diegomrsantos added a commit to diegomrsantos/anchor that referenced this pull request Dec 16, 2025
This commit addresses the three high-priority items from PR sigp#754 code review:

1. Enhanced documentation for PRE_SUBSCRIBE_EPOCHS and POST_GRACE_EPOCHS
   - Added detailed rationale explaining why these specific values (1 and 2 epochs)
   - Documented trade-offs between safety and resource efficiency
   - Explained operational constraints that determined these values

2. Documented clock synchronization requirements in current_epoch()
   - Made explicit that nodes must maintain NTP-synchronized clocks
   - Specified tolerance bounds (~100ms acceptable for 6.4-minute epochs)
   - Explained consequences of poor synchronization during fork transition

3. Added warning logging for PostFork subnet calculation failures
   - When empty operator list prevents PostFork subnet calculation in scoring
   - Helps operators detect configuration issues during grace period
   - Provides visibility into potential problems before grace period ends

All changes are documentation and logging improvements with no functional changes.
Tests pass and code formatting has been applied.
diegomrsantos added a commit to diegomrsantos/anchor that referenced this pull request Dec 16, 2025
Add TODO comments to all #[allow(clippy::too_many_arguments)] suppressions
referencing issue sigp#755 which proposes using ForkContext to reduce parameter
counts in fork-related code.

Changes:
- subnet_service: 3 locations (start_subnet_service, subnet_service, handle_subnet_changes)
- message_validator: 1 location (Validator::new - includes fork args added in PR sigp#754)
- validator_store: 2 locations (AnchorValidatorStore::new, collect_signature)

These TODOs document the technical debt and provide a path forward for
refactoring without blocking the fork transition work.
@diegomrsantos diegomrsantos force-pushed the feat/subnet-topology-fork-transition branch from f94bfa7 to b0ee4d8 Compare December 16, 2025 17:03
Create subnet_service::routing module that provides a single source of
truth for all fork-related routing decisions.

Key features:
- SubnetSchema enum (PreFork/PostFork) representing subnet calculation algorithms
- ForkSchedule struct with hardcoded transition windows (1 epoch pre-subscribe, 2 epochs grace)
- Decision functions: publish_schema(), subscribe_schemas(), accept_schemas()
- Consistency helper: current_epoch() for uniform epoch calculation
- Comprehensive unit tests covering full timeline and edge cases

Timeline (F = fork_epoch):
- Before F-1: PreFork only
- F-1: Dual-subscribe (strict PreFork accept to prevent early migration)
- F to F+2: Publish PostFork, accept both (grace period)
- After F+2: PostFork only

This centralizes fork logic that was scattered across multiple components
in the Go implementation, making transitions easier to reason about and modify.
Add `subnet_topology_fork_epoch: Option<Epoch>` field to `SsvNetworkConfig`
to support configuring when the subnet topology fork activates.

- Added `types` dependency to ssv_network_config
- Updated `constant()` method to set fork epoch to None for hardcoded networks
- Updated `load()` method to optionally read from ssv_subnet_topology_fork_epoch.txt

None means pre-fork operation (fork not configured/activated).
Pass ForkSchedule from client to subnet_service.

- Import ForkSchedule in client
- Create ForkSchedule from config.global_config.ssv_network.subnet_topology_fork_epoch
- Add fork_schedule parameter to start_subnet_service() signature
- Thread fork_schedule through to subnet_service async function
- Fork schedule will be used for dual-subscription logic in next commit
- Modified handle_subnet_changes() to support both PreFork and PostFork schemas
- Added fork_schedule and slot_clock parameters to determine active schemas
- Calculate subnets for each active schema based on current epoch
- Use routing::subscribe_schemas() to get transition state
- Handle empty operator lists with appropriate warnings
- Added #[allow(clippy::too_many_arguments)] attributes where necessary
…e publishing

This commit completes Phase 5 of the fork transition implementation:

1. Created centralized calculate_subnet_for_committee() function in
   subnet_service/src/lib.rs that handles subnet calculation for both
   PreFork (committee-ID based) and PostFork (operator-ID based) schemas.

2. Refactored handle_subnet_changes() to use the centralized helper,
   eliminating duplicated operator ID lookup and deduplication logic.

3. Extended NetworkMessageSenderConfig with fork-related fields:
   - fork_schedule: ForkSchedule
   - slot_clock: S
   - slots_per_epoch: u64
   - network_state_rx: watch::Receiver<NetworkState>

4. Modified do_send() in message_sender to:
   - Determine publishing schema using routing::publish_schema()
   - Calculate subnet using centralized helper
   - Properly handle errors for empty operator lists

5. Wired fork_schedule creation in client initialization and passed
   fork-related state to NetworkMessageSenderConfig.

The centralized approach ensures consistency between subscription and
publishing paths, with a single source of truth for subnet calculation
logic.
The get_committee_info_for_subnet function was using only PreFork
(committee-ID) algorithm to identify which committees map to a subnet.
This caused incorrect message rate calculations for topic scoring
post-fork and during the grace period.

Now uses accept_schemas() to determine which routing schemas are valid
at the current epoch, and checks if each cluster maps to the target
subnet under ANY accepted schema. During grace period (epochs N to N+2),
committees are included if they map via either PreFork OR PostFork.

This conservative approach ensures we don't under-estimate message rates
during the transition, which would cause incorrect peer scoring.
Thread topic through message_receiver to message_validator and validate that
messages arrive on the correct subnet topic based on fork-aware routing rules.

Changes:
- message_receiver: pass topic to validator
- message_validator: add topic parameter, validate subnet correctness
- message_validator: check committee maps to received subnet under any accepted schema
- message_sender: construct TopicHash for outgoing validation
- client: wire fork_schedule and subnet_count to Validator, fix initialization order

Returns ValidationFailure::IncorrectTopic (maps to Ignore) for wrong topics to
avoid peer punishment during transitions.
Remove temporary analysis files and directories used during development:
- fork_transition_detailed_plan.md
- ssv_analysis_data/ directory with all analysis scripts and results

Apply cargo fmt formatting fixes to client and message_validator.
The doctest example for current_epoch cannot compile without significant
setup overhead. Mark it as ignore since it's illustrative only.
This commit addresses the three high-priority items from PR sigp#754 code review:

1. Enhanced documentation for PRE_SUBSCRIBE_EPOCHS and POST_GRACE_EPOCHS
   - Added detailed rationale explaining why these specific values (1 and 2 epochs)
   - Documented trade-offs between safety and resource efficiency
   - Explained operational constraints that determined these values

2. Documented clock synchronization requirements in current_epoch()
   - Made explicit that nodes must maintain NTP-synchronized clocks
   - Specified tolerance bounds (~100ms acceptable for 6.4-minute epochs)
   - Explained consequences of poor synchronization during fork transition

3. Added warning logging for PostFork subnet calculation failures
   - When empty operator list prevents PostFork subnet calculation in scoring
   - Helps operators detect configuration issues during grace period
   - Provides visibility into potential problems before grace period ends

All changes are documentation and logging improvements with no functional changes.
Tests pass and code formatting has been applied.
Add TODO comments to all #[allow(clippy::too_many_arguments)] suppressions
referencing issue sigp#755 which proposes using ForkContext to reduce parameter
counts in fork-related code.

Changes:
- subnet_service: 3 locations (start_subnet_service, subnet_service, handle_subnet_changes)
- message_validator: 1 location (Validator::new - includes fork args added in PR sigp#754)
- validator_store: 2 locations (AnchorValidatorStore::new, collect_signature)

These TODOs document the technical debt and provide a path forward for
refactoring without blocking the fork transition work.
@diegomrsantos diegomrsantos force-pushed the feat/subnet-topology-fork-transition branch from b0ee4d8 to a18d133 Compare January 2, 2026 20:22
diegomrsantos and others added 8 commits January 2, 2026 17:57
- Change PRE_SUBSCRIBE_EPOCHS from 1 to 2 (SIP-43 PRIOR_WINDOW)
- Remove POST_GRACE_EPOCHS (no grace period after fork per SIP-43)
- Add process_schemas() to distinguish gossipsub acceptance from processing
- Update accept_schemas() to accept both schemas during pre-subscribe window
- Update subscribe_schemas() to only subscribe PostFork at/after fork

New timeline (SIP-43 compliant):
| Epoch | Subscribe      | Publish | Accept     | Process |
|-------|----------------|---------|------------|---------|
| F-3   | {Pre}          | Pre     | {Pre}      | {Pre}   |
| F-2   | {Pre, Post}    | Pre     | {Pre,Post} | {Pre}   |
| F-1   | {Pre, Post}    | Pre     | {Pre,Post} | {Pre}   |
| F     | {Post}         | Post    | {Post}     | {Post}  |

Key SIP-43 behaviors:
- Accept PostFork messages during pre-subscribe for gossipsub propagation
- Only process PreFork messages until fork epoch
- Immediate cutoff at fork (no grace period)
Add matched_schema field to ValidatedMessage to track which SubnetSchema
the message was validated against. This enables the message_receiver to
distinguish between messages that should be processed vs. only propagated
during the pre-subscribe window (F-2, F-1).

During the pre-subscribe window, PostFork messages are accepted for
gossipsub propagation but should not be processed until fork epoch.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add fork-aware message processing logic to NetworkMessageReceiver:
- Accept PostFork messages during pre-subscribe window (F-2, F-1) for
  gossipsub propagation
- Only process messages matching process_schemas for committee work
- Before fork: only process PreFork messages
- At/after fork: only process PostFork messages

This implements the SIP-43 requirement that during the pre-subscribe
window, nodes should accept and propagate PostFork messages but not
process them for consensus until the fork epoch.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Eliminate duplication where message_sender, message_validator, and
message_receiver each had their own copies of fork-related dependencies
(network_state_rx, fork_schedule, slot_clock, slots_per_epoch).

Changes:
- Add SubnetRouter struct to routing.rs with methods:
  - publish_subnet(): calculate subnet for publishing
  - validate_subnet(): validate incoming messages against accepted schemas
  - should_process(): determine if message should be processed
- Update message_sender to use router instead of 5 separate params
- Update message_validator to use router for network state and validation
- Update message_receiver to use router.should_process()
- Create single SubnetRouter instance in client and inject into components

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, subnet subscriptions were only recomputed when the database
changed. This meant fork transitions (from PreFork to PostFork schemas)
would not happen unless the DB changed, leaving nodes on wrong topics
and missing messages.

Now the epoch boundary handler always runs and checks if we're in the
fork transition window [F-2, F]. If so, it triggers subscription
recomputation regardless of DB changes.

Changes:
- Add is_fork_transition_epoch() helper to detect fork window
- Always handle epoch boundaries (not just when scoring enabled)
- Only trigger subscription changes during fork transition epochs
- Add comprehensive tests for fork transition detection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename handle_epoch_committee_update to send_scoring_rate_updates for clarity
- Add comprehensive docstring explaining gossipsub scoring context
- Improve inline comments explaining why rate updates are conditional
- Update debug message to better describe the operation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
mergify bot pushed a commit that referenced this pull request Jan 5, 2026
#767)

Preparatory refactor for the subnet topology fork transition work (#754).


  Split the 539-line monolithic `lib.rs` into focused modules with clear responsibilities:

| Module | Responsibility | Key Types/Functions |
|--------|---------------|---------------------|
| `subnet.rs` | Core types and calculation algorithms | `SubnetId`, `SubnetEvent`, `from_committee_alan()`, `from_operators()` |
| `service.rs` | Background service managing subscriptions | `SubnetService` struct, `start_subnet_service()` |
| `scoring.rs` | Message rate calculation for gossipsub | `calculate_message_rate_for_subnet()`, `get_committee_info_for_subnet()` |
| `lib.rs` | Public API re-exports only | — |

### Mental Model

```
subnet_service/
├── lib.rs          → Public API surface (re-exports)
├── subnet.rs       → WHAT a subnet IS (types, algorithms)
├── service.rs      → WHAT HAPPENS with subnets (lifecycle, events)
├── scoring.rs      → HOW to score subnets (message rates)
└── message_rate.rs → (unchanged) rate calculation math
```

### Why This Structure?

1. **Single Responsibility**: Each module has one clear purpose
2. **Easier Navigation**: Finding code by concept rather than scrolling
3. **Fork Transition Ready**: PR #754 can cleanly add `routing.rs` for fork-aware logic without touching unrelated code
4. **Encapsulation**: `SubnetService` struct groups related state and methods

### Cleanup

- Removed unused `test_tracker()` function (was defined but never called)


Co-Authored-By: diego <diego@sigmaprime.io>
Resolves merge conflicts by:
- Keeping modular structure from upstream (routing.rs, service.rs, subnet.rs, scoring.rs)
- Adding fork-aware features (ForkSchedule, SchemaSet, dual-subscription)
- Preserving clean run_scoring_loop/run_monitoring_loop structure
- Using fork-aware calculate_subnet_for_committee for multi-schema support
@diegomrsantos diegomrsantos force-pushed the feat/subnet-topology-fork-transition branch from e3fdf57 to 5f5ae8b Compare January 5, 2026 17:20
All clusters with the same committee_id have identical operators
(committee_id is derived from the operator set), so we only need
to look up one cluster instead of collecting from all and deduping.
diegomrsantos and others added 5 commits January 5, 2026 16:13
- Use NonZeroU64 for subnet_count to prevent division by zero at compile time
- Add SUBNET_COUNT_NZ constant for safe subnet count usage
- Change from_committee_alan to return Result like from_operators
- Use map_err instead of expect for production safety
- Optimize validate_subnet to accept operators directly, avoiding redundant DB lookup
- Fix impostor sender to log warnings instead of silently ignoring errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
During the dual-subscribe window, a PostFork calculation error (e.g.,
EmptyOperatorList) would previously cause immediate error propagation,
potentially rejecting messages before checking all schemas. Now errors
are handled per-schema with continue, ensuring we only return
IncorrectTopic when no schema matches.
operatorId - 1 could underflow if the contract emits operatorId == 0,
producing false "missing events" failures. Use checked_sub to safely
handle this edge case.
Add SubnetId::from_topic_str() and to_topic_string() helpers to avoid
duplicating the "ssv.v2.<subnet_id>" format in multiple places. This
ensures topic parsing stays aligned if the format ever changes.
@diegomrsantos diegomrsantos marked this pull request as ready for review January 5, 2026 21:04
@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Jan 5, 2026

Claude finished @diegomrsantos's task in 2m 11s —— View job


Code Review Complete

Todo List

  • Read repository guidelines (CLAUDE.md)
  • Review core routing policy module (routing.rs)
  • Review subnet_service changes
  • Review message_validator fork-aware validation
  • Review message_sender fork-aware publishing
  • Review configuration changes
  • Check test coverage and quality
  • Evaluate against previous review feedback
  • Provide comprehensive feedback

Executive Summary

This PR implements a well-designed fork transition mechanism for migrating the SSV network from committee-ID-based to operator-ID-based subnet routing (SIP-43). The centralized routing policy approach is architecturally sound and addresses a complex distributed systems challenge with clear separation of concerns.

Recommendation: Approve with minor suggested improvements

The implementation is production-ready. Identified issues are primarily about observability and edge case handling rather than fundamental design flaws.


Strengths

1. Excellent Architectural Design

The centralized routing policy in routing.rs is exemplary:

  • Single source of truth for all fork-related decisions
  • Clear separation between subscribe/accept/process schemas
  • Prevents subtle inconsistencies across components
  • Well-documented with timeline tables and rationale

2. Comprehensive Test Coverage

routing.rs has 9 test functions covering:

  • Pre-fork forever operation
  • Full fork timeline (F-3 through F+1)
  • Node starting after fork
  • Epoch boundary transitions
  • Pre-subscribe window vs processing logic

3. Type Safety Improvements

  • NonZeroU64 for subnet_count prevents division by zero at compile time
  • Proper Result error handling throughout
  • SUBNET_COUNT_NZ constant for safe arithmetic

4. Proper Error Handling

  • Uses Ignore instead of Reject for topic mismatches during transitions (message_validator/src/lib.rs:216,224)
  • Prevents peer punishment during contentious upgrades
  • Graceful handling of calculation errors

5. Clean Integration

All components (subnet_service, message_validator, message_sender) integrate cleanly with the centralized SubnetRouter


Issues Identified

Critical: None

High Priority

1. Missing warning logs for PostFork calculation failures during transition

Location: anchor/subnet_service/src/service.rs:195-209

During the pre-subscribe window (F-2, F-1), if a cluster's PostFork subnet calculation fails (e.g., EmptyOperatorList), the code logs a warning and continues. However, this could mask configuration problems that will cause issues at fork epoch.

Err(SubnetCalculationError::EmptyOperatorList) => {
    warn!(
        ?cluster_id,
        ?schema,
        "Cluster has empty operator list, skipping subnet calculation"
    );
}

Recommendation: Add an error! level log when PostFork calculation fails within 1 epoch of fork_epoch to alert operators of impending problems.


2. SubnetRouter validation allows continued operation with calculation errors

Location: anchor/subnet_service/src/routing.rs:386-403

In SubnetRouter::validate_subnet(), subnet calculation errors for one schema cause a continue to try the next schema. During dual-subscription, a PostFork calculation failure will silently fall back to PreFork validation:

let expected_subnet = match schema {
    SubnetSchema::PreFork => {
        match SubnetId::from_committee_alan(committee_id, self.subnet_count) {
            Ok(subnet) => subnet,
            Err(_) => continue,  // Silent failure
        }
    }
    SubnetSchema::PostFork => {
        match SubnetId::from_operators(operators, self.subnet_count) {
            Ok(subnet) => subnet,
            Err(_) => continue,  // Silent failure - masks config issues
        }
    }
};

This is intentional for graceful handling, but operators won't know their PostFork configuration is broken until fork epoch hits.

Recommendation: Add debug or warn logging when subnet calculation fails so operators can detect configuration issues before fork.


3. No validation for fork_epoch configuration sanity

Location: anchor/common/ssv_network_config/src/lib.rs:90-94

The subnet_topology_fork_epoch field is loaded from config without validation:

let subnet_topology_fork_epoch = subnet_topology_fork_epoch_path
    .exists()
    .then(|| read(&subnet_topology_fork_epoch_path))
    .transpose()?;

No checks for:

  • Fork epoch is in the future (setting past epoch could cause immediate dual-subscription at startup)
  • Fork epoch is reasonable (not too close to current epoch for operators to prepare)
  • Fork epoch hasn't already passed

Recommendation: Add validation in client/src/lib.rs after loading config to check fork_epoch sanity against current epoch. Log warnings for suspicious values.


Medium Priority

4. Deref implementation on SubnetId reduces type safety

Location: anchor/subnet_service/src/subnet.rs:136-142

impl Deref for SubnetId {
    type Target = u64;

    fn deref(&self) -> &Self::Target {
        &self.0
    }
}

The Deref impl makes it easy to accidentally use the raw u64 value instead of the SubnetId type, reducing type safety benefits of the newtype pattern.

Observation: This pattern exists throughout the codebase (OperatorId, etc.), so changing it would be inconsistent. However, for new code, consider explicit accessor methods instead of Deref.


5. Clock synchronization requirements not documented at API level

Location: anchor/subnet_service/src/routing.rs:293-307

The current_epoch() function has excellent documentation about clock synchronization requirements, but this critical information isn't surfaced at the API boundaries where it matters most (client initialization, configuration).

/// # Clock Synchronization Requirements
///
/// **Critical for fork transitions**: Nodes must maintain synchronized clocks using NTP or similar.
pub fn current_epoch(slot_clock: &impl SlotClock, slots_per_epoch: u64) -> Option<Epoch> {
    slot_clock.now().map(|slot| slot.epoch(slots_per_epoch))
}

Recommendation: Add a note about clock synchronization requirements to:

  • The ForkSchedule struct documentation
  • The subnet_topology_fork_epoch config field documentation
  • Operator documentation/README

6. No observability metrics for fork transition

Fork transitions are critical network events, but there are no metrics for:

  • Current fork phase (pre-fork, transition, post-fork)
  • Number of messages validated/processed per schema
  • Subnet calculation failures by schema
  • Schema transition events

Recommendation: Add Prometheus metrics:

fork_phase{phase="pre_fork|pre_subscribe|post_fork"}
messages_validated_total{schema="pre_fork|post_fork"}
subnet_calculation_errors_total{schema="pre_fork|post_fork", error_type="..."}

7. Missing integration tests for fork transition

While routing.rs has excellent unit tests (9 test functions), there are no integration tests verifying:

  • End-to-end behavior across subnet_service, message_validator, and message_sender during transition
  • Message flow during the dual-subscription window
  • Correct behavior when nodes have slightly different clock times

Recommendation: Add integration tests in a separate test file or mark as future work in a GitHub issue.


Low Priority

8. Hardcoded transition constants lack justification comments

Location: anchor/subnet_service/src/routing.rs:135

pub const PRE_SUBSCRIBE_EPOCHS: u64 = 2;

While the docstring explains the rationale (allowing ~12.8 minutes for mesh warmup), the comment doesn't explain why this specific value was chosen or what the minimum safe value is.

Recommendation: Add a comment explaining:

  • Why 2 epochs minimum (mesh connection establishment time, message propagation delays)
  • What would break with 1 epoch
  • Whether this value is network-dependent

9. Code duplication in schema iteration patterns

Locations: Multiple files iterate schemas with the same pattern:

subnet_service/src/service.rs:188-209:

for schema in current_schemas.iter() {
    match calculate_subnet_for_committee(...) {
        Ok(subnet_id) => { /* use it */ }
        Err(_) => continue,
    }
}

subnet_service/src/scoring.rs:96-123: Similar pattern

This repetition increases maintenance burden if the error handling strategy changes.

Recommendation: Consider adding a helper method to SubnetRouter:

pub fn calculate_subnets_for_schemas(
    &self,
    committee_id: CommitteeId,
    schema_set: SchemaSet
) -> Vec<(SubnetId, SubnetSchema)>

10. SubnetService::last_subscribe_schemas field could use better documentation

Location: anchor/subnet_service/src/service.rs:38

/// Track last subscribe schemas to detect epoch boundary schema changes.
last_subscribe_schemas: Option<routing::SchemaSet>,

The purpose is clear, but it's not obvious why this is Option instead of just initializing with the current schema set.

Minor improvement: Add a comment explaining why None is used initially (to force initial subscription calculation on first epoch boundary).


Code Quality Assessment

Adherence to Rust Best Practices

  • ✅ Idiomatic Rust patterns (Option, Result, iterators)
  • ✅ Proper error types with thiserror
  • ✅ No unsafe code
  • ✅ Good use of const generics and type safety
  • ✅ Clear function documentation with examples

Error Handling

  • ✅ Uses proper Result types throughout
  • ✅ No .unwrap() or .expect() in production code paths
  • ✅ Graceful degradation during transition periods
  • ⚠️ Some error paths are silent (by design, but could use more logging)

Architecture

  • ✅ Centralized policy prevents inconsistencies
  • ✅ Clean separation of concerns
  • ✅ Components interact through well-defined interfaces (SubnetRouter)
  • ✅ Fork logic is isolated to routing module

Testing

  • ✅ Comprehensive unit tests in routing.rs (9 test functions)
  • ✅ Tests cover edge cases (pre-fork forever, node starts after fork)
  • ✅ MinHash algorithm tests verify correctness and order independence
  • ⚠️ Missing integration tests for fork transition scenarios

Documentation

  • ✅ Excellent inline documentation in routing.rs
  • ✅ Clear PR description with timeline table
  • ✅ Rationale explained for key design decisions
  • ⚠️ Clock synchronization requirements could be more prominent

Security

No security vulnerabilities identified.

The implementation correctly:

  • Uses Ignore instead of Reject for ambiguous cases to prevent peer punishment exploitation
  • Validates topic subscription before processing messages
  • Prevents misconfigured nodes from causing network splits through graceful fallback
  • Uses constant-time hashing for operator IDs (SHA256)

Performance

The implementation is efficient:

  • ✅ Stack-allocated SchemaSet arrays avoid heap allocations in hot paths
  • ✅ Epoch calculation done once per message
  • ✅ Subnet calculation uses efficient hashing (SHA256 + modulo)
  • ✅ No unnecessary cloning in validation paths
  • NonZeroU64 enables compiler optimizations

Minor optimization opportunity: The calculate_subnet_for_committee function is called multiple times for the same committee during validation/scoring. If this becomes a bottleneck, consider caching results per epoch.


Comparison with Previous Review

The December 16 review identified several issues. Here's the status:

✅ Addressed:

  1. Clock skew documentation - Added in current_epoch() docstring (lines 293-307)
  2. Hardcoded transition constants documentation - Added rationale in PRE_SUBSCRIBE_EPOCHS docstring
  3. Logging for PostFork calculation errors - Added warn! logs in service.rs:199-203

⚠️ Partially Addressed:
4. Error handling patterns documentation - Added some comments, but inconsistency between validator/sender not fully documented
5. Missing metrics - Still not implemented (acceptable for initial version)

❌ Not Addressed:
6. Integration tests - Still missing (marked as future work)
7. Configuration validation - No validation for fork_epoch sanity
8. Code duplication in schema iteration - Still present


Recommendations Summary

Before Merge (High Priority):

  1. Add error-level logging when PostFork calculation fails within 1 epoch of fork
  2. Add debug/warn logging in SubnetRouter::validate_subnet() when calculation fails
  3. Add fork_epoch validation at client initialization

After Merge (Medium Priority):
4. Add Prometheus metrics for fork observability
5. Document clock synchronization requirements in operator docs
6. Create GitHub issue for integration test coverage

Future Improvements (Low Priority):
7. Add helper methods to reduce schema iteration duplication
8. Consider removing Deref for SubnetId in future major version


Conclusion

This PR demonstrates excellent software engineering:

  • Well-designed architecture solving a complex distributed systems problem
  • Comprehensive unit test coverage for core logic
  • Clear documentation and rationale
  • Production-safe error handling

The centralized routing policy is the correct approach and will make future fork transitions much easier. The identified issues are minor and mostly about operational visibility rather than correctness.

The code is ready for production deployment.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant