-
Notifications
You must be signed in to change notification settings - Fork 28
feat: implement subnet topology fork transition with centralized routing #754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
feat: implement subnet topology fork transition with centralized routing #754
Conversation
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
Claude finished @diegomrsantos's task in 1m 49s —— View job Code Review CompleteTodo List
SummaryThis PR implements a well-designed fork transition mechanism for migrating from committee-ID-based to operator-ID-based subnet routing. The centralized routing policy approach is sound and addresses a complex distributed systems challenge. The implementation demonstrates good practices with comprehensive tests and clear documentation. Strengths
Issues IdentifiedCritical
The Location: pub fn current_epoch(slot_clock: &impl SlotClock, slots_per_epoch: u64) -> Option<Epoch> {
slot_clock.now().map(|slot| slot.epoch(slots_per_epoch))
}Recommendation: Add a doc comment explaining that nodes must have synchronized clocks (within typical NTP bounds) for the fork transition to work correctly. High Priority
While the routing module has excellent unit tests, there are no integration tests that verify the end-to-end behavior across multiple components during a fork transition. Specifically:
Recommendation: Add integration tests that simulate:
pub const PRE_SUBSCRIBE_EPOCHS: u64 = 1;
pub const POST_GRACE_EPOCHS: u64 = 2;The PR description mentions these are "hardcoded based on network requirements," but there's no documentation explaining:
Recommendation: Add doc comments explaining the rationale for these values and the constraints that determined them.
When calculating subnet for PostFork schema, if the operator list is empty, it returns Location: Recommendation: Add logging at Medium Priority
In This inconsistency is intentional (validator is more permissive, sender is strict), but it's not documented. Recommendation: Add comments explaining why the validator ignores calculation errors (to handle misconfigured peers gracefully) while the sender fails fast (to avoid sending invalid messages).
There are no metrics to monitor:
Recommendation: Add Prometheus metrics for fork transition observability.
The
Recommendation: Add validation in the configuration loading logic or document expected operator behavior. Low Priority
The commit message mentions marking Recommendation: Consider creating a test helper that provides a mock
The pattern for iterating through active schemas is repeated in multiple places: let (schema_count, schemas) = routing::accept_schemas(epoch, fork_schedule);
for schema in schemas.iter().take(schema_count) {
// ... use schema
}Recommendation: Consider adding a helper method that returns an iterator or slice to reduce repetition and potential errors.
Recommendation: Consider removing the Code QualityPositive
Areas for Improvement
SecurityNo security vulnerabilities identified. The implementation correctly:
PerformanceThe implementation is efficient:
One minor optimization opportunity: The TestingStrengths:
Gaps:
DocumentationStrengths:
Improvements needed:
Recommendations
ConclusionThis is a well-designed and well-implemented feature. The centralized routing policy approach is the right solution for this problem, and the implementation demonstrates careful consideration of edge cases and failure modes. The identified issues are mostly about observability, testing, and documentation rather than fundamental design flaws. The code is production-ready with minor improvements for observability and testing. Recommendation: Approve with minor improvements |
This commit addresses the three high-priority items from PR sigp#754 code review: 1. Enhanced documentation for PRE_SUBSCRIBE_EPOCHS and POST_GRACE_EPOCHS - Added detailed rationale explaining why these specific values (1 and 2 epochs) - Documented trade-offs between safety and resource efficiency - Explained operational constraints that determined these values 2. Documented clock synchronization requirements in current_epoch() - Made explicit that nodes must maintain NTP-synchronized clocks - Specified tolerance bounds (~100ms acceptable for 6.4-minute epochs) - Explained consequences of poor synchronization during fork transition 3. Added warning logging for PostFork subnet calculation failures - When empty operator list prevents PostFork subnet calculation in scoring - Helps operators detect configuration issues during grace period - Provides visibility into potential problems before grace period ends All changes are documentation and logging improvements with no functional changes. Tests pass and code formatting has been applied.
Add TODO comments to all #[allow(clippy::too_many_arguments)] suppressions referencing issue sigp#755 which proposes using ForkContext to reduce parameter counts in fork-related code. Changes: - subnet_service: 3 locations (start_subnet_service, subnet_service, handle_subnet_changes) - message_validator: 1 location (Validator::new - includes fork args added in PR sigp#754) - validator_store: 2 locations (AnchorValidatorStore::new, collect_signature) These TODOs document the technical debt and provide a path forward for refactoring without blocking the fork transition work.
f94bfa7 to
b0ee4d8
Compare
Create subnet_service::routing module that provides a single source of truth for all fork-related routing decisions. Key features: - SubnetSchema enum (PreFork/PostFork) representing subnet calculation algorithms - ForkSchedule struct with hardcoded transition windows (1 epoch pre-subscribe, 2 epochs grace) - Decision functions: publish_schema(), subscribe_schemas(), accept_schemas() - Consistency helper: current_epoch() for uniform epoch calculation - Comprehensive unit tests covering full timeline and edge cases Timeline (F = fork_epoch): - Before F-1: PreFork only - F-1: Dual-subscribe (strict PreFork accept to prevent early migration) - F to F+2: Publish PostFork, accept both (grace period) - After F+2: PostFork only This centralizes fork logic that was scattered across multiple components in the Go implementation, making transitions easier to reason about and modify.
Add `subnet_topology_fork_epoch: Option<Epoch>` field to `SsvNetworkConfig` to support configuring when the subnet topology fork activates. - Added `types` dependency to ssv_network_config - Updated `constant()` method to set fork epoch to None for hardcoded networks - Updated `load()` method to optionally read from ssv_subnet_topology_fork_epoch.txt None means pre-fork operation (fork not configured/activated).
Pass ForkSchedule from client to subnet_service. - Import ForkSchedule in client - Create ForkSchedule from config.global_config.ssv_network.subnet_topology_fork_epoch - Add fork_schedule parameter to start_subnet_service() signature - Thread fork_schedule through to subnet_service async function - Fork schedule will be used for dual-subscription logic in next commit
- Modified handle_subnet_changes() to support both PreFork and PostFork schemas - Added fork_schedule and slot_clock parameters to determine active schemas - Calculate subnets for each active schema based on current epoch - Use routing::subscribe_schemas() to get transition state - Handle empty operator lists with appropriate warnings - Added #[allow(clippy::too_many_arguments)] attributes where necessary
…e publishing This commit completes Phase 5 of the fork transition implementation: 1. Created centralized calculate_subnet_for_committee() function in subnet_service/src/lib.rs that handles subnet calculation for both PreFork (committee-ID based) and PostFork (operator-ID based) schemas. 2. Refactored handle_subnet_changes() to use the centralized helper, eliminating duplicated operator ID lookup and deduplication logic. 3. Extended NetworkMessageSenderConfig with fork-related fields: - fork_schedule: ForkSchedule - slot_clock: S - slots_per_epoch: u64 - network_state_rx: watch::Receiver<NetworkState> 4. Modified do_send() in message_sender to: - Determine publishing schema using routing::publish_schema() - Calculate subnet using centralized helper - Properly handle errors for empty operator lists 5. Wired fork_schedule creation in client initialization and passed fork-related state to NetworkMessageSenderConfig. The centralized approach ensures consistency between subscription and publishing paths, with a single source of truth for subnet calculation logic.
The get_committee_info_for_subnet function was using only PreFork (committee-ID) algorithm to identify which committees map to a subnet. This caused incorrect message rate calculations for topic scoring post-fork and during the grace period. Now uses accept_schemas() to determine which routing schemas are valid at the current epoch, and checks if each cluster maps to the target subnet under ANY accepted schema. During grace period (epochs N to N+2), committees are included if they map via either PreFork OR PostFork. This conservative approach ensures we don't under-estimate message rates during the transition, which would cause incorrect peer scoring.
Thread topic through message_receiver to message_validator and validate that messages arrive on the correct subnet topic based on fork-aware routing rules. Changes: - message_receiver: pass topic to validator - message_validator: add topic parameter, validate subnet correctness - message_validator: check committee maps to received subnet under any accepted schema - message_sender: construct TopicHash for outgoing validation - client: wire fork_schedule and subnet_count to Validator, fix initialization order Returns ValidationFailure::IncorrectTopic (maps to Ignore) for wrong topics to avoid peer punishment during transitions.
Remove temporary analysis files and directories used during development: - fork_transition_detailed_plan.md - ssv_analysis_data/ directory with all analysis scripts and results Apply cargo fmt formatting fixes to client and message_validator.
The doctest example for current_epoch cannot compile without significant setup overhead. Mark it as ignore since it's illustrative only.
This commit addresses the three high-priority items from PR sigp#754 code review: 1. Enhanced documentation for PRE_SUBSCRIBE_EPOCHS and POST_GRACE_EPOCHS - Added detailed rationale explaining why these specific values (1 and 2 epochs) - Documented trade-offs between safety and resource efficiency - Explained operational constraints that determined these values 2. Documented clock synchronization requirements in current_epoch() - Made explicit that nodes must maintain NTP-synchronized clocks - Specified tolerance bounds (~100ms acceptable for 6.4-minute epochs) - Explained consequences of poor synchronization during fork transition 3. Added warning logging for PostFork subnet calculation failures - When empty operator list prevents PostFork subnet calculation in scoring - Helps operators detect configuration issues during grace period - Provides visibility into potential problems before grace period ends All changes are documentation and logging improvements with no functional changes. Tests pass and code formatting has been applied.
Add TODO comments to all #[allow(clippy::too_many_arguments)] suppressions referencing issue sigp#755 which proposes using ForkContext to reduce parameter counts in fork-related code. Changes: - subnet_service: 3 locations (start_subnet_service, subnet_service, handle_subnet_changes) - message_validator: 1 location (Validator::new - includes fork args added in PR sigp#754) - validator_store: 2 locations (AnchorValidatorStore::new, collect_signature) These TODOs document the technical debt and provide a path forward for refactoring without blocking the fork transition work.
b0ee4d8 to
a18d133
Compare
- Change PRE_SUBSCRIBE_EPOCHS from 1 to 2 (SIP-43 PRIOR_WINDOW)
- Remove POST_GRACE_EPOCHS (no grace period after fork per SIP-43)
- Add process_schemas() to distinguish gossipsub acceptance from processing
- Update accept_schemas() to accept both schemas during pre-subscribe window
- Update subscribe_schemas() to only subscribe PostFork at/after fork
New timeline (SIP-43 compliant):
| Epoch | Subscribe | Publish | Accept | Process |
|-------|----------------|---------|------------|---------|
| F-3 | {Pre} | Pre | {Pre} | {Pre} |
| F-2 | {Pre, Post} | Pre | {Pre,Post} | {Pre} |
| F-1 | {Pre, Post} | Pre | {Pre,Post} | {Pre} |
| F | {Post} | Post | {Post} | {Post} |
Key SIP-43 behaviors:
- Accept PostFork messages during pre-subscribe for gossipsub propagation
- Only process PreFork messages until fork epoch
- Immediate cutoff at fork (no grace period)
Add matched_schema field to ValidatedMessage to track which SubnetSchema the message was validated against. This enables the message_receiver to distinguish between messages that should be processed vs. only propagated during the pre-subscribe window (F-2, F-1). During the pre-subscribe window, PostFork messages are accepted for gossipsub propagation but should not be processed until fork epoch. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add fork-aware message processing logic to NetworkMessageReceiver: - Accept PostFork messages during pre-subscribe window (F-2, F-1) for gossipsub propagation - Only process messages matching process_schemas for committee work - Before fork: only process PreFork messages - At/after fork: only process PostFork messages This implements the SIP-43 requirement that during the pre-subscribe window, nodes should accept and propagate PostFork messages but not process them for consensus until the fork epoch. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Eliminate duplication where message_sender, message_validator, and message_receiver each had their own copies of fork-related dependencies (network_state_rx, fork_schedule, slot_clock, slots_per_epoch). Changes: - Add SubnetRouter struct to routing.rs with methods: - publish_subnet(): calculate subnet for publishing - validate_subnet(): validate incoming messages against accepted schemas - should_process(): determine if message should be processed - Update message_sender to use router instead of 5 separate params - Update message_validator to use router for network state and validation - Update message_receiver to use router.should_process() - Create single SubnetRouter instance in client and inject into components 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, subnet subscriptions were only recomputed when the database changed. This meant fork transitions (from PreFork to PostFork schemas) would not happen unless the DB changed, leaving nodes on wrong topics and missing messages. Now the epoch boundary handler always runs and checks if we're in the fork transition window [F-2, F]. If so, it triggers subscription recomputation regardless of DB changes. Changes: - Add is_fork_transition_epoch() helper to detect fork window - Always handle epoch boundaries (not just when scoring enabled) - Only trigger subscription changes during fork transition epochs - Add comprehensive tests for fork transition detection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename handle_epoch_committee_update to send_scoring_rate_updates for clarity - Add comprehensive docstring explaining gossipsub scoring context - Improve inline comments explaining why rate updates are conditional - Update debug message to better describe the operation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
#767) Preparatory refactor for the subnet topology fork transition work (#754). Split the 539-line monolithic `lib.rs` into focused modules with clear responsibilities: | Module | Responsibility | Key Types/Functions | |--------|---------------|---------------------| | `subnet.rs` | Core types and calculation algorithms | `SubnetId`, `SubnetEvent`, `from_committee_alan()`, `from_operators()` | | `service.rs` | Background service managing subscriptions | `SubnetService` struct, `start_subnet_service()` | | `scoring.rs` | Message rate calculation for gossipsub | `calculate_message_rate_for_subnet()`, `get_committee_info_for_subnet()` | | `lib.rs` | Public API re-exports only | — | ### Mental Model ``` subnet_service/ ├── lib.rs → Public API surface (re-exports) ├── subnet.rs → WHAT a subnet IS (types, algorithms) ├── service.rs → WHAT HAPPENS with subnets (lifecycle, events) ├── scoring.rs → HOW to score subnets (message rates) └── message_rate.rs → (unchanged) rate calculation math ``` ### Why This Structure? 1. **Single Responsibility**: Each module has one clear purpose 2. **Easier Navigation**: Finding code by concept rather than scrolling 3. **Fork Transition Ready**: PR #754 can cleanly add `routing.rs` for fork-aware logic without touching unrelated code 4. **Encapsulation**: `SubnetService` struct groups related state and methods ### Cleanup - Removed unused `test_tracker()` function (was defined but never called) Co-Authored-By: diego <diego@sigmaprime.io>
Resolves merge conflicts by: - Keeping modular structure from upstream (routing.rs, service.rs, subnet.rs, scoring.rs) - Adding fork-aware features (ForkSchedule, SchemaSet, dual-subscription) - Preserving clean run_scoring_loop/run_monitoring_loop structure - Using fork-aware calculate_subnet_for_committee for multi-schema support
e3fdf57 to
5f5ae8b
Compare
All clusters with the same committee_id have identical operators (committee_id is derived from the operator set), so we only need to look up one cluster instead of collecting from all and deduping.
- Use NonZeroU64 for subnet_count to prevent division by zero at compile time - Add SUBNET_COUNT_NZ constant for safe subnet count usage - Change from_committee_alan to return Result like from_operators - Use map_err instead of expect for production safety - Optimize validate_subnet to accept operators directly, avoiding redundant DB lookup - Fix impostor sender to log warnings instead of silently ignoring errors 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
During the dual-subscribe window, a PostFork calculation error (e.g., EmptyOperatorList) would previously cause immediate error propagation, potentially rejecting messages before checking all schemas. Now errors are handled per-schema with continue, ensuring we only return IncorrectTopic when no schema matches.
operatorId - 1 could underflow if the contract emits operatorId == 0, producing false "missing events" failures. Use checked_sub to safely handle this edge case.
Add SubnetId::from_topic_str() and to_topic_string() helpers to avoid duplicating the "ssv.v2.<subnet_id>" format in multiple places. This ensures topic parsing stays aligned if the format ever changes.
|
Claude finished @diegomrsantos's task in 2m 11s —— View job Code Review CompleteTodo List
Executive SummaryThis PR implements a well-designed fork transition mechanism for migrating the SSV network from committee-ID-based to operator-ID-based subnet routing (SIP-43). The centralized routing policy approach is architecturally sound and addresses a complex distributed systems challenge with clear separation of concerns. Recommendation: Approve with minor suggested improvements The implementation is production-ready. Identified issues are primarily about observability and edge case handling rather than fundamental design flaws. Strengths1. Excellent Architectural DesignThe centralized routing policy in
2. Comprehensive Test Coverage
3. Type Safety Improvements
4. Proper Error Handling
5. Clean IntegrationAll components (subnet_service, message_validator, message_sender) integrate cleanly with the centralized Issues IdentifiedCritical: NoneHigh Priority1. Missing warning logs for PostFork calculation failures during transition Location: During the pre-subscribe window (F-2, F-1), if a cluster's PostFork subnet calculation fails (e.g., Err(SubnetCalculationError::EmptyOperatorList) => {
warn!(
?cluster_id,
?schema,
"Cluster has empty operator list, skipping subnet calculation"
);
}Recommendation: Add an 2. SubnetRouter validation allows continued operation with calculation errors Location: In let expected_subnet = match schema {
SubnetSchema::PreFork => {
match SubnetId::from_committee_alan(committee_id, self.subnet_count) {
Ok(subnet) => subnet,
Err(_) => continue, // Silent failure
}
}
SubnetSchema::PostFork => {
match SubnetId::from_operators(operators, self.subnet_count) {
Ok(subnet) => subnet,
Err(_) => continue, // Silent failure - masks config issues
}
}
};This is intentional for graceful handling, but operators won't know their PostFork configuration is broken until fork epoch hits. Recommendation: Add debug or warn logging when subnet calculation fails so operators can detect configuration issues before fork. 3. No validation for fork_epoch configuration sanity Location: The let subnet_topology_fork_epoch = subnet_topology_fork_epoch_path
.exists()
.then(|| read(&subnet_topology_fork_epoch_path))
.transpose()?;No checks for:
Recommendation: Add validation in Medium Priority4. Deref implementation on SubnetId reduces type safety Location: impl Deref for SubnetId {
type Target = u64;
fn deref(&self) -> &Self::Target {
&self.0
}
}The Observation: This pattern exists throughout the codebase (OperatorId, etc.), so changing it would be inconsistent. However, for new code, consider explicit accessor methods instead of 5. Clock synchronization requirements not documented at API level Location: The /// # Clock Synchronization Requirements
///
/// **Critical for fork transitions**: Nodes must maintain synchronized clocks using NTP or similar.
pub fn current_epoch(slot_clock: &impl SlotClock, slots_per_epoch: u64) -> Option<Epoch> {
slot_clock.now().map(|slot| slot.epoch(slots_per_epoch))
}Recommendation: Add a note about clock synchronization requirements to:
6. No observability metrics for fork transition Fork transitions are critical network events, but there are no metrics for:
Recommendation: Add Prometheus metrics: fork_phase{phase="pre_fork|pre_subscribe|post_fork"}
messages_validated_total{schema="pre_fork|post_fork"}
subnet_calculation_errors_total{schema="pre_fork|post_fork", error_type="..."}7. Missing integration tests for fork transition While
Recommendation: Add integration tests in a separate test file or mark as future work in a GitHub issue. Low Priority8. Hardcoded transition constants lack justification comments Location: pub const PRE_SUBSCRIBE_EPOCHS: u64 = 2;While the docstring explains the rationale (allowing ~12.8 minutes for mesh warmup), the comment doesn't explain why this specific value was chosen or what the minimum safe value is. Recommendation: Add a comment explaining:
9. Code duplication in schema iteration patterns Locations: Multiple files iterate schemas with the same pattern:
for schema in current_schemas.iter() {
match calculate_subnet_for_committee(...) {
Ok(subnet_id) => { /* use it */ }
Err(_) => continue,
}
}
This repetition increases maintenance burden if the error handling strategy changes. Recommendation: Consider adding a helper method to pub fn calculate_subnets_for_schemas(
&self,
committee_id: CommitteeId,
schema_set: SchemaSet
) -> Vec<(SubnetId, SubnetSchema)>10. SubnetService::last_subscribe_schemas field could use better documentation Location: /// Track last subscribe schemas to detect epoch boundary schema changes.
last_subscribe_schemas: Option<routing::SchemaSet>,The purpose is clear, but it's not obvious why this is Minor improvement: Add a comment explaining why Code Quality AssessmentAdherence to Rust Best Practices
Error Handling
Architecture
Testing
Documentation
SecurityNo security vulnerabilities identified. The implementation correctly:
PerformanceThe implementation is efficient:
Minor optimization opportunity: The Comparison with Previous ReviewThe December 16 review identified several issues. Here's the status: ✅ Addressed:
❌ Not Addressed: Recommendations SummaryBefore Merge (High Priority):
After Merge (Medium Priority): Future Improvements (Low Priority): ConclusionThis PR demonstrates excellent software engineering:
The centralized routing policy is the correct approach and will make future fork transitions much easier. The identified issues are minor and mostly about operational visibility rather than correctness. The code is ready for production deployment. |
Issue Addressed
Implements the subnet topology fork transition (SIP-43) to safely migrate the network from committee-ID-based to operator-ID-based (MinHash topology) subnet calculation.
Proposed Changes
The Problem: The network needs to change how it calculates which subnet (gossipsub topic) a committee belongs to. This affects every network operation: subscription, publishing, message validation, and peer scoring. Getting this wrong causes network partitions, message loss, or incorrect peer punishment.
Centralized Routing Policy: Instead of scattering fork logic across components, this PR creates a single routing policy module (
routing.rs) that answers four questions for any epoch:publish_schema()subscribe_schemas()accept_schemas()process_schemas()SIP-43 Compliant Fork Timeline (F = fork_epoch):
Key behaviors per SIP-43:
Fork-Aware Network Operations: Updated all network components to be fork-aware:
SubnetRouterprovides centralized routing decisions shared across componentsType Safety Improvements:
NonZeroU64for subnet_count to prevent division by zero at compile timeResulterror handling withmap_errinstead ofexpectin production codeSUBNET_COUNT_NZconstant for safe subnet count usageConfiguration: Added
subnet_topology_fork_epochas an optional network configuration parameter. When unset, the system operates in pre-fork mode indefinitely.Additional Info
Design Principle: Centralized policy prevents subtle inconsistencies (e.g., subscription using one epoch threshold, validation using another) that could cause hard-to-debug network issues.
Safety: Message validation uses "Ignore" (not "Reject") for topic mismatches to avoid peer punishment during contentious upgrades or configuration issues.
MinHash Algorithm: Implements the SIP-43 specified algorithm - SHA256 hash each operator ID (little-endian u64), find minimum hash, return
min_hash % 128.