RFC: Server-side Transfer Planning
Status: Draft
experimental branch
Summary
Proposing a model where planning for transfers is moved from the client to the server. Instead of a client discovering peers directly, the client asks the server for a plan: a set of peers, each assigned a disjoint slice of tensors.
The server owns peer selection and tensor assignment.
To do this, we suggest adding two new RPCs (ComputeTransferPlan, AdvertiseInventory) and one additive WorkerMetadata.labels field. The existing four RPCs are unchanged.
To start with, hide the feature behind a feature flag like MX_USE_TRANSFER_PLAN=1. Clients by default will continue to be in charge of their own peer selection.
Rationale
Doing server-side planning has a lot of downstream benefits:
-
Load- or topology-aware peer preferences need a view of cluster and workload state that no individual client has, and that shouldn't be a client responsibility to gather. Moving planning server side makes ranking potential transfer routes between candidate peers possible later (see Future work).
-
A single peer pod (or process) may not hold the entire model's weights. A planner lets clients build a transfer plan spanning multiple peers that assembles the complete set of weights they need, and this keeps working as experts are rebalanced.
-
In a multi-tenant deployment, how peers are selected (or excluded) should be set centrally by the cluster operator. Individual badly-behaving clients (e.g. crashlooping pods from one tenant) shouldn't be able to disrupt other workloads' transfers.
-
Assigning disjoint slices across peers naturally parlays into parallel downloads, which maximizes performance. It also opens the door to a planner that trades latency to minimize fabric congestion — again requiring a global view of health that shouldn't be a client's job to compute.
The client also gets simpler: one RPC replaces the discover/fetch/retry loop, and clients no longer track selection logic between peers. Many of the roadmap items in the project README also lean on server-side planning as a substrate (parallel/sharded assembly, RL weight refit, fault-tolerant re-planning, predictive prefetch).
Protocol changes
service P2pService {
// ... existing 4 RPCs unchanged ...
rpc ComputeTransferPlan(...) returns (...);
rpc AdvertiseInventory(...) returns (...);
}
-
ComputeTransferPlan: given (identity, requester rank+id, optional max_peers, optional explicit tensor list), returns per-peer assignments with NIXL metadata inlined, structured diagnostics (below), and any uncovered_tensor_names the client should load from disk. An explicit tensor list restricts the plan to those names; an empty list means the union of all peers' inventories (the cold-start "I need the whole model" case). Each covered tensor is assigned to exactly one peer.
-
AdvertiseInventory: a peer declares the tensors it owns (name, dtype, byte length — no GPU addresses). Required: the server cannot derive ownership from SourceIdentity for disjoint-shard models, so a peer that has not advertised an inventory is not eligible to serve any tensor in a plan. Sources advertise after load; a receiver re-advertises after its own transfer completes, so it can serve future requesters. Re-sending replaces the previous inventory for (mx_source_id, worker_id) iff the generation is higher; the lease expires with the worker's heartbeat.
-
WorkerMetadata.labels (map<string,string>): opaque external identity (pod, node, rack, ...). The server stores it but does not interpret it today; it's the join surface that future peer ranking will key on (see Future work). This is the entire topology surface — no structured topology message.
Why label propogation (forward-looking)
modelexpress keys workers by internal IDs (mx_source_id, worker_id) that mean nothing to external systems. Any future external ranking signal is keyed by it's own local IDs, example: a Prometheus series tagged pod=..., node labels rack=.... labels, which we would need to store to allow future joins.
The server never interprets the keys, so adding a new signal later won't require a proto change.
Example of how this could be populated via the k8s downward API:
env:
- name: MX_LABEL_pod
valueFrom: { fieldRef: { fieldPath: metadata.name } }
- name: MX_LABEL_node
valueFrom: { fieldRef: { fieldPath: spec.nodeName } }
- name: MX_LABEL_tenant
valueFrom: { fieldRef: { fieldPath: metadata.labels['tenant'] } }
MX_LABEL_* env vars become labels entries at publish time (pod, node, tenant).
Planning
Many different planing schemes are possible, but I think a safe default is greedy bin packing that tries to include more peers. Some research will need to be done to ensure the default strategy is robust at different peer counts and that doing parallel transfers doesn't interrupt inference workloads already on the fabric (KV transfers, all2alls).
The planning algorithm is a potential plugin point, but I would caution overrotating on extensibility and instead try to tie features directly to use-cases and examples. Too many knobs might confuse operators.
Diagnostics
Peer sort ordering is Debug-printable, and the server should publish metrics on planning decisions: peers considered vs. assigned, uncovered tensor counts, and dtype conflicts that dropped a tensor.
Future work (out of scope)
- Pluggable peer ranking. A background data-collection task on the server feeds relative scores between peers, used as a tiebreak (and hard-exclusion) during assignment. Inputs could include:
- topology
- live inference-server load (e.g. scraped from Prometheus)
- tenant labels
- Partial / diff need-sets Let a receiver advertise its own inventory and scope a plan to the difference against what peers own, rather than the full union — enabling resume-after-failure and RL weight refit/resharding, where only changed weights move. (The proto already allows an empty inventory advertisement to support this.). This would likely require model server integration.
- Fault tolerance If a transfer fails it might be faster to "re-route" a transfer and move to a different peer instead of moving to a different storage backend, we should support this type of retry.
RFC: Server-side Transfer Planning
Status: Draft
experimental branch
Summary
Proposing a model where planning for transfers is moved from the client to the server. Instead of a client discovering peers directly, the client asks the server for a plan: a set of peers, each assigned a disjoint slice of tensors.
The server owns peer selection and tensor assignment.
To do this, we suggest adding two new RPCs (
ComputeTransferPlan,AdvertiseInventory) and one additiveWorkerMetadata.labelsfield. The existing four RPCs are unchanged.To start with, hide the feature behind a feature flag like
MX_USE_TRANSFER_PLAN=1. Clients by default will continue to be in charge of their own peer selection.Rationale
Doing server-side planning has a lot of downstream benefits:
Load- or topology-aware peer preferences need a view of cluster and workload state that no individual client has, and that shouldn't be a client responsibility to gather. Moving planning server side makes ranking potential transfer routes between candidate peers possible later (see Future work).
A single peer pod (or process) may not hold the entire model's weights. A planner lets clients build a transfer plan spanning multiple peers that assembles the complete set of weights they need, and this keeps working as experts are rebalanced.
In a multi-tenant deployment, how peers are selected (or excluded) should be set centrally by the cluster operator. Individual badly-behaving clients (e.g. crashlooping pods from one tenant) shouldn't be able to disrupt other workloads' transfers.
Assigning disjoint slices across peers naturally parlays into parallel downloads, which maximizes performance. It also opens the door to a planner that trades latency to minimize fabric congestion — again requiring a global view of health that shouldn't be a client's job to compute.
The client also gets simpler: one RPC replaces the discover/fetch/retry loop, and clients no longer track selection logic between peers. Many of the roadmap items in the project README also lean on server-side planning as a substrate (parallel/sharded assembly, RL weight refit, fault-tolerant re-planning, predictive prefetch).
Protocol changes
ComputeTransferPlan: given(identity, requester rank+id, optional max_peers, optional explicit tensor list), returns per-peer assignments with NIXL metadata inlined, structured diagnostics (below), and anyuncovered_tensor_namesthe client should load from disk. An explicit tensor list restricts the plan to those names; an empty list means the union of all peers' inventories (the cold-start "I need the whole model" case). Each covered tensor is assigned to exactly one peer.AdvertiseInventory: a peer declares the tensors it owns (name, dtype, byte length — no GPU addresses). Required: the server cannot derive ownership fromSourceIdentityfor disjoint-shard models, so a peer that has not advertised an inventory is not eligible to serve any tensor in a plan. Sources advertise after load; a receiver re-advertises after its own transfer completes, so it can serve future requesters. Re-sending replaces the previous inventory for(mx_source_id, worker_id)iff the generation is higher; the lease expires with the worker's heartbeat.WorkerMetadata.labels(map<string,string>): opaque external identity (pod,node,rack, ...). The server stores it but does not interpret it today; it's the join surface that future peer ranking will key on (see Future work). This is the entire topology surface — no structured topology message.Why label propogation (forward-looking)
modelexpresskeys workers by internal IDs (mx_source_id,worker_id) that mean nothing to external systems. Any future external ranking signal is keyed by it's own local IDs, example: a Prometheus series taggedpod=..., node labelsrack=....labels, which we would need to store to allow future joins.The server never interprets the keys, so adding a new signal later won't require a proto change.
Example of how this could be populated via the k8s downward API:
MX_LABEL_*env vars becomelabelsentries at publish time (pod,node,tenant).Planning
Many different planing schemes are possible, but I think a safe default is greedy bin packing that tries to include more peers. Some research will need to be done to ensure the default strategy is robust at different peer counts and that doing parallel transfers doesn't interrupt inference workloads already on the fabric (KV transfers, all2alls).
The planning algorithm is a potential plugin point, but I would caution overrotating on extensibility and instead try to tie features directly to use-cases and examples. Too many knobs might confuse operators.
Diagnostics
Peer sort ordering is
Debug-printable, and the server should publish metrics on planning decisions: peers considered vs. assigned,uncoveredtensor counts, and dtype conflicts that dropped a tensor.Future work (out of scope)