[RFC] Proposal for a central transfer planner between peers

# RFC: Server-side Transfer Planning

Status: Draft

[experimental branch](https://github.com/wseaton/modelexpress/tree/planner)

## Summary

Proposing a model where planning for transfers is moved from the client to the server. Instead of a client discovering peers directly, the client asks the server for a *plan*: a set of peers, each assigned a disjoint slice of tensors.

The server owns peer selection and tensor assignment.

To do this, we suggest adding two new RPCs (`ComputeTransferPlan`, `AdvertiseInventory`) and one additive `WorkerMetadata.labels` field. The existing four RPCs are unchanged.

To start with, hide the feature behind a feature flag like `MX_USE_TRANSFER_PLAN=1`. Clients by default will continue to be in charge of their own peer selection.

## Rationale

Doing server-side planning has a lot of downstream benefits:

- Load- or topology-aware peer preferences need a view of cluster and workload state that no individual client has, and that shouldn't be a client responsibility to gather. Moving planning server side makes ranking potential transfer routes between candidate peers possible later (see Future work).

- A single peer pod (or process) may not hold the entire model's weights. A planner lets clients build a transfer plan spanning multiple peers that assembles the complete set of weights they need, and this keeps working as experts are rebalanced.

- In a multi-tenant deployment, how peers are selected (or excluded) should be set centrally by the cluster operator. Individual badly-behaving clients (e.g. crashlooping pods from one tenant) shouldn't be able to disrupt other workloads' transfers.

- Assigning disjoint slices across peers naturally parlays into parallel downloads, which maximizes performance. It also opens the door to a planner that trades latency to minimize fabric congestion — again requiring a global view of health that shouldn't be a client's job to compute.

The client also gets simpler: one RPC replaces the discover/fetch/retry loop, and clients no longer track selection logic between peers. Many of the roadmap items in the project README also lean on server-side planning as a substrate (parallel/sharded assembly, RL weight refit, fault-tolerant re-planning, predictive prefetch).

## Protocol changes

```proto
service P2pService {
  // ... existing 4 RPCs unchanged ...
  rpc ComputeTransferPlan(...) returns (...);
  rpc AdvertiseInventory(...) returns (...);
}
```

- **`ComputeTransferPlan`**: given `(identity, requester rank+id, optional max_peers, optional explicit tensor list)`, returns per-peer assignments with NIXL metadata inlined, structured diagnostics (below), and any `uncovered_tensor_names` the client should load from disk. An explicit tensor list restricts the plan to those names; an empty list means the union of all peers' inventories (the cold-start "I need the whole model" case). Each covered tensor is assigned to exactly one peer.

- **`AdvertiseInventory`**: a peer declares the tensors it owns (name, dtype, byte length — no GPU addresses). Required: the server cannot derive ownership from `SourceIdentity` for disjoint-shard models, so **a peer that has not advertised an inventory is not eligible to serve any tensor in a plan**. Sources advertise after load; a receiver re-advertises after its own transfer completes, so it can serve future requesters. Re-sending replaces the previous inventory for `(mx_source_id, worker_id)` iff the generation is higher; the lease expires with the worker's heartbeat.

- **`WorkerMetadata.labels`** (`map<string,string>`): opaque external identity (`pod`, `node`, `rack`, ...). The server stores it but does not interpret it today; it's the join surface that future peer ranking will key on (see Future work). This is the entire topology surface — no structured topology message.

### Why label propogation (forward-looking)

`modelexpress` keys workers by internal IDs (`mx_source_id`, `worker_id`) that mean nothing to external systems. Any future external ranking signal is keyed by it's own local IDs, example: a Prometheus series tagged `pod=...`, node labels `rack=...`. `labels`, which we would need to store to allow future joins.

The server never interprets the keys, so adding a new signal later won't require a proto change.

Example of how this could be populated via the k8s downward API:

```yaml
env:
  - name: MX_LABEL_pod
    valueFrom: { fieldRef: { fieldPath: metadata.name } }
  - name: MX_LABEL_node
    valueFrom: { fieldRef: { fieldPath: spec.nodeName } }
  - name: MX_LABEL_tenant
    valueFrom: { fieldRef: { fieldPath: metadata.labels['tenant'] } }
```

`MX_LABEL_*` env vars become `labels` entries at publish time (`pod`, `node`, `tenant`).

## Planning

Many different planing schemes are possible, but I think a safe default is greedy bin packing that tries to include more peers. Some research will need to be done to ensure the default strategy is robust at different peer counts and that doing parallel transfers doesn't interrupt inference workloads already on the fabric (KV transfers, all2alls).

The planning algorithm is a potential plugin point, but I would caution overrotating on extensibility and instead try to tie features directly to use-cases and examples. Too many knobs might confuse operators.

## Diagnostics

Peer sort ordering is `Debug`-printable, and the server should publish metrics on planning decisions: peers considered vs. assigned, `uncovered` tensor counts, and dtype conflicts that dropped a tensor.

## Future work (out of scope)

- **Pluggable peer ranking.** A background data-collection task on the server feeds relative scores between peers, used as a tiebreak (and hard-exclusion) during assignment. Inputs could include:
    -  topology
    - live inference-server load (e.g. scraped from Prometheus)
    - tenant labels
- **Partial / diff need-sets** Let a receiver advertise its own inventory and scope a plan to the *difference* against what peers own, rather than the full union — enabling resume-after-failure and RL weight refit/resharding, where only changed weights move. (The proto already allows an empty inventory advertisement to support this.). This would likely require model server integration.
- **Fault tolerance** If a transfer fails it might be faster to "re-route" a transfer and move to a different peer instead of moving to a different storage backend, we should support this type of retry.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Proposal for a central transfer planner between peers #293

RFC: Server-side Transfer Planning

Summary

Rationale

Protocol changes

Why label propogation (forward-looking)

Planning

Diagnostics

Future work (out of scope)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC] Proposal for a central transfer planner between peers #293

Description

RFC: Server-side Transfer Planning

Summary

Rationale

Protocol changes

Why label propogation (forward-looking)

Planning

Diagnostics

Future work (out of scope)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions