This plan will modify the DisaggregatedPDSystem2D class so that each prefill pipeline rank also performs tensor parallelism (TP), rather than only pipeline parallelism (PP).
Integrating Tensor Parallelism (TP) alongside Pipeline Parallelism (PP) transforms the model into a 2D parallelism scheme. In this configuration, each "rank" in your pipeline is no longer a single GPU, but a TP Group of GPUs working in lockstep.
The primary change is that a single ComputeJob for a layer is now split into multiple parallel compute tasks, and new All-Reduce or All-Gather communication overheads must be modeled within each pipeline stage.
- TP Group: A set of
$TP$ GPUs. Each GPU holds a shard of the weights for the layers assigned to that PP stage. - Intra-Stage Communication: Within a single PP stage, GPUs must perform an All-Reduce operation twice per Transformer layer (once after Attention, once after MLP).
- Inter-Stage Communication: When moving from PP stage
$i$ to$i+1$ , each GPU in TP Group$i$ typically sends its activation shard to the corresponding GPU in TP Group$i+1$ .
To track 2D parallelism, compute jobs must include the TP index (
Format: P_Rank_PP[p]_TP[t]_Chunk[c]
- P: Prefill phase.
- PP[p]: The pipeline stage (0 to
$PP-1$ ). - TP[t]: The tensor parallel shard index (0 to
$TP-1$ ). - Chunk[c]: The specific prefill chunk being processed.
In the prefill phase, All-Reduce is typically implemented as a Reduce-Scatter followed by an All-Gather. For the simulator, we model the total data moved to synchronize a layer.
- Name: TP_AR_PP[p]_Layer[l]_Chunk[c]
- Path: Typically uses NVLink or NVSwitch (local to the server).
- Trigger: Starts after the partial matrix multiplication compute job on each TP GPU.
Instead of one large activation batch, we send
- Name: PP_Act_FromP[p]_ToP[p+1]_TP[t]_Chunk[c]
- Path: [GPU_P[p][t]_PCI, GPU_P[p+1][t]_PCI] (Assuming they share the same PLX).
Each TP GPU in the prefill cluster is responsible for a shard of the KV cache.
- Name: Handoff_PP[p]_TP[t]_Chunk[c]
- Path: Same as your original IB path, but mapped to NICs based on the global GPU index
$(p \times TP + t)$ .
- Add tp_degree parameter.
- Update params_per_rank: (W // L) * (L // pp_degree) // tp_degree.
- Update kv_bytes: Divide the per-rank KV cache by tp_degree (since heads are sharded).
The compute time remains largely the same (total FLOPS divided by total GPUs), but you must add a "TP Sync" penalty.
- Start: P_Rank_PP[0]_TP[0...t]_Chunk[0] compute jobs are added.
- TP Sync: When a TP compute job finishes, check if all
$t$ peers in that group are done. If yes, inject the TP_AR data batches. - PP Forward: When TP_AR is complete, inject the PP_Act shards to the next PP stage.
- Handoff: Simultaneously inject the Handoff shards to the decode cluster.
Add a tracker (e.g., self.tp_sync_gate = defaultdict(int)) to ensure the pipeline only moves forward when all