Skip to content

[Feature] Support FP8 weight transfer in awex colocated (CUDA-IPC) path #1359

@guozhihao-224

Description

@guozhihao-224

Checklist

  • This feature will maintain backward compatibility with the current APIs in
    areal/api/. If not, please raise a refactor issue first.

Background

After #1310 introduced the colocated CUDA-IPC weight transfer path, the awex Megatron adapter hardcodes fp8_direct_convert=False, quantization_config=None when iterating HF parameters. A Megatron model trained with FP8EngineConfig is silently dequantized to BF16 before being transferred to the inference engine through awex, even though the legacy xccl path already supports FP8 end-to-end.

Evidence

  • Hardcoded downgrade: areal/experimental/weight_update/awex/megatron_adapter.py:288-289

    gathered = all_gather_param(
        mcore_name,
        param,
        fp8_direct_convert=False,    # forces BF16
        quantization_config=None,    # disables FP8 quantize_params
        duplicated_param_names=...,
    )
  • The xccl path uses the same all_gather_param + convert_to_hf but plumbs self.fp8_direct_convert and self.quantization_config through (megatron_engine.py:1471-1486, 1511-1521), with FP8 gather wired in megatron_utils/megatron.py:86-122 (_all_gather_fp8_tensor_and_concat).

  • Repo-wide grep for fp8|float8 under areal/experimental/weight_update/ returns only the two hardcoded lines. The receiving SGLang adapter has no FP8 awareness either.

Impact for users running Megatron+FP8 with awex colocated mode:

  1. Silent correctness gap: no error or warning; users may believe FP8 is end-to-end while rollout always sees BF16 weights.
  2. Bandwidth regression vs xccl: weight transfer is 2× (BF16 vs FP8).
  3. Inference cannot leverage FP8 acceleration because it never receives FP8 weights and scales.

Potential Solution

awex matches sender/receiver by parameter name through _iter_hf_params() →ParameterMeta → TransferPlan. Enabling FP8 introduces extra *.weight_scale_inv tensors per weight, which BOTH ends must declare.

Stage 1 - Sender (Megatron adapter)

In awex/megatron_adapter.py:_iter_hf_params, replace hardcoded values with the engine's existing fields (self._engine.fp8_direct_convert, self._engine.quantization_config, self._engine.hf_config). Pattern mirrors the xccl path in megatron_engine.py:1471-1521.

Stage 2 - Receiver (SGLang adapter)

awex/sglang_adapter.py must mirror the FP8-aware iteration so that the inference side declares matching ParameterMeta for *.weight_scale_inv tensors, allowing the TransferPlan to align names across both ends.

Stage 3 - Tests

Add tests/experimental/weight_update/test_colocate_fp8.py: Megatron FP8EngineConfig enabled → awex colocated transfer → SGLang. Compare post-transfer weights (after dequant) against the xccl path baseline.

Reference implementation: the xccl path (MegatronEngine._collect_param and _impl_update_weight_from_distributed) already does the equivalent work end-to-end, so Stage 1 is largely a matter of plumbing existing fields through.

Additional Information

Scope constraint: PR #1310 limits colocated mode to TP=PP=EP=1. This work should follow the same scope; broader parallelism support depends on the awex roadmap.

Open question: how should the SGLang side discover FP8 configuration — parse from the running SGLang server, or be told via the gateway / KV store metadata?

Related:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions