Flux Development Documentation

This document consolidates development progress, code review notes, and fixes applied to the Flux framework.

Development Status
Code Review Summary
Fixes Applied
Remaining Issues
Reference Repositories

Development Status

All phases of Flux development have been completed. The framework is production-ready.

Progress Summary

Phase	Status	Tests
Phase 1: Core Foundation	✅ Complete	86/86
Phase 2: Algorithm Framework	✅ Complete	68/68
Phase 3: Async Infrastructure	✅ Complete	97/97
Phase 4: Native Engines	✅ Complete	75/75
Phase 5: Coordinator & Trainer	✅ Complete	88/88
Phase 6: Production Readiness	✅ Complete	71/71

Total Tests: 485 passing

Phase 1: Core Foundation ✅

Make the package importable and testable.

Core Types (flux/core/types.py): TrainingPhase, TrainingState, PolicyVersion, AsyncDecision, StalenessMetrics, BatchMetrics, RolloutMetrics
Trajectory Data Structures (flux/core/trajectory.py): VersionSegment, Trajectory, PartialTrajectory, TrajectoryBatch, TrajectoryBuffer
Metrics Collection (flux/core/metrics.py): RunningStatistics, MetricsSnapshot, MetricsAggregator, MetricsLogger
Trainer Stub (flux/trainer.py): FluxTrainer stub with fit() signature
Tests: 86 tests in tests/unit/

Phase 2: Algorithm Framework ✅

VERL-style extensible algorithm design with registry pattern.

Algorithm Base & Registry (flux/training/algorithms/base.py):
- ADV_ESTIMATOR_REGISTRY, POLICY_LOSS_REGISTRY
- @register_adv_estimator(name), @register_policy_loss(name)
Built-in Algorithms:
- PPO: Clipped surrogate + GAE
- GRPO: Group Relative Policy Optimization (default)
- REINFORCE: Basic policy gradient with baseline
- DPO: Direct Preference Optimization + IPO
- DAPO: Decoupled clip + dynamic sampling
- GSPO: Group Stability Policy Optimization
- RLOO: Leave-One-Out baseline
Importance Correction (flux/controller/importance.py): UnifiedImportanceCorrection
Tests: 68 tests (44 algorithm + 24 importance)

Phase 3: Async Infrastructure ✅

AReaL-style asynchronous training with staleness control.

Async Task Runner (flux/coordinator/async_runner.py): Background thread with uvloop, thread-safe queues
Staleness Manager (flux/controller/staleness.py): KL + IW variance + version gap computation
Adaptive Async Controller (flux/controller/adaptive_async.py): PID controller with anti-windup
Tests: 97 tests

Phase 4: Native Engines ✅

Slime-style native integration with SGLang and Megatron.

SGLang Client (flux/rollout/sglang_client.py): HTTP client with connection pooling, retry logic
Streaming Rollout Manager (flux/rollout/manager.py): APRIL strategy implementation
Length Predictor (flux/rollout/length_predictor.py): Feature-based length prediction
Partial Buffer (flux/rollout/partial_buffer.py): Hash-based prompt matching
Weight Sync (flux/sync/weight_sync.py): Version tracking, lazy sync
CUDA IPC (flux/sync/cuda_ipc.py): Zero-copy tensor transfer
Delta Compression (flux/sync/delta_compression.py): Sparse and quantized encoding
Megatron Engine (flux/training/megatron_engine.py): Model loading, train_step()
Tests: 75 tests

Phase 5: Coordinator & Trainer ✅

Complete the training loop.

Smart Batch Composer (flux/training/batch_composer.py): Length bucketing, staleness balancing
Rewards (flux/rewards/): RewardFunction, CompositeReward, rule-based and model-based rewards
Coordinator (flux/coordinator/coordinator.py): Orchestrates rollout → reward → train → sync
Full Trainer (flux/trainer.py): fit(), checkpoints, evaluation
CLI (flux/cli.py): train, test, generate, info commands
Tests: 88 tests

Phase 6: Production Readiness ✅

Fault Tolerance (flux/utils/checkpoint.py, flux/utils/fault_tolerance.py):
- Atomic saves, checkpoint registry, best checkpoint tracking
- GracefulShutdown, RetryConfig, CircuitBreaker, HealthMonitor
Monitoring (flux/utils/monitoring.py):
- Prometheus-style metrics, HTTP exporter (/metrics, /health)
Documentation: Getting started, configuration, algorithms, API reference
Examples: basic_training.py, custom_reward.py, custom_algorithm.py
Configs: qwen3-8b-8gpu.yaml, qwen3-72b-64gpu.yaml
Tests: 71 tests

Code Review Summary

Review Date: 2026-01-09 Overall Grade: B+ (Production-ready foundation with fixes applied)

Quality Grades

Category	Grade	Notes
Architecture	A	Clean three-layer design, good separation of concerns
Code Quality	B+	Consistent style, good type hints, minor issues fixed
Error Handling	B	Improved with fixes, still some broad catches
Testing	B	Good structure, needs more E2E tests
Documentation	A-	Good docstrings and CLAUDE.md, needs API docs
Security	B	Basic validation added, needs more hardening
Performance	B+	Good optimizations, some remaining concerns

Fixes Applied

Import at EOF (coordinator.py)

Issue: AsyncIterator was imported at the end of the file (line 907) instead of at the top.

Fix: Moved import to top of file with other typing imports.

# After
from typing import Any, AsyncIterator, Callable, Iterator

Thread Safety in record_sync() (adaptive_async.py)

Issue: staleness_manager.record_sync() was called outside the lock, creating a race condition.

Location: flux/controller/adaptive_async.py:321-327

Fix: Moved the staleness_manager call inside the lock.

def record_sync(self) -> None:
    with self._lock:
        self._steps_since_sync = 0
        if self.staleness_manager is not None:
            self.staleness_manager.record_sync()  # Now inside lock

Broad Exception Handling (sglang_client.py)

Issue: Used except (httpx.RequestError, Exception) which catches too broadly.

Location: flux/rollout/sglang_client.py:379-389

Fix: Separated into specific exception handlers:

httpx.HTTPStatusError - HTTP errors with response
httpx.RequestError - Request-level errors
httpx.TimeoutException, ConnectionError, OSError - Network-level errors

Type Annotation Inconsistency (batch_composer.py)

Issue: max_length and max_staleness fields used float("inf") but type hint was int.

Location: flux/training/batch_composer.py:25-30, 55-61

Fix: Changed type hints to int | float to properly reflect that infinity is used.

max_length: int | float  # Can be float("inf") for unbounded upper limit

Input Validation for Prompts (trainer.py)

Issue: _prepare_prompts() had no size limits or sanitization.

Location: flux/trainer.py:499-560

Fix: Added comprehensive validation:

Maximum prompt length (default 32768 chars) with truncation
Maximum number of prompts limit
Strip leading/trailing whitespace
Skip empty prompts
Logging for truncated prompts

Silent Failure in Reward Computation (coordinator.py)

Issue: When reward computation failed, it silently set reward to 0.0 which could mask real problems.

Location: flux/coordinator/coordinator.py:647-687

Fix: Added:

More descriptive warning message per failure
Metadata tracking for failed computations
Error-level logging when failure rate exceeds 10%
Failure information stored in trajectory metadata

StreamingRolloutManager Integration

Files changed: flux/coordinator/coordinator.py

Added StreamingRolloutManager with APRIL strategy
Refactored _generate_rollouts_sync() for proper event loop handling
Added _generate_rollouts_stub() for testing without SGLang
Full APRIL strategy: oversample, abort long-tail, partial reuse

Weight Sync via SGLangClient.update_weights

Files changed: flux/coordinator/coordinator.py

Updated _sync_weights_async() to push weights via SGLang HTTP API
Graceful fallback when update_weights not available
Proper error handling and logging

Dedicated Event Loop for Async Operations

Files changed: flux/coordinator/coordinator.py

Added _ensure_async_loop() for dedicated background loop
Works in Jupyter notebooks without nest_asyncio
No event loop conflicts with existing async contexts
Proper cleanup on shutdown via _stop_async_loop()

Centralized Weight Serialization

Files changed: flux/sync/weight_sync.py

Moved all serialization logic to WeightSyncManager.serialize_for_sync()
Handles Full, Delta, and Per-tensor methods
Proper baseline management with get_baseline() and set_baseline()

Delta Baseline Versioning

Files changed: flux/sync/weight_sync.py, flux/coordinator/coordinator.py

Added baseline version tracking for delta computation
Baseline updated only after successful update_weights
Serialization moves tensors to CPU before torch.save

Config Flexibility Updates

Configurable Weight Sync Method: WeightSyncMethod enum with "full", "delta", "per_tensor" options
Pure String Algorithm Names: Changed AlgorithmConfig.name from enum to str for flexibility

Remaining Issues

High Priority

GPU Memory Management
- Location: flux/core/trajectory.py:383-384
- Issue: Trajectories contain torch tensors that could accumulate GPU memory
- Recommendation: Add explicit tensor cleanup or move to CPU when storing
Missing E2E Tests
- Location: tests/e2e/
- Issue: Directory exists but contains only __init__.py
- Recommendation: Add full training loop tests

Medium Priority

GRPO Vectorization
- Location: flux/training/algorithms/grpo.py:77-79
- Issue: CPU-bound loop for grouping could be vectorized
- Recommendation: Use torch operations instead of Python dict
Potential Off-by-One
- Location: flux/controller/adaptive_async.py:269
- Issue: >= vs > for max_steps_without_sync check
- Recommendation: Verify intended behavior
Unnecessary Tensor Copies
- Location: flux/sync/weight_sync.py:209
- Issue: Multiple .clone().cpu() calls could be expensive
- Recommendation: Consider lazy evaluation or single copy

Low Priority

API Documentation
- Issue: No sphinx/mkdocs documentation
- Recommendation: Add API documentation generator
Architecture Diagrams
- Issue: No visual documentation
- Recommendation: Add diagrams for three-layer architecture

Testing Recommendations

After making changes, run:

# Run all tests
pytest

# Run with coverage
pytest --cov=flux --cov-report=html

# Type checking
mypy flux/

# Linting
ruff check .

Reference Repositories

When implementing components, study these for best practices:

VERL (/Users/xxzhou/OSS/verl): Algorithm extensibility, HybridFlow controller
AReaL (/Users/xxzhou/OSS/AReaL): Async architecture, staleness management
Slime (/Users/xxzhou/OSS/slime): SGLang integration, APRIL strategy, CUDA IPC

Technical Notes

All algorithms use registry pattern (no class inheritance)
Importance weights computed by framework, algorithms just compute loss
Staleness = 0.4 * KL + 0.3 * IW_variance + 0.3 * version_gap
PID controller: async_ratio += kperror + kiintegral + kd*derivative

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux Development Documentation

Table of Contents

Development Status

Progress Summary

Phase 1: Core Foundation ✅

Phase 2: Algorithm Framework ✅

Phase 3: Async Infrastructure ✅

Phase 4: Native Engines ✅

Phase 5: Coordinator & Trainer ✅

Phase 6: Production Readiness ✅

Code Review Summary

Quality Grades

Fixes Applied

Import at EOF (coordinator.py)

Thread Safety in record_sync() (adaptive_async.py)

Broad Exception Handling (sglang_client.py)

Type Annotation Inconsistency (batch_composer.py)

Input Validation for Prompts (trainer.py)

Silent Failure in Reward Computation (coordinator.py)

StreamingRolloutManager Integration

Weight Sync via SGLangClient.update_weights

Dedicated Event Loop for Async Operations

Centralized Weight Serialization

Delta Baseline Versioning

Config Flexibility Updates

Remaining Issues

High Priority

Medium Priority

Low Priority

Testing Recommendations

Reference Repositories

Technical Notes

FilesExpand file tree

DEVELOPMENT.md

Latest commit

History

DEVELOPMENT.md

File metadata and controls

Flux Development Documentation

Table of Contents

Development Status

Progress Summary

Phase 1: Core Foundation ✅

Phase 2: Algorithm Framework ✅

Phase 3: Async Infrastructure ✅

Phase 4: Native Engines ✅

Phase 5: Coordinator & Trainer ✅

Phase 6: Production Readiness ✅

Code Review Summary

Quality Grades

Fixes Applied

Import at EOF (coordinator.py)

Thread Safety in record_sync() (adaptive_async.py)

Broad Exception Handling (sglang_client.py)

Type Annotation Inconsistency (batch_composer.py)

Input Validation for Prompts (trainer.py)

Silent Failure in Reward Computation (coordinator.py)

StreamingRolloutManager Integration

Weight Sync via SGLangClient.update_weights

Dedicated Event Loop for Async Operations

Centralized Weight Serialization

Delta Baseline Versioning

Config Flexibility Updates

Remaining Issues

High Priority

Medium Priority

Low Priority

Testing Recommendations

Reference Repositories

Technical Notes