This document consolidates development progress, code review notes, and fixes applied to the Flux framework.
All phases of Flux development have been completed. The framework is production-ready.
| Phase | Status | Tests |
|---|---|---|
| Phase 1: Core Foundation | ✅ Complete | 86/86 |
| Phase 2: Algorithm Framework | ✅ Complete | 68/68 |
| Phase 3: Async Infrastructure | ✅ Complete | 97/97 |
| Phase 4: Native Engines | ✅ Complete | 75/75 |
| Phase 5: Coordinator & Trainer | ✅ Complete | 88/88 |
| Phase 6: Production Readiness | ✅ Complete | 71/71 |
Total Tests: 485 passing
Make the package importable and testable.
- Core Types (
flux/core/types.py): TrainingPhase, TrainingState, PolicyVersion, AsyncDecision, StalenessMetrics, BatchMetrics, RolloutMetrics - Trajectory Data Structures (
flux/core/trajectory.py): VersionSegment, Trajectory, PartialTrajectory, TrajectoryBatch, TrajectoryBuffer - Metrics Collection (
flux/core/metrics.py): RunningStatistics, MetricsSnapshot, MetricsAggregator, MetricsLogger - Trainer Stub (
flux/trainer.py): FluxTrainer stub with fit() signature - Tests: 86 tests in
tests/unit/
VERL-style extensible algorithm design with registry pattern.
- Algorithm Base & Registry (
flux/training/algorithms/base.py):ADV_ESTIMATOR_REGISTRY,POLICY_LOSS_REGISTRY@register_adv_estimator(name),@register_policy_loss(name)
- Built-in Algorithms:
- PPO: Clipped surrogate + GAE
- GRPO: Group Relative Policy Optimization (default)
- REINFORCE: Basic policy gradient with baseline
- DPO: Direct Preference Optimization + IPO
- DAPO: Decoupled clip + dynamic sampling
- GSPO: Group Stability Policy Optimization
- RLOO: Leave-One-Out baseline
- Importance Correction (
flux/controller/importance.py): UnifiedImportanceCorrection - Tests: 68 tests (44 algorithm + 24 importance)
AReaL-style asynchronous training with staleness control.
- Async Task Runner (
flux/coordinator/async_runner.py): Background thread with uvloop, thread-safe queues - Staleness Manager (
flux/controller/staleness.py): KL + IW variance + version gap computation - Adaptive Async Controller (
flux/controller/adaptive_async.py): PID controller with anti-windup - Tests: 97 tests
Slime-style native integration with SGLang and Megatron.
- SGLang Client (
flux/rollout/sglang_client.py): HTTP client with connection pooling, retry logic - Streaming Rollout Manager (
flux/rollout/manager.py): APRIL strategy implementation - Length Predictor (
flux/rollout/length_predictor.py): Feature-based length prediction - Partial Buffer (
flux/rollout/partial_buffer.py): Hash-based prompt matching - Weight Sync (
flux/sync/weight_sync.py): Version tracking, lazy sync - CUDA IPC (
flux/sync/cuda_ipc.py): Zero-copy tensor transfer - Delta Compression (
flux/sync/delta_compression.py): Sparse and quantized encoding - Megatron Engine (
flux/training/megatron_engine.py): Model loading, train_step() - Tests: 75 tests
Complete the training loop.
- Smart Batch Composer (
flux/training/batch_composer.py): Length bucketing, staleness balancing - Rewards (
flux/rewards/): RewardFunction, CompositeReward, rule-based and model-based rewards - Coordinator (
flux/coordinator/coordinator.py): Orchestrates rollout → reward → train → sync - Full Trainer (
flux/trainer.py): fit(), checkpoints, evaluation - CLI (
flux/cli.py): train, test, generate, info commands - Tests: 88 tests
- Fault Tolerance (
flux/utils/checkpoint.py,flux/utils/fault_tolerance.py):- Atomic saves, checkpoint registry, best checkpoint tracking
- GracefulShutdown, RetryConfig, CircuitBreaker, HealthMonitor
- Monitoring (
flux/utils/monitoring.py):- Prometheus-style metrics, HTTP exporter (/metrics, /health)
- Documentation: Getting started, configuration, algorithms, API reference
- Examples: basic_training.py, custom_reward.py, custom_algorithm.py
- Configs: qwen3-8b-8gpu.yaml, qwen3-72b-64gpu.yaml
- Tests: 71 tests
Review Date: 2026-01-09 Overall Grade: B+ (Production-ready foundation with fixes applied)
| Category | Grade | Notes |
|---|---|---|
| Architecture | A | Clean three-layer design, good separation of concerns |
| Code Quality | B+ | Consistent style, good type hints, minor issues fixed |
| Error Handling | B | Improved with fixes, still some broad catches |
| Testing | B | Good structure, needs more E2E tests |
| Documentation | A- | Good docstrings and CLAUDE.md, needs API docs |
| Security | B | Basic validation added, needs more hardening |
| Performance | B+ | Good optimizations, some remaining concerns |
Issue: AsyncIterator was imported at the end of the file (line 907) instead of at the top.
Fix: Moved import to top of file with other typing imports.
# After
from typing import Any, AsyncIterator, Callable, IteratorIssue: staleness_manager.record_sync() was called outside the lock, creating a race condition.
Location: flux/controller/adaptive_async.py:321-327
Fix: Moved the staleness_manager call inside the lock.
def record_sync(self) -> None:
with self._lock:
self._steps_since_sync = 0
if self.staleness_manager is not None:
self.staleness_manager.record_sync() # Now inside lockIssue: Used except (httpx.RequestError, Exception) which catches too broadly.
Location: flux/rollout/sglang_client.py:379-389
Fix: Separated into specific exception handlers:
httpx.HTTPStatusError- HTTP errors with responsehttpx.RequestError- Request-level errorshttpx.TimeoutException, ConnectionError, OSError- Network-level errors
Issue: max_length and max_staleness fields used float("inf") but type hint was int.
Location: flux/training/batch_composer.py:25-30, 55-61
Fix: Changed type hints to int | float to properly reflect that infinity is used.
max_length: int | float # Can be float("inf") for unbounded upper limitIssue: _prepare_prompts() had no size limits or sanitization.
Location: flux/trainer.py:499-560
Fix: Added comprehensive validation:
- Maximum prompt length (default 32768 chars) with truncation
- Maximum number of prompts limit
- Strip leading/trailing whitespace
- Skip empty prompts
- Logging for truncated prompts
Issue: When reward computation failed, it silently set reward to 0.0 which could mask real problems.
Location: flux/coordinator/coordinator.py:647-687
Fix: Added:
- More descriptive warning message per failure
- Metadata tracking for failed computations
- Error-level logging when failure rate exceeds 10%
- Failure information stored in trajectory metadata
Files changed: flux/coordinator/coordinator.py
- Added
StreamingRolloutManagerwith APRIL strategy - Refactored
_generate_rollouts_sync()for proper event loop handling - Added
_generate_rollouts_stub()for testing without SGLang - Full APRIL strategy: oversample, abort long-tail, partial reuse
Files changed: flux/coordinator/coordinator.py
- Updated
_sync_weights_async()to push weights via SGLang HTTP API - Graceful fallback when
update_weightsnot available - Proper error handling and logging
Files changed: flux/coordinator/coordinator.py
- Added
_ensure_async_loop()for dedicated background loop - Works in Jupyter notebooks without nest_asyncio
- No event loop conflicts with existing async contexts
- Proper cleanup on shutdown via
_stop_async_loop()
Files changed: flux/sync/weight_sync.py
- Moved all serialization logic to
WeightSyncManager.serialize_for_sync() - Handles Full, Delta, and Per-tensor methods
- Proper baseline management with
get_baseline()andset_baseline()
Files changed: flux/sync/weight_sync.py, flux/coordinator/coordinator.py
- Added baseline version tracking for delta computation
- Baseline updated only after successful
update_weights - Serialization moves tensors to CPU before
torch.save
- Configurable Weight Sync Method:
WeightSyncMethodenum with "full", "delta", "per_tensor" options - Pure String Algorithm Names: Changed
AlgorithmConfig.namefrom enum tostrfor flexibility
-
GPU Memory Management
- Location:
flux/core/trajectory.py:383-384 - Issue: Trajectories contain torch tensors that could accumulate GPU memory
- Recommendation: Add explicit tensor cleanup or move to CPU when storing
- Location:
-
Missing E2E Tests
- Location:
tests/e2e/ - Issue: Directory exists but contains only
__init__.py - Recommendation: Add full training loop tests
- Location:
-
GRPO Vectorization
- Location:
flux/training/algorithms/grpo.py:77-79 - Issue: CPU-bound loop for grouping could be vectorized
- Recommendation: Use torch operations instead of Python dict
- Location:
-
Potential Off-by-One
- Location:
flux/controller/adaptive_async.py:269 - Issue:
>=vs>for max_steps_without_sync check - Recommendation: Verify intended behavior
- Location:
-
Unnecessary Tensor Copies
- Location:
flux/sync/weight_sync.py:209 - Issue: Multiple
.clone().cpu()calls could be expensive - Recommendation: Consider lazy evaluation or single copy
- Location:
-
API Documentation
- Issue: No sphinx/mkdocs documentation
- Recommendation: Add API documentation generator
-
Architecture Diagrams
- Issue: No visual documentation
- Recommendation: Add diagrams for three-layer architecture
After making changes, run:
# Run all tests
pytest
# Run with coverage
pytest --cov=flux --cov-report=html
# Type checking
mypy flux/
# Linting
ruff check .When implementing components, study these for best practices:
- VERL (
/Users/xxzhou/OSS/verl): Algorithm extensibility, HybridFlow controller - AReaL (
/Users/xxzhou/OSS/AReaL): Async architecture, staleness management - Slime (
/Users/xxzhou/OSS/slime): SGLang integration, APRIL strategy, CUDA IPC
- All algorithms use registry pattern (no class inheritance)
- Importance weights computed by framework, algorithms just compute loss
- Staleness = 0.4 * KL + 0.3 * IW_variance + 0.3 * version_gap
- PID controller: async_ratio += kperror + kiintegral + kd*derivative