diff --git a/examples/README.md b/examples/README.md index 09ce407..57e7699 100644 --- a/examples/README.md +++ b/examples/README.md @@ -2,6 +2,7 @@ | Date | Summary of Changes | |------------|--------------------| +| 2026-06-14 | Added PDD pd-disaggregation examples, script index, and release-scope guidance for local PR preparation. | | 2026-06-08 | Clarified that dummy analytical co-location smoke runs validate runtime plumbing, not profiling fidelity. | | 2026-06-08 | Split co-location examples into `offline/` and `online/`, added suite runner and cross-validation guidance. | | 2026-06-07 | Added optimized co-location advanced MoE recipes, top-level profiling examples, and corrected metrics behavior for Thinking Mode. | @@ -16,36 +17,43 @@ This directory contains runnable examples for the release-supported Frontier sim ## Release Scope -`pre-release-v0.1` supports only the `co-location` architecture. Historical `pd-disaggregation` and `pd-af-disaggregation` examples are intentionally not included in this branch. If those architectures are requested through CLI/config, Frontier exits with the release error documented in the top-level `README.md`. +`pre-release-v0.2` foregrounds **PDD / `pd-disaggregation`** examples: prefill runs in the `PREFILL` cluster, decode runs in the unified `DECODE` cluster, and KV cache is transferred between them. The public PDD example path uses the sequential simulator mode through `--no-enable_parallel_clusters`. -## Quick Start +The `pd-af-disaggregation` architecture and split `DECODE_ATTN` / `DECODE_FFN` release surface remain intentionally outside this examples release scope. Co-location examples are still kept as baseline comparison recipes and historical v0.1-compatible references. -The co-location examples are split by simulation mode: +## Quick Start -- `examples/architecture/co-location/offline/`: offline batch simulations. Existing offline examples were moved here unchanged in scenario intent. -- `examples/architecture/co-location/online/`: online serving simulations that mirror the offline scenarios while preserving generated request arrivals. -- `examples/architecture/co-location/run_all.sh`: one-click suite runner for all 10 co-location cases. +The PDD dense example uses dummy execution-time prediction and the analytical communication backend, so it does not require profiling data or the collective-sim binary for the first smoke run. ```bash export PYTHONPATH=$PWD export WANDB_DISABLED=true export VIDUR_DISABLE_WANDB=1 -# Run all five offline cases and all five online cases. -bash examples/architecture/co-location/run_all.sh +bash examples/architecture/pdd/offline/dense_model_basic.sh +``` -# Run one case directly. -bash examples/architecture/co-location/offline/dense_model_basic.sh -bash examples/architecture/co-location/online/dense_model_basic_online.sh +For the complete PDD architecture suite, run: -# Thinking Mode examples are available in both modes. -bash examples/architecture/co-location/offline/thinking_mode_basic.sh -bash examples/architecture/co-location/online/thinking_mode_basic_online.sh +```bash +bash examples/architecture/pdd/run_all.sh ``` -All co-location examples default to `--cc_backend_config_type analytical` so the suite is one-click runnable on a fresh checkout without building the collective-sim binary. To exercise the topology-aware backend, set `CC_BACKEND=collective_sim` and build `frontier/cc_backend/backends/collective-sim/sim/datacenter/htsim_ndp` first. +Co-location baseline and advanced recipes remain available for comparison. The current co-location layout is split into offline and online entrypoints: -Profiling commands can be validated without launching GPU kernels by using `--dry-run`: +```bash +bash examples/architecture/co-location/run_all.sh +bash examples/architecture/co-location/offline/dense_model_basic.sh +bash examples/architecture/co-location/offline/moe_model_basic.sh +bash examples/architecture/co-location/offline/thinking_mode_basic.sh +bash examples/architecture/co-location/offline/moe_spec_dec.sh +bash examples/architecture/co-location/offline/moe_prefix_caching.sh +bash examples/architecture/co-location/online/dense_model_basic_online.sh +bash examples/architecture/co-location/online/moe_model_basic_online.sh +bash examples/architecture/co-location/online/thinking_mode_basic_online.sh +bash examples/architecture/co-location/online/moe_spec_dec_online.sh +bash examples/architecture/co-location/online/moe_prefix_caching_online.sh +``` ```bash bash examples/profiling/profile_linear_op.sh --dry-run @@ -62,6 +70,21 @@ examples/ │ └── prefix_cache_shared_session_trace.csv ├── architecture/ │ ├── README.md +│ ├── pdd/ +│ │ ├── run_all.sh +│ │ ├── dense_model_basic.sh +│ │ ├── offline/ +│ │ │ ├── dense_model_basic.sh +│ │ │ ├── moe_model_basic.sh +│ │ │ ├── thinking_mode_basic.sh +│ │ │ ├── moe_spec_dec.sh +│ │ │ └── moe_prefix_caching.sh +│ │ └── online/ +│ │ ├── dense_model_basic_online.sh +│ │ ├── moe_model_basic_online.sh +│ │ ├── thinking_mode_basic_online.sh +│ │ ├── moe_spec_dec_online.sh +│ │ └── moe_prefix_caching_online.sh │ └── co-location/ │ ├── run_all.sh │ ├── offline/ @@ -88,9 +111,19 @@ examples/ ## Architecture Mode +### PDD / pd-disaggregation + +Separate prefill and decode clusters model prefill/decode disaggregation without splitting decode attention and decode FFN into separate public release clusters. + +- `--sys_arch pd-disaggregation` +- Uses `PREFILL` and unified `DECODE` clusters. +- Supports Dense, MoE, Thinking Mode, Speculative Decoding / MTP, and Prefix Caching examples in offline and online modes. +- Uses `--no-enable_parallel_clusters` because the pre-release-v0.2 public PDD path is the sequential simulator path; parallel cluster processing is still guarded. +- Keeps `pd-af-disaggregation` and global `--use_cuda_graph` outside the v0.2 examples release surface. + ### Co-location -Single monolithic cluster handles all prefill and decode work. +Single monolithic cluster handles all prefill and decode work. These examples are retained as baseline comparison recipes. - `--sys_arch co-location` - Supports dense and MoE model configs. @@ -99,13 +132,22 @@ Single monolithic cluster handles all prefill and decode work. ## Key Configuration Options +### PDD Cluster Layout + +- `--cluster_config_prefill_cluster_num_replicas`: Number of `PREFILL` cluster replicas. +- `--cluster_config_decode_cluster_num_replicas`: Number of unified `DECODE` cluster replicas. +- `--cluster_config_prefill_replica_config_*`: `PREFILL` replica parallelism and device fields. +- `--cluster_config_decode_replica_config_*`: `DECODE` replica parallelism and device fields. +- `--analytical_kv_cache_transfer_config_network_bandwidth_gbps`: Analytical KV transfer bandwidth. +- `--analytical_kv_cache_transfer_config_network_latency_ms`: Analytical KV transfer latency. + ### Parallelism -- `--replica_config_attn_tensor_parallel_size`: Attention tensor parallelism. -- `--replica_config_moe_tensor_parallel_size`: MoE tensor parallelism. -- `--replica_config_moe_expert_parallel_size`: Expert parallelism. -- `--replica_config_num_pipeline_stages`: Pipeline parallelism. -- `--cluster_config_num_replicas`: Number of monolithic cluster replicas. +- `--replica_config_attn_tensor_parallel_size`: Attention tensor parallelism for co-location examples. +- `--replica_config_moe_tensor_parallel_size`: MoE tensor parallelism for co-location examples. +- `--replica_config_moe_expert_parallel_size`: Expert parallelism for co-location examples. +- `--replica_config_num_pipeline_stages`: Pipeline parallelism for co-location examples. +- `--cluster_config_num_replicas`: Number of monolithic cluster replicas for co-location examples. ### Request Generation @@ -129,11 +171,11 @@ Single monolithic cluster handles all prefill and decode work. ## Running Examples -The checked-in co-location simulation examples use dummy mode (`--random_forrest_execution_time_predictor_config_enable_dummy_mode`) for quick testing without profiling data. Dummy mode skips ML predictor training and profiling metadata loading, so missing profiling CSVs do not affect smoke-test correctness. +The checked-in PDD examples use dummy mode (`--random_forrest_execution_time_predictor_config_enable_dummy_mode`), analytical communication cost modeling, and `--no-enable_parallel_clusters` for quick testing without profiling data. The expected minimal dense smoke behavior is one completed request, one KV cache transfer, and no release-guard crash. Metrics are written under `outputs/examples/pdd` by default. -These examples validate CLI/runtime plumbing and metrics artifact generation, not profiling fidelity. Use non-dummy profiling data before drawing hardware accuracy conclusions. +PDD Thinking Mode can produce multiple prefill-to-decode handoffs for one user request. The default small smoke configuration completes one request and records two KV transfers. -Offline cases write under `outputs/examples/co-location/offline//offline_batch//` by default. Online cases write under `outputs/examples/co-location/online//online_serving//` by default. The mode-specific `offline_batch` / `online_serving` path segment is added by Frontier's metrics taxonomy. +Co-location examples also use dummy mode for quick testing without profiling data. These examples validate CLI/runtime plumbing and metrics artifact generation, not profiling fidelity. Use non-dummy profiling data before drawing hardware accuracy conclusions. Baseline co-location scripts default to `decode_cuda_graph_mode=full_decode_only` and Chunked Prefill. The Speculative Decoding / MTP recipes use `decode_cuda_graph_mode=none` because speculative decoding currently conflicts with decode CUDA Graph modeling. The Prefix Caching recipes replay `examples/fixtures/prefix_cache_shared_session_trace.csv` to exercise cache-hit behavior. @@ -154,7 +196,7 @@ When comparing offline and online pairs, validate the following for each scenari ## Thinking Mode Example -The Thinking Mode scripts use: +The PDD and co-location Thinking Mode scripts use: - `--enable_thinking_mode` - `--thinking_depth 2` diff --git a/examples/architecture/README.md b/examples/architecture/README.md index 3ee7ff8..b780fe3 100644 --- a/examples/architecture/README.md +++ b/examples/architecture/README.md @@ -1,10 +1,18 @@ # Architecture Examples -This directory contains one-click architecture entrypoints for Frontier's release-supported runtime layout. +## Modification History + +| Date | Summary of Changes | +|------------|--------------------| +| 2026-06-14 | Added PDD pd-disaggregation script list, configuration contract, and validation criteria for local PR preparation. | + +This directory contains one-click architecture entrypoints for Frontier's release-supported runtime layouts. ## Release Scope -`pre-release-v0.1` supports only `co-location`. Disaggregated architecture examples are intentionally absent from this branch because the runtime guard rejects `pd-disaggregation` and `pd-af-disaggregation`. +`pre-release-v0.2` foregrounds **PDD / `pd-disaggregation`** examples. Prefill runs in the `PREFILL` cluster, decode runs in the unified `DECODE` cluster, and KV cache is transferred between them. The public PDD example path uses the sequential simulator mode through `--no-enable_parallel_clusters`. + +`co-location` examples remain available as baseline comparison recipes and v0.1-compatible architecture references. `pd-af-disaggregation` and split `DECODE_ATTN` / `DECODE_FFN` public examples remain outside this release scope. ## Scripts @@ -21,6 +29,38 @@ This directory contains one-click architecture entrypoints for Frontier's releas | `co-location/online/thinking_mode_basic_online.sh` | Online Thinking Mode v1 co-location | Mirrors Thinking Mode offline settings with `--simulation_mode online` | | `co-location/online/moe_spec_dec_online.sh` | Online MoE Speculative Decoding / MTP | Mirrors Speculative Decoding offline settings with `--simulation_mode online` | | `co-location/online/moe_prefix_caching_online.sh` | Online MoE Prefix Caching | Replays the same prefix-cache fixture with `--simulation_mode online` | +| `pdd/run_all.sh` | Full PDD suite | Runs all five offline PDD cases and all five online PDD cases; pass extra Frontier CLI flags after `--` | +| `pdd/offline/dense_model_basic.sh` | Offline dense PDD baseline | Sequential `pd-disaggregation`, analytical backend, dummy execution time, Chunked Prefill, CSV/JSON metrics | +| `pdd/offline/moe_model_basic.sh` | Offline MoE PDD baseline | Sequential `pd-disaggregation`, reference-runnable shared-domain MoE topology, Chunked Prefill, CSV/JSON metrics | +| `pdd/offline/thinking_mode_basic.sh` | Offline Thinking Mode v1 PDD | Thinking Mode with two KV transfer handoffs for the one-request smoke configuration | +| `pdd/offline/moe_spec_dec.sh` | Offline MoE PDD Speculative Decoding / MTP | Speculative Decoding enabled; Prefix Caching intentionally disabled; `DECODE_CUDA_GRAPH_MODE=none` | +| `pdd/offline/moe_prefix_caching.sh` | Offline MoE PDD Prefix Caching | Sticky scheduler with `examples/fixtures/prefix_cache_shared_session_trace.csv` | +| `pdd/online/dense_model_basic_online.sh` | Online dense PDD baseline | Mirrors dense offline settings with `--simulation_mode online` | +| `pdd/online/moe_model_basic_online.sh` | Online MoE PDD baseline | Mirrors MoE offline settings with `--simulation_mode online` | +| `pdd/online/thinking_mode_basic_online.sh` | Online Thinking Mode v1 PDD | Mirrors Thinking Mode offline settings with `--simulation_mode online` | +| `pdd/online/moe_spec_dec_online.sh` | Online MoE PDD Speculative Decoding / MTP | Mirrors Speculative Decoding offline settings with `--simulation_mode online` | +| `pdd/online/moe_prefix_caching_online.sh` | Online MoE PDD Prefix Caching | Replays the same prefix-cache fixture with `--simulation_mode online` | + +## PDD Configuration Contract + +All PDD scripts use these release-supported defaults unless overridden from the shell: + +- `--sys_arch pd-disaggregation` +- `--no-enable_parallel_clusters` +- explicit `PREFILL` and unified `DECODE` cluster settings +- `--cc_backend_config_type analytical` +- dummy execution-time prediction enabled by default +- CSV/JSON metrics enabled by default through `--metrics_config_write_metrics` and `--metrics_config_store_request_metrics` +- plots, Chrome trace, and JSON event trace disabled for lightweight one-click artifacts + +MoE PDD scripts also enforce the shared-domain invariant before launching Frontier: + +```text +PREFILL_ATTN_TP * PREFILL_ATTN_DP == PREFILL_MOE_TP * PREFILL_MOE_EP +DECODE_ATTN_TP * DECODE_ATTN_DP == DECODE_MOE_TP * DECODE_MOE_EP +``` + +This fail-fast check prevents known non-runnable MoE topology combinations from entering the simulator. ## Thinking Mode v1 @@ -34,10 +74,29 @@ The Thinking Mode examples use: - `--cc_backend_config_type analytical` so the one-click smoke run works on a minimal single-replica layout - CSV/JSON metrics enabled by default, with plots, Chrome trace, and JSON event trace disabled for lightweight artifacts +Under PDD, one user request can produce multiple prefill-to-decode KV handoffs. The default Thinking Mode smoke case completes one request and records two KV transfers. + ## Recommended Start Order ```bash -# Full suite. +# Full PDD suite for pre-release-v0.2. +bash examples/architecture/pdd/run_all.sh + +# PDD offline cases. +bash examples/architecture/pdd/offline/dense_model_basic.sh +bash examples/architecture/pdd/offline/moe_model_basic.sh +bash examples/architecture/pdd/offline/thinking_mode_basic.sh +bash examples/architecture/pdd/offline/moe_spec_dec.sh +bash examples/architecture/pdd/offline/moe_prefix_caching.sh + +# PDD online cases. +bash examples/architecture/pdd/online/dense_model_basic_online.sh +bash examples/architecture/pdd/online/moe_model_basic_online.sh +bash examples/architecture/pdd/online/thinking_mode_basic_online.sh +bash examples/architecture/pdd/online/moe_spec_dec_online.sh +bash examples/architecture/pdd/online/moe_prefix_caching_online.sh + +# Full co-location comparison suite. bash examples/architecture/co-location/run_all.sh # Offline cases. @@ -55,7 +114,7 @@ bash examples/architecture/co-location/online/moe_spec_dec_online.sh bash examples/architecture/co-location/online/moe_prefix_caching_online.sh ``` -Use the baseline scripts first, then use the Speculative Decoding / MTP and Prefix Caching recipes as advanced cases. +Use the dense baseline scripts first, then use the Thinking Mode, Speculative Decoding / MTP, and Prefix Caching recipes as advanced cases. ## Cross-validation Criteria @@ -66,3 +125,11 @@ For each offline/online pair: 3. Record expected request count, actual request rows, completed request rows, total input tokens, total output tokens, mean TTFT, mean latency, and request throughput when present. 4. Confirm offline outputs include the `offline_batch` taxonomy segment and online outputs include `online_serving`. 5. Treat latency differences as expected when online mode preserves request arrival times; investigate only if counts, token totals, output files, or finite numeric metrics diverge unexpectedly. + +For every PDD script, the release gate should additionally record: + +1. The script exits with code `0`. +2. `request_metrics.csv` and `system_metrics.json` exist in the metrics output directory. +3. Request row count, `total_requests`, and `completed_requests` match the expected case size. +4. KV transfer count, total KV bytes, and KV transfer time are present and positive. +5. Request-level `ttft`, `tpot`, `request_e2e_time`, and `transfer_kv_cache` are finite and positive. diff --git a/examples/architecture/pdd/dense_model_basic.sh b/examples/architecture/pdd/dense_model_basic.sh new file mode 100755 index 0000000..3ac4e64 --- /dev/null +++ b/examples/architecture/pdd/dense_model_basic.sh @@ -0,0 +1,16 @@ +#!/bin/bash +# ============================================================================= +# Compatibility entrypoint for the PDD dense offline example +# ============================================================================= +# The canonical pre-release-v0.2 PDD example layout is: +# examples/architecture/pdd/offline/dense_model_basic.sh +# examples/architecture/pdd/online/dense_model_basic_online.sh +# +# Keep this top-level script as a backward-compatible alias for users who ran +# the early PDD dense smoke entrypoint before the offline/online split. +# ============================================================================= + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +exec bash "$SCRIPT_DIR/offline/dense_model_basic.sh" "$@" diff --git a/examples/architecture/pdd/offline/dense_model_basic.sh b/examples/architecture/pdd/offline/dense_model_basic.sh new file mode 100755 index 0000000..0492e7c --- /dev/null +++ b/examples/architecture/pdd/offline/dense_model_basic.sh @@ -0,0 +1,217 @@ +#!/bin/bash +# ============================================================================= +# PDD / pd-disaggregation Offline Mode - Dense Model Example +# ============================================================================= +# This script mirrors the co-location example surface while using the +# pre-release-v0.2 PDD / pd-disaggregation architecture: prefill runs in the PREFILL cluster, +# decode runs in the DECODE cluster, and KV cache is transferred between them. +# +# This script demonstrates the release-supported sequential pd-disaggregation path. +# Decode CUDA Graph modeling and Chunked Prefill can be toggled with +# DECODE_CUDA_GRAPH_MODE and ENABLE_CHUNKED_PREFILL. +## Override any uppercase variable from the shell, and append extra Frontier CLI +# flags after "--" if you need to customize the run. +# ============================================================================= + +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/../../../.." && pwd)" +export PYTHONPATH="$REPO_ROOT${PYTHONPATH:+:$PYTHONPATH}" +export WANDB_DISABLED=true +export VIDUR_DISABLE_WANDB=1 +PYTHON_BIN="${PYTHON_BIN:-python3}" + +MODEL_NAME="${MODEL_NAME:-meta-llama/Llama-2-7b-hf}" +SYS_ARCH="${SYS_ARCH:-pd-disaggregation}" +PREFILL_REPLICAS="${PREFILL_REPLICAS:-1}" +DECODE_REPLICAS="${DECODE_REPLICAS:-1}" +PREFILL_ATTN_TP="${PREFILL_ATTN_TP:-1}" +PREFILL_ATTN_DP="${PREFILL_ATTN_DP:-1}" +PREFILL_MOE_TP="${PREFILL_MOE_TP:-1}" +PREFILL_MOE_EP="${PREFILL_MOE_EP:-1}" +PREFILL_PP="${PREFILL_PP:-1}" +PREFILL_DEVICE="${PREFILL_DEVICE:-a800}" +PREFILL_MEMORY_MARGIN_FRACTION="${PREFILL_MEMORY_MARGIN_FRACTION:-0.2}" +DECODE_ATTN_TP="${DECODE_ATTN_TP:-1}" +DECODE_ATTN_DP="${DECODE_ATTN_DP:-1}" +DECODE_MOE_TP="${DECODE_MOE_TP:-1}" +DECODE_MOE_EP="${DECODE_MOE_EP:-1}" +DECODE_PP="${DECODE_PP:-1}" +DECODE_DEVICE="${DECODE_DEVICE:-a800}" +DECODE_MEMORY_MARGIN_FRACTION="${DECODE_MEMORY_MARGIN_FRACTION:-0.2}" +TOTAL_EXPERTS="${TOTAL_EXPERTS:-1}" +ROUTER_TOPK="${ROUTER_TOPK:-1}" +MOE_ROUTING_MODE="${MOE_ROUTING_MODE:-simulation}" +MOE_ROUTING_SEED="${MOE_ROUTING_SEED:-42}" +REPLICA_SCHEDULER="${REPLICA_SCHEDULER:-vllm_v1}" +NUM_REQUESTS="${NUM_REQUESTS:-8}" +PREFILL_TOKENS="${PREFILL_TOKENS:-512}" +DECODE_TOKENS="${DECODE_TOKENS:-64}" +QPS="${QPS:-1.0}" +ENABLE_DUMMY_MODE="${ENABLE_DUMMY_MODE:-true}" +DUMMY_EXEC_TIME_MS="${DUMMY_EXEC_TIME_MS:-1.0}" +DECODE_CUDA_GRAPH_MODE="${DECODE_CUDA_GRAPH_MODE:-none}" +ENABLE_CHUNKED_PREFILL="${ENABLE_CHUNKED_PREFILL:-true}" +MAX_TOKENS_IN_BATCH="${MAX_TOKENS_IN_BATCH:-1024}" +LONG_PREFILL_TOKEN_THRESHOLD="${LONG_PREFILL_TOKEN_THRESHOLD:-64}" +KV_TRANSFER_BANDWIDTH_GBPS="${KV_TRANSFER_BANDWIDTH_GBPS:-200.0}" +KV_TRANSFER_LATENCY_MS="${KV_TRANSFER_LATENCY_MS:-0.5}" +METRICS_OUTPUT_DIR="${METRICS_OUTPUT_DIR:-$REPO_ROOT/outputs/examples/pdd/offline}" +RUN_ID="${RUN_ID:-dense_model_basic}" + +require_bool() { + local name="$1" + local value="$2" + if [ "$value" != "true" ] && [ "$value" != "false" ]; then + echo "ERROR: $name must be true or false; got $value" >&2 + exit 2 + fi +} + +require_non_negative_integer() { + local name="$1" + local value="$2" + if [[ ! "$value" =~ ^[0-9]+$ ]]; then + echo "ERROR: $name must be a non-negative integer; got $value" >&2 + exit 2 + fi +} + +require_positive_integer() { + local name="$1" + local value="$2" + [[ "$value" =~ ^[1-9][0-9]*$ ]] +} + +require_bool "ENABLE_DUMMY_MODE" "$ENABLE_DUMMY_MODE" +require_bool "ENABLE_CHUNKED_PREFILL" "$ENABLE_CHUNKED_PREFILL" + +if [ "$SYS_ARCH" != "pd-disaggregation" ]; then + echo "ERROR: this example only supports SYS_ARCH=pd-disaggregation; got SYS_ARCH=$SYS_ARCH" >&2 + exit 2 +fi + +if [ "$DECODE_CUDA_GRAPH_MODE" = "none" ]; then + echo "INFO: Decode CUDA Graph modeling is disabled by DECODE_CUDA_GRAPH_MODE=none." +elif [ "$DECODE_CUDA_GRAPH_MODE" != "full_decode_only" ] && [ "$DECODE_CUDA_GRAPH_MODE" != "piecewise" ]; then + echo "ERROR: DECODE_CUDA_GRAPH_MODE must be none, full_decode_only, or piecewise; got $DECODE_CUDA_GRAPH_MODE" >&2 + exit 2 +fi + +if [ "$ENABLE_CHUNKED_PREFILL" = "false" ] && [ "$LONG_PREFILL_TOKEN_THRESHOLD" != "0" ]; then + echo "ERROR: LONG_PREFILL_TOKEN_THRESHOLD must be 0 when ENABLE_CHUNKED_PREFILL=false" >&2 + exit 2 +fi + +if ! command -v "$PYTHON_BIN" >/dev/null 2>&1; then + echo "ERROR: PYTHON_BIN is not executable or not on PATH: $PYTHON_BIN" >&2 + exit 2 +fi + +CMD=( + "$PYTHON_BIN" -m frontier.main + --simulation_mode offline + --sys_arch "$SYS_ARCH" + --no-enable_parallel_clusters + --cluster_config_prefill_cluster_num_replicas "$PREFILL_REPLICAS" + --cluster_config_decode_cluster_num_replicas "$DECODE_REPLICAS" + --cluster_config_prefill_replica_config_num_pipeline_stages "$PREFILL_PP" + --cluster_config_prefill_replica_config_attn_tensor_parallel_size "$PREFILL_ATTN_TP" + --cluster_config_prefill_replica_config_attn_data_parallel_size "$PREFILL_ATTN_DP" + --cluster_config_prefill_replica_config_moe_tensor_parallel_size "$PREFILL_MOE_TP" + --cluster_config_prefill_replica_config_moe_expert_parallel_size "$PREFILL_MOE_EP" + --cluster_config_prefill_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_prefill_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_prefill_replica_config_device "$PREFILL_DEVICE" + --cluster_config_prefill_replica_config_memory_margin_fraction "$PREFILL_MEMORY_MARGIN_FRACTION" + --cluster_config_decode_replica_config_num_pipeline_stages "$DECODE_PP" + --cluster_config_decode_replica_config_attn_tensor_parallel_size "$DECODE_ATTN_TP" + --cluster_config_decode_replica_config_attn_data_parallel_size "$DECODE_ATTN_DP" + --cluster_config_decode_replica_config_moe_tensor_parallel_size "$DECODE_MOE_TP" + --cluster_config_decode_replica_config_moe_expert_parallel_size "$DECODE_MOE_EP" + --cluster_config_decode_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_decode_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_decode_replica_config_device "$DECODE_DEVICE" + --cluster_config_decode_replica_config_memory_margin_fraction "$DECODE_MEMORY_MARGIN_FRACTION" + --cc_backend_config_type analytical + --replica_config_model_name "$MODEL_NAME" + --replica_config_moe_routing_mode "$MOE_ROUTING_MODE" + --replica_config_moe_routing_seed "$MOE_ROUTING_SEED" + --replica_scheduler_config_type "$REPLICA_SCHEDULER" + --decode_cuda_graph_mode "$DECODE_CUDA_GRAPH_MODE" + --vllm_v1_scheduler_config_max_tokens_in_batch "$MAX_TOKENS_IN_BATCH" + --vllm_v1_scheduler_config_long_prefill_token_threshold "$LONG_PREFILL_TOKEN_THRESHOLD" + --vllm_v1_scheduler_config_block_size "${BLOCK_SIZE:-16}" + --vllm_v1_scheduler_config_num_blocks "${NUM_BLOCKS:-128}" + --request_generator_config_type synthetic + --synthetic_request_generator_config_num_requests "$NUM_REQUESTS" + --length_generator_config_type fixed + --fixed_request_length_generator_config_prefill_tokens "$PREFILL_TOKENS" + --fixed_request_length_generator_config_decode_tokens "$DECODE_TOKENS" + --interval_generator_config_type poisson + --poisson_request_interval_generator_config_qps "$QPS" + --analytical_kv_cache_transfer_config_network_bandwidth_gbps "$KV_TRANSFER_BANDWIDTH_GBPS" + --analytical_kv_cache_transfer_config_network_latency_ms "$KV_TRANSFER_LATENCY_MS" + --metrics_config_output_dir "$METRICS_OUTPUT_DIR" + --metrics_config_run_id "$RUN_ID" + --metrics_config_write_metrics + --metrics_config_store_request_metrics + --metrics_config_store_batch_metrics + --metrics_config_store_token_completion_metrics + --metrics_config_store_utilization_metrics + --no-metrics_config_store_plots + --no-metrics_config_enable_chrome_trace + --no-metrics_config_write_json_trace + +) + +if [ "$ENABLE_CHUNKED_PREFILL" = "true" ]; then + CMD+=(--vllm_v1_scheduler_config_enable_chunked_prefill) +else + CMD+=(--no-vllm_v1_scheduler_config_enable_chunked_prefill) +fi + +if [ "$ENABLE_DUMMY_MODE" = "true" ]; then + CMD+=( + --random_forrest_execution_time_predictor_config_enable_dummy_mode + --random_forrest_execution_time_predictor_config_dummy_execution_time_ms "$DUMMY_EXEC_TIME_MS" + ) +fi + +if [ "$#" -gt 0 ]; then + if [ "$1" = "--" ]; then + shift + fi + CMD+=("$@") +fi + +cat <&2 + exit "$exit_code" +fi diff --git a/examples/architecture/pdd/offline/moe_model_basic.sh b/examples/architecture/pdd/offline/moe_model_basic.sh new file mode 100755 index 0000000..e4e9dce --- /dev/null +++ b/examples/architecture/pdd/offline/moe_model_basic.sh @@ -0,0 +1,229 @@ +#!/bin/bash +# ============================================================================= +# PDD / pd-disaggregation Offline Mode - MoE Model Example +# ============================================================================= +# This script mirrors the co-location example surface while using the +# pre-release-v0.2 PDD / pd-disaggregation architecture: prefill runs in the PREFILL cluster, +# decode runs in the DECODE cluster, and KV cache is transferred between them. +# +# This script demonstrates the release-supported sequential pd-disaggregation path. +# Decode CUDA Graph modeling and Chunked Prefill can be toggled with +# DECODE_CUDA_GRAPH_MODE and ENABLE_CHUNKED_PREFILL. +## Override any uppercase variable from the shell, and append extra Frontier CLI +# flags after "--" if you need to customize the run. +# ============================================================================= + +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/../../../.." && pwd)" +export PYTHONPATH="$REPO_ROOT${PYTHONPATH:+:$PYTHONPATH}" +export WANDB_DISABLED=true +export VIDUR_DISABLE_WANDB=1 +PYTHON_BIN="${PYTHON_BIN:-python3}" + +MODEL_NAME="${MODEL_NAME:-Phi-tiny-MoE-instruct}" +SYS_ARCH="${SYS_ARCH:-pd-disaggregation}" +PREFILL_REPLICAS="${PREFILL_REPLICAS:-1}" +DECODE_REPLICAS="${DECODE_REPLICAS:-1}" +PREFILL_ATTN_TP="${PREFILL_ATTN_TP:-2}" +PREFILL_ATTN_DP="${PREFILL_ATTN_DP:-1}" +PREFILL_MOE_TP="${PREFILL_MOE_TP:-1}" +PREFILL_MOE_EP="${PREFILL_MOE_EP:-2}" +PREFILL_PP="${PREFILL_PP:-1}" +PREFILL_DEVICE="${PREFILL_DEVICE:-a800}" +PREFILL_MEMORY_MARGIN_FRACTION="${PREFILL_MEMORY_MARGIN_FRACTION:-0.2}" +DECODE_ATTN_TP="${DECODE_ATTN_TP:-2}" +DECODE_ATTN_DP="${DECODE_ATTN_DP:-1}" +DECODE_MOE_TP="${DECODE_MOE_TP:-1}" +DECODE_MOE_EP="${DECODE_MOE_EP:-2}" +DECODE_PP="${DECODE_PP:-1}" +DECODE_DEVICE="${DECODE_DEVICE:-a800}" +DECODE_MEMORY_MARGIN_FRACTION="${DECODE_MEMORY_MARGIN_FRACTION:-0.2}" +TOTAL_EXPERTS="${TOTAL_EXPERTS:-8}" +ROUTER_TOPK="${ROUTER_TOPK:-2}" +MOE_ROUTING_MODE="${MOE_ROUTING_MODE:-simulation}" +MOE_ROUTING_SEED="${MOE_ROUTING_SEED:-42}" +REPLICA_SCHEDULER="${REPLICA_SCHEDULER:-vllm_v1}" +NUM_REQUESTS="${NUM_REQUESTS:-8}" +PREFILL_TOKENS="${PREFILL_TOKENS:-256}" +DECODE_TOKENS="${DECODE_TOKENS:-32}" +QPS="${QPS:-1.0}" +ENABLE_DUMMY_MODE="${ENABLE_DUMMY_MODE:-true}" +DUMMY_EXEC_TIME_MS="${DUMMY_EXEC_TIME_MS:-1.0}" +DECODE_CUDA_GRAPH_MODE="${DECODE_CUDA_GRAPH_MODE:-none}" +ENABLE_CHUNKED_PREFILL="${ENABLE_CHUNKED_PREFILL:-true}" +MAX_TOKENS_IN_BATCH="${MAX_TOKENS_IN_BATCH:-1024}" +LONG_PREFILL_TOKEN_THRESHOLD="${LONG_PREFILL_TOKEN_THRESHOLD:-64}" +KV_TRANSFER_BANDWIDTH_GBPS="${KV_TRANSFER_BANDWIDTH_GBPS:-200.0}" +KV_TRANSFER_LATENCY_MS="${KV_TRANSFER_LATENCY_MS:-0.5}" +METRICS_OUTPUT_DIR="${METRICS_OUTPUT_DIR:-$REPO_ROOT/outputs/examples/pdd/offline}" +RUN_ID="${RUN_ID:-moe_model_basic}" + +require_bool() { + local name="$1" + local value="$2" + if [ "$value" != "true" ] && [ "$value" != "false" ]; then + echo "ERROR: $name must be true or false; got $value" >&2 + exit 2 + fi +} + +require_non_negative_integer() { + local name="$1" + local value="$2" + if [[ ! "$value" =~ ^[0-9]+$ ]]; then + echo "ERROR: $name must be a non-negative integer; got $value" >&2 + exit 2 + fi +} + +require_positive_integer() { + local name="$1" + local value="$2" + [[ "$value" =~ ^[1-9][0-9]*$ ]] +} + +require_bool "ENABLE_DUMMY_MODE" "$ENABLE_DUMMY_MODE" +require_bool "ENABLE_CHUNKED_PREFILL" "$ENABLE_CHUNKED_PREFILL" + +if [ "$SYS_ARCH" != "pd-disaggregation" ]; then + echo "ERROR: this example only supports SYS_ARCH=pd-disaggregation; got SYS_ARCH=$SYS_ARCH" >&2 + exit 2 +fi + +if [ "$DECODE_CUDA_GRAPH_MODE" = "none" ]; then + echo "INFO: Decode CUDA Graph modeling is disabled by DECODE_CUDA_GRAPH_MODE=none." +elif [ "$DECODE_CUDA_GRAPH_MODE" != "full_decode_only" ] && [ "$DECODE_CUDA_GRAPH_MODE" != "piecewise" ]; then + echo "ERROR: DECODE_CUDA_GRAPH_MODE must be none, full_decode_only, or piecewise; got $DECODE_CUDA_GRAPH_MODE" >&2 + exit 2 +fi + +if [ "$ENABLE_CHUNKED_PREFILL" = "false" ] && [ "$LONG_PREFILL_TOKEN_THRESHOLD" != "0" ]; then + echo "ERROR: LONG_PREFILL_TOKEN_THRESHOLD must be 0 when ENABLE_CHUNKED_PREFILL=false" >&2 + exit 2 +fi + +if (( PREFILL_ATTN_TP * PREFILL_ATTN_DP != PREFILL_MOE_TP * PREFILL_MOE_EP )); then + echo "ERROR: shared-domain prefill MoE requires PREFILL_ATTN_TP * PREFILL_ATTN_DP == PREFILL_MOE_TP * PREFILL_MOE_EP" >&2 + echo " got PREFILL_ATTN_TP=$PREFILL_ATTN_TP, PREFILL_ATTN_DP=$PREFILL_ATTN_DP, PREFILL_MOE_TP=$PREFILL_MOE_TP, PREFILL_MOE_EP=$PREFILL_MOE_EP" >&2 + exit 2 +fi + +if (( DECODE_ATTN_TP * DECODE_ATTN_DP != DECODE_MOE_TP * DECODE_MOE_EP )); then + echo "ERROR: shared-domain decode MoE requires DECODE_ATTN_TP * DECODE_ATTN_DP == DECODE_MOE_TP * DECODE_MOE_EP" >&2 + echo " got DECODE_ATTN_TP=$DECODE_ATTN_TP, DECODE_ATTN_DP=$DECODE_ATTN_DP, DECODE_MOE_TP=$DECODE_MOE_TP, DECODE_MOE_EP=$DECODE_MOE_EP" >&2 + exit 2 +fi + +if ! command -v "$PYTHON_BIN" >/dev/null 2>&1; then + echo "ERROR: PYTHON_BIN is not executable or not on PATH: $PYTHON_BIN" >&2 + exit 2 +fi + +CMD=( + "$PYTHON_BIN" -m frontier.main + --simulation_mode offline + --sys_arch "$SYS_ARCH" + --no-enable_parallel_clusters + --cluster_config_prefill_cluster_num_replicas "$PREFILL_REPLICAS" + --cluster_config_decode_cluster_num_replicas "$DECODE_REPLICAS" + --cluster_config_prefill_replica_config_num_pipeline_stages "$PREFILL_PP" + --cluster_config_prefill_replica_config_attn_tensor_parallel_size "$PREFILL_ATTN_TP" + --cluster_config_prefill_replica_config_attn_data_parallel_size "$PREFILL_ATTN_DP" + --cluster_config_prefill_replica_config_moe_tensor_parallel_size "$PREFILL_MOE_TP" + --cluster_config_prefill_replica_config_moe_expert_parallel_size "$PREFILL_MOE_EP" + --cluster_config_prefill_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_prefill_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_prefill_replica_config_device "$PREFILL_DEVICE" + --cluster_config_prefill_replica_config_memory_margin_fraction "$PREFILL_MEMORY_MARGIN_FRACTION" + --cluster_config_decode_replica_config_num_pipeline_stages "$DECODE_PP" + --cluster_config_decode_replica_config_attn_tensor_parallel_size "$DECODE_ATTN_TP" + --cluster_config_decode_replica_config_attn_data_parallel_size "$DECODE_ATTN_DP" + --cluster_config_decode_replica_config_moe_tensor_parallel_size "$DECODE_MOE_TP" + --cluster_config_decode_replica_config_moe_expert_parallel_size "$DECODE_MOE_EP" + --cluster_config_decode_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_decode_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_decode_replica_config_device "$DECODE_DEVICE" + --cluster_config_decode_replica_config_memory_margin_fraction "$DECODE_MEMORY_MARGIN_FRACTION" + --cc_backend_config_type analytical + --replica_config_model_name "$MODEL_NAME" + --replica_config_moe_routing_mode "$MOE_ROUTING_MODE" + --replica_config_moe_routing_seed "$MOE_ROUTING_SEED" + --replica_scheduler_config_type "$REPLICA_SCHEDULER" + --decode_cuda_graph_mode "$DECODE_CUDA_GRAPH_MODE" + --vllm_v1_scheduler_config_max_tokens_in_batch "$MAX_TOKENS_IN_BATCH" + --vllm_v1_scheduler_config_long_prefill_token_threshold "$LONG_PREFILL_TOKEN_THRESHOLD" + --vllm_v1_scheduler_config_block_size "${BLOCK_SIZE:-16}" + --vllm_v1_scheduler_config_num_blocks "${NUM_BLOCKS:-128}" + --request_generator_config_type synthetic + --synthetic_request_generator_config_num_requests "$NUM_REQUESTS" + --length_generator_config_type fixed + --fixed_request_length_generator_config_prefill_tokens "$PREFILL_TOKENS" + --fixed_request_length_generator_config_decode_tokens "$DECODE_TOKENS" + --interval_generator_config_type poisson + --poisson_request_interval_generator_config_qps "$QPS" + --analytical_kv_cache_transfer_config_network_bandwidth_gbps "$KV_TRANSFER_BANDWIDTH_GBPS" + --analytical_kv_cache_transfer_config_network_latency_ms "$KV_TRANSFER_LATENCY_MS" + --metrics_config_output_dir "$METRICS_OUTPUT_DIR" + --metrics_config_run_id "$RUN_ID" + --metrics_config_write_metrics + --metrics_config_store_request_metrics + --metrics_config_store_batch_metrics + --metrics_config_store_token_completion_metrics + --metrics_config_store_utilization_metrics + --no-metrics_config_store_plots + --no-metrics_config_enable_chrome_trace + --no-metrics_config_write_json_trace + +) + +if [ "$ENABLE_CHUNKED_PREFILL" = "true" ]; then + CMD+=(--vllm_v1_scheduler_config_enable_chunked_prefill) +else + CMD+=(--no-vllm_v1_scheduler_config_enable_chunked_prefill) +fi + +if [ "$ENABLE_DUMMY_MODE" = "true" ]; then + CMD+=( + --random_forrest_execution_time_predictor_config_enable_dummy_mode + --random_forrest_execution_time_predictor_config_dummy_execution_time_ms "$DUMMY_EXEC_TIME_MS" + ) +fi + +if [ "$#" -gt 0 ]; then + if [ "$1" = "--" ]; then + shift + fi + CMD+=("$@") +fi + +cat <&2 + exit "$exit_code" +fi diff --git a/examples/architecture/pdd/offline/moe_prefix_caching.sh b/examples/architecture/pdd/offline/moe_prefix_caching.sh new file mode 100755 index 0000000..ba62994 --- /dev/null +++ b/examples/architecture/pdd/offline/moe_prefix_caching.sh @@ -0,0 +1,241 @@ +#!/bin/bash +# ============================================================================= +# PDD / pd-disaggregation Offline Mode - MoE Prefix Caching Recipe +# ============================================================================= +# This script mirrors the co-location example surface while using the +# pre-release-v0.2 PDD / pd-disaggregation architecture: prefill runs in the PREFILL cluster, +# decode runs in the DECODE cluster, and KV cache is transferred between them. +# +# This recipe enables vLLM V1 Prefix Caching with a public shared-session trace +# fixture. Prefix Caching and Speculative Decoding are intentionally separate; +# this script does not enable speculative decoding. +## Override any uppercase variable from the shell, and append extra Frontier CLI +# flags after "--" if you need to customize the run. +# ============================================================================= + +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/../../../.." && pwd)" +export PYTHONPATH="$REPO_ROOT${PYTHONPATH:+:$PYTHONPATH}" +export WANDB_DISABLED=true +export VIDUR_DISABLE_WANDB=1 +PYTHON_BIN="${PYTHON_BIN:-python3}" + +MODEL_NAME="${MODEL_NAME:-Phi-tiny-MoE-instruct}" +SYS_ARCH="${SYS_ARCH:-pd-disaggregation}" +PREFILL_REPLICAS="${PREFILL_REPLICAS:-2}" +DECODE_REPLICAS="${DECODE_REPLICAS:-2}" +PREFILL_ATTN_TP="${PREFILL_ATTN_TP:-2}" +PREFILL_ATTN_DP="${PREFILL_ATTN_DP:-1}" +PREFILL_MOE_TP="${PREFILL_MOE_TP:-1}" +PREFILL_MOE_EP="${PREFILL_MOE_EP:-2}" +PREFILL_PP="${PREFILL_PP:-1}" +PREFILL_DEVICE="${PREFILL_DEVICE:-a800}" +PREFILL_MEMORY_MARGIN_FRACTION="${PREFILL_MEMORY_MARGIN_FRACTION:-0.2}" +DECODE_ATTN_TP="${DECODE_ATTN_TP:-2}" +DECODE_ATTN_DP="${DECODE_ATTN_DP:-1}" +DECODE_MOE_TP="${DECODE_MOE_TP:-1}" +DECODE_MOE_EP="${DECODE_MOE_EP:-2}" +DECODE_PP="${DECODE_PP:-1}" +DECODE_DEVICE="${DECODE_DEVICE:-a800}" +DECODE_MEMORY_MARGIN_FRACTION="${DECODE_MEMORY_MARGIN_FRACTION:-0.2}" +TOTAL_EXPERTS="${TOTAL_EXPERTS:-8}" +ROUTER_TOPK="${ROUTER_TOPK:-2}" +MOE_ROUTING_MODE="${MOE_ROUTING_MODE:-simulation}" +MOE_ROUTING_SEED="${MOE_ROUTING_SEED:-42}" +REPLICA_SCHEDULER="${REPLICA_SCHEDULER:-vllm_v1}" +NUM_REQUESTS="${NUM_REQUESTS:-2}" +PREFILL_TOKENS="${PREFILL_TOKENS:-32}" +DECODE_TOKENS="${DECODE_TOKENS:-8}" +QPS="${QPS:-1.0}" +ENABLE_DUMMY_MODE="${ENABLE_DUMMY_MODE:-true}" +DUMMY_EXEC_TIME_MS="${DUMMY_EXEC_TIME_MS:-1.0}" +DECODE_CUDA_GRAPH_MODE="${DECODE_CUDA_GRAPH_MODE:-none}" +ENABLE_CHUNKED_PREFILL="${ENABLE_CHUNKED_PREFILL:-true}" +MAX_TOKENS_IN_BATCH="${MAX_TOKENS_IN_BATCH:-1024}" +LONG_PREFILL_TOKEN_THRESHOLD="${LONG_PREFILL_TOKEN_THRESHOLD:-64}" +KV_TRANSFER_BANDWIDTH_GBPS="${KV_TRANSFER_BANDWIDTH_GBPS:-200.0}" +KV_TRANSFER_LATENCY_MS="${KV_TRANSFER_LATENCY_MS:-0.5}" +TRACE_FILE="${TRACE_FILE:-$REPO_ROOT/examples/fixtures/prefix_cache_shared_session_trace.csv}" +MAX_TOKENS="${MAX_TOKENS:-128}" +EXPECTED_TRACE_REQUESTS="${EXPECTED_TRACE_REQUESTS:-2}" +BLOCK_SIZE="${BLOCK_SIZE:-16}" +NUM_BLOCKS="${NUM_BLOCKS:-128}" +METRICS_OUTPUT_DIR="${METRICS_OUTPUT_DIR:-$REPO_ROOT/outputs/examples/pdd/offline}" +RUN_ID="${RUN_ID:-moe_prefix_caching}" + +require_bool() { + local name="$1" + local value="$2" + if [ "$value" != "true" ] && [ "$value" != "false" ]; then + echo "ERROR: $name must be true or false; got $value" >&2 + exit 2 + fi +} + +require_non_negative_integer() { + local name="$1" + local value="$2" + if [[ ! "$value" =~ ^[0-9]+$ ]]; then + echo "ERROR: $name must be a non-negative integer; got $value" >&2 + exit 2 + fi +} + +require_positive_integer() { + local name="$1" + local value="$2" + [[ "$value" =~ ^[1-9][0-9]*$ ]] +} + +require_bool "ENABLE_DUMMY_MODE" "$ENABLE_DUMMY_MODE" +require_bool "ENABLE_CHUNKED_PREFILL" "$ENABLE_CHUNKED_PREFILL" + +if [ "$SYS_ARCH" != "pd-disaggregation" ]; then + echo "ERROR: this example only supports SYS_ARCH=pd-disaggregation; got SYS_ARCH=$SYS_ARCH" >&2 + exit 2 +fi + +if [ "$DECODE_CUDA_GRAPH_MODE" = "none" ]; then + echo "INFO: Decode CUDA Graph modeling is disabled by DECODE_CUDA_GRAPH_MODE=none." +elif [ "$DECODE_CUDA_GRAPH_MODE" != "full_decode_only" ] && [ "$DECODE_CUDA_GRAPH_MODE" != "piecewise" ]; then + echo "ERROR: DECODE_CUDA_GRAPH_MODE must be none, full_decode_only, or piecewise; got $DECODE_CUDA_GRAPH_MODE" >&2 + exit 2 +fi + +if [ "$ENABLE_CHUNKED_PREFILL" = "false" ] && [ "$LONG_PREFILL_TOKEN_THRESHOLD" != "0" ]; then + echo "ERROR: LONG_PREFILL_TOKEN_THRESHOLD must be 0 when ENABLE_CHUNKED_PREFILL=false" >&2 + exit 2 +fi + +if (( PREFILL_ATTN_TP * PREFILL_ATTN_DP != PREFILL_MOE_TP * PREFILL_MOE_EP )); then + echo "ERROR: shared-domain prefill MoE requires PREFILL_ATTN_TP * PREFILL_ATTN_DP == PREFILL_MOE_TP * PREFILL_MOE_EP" >&2 + echo " got PREFILL_ATTN_TP=$PREFILL_ATTN_TP, PREFILL_ATTN_DP=$PREFILL_ATTN_DP, PREFILL_MOE_TP=$PREFILL_MOE_TP, PREFILL_MOE_EP=$PREFILL_MOE_EP" >&2 + exit 2 +fi + +if (( DECODE_ATTN_TP * DECODE_ATTN_DP != DECODE_MOE_TP * DECODE_MOE_EP )); then + echo "ERROR: shared-domain decode MoE requires DECODE_ATTN_TP * DECODE_ATTN_DP == DECODE_MOE_TP * DECODE_MOE_EP" >&2 + echo " got DECODE_ATTN_TP=$DECODE_ATTN_TP, DECODE_ATTN_DP=$DECODE_ATTN_DP, DECODE_MOE_TP=$DECODE_MOE_TP, DECODE_MOE_EP=$DECODE_MOE_EP" >&2 + exit 2 +fi + +if [ ! -f "$TRACE_FILE" ]; then + echo "ERROR: TRACE_FILE does not exist: $TRACE_FILE" >&2 + exit 2 +fi + +if ! command -v "$PYTHON_BIN" >/dev/null 2>&1; then + echo "ERROR: PYTHON_BIN is not executable or not on PATH: $PYTHON_BIN" >&2 + exit 2 +fi + +CMD=( + "$PYTHON_BIN" -m frontier.main + --simulation_mode offline + --sys_arch "$SYS_ARCH" + --no-enable_parallel_clusters + --cluster_config_prefill_cluster_num_replicas "$PREFILL_REPLICAS" + --cluster_config_decode_cluster_num_replicas "$DECODE_REPLICAS" + --cluster_scheduler_config_type sticky_round_robin + --cluster_config_prefill_replica_config_num_pipeline_stages "$PREFILL_PP" + --cluster_config_prefill_replica_config_attn_tensor_parallel_size "$PREFILL_ATTN_TP" + --cluster_config_prefill_replica_config_attn_data_parallel_size "$PREFILL_ATTN_DP" + --cluster_config_prefill_replica_config_moe_tensor_parallel_size "$PREFILL_MOE_TP" + --cluster_config_prefill_replica_config_moe_expert_parallel_size "$PREFILL_MOE_EP" + --cluster_config_prefill_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_prefill_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_prefill_replica_config_device "$PREFILL_DEVICE" + --cluster_config_prefill_replica_config_memory_margin_fraction "$PREFILL_MEMORY_MARGIN_FRACTION" + --cluster_config_decode_replica_config_num_pipeline_stages "$DECODE_PP" + --cluster_config_decode_replica_config_attn_tensor_parallel_size "$DECODE_ATTN_TP" + --cluster_config_decode_replica_config_attn_data_parallel_size "$DECODE_ATTN_DP" + --cluster_config_decode_replica_config_moe_tensor_parallel_size "$DECODE_MOE_TP" + --cluster_config_decode_replica_config_moe_expert_parallel_size "$DECODE_MOE_EP" + --cluster_config_decode_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_decode_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_decode_replica_config_device "$DECODE_DEVICE" + --cluster_config_decode_replica_config_memory_margin_fraction "$DECODE_MEMORY_MARGIN_FRACTION" + --cc_backend_config_type analytical + --replica_config_model_name "$MODEL_NAME" + --replica_config_moe_routing_mode "$MOE_ROUTING_MODE" + --replica_config_moe_routing_seed "$MOE_ROUTING_SEED" + --replica_scheduler_config_type "$REPLICA_SCHEDULER" + --decode_cuda_graph_mode "$DECODE_CUDA_GRAPH_MODE" + --vllm_v1_scheduler_config_max_tokens_in_batch "$MAX_TOKENS_IN_BATCH" + --vllm_v1_scheduler_config_long_prefill_token_threshold "$LONG_PREFILL_TOKEN_THRESHOLD" + --vllm_v1_scheduler_config_block_size "${BLOCK_SIZE:-16}" + --vllm_v1_scheduler_config_num_blocks "${NUM_BLOCKS:-128}" + --request_generator_config_type trace_replay + --trace_request_generator_config_trace_file "$TRACE_FILE" + --trace_request_generator_config_max_tokens "$MAX_TOKENS" + --analytical_kv_cache_transfer_config_network_bandwidth_gbps "$KV_TRANSFER_BANDWIDTH_GBPS" + --analytical_kv_cache_transfer_config_network_latency_ms "$KV_TRANSFER_LATENCY_MS" + --metrics_config_output_dir "$METRICS_OUTPUT_DIR" + --metrics_config_run_id "$RUN_ID" + --metrics_config_write_metrics + --metrics_config_store_request_metrics + --metrics_config_store_batch_metrics + --metrics_config_store_token_completion_metrics + --metrics_config_store_utilization_metrics + --no-metrics_config_store_plots + --no-metrics_config_enable_chrome_trace + --no-metrics_config_write_json_trace + +) + +if [ "$ENABLE_CHUNKED_PREFILL" = "true" ]; then + CMD+=(--vllm_v1_scheduler_config_enable_chunked_prefill) +else + CMD+=(--no-vllm_v1_scheduler_config_enable_chunked_prefill) +fi + +CMD+=(--vllm_v1_scheduler_config_enable_prefix_caching) + +if [ "$ENABLE_DUMMY_MODE" = "true" ]; then + CMD+=( + --random_forrest_execution_time_predictor_config_enable_dummy_mode + --random_forrest_execution_time_predictor_config_dummy_execution_time_ms "$DUMMY_EXEC_TIME_MS" + ) +fi + +if [ "$#" -gt 0 ]; then + if [ "$1" = "--" ]; then + shift + fi + CMD+=("$@") +fi + +cat <&2 + exit "$exit_code" +fi diff --git a/examples/architecture/pdd/offline/moe_spec_dec.sh b/examples/architecture/pdd/offline/moe_spec_dec.sh new file mode 100755 index 0000000..72f9fe6 --- /dev/null +++ b/examples/architecture/pdd/offline/moe_spec_dec.sh @@ -0,0 +1,282 @@ +#!/bin/bash +# ============================================================================= +# PDD / pd-disaggregation Offline Mode - MoE Speculative Decoding / MTP Recipe +# ============================================================================= +# This script mirrors the co-location example surface while using the +# pre-release-v0.2 PDD / pd-disaggregation architecture: prefill runs in the PREFILL cluster, +# decode runs in the DECODE cluster, and KV cache is transferred between them. +# +# Speculative decoding and Prefix Caching have separate runtime contracts. +# This recipe enables speculative decoding and intentionally leaves Prefix +# Caching disabled. It also defaults DECODE_CUDA_GRAPH_MODE to "none" because +# production speculative decoding requires eager decode scheduling. +# +# For MTP-style methods, set SPEC_METHOD to an MTP method and keep +# MTP_N_PREDICT / MTP_NUM_LAYERS positive. +## Override any uppercase variable from the shell, and append extra Frontier CLI +# flags after "--" if you need to customize the run. +# ============================================================================= + +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/../../../.." && pwd)" +export PYTHONPATH="$REPO_ROOT${PYTHONPATH:+:$PYTHONPATH}" +export WANDB_DISABLED=true +export VIDUR_DISABLE_WANDB=1 +PYTHON_BIN="${PYTHON_BIN:-python3}" + +MODEL_NAME="${MODEL_NAME:-Phi-tiny-MoE-instruct}" +SYS_ARCH="${SYS_ARCH:-pd-disaggregation}" +PREFILL_REPLICAS="${PREFILL_REPLICAS:-1}" +DECODE_REPLICAS="${DECODE_REPLICAS:-1}" +PREFILL_ATTN_TP="${PREFILL_ATTN_TP:-2}" +PREFILL_ATTN_DP="${PREFILL_ATTN_DP:-1}" +PREFILL_MOE_TP="${PREFILL_MOE_TP:-1}" +PREFILL_MOE_EP="${PREFILL_MOE_EP:-2}" +PREFILL_PP="${PREFILL_PP:-1}" +PREFILL_DEVICE="${PREFILL_DEVICE:-a800}" +PREFILL_MEMORY_MARGIN_FRACTION="${PREFILL_MEMORY_MARGIN_FRACTION:-0.2}" +DECODE_ATTN_TP="${DECODE_ATTN_TP:-2}" +DECODE_ATTN_DP="${DECODE_ATTN_DP:-1}" +DECODE_MOE_TP="${DECODE_MOE_TP:-1}" +DECODE_MOE_EP="${DECODE_MOE_EP:-2}" +DECODE_PP="${DECODE_PP:-1}" +DECODE_DEVICE="${DECODE_DEVICE:-a800}" +DECODE_MEMORY_MARGIN_FRACTION="${DECODE_MEMORY_MARGIN_FRACTION:-0.2}" +TOTAL_EXPERTS="${TOTAL_EXPERTS:-8}" +ROUTER_TOPK="${ROUTER_TOPK:-2}" +MOE_ROUTING_MODE="${MOE_ROUTING_MODE:-simulation}" +MOE_ROUTING_SEED="${MOE_ROUTING_SEED:-42}" +REPLICA_SCHEDULER="${REPLICA_SCHEDULER:-vllm_v1}" +NUM_REQUESTS="${NUM_REQUESTS:-8}" +PREFILL_TOKENS="${PREFILL_TOKENS:-256}" +DECODE_TOKENS="${DECODE_TOKENS:-32}" +QPS="${QPS:-1.0}" +ENABLE_DUMMY_MODE="${ENABLE_DUMMY_MODE:-true}" +DUMMY_EXEC_TIME_MS="${DUMMY_EXEC_TIME_MS:-1.0}" +DECODE_CUDA_GRAPH_MODE="${DECODE_CUDA_GRAPH_MODE:-none}" +ENABLE_CHUNKED_PREFILL="${ENABLE_CHUNKED_PREFILL:-true}" +MAX_TOKENS_IN_BATCH="${MAX_TOKENS_IN_BATCH:-1024}" +LONG_PREFILL_TOKEN_THRESHOLD="${LONG_PREFILL_TOKEN_THRESHOLD:-64}" +KV_TRANSFER_BANDWIDTH_GBPS="${KV_TRANSFER_BANDWIDTH_GBPS:-200.0}" +KV_TRANSFER_LATENCY_MS="${KV_TRANSFER_LATENCY_MS:-0.5}" +SPEC_METHOD="${SPEC_METHOD:-ngram}" +SPEC_MODEL_NAME="${SPEC_MODEL_NAME:-}" +NUM_SPECULATIVE_TOKENS="${NUM_SPECULATIVE_TOKENS:-2}" +COMMITTED_TOKENS_PER_ITERATION="${COMMITTED_TOKENS_PER_ITERATION:-2}" +PROPOSER_OVERHEAD_MS_BY_METHOD="${PROPOSER_OVERHEAD_MS_BY_METHOD:-{\"ngram\":0.0,\"qwen3_next_mtp\":0.0,\"deepseek_mtp\":0.0,\"ernie_mtp\":0.0}}" +MTP_N_PREDICT="${MTP_N_PREDICT:-0}" +MTP_NUM_LAYERS="${MTP_NUM_LAYERS:-0}" +METRICS_OUTPUT_DIR="${METRICS_OUTPUT_DIR:-$REPO_ROOT/outputs/examples/pdd/offline}" +RUN_ID="${RUN_ID:-moe_spec_dec}" + +require_bool() { + local name="$1" + local value="$2" + if [ "$value" != "true" ] && [ "$value" != "false" ]; then + echo "ERROR: $name must be true or false; got $value" >&2 + exit 2 + fi +} + +require_non_negative_integer() { + local name="$1" + local value="$2" + if [[ ! "$value" =~ ^[0-9]+$ ]]; then + echo "ERROR: $name must be a non-negative integer; got $value" >&2 + exit 2 + fi +} + +require_positive_integer() { + local name="$1" + local value="$2" + [[ "$value" =~ ^[1-9][0-9]*$ ]] +} + +require_bool "ENABLE_DUMMY_MODE" "$ENABLE_DUMMY_MODE" +require_bool "ENABLE_CHUNKED_PREFILL" "$ENABLE_CHUNKED_PREFILL" + +if [ "$SYS_ARCH" != "pd-disaggregation" ]; then + echo "ERROR: this example only supports SYS_ARCH=pd-disaggregation; got SYS_ARCH=$SYS_ARCH" >&2 + exit 2 +fi + +if [ "$DECODE_CUDA_GRAPH_MODE" = "none" ]; then + echo "INFO: Decode CUDA Graph modeling is disabled by DECODE_CUDA_GRAPH_MODE=none." +elif [ "$DECODE_CUDA_GRAPH_MODE" != "full_decode_only" ] && [ "$DECODE_CUDA_GRAPH_MODE" != "piecewise" ]; then + echo "ERROR: DECODE_CUDA_GRAPH_MODE must be none, full_decode_only, or piecewise; got $DECODE_CUDA_GRAPH_MODE" >&2 + exit 2 +fi + +if [ "$ENABLE_CHUNKED_PREFILL" = "false" ] && [ "$LONG_PREFILL_TOKEN_THRESHOLD" != "0" ]; then + echo "ERROR: LONG_PREFILL_TOKEN_THRESHOLD must be 0 when ENABLE_CHUNKED_PREFILL=false" >&2 + exit 2 +fi + +if (( PREFILL_ATTN_TP * PREFILL_ATTN_DP != PREFILL_MOE_TP * PREFILL_MOE_EP )); then + echo "ERROR: shared-domain prefill MoE requires PREFILL_ATTN_TP * PREFILL_ATTN_DP == PREFILL_MOE_TP * PREFILL_MOE_EP" >&2 + echo " got PREFILL_ATTN_TP=$PREFILL_ATTN_TP, PREFILL_ATTN_DP=$PREFILL_ATTN_DP, PREFILL_MOE_TP=$PREFILL_MOE_TP, PREFILL_MOE_EP=$PREFILL_MOE_EP" >&2 + exit 2 +fi + +if (( DECODE_ATTN_TP * DECODE_ATTN_DP != DECODE_MOE_TP * DECODE_MOE_EP )); then + echo "ERROR: shared-domain decode MoE requires DECODE_ATTN_TP * DECODE_ATTN_DP == DECODE_MOE_TP * DECODE_MOE_EP" >&2 + echo " got DECODE_ATTN_TP=$DECODE_ATTN_TP, DECODE_ATTN_DP=$DECODE_ATTN_DP, DECODE_MOE_TP=$DECODE_MOE_TP, DECODE_MOE_EP=$DECODE_MOE_EP" >&2 + exit 2 +fi + +require_non_negative_integer "MTP_N_PREDICT" "$MTP_N_PREDICT" +require_non_negative_integer "MTP_NUM_LAYERS" "$MTP_NUM_LAYERS" + +if [ "$DECODE_CUDA_GRAPH_MODE" != "none" ]; then + echo "ERROR: speculative decoding currently requires DECODE_CUDA_GRAPH_MODE=none in production recipes; got $DECODE_CUDA_GRAPH_MODE" >&2 + exit 2 +fi + +case "$SPEC_METHOD" in + deepseek_mtp|ernie_mtp|qwen3_moe_mtp|qwen3_next_mtp) + if ! require_positive_integer "MTP_N_PREDICT" "$MTP_N_PREDICT" || ! require_positive_integer "MTP_NUM_LAYERS" "$MTP_NUM_LAYERS"; then + echo "ERROR: SPEC_METHOD=$SPEC_METHOD requires MTP_N_PREDICT>0 and MTP_NUM_LAYERS>0" >&2 + exit 2 + fi + ;; + *) + if [ "$MTP_N_PREDICT" -ne 0 ] || [ "$MTP_NUM_LAYERS" -ne 0 ]; then + echo "ERROR: MTP_N_PREDICT/MTP_NUM_LAYERS are only valid for MTP SPEC_METHOD values" >&2 + exit 2 + fi + ;; +esac + +if ! command -v "$PYTHON_BIN" >/dev/null 2>&1; then + echo "ERROR: PYTHON_BIN is not executable or not on PATH: $PYTHON_BIN" >&2 + exit 2 +fi + +CMD=( + "$PYTHON_BIN" -m frontier.main + --simulation_mode offline + --sys_arch "$SYS_ARCH" + --no-enable_parallel_clusters + --cluster_config_prefill_cluster_num_replicas "$PREFILL_REPLICAS" + --cluster_config_decode_cluster_num_replicas "$DECODE_REPLICAS" + --cluster_config_prefill_replica_config_num_pipeline_stages "$PREFILL_PP" + --cluster_config_prefill_replica_config_attn_tensor_parallel_size "$PREFILL_ATTN_TP" + --cluster_config_prefill_replica_config_attn_data_parallel_size "$PREFILL_ATTN_DP" + --cluster_config_prefill_replica_config_moe_tensor_parallel_size "$PREFILL_MOE_TP" + --cluster_config_prefill_replica_config_moe_expert_parallel_size "$PREFILL_MOE_EP" + --cluster_config_prefill_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_prefill_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_prefill_replica_config_device "$PREFILL_DEVICE" + --cluster_config_prefill_replica_config_memory_margin_fraction "$PREFILL_MEMORY_MARGIN_FRACTION" + --cluster_config_decode_replica_config_num_pipeline_stages "$DECODE_PP" + --cluster_config_decode_replica_config_attn_tensor_parallel_size "$DECODE_ATTN_TP" + --cluster_config_decode_replica_config_attn_data_parallel_size "$DECODE_ATTN_DP" + --cluster_config_decode_replica_config_moe_tensor_parallel_size "$DECODE_MOE_TP" + --cluster_config_decode_replica_config_moe_expert_parallel_size "$DECODE_MOE_EP" + --cluster_config_decode_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_decode_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_decode_replica_config_device "$DECODE_DEVICE" + --cluster_config_decode_replica_config_memory_margin_fraction "$DECODE_MEMORY_MARGIN_FRACTION" + --cc_backend_config_type analytical + --replica_config_model_name "$MODEL_NAME" + --replica_config_moe_routing_mode "$MOE_ROUTING_MODE" + --replica_config_moe_routing_seed "$MOE_ROUTING_SEED" + --replica_scheduler_config_type "$REPLICA_SCHEDULER" + --decode_cuda_graph_mode "$DECODE_CUDA_GRAPH_MODE" + --vllm_v1_scheduler_config_max_tokens_in_batch "$MAX_TOKENS_IN_BATCH" + --vllm_v1_scheduler_config_long_prefill_token_threshold "$LONG_PREFILL_TOKEN_THRESHOLD" + --vllm_v1_scheduler_config_block_size "${BLOCK_SIZE:-16}" + --vllm_v1_scheduler_config_num_blocks "${NUM_BLOCKS:-128}" + --request_generator_config_type synthetic + --synthetic_request_generator_config_num_requests "$NUM_REQUESTS" + --length_generator_config_type fixed + --fixed_request_length_generator_config_prefill_tokens "$PREFILL_TOKENS" + --fixed_request_length_generator_config_decode_tokens "$DECODE_TOKENS" + --interval_generator_config_type poisson + --poisson_request_interval_generator_config_qps "$QPS" + --analytical_kv_cache_transfer_config_network_bandwidth_gbps "$KV_TRANSFER_BANDWIDTH_GBPS" + --analytical_kv_cache_transfer_config_network_latency_ms "$KV_TRANSFER_LATENCY_MS" + --metrics_config_output_dir "$METRICS_OUTPUT_DIR" + --metrics_config_run_id "$RUN_ID" + --metrics_config_write_metrics + --metrics_config_store_request_metrics + --metrics_config_store_batch_metrics + --metrics_config_store_token_completion_metrics + --metrics_config_store_utilization_metrics + --no-metrics_config_store_plots + --no-metrics_config_enable_chrome_trace + --no-metrics_config_write_json_trace + +) + +if [ "$ENABLE_CHUNKED_PREFILL" = "true" ]; then + CMD+=(--vllm_v1_scheduler_config_enable_chunked_prefill) +else + CMD+=(--no-vllm_v1_scheduler_config_enable_chunked_prefill) +fi + +CMD+=( + --speculative_decoding_config_enabled + --speculative_decoding_config_method "$SPEC_METHOD" + --speculative_decoding_config_spec_model_name "$SPEC_MODEL_NAME" + --speculative_decoding_config_num_speculative_tokens "$NUM_SPECULATIVE_TOKENS" + --speculative_decoding_config_committed_tokens_per_iteration "$COMMITTED_TOKENS_PER_ITERATION" + --speculative_decoding_config_proposer_overhead_ms_by_method "$PROPOSER_OVERHEAD_MS_BY_METHOD" +) + +if [ "$MTP_N_PREDICT" -gt 0 ]; then + CMD+=(--speculative_decoding_config_mtp_n_predict "$MTP_N_PREDICT") +fi + +if [ "$MTP_NUM_LAYERS" -gt 0 ]; then + CMD+=(--speculative_decoding_config_mtp_num_layers "$MTP_NUM_LAYERS") +fi + +if [ "$ENABLE_DUMMY_MODE" = "true" ]; then + CMD+=( + --random_forrest_execution_time_predictor_config_enable_dummy_mode + --random_forrest_execution_time_predictor_config_dummy_execution_time_ms "$DUMMY_EXEC_TIME_MS" + ) +fi + +if [ "$#" -gt 0 ]; then + if [ "$1" = "--" ]; then + shift + fi + CMD+=("$@") +fi + +cat <&2 + exit "$exit_code" +fi diff --git a/examples/architecture/pdd/offline/thinking_mode_basic.sh b/examples/architecture/pdd/offline/thinking_mode_basic.sh new file mode 100755 index 0000000..7cf4a9e --- /dev/null +++ b/examples/architecture/pdd/offline/thinking_mode_basic.sh @@ -0,0 +1,225 @@ +#!/bin/bash +# ============================================================================= +# PDD / pd-disaggregation Offline Mode - Dense Thinking Mode Example +# ============================================================================= +# This script mirrors the co-location example surface while using the +# pre-release-v0.2 PDD / pd-disaggregation architecture: prefill runs in the PREFILL cluster, +# decode runs in the DECODE cluster, and KV cache is transferred between them. +# +# This script demonstrates Thinking Mode on the release-supported sequential +# pd-disaggregation path. +## Override any uppercase variable from the shell, and append extra Frontier CLI +# flags after "--" if you need to customize the run. +# ============================================================================= + +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/../../../.." && pwd)" +export PYTHONPATH="$REPO_ROOT${PYTHONPATH:+:$PYTHONPATH}" +export WANDB_DISABLED=true +export VIDUR_DISABLE_WANDB=1 +PYTHON_BIN="${PYTHON_BIN:-python3}" + +MODEL_NAME="${MODEL_NAME:-meta-llama/Llama-2-7b-hf}" +SYS_ARCH="${SYS_ARCH:-pd-disaggregation}" +PREFILL_REPLICAS="${PREFILL_REPLICAS:-2}" +DECODE_REPLICAS="${DECODE_REPLICAS:-2}" +PREFILL_ATTN_TP="${PREFILL_ATTN_TP:-1}" +PREFILL_ATTN_DP="${PREFILL_ATTN_DP:-1}" +PREFILL_MOE_TP="${PREFILL_MOE_TP:-1}" +PREFILL_MOE_EP="${PREFILL_MOE_EP:-1}" +PREFILL_PP="${PREFILL_PP:-1}" +PREFILL_DEVICE="${PREFILL_DEVICE:-a800}" +PREFILL_MEMORY_MARGIN_FRACTION="${PREFILL_MEMORY_MARGIN_FRACTION:-0.2}" +DECODE_ATTN_TP="${DECODE_ATTN_TP:-1}" +DECODE_ATTN_DP="${DECODE_ATTN_DP:-1}" +DECODE_MOE_TP="${DECODE_MOE_TP:-1}" +DECODE_MOE_EP="${DECODE_MOE_EP:-1}" +DECODE_PP="${DECODE_PP:-1}" +DECODE_DEVICE="${DECODE_DEVICE:-a800}" +DECODE_MEMORY_MARGIN_FRACTION="${DECODE_MEMORY_MARGIN_FRACTION:-0.2}" +TOTAL_EXPERTS="${TOTAL_EXPERTS:-1}" +ROUTER_TOPK="${ROUTER_TOPK:-1}" +MOE_ROUTING_MODE="${MOE_ROUTING_MODE:-simulation}" +MOE_ROUTING_SEED="${MOE_ROUTING_SEED:-42}" +REPLICA_SCHEDULER="${REPLICA_SCHEDULER:-vllm_v1}" +NUM_REQUESTS="${NUM_REQUESTS:-1}" +PREFILL_TOKENS="${PREFILL_TOKENS:-8}" +DECODE_TOKENS="${DECODE_TOKENS:-2}" +QPS="${QPS:-1.0}" +ENABLE_DUMMY_MODE="${ENABLE_DUMMY_MODE:-true}" +DUMMY_EXEC_TIME_MS="${DUMMY_EXEC_TIME_MS:-1.0}" +DECODE_CUDA_GRAPH_MODE="${DECODE_CUDA_GRAPH_MODE:-none}" +ENABLE_CHUNKED_PREFILL="${ENABLE_CHUNKED_PREFILL:-true}" +MAX_TOKENS_IN_BATCH="${MAX_TOKENS_IN_BATCH:-1024}" +LONG_PREFILL_TOKEN_THRESHOLD="${LONG_PREFILL_TOKEN_THRESHOLD:-64}" +KV_TRANSFER_BANDWIDTH_GBPS="${KV_TRANSFER_BANDWIDTH_GBPS:-200.0}" +KV_TRANSFER_LATENCY_MS="${KV_TRANSFER_LATENCY_MS:-0.5}" +THINKING_DEPTH="${THINKING_DEPTH:-2}" +TOOL_CALL_LATENCY="${TOOL_CALL_LATENCY:-0.001}" +THINKING_ROUND_PREFILL_TOKENS="${THINKING_ROUND_PREFILL_TOKENS:-3}" +THINKING_ROUND_DECODE_TOKENS="${THINKING_ROUND_DECODE_TOKENS:-1}" +METRICS_OUTPUT_DIR="${METRICS_OUTPUT_DIR:-$REPO_ROOT/outputs/examples/pdd/offline}" +RUN_ID="${RUN_ID:-thinking_mode_basic}" + +require_bool() { + local name="$1" + local value="$2" + if [ "$value" != "true" ] && [ "$value" != "false" ]; then + echo "ERROR: $name must be true or false; got $value" >&2 + exit 2 + fi +} + +require_non_negative_integer() { + local name="$1" + local value="$2" + if [[ ! "$value" =~ ^[0-9]+$ ]]; then + echo "ERROR: $name must be a non-negative integer; got $value" >&2 + exit 2 + fi +} + +require_positive_integer() { + local name="$1" + local value="$2" + [[ "$value" =~ ^[1-9][0-9]*$ ]] +} + +require_bool "ENABLE_DUMMY_MODE" "$ENABLE_DUMMY_MODE" +require_bool "ENABLE_CHUNKED_PREFILL" "$ENABLE_CHUNKED_PREFILL" + +if [ "$SYS_ARCH" != "pd-disaggregation" ]; then + echo "ERROR: this example only supports SYS_ARCH=pd-disaggregation; got SYS_ARCH=$SYS_ARCH" >&2 + exit 2 +fi + +if [ "$DECODE_CUDA_GRAPH_MODE" = "none" ]; then + echo "INFO: Decode CUDA Graph modeling is disabled by DECODE_CUDA_GRAPH_MODE=none." +elif [ "$DECODE_CUDA_GRAPH_MODE" != "full_decode_only" ] && [ "$DECODE_CUDA_GRAPH_MODE" != "piecewise" ]; then + echo "ERROR: DECODE_CUDA_GRAPH_MODE must be none, full_decode_only, or piecewise; got $DECODE_CUDA_GRAPH_MODE" >&2 + exit 2 +fi + +if [ "$ENABLE_CHUNKED_PREFILL" = "false" ] && [ "$LONG_PREFILL_TOKEN_THRESHOLD" != "0" ]; then + echo "ERROR: LONG_PREFILL_TOKEN_THRESHOLD must be 0 when ENABLE_CHUNKED_PREFILL=false" >&2 + exit 2 +fi + +if ! command -v "$PYTHON_BIN" >/dev/null 2>&1; then + echo "ERROR: PYTHON_BIN is not executable or not on PATH: $PYTHON_BIN" >&2 + exit 2 +fi + +CMD=( + "$PYTHON_BIN" -m frontier.main + --simulation_mode offline + --sys_arch "$SYS_ARCH" + --no-enable_parallel_clusters + --cluster_config_prefill_cluster_num_replicas "$PREFILL_REPLICAS" + --cluster_config_decode_cluster_num_replicas "$DECODE_REPLICAS" + --cluster_config_prefill_replica_config_num_pipeline_stages "$PREFILL_PP" + --cluster_config_prefill_replica_config_attn_tensor_parallel_size "$PREFILL_ATTN_TP" + --cluster_config_prefill_replica_config_attn_data_parallel_size "$PREFILL_ATTN_DP" + --cluster_config_prefill_replica_config_moe_tensor_parallel_size "$PREFILL_MOE_TP" + --cluster_config_prefill_replica_config_moe_expert_parallel_size "$PREFILL_MOE_EP" + --cluster_config_prefill_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_prefill_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_prefill_replica_config_device "$PREFILL_DEVICE" + --cluster_config_prefill_replica_config_memory_margin_fraction "$PREFILL_MEMORY_MARGIN_FRACTION" + --cluster_config_decode_replica_config_num_pipeline_stages "$DECODE_PP" + --cluster_config_decode_replica_config_attn_tensor_parallel_size "$DECODE_ATTN_TP" + --cluster_config_decode_replica_config_attn_data_parallel_size "$DECODE_ATTN_DP" + --cluster_config_decode_replica_config_moe_tensor_parallel_size "$DECODE_MOE_TP" + --cluster_config_decode_replica_config_moe_expert_parallel_size "$DECODE_MOE_EP" + --cluster_config_decode_replica_config_total_expert_num "$TOTAL_EXPERTS" + --cluster_config_decode_replica_config_router_topk "$ROUTER_TOPK" + --cluster_config_decode_replica_config_device "$DECODE_DEVICE" + --cluster_config_decode_replica_config_memory_margin_fraction "$DECODE_MEMORY_MARGIN_FRACTION" + --cc_backend_config_type analytical + --replica_config_model_name "$MODEL_NAME" + --replica_config_moe_routing_mode "$MOE_ROUTING_MODE" + --replica_config_moe_routing_seed "$MOE_ROUTING_SEED" + --replica_scheduler_config_type "$REPLICA_SCHEDULER" + --decode_cuda_graph_mode "$DECODE_CUDA_GRAPH_MODE" + --vllm_v1_scheduler_config_max_tokens_in_batch "$MAX_TOKENS_IN_BATCH" + --vllm_v1_scheduler_config_long_prefill_token_threshold "$LONG_PREFILL_TOKEN_THRESHOLD" + --vllm_v1_scheduler_config_block_size "${BLOCK_SIZE:-16}" + --vllm_v1_scheduler_config_num_blocks "${NUM_BLOCKS:-128}" + --request_generator_config_type synthetic + --synthetic_request_generator_config_num_requests "$NUM_REQUESTS" + --length_generator_config_type fixed + --fixed_request_length_generator_config_prefill_tokens "$PREFILL_TOKENS" + --fixed_request_length_generator_config_decode_tokens "$DECODE_TOKENS" + --interval_generator_config_type poisson + --poisson_request_interval_generator_config_qps "$QPS" + --analytical_kv_cache_transfer_config_network_bandwidth_gbps "$KV_TRANSFER_BANDWIDTH_GBPS" + --analytical_kv_cache_transfer_config_network_latency_ms "$KV_TRANSFER_LATENCY_MS" + --metrics_config_output_dir "$METRICS_OUTPUT_DIR" + --metrics_config_run_id "$RUN_ID" + --metrics_config_write_metrics + --metrics_config_store_request_metrics + --metrics_config_store_batch_metrics + --metrics_config_store_token_completion_metrics + --metrics_config_store_utilization_metrics + --no-metrics_config_store_plots + --no-metrics_config_enable_chrome_trace + --no-metrics_config_write_json_trace + --enable_thinking_mode + --thinking_depth "$THINKING_DEPTH" + --tool_call_latency "$TOOL_CALL_LATENCY" + --thinking_round_prefill_tokens "$THINKING_ROUND_PREFILL_TOKENS" + --thinking_round_decode_tokens "$THINKING_ROUND_DECODE_TOKENS" +) + +if [ "$ENABLE_CHUNKED_PREFILL" = "true" ]; then + CMD+=(--vllm_v1_scheduler_config_enable_chunked_prefill) +else + CMD+=(--no-vllm_v1_scheduler_config_enable_chunked_prefill) +fi + +if [ "$ENABLE_DUMMY_MODE" = "true" ]; then + CMD+=( + --random_forrest_execution_time_predictor_config_enable_dummy_mode + --random_forrest_execution_time_predictor_config_dummy_execution_time_ms "$DUMMY_EXEC_TIME_MS" + ) +fi + +if [ "$#" -gt 0 ]; then + if [ "$1" = "--" ]; then + shift + fi + CMD+=("$@") +fi + +cat <