Block: Balance Loader of LLM Serving with Context ,Knowledge and Predictive Scheduling (paper link)
Block is a research prototype that explores predictive, performance-aware scheduling for distributed large-language-model (LLM) inference. It builds on top of Microsoft’s Vidur simulator, which initially developed for offline evaluation and optimal configuration searching and adds
- a side-car Predictor service that forecasts per–instance leading metrics with Vidur at run time,
- a Global Scheduler that uses these predictions (or live metrics) to route requests, and
- tooling for training a light-weight length-estimator model so the scheduler can reason about prompts it has never seen before.
Everything needed to reproduce the paper’s results—source code, datasets, experiment scripts—lives in this repository and is anonymised for the reviewing process.
• Predictor (block/predictor)
Co-locates with every inference node. Collects live stats, or spins up a Vidur simulation on-demand, and answers “What if I got one more request?”
• Global Scheduler (block/global_scheduler)
Receives requests, queries Predictors, and applies the scheduling policy (default: Block; alternatives: LLumnix, round-robin, …).
• Query Length Tagger (block/length_estimation)
A RoBERTa-based regressor that predicts the response-token count for unseen (model, prompt) pairs, feeding the scheduler with input-aware cost estimates. Currently, we just run this model offline to tag the ShareGPT dataset with predicted response length in data/trace_data/sharegpt/generate but it should be easy to adapt to any runtime model service as tf-serving or TorchServe.
Block is inference-engine agnostic. We provide an implementation for vLLM 0.7.2 (see the sealed repo at https://github.com/AKafakA/vllm/tree/block).
block/
├── predictor/ # side-car prediction service
├── global_scheduler/ # request router
├── length_estimation/ # training & inference of token-length regressor
├── benchmark/ # Poisson load-generator (forked from vLLM)
├── config/
│ └── host_configs.json# cluster description template
├── exp/
│ ├── generate_config.py
│ ├── setup.sh # installs deps & deploys cluster
│ ├── end_to_end_exp_scripts/
│ └── … # run, plot, gather
└── data/ # ShareGPT, BurstGPT, ArXiv-Summ datasets
vidur # Same as original vidur repo but with new replica scheduler/revised simulator
├── ...
├── scheduler
├── sarathi_replica_scheduler.py # vLLM scheduler simulator with chunked prefill
├── simulate_predict_replica_scheduler.py # class to adapt other simulators/latency linear model into Predictor Service
└── … # other simulators
-
Set up cluster
-
Generate the cluster hosts configuration
If using cloudlab, just download its manifest.xml and moved to block/prediction
python block/exp/generate_config.py --username USERNAME_SSH_TO_HOSTOtherwise, need to manually generate configration and hostname listing files as examples under block/config.
-
Deploy software stack and vLLM build
sh block/exp/setup.sh
Insert the vLLM github link if missing or manually distributed vLLM repo to testing workers
-
Start a full end-to-end experiment (≈50 h on 12 × A30 GPUs)
Fill your Hugging Face token in
exp/run_exp_vllm.shand the hostname to run the global scheduler and benchmarking atexp/experiment.sh.And run all end-to-end scripts like
sh block/exp/end_to_end_exp_scripts/main_experiments.sh ... # results under experiment_output/data/
All testing scripts are located in the block/exp/end_to_end_exp_scripts directory. These scripts can be used to reproduce different experiments as follows:
main_experiment.sh: Generates results for Figure 6 (refer to Section 5.3).auto_provision_exp.sh: Produces results for Figure 8 (refer to Section 5.5).config_search_experiment.sh: Used for testing with different batch size and chunk size, corresponding to Table 2 (refer to Section 5.6).extension_experiment.sh: Tests with the Qwen model and the BurstGPT dataset, also related to first 2 columns in Table 2 (refer to Section 5.6).prediction_experiment.sh: Provides results for debugging prediction accuracy Figure 5 (refer to Section 5.2.2).warmup.sh: Uswd for simple experiments and debugging, facilitating and warm-up models for other experiments
-
Plot and summarise results after all above experiments finished
sh block/exp/end_to_end_exp_scripts/plot.sh # figures end up in experiment_output/results/
If you wish to regenerate the regressor instead of using the provided tagged data.
Download the dataset from
wget https://huggingface.co/datasets/shibing624/sharegpt_gpt4/blob/main/sharegpt_gpt4.jsonl
And train and generate the tagged data
python block/length_estimation/sample
python block/length_estimation/train_roberta
# Tag ShareGPT prompts with ground-truth response lengths
python block/length_estimation/eval_roberta --tag-data Trueblock/benchmark_serving, which modified from vLLM benchmark scripts can replay JSON/CSV dataset at a configurable Poisson arrival rate
python block/benchmark/benchmark_serving.py• Add a new scheduling heuristic: 1) implement from inference side to export required metrics to Predictor (checking vLLM latest commits), and 2) define its load scores as inside simulate_predictor.py. predict 3) Append the name of new scheduler under static enum class vidur/types/optimal_global_scheduler_target_metric.py and update inside e2e shell scripts
• Support a different inference engine: expose the metrics in a new API, taking the scheduler_trace in vLLM vllm/entrypoints/api_server.py at the vLLM implementation as an examples.
• Support different models with different GPU SKUs
- please following the Vidur instruction to gather the profiling data and moved to
data/profiling - Append the new model configurations at
vidur/config/model_config.py. Currently we only profiled the cloudlab d7525 host (with single A30 GPUs) with Qwen2-7B and Llama-2-7B-hf for prototyping. - Provide a new model config for global scheduler to use, following the
block/config/llama_config.jsonas an example and update the end-to-end scripts to use the new model config according, checking '/block/end_to_end_exp_scripts/entension' which run the experiments on Llama2 and Qwen2 both. - Finally, if using a estimated length associated with the new model ( as Block* in the paper), following above steps to train/evaluate the length estimator to get the tagged data and put it under
data/trace_data/DATASET_NAME/generate/MODEL_NAME/
It was tested with this set of packages
• Python 3.10
• CUDA 12.6
• flashinfer-python 0.2.5 and triton 3.2.0, PyTorch-2.5+, customized vLLM based on 0.7.2
Plase checking requirments.txt and block/exp/setup.sh
If you find Block useful, please cite our paper:
@misc{da2025blockbalancingloadllm,
title={Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling},
author={Wei Da and Evangelia Kalyvianaki},
year={2025},
eprint={2508.03611},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2508.03611},
}
This work is released under the MIT license. See LICENSE for details.
Happy scheduling!
