| Author | PatchouliTaisa |
|---|
Note
- License in
traineagle3directory: Apache 2.0 (same as the original eagle repo) - License everywhere else: MIT
The above benchmark was ran with this benchmarking file. Lhs is Baseline, Rhs is Eagle3, achieving around 1.32x bump in inferencing speed.
Environment: A single H100 GPU, dtype = float16, attention-backend = flashinfer, num_prompts = 1
Speculative config: method = eagle3, draft_tensor_parallel_size = 1, num_speculative_tokens = 2
The above benchmark was ran with this benchmarking file. Lhs is Baseline, Rhs is Eagle3, achieving around 1.56x bump in inferencing speed.
Environment: A single H100 GPU, dtype = float16, attention-backend = flashinfer, mem-fraction-static = 0.8, max-total-tokens = 131072, cuda-graph-max-bs = 32, num_prompts = 1
Speculative decoding config: speculative-algorithm = EAGLE3, speculative-num-steps = 3, speculative-eagle-topk = 24, speculative-num-draft-tokens = 128
Note
The models on huggingface are the state_19 (20th epoch). It was trained on 6 H200 GPUs for around 48 hrs (1 epoch took around 2h10m).
2 eagle3 model weights on huggingface for
vllmandsglangare identical, the only difference is theconfig.json.
-
Download Eagle3 Model weights from huggingface (you will need to accept the terms on huggingface first before you can download it).
-
Pull the vllm-openai image from dockerhub:
$ docker pull vllm/vllm-openai:latest
-
Or if you're on HPC, you might want to build the singularity image:
$ singuarity build --fakeroot vllm.sif vllm.def
-
Start the server using the script
benchmark/vllm/serve.sh(some modifications are necessary to adapt to your environment). -
Run the benchmark using
benchmark/vllm/benchmark.sh(you will need to clonevllmrepo to the project root first).
-
Download Eagle3 Model weights from huggingface (you will need to accept the terms on huggingface first before you can download it).
-
Pull the sglang image from dockerhub:
$ docker pull lmsysorg/sglang:latest
-
Or if you're on HPC, you might want to build the singularity image:
$ singuarity build --fakeroot sglang.sif sglang.def
-
Start the server using the script
benchmark/sglang/serve.sh(some modifications are necessary to adapt to your environment). -
Run the benchmark using
benchmark/sglang/benchmark.sh(you will need to clonevllmrepo to the project root first).
# you will need to accept the terms on huggingface first before you can download it
$ huggingface-cli download seanmamasde/sharegpt-gpt4-taide --local-dir dataset/sharegpt-taide
# install required packages using pip/conda or other package manager of your choice
$ pip install -r requirement.txt
# cd into the directory and change necessary stuff in the config
$ cd traineagle3/llama
$ vim ds_config.json
# start training using deepspeed
$ deepspeed main.py --deepspeed_config ds_config.json

