Skip to content

[Feature] What is the upper limit to the performance improvement of Dflash training as the step increases? #469

@5SSjw

Description

@5SSjw

Checklist

Motivation

Hi maintainers,
I’m trying to reproduce the official DFlash training pipeline, following the instructions from issue #465. During reproduction, I’m observing that DFlash decoding efficiency keeps improving as training steps increase, but the marginal gains become quite small in the later stage. I’d like to understand what the expected “ceiling” (if any) is when scaling training steps/epochs.

Observation:
In my runs, improvements are clearly diminishing, but even after ~12 epochs the metrics are still moving upward.

Measured results:
official baseline: speedup: 4.72x, τ: 5.97
Step 60,000: speedup: 2.58×, τ: 3.18
Step 80,000: speedup: 2.64×, τ: 3.26
Step 100,000: speedup: 2.71×, τ: 3.34
Step 142,000: speedup: 2.86×, τ: 3.55

(Training curves attached: train/accuracy and train/loss. They look stable overall, and the improvement trend flattens later.)
Image
Additionally, is this task strongly correlated with the dataset? Is multi-dataset training necessary, or are there other training techniques? Welcome everyone to participate in the discussion.

Related resources

Here's my training shell template
`set -e

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
export TORCHINDUCTOR_CACHE_DIR=$ROOT_DIR/cache/compiled_kernels
export SPECFORGE_DATA_NUM_PROC=32
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NUM_GPUS=${1:-8}

ATTENTION_BACKEND=${2:-flex_attention}

torchrun
--standalone
--nproc_per_node $NUM_GPUS
$ROOT_DIR/scripts/train_dflash.py
--target-model-path /models/Qwen3-8B
--draft-config-path $ROOT_DIR/configs/qwen3-8b-dflash.json
--train-data-path $ROOT_DIR/cache/dataset/ultrachat_train_regen_nonthinking.jsonl
--output-dir $ROOT_DIR/outputs/qwen3-8b-dflash-perfectblend-v2
--num-epochs 20
--batch-size 2
--accumulation-steps 2
--learning-rate 6e-4
--warmup-ratio 0.04
--max-grad-norm 1.0
--max-length 3072
--chat-template qwen
--attention-backend $ATTENTION_BACKEND
--random-anchor
--num-anchors 512
--loss-decay-gamma 7.0
--log-interval 50
--save-interval 1000
--cache-dir $ROOT_DIR/cache
--report-to tensorboard`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions