[Feature] LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

[Speculative decoding](https://huggingface.co/papers?q=Speculative%20decoding) accelerates [autoregressive large language model](https://huggingface.co/papers?q=autoregressive%20large%20language%20model) (LLM) inference by using a lightweight [draft model](https://huggingface.co/papers?q=draft%20model) to propose [candidate tokens](https://huggingface.co/papers?q=candidate%20tokens) that are then verified in parallel by the [target model](https://huggingface.co/papers?q=target%20model). The speedup is significantly determined by the [acceptance rate](https://huggingface.co/papers?q=acceptance%20rate), yet standard training minimizes Kullback-Leibler (KL) divergence as a proxy objective. While KL divergence and [acceptance rate](https://huggingface.co/papers?q=acceptance%20rate) share the same global optimum, small [draft model](https://huggingface.co/papers?q=draft%20model)s, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing [acceptance rate](https://huggingface.co/papers?q=acceptance%20rate). To address this issue, we propose [LK losses](https://huggingface.co/papers?q=LK%20losses), special [training objectives](https://huggingface.co/papers?q=training%20objectives) that directly target [acceptance rate](https://huggingface.co/papers?q=acceptance%20rate). Comprehensive experiments across four draft architectures and six [target model](https://huggingface.co/papers?q=target%20model)s, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training. We evaluate our approach on general, coding and math domains and report gains of up to 8-10% in average acceptance length. [LK losses](https://huggingface.co/papers?q=LK%20losses) are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft [training objectives](https://huggingface.co/papers?q=training%20objectives).

### Related resources

https://huggingface.co/papers/2602.23881

https://arxiv.org/abs/2602.23881

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding #485

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding #485

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions