When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

arXiv Stat

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

Speculative decoding accelerates language model inference by using a fast drafter to propose candidate tokens that are then verified by a larger target model.

Existing theory largely studies the stochastic, distribution-preserving setting, where the goal is to exactly sample from the target distribution.

In contrast, many practical systems use greedy decoding, relaxed acceptance rules, or tree-based candidate sets, where success is governed by local ranking and threshold events rather than exact distributional equality.

We develop a theory for these regimes.

We identify that many common acceptance criteria have rejection regions that can be characterized as lower level sets of the target distribution.

For these, we characterize the exact KL divergence required for rejection yielding exact certificates and sharp margin-based bounds for strict greedy decoding, additive and multiplicative relaxed acceptance, top-(m) relaxed criteria, and entropy-thresholded acceptance.

We then extend the framework to greedy tree decoding, deriving exact and margin-only certificates for when the target greedy token remains covered by the drafter's top-(m) candidates.

Finally, we evaluate the resulting certificates on Qwen3 models, showing that relaxed and tree-based criteria substantially enlarge the region of certified acceptance, especially on decoding steps with low target model distribution margin.

These results complement existing distribution-preserving analyses of speculative decoding by characterizing the deterministic local acceptance events common in practical inference systems.

전문 보기

When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

Recursive Self-Evolving Agents via Held-Out Selection

Data and Evaluation Closed-Loop for Model Capability Enhancement

GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes

arXiv의 다른 기사

Aristotelian Virtue Profiling of LLMs through Ethical Dilemmas

An AI agent for treatment reasoning over a biomedical tool universe

COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models