Ranking Before Serving: Low-Latency LLM Serving via Pairwise Learning-to-Rank
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Efficient scheduling of large language model (LLM) inference tasks is critical for achieving low latency and high throughput, a challenge that is becoming increasingly acute with the rise of reasoning-capable LLMs whose generation lengths are highly variable.
Traditional strategies like First Come, First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them.
In this paper, we introduce PARS, a prompt-aware LLM task scheduler that mitigates HOL blocking by approximating shortest-job-first (SJF) scheduling through pairwise ranking with a margin ranking loss.
PARS effectively predicts response-length-based task ordering directly from prompts, thereby optimizing scheduling decisions with minimal overhead.
In addition, it integrates seamlessly with vLLM, a state-of-the-art LLM serving system, for the research community.
Extensive experiments across multiple LLM models and real-world inference use cases, including chat, math, and code generation, demonstrate that PARS significantly reduces latency by up to 15.7x compared to the vLLM default scheduler.
Cross-model evaluations demonstrate that our design generalizes effectively, allowing effective scheduling across diverse LLMs without requiring model-specific retraining.