Ray Serve LLM Enhances Distributed Inference with 24x Boost

Ray Serve LLM, a framework tailored for distributed large language model (LLM) inference, has announced a suite of groundbreaking optimizations that deliver up to 24x higher throughput on decode-heavy workloads. The updates, developed in collaboration with Google Kubernetes Engine (GKE), address key performance bottlenecks and position Ray Serve LLM as a leader in scalable, low-latency LLM deployment.

Three major architectural upgrades are driving this leap in performance:

Direct Streaming: Introduced in Ray 2.56, this innovation decouples routing decisions from response streaming, drastically reducing latency. By enabling HAProxy to establish direct HTTP connections to target replicas, the overhead associated with intermediate routing layers has been eliminated. This boosts time-per-output-token (TPOT) efficiency, especially for decode-heavy tasks.
vLLM Ray Executor Backend V2: The revamped backend leverages asynchronous scheduling and improved process management to optimize inference pipelines. Included by default in vLLM 0.21.0, the backend facilitates better resource utilization and reduces orchestration overhead.
HAProxy Integration: A C-based HAProxy ingress load balancer, combined with optimizations like disabling TCP datagram buffering (Nagle’s algorithm), significantly enhances throughput and streaming performance. These updates are available in Ray’s latest container images.

Benchmarks highlight the transformative impact of these updates. In prefill-heavy workloads with input sequence length (ISL) of 8,000 and output sequence length (OSL) of 50, Ray Serve LLM achieved 4.4x higher throughput compared to its baseline. On decode-heavy workloads (ISL 50, OSL 500), it delivered a staggering 24x improvement. Realistic agentic multi-turn scenarios, simulating coding agent interactions, confirm that Ray Serve LLM now matches or outperforms the vLLM router on key metrics like time-to-first-token (TTFT) and throughput.

For enterprises scaling LLMs across multi-GPU and multi-node clusters, these updates are a game-changer. Ray Serve LLM’s unique architecture allows for prefill-decode disaggregation, meaning the prompt processing (prefill) and token generation (decode) phases can scale independently. This flexibility, combined with Ray’s fault tolerance and observability features, makes it a versatile choice for production-grade LLM serving.

Ray Serve LLM’s direct streaming and enhanced vLLM backend are particularly well-suited for workloads requiring high concurrency and low latency. For example, in a test using eight Qwen3-0.6B replicas, Ray Serve LLM matched or outperformed the vLLM router in TTFT (e.g., 355ms vs. 389ms in prefill-heavy scenarios) and decode-heavy workloads (165ms vs. 190ms). The improved efficiency stems from HAProxy’s direct connections and reduced routing overhead.

As the demand for LLM inference grows, these optimizations solidify Ray Serve LLM’s position in the market. Competing frameworks have historically struggled to balance throughput, fault tolerance, and resource efficiency at scale. By addressing these challenges, Ray Serve LLM provides developers with a single, engine-agnostic platform capable of handling everything from isolated scaling to complex multi-replica deployments.

Developers can experiment with these features in Ray 2.56 and leverage updated container images, such as rayproject/ray-llm:2.56-py312-cu130, which include the latest optimizations. For more details on implementation, benchmarks, and configuration, visit the official announcement.

With these advancements, Ray Serve LLM is poised to power the next generation of distributed AI applications, enabling enterprises to deploy large-scale LLMs with unprecedented efficiency and reliability.

Ray Serve LLM Enhances Distributed Inference with 24x Boost

Read More