Announcing Day-0 Support for NVIDIA Nemotron 3 Ultra on vLLM
vLLM Blog
19 items · Foundation Models & Frontier AI Labs · site ↗
Fast & Efficient LLM Inference with vLLM: A New Course with DeepLearning.AI
Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon LLM Agents
Accelerating vLLM-Omni Inference with AutoRound Quantization
vLLM on the DGX Spark: Architecture, Configuration, and Local Evaluation
Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor
Native RL APIs in vLLM
Speculators v0.5.0: DFlash Support and Online Training
From Text to Multimodal Routing: Hardening Vision Signals in vLLM Semantic Router
EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec
vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache
Elastic Expert Parallelism in vLLM
Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models
A First Comprehensive Study of TurboQuant: Accuracy and Performance
vLLM Tops the Artificial Analysis Leaderboard
Serving Agentic Workloads at Scale with vLLM x Mooncake
Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM
DeepSeek V4 in vLLM: Efficient Long-context Attention
The State of FP8 KV-Cache and Attention Quantization in vLLM