Topic

Inference

Latest news, analysis, and insights about Inference.

Thumbnail for: vLLM Hits 2,200 Tokens/Second Per H200 for DeepSeek

product1/14/2026

vLLM's Wide Expert Parallelism Makes DeepSeek Inference 10x More Efficient at Scale

The vLLM team just published benchmarks showing 2,200 tokens per second per H200 GPU for DeepSeek inference. Their 'wide-ep' approach could reshape the economics of serving massive mixture-of-experts models in production.

AI InfrastructureOpen SourceMachine Learning