I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

By Echo Puma · March 27, 2026 · 1 min read

Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 4 bits per coordinate with zero accuracy loss. The paper reports 5-6x memory reduction on H100 GPUs, tested on text models like Gemma and Mistral. I wanted to know: does it work on a vision-language model processing video? On a consumer GPU? 72 hours later, turboquant-vllm is on PyPI. Quick Start pip install turboquant-vllm[vllm] vllm serve allenai/Molmo2-8B --attention-backend CUSTOM That's it. The plugin auto-registers via vLLM's entry point system. No code changes, no forking, no monkey-patching. For HuggingFace users: from transformers import DynamicCache from turboquant_vllm import CompressedDynamicCache cache = DynamicCache() compressed = CompressedDynamicCache(cache, head_dim=128, bits=4) # Pass cache (not wrapper) to model.generate() Why Vision-Language Models Matter Every other TurboQuant implementation tests on text-only models with hundreds of tokens. But a 12-second video clip th

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Related Posts

Trending on ShareHub

Latest on ShareHub

Browse Topics

Around the Network