This article introduces hierarchical sparse attention: the full KV Cache is stored on the CPU, while the GPU keeps only a Top-k LRU Buffer.
This article introduces a dual memory-pool inference framework enabling efficient hybrid Transformer-Mamba model execution by resolving conflicting caching mechanisms.
This article introduces the architecture and implementation of Tair KVCache Manager, an open-source enterprise-grade global KVCache management service for scalable Agentic AI inference.
This article introduces engineering optimizations to 3FS—KVCache's foundation layer—across performance, productization, and cloud-native management for scalable AI inference.
This article shows how SGLang RBG + Mooncake enable production-grade, cloud-native LLM inference with PD-disaggregation.
Alibaba Cloud has unveiled AI Lakebase architecture alongside a suite of upgrades for its flagship database, PolarDB at PolarDB Developer Conference in China.
This article introduces HiCache, a hierarchical KVCache infrastructure developed by Alibaba Cloud Tair and SGLang to optimize performance and memory capacity for long-context "agentic" LLM inference.