We are happy to share that RetentiveKV has been accepted to ACL 2026 Findings.
Multimodal Large Language Models face severe challenges in computational efficiency and memory consumption due to the substantial expansion of the visual KV cache when processing long visual contexts. Existing KV cache compression methods typically rely on the "persistence of importance" hypothesis to prune tokens — but this proves fragile in multimodal settings. RetentiveKV takes a fundamentally different approach: instead of permanently discarding tokens, we reformulate KV eviction from discrete context truncation to continuous memory evolution based on State Space Models.
The problem: why importance-based eviction fails for multimodal models
Current KV cache compression methods assume that tokens with high initial attention remain critical throughout decoding. However, we identify two key issues that break this assumption in multimodal contexts:
-
Deferred Importance. Visual tokens often exhibit low initial salience but become pivotal at later decoding timesteps. Early decoding prioritizes semantic comprehension of the textual query, while attention pivots toward visual tokens only when generating responses requiring specific visual attributes. Importance-based pruning prematurely evicts these "deferred-critical" tokens.
-
Visual Continuity Collapse. Visual tokens are characterized by intrinsic spatial continuity and strong inter-patch correlations. Discrete eviction of KV pairs fragments these continuous representations, disrupting spatial coherence. Existing modality-aware methods that merely adjust compression ratios fail to fundamentally address this structural discontinuity.
RetentiveKV: from discrete pruning to continuous state evolution
Drawing inspiration from State Space Models (SSMs), RetentiveKV assimilates low-attention tokens into a continuous state space governed by information entropy, rather than permanently discarding them. This preserves visual tokens exhibiting high uncertainty during early decoding and updates them continuously as decoding progresses — all with only O(1) memory overhead.
The framework consists of three core components:

1. Entropy-Guided KV Retention Estimator
Traditional eviction uses only immediate attention scores. RetentiveKV adds a prospective uncertainty dimension: we leverage the entropy of the cross-modal attention distribution (from textual tokens to visual tokens) to estimate the "undetermined potential" of low-attention tokens.
The retention score combines both signals:
Rtl,i=λαtl,i+(1−λ)Htl,i
where α is the attention score and H is the cross-modal attention entropy. High entropy indicates diffuse attention across visual tokens — implying high uncertainty and potential relevance for subsequent decoding.
2. Entropy-Guided State Transition
Instead of binary keep/discard decisions, evicted KV pairs are absorbed into a continuous state space. The state evolves via a recursive update:
St=Ht⊙St−1+At⊙(kt⊤vt)
where Ht is the retention matrix (modulated by token-level entropy) and At is the absorption matrix (modulated by accumulated attention). This formulation preserves selective attention to historical segments while naturally incorporating positional and spatial information.
3. Dual-State Architecture
Inspired by hierarchical human memory, RetentiveKV maintains two complementary states:
| State | Purpose | Design |
|---|
| Visual-Dominant State (SV) | Preserve spatial topology of visual patches | Spatially-aware SSM with root-patch conditioning and Manhattan-distance decay |
| Recall-Oriented State (ST) | Capture long-term semantic dependencies | Sliding window for recent tokens + entropy-guided absorption for evicted tokens |
During inference, a Query-Conditioned Retrieval mechanism dynamically queries both states, retrieving and reactivating deferred-critical information only when semantic relevance is detected:
Ot=Attnlocal(qt,Klocal,Vlocal)+OS(t)
Experimental results
We evaluate RetentiveKV across three MLLM architectures (LLaVA-v1.5-7B, Qwen3-VL-4B, Qwen3-VL-8B) on eight benchmarks covering document understanding, mathematical reasoning, holistic perception, conversational QA, and embodied AI.
Key results:
- Fine-grained visual perception (DocVQA, TextVQA): +2.1 points over importance-centric baselines, thanks to entropy-guided preservation of high-uncertainty visual tokens.
- Long-context reasoning (MMCoQA, ALFRED): +3.0 points over modality-aware baselines, demonstrating effective retrieval of deferred-critical information under long-horizon reasoning.
- Efficiency: At 5% cache budget, RetentiveKV reduces memory from 2.24 GiB to 0.23 GiB (achieving 5× KV cache compression) and accelerates decoding by 1.75× (from 32.15 ms/token to 18.42 ms/token).
| Method | Budget | Latency | Memory |
|---|
| Full Cache | 100% | 32.15 ms/token | 2.24 GiB |
| RetentiveKV | 50% | 27.84 ms/token | 1.18 GiB |
| RetentiveKV | 20% | 21.42 ms/token | 0.46 GiB |
| RetentiveKV | 5% | 18.42 ms/token | 0.23 GiB |
Under extreme compression (5%–10% budget), conventional methods suffer abrupt performance degradation, confirming the Deferred Importance hypothesis. RetentiveKV maintains robust performance stability across all budget levels.
Ablation highlights
| Component | Avg. impact |
|---|
| Query-Conditioned Retrieval (QR) | −4.6% to −5.4% without it |
| Modality-Agnostic dual state (MA) | −3.1% with unified state |
| Entropy-driven metric (EM) | −2.9% (MMCoQA), −4.6% (DocVQA) without it |
The query-conditioned retrieval mechanism is the most critical component — it enables the model to recall specific visual and textual cues that were previously evicted, mitigating irreversible information loss.
Takeaway
RetentiveKV demonstrates that KV cache eviction need not be a one-way operation. By reinterpreting low-attention tokens through information entropy and absorbing them into continuously evolving state spaces, we preserve both spatial continuity and deferred semantic relevance — achieving substantial compression without the fragility of discrete pruning.
Further reading