DeepSeek has released a new technical paper detailing “Engram,” a conditional memory-based technique that allows AI models to utilize a queryable database of information committed to system memory. By committing sequences of data to static memory, Engram achieves demonstrably higher performance in long-context queries. This approach eases the reliance on reasoning for AI models, allowing GPUs to focus on more complex tasks. Crucially, this method increases performance while reducing the industry’s heavy reliance on scarce High-Bandwidth Memory (HBM).
The paper details how N-grams—statistical sequences of words—are integrated into the model’s neural networks, effectively placing them into a queryable memory bank. Engram allows models to simply “remember” facts rather than having to reason them out, a process that is far more computationally expensive. Released on the company’s GitHub page, Engram aims to curb the reliance on complex GPU memory by committing a knowledge library to more common system memory standards, such as CXL, enabling static memory to be held separately from an LLM’s compute power.
As detailed in the paper, an Engram-based model scaled to nearly 27 billion parameters can outperform a standard Mixture of Experts (MoE) model in long-context training. Standard MoE models utilize “conditional computation,” forcing the model to reconstruct data pieces every time they are referenced. Engram eliminates this computational waste by asking, “Do I already have this data?” This avoids what the paper describes as “expensive runtime reconstruction of a static lookup table,” saving valuable sequential depth for higher-level reasoning.
Engram is distinct from solutions like Nvidia’s KVCache, which offloads context data to NVMe memory. While KVCache acts as a short-term solution for remembering recent conversation history—akin to storing handwritten notes—Engram acts as a persistent record of a whole encyclopedia. Through tokenizer compression and “Multi-Head Hashing,” Engram reduces vocabulary size and allows for rapid parsing of information, ensuring distinct concepts (like “Universal” vs. “Universal Studios”) are retrieved without error via context-aware gating.
DeepSeek also explored the optimal balance between memory and compute, discovering a “U-curve” where allocating roughly 20–25% of the sparse parameter budget to Engram yields the best performance. In an experiment dubbed the “Infinite Memory Regime,” they found that performance scales linearly with memory size even when the compute budget is fixed. This implies that future AI improvements may not be solely bound by compute power, but could be achieved by expanding long-term “Engram” memory banks using standard DRAM within data centers.
The performance results highlight the potential of this architecture. In parallel testing, an Engram-27B model surpassed a standard 27B MoE model by 3.4 to 4 points in knowledge-intensive tasks and saw a massive leap in “Needle in a Haystack” long-context accuracy, scoring 97% compared to the MoE’s 84.2%. With DeepSeek viewing conditional memory as an “indispensable modeling primitive,” industry observers suggest this technology could be central to the rumored DeepSeek V4, potentially shifting hardware demand from HBM to standard system DRAM.
