The memory wall has re-emerged as a critical bottleneck, now defined not by processor-to-memory speed gaps but by DRAM’s inability to scale capacity for AI inference workloads.
The evolving memory wall
Coined in the 1990s, the original memory wall described the widening latency gap between CPUs and DRAM. For three decades, cache hierarchies, prefetching, and interleaving masked this divide. Today, however, AI models expanding from billions to trillions of parameters expose a more fundamental constraint: capacity scaling. Rising DRAM and HBM costs, energy dissipation, and diminishing returns on traditional optimization techniques signal that the architecture itself must change.
AI inference redefines memory demands
Large language models and growing context sizes—driven by retrieval-augmented generation, chain-of-thought reasoning, and user-specific data—require key-value caches that often exceed model weights. AI inference workloads are predominantly read-heavy and latency-tolerant, with deterministic, prefetch-friendly access patterns. This renders HBM’s narrow focus on raw bandwidth insufficient. The bottleneck shifts from speed to the efficient orchestration of high-capacity, sequential data retrieval.
High-bandwidth flash as an alternative
To address this, high-bandwidth flash emerges as a scalable alternative. Leveraging NAND’s density advantages through stacking and wafer bonding (e.g., CMOS directly bonded to array technology), these architectures deliver higher capacity than HBM at lower cost and power. While latency is higher than DRAM, AI inference is increasingly bandwidth-bound. High-bandwidth flash excels at large-granularity reads via concurrent array access, and its non-volatility enables persistent KV cache reuse for long-term memory. Thermal stability in high-energy environments further positions it as a practical solution.
Implications for data center design
Historically, data centers balanced compute and memory by partitioning workloads across multiple expensive accelerators—a strategy that wasted compute capacity but was justified at scale. For smaller enterprises or heterogeneous workloads, this approach becomes inefficient. High-bandwidth flash offers a more direct path: optimizing when and how data is retrieved rather than brute-forcing bandwidth.
The memory wall in the AI era is not about speed but about capacity and data orchestration. Relying solely on DRAM and HBM will constrain architectural innovation. High-bandwidth flash provides a scalable, efficient memory alternative tailored to inference-driven workloads, where performance is determined by the efficiency of data flow, not raw latency.
