The LLM Efficiency Wars: Seven BreakthroughsRewriting the Rules of AI

#AI Infrastructure #Article #BitNet #LLM #Quantization #TurboQuant

By Deepak Pachiannan May 9, 2026 13 min read Scroll to read

For most of AI’s modern era, the playbook was simple: more parameters, more compute, more data. Bigger models were better models, and “better” was worth almost any price. Hundred-billion-parameter behemoths ran on clusters of thousands of specialised chips, cost tens of millions of dollars to train, and required racks of high-end GPUs just to serve a single user. The economics were brutal, access was gatekept, and the energy bill was staggering.

That era is ending. Not because we have run out of room to scale — but because the field has discovered something more interesting than raw size: targeted efficiency. In the span of roughly eighteen months, seven distinct breakthroughs have emerged, each attacking a different chokepoint in the AI pipeline. Together they are compressing what once required a data centre into something that runs on a laptop, shrinking inference costs by orders of magnitude, and making million-token context windows affordable for the first time.

This is not one story. It is seven interlocking ones, each with different authors, different methods, and different parts of the problem in their sights. But they share a common ambition: to make AI so efficient that the constraint is no longer hardware or budget — it is imagination.

6×KV cache compression
Google TurboQuant

1.58bits per weight
Microsoft BitNet

1000×less attention compute
SubQ at 12M tokens

90%KV cache reduction
DeepSeek V4

Understanding the battlefield

Before diving into the individual breakthroughs, it helps to understand what they are each attacking. A large language model has a pipeline — a sequence of stages from training to output — and each stage has its own bottleneck. The seven techniques in this article map almost perfectly onto that pipeline.

Where each technique intervenes in the LLM stack

🏗 Training

←OSP Eliminates outlier weights before they form

🗜 Model weights

←BitNet Compresses to ternary: −1, 0, +1 only

🧠 Attention architecture

←DeepSeek V4 Redesigns attention with CSA + HCA hybrid

📈 Attention scaling

←SubQ Makes compute grow linearly, not quadratically

💾 GPU memory I/O

←FlashAttention 4 Tiles compute to minimise HBM↔SRAM transfers

🗄 KV cache

←TurboQuant Compresses stored keys & values 6× online

⚡ Token generation

←Spec. decoding Draft + verify multiple tokens in parallel

What makes this moment remarkable is not any single breakthrough but the fact that all seven are arriving simultaneously, from different labs, attacking different parts of the same machine. They are, almost accidentally, a complete stack.

The seven techniques, in depth

Technique 01 · Training phase OSP

Outlier-Safe Pre-Training

Target: weight outliers that silently destroy quantization quality

Most people trying to run an efficient LLM compress the model after training — a process called post-training quantization. The problem is that modern neural networks spontaneously develop extreme outlier values in certain weight dimensions. A handful of weights might be 100× larger than average, and those outliers blow up when you try to squeeze the model into 4-bit or lower precision.

OSP, published in 2025, tackles this at the source by combining the Muon optimizer with modified normalization layers. Rather than detecting and working around outliers after the fact, OSP prevents them from forming during training. The result is a model that quantizes cleanly — without the accuracy penalties that typically come with aggressive compression.

It is the least glamorous of the seven techniques — no headline compression ratio, no dramatic benchmark score. But it is arguably the most foundational: if you want BitNet-style extreme quantization to work well, OSP is the foundation you build on.

Technique 02 · Model weights BitNet b1.58

Microsoft BitNet b1.58

Target: the enormous memory and energy footprint of floating-point weights

A traditional neural network weight is stored as a 32-bit floating-point number — a precise decimal value anywhere in a vast continuous range. BitNet b1.58, released by Microsoft Research in April 2025, asks a radical question: what if every weight could only be one of three values? Negative one. Zero. Or positive one.

That is not a simplification. That is a complete reimagining of what a neural network weight is. At 1.58 bits per weight — the theoretical minimum to represent three states — the storage requirements collapse by more than 95%. A model that previously needed 32GB of GPU memory might fit in under 2GB. A 100-billion-parameter model that previously required a server farm can run on a single laptop CPU.

A 100 billion parameter model running on a single CPU at human reading speed. That sentence would have been science fiction eighteen months ago.

The catch, and there always is one, is that BitNet models cannot be created by compressing existing models. They must be trained from scratch in this format. Microsoft’s open-source bitnet.cpp framework demonstrated that a 2B-parameter BitNet model, trained on 4 trillion tokens, achieves competitive quality on standard benchmarks. On x86 CPUs the speedup is 2.4× to 6.2×, with energy reduction of 72–82%. On a single CPU, a 100B BitNet model runs at 5–7 tokens per second — comparable to human reading speed. The implication is profound: frontier-scale intelligence, running offline, on commodity hardware, without a cloud subscription.

Technique 03 · Attention architecture DeepSeek V4

DeepSeek V4 — Hybrid Attention

Target: the KV cache memory footprint baked into the architecture itself

The Chinese AI lab DeepSeek caused its first earthquake in January 2025, when it released R1 — a reasoning model competitive with OpenAI’s best, trained in two months for under $6 million. The world sat up. Then, in April 2026, it did something structurally more interesting.

DeepSeek V4 introduced a hybrid attention architecture that combines two novel mechanisms: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). These are not incremental tweaks to the standard transformer. They fundamentally redesign how the model handles long sequences, interleaving full-precision attention layers with heavily compressed ones to achieve dramatic efficiency gains at scale.

At a one-million-token context, DeepSeek V4-Pro requires just 27% of the inference compute of its predecessor V3.2 — and only 10% of the KV cache memory. Those gains are baked into the architecture itself, not applied as a post-hoc layer. The 1.6-trillion-parameter Pro model (with 49B parameters active per token via MoE) is priced at $1.74 per million input tokens — cheaper than most mid-tier models from Western labs. It is also the first frontier model designed to run optimally on Huawei Ascend NPUs rather than Nvidia GPUs. The geopolitical implications of that alone are considerable.

Technique 04 · Attention scaling SubQ

SubQ — Subquadratic Sparse Attention

Target: the quadratic scaling law that makes long context economically prohibitive

The original transformer architecture, introduced in 2017, has a mathematical problem baked into its design: attention is quadratic. Double the context length and you quadruple the compute. This means that while labs have been racing to support one-million-token context windows, actually using them at full capacity has been economically prohibitive.

A Miami-based startup called Subquadratic emerged from stealth on May 5, 2026, with a claim that would constitute the most significant architectural shift since the transformer itself: their model, SubQ, is the first frontier-scale LLM built entirely on a subquadratic sparse attention architecture. Compute grows linearly with context length — not quadratically.

The core innovation is SSA — Subquadratic Sparse Attention — which observes that in any given attention computation, only a small fraction of token-to-token relationships actually matter. Standard attention computes all of them anyway. SSA finds and focuses only on the relationships that carry signal, discarding the rest. At 12 million tokens, the company claims this reduces attention compute by roughly 1,000× compared to standard transformers. Early benchmarks: SubQ scored 97% on RULER 128K at a cost of $8 per run, versus approximately $2,600 for frontier models. Independent verification is pending, but the architecture is real and the economic logic is sound.

Technique 05 · GPU memory I/O FlashAttention 4

FlashAttention 4

Target: wasted data movement between GPU memory tiers on every forward pass

FlashAttention is older than the others on this list — Tri Dao introduced the original in 2022 — but its fourth iteration, presented at Hot Chips 2025 and targeting Nvidia Blackwell GPUs, delivers another 20% speedup over an already-impressive baseline.

The insight is elegant. A GPU has two kinds of memory: High Bandwidth Memory (HBM, large but slow) and on-chip SRAM (tiny but fast). Standard attention repeatedly shuttles data between them, and those memory reads and writes dominate the runtime. FlashAttention restructures the computation using tiling — breaking the attention matrix into blocks that fit in SRAM and computing them without ever writing intermediate results to HBM.

Unlike the others on this list, FlashAttention requires no changes to models or training. It is a drop-in kernel that any transformer can use immediately. It is already integrated into vLLM and TensorRT-LLM — which means it is quietly running under most production AI deployments today, often without the operator even knowing the name.

Technique 06 · KV cache TurboQuant

Google TurboQuant

Target: the KV cache memory explosion that dominates GPU cost at scale

Imagine you are running a 70-billion-parameter language model serving 512 concurrent users, each with a conversation context of 512 tokens. The model weights themselves take about 140GB of GPU memory. The KV cache — the running record of every key and value vector the model has computed for those conversations — takes an additional 512GB. That is nearly four times the memory consumed by the model itself, and it grows with every new message.

Google’s TurboQuant, published on arXiv in April 2025 and formally presented at ICLR 2026, is a software-only solution. It is an online vector quantization algorithm that compresses the KV cache to as few as 3.5 bits per value — roughly a 6× reduction — as keys and values are written to the cache, decompressing them when attention reads them back. It requires no calibration data, no per-model tuning, and no retraining. On Nvidia H100 GPUs, it delivers up to 8× speedup in attention computation at 4-bit precision.

TurboQuant shrinks a 42GB KV cache for a single 128K-token request down to roughly 7GB. That is the difference between serving one user per GPU and serving six.

Google’s official implementation has not yet shipped. But the community moved immediately: independent developers built working implementations in PyTorch, MLX for Apple Silicon, and C/CUDA for llama.cpp within hours of the paper going public. One developer tested on Gemma 3 4B and reported character-identical output to the uncompressed baseline at 2-bit precision. The theory checks out.

Technique 07 · Token generation Speculative decoding

Speculative Decoding — The New Generation

Target: the latency of sequential, one-token-at-a-time autoregressive generation

Every token an LLM generates requires a full forward pass through the entire model. There is no batching this in time: the model must see token n before it can generate token n+1. At large scale, this sequential bottleneck dominates end-to-end latency.

Speculative decoding offers a clever escape. A small, fast draft model guesses the next several tokens simultaneously. The large model then verifies all those guesses in a single parallel forward pass — accepting the correct ones, rejecting the first wrong one, and regenerating from that point. When the draft is right — which it often is — the large model effectively generates multiple tokens in the time it would have taken to generate one.

2025–2026 has seen a wave of advances on this base technique. Google achieved 3× speedups on TPUs by integrating DFlash — a block-diffusion variant that collapses O(K) sequential drafting into O(1) parallel generation. MIT’s “Taming the Long Tail” (TLT) solved making speculative decoding work during reinforcement learning training — previously impossible because the static draft model went stale after model updates. Nvidia’s NeMo RL demonstrated 1.8× rollout speedup at 8B parameters, projecting 2.5× end-to-end at 235B. Crucially, speculative decoding is mathematically lossless. Latency drops with zero quality tradeoff.

How they fit together

A complete stack — if you choose to build it

Each of these seven techniques would be significant in isolation. What makes this moment genuinely historic is that they are complementary. They can be stacked. Consider a deployment that combines all seven: OSP during pre-training eliminates outliers at the source. BitNet-style ternary weights slash storage by 95%. DeepSeek V4’s hybrid attention reduces KV footprint structurally. SubQ’s sparse attention makes long context linear in compute. FlashAttention-4 kernels minimise memory I/O on chip. TurboQuant compresses the remaining KV cache 6×. Speculative decoding then halves token latency at zero quality cost.

No one has built this complete stack yet. But the individual pieces exist, have been validated, and are being integrated into production frameworks. The trajectory is clear.

The question of why now has several answers. The most honest one is cumulative: each technique required predecessor insights that only became available recently. TurboQuant builds on PolarQuant and QJL. FlashAttention-4 builds on three prior versions. OSP became possible because the Muon optimizer matured. These are compounding dividends, not independent inventions. There is also an economic driver. When serving a frontier model costs $2,600 per complex long-context request, most applications are not viable. When the same request costs $8, entirely new product categories open up.

What this means

The end of the size race

The most immediate implication is democratisation. When a frontier-capable model fits in 2GB and runs on a CPU, the barrier to entry for AI deployment collapses. Hospitals in regions without reliable cloud infrastructure can run diagnostic tools locally. Schools can run tutoring AI on decade-old hardware. Developers in countries with expensive internet can build AI products that work entirely offline.

The memory chip market has already noticed. Following TurboQuant’s publication in March 2026, analysts observed a downward drift in the stock prices of major memory suppliers including Micron and Western Digital. If AI giants can compress memory requirements by 6× through software alone, the insatiable demand for High Bandwidth Memory may be moderated by algorithmic efficiency. The picks-and-shovels trade in AI hardware is being disrupted by the very AI it was supposed to power.

The size race of the 2020s was a blunt instrument. More parameters, more data, more GPUs, more money. It worked — the models got dramatically better. But it was always going to hit physical and economic walls. What is happening now is the field responding to those walls not by pushing harder against them, but by going around.

Weights need to be precise (32-bit)BitNet: −1, 0, +1 is enough

Attention must be quadraticSubQ: linear scaling is possible

KV cache stored at full precisionTurboQuant: 3.5 bits, zero loss

Generation must be sequentialSpeculative decoding: draft in parallel

Outliers handled post-trainingOSP: prevented during training

GPU memory I/O is fixed overheadFlashAttention: tiling eliminates it

Seven assumptions, discarded. The models on the other side of those discards are faster, cheaper, smaller, and in many ways more capable — not because they have more parameters, but because they waste fewer of the ones they have. The next decade of AI will be defined not by who builds the biggest model, but by who builds the most efficient one. That race is already underway. And for the first time in the history of the field, the teams in the lead are not necessarily the ones with the most money.

Link copied!

The LLM Efficiency Wars: Seven BreakthroughsRewriting the Rules of AI