Turbo4 KV Cache: Better Than Q8 at Half the VRAM

Benchmarks Optimization Research May 5, 2026 22 min read

⚡ TL;DR

Turbo4 KV cache (4.25bpv) matches FP16 quality while using half the VRAM of Q8_0.

Score: 100/100 on hardened agentic benchmark vs Q8_0's 91/100
Speed: 40.1 t/s decode vs Q8_0's 35.4 t/s — 14% faster
Context: 256K on 24GB VRAM vs Q8_0's 128K ceiling — 2x more
First-shot precision: 6/6 (100%) with zero fix attempts — every file correct on first keystroke
Quality: Lossless — statistically indistinguishable from FP16 KV cache per the buun paper
Build: spiritbuun/buun-llama-cpp — a fork of llama.cpp with TCQ + DFlash support

The KV Cache Problem

Every token your LLM generates requires computing attention against every previous token in the context window. Without a KV cache, each new token would require recomputing keys and values for the entire sequence — an O(n²) catastrophe. The KV cache stores pre-computed key-value pairs from previous tokens so new ones only need to attend against the stored values.

But the KV cache has a size problem. A 27B model at 256K context with 64 attention layers, 24 heads, and 256-dim key/value vectors stores roughly:

KV Cache Size = 2 × n_layers × n_heads × d_head × n_ctx × sizeof(FP16)
               = 2 × 64 × 24 × 256 × 262144 × 2 bytes
               = ~41 GB for full FP16 KV cache

That's impossible on a 24 GB card when model weights alone consume 16 GB. The standard solution has been scalar quantization: store each value at 8 bits (Q8_0) instead of 16, halving the size to ~20.5 GB — barely fitting but leaving zero room. Or 4 bits (Q4_0) at ~10 GB — fits comfortably but with quality degradation.

The community consensus was clear: Q8_0 is the gold standard, Q4_0 is a desperation option. Trellis-Coded Quantization blows that consensus apart.

What Is Trellis-Coded Quantization?

Standard scalar quantization is simple: take each value independently, find the nearest bucket centre, store the bucket index. It loses information with every value because it treats each one in isolation.

TCQ, from Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits (spiritbuun, 2026), takes a fundamentally different approach to the same problem:

FWHT Rotation with Random Sign Flips. Apply a Fast Walsh-Hadamard Transform to the key and value vectors. This decorrelates the highly correlated attention vectors into near-i.i.d. Gaussian entries. Without this step, trellis coding fails because correlated values can't benefit from sequence-level optimisation.
Viterbi Trellis Encoding. Run a Viterbi algorithm across a 512-state (3-bit) or 256-state (2-bit) right-shift trellis. Instead of quantising each value independently, it finds the globally optimal path through the trellis that minimises total distortion across the entire sequence. This is the core insight: you can do better than per-value quantisation when you consider the whole sequence together.
O(1) Sliding-Window Decode. At inference time, each value is decoded via a simple bit-window lookup — no trellis traversal needed. The expensive Viterbi search is encode-only. Decode is fast.
Context-Adaptive Alpha Scaling. A logarithmic norm scaling formula automatically adjusts the dequantisation scale based on context length, compensating for the changing statistical properties of attention vectors as sequences grow.

The paper reports that at 3.25 bits per value, TCQ produces lower perplexity than FP16 (5.802 vs 5.805 at 2K context) — a mild regularisation effect from the trellis constraints. At 4.25 bits per value (Turbo4), it's statistically indistinguishable from FP16. Lossless.

The implementation lives in spiritbuun/buun-llama-cpp, a fork of llama.cpp that adds TCQ alongside DFlash speculative decoding support. It compiles clean with CUDA on consumer hardware.

The Setup

Hardware
  GPU:  NVIDIA GeForce RTX 4090 24GB GDDR6X (sm_89, Ada Lovelace)
  CPU:  AMD Ryzen 9 5950X (16-core / 32-thread)
  RAM:  64GB DDR4-3600
  OS:   Ubuntu Linux, CUDA 12.9

Model
  File:  Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf
  Size:  16.3 GB (5.4 BPW, Q4_K_P quantisation)
  Arch:  Qwen3.5/3.6 hybrid — Gated DeltaNet + full softmax attention every 4th layer
  Params: 26.9B, 64 layers, 24 heads, 256-dim K/V, 128-dim SSM state

Build
  Repo:  spiritbuun/buun-llama-cpp @ commit aecbbd5d (May 1, 2026)
  Flags: -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON
         -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release

KV Cache Configurations Tested
  turbo4    — 4.25 bpv, trellis-coded, lossless  (new method)
  q8_0      — 8.0 bpv,  scalar, near-lossless   (current gold standard)
  turbo3_tcq — 3.25 bpv, trellis-coded, ~5x comp (new method, aggressive)
  turbo4    — 4.25 bpv, trellis-coded           (also tested on easier TASK1)
  q8_0      — 8.0 bpv,  scalar                  (tested on upstream llama.cpp for baseline)

The Benchmarks

I designed two benchmarks to stress-test KV cache quality at different complexity levels:

TASK1 — SimulaCorp (18 steps, easier)

A multi-language project scaffold benchmark. Build a fake company with Python/Rust/JavaScript source files, data validation, plugin system, LRU cache, CSV parser, pattern matcher, scheduler, zebra logic puzzle, and long-context memory recall. Scoring: 100 points across functional completion (30), code quality (20), precision (25), and memory (25), minus efficiency penalties.

TASK2 v2 — Hardened (8 steps, harder)

A compressed, more difficult benchmark focused on single-file challenges. Lock-free SPSC queue, BitTorrent Bencode parser, debug 4 broken hashmap bugs, first-shot battery (concurrent dict with fine-grained locking, topological sort, bitmap with manual popcount), 3-algorithm thread-safe rate limiter, and long-context memory recall. Scoring: 100 points across functional (40), code quality (20), precision (25), and memory (15), with harsher penalties (-3 per edit, -15 for re-reading during memory tests).

TASK2 is the more discriminating benchmark. The same model scoring 85 on TASK1 scores anywhere from 60 to 100 on TASK2 — a 40-point spread vs a 9-point spread. That's what you want from a benchmark.

Run 1: Baseline — Upstream Q8_0 vs Q4_0 on TASK1

First, I established baselines using the standard upstream llama.cpp build with the Qwen3.6-27B HauhauCS Q4_K_P model:

KV Cache	Context	Score	Rating	First-Shot	Fixes
Q8_0 (upstream)	128K	85/100	A	—	~4
Q4_0 (upstream)	256K	76/100	B	—	—

Standard result: Q8_0 outscored Q4_0 by 9 points. This is the expected behaviour — higher precision KV cache produces better code. Q4_0 gave double the context but at a quality cost.

I also tested a DavidAU Qwen3.6-27B fine-tune (IQ4_NL quant, 4.32 BPW) on both Q4_0 and Q8_0. It scored 80 and 76 respectively — worse than the plain abliterated HauhauCS model in both configurations. The finetune's published benchmark gains (+4% arc-c, +5.4% arc-e) didn't translate to agentic coding tasks. Bitrate matters more than fine-tuning for this workload.

Run 2: Enter the Buun Fork — DFlash Attempt

Spiritbuun's buun-llama-cpp fork includes both TCQ KV cache support and DFlash speculative decoding. DFlash uses a separate draft model (~1.8 GB Q8_0 GGUF) trained on base Qwen3.6 output distribution to propose blocks of tokens in parallel, which the target model then verifies.

I downloaded the draft model from spiritbuun/Qwen3.6-27B-DFlash-GGUF and ran it against the HauhauCS fine-tune target:

Metric	Q4_K_M Drafter	Q8_0 Drafter
Draft tokens per cycle	16	16-17
Accepted per cycle	~1.7	~2.0
Acceptance rate	~10.6%	~11.8%
Cycle time	~80ms	~90ms
Effective throughput	~21 t/s	~22 t/s

DFlash made things slower, not faster. The HauhauCS aggressive abliteration changed the model's output distribution enough that the drafter (trained on base Qwen3.6) was wrong 88% of the time. For comparison, the Reddit post using a base unsloth Qwen3.6 Q4_K_M target reported ~80 t/s with DFlash — a 2-3x speedup. The fine-tune kills it.

Lesson: DFlash only works with base models. If you're using a fine-tuned or abliterated model, skip speculative decoding entirely. The acceptance rate collapses.

Run 3: Buun Build Validation — Q8_0 on TASK1

Before trusting the buun build, I verified it didn't introduce regressions. Running the same Q8_0 KV cache on buun build vs upstream on TASK1:

Build	Score	Decode t/s
Upstream llama.cpp	85/100	~34
Buun-llama-cpp	85/100	~34

Identical quality, identical speed. The buun build is a drop-in replacement with no downside and significant upside. Confirmed safe.

Run 4: Turbo4 Debut — TASK1

First test of Turbo4 KV cache (4.25bpv, trellis-coded) on the easier TASK1 benchmark:

Metric	Turbo4 (buun)	Q8_0 (upstream)	Q4_0 (upstream)
Score	76/100	85/100	76/100
Functional	26/30	28/30	27/30
Code Quality	18/20	19/20	17/20
Precision	14/25	18/25	18/25
Memory	23/25	24/25	24/25
Decode t/s	37.0	~34	~34

Turbo4 tied Q4_0 at 76. Disappointing — the lossless claim should have delivered Q8_0-quality results. The decode speed was higher (37 t/s vs 34 — a 10% boost from the TCQ kernels), but the quality didn't show up in the score.

This turned out to be a benchmark sensitivity problem, not a Turbo4 problem. TASK1 has too much noise and not enough spread. The same model configuration can score 76 or 85 depending on which implementation patterns it happens to choose. I needed a harder benchmark.

Runs 5-7: TASK2 v2 — The Decisive Test

TASK2 v2 was designed to fix TASK1's shortcomings: shorter (8 steps vs 18), harder (models struggle to hit 50), wider spread, and independent steps that don't cascade-fail. Each KV cache configuration was tested fresh from an empty directory — no prior code to work from.

Metric	Turbo4 (4.25bpv)	Q8_0 (8.0bpv)	TCQ3 (3.25bpv)
Final Score	100/100 (S)	91/100 (S)	76/100 (A)
Functional Completion	40/40	40/40	40/40
Code Quality	20/20	20/20	20/20
Precision	25/25	22/25	13/25
Memory/Recall	15/15	15/15	15/15
First-Shot Pass Rate	6/6 (100%)	5/7 (71%)	5/8 (62%)
Fix Attempts	0	3	7
Edit Calls	0	2	4
Server Tokens	39K	110K	382K
Context Used	15%	~84%	145%
Decode Speed	40.1 t/s	35.4 t/s	23.3 t/s*
Server Time	5.1 min	7.0 min	3.3 min

*TCQ3's 23.3 t/s is misleading — the 382K tokens of fix iterations created context management overhead. The buun paper benchmarks TCQ3 decode at 97% of Q8_0 speed. The bloat was from the precision gap, not the KV cache.

Turbo4 scored a perfect 100. Zero fixes. Zero edits. Every file correct on the first keystroke. A straight-line run with no backtracking.

Code Quality: Turbo4 Wrote Better Code

This isn't just scoring. I audited every implementation across all three runs. Turbo4 consistently produced more sophisticated, more correct code:

SPSC Queue

Q8_0 and TCQ3 both reached for ctypes.c_size_t with integer head/tail indices, claiming a "lock-free" implementation that depends entirely on the CPython GIL for atomicity — it's not lock-free at all, it's GIL-dependent. Both needed 2 fix attempts to get the threading right.

Turbo4 used the classic wasted-slot circular buffer pattern: capacity+1 slots, full = (head+1) % cap == tail, empty = head == tail. Per-index locks (head_lock, tail_lock) that are honest about what they do. Correct by construction. Zero fix attempts. This is the approach you'd find in a production ring buffer, not a clever trick that happens to work on CPython.

Concurrent Dictionary

Q8_0 and TCQ3 used a single shared dict with 16 per-bucket threading.Lock objects. Keys hashing to the same bucket contend on the same lock despite being different keys.

Turbo4 used 16 independent dicts — self._data = [{} for _ in range(16)]. Each bucket has its own completely isolated dictionary. Two keys in different buckets have zero shared state. Genuinely better architecture with zero tradeoffs.

Rate Limiter

Q8_0 used flat dicts with local import time inside each method. TCQ3 used tuples for state (tokens, last_time). Both work but lack structure.

Turbo4 defined proper state classes with __slots__: _TokenBucketState (tokens + last refill), _SlidingWindowState (timestamp list), _FixedWindowState (count + window start). Memory-efficient, clean separation of algorithm logic from state management. This is senior-engineer code.

Topological Sort

Q8_0 and Turbo4 both chose Kahn's algorithm (BFS with in-degree tracking) — reports specific cycle nodes on failure. TCQ3 chose DFS recursion stack — standard but less informative on errors. Turbo4 and Q8_0 tied here, both correct.

SearchIndex Debugging

All three runs found 4 bugs. But which bugs they found differed:

Bug	Turbo4	Q8_0	TCQ3
Case mismatch in add()	✅	✅	✅
ZeroDivisionError in stats()	✅	✅	✅
Search pollutes index	✅ defaultdict insertion	⚠ Set reference leak	⚠ Mutable set
Delete issue	✅ Empty set cleanup	⚠ Case mismatch (same as #1)	⚠ Case mismatch (same as #1)

Q8_0 and TCQ3 both double-counted the same root cause — case sensitivity in add() and delete() is the same bug manifested in two places. Turbo4 found two genuinely distinct bugs: defaultdict silently inserts empty sets when you access a missing key (corrupting the index invisibly), and discard() leaves empty sets that inflate term counts in stats(). More thorough analysis that found different root causes, not the same bug twice.

Bencode Parser

Turbo4 added a while pos < len(data) bounds check on list/dict parsing that Q8_0 and TCQ3 both omitted. Safer against malformed input. Small detail, but zero-fix code has these details baked in.

The Conversation Pattern Analysis

Reading the full conversation logs revealed that Turbo4 worked fundamentally differently from the other two runs:

Turbo4 thought before typing. Before writing the debug report, it analysed all 4 SearchIndex bugs inline in the conversation — structured reasoning, then re-analysis ("Actually let me re-analyse more carefully..."), then committed to code. It wrote search_broken.py (original) AND search_fixed.py (patched) as separate files, then ran tests against the broken version first to confirm the bugs existed before claiming to have fixed them. This is proper debugging methodology — reproduce the bug before you fix it.

Turbo4 never used the Edit tool. The tool call sequence was a straight line: Write → Write → Write → Bash (pass) → Write → Write → Bash (pass) → ... 14 Write calls, zero Edit calls. Every file was correct on the first keystroke. No "let me fix that." No iteration. No backtracking.

Q8_0 was almost as clean. 2 Edit calls, 3 fix attempts. It chose Kahn's algorithm for topological sort (correct choice), wrote clean code, but needed minor threading fixes on the SPSC queue and an import order fix on the rate limiter. Good, not perfect.

TCQ3 death-spiralled. 4 Edit calls, 7 fix attempts, 382K tokens. A bencode generator scope bug, search test logic issues, topological sort import and order errors, rate limiter bucket initialisation and refill math — small mistakes that compounded. Each fix consumed tokens, polluting the context window and slowing down subsequent operations. The 23.3 t/s decode speed wasn't a TCQ3 problem — it was a token-bloat problem.

Why Q8_0 Lost

Q8_0 is a good method. Near-lossless quality at 8 bits per value. The problem is purely physical: at 256K context, the KV cache alone needs 8.7 GB. With the model at 16 GB and overhead, that's 24.7 GB — doesn't fit on 24 GB. The Q8_0 test had to run at 128K context.

Turbo4 at 4.25 bits per value fits 256K in the same budget. But even at equal context sizes, the trellis coding has structural advantages over scalar quantisation:

Property	Q8_0 (Scalar)	Turbo4 (TCQ)
Quantisation approach	Per-value independent	Sequence-level Viterbi search
Pre-processing	None	FWHT rotation (decorrelation)
Scaling	Static	Context-adaptive alpha
Bits per value	8.0	4.25
Quality vs FP16	Near-lossless	Statistically lossless
Decode speed penalty vs q8_0	—	None (97%+ per paper)

The Viterbi trellis search finds globally optimal codeword assignments across the entire sequence. The FWHT rotation decorrelates before quantisation. The adaptive alpha corrects for context-length drift. You get lower distortion at half the bitrate because you're solving a harder optimisation problem — and the expensive part (encoding) happens once, while the cheap part (decoding) happens at every token.

TCQ3: The Efficiency Trade-off

TCQ3 at 3.25bpv scored 76/100. Perfect functional completion. Perfect code quality. Perfect memory recall. The 13/25 precision score came entirely from needing more fix attempts (7 vs 0 for Turbo4).

For use cases where you genuinely need maximum context and can tolerate occasional iteration, TCQ3 is viable. The output quality is identical — the model still produces correct, clean, well-typed code. It just takes a few more attempts to get there. At ~5x compression over FP16, it's the right tool when you need to push context past what Turbo4 can handle.

If you want...	Use	Flags	Compression
Best quality + max context	Turbo4	`-ctk turbo4 -ctv turbo4`	~3.8x
Maximum context + good quality	TCQ3	`-ctk turbo3_tcq -ctv turbo3_tcq`	~5x
Extreme compression	TCQ2	`-ctk turbo2_tcq -ctv turbo2_tcq`	~7x

Setup Guide

Build buun-llama-cpp:
git clone https://github.com/spiritbuun/buun-llama-cpp /opt/llama.cpp-dflash
cd /opt/llama.cpp-dflash
cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_NATIVE=ON \
  -DGGML_CUDA_FA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_BUILD_SERVER=ON
cmake --build build -j$(nproc)

Launch with Turbo4 (256K context, RTX 4090 24GB):
./build/bin/llama-server \
  -m /path/to/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf \
  --mmproj /path/to/mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-f16.gguf \
  --no-mmproj-offload \
  --port 8095 \
  -c 262144 \
  --flash-attn on \
  -t 20 \
  --no-mmap \
  -ngl 99 \
  --parallel 1 \
  -ctk turbo4 -ctv turbo4 \
  --fit on \
  --reasoning on \
  --jinja \
  --chat-template-kwargs '{"enable_thinking":false,"preserve_thinking":true}' \
  --temp 0.6 --top-p 0.95 --top-k 20 --seed 3407

Or with Q8_0 for comparison (128K context):
  -c 131072 -ctk q8_0 -ctv q8_0

Or with TCQ3 for max context:
  -c 262144 -ctk turbo3_tcq -ctv turbo3_tcq

The Broader Implication

For the past two years, the local AI community has treated Q8_0 KV cache as the gold standard and Q4_0 as the desperation option. This made sense in a world where the only option was per-value scalar quantisation — half the bits meant half the precision, and the quality gap was real.

TCQ changes the game. By optimising across the entire sequence rather than per-value, by decorrelating before quantisation, and by adapting to context length, trellis coding achieves better fidelity at half the bitrate. You don't have to choose between quality and context size. You get both.

Turbo4 isn't just "not worse than Q8_0." On the same model, same hardware, same benchmark — it's better. Better code. Better architecture. Better debugging. Better first-shot precision. All while running 14% faster and fitting double the context.

The buun-llama-cpp build is a pure upgrade over upstream: same quality, faster speed, more KV cache options, plus DFlash support for base model users. Compiled clean on CUDA 12.9 with an RTX 4090. No downsides detected across 7 benchmark runs.

Full Results Table

Rank	KV Cache	Build	Benchmark	Score	Rating	t/s	Fixes
1	Turbo4	buun	TASK2 v2	100	S	40.1	0
2	Q8_0	buun	TASK2 v2	91	S	35.4	3
3	Q8_0	upstream	TASK1	85	A	~34	~4
4	Q4_0 (DavidAU)	upstream	TASK1	80	B	—	—
5	TCQ3	buun	TASK2 v2	76	A	23.3*	7
6	Q4_0	upstream	TASK1	76	B	~34	—
6	Turbo4	buun	TASK1	76	B	37.0	8
6	Q8_0 (DavidAU)	upstream	TASK1	76	B	—	—

Full benchmark data, individual run analyses, code audits, conversation logs, and the TASK2 benchmark itself available in the project repository. All tests conducted May 4-5, 2026 on RTX 4090 24GB with CUDA 12.9, buun-llama-cpp at commit aecbbd5d, Qwen3.6-27B-Uncensored-HauhauCS-Aggressive Q4_K_P.