The Conventional Wisdom

If you've spent time in local AI communities, you've heard the advice: never use Q4 KV cache. It destroys quality. Models become incoherent. The consensus is clear — stick to Q8 or f16 for KV cache, or accept degraded output.

There's even a documented GitHub issue from December 2024 showing Q8 KV cache degrading model performance on reasoning tasks. The user reported consistent incorrect answers with compressed KV cache that worked perfectly with uncompressed.

So when I started benchmarking Q4 vs Q8 KV cache, I expected garbage. What I got instead made me question the conventional wisdom.

The Test Setup

I ran identical coding benchmarks across two different model families, testing Q8 vs Q4 KV cache on the exact same prompt:

Task: Build a concurrent rate-limited API client with circuit breaker pattern Requirements: Token bucket rate limiter, circuit breaker, exponential backoff, concurrent request pool, timeout handling, metrics collection Hardware: NVIDIA RTX 4090 24GB, AMD Ryzen 9 5950X, 64GB RAM Models tested: GLM-4.7-Flash-PRISM, Qwen3-Coder-Next-Abliterated

The benchmark was designed to test practical coding ability — producing working, production-quality code with multiple interacting components.

The Unexpected Results

Across both model families, Q4 KV cache didn't just match Q8 — it performed better.

GLM-4.7-Flash Results

Metric Q8 KV Cache Q4 KV Cache
Prompt Processing 461 t/s 1,149 t/s (+149%)
Token Generation 59.4 t/s 96.0 t/s (+62%)
Code Quality Score 9/10 9.5/10
Files Generated 4 files, 1,646 lines 4 files, 1,534 lines

Qwen3-Coder-Next Results

Metric Q8 KV Cache Q4 KV Cache
Prompt Processing 1,392 t/s 1,523 t/s (+9.4%)
Token Generation 65.0 t/s 68.8 t/s (+5.8%)
Code Quality Score 7/10 9/10
Test Suite None 16 tests created
Completion Status Stuck in loop Clean finish

What Actually Happened

The Q8 KV cache test on Qwen3-Coder-Next got stuck in a repetition loop after 10 minutes. The model kept generating tiny 33-token responses repeatedly instead of completing the task. This required manual intervention.

The Q4 KV cache test on the same model completed cleanly, generated a full test suite, and finished without issues.

For GLM-4.7-Flash, both configurations completed, but Q4 KV produced better-structured code with custom exception hierarchies, separate configuration dataclasses, and more polished implementations — despite the conventional wisdom suggesting Q4 should produce worse output.

Why Might This Be Happening?

I'm not claiming to have definitive answers. But recent research offers some clues:

Layer Sensitivity Varies by Model

A February 2025 paper, KVTuner, demonstrated that different layers in a model have vastly different sensitivities to KV cache quantization. Some layers can tolerate aggressive quantization with minimal impact; others degrade quickly.

The "Q4 destroys quality" conclusion may come from testing on models where critical layers are highly sensitive. But not all models share the same sensitivity profile.

Model Architecture Matters

Both models I tested use different architectures than the Llama-3.3-70B referenced in the GitHub issue showing Q8 degradation. GLM-4.7-Flash uses a different attention mechanism entirely. Qwen3-Coder-Next is a Mixture of Experts model with 512 experts.

These architectural differences may mean KV cache quantization impacts them differently. The blanket advice to "avoid Q4 KV" assumes all models respond the same way.

VRAM Pressure and Stability

Here's a practical consideration: Q4 KV cache uses significantly less VRAM. With less memory pressure, the model has more headroom for computation buffers and can avoid edge cases that trigger instability.

The Qwen3-Coder-Next Q8 test got stuck in a loop. The Q4 test didn't. Correlation isn't causation, but reduced memory pressure might contribute to more stable generation.

The Counter-Evidence

I should acknowledge where the conventional wisdom comes from. The GitHub issue I mentioned earlier shows clear quality degradation with Q8 KV cache on Llama-3.3-70B. The model that previously answered date calculations correctly started producing wrong answers consistently.

That's real. That's documented. And it contradicts my results.

The difference? Different model. Different architecture. Different quantization. Different task type. The question isn't whether KV cache quantization can degrade quality — it's whether it degrades quality for your specific use case.

What This Means Practically

I'm not recommending everyone switch to Q4 KV cache tomorrow. But I am suggesting the "never use Q4" advice is overgeneralized.

What I'm Using Now

  • Default: Q4 KV cache for coding tasks — faster, equal or better quality in my tests
  • Fallback: If a model produces degraded output, I'll switch to Q8
  • Testing: I test both configurations before settling on one for any new model

The VRAM savings are substantial. At 168K context, Q4 KV cache saves roughly 2GB compared to Q8. That's 2GB more for model weights, larger context windows, or running models that wouldn't otherwise fit.

How to Test This Yourself

If you're running local models with llama.cpp, testing Q4 vs Q8 KV cache is straightforward:

# Q8 KV Cache (conventional recommendation) ./llama-server \ -m your-model.gguf \ -c 168000 \ -ctk q8_0 \ -ctv q8_0 \ --flash-attn on # Q4 KV Cache (try it yourself) ./llama-server \ -m your-model.gguf \ -c 168000 \ -ctk q4_0 \ -ctv q4_0 \ --flash-attn on

Run the same prompt on both. Check output quality. Check generation speed. Check whether the model completes or gets stuck. Your results may differ from mine — that's exactly the point.

The Takeaway

Conventional wisdom in local AI circles often comes from specific tests on specific models. Those results get generalized into universal rules. But models aren't universal — they're diverse architectures with different properties.

Q4 KV cache might be unusable for some models. It might also be perfectly fine — or even preferable — for others. The only way to know is to test.

For my coding workloads on GLM-4.7-Flash and Qwen3-Coder-Next, Q4 KV cache has become my default. Faster generation, cleaner completions, and equal or better code quality. Your mileage may vary.

Don't take my word for it. Don't take the conventional wisdom either.

Test both configurations on your models, your hardware, your use cases. Make decisions based on your own benchmarks, not assumptions from other people's tests on different models.

Resources