⚡ February 2026 Update: 4x Speed Improvement

New optimizations push throughput from 22 t/s → 94 t/s (328% faster). Key changes:

  • Q8 KV Cache — 50% smaller KV buffer, more room for model weights on GPU
  • Abliterated Model — huihui-ai's uncensored variant, no safety refusals
  • llama.cpp b8054 — Latest build with Flash Attention optimizations (PR #19375)
  • AUTO-FIT Mode — Let llama.cpp auto-optimize GPU/CPU layer split
  • 168K Context — Sweet spot balancing speed vs context window

Background

Throughput is the main bottleneck when running large language models locally. For Qwen3-Coder-Next — an 80B-parameter Mixture of Experts model with 512 experts per layer — the choice of quantization and KV cache directly determines how many layers fit in VRAM and how fast each token generates.

My original tests compared UD-IQ3_XXS (33 GB) vs UD-IQ2_XXS (26 GB). Since then, I've switched to the huihui-ai Abliterated IQ2_XS variant (21 GB) for uncensored responses, and discovered that Q8 KV cache provides a massive speed boost.

Current Benchmark Results

All tests run on RTX 4090 with llama.cpp build 8054. The current "MAX SPEED" configuration uses AUTO-FIT mode with Q8 KV cache.

Configuration Speed Context KV Cache
Abliterated IQ2_XS (Current) ~94 t/s 168K Q8
UD-IQ2_XXS (Original) 22 t/s 200K Q4
UD-IQ3_XXS 12 t/s 200K Q4

What Changed?

  • Q8 KV Cache — Reduces KV buffer by 50% vs f16 (2GB instead of 4GB at 168K context). This frees VRAM for more model weights on GPU. Q8 has "very small loss, usually no noticeable impact" compared to f16.
  • 168K Context — Slightly smaller than 200K, but the reduced KV cache size allows more layers on GPU, dramatically increasing speed.
  • AUTO-FIT Mode — Let llama.cpp automatically determine optimal GPU/CPU layer split instead of manual -ngl tuning.
  • Abliterated Model — huihui-ai's IQ2_XS variant removes safety refusals. Smaller (21GB vs 26GB) and fits more comfortably in VRAM.

Original IQ2 vs IQ3 Comparison

Original tests run at 200K context on the same hardware. Each metric averaged across 5 runs.

Metric IQ2_XXS IQ3_XXS Delta
Overall Speed 21.78 t/s 11.79 t/s +84.7%
Coding Tasks 21.64 t/s 11.36 t/s +90.5%
Technical Writing 22.00 t/s 12.44 t/s +76.9%
Model Size 26 GB 33 GB -21%
GPU Layers 42 / 48 34 / 48 +23%
CPU-MoE Layers 10 18 -44%
Load Time ~7s ~12s -42%

IQ2 is roughly 85% faster while being 7 GB smaller. The speed gain comes from two factors: the smaller file means more layers fit in GPU memory (42 vs 34), and fewer expert weights need to transfer from system RAM per token (10 CPU-MoE layers instead of 18).

Quality Comparison

The obvious concern with more aggressive quantization is output quality. I tested both models across several categories using identical prompts:

  • Code generation (Python, JavaScript, HTML/CSS) — no observable difference in correctness or style
  • Technical writing (documentation, explanations) — indistinguishable outputs
  • Debugging (bug identification, optimization) — same insights from both
  • Complex reasoning (multi-step logic, architecture decisions) — comparable quality

As a concrete example, I asked both models to generate a complete Snake game in JavaScript with HTML5 canvas, score tracking, collision detection, and responsive design. IQ3 produced it in ~25 seconds at 12.45 t/s. IQ2 produced functionally identical code in ~15 seconds at 21.21 t/s. Both games worked correctly.

For coding and technical tasks, the quantization loss between IQ3 and IQ2 appears negligible in practice.

Why IQ2 Is Faster

The performance difference comes down to memory bandwidth. Qwen3-Coder-Next uses a Mixture of Experts architecture: 512 experts per layer, 10 active per token. At inference time, the model needs to load the active expert weights for each layer of each token.

IQ2_XXS uses 2-bit quantization compared to IQ3's 3-bit. This reduces the model from 33 GB to 26 GB, which has cascading effects:

  • More GPU layers — 42 of 48 layers fit in 24 GB VRAM, versus 34 for IQ3
  • Less CPU offload — only 10 layers on CPU instead of 18, meaning less data moves across the PCIe bus per token
  • Higher bandwidth efficiency — each expert weight transfer is ~33% smaller

The architecture (80B parameters, 512 experts/layer, MoE design) is unchanged. IQ2 simply packs the same weights into a smaller representation.

Hardware and Configuration

Test System

GPU: NVIDIA RTX 4090 24GB RAM: 64GB DDR4-3200 CPU: AMD Ryzen 9 5950X (16 cores) OS: Ubuntu 24.04 LTS llama.cpp: b8054 (February 2026) with Flash Attention PR #19375

Current Server Configuration (MAX SPEED)

./llama-server \ -m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \ --port 8085 \ -c 168000 \ # 168K context (sweet spot) -ctk q8_0 \ # Q8 KV cache (K) -ctv q8_0 \ # Q8 KV cache (V) --flash-attn on \ --jinja \ --metrics \ # AUTO-FIT mode (no -ngl) --host 0.0.0.0

Key difference: No manual -ngl or --n-cpu-moe flags. AUTO-FIT mode automatically places 49 layers on GPU with 11 overflowing to CPU. The Q8 KV cache saves ~2GB VRAM at 168K context compared to f16, allowing this better layer distribution.

Original Configuration (IQ2_XXS)

./llama-server \ -m Qwen3-Coder-Next-UD-IQ2_XXS.gguf \ --port 8084 \ -c 200000 \ # 200K context -ngl 42 \ # GPU layers --n-cpu-moe 10 \ # CPU-offloaded MoE layers -b 1024 --ub 1024 \ # batch sizes --cache-type-k q4_0 \# KV cache compression --cache-type-v q4_0 \ --flash-attn on \ --cont-batching \ --jinja \ -t 16

For IQ3_XXS, substitute -ngl 34 --n-cpu-moe 18. The larger model requires fewer GPU layers and more CPU offloading.

Key configuration notes: q4_0 KV cache compression and flash attention are essential — without them, the 200K context window would exceed available VRAM. The 16 threads match the 5950X core count for optimal CPU-MoE throughput.

Getting the Models

Model variants available on Hugging Face in GGUF format:

# RECOMMENDED: Abliterated + Fast (MAX SPEED config) huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf # ~21 GB # Original UD variants (for comparison) Qwen3-Coder-Next-UD-IQ2_XXS.gguf # ~26 GB Qwen3-Coder-Next-UD-IQ3_XXS.gguf # ~33 GB

The huihui-ai Abliterated variant removes safety refusals while maintaining code quality. At 21GB, it fits more comfortably in VRAM than the 26GB UD-IQ2_XXS, allowing better GPU layer distribution.

The UD (Ultra Dense) variants use 512 experts per layer, compared to REAP's 308. This higher expert count provides better task specialisation within the MoE routing.

Claude Code Integration

I run the abliterated IQ2_XS as a local backend for Claude Code via a wrapper script:

# ~/.local/bin/coder-next-ab-fast (v1.1) MODEL="/path/to/huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf" PORT=8085 CONTEXT=168000 KV_CACHE="q8_0" # Both K and V MODE="AUTO-FIT" # No manual -ngl

This provides local coding assistance with 168K context at ~94 t/s, no API costs, and no data leaving the machine. The wrapper automatically starts the llama.cpp server with Q8 KV cache and AUTO-FIT optimization.

Trade-offs

  • Quantization loss — theoretically present, but not observable in coding or technical writing tasks during my testing
  • Hardware requirement — still requires 24 GB VRAM for the GPU-layer split to work well
  • Model availability — not all models have IQ2 quantizations yet; this depends on upstream GGUF support

For users with less than 24 GB VRAM, IQ2 would still be faster than IQ3 at equivalent layer counts, but the absolute throughput would be lower.

What's Next

Current optimizations achieved, areas I'm watching for further gains:

  • ✓ KV cache compression — Q8 cache implemented, saving 2GB VRAM at 168K context. Avoid Q4 KV cache — it "tanks quality in any model".
  • llama.cpp updates — PR #19375 brought 8.5% speed improvement. Watching for future Flash Attention enhancements.
  • IQ4_XS experiment — Could try higher quality quant at reduced context (100K) to compare quality vs speed tradeoff.
  • Expert caching — Caching frequently-used experts in GPU memory to reduce CPU-MoE transfer overhead.

The current MAX SPEED configuration (Abliterated IQ2_XS + Q8 KV + 168K context) is the fastest practical local setup I've found for this model on single RTX 4090.

Summary

Abliterated IQ2_XS + Q8 KV Cache is the optimal single-GPU configuration.

~94 t/s at 168K context on RTX 4090 — over 4x faster than the original IQ3 setup. The huihui-ai abliterated variant removes safety refusals while the Q8 KV cache provides the VRAM savings needed for optimal GPU layer placement.

Original finding still holds: IQ2 quantization has no measurable quality loss for coding tasks. The Q8 KV cache adds minimal quality impact while providing significant speed gains.

Resources