⚡ February 2026 Update: 4x Speed Improvement
New optimizations push throughput from 22 t/s → 94 t/s (328% faster). Key changes:
- Q8 KV Cache — 50% smaller KV buffer, more room for model weights on GPU
- Abliterated Model — huihui-ai's uncensored variant, no safety refusals
- llama.cpp b8054 — Latest build with Flash Attention optimizations (PR #19375)
- AUTO-FIT Mode — Let llama.cpp auto-optimize GPU/CPU layer split
- 168K Context — Sweet spot balancing speed vs context window
Background
Throughput is the main bottleneck when running large language models locally. For Qwen3-Coder-Next — an 80B-parameter Mixture of Experts model with 512 experts per layer — the choice of quantization and KV cache directly determines how many layers fit in VRAM and how fast each token generates.
My original tests compared UD-IQ3_XXS (33 GB) vs UD-IQ2_XXS (26 GB). Since then, I've switched to the huihui-ai Abliterated IQ2_XS variant (21 GB) for uncensored responses, and discovered that Q8 KV cache provides a massive speed boost.
Current Benchmark Results
All tests run on RTX 4090 with llama.cpp build 8054. The current "MAX SPEED" configuration uses AUTO-FIT mode with Q8 KV cache.
| Configuration | Speed | Context | KV Cache |
|---|---|---|---|
| Abliterated IQ2_XS (Current) | ~94 t/s | 168K | Q8 |
| UD-IQ2_XXS (Original) | 22 t/s | 200K | Q4 |
| UD-IQ3_XXS | 12 t/s | 200K | Q4 |
What Changed?
- Q8 KV Cache — Reduces KV buffer by 50% vs f16 (2GB instead of 4GB at 168K context). This frees VRAM for more model weights on GPU. Q8 has "very small loss, usually no noticeable impact" compared to f16.
- 168K Context — Slightly smaller than 200K, but the reduced KV cache size allows more layers on GPU, dramatically increasing speed.
- AUTO-FIT Mode — Let llama.cpp automatically determine optimal GPU/CPU layer split instead of manual
-ngltuning. - Abliterated Model — huihui-ai's IQ2_XS variant removes safety refusals. Smaller (21GB vs 26GB) and fits more comfortably in VRAM.
Original IQ2 vs IQ3 Comparison
Original tests run at 200K context on the same hardware. Each metric averaged across 5 runs.
| Metric | IQ2_XXS | IQ3_XXS | Delta |
|---|---|---|---|
| Overall Speed | 21.78 t/s | 11.79 t/s | +84.7% |
| Coding Tasks | 21.64 t/s | 11.36 t/s | +90.5% |
| Technical Writing | 22.00 t/s | 12.44 t/s | +76.9% |
| Model Size | 26 GB | 33 GB | -21% |
| GPU Layers | 42 / 48 | 34 / 48 | +23% |
| CPU-MoE Layers | 10 | 18 | -44% |
| Load Time | ~7s | ~12s | -42% |
IQ2 is roughly 85% faster while being 7 GB smaller. The speed gain comes from two factors: the smaller file means more layers fit in GPU memory (42 vs 34), and fewer expert weights need to transfer from system RAM per token (10 CPU-MoE layers instead of 18).
Quality Comparison
The obvious concern with more aggressive quantization is output quality. I tested both models across several categories using identical prompts:
- Code generation (Python, JavaScript, HTML/CSS) — no observable difference in correctness or style
- Technical writing (documentation, explanations) — indistinguishable outputs
- Debugging (bug identification, optimization) — same insights from both
- Complex reasoning (multi-step logic, architecture decisions) — comparable quality
As a concrete example, I asked both models to generate a complete Snake game in JavaScript with HTML5 canvas, score tracking, collision detection, and responsive design. IQ3 produced it in ~25 seconds at 12.45 t/s. IQ2 produced functionally identical code in ~15 seconds at 21.21 t/s. Both games worked correctly.
For coding and technical tasks, the quantization loss between IQ3 and IQ2 appears negligible in practice.
Why IQ2 Is Faster
The performance difference comes down to memory bandwidth. Qwen3-Coder-Next uses a Mixture of Experts architecture: 512 experts per layer, 10 active per token. At inference time, the model needs to load the active expert weights for each layer of each token.
IQ2_XXS uses 2-bit quantization compared to IQ3's 3-bit. This reduces the model from 33 GB to 26 GB, which has cascading effects:
- More GPU layers — 42 of 48 layers fit in 24 GB VRAM, versus 34 for IQ3
- Less CPU offload — only 10 layers on CPU instead of 18, meaning less data moves across the PCIe bus per token
- Higher bandwidth efficiency — each expert weight transfer is ~33% smaller
The architecture (80B parameters, 512 experts/layer, MoE design) is unchanged. IQ2 simply packs the same weights into a smaller representation.
Hardware and Configuration
Test System
GPU: NVIDIA RTX 4090 24GB
RAM: 64GB DDR4-3200
CPU: AMD Ryzen 9 5950X (16 cores)
OS: Ubuntu 24.04 LTS
llama.cpp: b8054 (February 2026) with Flash Attention PR #19375
Current Server Configuration (MAX SPEED)
./llama-server \
-m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \
--port 8085 \
-c 168000 \ # 168K context (sweet spot)
-ctk q8_0 \ # Q8 KV cache (K)
-ctv q8_0 \ # Q8 KV cache (V)
--flash-attn on \
--jinja \
--metrics \ # AUTO-FIT mode (no -ngl)
--host 0.0.0.0
Key difference: No manual -ngl or --n-cpu-moe flags. AUTO-FIT mode automatically places 49 layers on GPU with 11 overflowing to CPU. The Q8 KV cache saves ~2GB VRAM at 168K context compared to f16, allowing this better layer distribution.
Original Configuration (IQ2_XXS)
./llama-server \
-m Qwen3-Coder-Next-UD-IQ2_XXS.gguf \
--port 8084 \
-c 200000 \ # 200K context
-ngl 42 \ # GPU layers
--n-cpu-moe 10 \ # CPU-offloaded MoE layers
-b 1024 --ub 1024 \ # batch sizes
--cache-type-k q4_0 \# KV cache compression
--cache-type-v q4_0 \
--flash-attn on \
--cont-batching \
--jinja \
-t 16
For IQ3_XXS, substitute -ngl 34 --n-cpu-moe 18. The larger model requires fewer GPU layers and more CPU offloading.
Key configuration notes: q4_0 KV cache compression and flash attention are essential — without them, the 200K context window would exceed available VRAM. The 16 threads match the 5950X core count for optimal CPU-MoE throughput.
Getting the Models
Model variants available on Hugging Face in GGUF format:
# RECOMMENDED: Abliterated + Fast (MAX SPEED config)
huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf # ~21 GB
# Original UD variants (for comparison)
Qwen3-Coder-Next-UD-IQ2_XXS.gguf # ~26 GB
Qwen3-Coder-Next-UD-IQ3_XXS.gguf # ~33 GB
The huihui-ai Abliterated variant removes safety refusals while maintaining code quality. At 21GB, it fits more comfortably in VRAM than the 26GB UD-IQ2_XXS, allowing better GPU layer distribution.
The UD (Ultra Dense) variants use 512 experts per layer, compared to REAP's 308. This higher expert count provides better task specialisation within the MoE routing.
Claude Code Integration
I run the abliterated IQ2_XS as a local backend for Claude Code via a wrapper script:
# ~/.local/bin/coder-next-ab-fast (v1.1)
MODEL="/path/to/huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf"
PORT=8085
CONTEXT=168000
KV_CACHE="q8_0" # Both K and V
MODE="AUTO-FIT" # No manual -ngl
This provides local coding assistance with 168K context at ~94 t/s, no API costs, and no data leaving the machine. The wrapper automatically starts the llama.cpp server with Q8 KV cache and AUTO-FIT optimization.
Trade-offs
- Quantization loss — theoretically present, but not observable in coding or technical writing tasks during my testing
- Hardware requirement — still requires 24 GB VRAM for the GPU-layer split to work well
- Model availability — not all models have IQ2 quantizations yet; this depends on upstream GGUF support
For users with less than 24 GB VRAM, IQ2 would still be faster than IQ3 at equivalent layer counts, but the absolute throughput would be lower.
What's Next
Current optimizations achieved, areas I'm watching for further gains:
- ✓ KV cache compression — Q8 cache implemented, saving 2GB VRAM at 168K context. Avoid Q4 KV cache — it "tanks quality in any model".
- llama.cpp updates — PR #19375 brought 8.5% speed improvement. Watching for future Flash Attention enhancements.
- IQ4_XS experiment — Could try higher quality quant at reduced context (100K) to compare quality vs speed tradeoff.
- Expert caching — Caching frequently-used experts in GPU memory to reduce CPU-MoE transfer overhead.
The current MAX SPEED configuration (Abliterated IQ2_XS + Q8 KV + 168K context) is the fastest practical local setup I've found for this model on single RTX 4090.
Summary
Abliterated IQ2_XS + Q8 KV Cache is the optimal single-GPU configuration.
~94 t/s at 168K context on RTX 4090 — over 4x faster than the original IQ3 setup. The huihui-ai abliterated variant removes safety refusals while the Q8 KV cache provides the VRAM savings needed for optimal GPU layer placement.
Original finding still holds: IQ2 quantization has no measurable quality loss for coding tasks. The Q8 KV cache adds minimal quality impact while providing significant speed gains.
Resources
- llama.cpp — github.com/ggerganov/llama.cpp (build 8054+ recommended for PR #19375)
- Abliterated Model — bartowski/huihui-ai_Qwen3-Coder-Next-abliterated-GGUF
- Original UD Models — available on Hugging Face (search "Qwen3-Coder-Next GGUF")
- Related — Running Uncensored AI Locally (GLM-4.7-Flash setup guide)
- Setup help — AI Setup & Consultation
