Background

Throughput is the main bottleneck when running large language models locally. For Qwen3-Coder-Next — an 80B-parameter Mixture of Experts model with 512 experts per layer — the choice of quantization directly determines how many layers fit in VRAM and how fast each token generates.

I tested the two available GGUF quantizations: UD-IQ3_XXS (33 GB) and UD-IQ2_XXS (26 GB). The difference was significant enough to document.

Benchmark Results

All tests were run at 200K context on the same hardware, same prompts, same llama.cpp build. Each metric is averaged across 5 runs.

Metric IQ2_XXS IQ3_XXS Delta
Overall Speed 21.78 t/s 11.79 t/s +84.7%
Coding Tasks 21.64 t/s 11.36 t/s +90.5%
Technical Writing 22.00 t/s 12.44 t/s +76.9%
Model Size 26 GB 33 GB -21%
GPU Layers 42 / 48 34 / 48 +23%
CPU-MoE Layers 10 18 -44%
Load Time ~7s ~12s -42%

IQ2 is roughly 85% faster while being 7 GB smaller. The speed gain comes from two factors: the smaller file means more layers fit in GPU memory (42 vs 34), and fewer expert weights need to transfer from system RAM per token (10 CPU-MoE layers instead of 18).

Quality Comparison

The obvious concern with more aggressive quantization is output quality. I tested both models across several categories using identical prompts:

  • Code generation (Python, JavaScript, HTML/CSS) — no observable difference in correctness or style
  • Technical writing (documentation, explanations) — indistinguishable outputs
  • Debugging (bug identification, optimization) — same insights from both
  • Complex reasoning (multi-step logic, architecture decisions) — comparable quality

As a concrete example, I asked both models to generate a complete Snake game in JavaScript with HTML5 canvas, score tracking, collision detection, and responsive design. IQ3 produced it in ~25 seconds at 12.45 t/s. IQ2 produced functionally identical code in ~15 seconds at 21.21 t/s. Both games worked correctly.

For coding and technical tasks, the quantization loss between IQ3 and IQ2 appears negligible in practice.

Why IQ2 Is Faster

The performance difference comes down to memory bandwidth. Qwen3-Coder-Next uses a Mixture of Experts architecture: 512 experts per layer, 10 active per token. At inference time, the model needs to load the active expert weights for each layer of each token.

IQ2_XXS uses 2-bit quantization compared to IQ3's 3-bit. This reduces the model from 33 GB to 26 GB, which has cascading effects:

  • More GPU layers — 42 of 48 layers fit in 24 GB VRAM, versus 34 for IQ3
  • Less CPU offload — only 10 layers on CPU instead of 18, meaning less data moves across the PCIe bus per token
  • Higher bandwidth efficiency — each expert weight transfer is ~33% smaller

The architecture (80B parameters, 512 experts/layer, MoE design) is unchanged. IQ2 simply packs the same weights into a smaller representation.

Hardware and Configuration

Test System

GPU: NVIDIA RTX 4090 24GB RAM: 64GB DDR4-3200 CPU: AMD Ryzen 9 5950X (16 cores) OS: Ubuntu 24.04 LTS llama.cpp: b4785 (February 2026)

Server Configuration (IQ2_XXS)

./llama-server \ -m Qwen3-Coder-Next-UD-IQ2_XXS.gguf \ --port 8084 \ -c 200000 \ # 200K context -ngl 42 \ # GPU layers --n-cpu-moe 10 \ # CPU-offloaded MoE layers -b 1024 --ub 1024 \ # batch sizes --cache-type-k q4_0 \# KV cache compression --cache-type-v q4_0 \ --flash-attn on \ --cont-batching \ --jinja \ -t 16

For IQ3_XXS, substitute -ngl 34 --n-cpu-moe 18. The larger model requires fewer GPU layers and more CPU offloading.

Key configuration notes: q4_0 KV cache compression and flash attention are essential — without them, the 200K context window would exceed available VRAM. The 16 threads match the 5950X core count for optimal CPU-MoE throughput.

Getting the Models

Both quantizations are available on Hugging Face in GGUF format:

# Recommended Qwen3-Coder-Next-UD-IQ2_XXS.gguf # ~26 GB # Alternative (larger, slower on single GPU) Qwen3-Coder-Next-UD-IQ3_XXS.gguf # ~33 GB

The UD (Ultra Dense) variants use 512 experts per layer, compared to REAP's 308. This higher expert count provides better task specialisation within the MoE routing.

Claude Code Integration

I run IQ2 as a local backend for Claude Code via a wrapper script:

# ~/.local/bin/coder-next-opt-full MODEL="/path/to/Qwen3-Coder-Next-UD-IQ2_XXS.gguf" PORT=8084 CONTEXT=200000 GPU_LAYERS=42 CPU_MOE=10

This provides local coding assistance with 200K context, no API costs, and no data leaving the machine.

Trade-offs

  • Quantization loss — theoretically present, but not observable in coding or technical writing tasks during my testing
  • Hardware requirement — still requires 24 GB VRAM for the GPU-layer split to work well
  • Model availability — not all models have IQ2 quantizations yet; this depends on upstream GGUF support

For users with less than 24 GB VRAM, IQ2 would still be faster than IQ3 at equivalent layer counts, but the absolute throughput would be lower.

What's Next

Areas I'm watching for further performance gains:

  • KV cache compression — GEAR and KIVI techniques could reduce VRAM usage further, allowing more GPU layers or larger context
  • Expert caching — caching frequently-used experts in GPU memory to reduce CPU-MoE transfer overhead
  • Speculative decoding — using a smaller draft model to predict tokens in advance, reducing wall-clock time per token

IQ2_XXS at the configuration above is the fastest local setup I've found for this model.

Summary

IQ2_XXS is the clear choice for single-GPU local inference.

85% faster generation, 7 GB smaller footprint, and no measurable quality loss for coding tasks. If you have an RTX 4090 and want to run Qwen3-Coder-Next locally, there's no practical reason to use IQ3.

Resources