Qwen3-Coder-Next IQ2 vs IQ3 Benchmarks

Benchmarks February 9, 2026 8 min read

Background

Throughput is the main bottleneck when running large language models locally. For Qwen3-Coder-Next — an 80B-parameter Mixture of Experts model with 512 experts per layer — the choice of quantization directly determines how many layers fit in VRAM and how fast each token generates.

I tested the two available GGUF quantizations: UD-IQ3_XXS (33 GB) and UD-IQ2_XXS (26 GB). The difference was significant enough to document.

Benchmark Results

All tests were run at 200K context on the same hardware, same prompts, same llama.cpp build. Each metric is averaged across 5 runs.

Metric	IQ2_XXS	IQ3_XXS	Delta
Overall Speed	21.78 t/s	11.79 t/s	+84.7%
Coding Tasks	21.64 t/s	11.36 t/s	+90.5%
Technical Writing	22.00 t/s	12.44 t/s	+76.9%
Model Size	26 GB	33 GB	-21%
GPU Layers	42 / 48	34 / 48	+23%
CPU-MoE Layers	10	18	-44%
Load Time	~7s	~12s	-42%

IQ2 is roughly 85% faster while being 7 GB smaller. The speed gain comes from two factors: the smaller file means more layers fit in GPU memory (42 vs 34), and fewer expert weights need to transfer from system RAM per token (10 CPU-MoE layers instead of 18).

Quality Comparison

The obvious concern with more aggressive quantization is output quality. I tested both models across several categories using identical prompts:

Code generation (Python, JavaScript, HTML/CSS) — no observable difference in correctness or style
Technical writing (documentation, explanations) — indistinguishable outputs
Debugging (bug identification, optimization) — same insights from both
Complex reasoning (multi-step logic, architecture decisions) — comparable quality

As a concrete example, I asked both models to generate a complete Snake game in JavaScript with HTML5 canvas, score tracking, collision detection, and responsive design. IQ3 produced it in ~25 seconds at 12.45 t/s. IQ2 produced functionally identical code in ~15 seconds at 21.21 t/s. Both games worked correctly.

For coding and technical tasks, the quantization loss between IQ3 and IQ2 appears negligible in practice.

Why IQ2 Is Faster

The performance difference comes down to memory bandwidth. Qwen3-Coder-Next uses a Mixture of Experts architecture: 512 experts per layer, 10 active per token. At inference time, the model needs to load the active expert weights for each layer of each token.

IQ2_XXS uses 2-bit quantization compared to IQ3's 3-bit. This reduces the model from 33 GB to 26 GB, which has cascading effects:

More GPU layers — 42 of 48 layers fit in 24 GB VRAM, versus 34 for IQ3
Less CPU offload — only 10 layers on CPU instead of 18, meaning less data moves across the PCIe bus per token
Higher bandwidth efficiency — each expert weight transfer is ~33% smaller

The architecture (80B parameters, 512 experts/layer, MoE design) is unchanged. IQ2 simply packs the same weights into a smaller representation.

Hardware and Configuration

Test System

GPU: NVIDIA RTX 4090 24GB
RAM: 64GB DDR4-3200
CPU: AMD Ryzen 9 5950X (16 cores)
OS: Ubuntu 24.04 LTS
llama.cpp: b4785 (February 2026)
 

Server Configuration (IQ2_XXS)

./llama-server \
  -m Qwen3-Coder-Next-UD-IQ2_XXS.gguf \
  --port 8084 \
  -c 200000 \          # 200K context
  -ngl 42 \            # GPU layers
  --n-cpu-moe 10 \     # CPU-offloaded MoE layers
  -b 1024 --ub 1024 \  # batch sizes
  --cache-type-k q4_0 \# KV cache compression
  --cache-type-v q4_0 \
  --flash-attn on \
  --cont-batching \
  --jinja \
  -t 16
 

For IQ3_XXS, substitute -ngl 34 --n-cpu-moe 18. The larger model requires fewer GPU layers and more CPU offloading.

Key configuration notes: q4_0 KV cache compression and flash attention are essential — without them, the 200K context window would exceed available VRAM. The 16 threads match the 5950X core count for optimal CPU-MoE throughput.

Getting the Models

Both quantizations are available on Hugging Face in GGUF format:

# Recommended
Qwen3-Coder-Next-UD-IQ2_XXS.gguf   # ~26 GB

# Alternative (larger, slower on single GPU)
Qwen3-Coder-Next-UD-IQ3_XXS.gguf   # ~33 GB
 

The UD (Ultra Dense) variants use 512 experts per layer, compared to REAP's 308. This higher expert count provides better task specialisation within the MoE routing.

Claude Code Integration

I run IQ2 as a local backend for Claude Code via a wrapper script:

# ~/.local/bin/coder-next-opt-full
MODEL="/path/to/Qwen3-Coder-Next-UD-IQ2_XXS.gguf"
PORT=8084
CONTEXT=200000
GPU_LAYERS=42
CPU_MOE=10
 

This provides local coding assistance with 200K context, no API costs, and no data leaving the machine.

Trade-offs

Quantization loss — theoretically present, but not observable in coding or technical writing tasks during my testing
Hardware requirement — still requires 24 GB VRAM for the GPU-layer split to work well
Model availability — not all models have IQ2 quantizations yet; this depends on upstream GGUF support

For users with less than 24 GB VRAM, IQ2 would still be faster than IQ3 at equivalent layer counts, but the absolute throughput would be lower.

What's Next

Areas I'm watching for further performance gains:

KV cache compression — GEAR and KIVI techniques could reduce VRAM usage further, allowing more GPU layers or larger context
Expert caching — caching frequently-used experts in GPU memory to reduce CPU-MoE transfer overhead
Speculative decoding — using a smaller draft model to predict tokens in advance, reducing wall-clock time per token

IQ2_XXS at the configuration above is the fastest local setup I've found for this model.

Summary

IQ2_XXS is the clear choice for single-GPU local inference.

85% faster generation, 7 GB smaller footprint, and no measurable quality loss for coding tasks. If you have an RTX 4090 and want to run Qwen3-Coder-Next locally, there's no practical reason to use IQ3.

Resources

llama.cpp — github.com/ggerganov/llama.cpp
Qwen3-Coder-Next — available on Hugging Face (search "Qwen3-Coder-Next GGUF")
Related — Running Uncensored AI Locally (GLM-4.7-Flash setup guide)
Setup help — AI Setup & Consultation

Published by Indra's Mirror

Qwen3-Coder-Next, IQ2 Quantization, Local AI, RTX 4090, llama.cpp, MoE Models

February 2026 | llama.cpp b4785 | Ubuntu 24.04

Qwen3-Coder-Next: IQ2 vs IQ3 Benchmarks