⚡ TL;DR

IQ2_XS is ~2x faster than IQ3_XXS with negligible quality loss for coding tasks.

  • Speed: 86.4 t/s (IQ2) vs 44.6 t/s (IQ3) — 94% faster
  • Model size: 21 GB (IQ2) vs 31 GB (IQ3) — fits more in VRAM
  • CPU offload: 7 layers (IQ2) vs 22 layers (IQ3) — less PCIe bottleneck
  • Quality: Indistinguishable for coding, debugging, and technical writing

Why This Matters

When running large language models locally on a single GPU, quantization choice determines everything: how much fits in VRAM, how many layers spill to CPU, and ultimately how fast tokens generate. For Qwen3-Coder-Next (80B parameters, 512 experts/layer), this decision is critical.

I ran comprehensive benchmarks comparing IQ2_XS (21 GB) against IQ3_XXS (31 GB) on an RTX 4090 to answer a simple question: is the theoretical quality improvement of IQ3 worth the speed penalty?

Short answer: No.

Benchmark Setup

GPU: NVIDIA RTX 4090 24GB RAM: 64GB DDR4-3200 CPU: AMD Ryzen 9 5950X (16 cores) OS: Ubuntu 24.04 LTS llama.cpp: b8054 (February 2026) with Flash Attention Context: 168K tokens KV Cache: Q8 (both K and V)

Both models tested with identical llama.cpp settings:

./llama-server \ -m [MODEL].gguf \ --port 8085 \ -c 168000 \ -ctk q8_0 -ctv q8_0 \ --flash-attn on \ --jinja \ --metrics

Speed Results

Generation speed averaged across multiple test types:

Test Type IQ2_XS IQ3_XXS Speed Delta
Algorithm Implementation 92.7 t/s 39.3 t/s +136%
System Design 89.3 t/s 38.2 t/s +134%
Debugging 87.2 t/s 36.7 t/s +138%
Code Refactoring 87.7 t/s 36.8 t/s +138%
Architecture Explanation 85.9 t/s 37.1 t/s +131%
SQL Queries 87.9 t/s 36.7 t/s +139%
AVERAGE 88.5 t/s 37.5 t/s +136%

The 31 GB IQ3_XXS model overflows 22-23 layers to CPU (system RAM), creating a massive PCIe bus bottleneck. The 21 GB IQ2_XS only overflows 7 layers, keeping most computation on GPU.

Context Size Experiment: 168K vs 128K

I also tested IQ3_XXS at 128K context to see if reducing KV cache size would improve speed:

Configuration Speed CPU Layers Speed Gain?
IQ3_XXS @ 168K 44.6 t/s 23
IQ3_XXS @ 128K 44.1 t/s 22 No (−1%)

Result: Reducing context from 168K to 128K made no meaningful difference. The bottleneck is model weight overflow to CPU, not KV cache size. IQ3_XXS is simply too large for 24 GB VRAM.

Quality Comparison

Speed is meaningless if quality suffers. I tested both models across multiple domains:

1. Debugging Challenge

I provided both models with a 500+ line async Python codebase containing subtle race conditions, missing awaits, and connection pool bugs. Task: find and fix ALL bugs with severity ratings.

Result: Both models identified the same critical bugs:

  • Rate limiter list reassignment bug (Critical)
  • Missing await in wait_if_needed (Critical)
  • Connection pool semaphore double-release (Major)
  • Race condition in concurrent access (Major)

The explanations and fixes were functionally identical. IQ3 was slightly more verbose in explanations, but provided no additional bug discoveries.

2. SQL Query Generation

Complex PostgreSQL queries including cohort analysis, market basket analysis, and RFM segmentation with index recommendations.

Result: Both models produced correct, optimized queries with appropriate indexes. IQ3 suggested pg_trgm extension for fuzzy matching — a nice addition, but IQ2's solution using LEVENSHTEIN was equally valid.

3. Agentic Task: Build Complete CLI Application

Multi-phase task requiring planning, implementation, testing, and documentation of a Python CLI todo application with JSON storage, argparse interface, and unit tests.

Metric IQ2_XS @ 168K IQ3_XXS @ 168K IQ3_XXS @ 128K
Speed 86.4 t/s 44.6 t/s 44.1 t/s
Total Time 93 seconds 180 seconds 182 seconds
Files Created 8 9 15
Phases Completed 3/4 1/4 1/4

Observation: IQ3_XXS at 128K created more files (15 vs 9), but this was due to including command outputs as pseudo-files, not better organization. IQ2_XS followed the multi-phase structure more precisely.

Why IQ2 Is Faster: Technical Deep Dive

The speed difference comes down to PCIe bus bandwidth. Qwen3-Coder-Next uses Mixture of Experts: 512 experts per layer, 10 active per token. Each token requires loading the active expert weights for all 48 layers.

With IQ3_XXS (31 GB) on 24 GB VRAM:

  • Only ~26 layers fit entirely in GPU memory
  • 22-23 layers require fetching expert weights from system RAM
  • Each token: 22 PCIe round-trips for expert weights
  • PCIe 4.0 x16: ~32 GB/s bandwidth → bottleneck

With IQ2_XS (21 GB) on 24 GB VRAM:

  • 41 of 48 layers fit entirely in GPU memory
  • Only 7 layers require CPU fetch
  • 3x fewer PCIe round-trips per token
  • Result: ~2x faster generation

The quantization itself (2-bit vs 3-bit) doesn't directly speed things up — it's the reduced model size enabling better GPU layer placement.

Perceived Quality Test

I blind-tested outputs from both models across:

  • Code generation (Python, JavaScript, SQL) — Could not reliably distinguish
  • Technical writing (documentation, explanations) — IQ3 slightly more verbose, same information
  • Debugging (bug identification, fixes) — Identical bug detection
  • Architecture decisions (system design, trade-offs) — Comparable reasoning quality

The only observable difference was verbosity: IQ3 tends to write longer explanations. Whether this is "better" depends on your use case — for coding tasks, I prefer concise, actionable output.

Recommendation

For RTX 4090 (24 GB VRAM): Use IQ2_XS.

Factor IQ2_XS IQ3_XXS Winner
Generation Speed 86-94 t/s 37-45 t/s IQ2 (2x)
Model Size 21 GB 31 GB IQ2
CPU Offload 7 layers 22 layers IQ2
Code Quality Excellent Excellent Tie
Debugging Excellent Excellent Tie
Verbosity Concise Verbose Depends

The quality gap between IQ2 and IQ3 is negligible for practical coding work. The speed gap is massive and immediately noticeable in daily use.

IQ3_XXS might make sense if:

  • You have 48+ GB VRAM (dual 3090/4090 or A6000)
  • You're doing creative writing where verbosity is desired
  • You're doing one-off analysis where speed doesn't matter

Configuration

My current production setup (Abliterated IQ2_XS):

./llama-server \ -m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \ --port 8085 \ -c 168000 \ # 168K context -ctk q8_0 \ # Q8 KV cache (K) -ctv q8_0 \ # Q8 KV cache (V) --flash-attn on \ # Flash Attention --jinja \ # Jinja templates --metrics # Auto-fit mode

The huihui-ai Abliterated variant removes safety refusals while maintaining code quality. Combined with Q8 KV cache, this achieves ~94 t/s at 168K context.

Conclusion

After extensive testing across coding, debugging, SQL generation, and agentic tasks, the conclusion is clear:

IQ2_XS provides 2x the speed of IQ3_XXS with no measurable quality loss for technical work.

The bottleneck for local inference on 24 GB VRAM is PCIe bandwidth for CPU-offloaded layers, not quantization precision. IQ2's smaller size enables better GPU layer placement, which translates directly to faster generation.

For anyone running large MoE models on single consumer GPUs, aggressive quantization (IQ2_XS or similar) is the practical choice. The theoretical quality improvement of IQ3 doesn't survive contact with VRAM constraints.

Resources