⚡ TL;DR
IQ2_XS is ~2x faster than IQ3_XXS with negligible quality loss for coding tasks.
- Speed: 86.4 t/s (IQ2) vs 44.6 t/s (IQ3) — 94% faster
- Model size: 21 GB (IQ2) vs 31 GB (IQ3) — fits more in VRAM
- CPU offload: 7 layers (IQ2) vs 22 layers (IQ3) — less PCIe bottleneck
- Quality: Indistinguishable for coding, debugging, and technical writing
Why This Matters
When running large language models locally on a single GPU, quantization choice determines everything: how much fits in VRAM, how many layers spill to CPU, and ultimately how fast tokens generate. For Qwen3-Coder-Next (80B parameters, 512 experts/layer), this decision is critical.
I ran comprehensive benchmarks comparing IQ2_XS (21 GB) against IQ3_XXS (31 GB) on an RTX 4090 to answer a simple question: is the theoretical quality improvement of IQ3 worth the speed penalty?
Short answer: No.
Benchmark Setup
GPU: NVIDIA RTX 4090 24GB
RAM: 64GB DDR4-3200
CPU: AMD Ryzen 9 5950X (16 cores)
OS: Ubuntu 24.04 LTS
llama.cpp: b8054 (February 2026) with Flash Attention
Context: 168K tokens
KV Cache: Q8 (both K and V)
Both models tested with identical llama.cpp settings:
./llama-server \
-m [MODEL].gguf \
--port 8085 \
-c 168000 \
-ctk q8_0 -ctv q8_0 \
--flash-attn on \
--jinja \
--metrics
Speed Results
Generation speed averaged across multiple test types:
| Test Type | IQ2_XS | IQ3_XXS | Speed Delta |
|---|---|---|---|
| Algorithm Implementation | 92.7 t/s | 39.3 t/s | +136% |
| System Design | 89.3 t/s | 38.2 t/s | +134% |
| Debugging | 87.2 t/s | 36.7 t/s | +138% |
| Code Refactoring | 87.7 t/s | 36.8 t/s | +138% |
| Architecture Explanation | 85.9 t/s | 37.1 t/s | +131% |
| SQL Queries | 87.9 t/s | 36.7 t/s | +139% |
| AVERAGE | 88.5 t/s | 37.5 t/s | +136% |
The 31 GB IQ3_XXS model overflows 22-23 layers to CPU (system RAM), creating a massive PCIe bus bottleneck. The 21 GB IQ2_XS only overflows 7 layers, keeping most computation on GPU.
Context Size Experiment: 168K vs 128K
I also tested IQ3_XXS at 128K context to see if reducing KV cache size would improve speed:
| Configuration | Speed | CPU Layers | Speed Gain? |
|---|---|---|---|
| IQ3_XXS @ 168K | 44.6 t/s | 23 | — |
| IQ3_XXS @ 128K | 44.1 t/s | 22 | No (−1%) |
Result: Reducing context from 168K to 128K made no meaningful difference. The bottleneck is model weight overflow to CPU, not KV cache size. IQ3_XXS is simply too large for 24 GB VRAM.
Quality Comparison
Speed is meaningless if quality suffers. I tested both models across multiple domains:
1. Debugging Challenge
I provided both models with a 500+ line async Python codebase containing subtle race conditions, missing awaits, and connection pool bugs. Task: find and fix ALL bugs with severity ratings.
Result: Both models identified the same critical bugs:
- Rate limiter list reassignment bug (Critical)
- Missing
awaitinwait_if_needed(Critical) - Connection pool semaphore double-release (Major)
- Race condition in concurrent access (Major)
The explanations and fixes were functionally identical. IQ3 was slightly more verbose in explanations, but provided no additional bug discoveries.
2. SQL Query Generation
Complex PostgreSQL queries including cohort analysis, market basket analysis, and RFM segmentation with index recommendations.
Result: Both models produced correct, optimized queries with appropriate indexes. IQ3 suggested pg_trgm extension for fuzzy matching — a nice addition, but IQ2's solution using LEVENSHTEIN was equally valid.
3. Agentic Task: Build Complete CLI Application
Multi-phase task requiring planning, implementation, testing, and documentation of a Python CLI todo application with JSON storage, argparse interface, and unit tests.
| Metric | IQ2_XS @ 168K | IQ3_XXS @ 168K | IQ3_XXS @ 128K |
|---|---|---|---|
| Speed | 86.4 t/s | 44.6 t/s | 44.1 t/s |
| Total Time | 93 seconds | 180 seconds | 182 seconds |
| Files Created | 8 | 9 | 15 |
| Phases Completed | 3/4 | 1/4 | 1/4 |
Observation: IQ3_XXS at 128K created more files (15 vs 9), but this was due to including command outputs as pseudo-files, not better organization. IQ2_XS followed the multi-phase structure more precisely.
Why IQ2 Is Faster: Technical Deep Dive
The speed difference comes down to PCIe bus bandwidth. Qwen3-Coder-Next uses Mixture of Experts: 512 experts per layer, 10 active per token. Each token requires loading the active expert weights for all 48 layers.
With IQ3_XXS (31 GB) on 24 GB VRAM:
- Only ~26 layers fit entirely in GPU memory
- 22-23 layers require fetching expert weights from system RAM
- Each token: 22 PCIe round-trips for expert weights
- PCIe 4.0 x16: ~32 GB/s bandwidth → bottleneck
With IQ2_XS (21 GB) on 24 GB VRAM:
- 41 of 48 layers fit entirely in GPU memory
- Only 7 layers require CPU fetch
- 3x fewer PCIe round-trips per token
- Result: ~2x faster generation
The quantization itself (2-bit vs 3-bit) doesn't directly speed things up — it's the reduced model size enabling better GPU layer placement.
Perceived Quality Test
I blind-tested outputs from both models across:
- Code generation (Python, JavaScript, SQL) — Could not reliably distinguish
- Technical writing (documentation, explanations) — IQ3 slightly more verbose, same information
- Debugging (bug identification, fixes) — Identical bug detection
- Architecture decisions (system design, trade-offs) — Comparable reasoning quality
The only observable difference was verbosity: IQ3 tends to write longer explanations. Whether this is "better" depends on your use case — for coding tasks, I prefer concise, actionable output.
Recommendation
For RTX 4090 (24 GB VRAM): Use IQ2_XS.
| Factor | IQ2_XS | IQ3_XXS | Winner |
|---|---|---|---|
| Generation Speed | 86-94 t/s | 37-45 t/s | IQ2 (2x) |
| Model Size | 21 GB | 31 GB | IQ2 |
| CPU Offload | 7 layers | 22 layers | IQ2 |
| Code Quality | Excellent | Excellent | Tie |
| Debugging | Excellent | Excellent | Tie |
| Verbosity | Concise | Verbose | Depends |
The quality gap between IQ2 and IQ3 is negligible for practical coding work. The speed gap is massive and immediately noticeable in daily use.
IQ3_XXS might make sense if:
- You have 48+ GB VRAM (dual 3090/4090 or A6000)
- You're doing creative writing where verbosity is desired
- You're doing one-off analysis where speed doesn't matter
Configuration
My current production setup (Abliterated IQ2_XS):
./llama-server \
-m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \
--port 8085 \
-c 168000 \ # 168K context
-ctk q8_0 \ # Q8 KV cache (K)
-ctv q8_0 \ # Q8 KV cache (V)
--flash-attn on \ # Flash Attention
--jinja \ # Jinja templates
--metrics # Auto-fit mode
The huihui-ai Abliterated variant removes safety refusals while maintaining code quality. Combined with Q8 KV cache, this achieves ~94 t/s at 168K context.
Conclusion
After extensive testing across coding, debugging, SQL generation, and agentic tasks, the conclusion is clear:
IQ2_XS provides 2x the speed of IQ3_XXS with no measurable quality loss for technical work.
The bottleneck for local inference on 24 GB VRAM is PCIe bandwidth for CPU-offloaded layers, not quantization precision. IQ2's smaller size enables better GPU layer placement, which translates directly to faster generation.
For anyone running large MoE models on single consumer GPUs, aggressive quantization (IQ2_XS or similar) is the practical choice. The theoretical quality improvement of IQ3 doesn't survive contact with VRAM constraints.
Resources
- IQ2_XS Model — bartowski/huihui-ai_Qwen3-Coder-Next-abliterated-GGUF
- IQ3_XXS Model — unsloth/Qwen3-Coder-Next-GGUF
- llama.cpp — github.com/ggerganov/llama.cpp
- Related — Qwen3-Coder-Next IQ2 Local Setup
