IQ2 vs IQ3 Quantization: RTX 4090 Benchmark

Benchmarks February 15, 2026 12 min read

⚡ TL;DR

IQ2_XS is ~2x faster than IQ3_XXS with negligible quality loss for coding tasks.

Speed: 86.4 t/s (IQ2) vs 44.6 t/s (IQ3) — 94% faster
Model size: 21 GB (IQ2) vs 31 GB (IQ3) — fits more in VRAM
CPU offload: 7 layers (IQ2) vs 22 layers (IQ3) — less PCIe bottleneck
Quality: Indistinguishable for coding, debugging, and technical writing

Why This Matters

When running large language models locally on a single GPU, quantization choice determines everything: how much fits in VRAM, how many layers spill to CPU, and ultimately how fast tokens generate. For Qwen3-Coder-Next (80B parameters, 512 experts/layer), this decision is critical.

I ran comprehensive benchmarks comparing IQ2_XS (21 GB) against IQ3_XXS (31 GB) on an RTX 4090 to answer a simple question: is the theoretical quality improvement of IQ3 worth the speed penalty?

Short answer: No.

Benchmark Setup

GPU: NVIDIA RTX 4090 24GB
RAM: 64GB DDR4-3200
CPU: AMD Ryzen 9 5950X (16 cores)
OS: Ubuntu 24.04 LTS
llama.cpp: b8054 (February 2026) with Flash Attention
Context: 168K tokens
KV Cache: Q8 (both K and V)
 

Both models tested with identical llama.cpp settings:

./llama-server \
  -m [MODEL].gguf \
  --port 8085 \
  -c 168000 \
  -ctk q8_0 -ctv q8_0 \
  --flash-attn on \
  --jinja \
  --metrics
 

Speed Results

Generation speed averaged across multiple test types:

Test Type	IQ2_XS	IQ3_XXS	Speed Delta
Algorithm Implementation	92.7 t/s	39.3 t/s	+136%
System Design	89.3 t/s	38.2 t/s	+134%
Debugging	87.2 t/s	36.7 t/s	+138%
Code Refactoring	87.7 t/s	36.8 t/s	+138%
Architecture Explanation	85.9 t/s	37.1 t/s	+131%
SQL Queries	87.9 t/s	36.7 t/s	+139%
AVERAGE	88.5 t/s	37.5 t/s	+136%

The 31 GB IQ3_XXS model overflows 22-23 layers to CPU (system RAM), creating a massive PCIe bus bottleneck. The 21 GB IQ2_XS only overflows 7 layers, keeping most computation on GPU.

Context Size Experiment: 168K vs 128K

I also tested IQ3_XXS at 128K context to see if reducing KV cache size would improve speed:

Configuration	Speed	CPU Layers	Speed Gain?
IQ3_XXS @ 168K	44.6 t/s	23	—
IQ3_XXS @ 128K	44.1 t/s	22	No (−1%)

Result: Reducing context from 168K to 128K made no meaningful difference. The bottleneck is model weight overflow to CPU, not KV cache size. IQ3_XXS is simply too large for 24 GB VRAM.

Quality Comparison

Speed is meaningless if quality suffers. I tested both models across multiple domains:

1. Debugging Challenge

I provided both models with a 500+ line async Python codebase containing subtle race conditions, missing awaits, and connection pool bugs. Task: find and fix ALL bugs with severity ratings.

Result: Both models identified the same critical bugs:

Rate limiter list reassignment bug (Critical)
Missing await in wait_if_needed (Critical)
Connection pool semaphore double-release (Major)
Race condition in concurrent access (Major)

The explanations and fixes were functionally identical. IQ3 was slightly more verbose in explanations, but provided no additional bug discoveries.

2. SQL Query Generation

Complex PostgreSQL queries including cohort analysis, market basket analysis, and RFM segmentation with index recommendations.

Result: Both models produced correct, optimized queries with appropriate indexes. IQ3 suggested pg_trgm extension for fuzzy matching — a nice addition, but IQ2's solution using LEVENSHTEIN was equally valid.

3. Agentic Task: Build Complete CLI Application

Multi-phase task requiring planning, implementation, testing, and documentation of a Python CLI todo application with JSON storage, argparse interface, and unit tests.

Metric	IQ2_XS @ 168K	IQ3_XXS @ 168K	IQ3_XXS @ 128K
Speed	86.4 t/s	44.6 t/s	44.1 t/s
Total Time	93 seconds	180 seconds	182 seconds
Files Created	8	9	15
Phases Completed	3/4	1/4	1/4

Observation: IQ3_XXS at 128K created more files (15 vs 9), but this was due to including command outputs as pseudo-files, not better organization. IQ2_XS followed the multi-phase structure more precisely.

Why IQ2 Is Faster: Technical Deep Dive

The speed difference comes down to PCIe bus bandwidth. Qwen3-Coder-Next uses Mixture of Experts: 512 experts per layer, 10 active per token. Each token requires loading the active expert weights for all 48 layers.

With IQ3_XXS (31 GB) on 24 GB VRAM:

Only ~26 layers fit entirely in GPU memory
22-23 layers require fetching expert weights from system RAM
Each token: 22 PCIe round-trips for expert weights
PCIe 4.0 x16: ~32 GB/s bandwidth → bottleneck

With IQ2_XS (21 GB) on 24 GB VRAM:

41 of 48 layers fit entirely in GPU memory
Only 7 layers require CPU fetch
3x fewer PCIe round-trips per token
Result: ~2x faster generation

The quantization itself (2-bit vs 3-bit) doesn't directly speed things up — it's the reduced model size enabling better GPU layer placement.

Perceived Quality Test

I blind-tested outputs from both models across:

Code generation (Python, JavaScript, SQL) — Could not reliably distinguish
Technical writing (documentation, explanations) — IQ3 slightly more verbose, same information
Debugging (bug identification, fixes) — Identical bug detection
Architecture decisions (system design, trade-offs) — Comparable reasoning quality

The only observable difference was verbosity: IQ3 tends to write longer explanations. Whether this is "better" depends on your use case — for coding tasks, I prefer concise, actionable output.

Recommendation

For RTX 4090 (24 GB VRAM): Use IQ2_XS.

Factor	IQ2_XS	IQ3_XXS	Winner
Generation Speed	86-94 t/s	37-45 t/s	IQ2 (2x)
Model Size	21 GB	31 GB	IQ2
CPU Offload	7 layers	22 layers	IQ2
Code Quality	Excellent	Excellent	Tie
Debugging	Excellent	Excellent	Tie
Verbosity	Concise	Verbose	Depends

The quality gap between IQ2 and IQ3 is negligible for practical coding work. The speed gap is massive and immediately noticeable in daily use.

IQ3_XXS might make sense if:

You have 48+ GB VRAM (dual 3090/4090 or A6000)
You're doing creative writing where verbosity is desired
You're doing one-off analysis where speed doesn't matter

Configuration

My current production setup (Abliterated IQ2_XS):

./llama-server \
  -m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \
  --port 8085 \
  -c 168000 \          # 168K context
  -ctk q8_0 \          # Q8 KV cache (K)
  -ctv q8_0 \          # Q8 KV cache (V)
  --flash-attn on \    # Flash Attention
  --jinja \            # Jinja templates
  --metrics            # Auto-fit mode
 

The huihui-ai Abliterated variant removes safety refusals while maintaining code quality. Combined with Q8 KV cache, this achieves ~94 t/s at 168K context.

Conclusion

After extensive testing across coding, debugging, SQL generation, and agentic tasks, the conclusion is clear:

IQ2_XS provides 2x the speed of IQ3_XXS with no measurable quality loss for technical work.

The bottleneck for local inference on 24 GB VRAM is PCIe bandwidth for CPU-offloaded layers, not quantization precision. IQ2's smaller size enables better GPU layer placement, which translates directly to faster generation.

For anyone running large MoE models on single consumer GPUs, aggressive quantization (IQ2_XS or similar) is the practical choice. The theoretical quality improvement of IQ3 doesn't survive contact with VRAM constraints.

Resources

IQ2_XS Model — bartowski/huihui-ai_Qwen3-Coder-Next-abliterated-GGUF
IQ3_XXS Model — unsloth/Qwen3-Coder-Next-GGUF
llama.cpp — github.com/ggerganov/llama.cpp
Related — Qwen3-Coder-Next IQ2 Local Setup

Published by Indra's Mirror

IQ2, IQ3, Quantization, RTX 4090, Benchmark, Qwen3-Coder-Next, Local AI

February 2026 | llama.cpp b8054 | Ubuntu 24.04

IQ2 vs IQ3 Quantization: The Trade-off