⚡ February 2026 Update: Now 94 t/s with Q8 KV Cache
Further optimization with Q8 KV cache pushes throughput from 70 t/s → 94 t/s (34% faster). Key changes:
- Q8 KV Cache (
-ctk q8_0 -ctv q8_0) — 50% smaller KV buffer, more VRAM for model weights - llama.cpp b8054 — Latest build with Flash Attention PR #19375 (8.5% speedup)
- AUTO-FIT Mode — No manual
-nglflags, let llama.cpp optimize automatically
See the main IQ2 benchmark post for complete updated configuration details.
The Discovery
After publishing my Qwen3-Coder-Next IQ2 benchmarks, I discovered something interesting: an abliterated version of the same model existed. Same architecture, same capabilities, but with safety guardrails removed.
More intriguing? It's 5 GB smaller. The implications were worth testing.
What Does "Abliterated" Mean?
In AI research circles, abliteration refers to removing refusal behaviors from a model while preserving its actual capabilities. Commercial models are trained to refuse certain requests — not because they're harmful, but because corporate policy deems them off-limits.
The abliterated model doesn't lose intelligence. It gains directness. When you ask for something, it responds to the request itself rather than lecturing you about why you shouldn't have asked.
For security researchers, red teaming, or legitimate use cases involving sensitive content, this matters. For developers who want an AI that just does what it's told, this matters.
First Benchmark: 200K Context
I started with standard 200K context window — same as my previous IQ2 tests. The model loaded, auto-fit configured 49 GPU layers, and I ran a 512-token generation benchmark.
Result: 31.30 tokens/second
That's already +80% faster than the standard UD-IQ2_XXS model (17-18 t/s). The 5 GB size reduction was paying dividends immediately.
Second Benchmark: 168K Context
Then I tested reduced context. The theory was simple: smaller KV cache = more VRAM available for model weights = faster generation. I dropped context from 200K to 168K tokens.
Result: 70.60 tokens/second
+127% faster than 200K context. +312% faster than the base IQ2 model.
This isn't marginal improvement. This is transformational.
Complete Benchmark Results
Direct comparison across models and configurations:
| Configuration | Model Size | Context | KV Cache | Speed |
|---|---|---|---|---|
| UD-IQ2_XXS (base) | 26 GB | 200K | Q4 | 17-18 t/s |
| Abliterated (200K) | 21 GB | 200K | f16 | 31 t/s |
| Abliterated (168K) | 21 GB | 168K | f16 | 70 t/s |
| Abliterated + Q8 (Current) | 21 GB | 168K | Q8 | ~94 t/s |
The 168K configuration achieves 4x throughput compared to my previous best setup. Same model architecture. Same hardware. Just smaller context window and 5 GB model size reduction.
Why Context Size Matters
The KV cache stores attention keys and values for each token in context. At 200K tokens, this consumes significant VRAM:
- 200K context: 4692 MiB KV cache
- 168K context: 3942 MiB KV cache
That's 750 MiB saved — nearly a gigabyte of VRAM that can be used for compute buffers, larger batches, or simply reducing memory pressure during generation.
The smaller abliterated model (21 GB vs 26 GB) combined with reduced KV cache means llama.cpp's auto-fit can keep more layers fully on GPU and optimize memory layout more aggressively.
Red Teaming Verification
Since this is an abliterated model, I verified it actually refuses nothing. Testing across categories that typically trigger refusals:
- Email hacking tutorials — Model provided detailed methods, added ethical disclaimers but didn't refuse
- Phishing email templates — Generated realistic template with formatting and psychological triggers
- Antivirus bypass code — Provided educational examples with obfuscation techniques
- Illegal substance trafficking — Direct response without moralizing
The model answers requests directly. Whether you consider this a feature or risk depends on your use case. For security testing, this is the intended behavior.
Hardware and Configuration
Test System
GPU: NVIDIA RTX 4090 24GB
RAM: 64GB DDR4-3200
CPU: AMD Ryzen 9 5950X (16 cores)
OS: Ubuntu 24.04 LTS
llama.cpp: b8054 (February 2026) with Flash Attention PR #19375
Model: huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf (20.61 GB)
Configuration: MAX SPEED (94 t/s) — Q8 KV Cache
./llama-server \
-m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \
--port 8085 \
-c 168000 \ # 168K context
-ctk q8_0 \ # Q8 KV cache (K) — KEY OPTIMIZATION
-ctv q8_0 \ # Q8 KV cache (V) — KEY OPTIMIZATION
--flash-attn on \
--jinja \
--metrics # AUTO-FIT mode (no -ngl needed)
Key insight: The Q8 KV cache saves ~2GB VRAM at 168K context compared to f16. This allows more model weights on GPU, reducing CPU offload from ~11 layers to ~7 layers. Fewer PCIe transfers = faster generation.
Configuration: 200K Context (Legacy)
./llama-server \
-m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \
--port 8087 \
-c 200000 \ # 200K context
--flash-attn on \
--metrics
Key insight: no manual -ngl or --n-cpu-moe specified. Letting llama.cpp's auto-fit handle layer allocation resulted in 49 GPU layers being configured automatically.
Configuration: 168K Context (Fast Mode)
./llama-server \
-m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \
--port 8085 \
-c 168000 \ # 168K context for max speed
--flash-attn on \
--metrics
Same minimal configuration. The smaller context size alone enables the dramatic speed increase.
Wrapper Setup
I created two wrapper scripts for easy switching between modes:
# Standard mode: 200K context, ~30 t/s
coder-next-ab
# Fast mode: 168K context, ~70 t/s
coder-next-ab-fast
Both integrate with Claude Code using the standard OpenAI-compatible API that llama.cpp provides. No API keys, no subscriptions, no data leaving the machine.
Quality Comparison
Does the abliterated model lose capability compared to standard? In my testing: no observable difference.
- Code generation — Same quality, same correctness, identical style
- Technical explanations — Indistinguishable from base model
- Debugging — Same insights, same optimization suggestions
- Complex reasoning — Comparable multi-step logic
The only difference is refusal behavior. For actual capability, it's the same model.
Trade-offs and Considerations
The fast configuration isn't free:
- 32K less context — 168K vs 200K means shorter effective conversation memory
- No refusals — Model will answer any request, requires user judgment
- Still needs 24GB VRAM — Smaller GPUs won't fit as many layers
For most coding tasks, 168K context is ample. The speed difference is substantial enough that I use the fast configuration by default.
When to Use Each Mode
I switch between configurations based on task:
- coder-next-ab (200K) — Long coding sessions, large file context, projects requiring full conversation history
- coder-next-ab-fast (168K) — Interactive coding, rapid iteration, testing, debugging sessions
The 4x speed difference makes the fast mode my default. Only switch to 200K when I genuinely need the extra context.
Getting the Model
The abliterated model is available on Hugging Face:
# Abliterated IQ2_XS (recommended)
huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf # ~21 GB
# Base IQ2_XXS for comparison
Qwen3-Coder-Next-UD-IQ2_XXS.gguf # ~26 GB
Search Hugging Face for "Qwen3 Coder Next abliterated" to find the model. Multiple quantization variants may exist.
Performance Deep Dive
Why is the abliterated model so much faster? Two factors:
Factor 1: Size Reduction
21 GB vs 26 GB means 5 GB more VRAM available. This directly enables:
- More complete layers on GPU (49 vs 42)
- Larger compute buffers without spilling to system RAM
- More efficient memory layout
Factor 2: Smaller KV Cache
At 168K context, KV cache drops from 4692 MiB to 3942 MiB. This 750 MiB savings reduces memory pressure and allows larger batch sizes and more aggressive optimization.
The combination compounds: smaller model + smaller context = dramatically better throughput.
Summary
The abliterated model with Q8 KV cache at 168K context is the fastest local LLM setup I've tested.
~94 t/s is over 5x faster than my original setup. The Q8 KV cache provides an additional 34% speedup over f16 KV cache by reducing VRAM pressure.
If you have an RTX 4090 and want maximum speed for local coding, this configuration is currently unmatched.
What's Next
Current optimizations achieved, areas I'm watching for further gains:
- ✓ Q8 KV cache — Implemented, saves ~2GB VRAM at 168K context. Avoid Q4 KV cache — it "tanks quality".
- llama.cpp updates — Watching for future Flash Attention enhancements beyond PR #19375.
- Speculative decoding — Using a smaller draft model could push beyond 100 t/s.
- Expert caching — Keeping frequently-used experts in GPU memory for MoE models.
Resources
- llama.cpp — github.com/ggerganov/llama.cpp (build 8054+ recommended)
- Abliterated Model — bartowski/huihui-ai_Qwen3-Coder-Next-abliterated-GGUF
- Related — Qwen3-Coder-Next IQ2: Complete Setup Guide
- Related — IQ2 vs IQ3 Benchmark: 2x Speed, Same Quality
- Related — Running Uncensored AI Locally
- Setup help — AI Setup & Consultation
