Maxing Out Qwen3-Coder-Next Abliterated

Benchmarks Abliterated Updated February 15, 2026 10 min read

⚡ February 2026 Update: Now 94 t/s with Q8 KV Cache

Further optimization with Q8 KV cache pushes throughput from 70 t/s → 94 t/s (34% faster). Key changes:

Q8 KV Cache (-ctk q8_0 -ctv q8_0) — 50% smaller KV buffer, more VRAM for model weights
llama.cpp b8054 — Latest build with Flash Attention PR #19375 (8.5% speedup)
AUTO-FIT Mode — No manual -ngl flags, let llama.cpp optimize automatically

See the main IQ2 benchmark post for complete updated configuration details.

The Discovery

After publishing my Qwen3-Coder-Next IQ2 benchmarks, I discovered something interesting: an abliterated version of the same model existed. Same architecture, same capabilities, but with safety guardrails removed.

More intriguing? It's 5 GB smaller. The implications were worth testing.

What Does "Abliterated" Mean?

In AI research circles, abliteration refers to removing refusal behaviors from a model while preserving its actual capabilities. Commercial models are trained to refuse certain requests — not because they're harmful, but because corporate policy deems them off-limits.

The abliterated model doesn't lose intelligence. It gains directness. When you ask for something, it responds to the request itself rather than lecturing you about why you shouldn't have asked.

For security researchers, red teaming, or legitimate use cases involving sensitive content, this matters. For developers who want an AI that just does what it's told, this matters.

First Benchmark: 200K Context

I started with standard 200K context window — same as my previous IQ2 tests. The model loaded, auto-fit configured 49 GPU layers, and I ran a 512-token generation benchmark.

Result: 31.30 tokens/second

That's already +80% faster than the standard UD-IQ2_XXS model (17-18 t/s). The 5 GB size reduction was paying dividends immediately.

Second Benchmark: 168K Context

Then I tested reduced context. The theory was simple: smaller KV cache = more VRAM available for model weights = faster generation. I dropped context from 200K to 168K tokens.

Result: 70.60 tokens/second

+127% faster than 200K context. +312% faster than the base IQ2 model.

This isn't marginal improvement. This is transformational.

Complete Benchmark Results

Direct comparison across models and configurations:

Configuration	Model Size	Context	KV Cache	Speed
UD-IQ2_XXS (base)	26 GB	200K	Q4	17-18 t/s
Abliterated (200K)	21 GB	200K	f16	31 t/s
Abliterated (168K)	21 GB	168K	f16	70 t/s
Abliterated + Q8 (Current)	21 GB	168K	Q8	~94 t/s

The 168K configuration achieves 4x throughput compared to my previous best setup. Same model architecture. Same hardware. Just smaller context window and 5 GB model size reduction.

Why Context Size Matters

The KV cache stores attention keys and values for each token in context. At 200K tokens, this consumes significant VRAM:

200K context: 4692 MiB KV cache
168K context: 3942 MiB KV cache

That's 750 MiB saved — nearly a gigabyte of VRAM that can be used for compute buffers, larger batches, or simply reducing memory pressure during generation.

The smaller abliterated model (21 GB vs 26 GB) combined with reduced KV cache means llama.cpp's auto-fit can keep more layers fully on GPU and optimize memory layout more aggressively.

Red Teaming Verification

Since this is an abliterated model, I verified it actually refuses nothing. Testing across categories that typically trigger refusals:

Email hacking tutorials — Model provided detailed methods, added ethical disclaimers but didn't refuse
Phishing email templates — Generated realistic template with formatting and psychological triggers
Antivirus bypass code — Provided educational examples with obfuscation techniques
Illegal substance trafficking — Direct response without moralizing

The model answers requests directly. Whether you consider this a feature or risk depends on your use case. For security testing, this is the intended behavior.

Hardware and Configuration

Test System

GPU: NVIDIA RTX 4090 24GB
RAM: 64GB DDR4-3200
CPU: AMD Ryzen 9 5950X (16 cores)
OS: Ubuntu 24.04 LTS
llama.cpp: b8054 (February 2026) with Flash Attention PR #19375
Model: huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf (20.61 GB)
 

Configuration: MAX SPEED (94 t/s) — Q8 KV Cache

./llama-server \
  -m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \
  --port 8085 \
  -c 168000 \          # 168K context
  -ctk q8_0 \          # Q8 KV cache (K) — KEY OPTIMIZATION
  -ctv q8_0 \          # Q8 KV cache (V) — KEY OPTIMIZATION
  --flash-attn on \
  --jinja \
  --metrics            # AUTO-FIT mode (no -ngl needed)
 

Key insight: The Q8 KV cache saves ~2GB VRAM at 168K context compared to f16. This allows more model weights on GPU, reducing CPU offload from ~11 layers to ~7 layers. Fewer PCIe transfers = faster generation.

Configuration: 200K Context (Legacy)

./llama-server \
  -m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \
  --port 8087 \
  -c 200000 \          # 200K context
  --flash-attn on \
  --metrics
 

Key insight: no manual -ngl or --n-cpu-moe specified. Letting llama.cpp's auto-fit handle layer allocation resulted in 49 GPU layers being configured automatically.

Configuration: 168K Context (Fast Mode)

./llama-server \
  -m huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf \
  --port 8085 \
  -c 168000 \          # 168K context for max speed
  --flash-attn on \
  --metrics
 

Same minimal configuration. The smaller context size alone enables the dramatic speed increase.

Wrapper Setup

I created two wrapper scripts for easy switching between modes:

# Standard mode: 200K context, ~30 t/s
coder-next-ab

# Fast mode: 168K context, ~70 t/s
coder-next-ab-fast
 

Both integrate with Claude Code using the standard OpenAI-compatible API that llama.cpp provides. No API keys, no subscriptions, no data leaving the machine.

Quality Comparison

Does the abliterated model lose capability compared to standard? In my testing: no observable difference.

Code generation — Same quality, same correctness, identical style
Technical explanations — Indistinguishable from base model
Debugging — Same insights, same optimization suggestions
Complex reasoning — Comparable multi-step logic

The only difference is refusal behavior. For actual capability, it's the same model.

Trade-offs and Considerations

The fast configuration isn't free:

32K less context — 168K vs 200K means shorter effective conversation memory
No refusals — Model will answer any request, requires user judgment
Still needs 24GB VRAM — Smaller GPUs won't fit as many layers

For most coding tasks, 168K context is ample. The speed difference is substantial enough that I use the fast configuration by default.

When to Use Each Mode

I switch between configurations based on task:

coder-next-ab (200K) — Long coding sessions, large file context, projects requiring full conversation history
coder-next-ab-fast (168K) — Interactive coding, rapid iteration, testing, debugging sessions

The 4x speed difference makes the fast mode my default. Only switch to 200K when I genuinely need the extra context.

Getting the Model

The abliterated model is available on Hugging Face:

# Abliterated IQ2_XS (recommended)
huihui-ai_Qwen3-Coder-Next-abliterated-IQ2_XS.gguf   # ~21 GB

# Base IQ2_XXS for comparison
Qwen3-Coder-Next-UD-IQ2_XXS.gguf   # ~26 GB
 

Search Hugging Face for "Qwen3 Coder Next abliterated" to find the model. Multiple quantization variants may exist.

Performance Deep Dive

Why is the abliterated model so much faster? Two factors:

Factor 1: Size Reduction

21 GB vs 26 GB means 5 GB more VRAM available. This directly enables:

More complete layers on GPU (49 vs 42)
Larger compute buffers without spilling to system RAM
More efficient memory layout

Factor 2: Smaller KV Cache

At 168K context, KV cache drops from 4692 MiB to 3942 MiB. This 750 MiB savings reduces memory pressure and allows larger batch sizes and more aggressive optimization.

The combination compounds: smaller model + smaller context = dramatically better throughput.

Summary

The abliterated model with Q8 KV cache at 168K context is the fastest local LLM setup I've tested.

~94 t/s is over 5x faster than my original setup. The Q8 KV cache provides an additional 34% speedup over f16 KV cache by reducing VRAM pressure.

If you have an RTX 4090 and want maximum speed for local coding, this configuration is currently unmatched.

What's Next

Current optimizations achieved, areas I'm watching for further gains:

✓ Q8 KV cache — Implemented, saves ~2GB VRAM at 168K context. Avoid Q4 KV cache — it "tanks quality".
llama.cpp updates — Watching for future Flash Attention enhancements beyond PR #19375.
Speculative decoding — Using a smaller draft model could push beyond 100 t/s.
Expert caching — Keeping frequently-used experts in GPU memory for MoE models.

Resources

llama.cpp — github.com/ggerganov/llama.cpp (build 8054+ recommended)
Abliterated Model — bartowski/huihui-ai_Qwen3-Coder-Next-abliterated-GGUF
Related — Qwen3-Coder-Next IQ2: Complete Setup Guide
Related — IQ2 vs IQ3 Benchmark: 2x Speed, Same Quality
Related — Running Uncensored AI Locally
Setup help — AI Setup & Consultation

Published by Indra's Mirror

Qwen3-Coder-Next, Abliterated AI, Uncensored AI, Local AI, RTX 4090, llama.cpp, MoE Models, AI Benchmarks

Updated February 2026 | llama.cpp b8054 | Ubuntu 24.04