Fused TBQ4 Flash Attention: 82 tok/s with Lossless 4-bit KV at 200K
We fused TurboQuant dequant directly into the flash attention kernel — reading raw TBQ4 blocks inline via centroid lookup in the FWHT-rotated domain. 82+ tok/s with lossless 4.25 bpv KV at 200K context on RTX 4090. Nobody else has done this.
Read Article →Turbo4 KV Cache: Better Than Q8 at Half the VRAM
Trellis-Coded Quantization benchmark: Turbo4 scores 100/100 vs Q8_0's 91 on hardened agentic benchmark. Lossless FP16 quality, 40 t/s, 256K context on RTX 4090 24GB.
Read Article →Q4 KV Cache: Surprisingly Viable
Benchmarks show Q4 KV cache producing faster, higher-quality code than Q8. The conventional wisdom about Q4 being unusable may be wrong.
Read Article →IQ2 vs IQ3 Quantization: 2x Speed, Same Quality
Comprehensive RTX 4090 benchmark: IQ2_XS hits 86 t/s vs IQ3_XXS at 44 t/s. Quality testing across coding, debugging, and agentic tasks shows negligible difference.
Read Article →Maxing Out Qwen3-Coder-Next Abliterated: 94 t/s
The abliterated version hits 94 t/s at 168K context with Q8 KV cache — over 5x faster than the base model. Complete optimization guide with red teaming verification.
Read Article →Qwen3-Coder-Next: IQ2 vs IQ3 Benchmarks
IQ2_XXS achieves 22 t/s on RTX 4090 at 200K context — 85% faster than IQ3 with no measurable quality loss. Full benchmark data and configuration.
Read Article →Running Uncensored AI Locally: My PRISM Setup
How I set up GLM-4.7-Flash locally with web search and vision capabilities. No content filters, no API subscription, just my hardware doing what I tell it to.
Read Article →Get AI Tips in Your Inbox
Subscribe for tutorials on local AI, Stable Diffusion, LoRA training, and Claude Code workflows.
More Articles Coming Soon
We're preparing additional tutorials and guides:
- Stable Diffusion Mastery — Advanced prompting and workflow optimization
- LoRA Training from Scratch — Create custom models for any subject
- LLM Fine-tuning Guide — Adapt open-source models for your domain
- AI Automation Pipelines — Build end-to-end workflows with n8n
- Claude Code Pro Tips — Advanced usage and MCP development
Have a topic you'd like us to cover? Let us know