TurboQuant on Apple Silicon
MLXllama.cpp (Metal)Apple Silicon (M1 Pro)Python
Independent evaluation of TurboQuant (arXiv 2504.19874) ported to run on Apple Silicon. Open source and reproducible.
Problem
TurboQuant is a near-optimal LLM weight and activation quantization method, but the reference path assumed dedicated GPU hardware. The open question: can it run, and hold long-context accuracy, on consumer Apple Silicon with no GPU?
Approach
- Worked from a CPU-only fork on an M1 Pro (16GB) and fixed five implementation bugs that were blocking correct inference.
- Ran a two-round study: an MLX path and a separate llama.cpp Metal path, each benchmarked on long-context needle-in-a-haystack retrieval.
- Published the full evaluation, the bug fixes, and reproducible results as an open-source repository, with writeups on LinkedIn and X.
Results
- Needle retrieval @ 16K: 0% -> 100%
- KV cache memory: baseline -> significantly reduced
- Bugs fixed in fork: 5 blocking -> 0
Trade-offs
A CPU-only target trades raw throughput for accessibility: the point was proving strong quantization and long-context accuracy are reachable on hardware anyone has on their desk, not winning a latency benchmark. Reflects how I approach AI infrastructure: take a research-grade method, get it actually running on accessible hardware, measure it honestly, and share it.