@simonw
That doesn't matter in this case because it's effectively a read-only workload - all if that read activity shouldn't hurt the SSD at all
Qwen 3.5 397B-A17B (209GB) runs on an M3 Mac at ~5.7 t/s with 5.5GB memory by quantizing & streaming weights from SSD (~17GB/s). Support 54%, Confront 20%.
Dan says he's got Qwen 3.5 397B-A17B - a 209GB on disk MoE model - running on an M3 Mac at ~5.7 tokens per second using only 5.5 GB of active memory (!) by quantizing and then streaming weights from SSD (at ~17GB/s), since MoE models only use a small subset of their weights for each token
Real-time analysis of public opinion and engagement
What the community is saying — both sides
people cite 5.7 tok/s on a 209GB/397B MoE model while touching only ~5.5GB active memory, turning NAND into a slow VRAM and shifting the bottleneck from RAM to SSD I/O.
, upgrading to 4-bit (≈4.36 t/s) restored functionality — so extreme compression can save resources but can also break model behavior for some tasks.
, 8–16GB → constrained but usable, 16GB+ → workable — mmap_lock and similar tricks can double swap performance if you have 8–16GB to spare.
, reframing the cloud vs local debate from "can you run it" to "is the latency acceptable."
for multi-agent setups and argue it weakens the argument for centralized datacenter-only inference — capability is getting cheaper faster than supervision.
, prefetching expert matrices, and treating SSD bandwidth (~17GB/s ceilings) as the key resource — software design choices will determine how far local inference scales.
, asking whether these speeds are acceptable for production coding/agent loops and urging people to test use-case latency and quality before assuming practicality.
hammering an SSD repeatedly will wear it out; when the storage is soldered to a Mac mini, that wear can force replacing the entire machine, not just the drive.
real-world streaming is limited by bandwidth (examples cited: ~6 GB/s on M3, PCIe4.0 ~7.5 GB/s on M3 Pro/Max, ~10–11 GB/s on M5), so IO throughput is the main constraint.
cutting weights to 2 bits or shifting between Q2/Q4 formats materially degrades outputs; claims of “production quality” for heavily quantized runs are contested.
reported token rates (e.g., ~5.7 tokens) and slow response make these setups “too slow to use” or outright “unusable” for many real tasks.
cool technical achievement but addresses a narrow problem; many see it as hobbyist/tinkering rather than production-ready engineering.
reducing hardware requirements doesn’t solve hallucinations, correctness, or the broader “authority” problem; it’s not evidence of intelligence or reliability.
critics argue the presentation may have actually run a smaller model on GPU (e.g., a 17B), which would undercut claims of pure SSD-based inference.
Most popular replies, ranked by engagement
That doesn't matter in this case because it's effectively a read-only workload - all if that read activity shouldn't hurt the SSD at all
Dan found that the 2-bit quantization broke tool calling but upgrading to 4-bit (at 4.36 tokens/second) got that working
btw llama.cpp's mmap_lock option forces the active experts into RAM which gets you like 2x faster swapping if you've got 8-16GB to spare
One problem with hammering your SSD like this is SSD’s have a finite number of writes. This is fine if SSD’s are cheap and replaceable, but when it’s hard soldered to your Mac mini, then you’ll eventually have to replace the whole thing.
I refuse to believe that quantising down to 2 bits per weight & reducing the number of experts doesn’t measurably impact the quality of the output.
Wat.
Found something wrong with this article? Let us know and we'll look into it.