AI
AI Analysis
Live Data

M3 Mac Runs Qwen 3.5 MoE Model via SSD Streaming Efficiently

Qwen 3.5 397B-A17B (209GB) runs on an M3 Mac at ~5.7 t/s with 5.5GB memory by quantizing & streaming weights from SSD (~17GB/s). Support 54%, Confront 20%.

@simonwposted on X

Dan says he's got Qwen 3.5 397B-A17B - a 209GB on disk MoE model - running on an M3 Mac at ~5.7 tokens per second using only 5.5 GB of active memory (!) by quantizing and then streaming weights from SSD (at ~17GB/s), since MoE models only use a small subset of their weights for each token

View original tweet on X →

Community Sentiment Analysis

Real-time analysis of public opinion and engagement

Sentiment Distribution

74% Engaged
54% Positive
20% Negative
Positive
54%
Negative
20%
Neutral
26%

Key Takeaways

What the community is saying — both sides

Supporting

1

MoE + quantization + SSD streaming lets huge models run locally

people cite 5.7 tok/s on a 209GB/397B MoE model while touching only ~5.5GB active memory, turning NAND into a slow VRAM and shifting the bottleneck from RAM to SSD I/O.

2

2-bit broke tool-calling

, upgrading to 4-bit (≈4.36 t/s) restored functionality — so extreme compression can save resources but can also break model behavior for some tasks.

3

under 8GB → context collapses fast

, 8–16GB → constrained but usable, 16GB+ → workable — mmap_lock and similar tricks can double swap performance if you have 8–16GB to spare.

4

no API keys, lower latency, and data stays on-device

, reframing the cloud vs local debate from "can you run it" to "is the latency acceptable."

5

cheaper than API costs

for multi-agent setups and argue it weakens the argument for centralized datacenter-only inference — capability is getting cheaper faster than supervision.

6

streaming active experts on demand

, prefetching expert matrices, and treating SSD bandwidth (~17GB/s ceilings) as the key resource — software design choices will determine how far local inference scales.

7

potential latency spikes, quality degradation, and workflow tradeoffs

, asking whether these speeds are acceptable for production coding/agent loops and urging people to test use-case latency and quality before assuming practicality.

Opposing

1

SSD write endurance is finite

hammering an SSD repeatedly will wear it out; when the storage is soldered to a Mac mini, that wear can force replacing the entire machine, not just the drive.

2

Transfer bottleneck from SSD/PCIe

real-world streaming is limited by bandwidth (examples cited: ~6 GB/s on M3, PCIe4.0 ~7.5 GB/s on M3 Pro/Max, ~10–11 GB/s on M5), so IO throughput is the main constraint.

3

Aggressive quantization harms quality

cutting weights to 2 bits or shifting between Q2/Q4 formats materially degrades outputs; claims of “production quality” for heavily quantized runs are contested.

4

Latency and token throughput make it impractical

reported token rates (e.g., ~5.7 tokens) and slow response make these setups “too slow to use” or outright “unusable” for many real tasks.

5

It’s an engineering stunt, not a general solution

cool technical achievement but addresses a narrow problem; many see it as hobbyist/tinkering rather than production-ready engineering.

6

Streaming from SSD doesn’t fix model trust

reducing hardware requirements doesn’t solve hallucinations, correctness, or the broader “authority” problem; it’s not evidence of intelligence or reliability.

7

Some say the demo relied on GPU work

critics argue the presentation may have actually run a smaller model on GPU (e.g., a 17B), which would undercut claims of pure SSD-based inference.

Top Reactions

Most popular replies, ranked by engagement

S

@simonw

Supporting

That doesn't matter in this case because it's effectively a read-only workload - all if that read activity shouldn't hurt the SSD at all

32
1
1.4K
S

@simonw

Supporting

Dan found that the 2-bit quantization broke tool calling but upgrading to 4-bit (at 4.36 tokens/second) got that working

25
5
5.1K
N

@NirDiamantAI

Supporting

btw llama.cpp's mmap_lock option forces the active experts into RAM which gets you like 2x faster swapping if you've got 8-16GB to spare

25
0
2.7K
F

@FixTechStuff1

Opposing

One problem with hammering your SSD like this is SSD’s have a finite number of writes. This is fine if SSD’s are cheap and replaceable, but when it’s hard soldered to your Mac mini, then you’ll eventually have to replace the whole thing.

11
1
1.5K
P

@pharmst

Opposing

I refuse to believe that quantising down to 2 bits per weight & reducing the number of experts doesn’t measurably impact the quality of the output.

2
0
85
M

@mykola

Opposing

Wat.

1
0
237

Report an Issue

Found something wrong with this article? Let us know and we'll look into it.