
Your Phone's SSD Is the New VRAM
A 397B parameter model running on 12GB of RAM. The trick isn't new ML theory. It's demand paging, the same architecture pattern we've used since the 1960s.
Someone ran a 397-billion parameter model on a phone with 12GB of RAM this week. Not through an API. Not on a server. On the actual phone, in airplane mode.
The speed was 0.6 tokens per second. One word every two seconds. Completely unusable for chat. And I think it's one of the most interesting systems engineering demos I've seen this year.
The numbers that shouldn't work
The model is Qwen3.5-397B-A17B, a Mixture of Experts architecture. 397 billion total parameters. At 4-bit quantization, that's roughly 209GB of weights. The iPhone 17 Pro has 12GB of LPDDR5X RAM.
209GB into 12GB doesn't fit. Obviously.
Flash-MoE, the open-source project behind this demo, solves it with a technique that every backend engineer already knows. They just don't know it applies here.
You've already built this system
If you've ever tuned PostgreSQL's shared_buffers, you've worked with the exact same pattern Flash-MoE uses. If you've configured InnoDB's buffer pool, same thing. If you've ever debugged page fault storms in a Linux process that exceeded its resident set size, you already understand why this demo runs at 0.6 tokens per second on a phone and 5.7 tokens per second on a laptop.
The pattern is demand paging. OS textbooks have covered it since the 1960s.
Here's what Flash-MoE actually does:
- The full 209GB model lives on the phone's NVMe SSD
- For each token, the MoE router picks which experts to activate
- Those expert weights get loaded from SSD into memory via parallel
pread()calls - The OS page cache keeps recently used experts hot in RAM
- Cold experts get evicted when memory pressure rises
That's it. No custom caching layer. No sophisticated eviction policy. The authors explicitly call their approach "Trust the OS." They let the filesystem page cache handle which experts stay resident and which get evicted. The same LRU-ish eviction that manages your database's buffer pool manages the neural network's working set.
Why MoE makes this possible (and dense models can't)
This trick only works because of how absurdly sparse MoE models are at inference time.
A dense 397B model needs all 397B parameters loaded for every single token. No shortcuts. You need the full model in memory or you get nothing.
Qwen3.5-397B-A17B has 512 experts per layer but only activates 10 per token. The Flash-MoE team found you can prune that to 4 active experts with no measurable quality loss. Four out of 512.
That changes the math completely.
Each expert is roughly 6.75MB. Four experts per layer, 60 layers. Your active working set per token is around 1.6GB. Even with overhead for attention layers, the shared expert, and OS needs, the whole engine uses about 6GB of RAM on the laptop build. On a 48GB machine, that leaves 42GB for the OS page cache to keep hot experts resident.
The cache hit ratio is everything. When the experts your model needs are already in the page cache, you're running at memory bandwidth speed. When they're not, you're waiting on SSD reads. The difference between "usable" and "unusable" is literally your cache hit rate, exactly like a database workload where queries slow to a crawl when your working set exceeds the buffer pool.
The latency breakdown nobody's talking about
On a MacBook Pro with an M3 Max (48GB RAM, 17.5 GB/s SSD reads), Flash-MoE hits 4.4 to 5.7 tokens per second. On an M4 Max with 64GB, it reaches 10-15 TPS.
Those numbers tell you something important about where time goes.
Apple Silicon has a quirk that matters here: SSD DMA and GPU compute share the same memory controller. You can't overlap them profitably. That means every cache miss is a serial stall. Load expert from SSD, then compute, then load the next one. No pipelining.
On the iPhone 17 Pro with 12GB RAM, the working set of hot experts barely fits alongside the OS and the app itself. Cache misses are frequent. Every miss means a round trip to NAND flash. 0.6 tokens per second isn't a performance bug. It's exactly what you'd predict from the cache hit ratio on a memory-constrained device.
This is why I find the demo more interesting than the "400B on a phone!" headlines suggest. It's a clean demonstration of a universal performance principle: your system's speed is determined by the ratio of cache hits to cache misses in whatever storage hierarchy you're operating in. CPU cache, database buffer pool, CDN layer, or neural network expert cache. Same math. Same bottleneck. Same tuning levers.
The code was written in 24 hours with Claude
One detail that caught my eye: the Flash-MoE paper and all the code were produced by Daniel Woods working with Claude Opus 4.6 in a 24-hour sprint. The entire inference engine is about 7,000 lines of C and hand-tuned Metal shaders. No Python. No ML frameworks.
I initially dismissed that as a gimmick. Then I looked at the code. It's doing parallel pread() with GCD dispatch groups, custom 4-bit dequantization kernels with FMA optimization, and Accelerate BLAS calls for the linear attention layers. This isn't generated boilerplate. It's the kind of low-level systems code that requires understanding memory layout, GPU compute pipelines, and NVMe I/O patterns simultaneously.
Whether you attribute that to the human, the model, or the collaboration pattern, the output is a genuine inference engine that runs a 397B model on consumer hardware. That's the proof that matters.
The SSD wear question
Every backend engineer's second thought after "cool demo" is "what about write amplification?"
Let's do the math. If you're generating tokens at 0.6 TPS on the iPhone, and each token loads 4 experts at 6.75MB each, that's about 27MB per token from SSD. Assume a 50% cache hit rate (generous for the iPhone's constrained memory). That's roughly 13.5MB of actual SSD reads per token.
Generate 10,000 tokens in a day (about 4.6 hours of continuous inference at 0.6 TPS). That's around 135GB of SSD reads. Apple's SSDs are rated for hundreds of terabytes written (TBW), and reads don't wear flash cells the way writes do. The wear concern is mostly a non-issue for reads.
But there's a subtler problem. The OS page cache evicts dirty pages under memory pressure, and metadata writes from the filesystem add up. It's not zero wear. For a proof of concept, it doesn't matter. For a feature Apple ships to a billion phones, it would need careful measurement.
What this actually means for your architecture
I don't think anyone's going to chat with a 400B model on their phone at one word per two seconds. That's not the point.
The point is that the storage hierarchy trick generalizes. If your inference workload has a sparse access pattern (and MoE guarantees that), you can run models dramatically larger than your available memory by treating fast storage as an extension of your compute memory. The same way databases have treated disk as an extension of RAM since the 1970s.
The practical implications show up at every scale:
Edge inference gets more capable. A 70B MoE model with aggressive quantization could run at usable speeds on phones with 16-24GB of RAM. Not the 400B party trick, but genuinely useful models for on-device tasks where privacy matters and latency to a cloud endpoint doesn't work.
Laptop inference is already there. 5.7 TPS on a 48GB MacBook Pro with a 397B model. Slow for interactive use, fine for background tasks: code review, document analysis, batch processing. No API costs. No network dependency. Full privacy.
Server inference could use this pattern to run fewer, larger models instead of many smaller ones. If your MoE model's hot expert set fits in RAM and your NVMe tier handles the cold experts, you might serve a 400B model from hardware that was specced for a 70B dense model.
The real lesson
The "400B model on a phone" headline makes people think about AI. I think about something more fundamental.
Every performance problem in computing comes down to the same hierarchy: register, cache, memory, storage, network. The art is figuring out which level your working set fits in and what your miss rate costs you. CPU architects figured this out in the 1960s. Database engineers figured it out in the 1970s. CDN engineers figured it out in the 2000s.
ML engineers are figuring it out now.
Flash-MoE didn't invent anything new. It applied a 60-year-old architecture pattern to a new domain and got a result that looked like magic. The best engineering usually does.
Get new posts in your inbox
Architecture, performance, security. No spam.
Keep reading
Ollama Just Became an OpenClaw Provider
Ollama 0.18 shipped with native OpenClaw integration. Local models now get tool calling, multi-agent workflows, and permission boundaries. No API costs, no data leaving your network.
The MCP vs CLI Debate Is Missing the Point
Everyone's arguing whether AI agents should use MCP or CLI tools. The answer depends on a question nobody's asking: does the model already know how to use the tool, or did your team build it last Tuesday?
What Claude Code Actually Chooses (And Why Tool Vendors Should Pay Attention)
Amplifying.ai ran 2,430 prompts against Claude Code and found it builds custom solutions in 12 of 20 categories. The tools it picks are becoming the default stack for a growing share of new projects.