Running a 35B MoE Model on a 16GB Consumer GPU

Wed, 27 May 2026 03:10:00 -0400

A 35-billion-parameter model belongs in a datacenter. That’s the assumption. You need an H100, or two, or eight of them. A consumer GPU tops out at 16 GB of VRAM and you’re not fitting a 35B model in there. End of story.

Except Qwen3.6-35B-A3B isn’t a normal 35B model. It’s a Mixture-of-Experts architecture: 35 billion parameters spread across 256 specialized expert modules, but only 8 of those experts activate on any given token. That’s 3 billion active parameters per pass. The other 248 experts sleep.

Llm-Inference on Notes from the Rabbit Hole

Running a 35B MoE Model on a 16GB Consumer GPU