How to serve Qwen3.6-35B-A3B, a Mixture-of-Experts model with 3B active parameters, on an RTX 5070 Ti using llama.cpp. Full config, performance numbers, and the flags that make it fit.
Running a 35B MoE Model on a 16GB Consumer GPU


How to serve Qwen3.6-35B-A3B, a Mixture-of-Experts model with 3B active parameters, on an RTX 5070 Ti using llama.cpp. Full config, performance numbers, and the flags that make it fit.

A technology preview: using a DJI Mic 2 transmitter and free local transcription on Apple Silicon instead of a Plaud NotePin subscription. Better audio, word-level timestamps, no cloud, and a fraction of the cost.