<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llm-Inference on Notes from the Rabbit Hole</title><link>https://magnus919.com/tags/llm-inference/</link><description>Recent content in Llm-Inference on Notes from the Rabbit Hole</description><generator>Hugo</generator><language>en</language><copyright>© [Magnus Hedemark](https://github.com/magnus919)</copyright><lastBuildDate>Wed, 27 May 2026 03:10:00 -0400</lastBuildDate><atom:link href="https://magnus919.com/tags/llm-inference/index.xml" rel="self" type="application/rss+xml"/><item><title>Running a 35B MoE Model on a 16GB Consumer GPU</title><link>https://magnus919.com/2026/05/running-a-35b-moe-model-on-a-16gb-consumer-gpu/</link><pubDate>Wed, 27 May 2026 03:10:00 -0400</pubDate><guid>https://magnus919.com/2026/05/running-a-35b-moe-model-on-a-16gb-consumer-gpu/</guid><description>&lt;p>A 35-billion-parameter model belongs in a datacenter. That&amp;rsquo;s the assumption. You need an H100, or two, or eight of them. A consumer GPU tops out at 16 GB of VRAM and you&amp;rsquo;re not fitting a 35B model in there. End of story.&lt;/p>
&lt;p>Except Qwen3.6-35B-A3B isn&amp;rsquo;t a normal 35B model. It&amp;rsquo;s a Mixture-of-Experts architecture: 35 billion parameters spread across 256 specialized expert modules, but only 8 of those experts activate on any given token. That&amp;rsquo;s 3 billion active parameters per pass. The other 248 experts sleep.&lt;/p></description></item></channel></rss>