Running a 35B MoE Model on a 16GB Consumer GPU

A 35-billion-parameter model belongs in a datacenter. That’s the assumption. You need an H100, or two, or eight of them. A consumer GPU tops out at 16 GB of VRAM and you’re not fitting a 35B model in there. End of story.

Except Qwen3.6-35B-A3B isn’t a normal 35B model. It’s a Mixture-of-Experts architecture: 35 billion parameters spread across 256 specialized expert modules, but only 8 of those experts activate on any given token. That’s 3 billion active parameters per pass. The other 248 experts sleep.

Three billion active parameters is well within reach of a consumer GPU. And with the right quantization, specifically the APEX Nano variant at 10.88 GB, the entire model fits on a single RTX 5070 Ti. All 256 experts, all 40 layers, the vision encoder. Everything on GPU. No CPU offloading, no PCIe round-trips, no compromises.

This is the guide I wish I’d had when I set this up. It covers picking the right quantization, configuring llama.cpp, and getting the server running as a persistent systemd service. No cloud API keys, no usage quotas, no telemetry. Just a 35B MoE model running locally at 121 tokens per second on a $750 GPU.

What Qwen3.6-35B-A3B Actually Is#

Alibaba’s Qwen team released Qwen3.6 in May 2026. The 35B-A3B variant is the sweet spot for local inference. It’s a hybrid reasoning model, which means it supports both deep chain-of-thought thinking and direct instruct responses from the same weights. You control which mode you get per request.

The architecture is genuinely unusual:

40 layers, arranged in 10 repeating blocks of: 3 Gated DeltaNet layers, one MoE block, one Gated Attention layer, another MoE block
256 total experts, with 8 routed plus 1 shared activated per token
3 billion active parameters per forward pass, drawn from a 35 billion total pool
Native 262,144 token context window, extensible to 1 million via YaRN
Multimodal: vision input is supported natively with a projection file

The quant we’re using is the APEX I-Nano from mudler’s repo: Adaptive Precision for EXpert Models. Instead of applying the same bit-width to every weight, APEX uses different precision for different model components: higher precision for the shared dense layers, lower precision for the 256 expert FFNs that only activate occasionally. The result is a 10.88 GB file that preserves quality where it matters and sheds bits where it doesn’t. It also has Multi-Token Prediction heads baked in, which enables speculative decoding without a separate draft model. More on that later.

The Hardware#

The build assumes an RTX 5070 Ti, but any GPU with 16 GB of VRAM should work. Here’s the reference setup:

Reference Hardware

GPU

NVIDIA GeForce RTX 5070 Ti (16,303 MiB VRAM)

RAM

32 GB DDR4

CPU

Intel i7-14700F (28 threads)

Ubuntu 24.04

CUDA

12.9 (Blackwell SM_120)

llama.cpp

Latest build from source

The model runs entirely on GPU with this setup. The 32 GB of system RAM provides comfortable headroom for the OS and any concurrent workloads, but the model itself doesn’t need it. All 10.88 GB of weights live in VRAM.

Getting the Model#

The GGUF you want is the APEX Nano quantization from mudler’s repo on Hugging Face. It’s 10.88 GB, small enough to fit the entire model on a 16 GB GPU:

Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-I-Nano.gguf

APEX stands for Adaptive Precision for EXpert Models. Unlike standard quantization that applies the same bit-width to every weight, APEX uses different precision for different model components: higher precision for the shared dense layers that every token passes through, lower precision for the expert FFNs that only activate occasionally. This is the trick that makes the Nano variant work on a consumer card.

You also need a vision projection file for multimodal input. The Carnice model doesn’t ship its own mmproj, but Qwen3.6 variants share the same vision encoder architecture. The AtomicChat mmproj works cross-compatibly:

mmproj-35B-F16.gguf (858 MB, from the same parent model family)

Download both into your models directory. Total disk footprint: about 11.7 GB.

Why Not the UDT Q3_K_XL?#

Good question, because I started there. The UDT Q3_K_XL quantization is 16.5 GB and requires a GPU/CPU split (-ngl 24). That triggers a llama.cpp kernel limitation: the fused Gated DeltaNet optimization gets disabled when layers span both GPU and CPU, forcing all 30 dense backbone layers to run on CPU. GPU utilization sits at 14%. Generation runs at 33-42 tok/s.

The APEX Nano at 10.88 GB fits all 40 layers on GPU, bypassing that kernel limitation entirely. GPU utilization hits 50%. Generation runs at 121 tok/s. That’s the difference between “it technically works” and “it’s actually good.”

Building llama.cpp#

Clone and build from source with CUDA support:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

The CMake build with CUDA enabled will compile the CUDA kernels for your GPU architecture. On Blackwell (RTX 50 series), this takes a few minutes.

The Magic Flags#

Here’s the command that runs a 35B MoE model entirely on a 16 GB consumer GPU:

llama-server --metrics \
    --host 0.0.0.0 --port 8080 \
    --jinja \
    --flash-attn on \
    --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-n-min 1 \
    -ngl 40 \
    -c 32768 -cb -ctk q8_0 -ctv q8_0 \
    -t 16 -tb 16 -ub 1024 --prio 1 \
    -np 1 \
    -m /path/to/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-I-Nano.gguf \
    --mmproj /path/to/mmproj-35B-F16.gguf

Let’s walk through why each flag matters.

All Layers on GPU: `-ngl 40`#

This is the headline. The APEX Nano quantization is 10.88 GB. The RTX 5070 Ti has 16 GB. That leaves about 5 GB for the KV cache, the vision projection matrix, and CUDA runtime overhead. It fits. All 40 layers live on the GPU, all 256 experts are GPU-resident, and there’s no CPU round-trip for expert routing.

This is what the UDT Q3_K_XL quantization couldn’t do at 16.5 GB. That version required -ngl 24 with --fit on, splitting layers between GPU and CPU. The split triggered a llama.cpp limitation where the fused Gated DeltaNet kernel gets disabled when layers span devices. With APEX Nano, there’s no split. The kernel stays enabled. Every layer is GPU-native.

MTP Speculative Decoding: `--spec-type draft-mtp`#

Multi-Token Prediction lets the model predict 2 tokens ahead instead of one. The first token is the standard next-token prediction. The second is a draft: the model guesses what comes after that. The acceptance rate runs about 60%, translating to a roughly 1.6x effective speedup.

MTP heads are baked into the GGUF. No separate draft model required.

KV Cache: `-c 32768 -ctk q8_0 -ctv q8_0`#

The KV cache stores attention keys and values for every token in the context window. Quantizing to 8-bit (q8_0) halves the memory footprint with negligible quality impact. At 32K context, the quantized cache consumes about 4 GB. Dropping from 64K to 32K freed enough VRAM to fit the vision projection matrix without sacrificing anything a local inference workload actually needs.

Everything Else#

--flash-attn on: Flash attention. Faster, less VRAM for long-context attention.
-cb: Continuous batching. Interleaves requests across slots so a short approval check doesn’t queue behind a long generation.
-np 1: Single slot. Enough for auxiliary agent tasks. The 2 GB of free VRAM could support a second slot if needed.
-t 16 -ub 1024: 16 threads for CUDA dispatch (the GPU does the real work), 1024 micro-batch size keeps the CUDA cores fed during prefill.
--mmproj mmproj-35B-F16.gguf: Vision projection from the AtomicChat repo. Cross-compatible with Carnice since they share the same base architecture.

Performance#

Here’s what you can expect:

Generation

121 tok/s

Non-thinking mode, all layers GPU

Prompt Processing

407 tok/s

During prefill, 16-thread dispatch

GPU Utilization

~50% SM

14.3 GB VRAM, 2 GB free

VRAM sits at about 14.3 GB used, leaving 2 GB free. Enough for the vision projection matrix and KV cache without spilling into system memory.

GPU utilization under load runs at about 50%. That’s not a problem. It’s the architecture working as designed. The MoE routing means 8 out of 256 experts activate per token, so the GPU is never going to saturate all its compute units the way a dense model would. The remaining 50% is headroom: room for a second concurrent request, room for vision processing alongside text generation, room for a larger batch size if you need it.

This utilization profile is a dramatic improvement over the GPU/CPU split approach. With the UDT Q3_K_XL quantization and -ngl 24, the fused Gated DeltaNet kernel was disabled and GPU utilization sat at 14%. The CPU handled all 30 dense backbone layers while the GPU waited. The move to APEX Nano eliminates that bottleneck entirely.

What This Setup Is Good For (and What It Isn’t)#

Let’s be honest about the tradeoffs. You’re running a 35B model at 121 tok/s with a 32K context window on a $750 GPU. That’s genuinely impressive. It’s also not a ChatGPT replacement.

The 32K context window is the limiting factor, but the real constraint is more severe than it looks. In an agent harness like Hermes Agent, the system prompt alone (identity documents, memory, available tools, skills, conversation history) can consume 20,000 to 30,000 tokens before a single user message enters the context. The first actual prompt exhausts the window. Hermes Agent has a specific error for this: “max compression attempts (3) reached.” It’s not a bug. It’s the upstream model server’s context window being too small for the agent’s context overhead.

This server isn’t a main agent model. It can’t be. The math doesn’t work. But that’s not what it’s for.

And that’s where it shines. The use case that justifies the hardware is auxiliary agent tasks: the behind-the-scenes work that your primary AI assistant farms out to specialized models. Here’s what that looks like in practice.

Hermes Agent Auxiliaries#

Hermes Agent can route specific tasks to secondary models, keeping the expensive frontier model free for the work that needs it. This server handles those secondary roles:

Vision analysis. When your agent needs to look at a screenshot, diagram, or photo, this model processes it locally. No sending images to a third-party API. No per-image charges. The mmproj handles it at about 1.1 GB of VRAM overhead per image, then releases the memory.
Approval checks. Before your agent takes a destructive action (deleting files, sending emails, pushing to production), it can ask this model for a quick safety check. 50 tokens, non-thinking mode, done in under a second.
Title generation. Bookmarking a conversation? Summarizing a session? The model generates concise, useful titles without burning credits on a frontier model.
Goal judging. Long-running agent tasks need periodic evaluation: “has the objective been met?” This model handles those judgment calls.

The cost math is straightforward. Frontier API models charge by the token. A single vision analysis on GPT-5 might cost a few cents. A hundred of them over a working session adds up. This server handles all of those for the price of electricity. At 50% GPU utilization, it can handle multiple concurrent auxiliary requests without breaking stride.

Code Completion in VS Code#

The OpenAI-compatible API means any tool that speaks the chat completions protocol can use this server as a backend. Continue.dev and similar VS Code extensions let you point at a local endpoint instead of a cloud API. The non-thinking mode at 121 tok/s is fast enough for inline code suggestions: type a comment, hit tab, and the model fills in the implementation before your fingers leave the keyboard.

The 32K context is actually an advantage here. Code completion doesn’t need a 100K-token conversation history. It needs the current file, a few imports, and maybe the function signature. The model can hold your entire current module plus several dependencies in its context window with room to spare.

One-Shot Analysis and Batch Processing#

The 32K window is plenty for structured, self-contained tasks: summarize this document, extract entities from this article, classify these support tickets, translate this technical spec. Fire a request, get a response, move on. No context to maintain between calls.

This is how LightRAG uses the server: entity extraction runs as individual requests, each one self-contained. The model doesn’t need to remember the previous document. It just needs to find the entities, relationships, and key claims in the text in front of it. 32K tokens is more than enough for that.

Not a Chatbot#

Here’s what this server won’t do well: extended multi-turn conversations where the model needs to track evolving context across dozens of exchanges. If you want a local ChatGPT alternative with long conversational memory, you want a model with a 128K+ context window, and you’ll need more VRAM or a smaller model. The 35B MoE with 32K context is the wrong tool for that job.

But for the auxiliary tasks, the one-shot analysis, the code completion, and the vision processing that would otherwise cost real money on every API call, it’s exactly right.

Thinking vs. Non-Thinking Modes#

Qwen3.6 is a hybrid reasoning model. You control the mode per request via chat_template_kwargs:

# Thinking OFF (fast, direct answers)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "chat_template_kwargs": {"enable_thinking": false},
    "max_tokens": 100
  }'

# Thinking ON (deep reasoning)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Explain the CAP theorem"}],
    "max_tokens": 8000
  }'

Critical detail about thinking mode: the reasoning tokens count against your max_tokens budget. If you request 200 tokens with thinking enabled, the model will spend 200 tokens on its internal reasoning and produce an empty content field. It starved before it could answer. For thinking-mode queries, set max_tokens to at least 8,000.

Recommended sampling parameters:

Thinking mode: temperature 1.0, top_p 0.95, top_k 20, presence_penalty 1.5
Instruct mode: temperature 0.7, top_p 0.8, top_k 20, presence_penalty 1.5

When to Use Each Mode#

Non-thinking mode is the default for most auxiliary tasks. It’s fast, deterministic enough to be reliable, and doesn’t waste tokens on internal monologue. Use it for:

Vision analysis ("what’s in this screenshot?")
Approval checks ("is this command safe to run?")
Title generation and summarization
Code completion and inline suggestions
Entity extraction and structured data tasks

The higher temperature (0.7) gives it enough variety to avoid sounding robotic without introducing the kind of randomness that makes agent tool calls unreliable.

Thinking mode is for tasks where the answer benefits from explicit step-by-step reasoning before the final output. Use it for:

Debugging a complex error where you want the model to walk through possible causes
Architecture decisions where tradeoffs need to be weighed
Code review on non-trivial changes where the model should explain its reasoning
Any question you’d ask a senior engineer and expect a five-minute answer, not a one-liner

The higher temperature (1.0) with presence_penalty (1.5) encourages the model to explore alternatives rather than committing to the first plausible answer. It’s slower. It burns more tokens. It’s also more likely to catch something the quick path would miss.

In practice, I run non-thinking mode for about 90% of auxiliary tasks. The thinking switch is there when I need it, but most agent work doesn’t benefit from the model showing its work. For code completion, approval checks, and vision processing, the direct path is the right path.

Vision Mode#

The 35B model is natively multimodal. Images are supported via base64 data URLs in the standard OpenAI multimodal format. You don’t need a separate vision model or pipeline: the same server handles text and images.

The mmproj file bridges the vision encoder output into the language model’s embedding space. It’s 858 MB and loads once at startup. Image processing overhead is about 1.1 GB of extra VRAM while an image is being processed.

External HTTPS image URLs are not supported directly. llama.cpp was built without OpenSSL in this configuration. You’ll need to convert images to base64 data URLs before sending them. If you’re using Hermes Agent, the vision_analyze tool handles this automatically.

Running as a Systemd Service#

You don’t want to keep a terminal open indefinitely. Here’s the systemd user unit file (~/.config/systemd/user/llama-server.service):

[Unit]
Description=llama.cpp Server (Qwen3.6-35B-A3B APEX Nano)
After=network.target

[Service]
ExecStart=%h/llama.cpp/build/bin/llama-server --metrics \
    --host 0.0.0.0 --port 8080 \
    --jinja \
    --flash-attn on \
    --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-n-min 1 \
    -ngl 40 \
    -c 32768 -cb -ctk q8_0 -ctv q8_0 \
    -t 16 -tb 16 -ub 1024 --prio 1 \
    -np 1 \
    -m %h/models/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-I-Nano.gguf \
    --mmproj %h/models/mmproj-35B-F16.gguf
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target

Enable and start it:

systemctl --user daemon-reload
systemctl --user enable --now llama-server

The model takes about 30 seconds to load. Check the logs during startup:

journalctl --user -u llama-server -f

You’ll see the layer offloading summary, VRAM allocation, and finally the HTTP server binding to port 8080. Once the server is up, test it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Say hello in exactly three words."}],"max_tokens":20}'

Why This Matters#

A year ago, running a 35B model at home meant accepting 5 tokens per second on CPU or spending thousands on a used A6000. Today a $750 consumer GPU serves one at 121 tokens per second with headroom to spare. The bottleneck wasn’t hardware capacity. It was quantization strategy.

The APEX Nano quantization is the key insight here. Standard quantization treats every weight the same: 3.4 bits per parameter, applied uniformly across shared layers and expert FFNs alike. APEX recognizes that different model components have different sensitivity to precision loss. The shared dense layers that every token passes through get higher precision. The expert FFNs that activate only occasionally get lower precision. The result is a 10.88 GB model that preserves quality where it matters and sheds bits where it doesn’t.

MoE models are a different category of thing from the dense models most people think of when they hear “35B parameters.” The total parameter count is misleading. What matters for local inference is how many parameters fire per token (3 billion) and whether the quantization strategy lets you fit the whole model on GPU. With the right quantization, you can.

The practical upshot: a model that reasons at the level of frontier open-weight releases from late 2025, running on hardware you can buy at Micro Center. No API keys. No rate limits. No telemetry phoning home. The flags above will get you there. The rest is downloading a 10.88 GB GGUF and waiting 30 seconds for it to load.