Why Local AI Is More Than Just Convenience
If you're sending sensitive data to a cloud API, you're trusting someone else's infrastructure with your most valuable information. For some tasks, that's fine. For others? It's not an option.
Local inference is the only solution when:
- You're working with legally privileged, HIPAA-regulated, or GDPR-sensitive material (compliance can't be outsourced)
- You need to protect unreleased code or proprietary workflows (no room for leaks or telemetry)
- You're operating in air-gapped environments (no network means no API calls)
- You want to avoid training consent issues (no feeding your prompts into someone else's training pipeline)
The good news? The tooling is already here. Tools like Ollama, LM Studio, and SillyTavern support local workflows that keep your data off cloud model APIs when configured properly. The infrastructure isn't catching up—it's already here.
The open-weight frontier has closed the gap. The distance between open and proprietary models now sits at 30–40 ELO—smaller than ever before. And smaller than most benchmarks suggest, since self-reported benchmark scores run 4–13 points above independent evaluations.
The Problem With Benchmarks
Benchmarks are useful. They're also systematically misleading.
Here's what actually happens with published leaderboard scores:
- Models peak with specific system prompts and prompting tricks. Remove those (as you do in real use) and performance drops 10–20%.
- 60–70% of SWE-Bench Verified results are self-reported. Independent evaluations at vals.ai consistently land 4–13 points lower than provider claims.
- OpenAI abandoned SWE-Bench Verified on February 23, 2026, citing test contamination and design flaws, directing users to SWE-bench Pro instead.
- MMLU-Pro is saturated: the top 15 models cluster within 5.4 points of 85%. A 3-point gap means nothing in that range.
This guide wasn't built on leaderboards. It was built on real constraints:
- Can you actually run it?
- Does the tool support it?
- Does it fit your VRAM without swapping?
- Is the license compatible with your use?
Benchmarks are a reference. Not a verdict.
Your VRAM Budget Is the Real Constraint
Not your GPU model. Not your CUDA version. Your actual memory in gigabytes.
Every model in this guide has been verified: it loads, it runs, it doesn't crawl. Tool availability checked against official documentation.
Tier 1: Entry Level (8GB VRAM)
Target hardware: Laptops (RTX 4060 Mobile, RTX 3060) or capable integrated graphics.
Our picks:
- Qwen3.5-9B at Q4_K_M (~5.7 GB, ~35–40 tok/s on RTX 4060, Apache 2.0, HF). The best quality-to-VRAM ratio in this tier—no competition.
- Gemma 4 E4B at Q4_K_M (~5.4 GB, ~25–30 tok/s, Apache 2.0, HF). Multimodal (vision + audio + text). Dense model, 4.5B effective / 8B with embeddings. Unique at this size: no other model in the 4B class offers native audio + vision + text under Apache 2.0.
What doesn't work:
14B+ models at Q4_K_M. At 8GB, your practical context ceiling is 8K–16K tokens, not 32K. Some 14B models technically load in ~8–9GB—but they swap constantly and become unusable.
Our recommendation:
- General use + coding: Qwen3.5-9B (solid all-arounder for chat, short coding tasks, everyday use — best quality-to-VRAM ratio in this tier)
- Commercial work: Gemma 4 E4B (Apache 2.0, no strings attached — the default for enterprise and legally sensitive deployments)
- Multimodal + audio: Gemma 4 E4B (vision + audio + text under Apache 2.0 — unique at this size, no other 4B-class model offers this)
ollama run qwen3.5:9b
Tier 2: Capable Daily Driver (16GB VRAM)
Target hardware: Desktop GPUs (RTX 4060 Ti 16GB, RTX 3080), workstation GPUs (RTX A4000).
Our picks:
- Qwen3-14B at Q4_K_M (~9 GB, Apache 2.0, Arena ELO 1443, HF). The strongest all-around: chat, coding, writing. Qwen3 (not Qwen3.5 — Qwen3.5 tops out at 9B; the 14B class is from the Qwen3 family, released April 2025). Text-only, no multimodal.
- Phi-4 Reasoning 14B at Q4_K_M (~9.1 GB, MIT license, HF). Best math and STEM reasoning in this tier. GSM8K: 95.6%, HumanEval: 82.6%.
What doesn't work:
24GB+ models. Gemma 4 26B-A4B and Qwen3.6-27B both need 24GB minimum.
Our recommendation:
- Reasoning + math: Phi-4 Reasoning 14B (strongest chain-of-thought in this tier; explicitly trained for STEM problem-solving, GSM8K and HumanEval benchmarks reflect this specialization)
- All-around + writing: Qwen3-14B (Apache 2.0, best balance — chat, coding, and general writing without specialist tradeoffs; Qwen3 family, released April 2025, not Qwen3.5 which maxes at 9B)
⚠️ Phi-4 Reasoning isn't competition-math ready. AIME 2024: 12%. For serious math competition, jump to 24GB+.
ollama run phi4-reasoning:14b
ollama run qwen3:14b
Tier 3: The Sweet Spot (24GB VRAM)
Target hardware: RTX 4090 24GB, RTX 4080 Super, Mac M4 Pro/Max (48GB+), RTX A5000.
This is where local inference becomes genuinely useful for real work: coding, multi-document processing, multimodal agents, agentic workflows.
If you're buying hardware for local AI, buy for this tier.
What runs reliably:
- Qwen3.6-27B at Q4_K_M (~17.5 GB file, ~19–21 GB with KV cache,HF). 262K context standard. Apache 2.0. Primary recommendation for this tier. Fits 24GB at moderate context — at full 262K KV cache, VRAM usage climbs toward 21GB, leaving tight but workable headroom. For headroom-focused users, IQ4_XS (~15 GB) is a safer pick.
- Qwen3.6-35B-A3B at Q4_K_M (~21–22 GB, HF). MoE: 35B total / 3B active per token. ~120–146 tok/s on RTX 4090. Fits 24GB at Q4_K_M with ~2–3 GB headroom. Apache 2.0.
- Gemma 4 26B-A4B at Q4_K_M (~14–17 GB, Arena ELO ~1441, HF). MoE: 3.8B active, 26B total. Only 11 ELO points behind Gemma 4 31B, but faster inference (3–6 tok/s vs 2–4 tok/s) and lower VRAM. The sweet spot for24GB.256K context. Apache 2.0.
- Gemma 4 31B at Q4_K_M (~20 GB, Arena ELO 1451, HF). Dense multimodal: text + image + video (up to 60s) + tool calling + thinking. 256K context. Apache 2.0.
⚠️ Gemma 4 31B VRAM caveat: Gemma4's hybrid attention architecture (sliding-window + global) produces a KV cache 2–3× larger than comparable models. At 8K context, total VRAM is ~26 GB — exceeding 24GB. At16K context, ~33 GB. "Fits 24GB" is only realistic at shorter contexts (under 8K) or with aggressive KV cache quantization. Gemma 4 26B-A4B is the more practical 24GB pick.
Benchmark reference (Arena AI / provider benchmarks, May 2026):
| Model | Arena ELO | SWE-Bench Verified | MMLU-Pro | License | VRAM (Q4_K_M) |
|---|---|---|---|---|---|
| Gemma 4 31B | 1451 | — | 85.2% | Apache 2.0 | ~20 GB + KV |
| Gemma 4 26B-A4B | ~1441 | — | — | Apache 2.0 | ~14–17 GB |
| Qwen3.6-27B | — | 77.2% (provider) | ~83% | Apache 2.0 | ~17.5 GB + KV |
| Qwen3.6-35B-A3B | — | 73.4% (provider) | ~84% | Apache 2.0 | ~21–22 GB |
Our picks for 24GB:
- Best all-around — first pick: Qwen3.6-27B (coding, agentic workflows, general writing — best VRAM-to-quality ratio, proven tooling, Apache 2.0; Qwen 3.6 is the coding benchmark leader at this size, beating Gemma 4 on HumanEval, SWE-Bench, MBPP, and LiveCodeBench per independent comparisons)
- Fast responses + long context: Qwen3.6-35B-A3B (MoE efficiency, 3–4× faster token generation than 27B, fits 24GB, Apache 2.0; particularly praised for thinking through complex refactoring tasks)
- Fastest inference at 24GB: Gemma 4 26B-A4B (3–6 tok/s, only 11 ELO behind 31B, lower VRAM ~14–17 GB)
- Multimodal (vision + tool calling): Gemma 4 31B — only model offering this at 24GB, with context caveat above
ollama run qwen3.6:27b # Primary — daily driver
ollama run qwen3.6:35b # MoE efficiency — fits 24GB at Q4_K_M
ollama run gemma4:26b-a4b # Sweet spot — fast + low VRAM
ollama run gemma4:31b # Multimodal (context caveat applies)
Tier 4: Professional and High-End (48GB+ VRAM)
Target hardware: 2× RTX 4090 24GB, workstation GPUs, Mac Studio M4 Max (128GB+), Mac Ultra (192GB+).
Our picks:
- DeepSeek V4 Flash (284B total / 13B active, MoE, Arena ELO 1432–1433, HF). 1M context. MIT license. Best reasoning model at this tier. Q4_K_M needs ~161–172 GB — requires multi-GPU or Q3_K_M quant (~127–136 GB) for 2×H100 80GB.
- MiniMax M2.7 (229B total / ~45B active, MoE, Arena ELO 1409, HF). 200K context. Modified MIT license. The only model at this capability level that runs on Apple Silicon M4 Max 128GB. Open weights confirmed released May 2026 (HuggingFace + ModelScope). FP8 ~230GB BF16, community INT4 quants ~115GB.
Benchmark reference (Arena AI, May 2026):
| Model | Arena ELO | SWE-Bench Verified | License | VRAM (Q4/Q3) |
|---|---|---|---|---|
| DeepSeek V4 Flash | 1432–1433 | 79% | MIT | ~161–172 GB (Q4_K_M) / ~127–136 GB (Q3_K_M) |
| MiniMax M2.7 | 1409 | 56% | Modified MIT | ~230 GB (FP8) / ~115 GB (INT4) |
Our picks for 48GB+:
- Best reasoning + agentic tasks: DeepSeek V4 Flash — 79% SWE-Bench, 1M context, MIT license, complex reasoning chains and math
- Apple Silicon path: MiniMax M2.7 — only model that fits unified memory at this capability level, now confirmed open weights
# DeepSeek V4 Flash — Q3_K_M on 2×H100 80GB
llama-server -hf deepseek-ai/DeepSeek-V4-Flash-GGUF --hf-file DeepSeek-V4-Flash-Q3_K_M.gguf -ngl 99 -c 4096 -tp 2
# MiniMax M2.7
ollama run minimax-m2.7
Tier 5: The Open-Weight Frontier (Reference)
These models set the ceiling. They're here for completeness. If you have the hardware, you already know. All verified on HuggingFace as of May 2026.
| Model | Architecture | Arena ELO | License | HuggingFace URL |
|---|---|---|---|---|
| GLM-5.1 | MoE 754B / ~40B | 1471 | MIT | HF |
| Kimi K2.6 | MoE 1T / 32B | 1465–1466 | Modified MIT | HF |
| DeepSeek V4 Pro | MoE 1.6T / 49B | 1459–1461 | MIT | HF |
| MiMo 2.5 Pro | MoE 1.02T / 42B | 1463–1465 | MIT | HF |
The standout:
GLM-5.1 is Arena ELO 1471—within ~30 ELO of the proprietary frontier and the best coding model by SWE-Bench Pro. MIT license. But it needs a workstation cluster (TP8). Kimi K2.6 and DeepSeek V4 Pro require datacenter hardware.
Deep Dive: Model Highlights
Qwen3.6-27B — The Daily Driver Champ
The community calls it the default for terminal coding in Claude Code and OpenCode — and the numbers back that up. Qwen 3.6 beats Gemma 4 on every coding benchmark that matters: HumanEval (+2.7 pts), SWE-Bench Verified (+3.7 pts), MBPP (+3.7 pts), LiveCodeBench (+2.5 pts). For writing, it leads technical work and retains themes best at 10K+ tokens. For agentic work, it has the best tool-use + planning combination with native support in OpenClaw. Apache 2.0 license, mature tooling, and VRAM efficiency — that's why it earned the position.
Qwen3.6-27B fits 24GB at moderate context (~17.5 GB file + KV cache at 262K ≈ 19–21 GB). Tight but workable. At full 262K the VRAM climb is real — for headroom-focused users, IQ4_XS (~15 GB) is the safer pick. No native multimodal — use LM Studio for vision. Arena ELO not yet independently verified (released April 2026, too new for public leaderboard accumulation).
SWE-Bench Verified: 77.2% (provider-reported; independent vals.ai scores lower). KV cache efficiency: ~0.003 MB/token vs Gemma 4 31B's ~0.85 MB/token — dramatically more efficient.
Run it: ollama run qwen3.6:27b · Qwen/Qwen3.6-27B on HuggingFace
Qwen3.6-35B-A3B — The Speed Pick
If you want speed at 24GB, this is the pick. MoE activates 3B parameters per token, so the VRAM footprint is closer to a 10B dense model than a 35B one — and it runs 3–4× faster than Qwen3.6-27B (146 tok/s vs 45 tok/s on RTX 4090). r/LocalLLaMA praises it for thinking through complex refactoring tasks where dense models rush to completion. The quality tradeoff is real: SWE-bench -3.8 pts, Terminal-Bench -7.8 pts, SkillsBench -19.5 pts vs 27B. These are not equivalent picks — 27B is the quality choice, 35B-A3B is the speed choice.
Fits 24GB at Q4_K_M with ~2–3 GB headroom. 262K context. Apache 2.0. Prefill is also 2–2.4× faster.
SWE-bench Verified: 73.4% (provider-reported). ~120–146 tok/s standard, ~186 tok/s with MTP.
Run it: ollama run qwen3.6:35b · Qwen/Qwen3.6-35B-A3B on HuggingFace
Gemma 4 26B-A4B — The 24GB Sweet Spot
Only 11 ELO points behind Gemma 4 31B, but running faster (3–6 tok/s vs 2–4 tok/s) and using less VRAM (~14–17 GB vs ~20 GB). The MoE architecture (3.8B active / 26B total) keeps it efficient. Community pick for LangChain tool-calling agents and creative writing where theme consistency through ~5K tokens matters. Multimodal: text, image, video up to 60s. Apache 2.0. 256K context.
For anything under 8K tokens where Qwen3.6-27B's tighter VRAM fit becomes a liability, this is the right call.
Arena ELO: ~1441. Inference speed: 3–6 tok/s on RTX 4090.
Run it: ollama run gemma4:26b-a4b · google/gemma-4-26B-A4B on HuggingFace
Gemma 4 31B — The Multimodal Option
The only 31B-class model with native multimodal support — vision, audio, text, tool calling, and configurable thinking depth (1–3 steps). Community favorite for LangChain where its native tool tokens simplify agentic pipelines. Apache 2.0 makes it the commercial default over Qwen for code-generation tools. AIME 2026: 89.2%.
But: Gemma 4's hybrid attention architecture (sliding-window + global) produces a KV cache 2–3× larger than comparable models. At 8K context, total VRAM is ~26 GB — exceeding 24GB. At 16K context, ~33 GB. "Fits 24GB" only realistic at shorter contexts or with aggressive KV cache quantization. For most users, Gemma 4 26B-A4B is the more practical 24GB pick.
Arena ELO: 1451. MMLU-Pro: 85.2%.
Run it: ollama run gemma4:31b · google/gemma-4-31B on HuggingFace
DeepSeek V4 Flash — The Reasoning Pick
DeepSeek's explicit chain-of-thought produces the most auditable reasoning chains in the local model landscape. If you're debugging complex multi-step logic or need an agent that shows its work, this is the pick. 10M+ community downloads, dominant trust in local reasoning deployments. 79% SWE-Bench Verified. 1M context window. MIT license.
The catch: Q4_K_M needs ~161–172 GB, so you need the Q3_K_M quant (~127–136 GB) to run it on 2×H100 80GB. Multi-GPU required. This is a datacenter tier, not a consumer tier.
Arena ELO: 1432–1433.
Run it: huggingface-cli download deepseek-ai/DeepSeek-V4-Flash-GGUF DeepSeek-V4-Flash-Q3_K_M.gguf --local-dir ./ then llama-cli -m DeepSeek-V4-Flash-Q3_K_M.gguf -ngl 99 -c 4096 -tp 2 · deepseek-ai/DeepSeek-V4-Flash on HuggingFace
MiniMax M2.7 — The Apple Silicon Pick
The only model at this capability tier that runs on Apple Silicon M4 Max 128GB. 229B MoE with ~45B active parameters. Open weights confirmed released May 2026 on HuggingFace and ModelScope. FP8 ~230GB — fits M4 Max at full precision with compression. Arena ELO 1409. Modified MIT — non-commercial by default, email [email protected] for commercial licensing.
Community pick for creative writing and roleplay where MiniMax's less restrictive content filter matters — it's the practical choice at this tier for anything that touches adult themes. 200K context.
SWE-bench Pro: 56.22%. Terminal Bench 2: 57.0%.
Run it: ollama run minimax-m2.7 · MiniMaxAI/MiniMax-M2.7 on HuggingFace
How to Get Started
Ollama (recommended for most users)
# Install — https://ollama.com
# Run a model
ollama run qwen3.6:27b # Best all-around 24GB pick
ollama run qwen3.5:9b # Best 8GB pick
ollama run qwen3:14b # Best 16GB pick
ollama run phi4-reasoning:14b # Best for math
ollama run gemma4:26b-a4b # Best Gemma 4 for 24GB
ollama run gemma4:31b # Best multimodal (context caveat applies)
# List all available models
ollama list
LM Studio (desktop GUI, no CLI)
- Download from lmstudio.ai
- Search model name in the browser
- Click Download
- Load with one click — runs locally, no data leaves your machine
SillyTavern (agentic workflows)
# Install SillyTavern — https://docs.sillytavern.ai
# Connect Ollama or LM Studio as backend
# Recommended for: roleplay, coding agents, long-session character memory
What to Avoid
Benchmark shopping. MMLU-Pro is saturated — a 3-point gap between models is noise, not signal. Pick based on VRAM fit, tooling support, and license.
The self-reported trap. 60–70% of SWE-Bench Verified results are published by the model provider. Independent evaluation at vals.ai consistently shows 4–13 points lower. Treat provider claims as ceiling, not floor.
Fake model names. Always verify the exact model name before downloading. Qwen3.5 goes up to 9B — there is no Qwen3.5-14B. Gemma 4 small sizes are E2B and E4B — there is no Gemma 4 9B. When in doubt, check the HuggingFace model card directly.
Custom license traps. Some "open source" models include use restrictions beyond standard open source licenses. MiniMax M2.7 has additional terms beyond standard MIT. Read the LICENSE file before deploying in a commercial product.
The Trust Checklist
Before you run any model locally:
- [ ] Does it fit my VRAM at the quant I want to use?
- [ ] Is the license compatible with my use case (commercial, internal, educational)?
- [ ] Is tooling mature for my use case (Ollama, LM Studio, SillyTavern)?
- [ ] Are independent benchmark scores available, not just provider claims?
- [ ] Have I read the license file for custom terms?
- [ ] Did I verify the exact model name on HuggingFace before downloading?
Final Recommendations
| VRAM Budget | First Pick | Second Pick | Notes |
|---|---|---|---|
| 8GB | Qwen3.5-9B (Apache 2.0) | Gemma 4 E4B (Apache 2.0) | Best quality-to-VRAM ratio; E4B has vision+audio |
| 16GB | Qwen3-14B (Apache 2.0) | Phi-4 Reasoning 14B (MIT) | Qwen3.5-14B doesn't exist — 14B class is Qwen3 |
| 24GB | Qwen3.6-27B (Apache 2.0) | Qwen3.6-35B-A3B (Apache 2.0) | 27B is quality pick; 35B-A3B is speed pick; Gemma 4 26B-A4B also strong |
| 48GB+ | DeepSeek V4 Flash (MIT) | MiniMax M2.7 (Modified MIT) | Flash for reasoning; M2.7 for Apple Silicon |
| Frontier | GLM-5.1 (MIT, ELO 1471) | Kimi K2.6 (Modified MIT) | Datacenter hardware required |
The bottom line:
Local AI is no longer a compromise. The open-weight gap to proprietary models has shrunk to ~30 ELO. For most users, the deciding factor isn't model quality — it's whether the model fits your VRAM and has mature tooling. Qwen3.6-27B at Q4_K_M is the sweet spot: it fits 24GB, has full tooling support, and comes with Apache 2.0. That's the answer for most people reading this guide.