Skip to main content
Setup Guide

Run Gemma 4 locally — complete setup guide with Ollama

Google Gemma 4 is the latest open-weight model from Google DeepMind, available in 12B and 27B parameter variants. Both support vision, reasoning, and 256K context. This guide covers exact VRAM requirements, step-by-step Ollama installation, real benchmark data, and which GPUs can run it.

Alex Chen AI Hardware Specialist
GitHub: github.com/javier-morales-ia
AI chip on a circuit board — Gemma 4 local setup guide cover

Disclosure: this article contains affiliate links. We may earn a commission from qualifying purchases at no cost to you. Prices and availability are subject to change.

What is Gemma 4?

Gemma 4 is Google DeepMind's April 2026 release of their open-weight model family. Built on the same architecture as Gemini, it comes in two sizes optimized for local inference: a 12B parameter model that fits on any 8 GB GPU, and a 27B parameter model for users with 16+ GB VRAM.

Both variants are multimodal — they accept text and images as input — and support a 256,000-token context window, making them suitable for long-document analysis, coding with large codebases, and extended conversations. The license is Apache 2.0, which means full commercial use is permitted.

12B / 27B

Parameters

256K

Context length

Apache 2.0

License

Vision

Multimodal

Gemma 4 VRAM requirements — exact numbers

The amount of VRAM you need depends on the model size and quantization level. Q4 (4-bit quantization) is the sweet spot for most users — it cuts memory usage by 75% with minimal quality loss.

Model FP16 Q8 Q4 Q2
Gemma 4 12B 26.4 GB 13.2 GB 6.6 GB 3.3 GB
Gemma 4 27B 59.4 GB 29.7 GB 14.9 GB 7.4 GB

Key takeaway: Gemma 4 12B at Q4 fits on any 8 GB GPU (RTX 3060, RTX 4060, RX 7600). The 27B variant needs 16 GB (RTX 4060 Ti 16GB, RX 7800 XT) or Apple Silicon with 16+ GB unified memory. Use our VRAM Calculator to check your specific GPU.

Compatible GPUs for Gemma 4

39 GPUs in our database can run Gemma 4 12B at Q4 quantization. Here are the top picks by price tier.

Entry tier (under $350)

Best for running Gemma 4 12B at Q4 for daily chat, reasoning, and vision tasks.

RTX 3060 12GB
budget Amazon Prime

RTX 3060 12GB

4.8 (1,400 reviews)

Pros

  • 12 GB VRAM — runs 12B with headroom
  • 30 tok/s on Llama 7B Q4
  • Best entry point for local AI
Check availability on Amazon

Enthusiast tier (for Gemma 4 27B)

You need 16+ GB VRAM for the 27B variant at Q4.

RTX 4060 Ti 16GB
mid Amazon Prime

RTX 4060 Ti 16GB

4.5 (312 reviews)

Pros

  • 16 GB — runs 27B Q4 natively
  • Sweet spot for Gemma 4 27B
  • 165W TDP
Check availability on Amazon
RTX 4090 24GB
pro Amazon Prime

RTX 4090 24GB

4.9 (1,204 reviews)

Pros

  • 24 GB — runs 27B Q8 comfortably
  • Top-tier inference speed
  • Future-proof
Check availability on Amazon

Prices and availability may change. Some links are affiliate links.

Install Gemma 4 with Ollama — step by step

Ollama is the fastest way to get Gemma 4 running locally. One command to install, one command to pull the model, and you are chatting.

Step 1 — Install Ollama

Available for Windows, macOS, and Linux. Download from ollama.com or use the terminal:

curl -fsSL https://ollama.com/install.sh | sh

Step 2 — Pull Gemma 4

Choose your variant based on available VRAM:

# 12B — needs 6.6 GB VRAM (Q4)
ollama pull gemma4:12b

# 27B — needs 14.9 GB VRAM (Q4)
ollama pull gemma4:27b

Step 3 — Start chatting

ollama run gemma4:12b

For a web UI, add Open WebUI:

docker run -d -p 3000:8080 ghcr.io/open-webui/open-webui:ollama

Step 4 — Test vision (multimodal)

Gemma 4 accepts images. Send one via the API:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:12b",
  "prompt": "Describe this image in detail",
  "images": ["base64-encoded-image-here"]
}'

Gemma 4 performance benchmarks

Based on published benchmarks and community testing. Gemma 4 outperforms Gemma 3 across the board and competes with models 2-3x its size.

Model Quality score VRAM Q4 CPU tok/s Context
Gemma 4 27B 90 14.9 GB ~3 tok/s 256K
Gemma 4 12B 86 6.6 GB ~8 tok/s 256K
Phi-4 14B 88 8.4 GB ~6 tok/s 16K
Llama 3.1 8B 78 5.0 GB ~12 tok/s 128K
Gemma 3 12B 82 7.1 GB ~7 tok/s 128K

Best hardware picks for Gemma 4

Gemma 4 vs alternatives — which should you run?

Gemma 4 12B vs Phi-4 14B: Phi-4 scores slightly higher (88 vs 86) but has only a 16K context window. Gemma 4 wins on context (256K) and adds vision capability. If you need long documents or image input, Gemma 4 is the better choice. If you only need chat with short context, Phi-4 has a slight edge.

Gemma 4 12B vs Llama 3.1 8B: Llama 3.1 is lighter (5 GB VRAM) and faster on CPU but scores lower (78 vs 86). Gemma 4 is the upgrade path for anyone currently running Llama who wants better quality without jumping to 70B.

Gemma 4 27B vs Qwen2.5 14B: Gemma 4 27B needs more VRAM (14.9 vs 8.4 GB) but delivers higher quality (90 vs 84). If you have a 16 GB card, the 27B is worth the extra memory. If you are on 12 GB, stick with Qwen or the Gemma 4 12B.

Frequently asked questions

How much VRAM does Gemma 4 need?

Gemma 4 12B needs 6.6 GB of VRAM at Q4 quantization — any GPU with 8 GB or more runs it comfortably. Gemma 4 27B needs 14.9 GB at Q4, so you need a 16 GB card like the RTX 4060 Ti 16GB or an Apple M-series Mac with 16+ GB unified memory.

Can I run Gemma 4 on CPU only?

Yes. The 12B variant runs at approximately 8 tokens/second on an i7 CPU. Usable for testing, but a GPU with 8+ GB VRAM delivers 5-10x faster inference.

What is the difference between Gemma 4 12B and 27B?

The 27B scores higher on reasoning and analysis (quality score 90 vs 86) but requires 14.9 GB VRAM at Q4 versus 6.6 GB. The 12B is the sweet spot for most consumer GPUs; the 27B is for enthusiasts with 16+ GB VRAM or Apple Silicon.

Ver mejor precio