Question 1

How much VRAM does Llama 3.1 8B need?

Accepted Answer

Llama 3.1 8B requires 5 GB VRAM at Q4 quantization, 8 GB at Q8, or 16 GB at full FP16 precision. For most users, Q4 (5 GB) offers the best balance of quality and memory usage. An RTX 3060 (12 GB) or any GPU with 6 GB+ can run this model comfortably.

Question 2

Can an RTX 3060 run AI locally?

Accepted Answer

Yes. The RTX 3060 has 12 GB VRAM, making it capable of running many popular AI models. It can run Llama 3.1 8B at Q8 (8 GB), Mistral 7B at Q8 (around 7 GB), DeepSeek R1 Distill 8B at Q8 (9.6 GB), and most 7-8B models at any quantization. It struggles with 13B+ models at Q8 but can handle 13B at Q4.

Question 3

What quantization should I use for 8GB VRAM?

Accepted Answer

With 8 GB VRAM, use Q4 quantization. This lets you run Llama 3.1 8B (5 GB at Q4), Mistral 7B (~4.5 GB at Q4), DeepSeek R1 Distill 8B (4.8 GB at Q4), and Phi-3 Mini (under 4 GB). Q4 provides good quality with roughly 10-15% quality reduction compared to full precision. Avoid Q8 for 8B+ models as they will often exceed your VRAM.

Question 4

How much VRAM does DeepSeek R1 need?

Accepted Answer

DeepSeek R1 (full 671B model) requires 403 GB VRAM at Q4 — this needs a server-grade multi-GPU cluster and is not practical for home use. For consumer hardware, use DeepSeek R1 Distill models: the 8B distill needs 4.8 GB VRAM at Q4, the 14B distill needs 8.4 GB, and the 32B distill needs 19.2 GB at Q4.

Question 5

What is quantization and how does it affect AI model performance?

Accepted Answer

Quantization reduces the precision of model weights to use less memory. FP16 (16-bit) is full precision. Q8 (8-bit) halves memory with minimal quality loss. Q4 (4-bit) quarters memory with 5-10% quality reduction — the sweet spot for most users. Q2 (2-bit) minimizes memory but may noticeably reduce output quality. Lower precision also means faster inference speed.

GPU / VRAM	Modelos compatibles en Q4
RTX 3060 (12 GB)	Llama 3.1 8B (Q8), Mistral 7B (Q8), Phi-4 14B (Q4), DeepSeek R1 8B (Q8)
RTX 4060 Ti (16 GB)	Qwen2.5 14B (Q8), Gemma 3 12B (Q8), Mistral Small 24B (Q4)
RTX 4090 (24 GB)	Llama 3.3 70B (Q2), Qwen2.5 32B (Q8), Mixtral 8x7B (Q4)

Calculadora de VRAM

Resolve fit, offload, or shortfall before you leave the tool

How RunAIatHome VRAM Calculation Works

Quantization Explained

Guía práctica de VRAM para modelos de IA local

¿Qué nivel de cuantización debo usar?

Casos reales: qué modelos caben en cada GPU

Aprende Más