How much VRAM does Llama 3 need?

Llama 3.1 8B needs ~4.9 GB VRAM at Q4_K_M quantization, ~8.5 GB at Q8, and ~16 GB at FP16. Llama 3.1 70B needs ~36 GB at Q4_K_M, ~70 GB at Q8, and ~140 GB at FP16. The 405B model needs ~220 GB at Q4 — not feasible for home use.

What is quantization and does it affect quality?

Quantization reduces model precision from 16-bit floating point (FP16) to smaller formats like 8-bit (Q8) or 4-bit (Q4), cutting VRAM usage by 2-4x. Q4_K_M is the most popular choice — it reduces VRAM by ~75% with minimal quality loss (typically 1-3% on benchmarks). Q8 is nearly indistinguishable from FP16. Below Q4 (Q3, Q2), quality degrades noticeably.

Llama 3 vs Llama 3.1 vs Llama 3.2 — which should I use?

Use Llama 3.1 for most tasks — it has 128K context window and is the most capable text model. Llama 3.2 introduced smaller models (1B, 3B) and multimodal variants (11B, 90B with vision). For local use, Llama 3.1 8B or 70B are the standard choices. Llama 3.2 3B is useful for very low VRAM setups.

Is Ollama the best way to run Llama 3 locally?

For most users, yes. Ollama handles model downloading, quantization selection, GPU detection, and provides both CLI and API access. Advanced users might prefer llama.cpp directly (more control over parameters) or vLLM (for serving multiple users). For getting started, Ollama is the simplest path.

Run Llama 3 Locally: Setup Guide

1. What Is Llama 3?

Llama 3 is Meta's open-weight large language model family, released in 2024. It comes in several sizes — 8B, 70B, and 405B parameters — and is one of the most capable open models available for local use.

"Open-weight" means you can download the model files and run them on your own hardware. No API keys, no cloud subscription, no data leaving your machine. Your prompts, documents, and code stay private.

Llama 3 vs 3.1 vs 3.2 — Quick Summary

Llama 3 (April 2024): Original release. 8B and 70B. 8K context window.
Llama 3.1 (July 2024): Updated with 128K context window. Added 405B model. This is the version most people should use.
Llama 3.2 (Sept 2024): Added smaller models (1B, 3B) and multimodal variants (11B, 90B with vision). The 1B/3B are great for very constrained hardware.

For this guide, we focus on Llama 3.1 8B and 70B — the two variants most useful for local deployment. The 8B runs on virtually any modern GPU; the 70B is the best open model you can realistically run at home.

2. Hardware Requirements

VRAM is the limiting factor. The table below shows exact requirements for each Llama 3.1 variant at different quantization levels. If a model does not fit in your GPU's VRAM, it spills to system RAM and runs 10-20x slower.

Model	FP16	Q8	Q4_K_M	Min GPU (Q4)
Llama 3.2 3B	6 GB	3.2 GB	2.0 GB	GTX 1060 6GB, any modern GPU
Llama 3.1 8B	16 GB	8.5 GB	4.9 GB	RTX 3060 8GB, GTX 1080, RX 6700 XT
Llama 3.1 70B	140 GB	70 GB	36 GB	RTX 3090 + offload, RTX 4090, A6000
Llama 3.1 405B	810 GB	405 GB	~220 GB	Multi-GPU server (not feasible at home)

Not sure what fits? Use our VRAM Calculator — enter your GPU and the model, and get an exact fit/no-fit answer with headroom details.

System Requirements (Beyond GPU)

RAM: At least 16 GB for 8B models, 32 GB for 70B. The system needs RAM for Ollama overhead, KV cache, and OS processes even if the model is fully on GPU.

Storage: SSD strongly recommended. Llama 3.1 8B Q4 is ~4.7 GB download, 70B Q4 is ~40 GB. Models are stored in ~/.ollama/models/.

CPU: Any modern multi-core CPU works. The CPU handles prompt tokenization and scheduling — not a bottleneck for inference if the model fits on GPU.

GPUs recomendadas para Llama 3 — precios en euros

Si estás buscando comprar una GPU para ejecutar Llama 3 localmente, la siguiente tabla te ayuda a comparar las opciones más populares con sus precios reales en Europa, velocidad de inferencia y el modelo de Llama 3 más grande que cada tarjeta puede ejecutar cómodamente.

GPU	VRAM	Precio aprox (€)	Velocidad tok/s	Caso de uso
RTX 4090	24 GB	~1.999 €	~90 tok/s	Llama 3.1 70B (con offload parcial)
RTX 3090 (segunda mano)	24 GB	~799 €	~70 tok/s	Llama 3.1 70B (con offload parcial)
RTX 4070 Ti Super	16 GB	~699 €	~55 tok/s	Llama 3.1 13B Q8 / 8B FP16
RTX 4070 Super	12 GB	~549 €	~45 tok/s	Llama 3.1 8B cómodo
RTX 4060 Ti 16 GB	16 GB	~399 €	~35 tok/s	Llama 3.1 8B a Q8 completo en VRAM
RTX 3080 (segunda mano)	10 GB	~499 €	~50 tok/s	Llama 3.2 3B / 8B con offload
RTX 3060	12 GB	~269 €	~30 tok/s	Llama 3.1 8B Q4 — entrada económica

Precios orientativos para el mercado europeo (abril 2026). Velocidades medidas con Llama 3.1 8B Q4_K_M en Ollama. Los precios de segunda mano corresponden a plataformas como Wallapop, eBay.es o Back Market.

3. Quantization Explained (Q4_K_M, Q8, FP16)

Quantization is the process of reducing a model's numerical precision to make it smaller and faster. Understanding this is essential for local AI — it determines the tradeoff between quality and VRAM usage.

Format	Bits/Param	VRAM (8B)	Quality Loss	When to Use
FP16	16	~16 GB	None (baseline)	If you have enough VRAM. Maximum quality.
Q8_0	8	~8.5 GB	Negligible (<1%)	Best balance if VRAM allows. Nearly indistinguishable from FP16.
Q4_K_M	4.5 (avg)	~4.9 GB	Minimal (1-3%)	The standard choice. Best quality-to-VRAM ratio. Recommended default.
Q4_K_S	4.3 (avg)	~4.6 GB	Small (2-4%)	Slightly smaller than K_M. Use if you need to squeeze a model in.
Q3_K_M	3.5 (avg)	~3.8 GB	Noticeable (5-8%)	Quality starts degrading. Only if you absolutely cannot fit Q4.
Q2_K	2.6 (avg)	~2.8 GB	Significant (10%+)	Not recommended. Severe quality loss. Use a smaller model at Q4 instead.

What does "K_M" mean?

The "K" stands for k-quant, a quantization method from llama.cpp that assigns different bit widths to different layers based on their importance. "M" means medium — a balanced profile. "S" is small (more aggressive compression), "L" is large (less compression, higher quality). Q4_K_M is the community standard for a reason: it hits the sweet spot.

Rule of thumb: If your VRAM can fit Q8 — use Q8. If not, Q4_K_M. Never go below Q3 unless you have no other option. A smaller model at Q4 will almost always outperform a larger model at Q2.

4. Step-by-Step: Install Ollama and Run Llama 3

1

Check Your VRAM

Before anything, verify how much VRAM you have:

NVIDIA (Linux/Windows):
nvidia-smi

AMD (Linux):
rocm-smi

macOS (Apple Silicon):
system_profiler SPDisplaysDataType | grep "VRAM\|Memory"

Or use the VRAM Calculator to check which Llama 3 variant your GPU can handle.
2

Install Ollama

Linux (one-liner):
curl -fsSL https://ollama.com/install.sh | sh

macOS / Windows:

Download from ollama.com/download

Verify installation:

ollama --version

For detailed installation steps, see the Complete Ollama Guide.
3

Pull Llama 3

Download the model that fits your hardware:

Llama 3.2 3B (minimum setup, ~2 GB download):
ollama pull llama3.2:3b

Llama 3.1 8B (recommended, ~4.7 GB download):
ollama pull llama3.1

Llama 3.1 70B (24+ GB VRAM, ~40 GB download):
ollama pull llama3.1:70b

Ollama defaults to Q4_K_M quantization, which is the best choice for most users. For Q8, append :8b-instruct-q8_0 to the model tag.
4

Run Llama 3

Start an interactive chat session:

ollama run llama3.1

You should see a prompt where you can type. Try something:

Example conversation:

>>> Explain the difference between TCP and UDP in two sentences.

TCP is a connection-oriented protocol that guarantees delivery and order of packets through acknowledgments and retransmission, making it reliable but slower. UDP is connectionless, sending packets without confirmation, which makes it faster and suitable for real-time applications like video streaming and gaming where occasional packet loss is acceptable.
5

Verify GPU Usage

While the model is running, open another terminal and check that it is using your GPU:

nvidia-smi

You should see Ollama listed under "Processes" and VRAM usage matching the expected model size. If VRAM usage is near zero, the model is running on CPU — see the Troubleshooting section.

5. Performance Tuning

Out of the box, Ollama works well. But these tweaks can significantly improve speed and quality:

Context Length (num_ctx)

Llama 3.1 supports 128K context, but each token in the KV cache uses VRAM. More context = more VRAM. Default is typically 2048-4096.

Set context to 8192 tokens:

OLLAMA_NUM_CTX=8192 ollama run llama3.1

For 8K context on the 8B model, add ~1-2 GB VRAM overhead. For 32K context, add ~4-6 GB. Monitor with nvidia-smi while running.

GPU Layer Offloading

If the model does not fully fit in VRAM, Ollama automatically offloads some layers to CPU. You can control this:

Force all layers on GPU (fails if not enough VRAM):

OLLAMA_NUM_GPU=999 ollama run llama3.1

Force CPU-only (useful for testing or if GPU is busy):

OLLAMA_NUM_GPU=0 ollama run llama3.1

Keep Model Loaded

By default, Ollama unloads models after 5 minutes of inactivity (to free VRAM). If you are using the model frequently:

Keep model loaded indefinitely:

OLLAMA_KEEP_ALIVE=-1 ollama serve

This eliminates the cold-start delay when you send a new prompt after a pause.

6. Benchmarks by GPU

Real-world token generation speed for Llama 3.1, measured with Ollama. These are eval throughput (tokens/sec during generation), not prompt processing speed.

Llama 3.1 8B (Q4_K_M)

GPU	VRAM Used	Tokens/sec	Feel
RTX 4090	~5.2 GB	~120 tok/s	Instant
RTX 4070 Ti Super	~5.2 GB	~85 tok/s	Instant
RTX 3090	~5.2 GB	~95 tok/s	Instant
RTX 3060 12GB	~5.2 GB	~40 tok/s	Fast
RX 7900 XTX	~5.2 GB	~75 tok/s	Instant
M4 Max (unified)	~5.2 GB	~55 tok/s	Fast
CPU only (i7-13700K)	N/A (uses RAM)	~8 tok/s	Slow but usable

Llama 3.1 70B (Q4_K_M)

GPU	VRAM Used	Tokens/sec	Notes
RTX 4090 (24 GB)	24 GB + offload	~20-25 tok/s	Partial CPU offload; still very usable
RTX 3090 (24 GB)	24 GB + offload	~15-18 tok/s	Partial CPU offload; conversational
A6000 (48 GB)	~38 GB	~35 tok/s	Fits fully in VRAM; no offload
M4 Max 128 GB	~38 GB unified	~18 tok/s	Fits fully in memory; bandwidth-limited
2x RTX 3090 (48 GB)	~38 GB across GPUs	~25-30 tok/s	Requires llama.cpp multi-GPU; fits fully

Benchmarks measured at 2048 context length. Longer context reduces throughput. See our Best GPUs for Local AI guide for detailed methodology.

7. Advanced Usage

REST API

Ollama exposes a local API on port 11434. Use it from scripts, applications, or other tools:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a Python function to calculate fibonacci numbers",
  "stream": false
}'

For chat-style conversations with history:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I read a CSV file in Python?"}
  ],
  "stream": false
}'

Custom Modelfile

Create a custom model with specific system prompts, temperature, and parameters:

Create a file called Modelfile:

FROM llama3.1

SYSTEM "You are a senior software engineer. Give concise, correct answers with code examples. Prefer Python unless asked otherwise."

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

Build and run:

ollama create my-coder -f Modelfile
ollama run my-coder

Running Multiple Models

Ollama can serve multiple models simultaneously if you have enough VRAM. Each model loaded adds its VRAM footprint. Monitor with:

List loaded models:

ollama ps

Unload a specific model:

ollama stop llama3.1

Web UI Options

Prefer a ChatGPT-like interface? These open-source UIs connect to Ollama's API:

Open WebUI — Full-featured, supports chat history, file uploads, RAG. Install with Docker: docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main
Chatbox — Desktop app (Windows/Mac/Linux). Simple, fast, no Docker needed.
Jan — Another desktop app with built-in model management. Good for beginners.

8. Troubleshooting

"Model is extremely slow (< 5 tok/s)"

Cause: The model is running on CPU instead of GPU, or partially offloaded to CPU RAM.

Fix:

Run nvidia-smi — check if Ollama appears in the process list.
If not: your GPU driver may be outdated. Install the latest NVIDIA driver.
If VRAM usage is low: the model may be too large. Try a smaller quantization or model size.
On Linux, ensure nvidia-container-toolkit is installed if using Docker.

"Error: model requires more memory than available"

Cause: VRAM is insufficient for the requested model.

Fix:

Use a smaller model: llama3.1 instead of llama3.1:70b.
Use a more aggressive quantization: pull the Q4_K_S variant instead of Q4_K_M.
Close other VRAM-consuming apps (browsers with hardware acceleration, games, other models).
Reduce context length: OLLAMA_NUM_CTX=2048.

"Connection refused" on API calls

Cause: Ollama server is not running.

Fix:

Start the server: ollama serve (runs in foreground).
On Linux, check the systemd service: systemctl status ollama.
Verify it is listening: curl http://localhost:11434 — should return "Ollama is running".

"Model output is gibberish or low quality"

Cause: Likely using too aggressive quantization (Q2 or Q3), or temperature is too high.

Fix:

Switch to Q4_K_M if using Q3 or lower. Quality improves dramatically.
Set lower temperature: /set parameter temperature 0.7 in the Ollama CLI.
Ensure you are using the instruct variant (default in Ollama), not the base model.

"AMD GPU not detected"

Cause: ROCm not installed or GPU not supported.

Fix:

Install ROCm: sudo apt install rocm-dev (Ubuntu/Debian).
Verify with: rocm-smi — your GPU should be listed.
Check the ROCm compatibility list — not all AMD GPUs are supported.
Note: AMD + Windows is not supported for AI workloads. Use Linux.

9. Frequently Asked Questions

Can I run Llama 3 on my PC?

Yes. Llama 3.1 8B at Q4_K_M requires only ~4.9 GB VRAM — any GPU with 6+ GB VRAM can run it. Even CPU-only inference works (just slower). Apple Silicon Macs with 8+ GB unified memory work too.

What is quantization and does it hurt quality?

Quantization reduces numerical precision to save VRAM. Q4_K_M loses 1-3% quality vs FP16 — most users cannot tell the difference. Q8 is virtually identical to FP16. Below Q4 (Q3, Q2), quality degrades noticeably. See the detailed breakdown above.

Llama 3 vs Llama 3.1 vs 3.2 — which should I use?

Llama 3.1 8B for most users — 128K context, best quality at this size. Llama 3.2 3B if you have very limited VRAM (<6 GB). Llama 3.1 70B if you have 24+ GB VRAM and want the best open-weight model available.

Is Ollama the best way to run Llama 3?

For getting started: yes. Ollama handles everything (download, quantization, GPU detection, API). For power users, llama.cpp gives more control over inference parameters. For serving multiple users, vLLM or TGI are better choices.

How does Llama 3 compare to ChatGPT?

Llama 3.1 70B is competitive with GPT-3.5 Turbo and approaches GPT-4 on many benchmarks. The 8B is more limited but surprisingly good for code, summarization, and general Q&A. The main advantage of local: privacy, zero cost per query, and offline access. The main disadvantage: no live internet access or tool use (without extra setup).

GPUs recomendadas para correr Llama 3 localmente

Llama 3.1 8B Q4 requires ~5 GB VRAM. 13B needs ~9 GB. 70B needs 24+ GB.

Precios y disponibilidad pueden cambiar. Enlaces de afiliado.

Entry Tier

8–12 GB VRAM

RTX 4060

8 GB VRAM

Ver disponibilidad →

RTX 3060

12 GB VRAM

Ver disponibilidad →

Mid Tier

12–16 GB VRAM

RTX 4060 Ti 16GB

16 GB VRAM

Ver disponibilidad →

RTX 4070

12 GB VRAM

Ver disponibilidad →

High Tier

24 GB VRAM

RTX 4090

24 GB VRAM

Ver disponibilidad →

RTX 3090

24 GB VRAM

Ver disponibilidad →

Check Which Llama 3 Model Fits Your GPU

Enter your GPU model and see exactly which Llama variants fit — with VRAM headroom for different quantization levels.

VRAM Calculator Best GPUs for AI

1. What Is Llama 3?

Llama 3 vs 3.1 vs 3.2 — Quick Summary

2. Hardware Requirements

System Requirements (Beyond GPU)

GPUs recomendadas para Llama 3 — precios en euros

3. Quantization Explained (Q4_K_M, Q8, FP16)

What does "K_M" mean?

4. Step-by-Step: Install Ollama and Run Llama 3

Check Your VRAM

Install Ollama

Pull Llama 3

Run Llama 3

Verify GPU Usage

5. Performance Tuning

Context Length (num_ctx)

GPU Layer Offloading

Keep Model Loaded

6. Benchmarks by GPU

Llama 3.1 8B (Q4_K_M)

Llama 3.1 70B (Q4_K_M)

7. Advanced Usage

REST API

Custom Modelfile

Running Multiple Models

Web UI Options

8. Troubleshooting

"Model is extremely slow (< 5 tok/s)"

"Error: model requires more memory than available"

"Connection refused" on API calls

"Model output is gibberish or low quality"

"AMD GPU not detected"

9. Frequently Asked Questions

Can I run Llama 3 on my PC?

What is quantization and does it hurt quality?

Llama 3 vs Llama 3.1 vs 3.2 — which should I use?

Is Ollama the best way to run Llama 3?

How does Llama 3 compare to ChatGPT?

GPUs recomendadas para correr Llama 3 localmente

Entry Tier

Mid Tier

High Tier

Check Which Llama 3 Model Fits Your GPU