By the RunAIatHome editorial team. Hardware notes are aligned with real VRAM thresholds and local inference constraints.
How to Run Llama 3 Locally: Complete Step-by-Step Guide
From zero to chatting with Llama 3 on your own hardware. Hardware requirements, installation, quantization explained, performance tuning, and troubleshooting.
1. What Is Llama 3?
Llama 3 is Meta's open-weight large language model family, released in 2024. It comes in several sizes — 8B, 70B, and 405B parameters — and is one of the most capable open models available for local use.
"Open-weight" means you can download the model files and run them on your own hardware. No API keys, no cloud subscription, no data leaving your machine. Your prompts, documents, and code stay private.
Llama 3 vs 3.1 vs 3.2 — Quick Summary
- Llama 3 (April 2024): Original release. 8B and 70B. 8K context window.
- Llama 3.1 (July 2024): Updated with 128K context window. Added 405B model. This is the version most people should use.
- Llama 3.2 (Sept 2024): Added smaller models (1B, 3B) and multimodal variants (11B, 90B with vision). The 1B/3B are great for very constrained hardware.
For this guide, we focus on Llama 3.1 8B and 70B — the two variants most useful for local deployment. The 8B runs on virtually any modern GPU; the 70B is the best open model you can realistically run at home.
2. Hardware Requirements
VRAM is the limiting factor. The table below shows exact requirements for each Llama 3.1 variant at different quantization levels. If a model does not fit in your GPU's VRAM, it spills to system RAM and runs 10-20x slower.
| Model | FP16 | Q8 | Q4_K_M | Min GPU (Q4) |
|---|---|---|---|---|
| Llama 3.2 3B | 6 GB | 3.2 GB | 2.0 GB | GTX 1060 6GB, any modern GPU |
| Llama 3.1 8B | 16 GB | 8.5 GB | 4.9 GB | RTX 3060 8GB, GTX 1080, RX 6700 XT |
| Llama 3.1 70B | 140 GB | 70 GB | 36 GB | RTX 3090 + offload, RTX 4090, A6000 |
| Llama 3.1 405B | 810 GB | 405 GB | ~220 GB | Multi-GPU server (not feasible at home) |
Not sure what fits? Use our VRAM Calculator — enter your GPU and the model, and get an exact fit/no-fit answer with headroom details.
System Requirements (Beyond GPU)
RAM: At least 16 GB for 8B models, 32 GB for 70B. The system needs RAM for Ollama overhead, KV cache, and OS processes even if the model is fully on GPU.
Storage: SSD strongly recommended. Llama 3.1 8B Q4 is ~4.7 GB download, 70B Q4 is ~40 GB. Models are stored in ~/.ollama/models/.
CPU: Any modern multi-core CPU works. The CPU handles prompt tokenization and scheduling — not a bottleneck for inference if the model fits on GPU.
GPUs recomendadas para Llama 3 — precios en euros
Si estás buscando comprar una GPU para ejecutar Llama 3 localmente, la siguiente tabla te ayuda a comparar las opciones más populares con sus precios reales en Europa, velocidad de inferencia y el modelo de Llama 3 más grande que cada tarjeta puede ejecutar cómodamente.
| GPU | VRAM | Precio aprox (€) | Velocidad tok/s | Caso de uso |
|---|---|---|---|---|
| RTX 4090 | 24 GB | ~1.999 € | ~90 tok/s | Llama 3.1 70B (con offload parcial) |
| RTX 3090 (segunda mano) | 24 GB | ~799 € | ~70 tok/s | Llama 3.1 70B (con offload parcial) |
| RTX 4070 Ti Super | 16 GB | ~699 € | ~55 tok/s | Llama 3.1 13B Q8 / 8B FP16 |
| RTX 4070 Super | 12 GB | ~549 € | ~45 tok/s | Llama 3.1 8B cómodo |
| RTX 4060 Ti 16 GB | 16 GB | ~399 € | ~35 tok/s | Llama 3.1 8B a Q8 completo en VRAM |
| RTX 3080 (segunda mano) | 10 GB | ~499 € | ~50 tok/s | Llama 3.2 3B / 8B con offload |
| RTX 3060 | 12 GB | ~269 € | ~30 tok/s | Llama 3.1 8B Q4 — entrada económica |
Precios orientativos para el mercado europeo (abril 2026). Velocidades medidas con Llama 3.1 8B Q4_K_M en Ollama. Los precios de segunda mano corresponden a plataformas como Wallapop, eBay.es o Back Market.
3. Quantization Explained (Q4_K_M, Q8, FP16)
Quantization is the process of reducing a model's numerical precision to make it smaller and faster. Understanding this is essential for local AI — it determines the tradeoff between quality and VRAM usage.
| Format | Bits/Param | VRAM (8B) | Quality Loss | When to Use |
|---|---|---|---|---|
| FP16 | 16 | ~16 GB | None (baseline) | If you have enough VRAM. Maximum quality. |
| Q8_0 | 8 | ~8.5 GB | Negligible (<1%) | Best balance if VRAM allows. Nearly indistinguishable from FP16. |
| Q4_K_M | 4.5 (avg) | ~4.9 GB | Minimal (1-3%) | The standard choice. Best quality-to-VRAM ratio. Recommended default. |
| Q4_K_S | 4.3 (avg) | ~4.6 GB | Small (2-4%) | Slightly smaller than K_M. Use if you need to squeeze a model in. |
| Q3_K_M | 3.5 (avg) | ~3.8 GB | Noticeable (5-8%) | Quality starts degrading. Only if you absolutely cannot fit Q4. |
| Q2_K | 2.6 (avg) | ~2.8 GB | Significant (10%+) | Not recommended. Severe quality loss. Use a smaller model at Q4 instead. |
What does "K_M" mean?
The "K" stands for k-quant, a quantization method from llama.cpp that assigns different bit widths to different layers based on their importance. "M" means medium — a balanced profile. "S" is small (more aggressive compression), "L" is large (less compression, higher quality). Q4_K_M is the community standard for a reason: it hits the sweet spot.
Rule of thumb: If your VRAM can fit Q8 — use Q8. If not, Q4_K_M. Never go below Q3 unless you have no other option. A smaller model at Q4 will almost always outperform a larger model at Q2.
4. Step-by-Step: Install Ollama and Run Llama 3
- 1
Check Your VRAM
Before anything, verify how much VRAM you have:
NVIDIA (Linux/Windows):
nvidia-smiAMD (Linux):
rocm-smimacOS (Apple Silicon):
system_profiler SPDisplaysDataType | grep "VRAM\|Memory"Or use the VRAM Calculator to check which Llama 3 variant your GPU can handle.
- 2
Install Ollama
Linux (one-liner):
curl -fsSL https://ollama.com/install.sh | shmacOS / Windows:
Download from ollama.com/download
Verify installation:
ollama --versionFor detailed installation steps, see the Complete Ollama Guide.
- 3
Pull Llama 3
Download the model that fits your hardware:
Llama 3.2 3B (minimum setup, ~2 GB download):
ollama pull llama3.2:3bLlama 3.1 8B (recommended, ~4.7 GB download):
ollama pull llama3.1Llama 3.1 70B (24+ GB VRAM, ~40 GB download):
ollama pull llama3.1:70bOllama defaults to Q4_K_M quantization, which is the best choice for most users. For Q8, append
:8b-instruct-q8_0to the model tag. - 4
Run Llama 3
Start an interactive chat session:
ollama run llama3.1You should see a prompt where you can type. Try something:
Example conversation:
>>> Explain the difference between TCP and UDP in two sentences.
TCP is a connection-oriented protocol that guarantees delivery and order of packets through acknowledgments and retransmission, making it reliable but slower. UDP is connectionless, sending packets without confirmation, which makes it faster and suitable for real-time applications like video streaming and gaming where occasional packet loss is acceptable.
- 5
Verify GPU Usage
While the model is running, open another terminal and check that it is using your GPU:
nvidia-smiYou should see Ollama listed under "Processes" and VRAM usage matching the expected model size. If VRAM usage is near zero, the model is running on CPU — see the Troubleshooting section.
5. Performance Tuning
Out of the box, Ollama works well. But these tweaks can significantly improve speed and quality:
Context Length (num_ctx)
Llama 3.1 supports 128K context, but each token in the KV cache uses VRAM. More context = more VRAM. Default is typically 2048-4096.
Set context to 8192 tokens:
OLLAMA_NUM_CTX=8192 ollama run llama3.1 For 8K context on the 8B model, add ~1-2 GB VRAM overhead. For 32K context, add ~4-6 GB. Monitor with nvidia-smi while running.
GPU Layer Offloading
If the model does not fully fit in VRAM, Ollama automatically offloads some layers to CPU. You can control this:
Force all layers on GPU (fails if not enough VRAM):
OLLAMA_NUM_GPU=999 ollama run llama3.1 Force CPU-only (useful for testing or if GPU is busy):
OLLAMA_NUM_GPU=0 ollama run llama3.1 Keep Model Loaded
By default, Ollama unloads models after 5 minutes of inactivity (to free VRAM). If you are using the model frequently:
Keep model loaded indefinitely:
OLLAMA_KEEP_ALIVE=-1 ollama serve This eliminates the cold-start delay when you send a new prompt after a pause.
6. Benchmarks by GPU
Real-world token generation speed for Llama 3.1, measured with Ollama. These are eval throughput (tokens/sec during generation), not prompt processing speed.
Llama 3.1 8B (Q4_K_M)
| GPU | VRAM Used | Tokens/sec | Feel |
|---|---|---|---|
| RTX 4090 | ~5.2 GB | ~120 tok/s | Instant |
| RTX 4070 Ti Super | ~5.2 GB | ~85 tok/s | Instant |
| RTX 3090 | ~5.2 GB | ~95 tok/s | Instant |
| RTX 3060 12GB | ~5.2 GB | ~40 tok/s | Fast |
| RX 7900 XTX | ~5.2 GB | ~75 tok/s | Instant |
| M4 Max (unified) | ~5.2 GB | ~55 tok/s | Fast |
| CPU only (i7-13700K) | N/A (uses RAM) | ~8 tok/s | Slow but usable |
Llama 3.1 70B (Q4_K_M)
| GPU | VRAM Used | Tokens/sec | Notes |
|---|---|---|---|
| RTX 4090 (24 GB) | 24 GB + offload | ~20-25 tok/s | Partial CPU offload; still very usable |
| RTX 3090 (24 GB) | 24 GB + offload | ~15-18 tok/s | Partial CPU offload; conversational |
| A6000 (48 GB) | ~38 GB | ~35 tok/s | Fits fully in VRAM; no offload |
| M4 Max 128 GB | ~38 GB unified | ~18 tok/s | Fits fully in memory; bandwidth-limited |
| 2x RTX 3090 (48 GB) | ~38 GB across GPUs | ~25-30 tok/s | Requires llama.cpp multi-GPU; fits fully |
Benchmarks measured at 2048 context length. Longer context reduces throughput. See our Best GPUs for Local AI guide for detailed methodology.
7. Advanced Usage
REST API
Ollama exposes a local API on port 11434. Use it from scripts, applications, or other tools:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Write a Python function to calculate fibonacci numbers",
"stream": false
}' For chat-style conversations with history:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I read a CSV file in Python?"}
],
"stream": false
}' Custom Modelfile
Create a custom model with specific system prompts, temperature, and parameters:
Create a file called Modelfile:
FROM llama3.1
SYSTEM "You are a senior software engineer. Give concise, correct answers with code examples. Prefer Python unless asked otherwise."
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9 Build and run:
ollama create my-coder -f Modelfile
ollama run my-coder Running Multiple Models
Ollama can serve multiple models simultaneously if you have enough VRAM. Each model loaded adds its VRAM footprint. Monitor with:
List loaded models:
ollama ps Unload a specific model:
ollama stop llama3.1 Web UI Options
Prefer a ChatGPT-like interface? These open-source UIs connect to Ollama's API:
- Open WebUI — Full-featured, supports chat history, file uploads, RAG. Install with Docker:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main - Chatbox — Desktop app (Windows/Mac/Linux). Simple, fast, no Docker needed.
- Jan — Another desktop app with built-in model management. Good for beginners.
8. Troubleshooting
"Model is extremely slow (< 5 tok/s)"
Cause: The model is running on CPU instead of GPU, or partially offloaded to CPU RAM.
Fix:
- Run
nvidia-smi— check if Ollama appears in the process list. - If not: your GPU driver may be outdated. Install the latest NVIDIA driver.
- If VRAM usage is low: the model may be too large. Try a smaller quantization or model size.
- On Linux, ensure
nvidia-container-toolkitis installed if using Docker.
"Error: model requires more memory than available"
Cause: VRAM is insufficient for the requested model.
Fix:
- Use a smaller model:
llama3.1instead ofllama3.1:70b. - Use a more aggressive quantization: pull the Q4_K_S variant instead of Q4_K_M.
- Close other VRAM-consuming apps (browsers with hardware acceleration, games, other models).
- Reduce context length:
OLLAMA_NUM_CTX=2048.
"Connection refused" on API calls
Cause: Ollama server is not running.
Fix:
- Start the server:
ollama serve(runs in foreground). - On Linux, check the systemd service:
systemctl status ollama. - Verify it is listening:
curl http://localhost:11434— should return "Ollama is running".
"Model output is gibberish or low quality"
Cause: Likely using too aggressive quantization (Q2 or Q3), or temperature is too high.
Fix:
- Switch to Q4_K_M if using Q3 or lower. Quality improves dramatically.
- Set lower temperature:
/set parameter temperature 0.7in the Ollama CLI. - Ensure you are using the instruct variant (default in Ollama), not the base model.
"AMD GPU not detected"
Cause: ROCm not installed or GPU not supported.
Fix:
- Install ROCm:
sudo apt install rocm-dev(Ubuntu/Debian). - Verify with:
rocm-smi— your GPU should be listed. - Check the ROCm compatibility list — not all AMD GPUs are supported.
- Note: AMD + Windows is not supported for AI workloads. Use Linux.
9. Frequently Asked Questions
Can I run Llama 3 on my PC?
Yes. Llama 3.1 8B at Q4_K_M requires only ~4.9 GB VRAM — any GPU with 6+ GB VRAM can run it. Even CPU-only inference works (just slower). Apple Silicon Macs with 8+ GB unified memory work too.
What is quantization and does it hurt quality?
Quantization reduces numerical precision to save VRAM. Q4_K_M loses 1-3% quality vs FP16 — most users cannot tell the difference. Q8 is virtually identical to FP16. Below Q4 (Q3, Q2), quality degrades noticeably. See the detailed breakdown above.
Llama 3 vs Llama 3.1 vs 3.2 — which should I use?
Llama 3.1 8B for most users — 128K context, best quality at this size. Llama 3.2 3B if you have very limited VRAM (<6 GB). Llama 3.1 70B if you have 24+ GB VRAM and want the best open-weight model available.
Is Ollama the best way to run Llama 3?
For getting started: yes. Ollama handles everything (download, quantization, GPU detection, API). For power users, llama.cpp gives more control over inference parameters. For serving multiple users, vLLM or TGI are better choices.
How does Llama 3 compare to ChatGPT?
Llama 3.1 70B is competitive with GPT-3.5 Turbo and approaches GPT-4 on many benchmarks. The 8B is more limited but surprisingly good for code, summarization, and general Q&A. The main advantage of local: privacy, zero cost per query, and offline access. The main disadvantage: no live internet access or tool use (without extra setup).
GPUs recomendadas para correr Llama 3 localmente
Llama 3.1 8B Q4 requires ~5 GB VRAM. 13B needs ~9 GB. 70B needs 24+ GB.
Precios y disponibilidad pueden cambiar. Enlaces de afiliado.
Entry Tier
8–12 GB VRAMRTX 4060
8 GB VRAMRTX 3060
12 GB VRAMMid Tier
12–16 GB VRAMRTX 4060 Ti 16GB
16 GB VRAMRTX 4070
12 GB VRAMHigh Tier
24 GB VRAMRTX 4090
24 GB VRAMRTX 3090
24 GB VRAMCheck Which Llama 3 Model Fits Your GPU
Enter your GPU model and see exactly which Llama variants fit — with VRAM headroom for different quantization levels.