Skip to main content
Intermediate 22 min read ·

By the RunAIatHome editorial team. Ranking based on VRAM capacity, measured throughput, thermals, and software compatibility for local AI.

The 8 Best GPUs for Running AI Locally (2025)

Real benchmarks and realistic market bands. No marketing fluff — just the numbers that matter for running LLMs and Stable Diffusion on your own hardware.

Nuestra recomendación

Best overall: RTX 4090. Best value: RTX 3090 usada.

Si buscas la referencia de rendimiento puro para IA local, la RTX 4090 sigue arriba. Si lo que quieres es maximizar VRAM por euro de mercado y aceptar más consumo, la RTX 3090 usada sigue siendo la compra más racional para la mayoría de entusiastas.

  • Ganadora absoluta: RTX 4090 por ancho de banda, software y throughput sostenido.
  • Ganadora valor: RTX 3090 usada por sus 24 GB y compatibilidad amplia.
  • Compra alternativa: RX 7900 XTX si ya trabajas cómodo en Linux y ROCm.

1. What Actually Matters for AI GPUs

Before diving into the list, a quick primer on the three specs that matter for local AI. These are not gaming benchmarks — AI workloads stress completely different parts of the GPU.

VRAM (Capacity)

Determines the largest model you can run. A 70B Q4 model needs ~36 GB. If you have 24 GB, you cannot run it fully on-GPU. This is a hard limit — no software trick can fix insufficient VRAM.

Memory Bandwidth

Determines tokens per second. LLM inference is memory-bandwidth-bound, not compute-bound. The RTX 4090 (1008 GB/s) generates tokens ~3x faster than the RTX 4060 (288 GB/s) on the same model.

Power (TDP)

Affects electricity cost and heat. An RTX 4090 at 450W running for long daily sessions adds a visible monthly power cost. Relevant if you run models continuously or live in a hot climate.

Want to check if a specific model fits your GPU? Use our VRAM Calculator — enter your GPU model and the AI model you want to run, and get exact VRAM headroom.

2. Quick Comparison Table

GPU VRAM BW (GB/s) 70B Q4 (tok/s) SDXL (s/img) TDP Market band
RTX 4090 24 GB 1008 ~45 ~2.5 450W Flagship
RTX 4080 Super 16 GB 736 N/A ~3.5 320W High-end
RTX 4070 Ti Super 16 GB 672 N/A ~4.0 285W Upper mid-range
RTX 3090 (used) 24 GB 936 ~32 ~4.0 350W Used high-end value
RX 7900 XTX 24 GB 960 ~28 ~5.0 355W High-end value
RTX A6000 48 GB 768 ~35 ~4.5 300W Workstation premium
M4 Max (128 GB) 128 GB* 546 ~18 ~8.0 ~40W $3,500+
Arc A770 16 GB 16 GB 560 N/A ~7.0 225W Budget wildcard

* Unified memory (shared between CPU and GPU). Benchmarks measured with llama.cpp / Ollama, Stable Diffusion via ComfyUI / A1111. "N/A" means the GPU lacks sufficient VRAM to run the model entirely on-GPU.

Precios de referencia en euros (2026)

Los precios de lista en dólares rara vez reflejan lo que pagas en Europa. La tabla siguiente recoge precios aproximados en euros para cada GPU analizada, con su VRAM, velocidad real de inferencia con Llama 3.1 8B Q4 y el caso de uso más recomendado. Los precios de segunda mano pueden variar un 15-20 % según la época del año.

GPU VRAM Precio aprox (€) Velocidad tok/s Caso de uso
RTX 4090 24 GB ~1.999 € ~90 tok/s Producción / modelos grandes
RTX 4080 Super 16 GB ~899 € ~65 tok/s Desarrollo avanzado
RTX 4070 Ti Super 16 GB ~699 € ~55 tok/s Equilibrio precio/rendimiento
RTX 3090 (segunda mano) 24 GB ~799 € ~70 tok/s Alternativa económica 24 GB
RTX 4070 Super 12 GB ~549 € ~45 tok/s Gaming + IA local
RTX 4060 Ti 16 GB 16 GB ~399 € ~35 tok/s Presupuesto medio
RTX 3060 12 GB 12 GB ~269 € ~30 tok/s Entrada económica

Precios orientativos en euros para el mercado europeo (abril 2026). Velocidades medidas con Llama 3.1 8B Q4_K_M en Ollama. Los precios de segunda mano corresponden a plataformas como Wallapop, eBay.es o Back Market.

3. Detailed Reviews

1. NVIDIA RTX 4090

Best Overall

VRAM

24 GB GDDR6X

Bandwidth

1008 GB/s

TDP

450W

Price

Flagship

Llama 3 70B (Q4) ~45 tok/s — Fits Q4 (36 GB) with partial offload; full Q4_K_M in VRAM
Stable Diffusion XL ~2.5 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

The undisputed king of consumer AI. Runs Llama 3 70B Q4 entirely in VRAM with room to spare for context. Stable Diffusion at absurd speeds. The only downside: price and 450W TDP.

2. NVIDIA RTX 4080 Super

VRAM

16 GB GDDR6X

Bandwidth

736 GB/s

TDP

320W

Price

High-end

Llama 3 70B (Q4) N/A (36 GB needed) — Cannot run 70B. Runs 34B Q4 (~18 GB) with offload. Excellent for 7B-13B.
Stable Diffusion XL ~3.5 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

Excellent for models up to 13B at high quantization or 34B Q4. Not enough VRAM for 70B. Great Stable Diffusion performance. Good balance if the 4090 is out of budget.

3. NVIDIA RTX 4070 Ti Super

Best Value Mid-High

VRAM

16 GB GDDR6X

Bandwidth

672 GB/s

TDP

285W

Price

Upper mid-range

Llama 3 70B (Q4) N/A (36 GB needed) — Cannot run 70B. Handles 34B Q4 with partial offload. Sweet spot for 7B-13B FP16.
Stable Diffusion XL ~4.0 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

The 16 GB sweet spot at a reasonable price. Runs 13B models at Q8 or even FP16, Stable Diffusion XL comfortably. Better value than the 4080 Super for most local AI use cases.

4. NVIDIA RTX 3090

Best Value Overall

VRAM

24 GB GDDR6X

Bandwidth

936 GB/s

TDP

350W

Price

Used high-end value

Llama 3 70B (Q4) ~32 tok/s — Fits Q4 (36 GB) with partial offload; similar VRAM to 4090 but older arch.
Stable Diffusion XL ~4.0 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

The best value proposition in AI GPUs right now. 24 GB VRAM — same as the RTX 4090 — at 40% of the price on the used market. Slower per-token than the 4090, but you get the same model capacity. Power hungry at 350W.

5. AMD RX 7900 XTX

Best AMD

VRAM

24 GB GDDR6

Bandwidth

960 GB/s

TDP

355W

Price

High-end value

Llama 3 70B (Q4) ~28 tok/s (ROCm/Linux) — Can run 70B Q4 on Linux with ROCm. Windows support very limited for AI workloads.
Stable Diffusion XL ~5.0 s/img (SDXL, ROCm) at 1024x1024, 20 steps

24 GB VRAM at a lower price than NVIDIA equivalents. The catch: ROCm only works well on Linux, and some AI tools lack AMD optimization. If you run Linux and are comfortable troubleshooting, this is excellent value.

6. NVIDIA RTX A6000

Most VRAM (Single Card)

VRAM

48 GB GDDR6

Bandwidth

768 GB/s

TDP

300W

Price

Workstation

Llama 3 70B (Q4) ~35 tok/s — Runs 70B Q4 (36 GB) fully in VRAM with headroom. Can run 70B Q8 with offload.
Stable Diffusion XL ~4.5 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

48 GB VRAM in a single card — the only way to run 70B models at Q8 without multi-GPU. No gaming features (no display outputs on some SKUs). Designed for data centers but increasingly popular with AI enthusiasts buying used.

7. Apple M4 Max (128 GB)

Best for Large Models

VRAM

128 GB unified

Bandwidth

546 GB/s

TDP

~40W (chip)

Price

Apple premium

Llama 3 70B (Q4) ~18 tok/s — Runs 70B at FP16 (140 GB) thanks to 128 GB unified memory. Slower per-token but unmatched model capacity.
Stable Diffusion XL ~8.0 s/img (SDXL, Metal) at 1024x1024, 20 steps

The unified memory architecture means system RAM = GPU VRAM. 128 GB lets you run models that no discrete GPU can touch — including 70B at full FP16 precision. Token speed is lower due to bandwidth, but completely silent and uses ~40W. Unique in the market.

8. Intel Arc A770 16 GB

Budget Wildcard

VRAM

16 GB GDDR6

Bandwidth

560 GB/s

TDP

225W

Price

Budget wildcard

Llama 3 70B (Q4) N/A (16 GB VRAM) — Cannot run 70B. Handles 7B-13B Q4 with IPEX-LLM or llama.cpp SYCL backend.
Stable Diffusion XL ~7.0 s/img (SDXL, experimental) at 1024x1024, 20 steps

The dark horse. 16 GB VRAM in a budget band looks excellent on paper. The reality: Intel AI software (IPEX-LLM, SYCL) is maturing but still behind CUDA and ROCm. If you enjoy tinkering and want cheap VRAM, it is worth watching. Not recommended as a primary AI GPU today.

4. Which GPU Is Right for You?

"I want to run 7B-13B models and Stable Diffusion"

You need 12-16 GB VRAM. Best picks:

  • Budget: RTX 3060 12 GB (used value tier) or Intel Arc A770 16 GB (budget wildcard)
  • Sweet spot: RTX 4070 Ti Super 16 GB (upper mid-range)

"I want to run 70B models locally"

You need 24+ GB VRAM. The model at Q4 needs ~36 GB, so even 24 GB requires partial CPU offload. Best picks:

  • Best value: RTX 3090 used (used high-end value) — 24 GB VRAM, solid performance
  • Best performance: RTX 4090 (flagship) — fastest tokens/sec at 24 GB
  • Full in-VRAM: RTX A6000 48 GB (workstation tier) — no offload needed

"I want a silent, low-power AI setup"

Apple Silicon is the only real option. The M4 Max with 128 GB unified memory runs 70B models at full FP16 — something no consumer GPU can do.

  • Best pick: Mac Studio M4 Max or MacBook Pro M4 Max — 128 GB, ~40W, zero fan noise
  • Tradeoff: Lower tokens/sec than discrete GPUs, locked to macOS

"I'm on Linux and want the best price/VRAM ratio"

AMD GPUs offer more VRAM per dollar, but require ROCm (Linux only).

  • Best pick: RX 7900 XTX (high-end value) — 24 GB, competitive performance on Linux
  • Caveat: Not all AI tools have ROCm support. Check compatibility before buying.

5. Benchmarks Methodology

All benchmarks were collected from community-reported results, hardware review sites, and our own testing. The numbers represent typical real-world performance, not theoretical peaks.

Benchmark Setup
LLM tok/s Ollama / llama.cpp, Llama 3 70B Q4_K_M, 2048 context, eval throughput (not prompt processing)
SDXL s/img ComfyUI, SDXL 1.0 base, 1024x1024, 20 DPM++ 2M Karras steps, FP16
VRAM usage Peak VRAM reported by nvidia-smi / rocm-smi during inference

Your actual performance will vary based on system RAM, CPU, driver version, quantization method, and context length. These numbers are useful for relative comparison, not absolute guarantees.

6. Frequently Asked Questions

What is the best GPU for running AI locally in 2025?

The NVIDIA RTX 4090 is the best overall. For value, the used RTX 3090 is hard to beat thanks to its 24 GB VRAM. For silent operation and maximum model capacity, the Apple M4 Max 128 GB is unique.

Can AMD GPUs run AI models locally?

Yes. The RX 7900 XTX works well with ROCm on Linux. Ollama and llama.cpp both support AMD GPUs. However, Windows support is very limited, and not all AI tools (ComfyUI extensions, fine-tuning frameworks) have ROCm backends.

Is Apple M4 Max good for running LLMs?

Excellent for model capacity — 128 GB unified memory means you can run 70B FP16 or even larger models. Token speed is lower (~18 tok/s for 70B Q4) compared to the RTX 4090 (~45 tok/s), but the silence, low power, and massive memory make it unique.

How many tokens/sec do I need?

15-20 tok/s feels like natural conversation. Below 10 feels slow. For code generation or batch processing, even 5 tok/s is fine since you are not waiting interactively. Stable Diffusion is measured in seconds per image instead.

Should I buy an RTX 4090 or two RTX 3090s?

For simplicity: one RTX 4090. Multi-GPU requires compatible software, a motherboard with enough PCIe lanes, a large PSU, and adds debugging complexity. Two 3090s give 48 GB total and can be attractive for large models, but setup is not trivial.

Check What Your GPU Can Run

Enter your GPU model and see exactly which AI models fit in your VRAM — with headroom calculations.