How many tokens per second do I need for a good experience?

For interactive chat, 15-20 tokens/sec feels conversational. Below 10 tok/s starts feeling slow. For batch processing or code generation, even 5-10 tok/s is acceptable since you are not waiting in real time. For Stable Diffusion, what matters is seconds per image, not tokens.

Best GPUs for Local AI

Q: What is the best GPU for running AI locally in 2025?

The NVIDIA RTX 4090 is the best consumer GPU for local AI, offering 24 GB GDDR6X VRAM, 1008 GB/s bandwidth, and ~45 tokens/sec on Llama 3 70B Q4. For budget builds, the RTX 3090 (used) offers 24 GB VRAM at half the price.

Q: Can AMD GPUs run AI models locally?

Yes. The RX 7900 XTX offers 24 GB VRAM at a lower price than NVIDIA equivalents. However, AMD requires ROCm on Linux — Windows support is limited. Software compatibility is narrower but improving rapidly, especially with llama.cpp and Ollama adding ROCm support.

Q: Is Apple M4 Max good for running LLMs?

Yes. The M4 Max with 128 GB unified memory can run Llama 3 70B at FP16 — something no consumer GPU can do. Token speed is lower (~18 tok/s on 70B Q4) due to lower memory bandwidth, but the ability to run very large models without quantization is unique. It is also completely silent.

Q: Should I buy an RTX 4090 or two RTX 3090s?

For most users, a single RTX 4090 is better. Multi-GPU setups add complexity, require compatible software (not all tools support multi-GPU), and the inter-GPU communication overhead reduces effective throughput. Two RTX 3090s give 48 GB total VRAM for running larger models, but setup is harder. A single card is simpler and faster per-token.

Nuestra recomendación

Best overall: RTX 4090. Best value: RTX 3090 usada.

Si buscas la referencia de rendimiento puro para IA local, la RTX 4090 sigue arriba. Si lo que quieres es maximizar VRAM por euro de mercado y aceptar más consumo, la RTX 3090 usada sigue siendo la compra más racional para la mayoría de entusiastas.

Ganadora absoluta: RTX 4090 por ancho de banda, software y throughput sostenido.
Ganadora valor: RTX 3090 usada por sus 24 GB y compatibilidad amplia.
Compra alternativa: RX 7900 XTX si ya trabajas cómodo en Linux y ROCm.

1. What Actually Matters for AI GPUs

Before diving into the list, a quick primer on the three specs that matter for local AI. These are not gaming benchmarks — AI workloads stress completely different parts of the GPU.

VRAM (Capacity)

Determines the largest model you can run. A 70B Q4 model needs ~36 GB. If you have 24 GB, you cannot run it fully on-GPU. This is a hard limit — no software trick can fix insufficient VRAM.

Memory Bandwidth

Determines tokens per second. LLM inference is memory-bandwidth-bound, not compute-bound. The RTX 4090 (1008 GB/s) generates tokens ~3x faster than the RTX 4060 (288 GB/s) on the same model.

Power (TDP)

Affects electricity cost and heat. An RTX 4090 at 450W running for long daily sessions adds a visible monthly power cost. Relevant if you run models continuously or live in a hot climate.

Want to check if a specific model fits your GPU? Use our VRAM Calculator — enter your GPU model and the AI model you want to run, and get exact VRAM headroom.

2. Quick Comparison Table

GPU	VRAM	BW (GB/s)	70B Q4 (tok/s)	SDXL (s/img)	TDP	Market band
RTX 4090	24 GB	1008	~45	~2.5	450W	Flagship
RTX 4080 Super	16 GB	736	N/A	~3.5	320W	High-end
RTX 4070 Ti Super	16 GB	672	N/A	~4.0	285W	Upper mid-range
RTX 3090 (used)	24 GB	936	~32	~4.0	350W	Used high-end value
RX 7900 XTX	24 GB	960	~28	~5.0	355W	High-end value
RTX A6000	48 GB	768	~35	~4.5	300W	Workstation premium
M4 Max (128 GB)	128 GB*	546	~18	~8.0	~40W	$3,500+
Arc A770 16 GB	16 GB	560	N/A	~7.0	225W	Budget wildcard

* Unified memory (shared between CPU and GPU). Benchmarks measured with llama.cpp / Ollama, Stable Diffusion via ComfyUI / A1111. "N/A" means the GPU lacks sufficient VRAM to run the model entirely on-GPU.

Precios de referencia en euros (2026)

Los precios de lista en dólares rara vez reflejan lo que pagas en Europa. La tabla siguiente recoge precios aproximados en euros para cada GPU analizada, con su VRAM, velocidad real de inferencia con Llama 3.1 8B Q4 y el caso de uso más recomendado. Los precios de segunda mano pueden variar un 15-20 % según la época del año.

GPU	VRAM	Precio aprox (€)	Velocidad tok/s	Caso de uso
RTX 4090	24 GB	~1.999 €	~90 tok/s	Producción / modelos grandes
RTX 4080 Super	16 GB	~899 €	~65 tok/s	Desarrollo avanzado
RTX 4070 Ti Super	16 GB	~699 €	~55 tok/s	Equilibrio precio/rendimiento
RTX 3090 (segunda mano)	24 GB	~799 €	~70 tok/s	Alternativa económica 24 GB
RTX 4070 Super	12 GB	~549 €	~45 tok/s	Gaming + IA local
RTX 4060 Ti 16 GB	16 GB	~399 €	~35 tok/s	Presupuesto medio
RTX 3060 12 GB	12 GB	~269 €	~30 tok/s	Entrada económica

Precios orientativos en euros para el mercado europeo (abril 2026). Velocidades medidas con Llama 3.1 8B Q4_K_M en Ollama. Los precios de segunda mano corresponden a plataformas como Wallapop, eBay.es o Back Market.

3. Detailed Reviews

1. NVIDIA RTX 4090

Best Overall

VRAM

24 GB GDDR6X

Bandwidth

1008 GB/s

TDP

450W

Price

Flagship

Llama 3 70B (Q4) ~45 tok/s — Fits Q4 (36 GB) with partial offload; full Q4_K_M in VRAM

Stable Diffusion XL ~2.5 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

The undisputed king of consumer AI. Runs Llama 3 70B Q4 entirely in VRAM with room to spare for context. Stable Diffusion at absurd speeds. The only downside: price and 450W TDP.

2. NVIDIA RTX 4080 Super

VRAM

16 GB GDDR6X

Bandwidth

736 GB/s

TDP

320W

Price

High-end

Llama 3 70B (Q4) N/A (36 GB needed) — Cannot run 70B. Runs 34B Q4 (~18 GB) with offload. Excellent for 7B-13B.

Stable Diffusion XL ~3.5 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

Excellent for models up to 13B at high quantization or 34B Q4. Not enough VRAM for 70B. Great Stable Diffusion performance. Good balance if the 4090 is out of budget.

3. NVIDIA RTX 4070 Ti Super

Best Value Mid-High

VRAM

16 GB GDDR6X

Bandwidth

672 GB/s

TDP

285W

Price

Upper mid-range

Llama 3 70B (Q4) N/A (36 GB needed) — Cannot run 70B. Handles 34B Q4 with partial offload. Sweet spot for 7B-13B FP16.

Stable Diffusion XL ~4.0 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

The 16 GB sweet spot at a reasonable price. Runs 13B models at Q8 or even FP16, Stable Diffusion XL comfortably. Better value than the 4080 Super for most local AI use cases.

4. NVIDIA RTX 3090

Best Value Overall

VRAM

24 GB GDDR6X

Bandwidth

936 GB/s

TDP

350W

Price

Used high-end value

Llama 3 70B (Q4) ~32 tok/s — Fits Q4 (36 GB) with partial offload; similar VRAM to 4090 but older arch.

Stable Diffusion XL ~4.0 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

The best value proposition in AI GPUs right now. 24 GB VRAM — same as the RTX 4090 — at 40% of the price on the used market. Slower per-token than the 4090, but you get the same model capacity. Power hungry at 350W.

5. AMD RX 7900 XTX

Best AMD

VRAM

24 GB GDDR6

Bandwidth

960 GB/s

TDP

355W

Price

High-end value

Llama 3 70B (Q4) ~28 tok/s (ROCm/Linux) — Can run 70B Q4 on Linux with ROCm. Windows support very limited for AI workloads.

Stable Diffusion XL ~5.0 s/img (SDXL, ROCm) at 1024x1024, 20 steps

24 GB VRAM at a lower price than NVIDIA equivalents. The catch: ROCm only works well on Linux, and some AI tools lack AMD optimization. If you run Linux and are comfortable troubleshooting, this is excellent value.

6. NVIDIA RTX A6000

Most VRAM (Single Card)

VRAM

48 GB GDDR6

Bandwidth

768 GB/s

TDP

300W

Price

Workstation

Llama 3 70B (Q4) ~35 tok/s — Runs 70B Q4 (36 GB) fully in VRAM with headroom. Can run 70B Q8 with offload.

Stable Diffusion XL ~4.5 s/img (SDXL 1024x1024) at 1024x1024, 20 steps

48 GB VRAM in a single card — the only way to run 70B models at Q8 without multi-GPU. No gaming features (no display outputs on some SKUs). Designed for data centers but increasingly popular with AI enthusiasts buying used.

7. Apple M4 Max (128 GB)

Best for Large Models

VRAM

128 GB unified

Bandwidth

546 GB/s

TDP

~40W (chip)

Price

Apple premium

Llama 3 70B (Q4) ~18 tok/s — Runs 70B at FP16 (140 GB) thanks to 128 GB unified memory. Slower per-token but unmatched model capacity.

Stable Diffusion XL ~8.0 s/img (SDXL, Metal) at 1024x1024, 20 steps

The unified memory architecture means system RAM = GPU VRAM. 128 GB lets you run models that no discrete GPU can touch — including 70B at full FP16 precision. Token speed is lower due to bandwidth, but completely silent and uses ~40W. Unique in the market.

8. Intel Arc A770 16 GB

Budget Wildcard

VRAM

16 GB GDDR6

Bandwidth

560 GB/s

TDP

225W

Price

Budget wildcard

Llama 3 70B (Q4) N/A (16 GB VRAM) — Cannot run 70B. Handles 7B-13B Q4 with IPEX-LLM or llama.cpp SYCL backend.

Stable Diffusion XL ~7.0 s/img (SDXL, experimental) at 1024x1024, 20 steps

The dark horse. 16 GB VRAM in a budget band looks excellent on paper. The reality: Intel AI software (IPEX-LLM, SYCL) is maturing but still behind CUDA and ROCm. If you enjoy tinkering and want cheap VRAM, it is worth watching. Not recommended as a primary AI GPU today.

4. Which GPU Is Right for You?

"I want to run 7B-13B models and Stable Diffusion"

You need 12-16 GB VRAM. Best picks:

Budget: RTX 3060 12 GB (used value tier) or Intel Arc A770 16 GB (budget wildcard)
Sweet spot: RTX 4070 Ti Super 16 GB (upper mid-range)

"I want to run 70B models locally"

You need 24+ GB VRAM. The model at Q4 needs ~36 GB, so even 24 GB requires partial CPU offload. Best picks:

Best value: RTX 3090 used (used high-end value) — 24 GB VRAM, solid performance
Best performance: RTX 4090 (flagship) — fastest tokens/sec at 24 GB
Full in-VRAM: RTX A6000 48 GB (workstation tier) — no offload needed

"I want a silent, low-power AI setup"

Apple Silicon is the only real option. The M4 Max with 128 GB unified memory runs 70B models at full FP16 — something no consumer GPU can do.

Best pick: Mac Studio M4 Max or MacBook Pro M4 Max — 128 GB, ~40W, zero fan noise
Tradeoff: Lower tokens/sec than discrete GPUs, locked to macOS

"I'm on Linux and want the best price/VRAM ratio"

AMD GPUs offer more VRAM per dollar, but require ROCm (Linux only).

Best pick: RX 7900 XTX (high-end value) — 24 GB, competitive performance on Linux
Caveat: Not all AI tools have ROCm support. Check compatibility before buying.

5. Benchmarks Methodology

All benchmarks were collected from community-reported results, hardware review sites, and our own testing. The numbers represent typical real-world performance, not theoretical peaks.

Benchmark	Setup
LLM tok/s	Ollama / llama.cpp, Llama 3 70B Q4_K_M, 2048 context, eval throughput (not prompt processing)
SDXL s/img	ComfyUI, SDXL 1.0 base, 1024x1024, 20 DPM++ 2M Karras steps, FP16
VRAM usage	Peak VRAM reported by nvidia-smi / rocm-smi during inference

Your actual performance will vary based on system RAM, CPU, driver version, quantization method, and context length. These numbers are useful for relative comparison, not absolute guarantees.

6. Frequently Asked Questions

What is the best GPU for running AI locally in 2025?

The NVIDIA RTX 4090 is the best overall. For value, the used RTX 3090 is hard to beat thanks to its 24 GB VRAM. For silent operation and maximum model capacity, the Apple M4 Max 128 GB is unique.

Can AMD GPUs run AI models locally?

Yes. The RX 7900 XTX works well with ROCm on Linux. Ollama and llama.cpp both support AMD GPUs. However, Windows support is very limited, and not all AI tools (ComfyUI extensions, fine-tuning frameworks) have ROCm backends.

Is Apple M4 Max good for running LLMs?

Excellent for model capacity — 128 GB unified memory means you can run 70B FP16 or even larger models. Token speed is lower (~18 tok/s for 70B Q4) compared to the RTX 4090 (~45 tok/s), but the silence, low power, and massive memory make it unique.

How many tokens/sec do I need?

15-20 tok/s feels like natural conversation. Below 10 feels slow. For code generation or batch processing, even 5 tok/s is fine since you are not waiting interactively. Stable Diffusion is measured in seconds per image instead.

Should I buy an RTX 4090 or two RTX 3090s?

For simplicity: one RTX 4090. Multi-GPU requires compatible software, a motherboard with enough PCIe lanes, a large PSU, and adds debugging complexity. Two 3090s give 48 GB total and can be attractive for large models, but setup is not trivial.

Check What Your GPU Can Run

Enter your GPU model and see exactly which AI models fit in your VRAM — with headroom calculations.

VRAM Calculator GPU Comparator