Este escenario está diseñado para researchers, ml students, advanced enthusiasts. La RTX 4090 24 GB ofrece el equilibrio óptimo entre capacidad (24 GB de VRAM mínimos), disponibilidad en el mercado y coste relativo para los casos de uso de este escenario.

Con 24 GB de VRAM puedes cargar los modelos recomendados en cuantización Q4 sin sacrificar demasiada calidad. El software listado está seleccionado por ser open source, activamente mantenido y compatible con el hardware de este tier.

Stack de software

Ollama ↗

Model management and inference API.

llama.cpp ↗

Low-level inference. Full control over quantization and context.

Axolotl ↗

Fine-tuning framework for LoRA and QLoRA on local models.

Guía de configuración paso a paso

1
Install the latest NVIDIA drivers and CUDA Toolkit 12.x.
2
Install Ollama and pull the base model: `ollama pull llama3.1:70b`.
3
Install llama.cpp for advanced benchmarks: `git clone https://github.com/ggerganov/llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j`.
4
For fine-tuning with Axolotl: `pip install axolotl` and prepare your dataset in JSONL format.
5
Configure JupyterLab or VS Code for experimentation notebooks.
6
Use `nvidia-smi dmon` to monitor VRAM and temperature in real time.

Modelos compatibles recomendados

Llama 3.1 70B (Q4)

Compatibilidad con RTX 4090 24 GB → Cómo instalar Llama 3.1 70B (Q4) →

DeepSeek R1 Distill 32B

Compatibilidad con RTX 4090 24 GB → Cómo instalar DeepSeek R1 Distill 32B →

Qwen2.5 72B (Q4)

Compatibilidad con RTX 4090 24 GB → Cómo instalar Qwen2.5 72B (Q4) →

Llama 3.3 70B (Q4)

Compatibilidad con RTX 4090 24 GB → Cómo instalar Llama 3.3 70B (Q4) →

Preguntas frecuentes

Can I fine-tune a 70B model on a single RTX 4090?

With QLoRA (Quantized LoRA), fine-tuning models up to 13B–30B is possible on an RTX 4090. For 70B you would need at least 48 GB of effective VRAM, which means aggressive gradient checkpointing or a multi-GPU setup.

Does Llama 3.1 70B at Q4 fit in 24 GB of VRAM?

Not fully. Llama 3.1 70B at Q4 needs ~40 GB of VRAM. With 24 GB you can offload layers to RAM (CPU offload), which works but slows inference. For full 70B on GPU, you need two RTX 4090s or an RTX 5090.

What is the difference between Ollama and llama.cpp for research?

Ollama is more convenient and faster to use. llama.cpp gives full control over custom quantization, context size, number of GPU/CPU layers, and performance metrics. For serious research, combining both is optimal.

Otros escenarios

Personal local AI assistant

A setup for chatting with an LLM privately and without limit…

Local coding assistant

A setup for coding with AI without sending your code to the …

Local image generation

A setup for generating images with Stable Diffusion and Flux…

Private audio transcription

Transcribe and translate audio without sending data to cloud…

Herramientas relacionadas

Planificador de presupuesto — elige GPU según tu inversión Calculadora de VRAM — comprueba si tu GPU aguanta el modelo

Found this useful? Get guides like this in your inbox every week.