Home AI research lab
Full setup to experiment with 70B models, fine-tuning, and benchmarks. For the most demanding users.
GPUs compatibles con este setup
Este escenario necesita al menos 24 GB de VRAM. Estas GPUs pueden ejecutarlo:
Precios y disponibilidad pueden variar. Ver todas las GPUs NVIDIA →
Por qué este setup
Este escenario está diseñado para researchers, ml students, advanced enthusiasts. La RTX 4090 24 GB ofrece el equilibrio óptimo entre capacidad (24 GB de VRAM mínimos), disponibilidad en el mercado y coste relativo para los casos de uso de este escenario.
Con 24 GB de VRAM puedes cargar los modelos recomendados en cuantización Q4 sin sacrificar demasiada calidad. El software listado está seleccionado por ser open source, activamente mantenido y compatible con el hardware de este tier.
Stack de software
Guía de configuración paso a paso
- 1
Install the latest NVIDIA drivers and CUDA Toolkit 12.x.
- 2
Install Ollama and pull the base model: `ollama pull llama3.1:70b`.
- 3
Install llama.cpp for advanced benchmarks: `git clone https://github.com/ggerganov/llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j`.
- 4
For fine-tuning with Axolotl: `pip install axolotl` and prepare your dataset in JSONL format.
- 5
Configure JupyterLab or VS Code for experimentation notebooks.
- 6
Use `nvidia-smi dmon` to monitor VRAM and temperature in real time.
Modelos compatibles recomendados
Preguntas frecuentes
Can I fine-tune a 70B model on a single RTX 4090?
With QLoRA (Quantized LoRA), fine-tuning models up to 13B–30B is possible on an RTX 4090. For 70B you would need at least 48 GB of effective VRAM, which means aggressive gradient checkpointing or a multi-GPU setup.
Does Llama 3.1 70B at Q4 fit in 24 GB of VRAM?
Not fully. Llama 3.1 70B at Q4 needs ~40 GB of VRAM. With 24 GB you can offload layers to RAM (CPU offload), which works but slows inference. For full 70B on GPU, you need two RTX 4090s or an RTX 5090.
What is the difference between Ollama and llama.cpp for research?
Ollama is more convenient and faster to use. llama.cpp gives full control over custom quantization, context size, number of GPU/CPU layers, and performance metrics. For serious research, combining both is optimal.
Otros escenarios
Herramientas relacionadas
Found this useful? Get guides like this in your inbox every week.