Este escenario está diseñado para users who want privacy and want to skip cloud subscriptions. La RTX 4060 8 GB ofrece el equilibrio óptimo entre capacidad (8 GB de VRAM mínimos), disponibilidad en el mercado y coste relativo para los casos de uso de este escenario.

Con 8 GB de VRAM puedes cargar los modelos recomendados en cuantización Q4 sin sacrificar demasiada calidad. El software listado está seleccionado por ser open source, activamente mantenido y compatible con el hardware de este tier.

Stack de software

Ollama ↗

Inference backend. Pulls and manages models with a single command.

Open WebUI ↗

ChatGPT-style web UI that connects to your local Ollama instance.

Guía de configuración paso a paso

1
Install Ollama from ollama.com (available for Windows, macOS, and Linux).
2
Pull your first model: open a terminal and run `ollama pull llama3.1:8b`.
3
Install Open WebUI with Docker: `docker run -d -p 3000:8080 ghcr.io/open-webui/open-webui:ollama`.
4
Open http://localhost:3000 in your browser and create your local account.
5
Select the downloaded model and start chatting.

Modelos compatibles recomendados

Llama 3.1 8B

Compatibilidad con RTX 4060 8 GB → Cómo instalar Llama 3.1 8B →

Phi-4 14B (Q4)

Compatibilidad con RTX 4060 8 GB → Cómo instalar Phi-4 14B (Q4) →

Mistral 7B

Compatibilidad con RTX 4060 8 GB → Cómo instalar Mistral 7B →

Gemma 3 4B

Compatibilidad con RTX 4060 8 GB → Cómo instalar Gemma 3 4B →

Preguntas frecuentes

What is the minimum VRAM for a personal assistant?

With 8 GB of VRAM you can run 7B–8B parameter models at Q4 smoothly. Phi-4 (14B) at Q4 needs at least 10 GB; an 8 GB RTX 4060 can handle it with some offloading.

Is running a local LLM actually private?

Yes. The model runs 100% on your hardware. Neither the model provider nor any third party sees your conversations. It is the most private option for AI assistants.

What is the difference between Ollama and llama.cpp?

Ollama is an abstraction layer on top of llama.cpp that adds model management, an OpenAI-compatible REST API, and multi-model support. For most users, Ollama is the best choice. llama.cpp is more powerful if you need low-level control.

Otros escenarios

Local coding assistant

A setup for coding with AI without sending your code to the …

Local image generation

A setup for generating images with Stable Diffusion and Flux…

Private audio transcription

Transcribe and translate audio without sending data to cloud…

Home AI research lab

Full setup to experiment with 70B models, fine-tuning, and b…

Herramientas relacionadas

Planificador de presupuesto — elige GPU según tu inversión Calculadora de VRAM — comprueba si tu GPU aguanta el modelo

Found this useful? Get guides like this in your inbox every week.