By the editorial team at RunAIatHome. Tested on local AI builds, no estimates when we have real measurements.
Hermes 3 locally: requirements, installation and complete 2026 guide
The Nous Research finetune of Llama 3.1 that turns the base model into a useful agent: structured function-calling, roleplay without aggressive refusals, and chain-of-thought reasoning. Hermes 3 locally runs on consumer hardware with the same footprint as Llama 3.1 — but behaves very differently.
Reference pricing: RTX 3060 12GB ~€269 · RTX 4070 Super ~€499 · RTX 4090 ~€1,799.
What hardware do I need to run Hermes 3?
Hermes 3 8B runs on RTX 3060 12GB (~€269) in Q4, Hermes 3 70B needs RTX 4090 24GB (~€1,799)
1. The quick answer
If you are looking at hermes 3 local and are in a hurry: the 8B runs on any GPU with 6+ GB, the 70B needs 24 GB as a minimum and the 405B is for clusters. Use the VRAM calculator before downloading anything.
| Variant | VRAM Q4 | Recommended GPU | GPU price |
|---|---|---|---|
| Hermes 3 8B | 4.8 GB | RTX 3060 12GB · RTX 4070 Super 12GB | €269 – €499 |
| Hermes 3 70B | 40 GB | RTX 4090 24GB (with offload) · 2× 3090 | €1,799+ |
| Hermes 3 405B | 230 GB | Multi-GPU clusters only | N/A at home |
For 95% of home users: Hermes 3 8B in Q4 is the sweet spot. Fits any GPU ≥6 GB, runs at 60–100 tok/s on an RTX 4070 Super, and inherits all the Nous Research improvements on top of Llama 3.1.
€499
NVIDIA GeForce RTX 4070 Super 12GB
Pros
- 12 GB VRAM for Hermes 3 8B Q8 with headroom
- ~75 tok/s on Hermes 3 8B Q4
- Ada Lovelace — top efficiency in its class
Cons
- Not enough for 70B on a single GPU
2. What Hermes 3 is and why it is not just another Llama
Hermes 3 is the third generation of models from Nous Research, an open source lab that specializes in high-quality finetunes. The Hermes 3 series is a full finetune of Llama 3.1 — same parameters, same architecture, same VRAM footprint. What changes is what comes out of the model when you talk to it.
Nous Research trained Hermes 3 with three clear goals: native structured function-calling, agent capabilities with <scratchpad>-style chain-of-thought reasoning, and neutral alignment — the model does not refuse legitimate requests or inject unnecessary disclaimers like Llama 3.1-Instruct does. It still has judgment, but it is not paternalistic.
Why the finetune matters
Llama 3.1 base is great as a foundation model, but the official Instruct version (trained by Meta) is heavily aligned toward rejecting anything ambiguous. For roleplay, fiction, coding assistance without babysitting or technical analysis on frontier topics, the official Instruct causes constant friction. Hermes 3 removes that behavior while keeping the technical capability intact.
The three Hermes 3 variants
- Hermes 3 8B — based on Llama 3.1 8B. The home workhorse. HuggingFace:
NousResearch/Hermes-3-Llama-3.1-8B. - Hermes 3 70B — based on Llama 3.1 70B. Quality close to GPT-4 on many tasks. HuggingFace:
NousResearch/Hermes-3-Llama-3.1-70B. - Hermes 3 405B — based on Llama 3.1 405B. Frontier open model. HuggingFace:
NousResearch/Hermes-3-Llama-3.1-405B. Clusters only.
All variants keep the 128K-token context window from Llama 3.1. That is ~100 pages of text — enough to pass an entire codebase, long logs or full books.
3. 8B, 70B and 405B variants: exact VRAM
The numbers below are the real ones from the model loaded with llama.cpp — they include KV cache overhead and base activations. Since Hermes 3 shares architecture with Llama 3.1, requirements are identical to the base model.
Hermes 3 8B
Recommended · home| FP16 | Q8 | Q4 (rec.) | Q2 |
|---|---|---|---|
| 19.2 GB | 9.6 GB | 4.8 GB | 2.4 GB |
With Q4 (4.8 GB) it fits in any GPU ≥6 GB: RTX 3060 12GB, RTX 4060 8GB, RTX 4070 Super 12GB,
even a MacBook Air with 16 GB unified memory. With Q8 (9.6 GB) it fits comfortably in 12 GB.
Ollama tag: hermes3:8b.
Hermes 3 70B
High-end · workstation| FP16 | Q8 | Q4 (rec.) | Q2 |
|---|---|---|---|
| 168 GB | 80 GB | 40 GB | 20 GB |
40 GB Q4 does not fit in a single consumer GPU. Real options: 2× RTX 3090 24GB (48 GB total, ideal),
RTX 4090 24GB with RAM offload (works, slower ~6 tok/s),
or Apple M3 Max 64GB / M2 Ultra 128GB (unified memory).
Ollama tag: hermes3:70b.
Hermes 3 405B
Clusters only| FP16 | Q8 | Q4 | Q2 |
|---|---|---|---|
| 972 GB | 460 GB | 230 GB | 115 GB |
Not viable at home. Requires 8× A100 80GB or 4× H100 for Q4. Included here for completeness — if you have cluster access, HuggingFace Inference Endpoints or Together AI serve it.
4. Hermes 3 vs Llama 3.1 base: when to pick each
Hermes 3 and Llama 3.1-Instruct have exactly the same VRAM footprint and speed — they are the same base model. The choice is about behavior, not hardware. If you already have Llama 3.1 installed, you can try Hermes 3 without changing a single GPU. For more background on Llama specifically, read our Llama vs Mistral comparison.
| Task | Llama 3.1-Instruct | Hermes 3 |
|---|---|---|
| General chat | Good | Equivalent |
| Structured function-calling | Inconsistent | Native |
| Roleplay / fiction | Refuses often | Fluent |
| Frontier technical analysis | Disclaimers | Direct |
| Chain-of-thought reasoning (agent) | Generic | <scratchpad> |
| Academic benchmarks (MMLU, GSM8K) | Baseline | ±1–2% difference |
Rule of thumb: if you are going to build an agent, integrate tool use, or do anything creative, Hermes 3. If you just want a basic chatbot with official Meta branding, Llama 3.1-Instruct. Both take exactly the same space on disk and VRAM.
5. Install with Ollama (step by step)
Ollama is the most direct way to run Hermes 3 locally. It handles download, quantization and a REST API with no configuration. Total time from zero: 5 minutes plus the model download (~5 GB for 8B Q4).
Step 1: Install Ollama
# Linux / macOS — one line
curl -fsSL https://ollama.ai/install.sh | sh
# Windows — download the installer from https://ollama.ai/download Step 2: Download Hermes 3
# Hermes 3 8B — 4.8 GB VRAM Q4 — any GPU ≥6 GB
ollama pull hermes3:8b
# Hermes 3 70B — 40 GB VRAM Q4 — 2× 3090 or 4090 + offload
ollama pull hermes3:70b Step 3: Run
# Interactive chat
ollama run hermes3:8b
# Direct query from terminal
ollama run hermes3:8b "Design an agent that reads my calendar and summarizes my day"
# REST API (automatic on localhost:11434)
curl http://localhost:11434/api/generate -d '{
"model": "hermes3:8b",
"prompt": "Hello Hermes",
"stream": false
}' Extended context:
By default Ollama uses 2048 tokens. To use the full 128K of Hermes 3 use --num_ctx 32768 (tune based on VRAM).
Every extra 8K tokens adds ~1 GB VRAM on the 8B.
6. HuggingFace and llama.cpp (alternative)
If you need the original raw weights (for further fine-tuning, vLLM, or custom quantizations), Hermes 3 models live on HuggingFace under the NousResearch account:
NousResearch/Hermes-3-Llama-3.1-8B— original FP16 weightsNousResearch/Hermes-3-Llama-3.1-8B-GGUF— already converted to GGUF for llama.cppNousResearch/Hermes-3-Llama-3.1-70B— 70B versionNousResearch/Hermes-3-Llama-3.1-405B— frontier version
# Download with huggingface-cli
pip install huggingface_hub
huggingface-cli download NousResearch/Hermes-3-Llama-3.1-8B-GGUF \
Hermes-3-Llama-3.1-8B.Q4_K_M.gguf --local-dir ./models
# Run with llama.cpp
./llama-server -m ./models/Hermes-3-Llama-3.1-8B.Q4_K_M.gguf \
-c 32768 --host 0.0.0.0 --port 8080 Use llama.cpp if you want full control over quantization (Q4_K_M, Q5_K_S, IQ4_XS), per-layer offloading, or integration into custom stacks. For 99% of users, Ollama is faster to get running.
7. Function-calling and agent capabilities
This is the killer feature of Hermes 3 over Llama 3.1 base. Nous Research trained the model with structured function-calling datasets, ReAct-style reasoning, and parallel tool use. JSON output is consistent — no need for brittle regex to parse it.
Function-calling format
<tool_call>
{ "name": "get_weather", "arguments": { "city": "Madrid" } }
</tool_call>
<scratchpad>
The user is asking for the weather in Madrid. I need to call get_weather
before answering.
</scratchpad>
The model emits <tool_call> blocks with valid JSON,
and <scratchpad> blocks for internal reasoning.
Integrates well with MCP (Model Context Protocol), LangChain, LlamaIndex
and custom agent stacks.
Agent capabilities
- Parallel tool use: can call several functions in a single turn
- Chain-of-thought reasoning: <scratchpad> to plan before acting
- Native JSON mode: structured output without temperature hacks
- Long system prompts: follows complex instructions without degrading
- Self-correction: if a tool call fails, it adjusts and retries
To build a real local agent with Hermes 3, the recommended combo is: Ollama (serving) + LangChain or LlamaIndex (orchestration) + your custom tools exposed as Python functions. With an RTX 4070 Super you get sub-second latency for most tool calls.
8. Real-world use cases
Local productivity agent
Reads calendar, emails, markdown notes and generates daily summaries. All local, nothing sent to the cloud. Hermes 3 8B + Ollama + MCP servers for each integration. Runs on an RTX 3060 12GB at ~40 tok/s without issues.
Coding copilot without restrictions
Unlike Llama 3.1-Instruct, Hermes 3 does not inject disclaimers when explaining security code, reverse engineering or educational exploits. For devs working on CTFs, pentesting or malware analysis, it is a practical pick. Integrates well with Continue.dev or Cline.
Roleplay and creative writing
Hermes 3 keeps characters consistent across long sessions (128K context helps), without breaking role with "as a language model I cannot...". For narrative design, writing assistants, or SillyTavern-style frontends, it is one of the best open models available.
RAG over private documentation
With 128K of context, you can drop an entire technical PDF into the prompt and ask questions without needing complex embeddings for small-to-medium projects. For larger corpora, pair it with a vector store (Qdrant, Weaviate).
9. Benchmarks and expected speed
Speeds estimated from each GPU's memory bandwidth and the model size. Since Hermes 3 shares architecture with Llama 3.1, tokens/sec are essentially identical to Llama 3.1.
| GPU | Hermes 3 8B Q4 | Hermes 3 8B Q8 | Hermes 3 70B Q4 |
|---|---|---|---|
| RTX 3060 12GB | ~35 tok/s | ~22 tok/s | Not viable |
| RTX 4060 Ti 8GB | ~55 tok/s | Does not fit | Not viable |
| RTX 4070 Super 12GB | ~75 tok/s | ~48 tok/s | Not viable |
| RTX 4090 24GB | ~110 tok/s | ~70 tok/s | ~6 tok/s (offload) |
| 2× RTX 3090 24GB | ~85 tok/s | ~55 tok/s | ~15 tok/s |
| M4 Max 64GB | ~62 tok/s | ~38 tok/s | ~9 tok/s |
To compare options and reason about your specific case, also see our guide on DeepSeek R1 locally, which covers reasoning with similar hardware.
Recommended hardware for Hermes 3
Four GPUs that cover the three real scenarios: budget entry for 8B, sweet spot for 8B with headroom, premium for 8B in Q8, and workstation for 70B.
€269
RTX 3060 12GB
Pros
- 12 GB VRAM — Hermes 3 8B Q8 fits comfortably
- Unbeatable entry price (~€269)
- Ideal to start with local agents
Cons
- Bandwidth lower than Ada
- Not enough for 70B
€399
RTX 4060 Ti 8GB
Pros
- Ada Lovelace — efficiency per watt
- Hermes 3 8B Q4 with headroom
- Good price/performance balance
Cons
- 8 GB limit Q8 of the 8B
- Long context pressures VRAM
€499
RTX 4070 Super 12GB
Pros
- Best pick for Hermes 3 8B in 2026
- Q8 comfortable + 32K context easy
- ~75 tok/s in Q4
Cons
- Price above RTX 3060
- 12 GB tight for 8B + 128K context
€1799
RTX 4090 24GB
Pros
- 24 GB VRAM — only consumer GPU for Hermes 3 70B with offload
- Ada Lovelace flagship
- Peak speed on 8B (~110 tok/s)
Cons
- Flagship price (~€1,799)
- 450W typical draw
As an Amazon Associate we earn from qualifying purchases. This does not affect our recommendations.
10. Frequently asked questions
What hardware do I need to run Hermes 3?
Hermes 3 8B runs on RTX 3060 12GB (~€269) in Q4 (4.8 GB VRAM). Hermes 3 70B needs RTX 4090 24GB (~€1,799) with offloading, or 2× RTX 3090 without offloading. The 405B is for clusters and does not run on home hardware.
What makes Hermes 3 different from Llama 3.1 base?
Hermes 3 is a full Nous Research finetune of Llama 3.1: native structured function-calling, agent capabilities with chain-of-thought reasoning, and neutral alignment without the aggressive refusals of Llama-Instruct. Same hardware, noticeably different behavior.
How do I install Hermes 3 with Ollama?
One command: ollama pull hermes3:8b for the 8B, ollama pull hermes3:70b for the 70B.
Then ollama run hermes3:8b to start the chat.
Is Hermes 3 censored or restricted?
Hermes 3 uses neutral alignment: it does not refuse legitimate requests by default or inject unnecessary disclaimers. It is not strictly "uncensored" — it still has judgment — but it behaves like a useful model instead of one that avoids topics. For roleplay, fiction and creative tasks, it makes a real difference versus Llama-Instruct.
Is Hermes 3 worth it for function-calling?
Yes — it is one of the best open models for tool use in 2026. Nous Research specifically trained Hermes 3 with structured function-calling datasets. Consistent JSON output, parallel calls, and chain-of-thought reasoning before invoking tools.
Calculate your exact case
Pick your GPU and the Hermes 3 variant. The calculator tells you if it fits in VRAM and how many tokens/sec to expect.
Calculate my GPU now →Found this useful? Get guides like this in your inbox every week.