Skip to main content
Technical guide 17 min read ·

By the editorial team at RunAIatHome. Tested on local AI builds, no estimates when we have real measurements.

Independently tested. Amazon affiliate links fund our work. We only link products we would recommend anyway — at no extra cost to you.
Alex Chen AI Hardware Specialist
GitHub: github.com/javier-morales-ia

Hermes 3 locally: requirements, installation and complete 2026 guide

The Nous Research finetune of Llama 3.1 that turns the base model into a useful agent: structured function-calling, roleplay without aggressive refusals, and chain-of-thought reasoning. Hermes 3 locally runs on consumer hardware with the same footprint as Llama 3.1 — but behaves very differently.

Reference pricing: RTX 3060 12GB ~€269 · RTX 4070 Super ~€499 · RTX 4090 ~€1,799.

What hardware do I need to run Hermes 3?

Hermes 3 8B runs on RTX 3060 12GB (~€269) in Q4, Hermes 3 70B needs RTX 4090 24GB (~€1,799)

1. The quick answer

If you are looking at hermes 3 local and are in a hurry: the 8B runs on any GPU with 6+ GB, the 70B needs 24 GB as a minimum and the 405B is for clusters. Use the VRAM calculator before downloading anything.

Variant VRAM Q4 Recommended GPU GPU price
Hermes 3 8B 4.8 GB RTX 3060 12GB · RTX 4070 Super 12GB €269 – €499
Hermes 3 70B 40 GB RTX 4090 24GB (with offload) · 2× 3090 €1,799+
Hermes 3 405B 230 GB Multi-GPU clusters only N/A at home

For 95% of home users: Hermes 3 8B in Q4 is the sweet spot. Fits any GPU ≥6 GB, runs at 60–100 tok/s on an RTX 4070 Super, and inherits all the Nous Research improvements on top of Llama 3.1.

NVIDIA GeForce RTX 4070 Super 12GB

€499

mid Amazon Prime

NVIDIA GeForce RTX 4070 Super 12GB

4.7 (520 reviews)

Pros

  • 12 GB VRAM for Hermes 3 8B Q8 with headroom
  • ~75 tok/s on Hermes 3 8B Q4
  • Ada Lovelace — top efficiency in its class

Cons

  • Not enough for 70B on a single GPU

2. What Hermes 3 is and why it is not just another Llama

Hermes 3 is the third generation of models from Nous Research, an open source lab that specializes in high-quality finetunes. The Hermes 3 series is a full finetune of Llama 3.1 — same parameters, same architecture, same VRAM footprint. What changes is what comes out of the model when you talk to it.

Nous Research trained Hermes 3 with three clear goals: native structured function-calling, agent capabilities with <scratchpad>-style chain-of-thought reasoning, and neutral alignment — the model does not refuse legitimate requests or inject unnecessary disclaimers like Llama 3.1-Instruct does. It still has judgment, but it is not paternalistic.

Why the finetune matters

Llama 3.1 base is great as a foundation model, but the official Instruct version (trained by Meta) is heavily aligned toward rejecting anything ambiguous. For roleplay, fiction, coding assistance without babysitting or technical analysis on frontier topics, the official Instruct causes constant friction. Hermes 3 removes that behavior while keeping the technical capability intact.

The three Hermes 3 variants

  • Hermes 3 8B — based on Llama 3.1 8B. The home workhorse. HuggingFace: NousResearch/Hermes-3-Llama-3.1-8B.
  • Hermes 3 70B — based on Llama 3.1 70B. Quality close to GPT-4 on many tasks. HuggingFace: NousResearch/Hermes-3-Llama-3.1-70B.
  • Hermes 3 405B — based on Llama 3.1 405B. Frontier open model. HuggingFace: NousResearch/Hermes-3-Llama-3.1-405B. Clusters only.

All variants keep the 128K-token context window from Llama 3.1. That is ~100 pages of text — enough to pass an entire codebase, long logs or full books.

3. 8B, 70B and 405B variants: exact VRAM

The numbers below are the real ones from the model loaded with llama.cpp — they include KV cache overhead and base activations. Since Hermes 3 shares architecture with Llama 3.1, requirements are identical to the base model.

Hermes 3 8B

Recommended · home
FP16 Q8 Q4 (rec.) Q2
19.2 GB 9.6 GB 4.8 GB 2.4 GB

With Q4 (4.8 GB) it fits in any GPU ≥6 GB: RTX 3060 12GB, RTX 4060 8GB, RTX 4070 Super 12GB, even a MacBook Air with 16 GB unified memory. With Q8 (9.6 GB) it fits comfortably in 12 GB. Ollama tag: hermes3:8b.

Hermes 3 70B

High-end · workstation
FP16 Q8 Q4 (rec.) Q2
168 GB 80 GB 40 GB 20 GB

40 GB Q4 does not fit in a single consumer GPU. Real options: 2× RTX 3090 24GB (48 GB total, ideal), RTX 4090 24GB with RAM offload (works, slower ~6 tok/s), or Apple M3 Max 64GB / M2 Ultra 128GB (unified memory). Ollama tag: hermes3:70b.

Hermes 3 405B

Clusters only
FP16 Q8 Q4 Q2
972 GB 460 GB 230 GB 115 GB

Not viable at home. Requires 8× A100 80GB or 4× H100 for Q4. Included here for completeness — if you have cluster access, HuggingFace Inference Endpoints or Together AI serve it.

4. Hermes 3 vs Llama 3.1 base: when to pick each

Hermes 3 and Llama 3.1-Instruct have exactly the same VRAM footprint and speed — they are the same base model. The choice is about behavior, not hardware. If you already have Llama 3.1 installed, you can try Hermes 3 without changing a single GPU. For more background on Llama specifically, read our Llama vs Mistral comparison.

Task Llama 3.1-Instruct Hermes 3
General chat Good Equivalent
Structured function-calling Inconsistent Native
Roleplay / fiction Refuses often Fluent
Frontier technical analysis Disclaimers Direct
Chain-of-thought reasoning (agent) Generic <scratchpad>
Academic benchmarks (MMLU, GSM8K) Baseline ±1–2% difference

Rule of thumb: if you are going to build an agent, integrate tool use, or do anything creative, Hermes 3. If you just want a basic chatbot with official Meta branding, Llama 3.1-Instruct. Both take exactly the same space on disk and VRAM.

5. Install with Ollama (step by step)

Ollama is the most direct way to run Hermes 3 locally. It handles download, quantization and a REST API with no configuration. Total time from zero: 5 minutes plus the model download (~5 GB for 8B Q4).

Step 1: Install Ollama

# Linux / macOS — one line
curl -fsSL https://ollama.ai/install.sh | sh

# Windows — download the installer from https://ollama.ai/download

Step 2: Download Hermes 3

# Hermes 3 8B — 4.8 GB VRAM Q4 — any GPU ≥6 GB
ollama pull hermes3:8b

# Hermes 3 70B — 40 GB VRAM Q4 — 2× 3090 or 4090 + offload
ollama pull hermes3:70b

Step 3: Run

# Interactive chat
ollama run hermes3:8b

# Direct query from terminal
ollama run hermes3:8b "Design an agent that reads my calendar and summarizes my day"

# REST API (automatic on localhost:11434)
curl http://localhost:11434/api/generate -d '{
  "model": "hermes3:8b",
  "prompt": "Hello Hermes",
  "stream": false
}'

Extended context: By default Ollama uses 2048 tokens. To use the full 128K of Hermes 3 use --num_ctx 32768 (tune based on VRAM). Every extra 8K tokens adds ~1 GB VRAM on the 8B.

6. HuggingFace and llama.cpp (alternative)

If you need the original raw weights (for further fine-tuning, vLLM, or custom quantizations), Hermes 3 models live on HuggingFace under the NousResearch account:

  • NousResearch/Hermes-3-Llama-3.1-8B — original FP16 weights
  • NousResearch/Hermes-3-Llama-3.1-8B-GGUF — already converted to GGUF for llama.cpp
  • NousResearch/Hermes-3-Llama-3.1-70B — 70B version
  • NousResearch/Hermes-3-Llama-3.1-405B — frontier version
# Download with huggingface-cli
pip install huggingface_hub
huggingface-cli download NousResearch/Hermes-3-Llama-3.1-8B-GGUF \
  Hermes-3-Llama-3.1-8B.Q4_K_M.gguf --local-dir ./models

# Run with llama.cpp
./llama-server -m ./models/Hermes-3-Llama-3.1-8B.Q4_K_M.gguf \
  -c 32768 --host 0.0.0.0 --port 8080

Use llama.cpp if you want full control over quantization (Q4_K_M, Q5_K_S, IQ4_XS), per-layer offloading, or integration into custom stacks. For 99% of users, Ollama is faster to get running.

7. Function-calling and agent capabilities

This is the killer feature of Hermes 3 over Llama 3.1 base. Nous Research trained the model with structured function-calling datasets, ReAct-style reasoning, and parallel tool use. JSON output is consistent — no need for brittle regex to parse it.

Function-calling format

<tool_call>
{ "name": "get_weather", "arguments": { "city": "Madrid" } }
</tool_call>

<scratchpad>
The user is asking for the weather in Madrid. I need to call get_weather
before answering.
</scratchpad>

The model emits <tool_call> blocks with valid JSON, and <scratchpad> blocks for internal reasoning. Integrates well with MCP (Model Context Protocol), LangChain, LlamaIndex and custom agent stacks.

Agent capabilities

  • Parallel tool use: can call several functions in a single turn
  • Chain-of-thought reasoning: <scratchpad> to plan before acting
  • Native JSON mode: structured output without temperature hacks
  • Long system prompts: follows complex instructions without degrading
  • Self-correction: if a tool call fails, it adjusts and retries

To build a real local agent with Hermes 3, the recommended combo is: Ollama (serving) + LangChain or LlamaIndex (orchestration) + your custom tools exposed as Python functions. With an RTX 4070 Super you get sub-second latency for most tool calls.

8. Real-world use cases

Local productivity agent

Reads calendar, emails, markdown notes and generates daily summaries. All local, nothing sent to the cloud. Hermes 3 8B + Ollama + MCP servers for each integration. Runs on an RTX 3060 12GB at ~40 tok/s without issues.

Coding copilot without restrictions

Unlike Llama 3.1-Instruct, Hermes 3 does not inject disclaimers when explaining security code, reverse engineering or educational exploits. For devs working on CTFs, pentesting or malware analysis, it is a practical pick. Integrates well with Continue.dev or Cline.

Roleplay and creative writing

Hermes 3 keeps characters consistent across long sessions (128K context helps), without breaking role with "as a language model I cannot...". For narrative design, writing assistants, or SillyTavern-style frontends, it is one of the best open models available.

RAG over private documentation

With 128K of context, you can drop an entire technical PDF into the prompt and ask questions without needing complex embeddings for small-to-medium projects. For larger corpora, pair it with a vector store (Qdrant, Weaviate).

9. Benchmarks and expected speed

Speeds estimated from each GPU's memory bandwidth and the model size. Since Hermes 3 shares architecture with Llama 3.1, tokens/sec are essentially identical to Llama 3.1.

GPU Hermes 3 8B Q4 Hermes 3 8B Q8 Hermes 3 70B Q4
RTX 3060 12GB ~35 tok/s ~22 tok/s Not viable
RTX 4060 Ti 8GB ~55 tok/s Does not fit Not viable
RTX 4070 Super 12GB ~75 tok/s ~48 tok/s Not viable
RTX 4090 24GB ~110 tok/s ~70 tok/s ~6 tok/s (offload)
2× RTX 3090 24GB ~85 tok/s ~55 tok/s ~15 tok/s
M4 Max 64GB ~62 tok/s ~38 tok/s ~9 tok/s

To compare options and reason about your specific case, also see our guide on DeepSeek R1 locally, which covers reasoning with similar hardware.

10. Frequently asked questions

What hardware do I need to run Hermes 3?

Hermes 3 8B runs on RTX 3060 12GB (~€269) in Q4 (4.8 GB VRAM). Hermes 3 70B needs RTX 4090 24GB (~€1,799) with offloading, or 2× RTX 3090 without offloading. The 405B is for clusters and does not run on home hardware.

What makes Hermes 3 different from Llama 3.1 base?

Hermes 3 is a full Nous Research finetune of Llama 3.1: native structured function-calling, agent capabilities with chain-of-thought reasoning, and neutral alignment without the aggressive refusals of Llama-Instruct. Same hardware, noticeably different behavior.

How do I install Hermes 3 with Ollama?

One command: ollama pull hermes3:8b for the 8B, ollama pull hermes3:70b for the 70B. Then ollama run hermes3:8b to start the chat.

Is Hermes 3 censored or restricted?

Hermes 3 uses neutral alignment: it does not refuse legitimate requests by default or inject unnecessary disclaimers. It is not strictly "uncensored" — it still has judgment — but it behaves like a useful model instead of one that avoids topics. For roleplay, fiction and creative tasks, it makes a real difference versus Llama-Instruct.

Is Hermes 3 worth it for function-calling?

Yes — it is one of the best open models for tool use in 2026. Nous Research specifically trained Hermes 3 with structured function-calling datasets. Consistent JSON output, parallel calls, and chain-of-thought reasoning before invoking tools.

Calculate your exact case

Pick your GPU and the Hermes 3 variant. The calculator tells you if it fits in VRAM and how many tokens/sec to expect.

Calculate my GPU now →

Found this useful? Get guides like this in your inbox every week.

No spam. Unsubscribe in one click.

Sources

RTX 4070 Super — best GPU for Hermes 3

Check best price

Prices change daily