Hermes 3 locally: requirements, installation and complete 2026 guide

Q: What hardware do I need to run Hermes 3?

Hermes 3 8B runs on an RTX 3060 12GB (~€269) in Q4 (4.8 GB VRAM). Hermes 3 70B needs RTX 4090 24GB (~€1,799) in Q4 (40 GB — requires offloading or 2× 4090). Hermes 3 405B is for clusters: 230 GB Q4, not viable at home.

Q: What makes Hermes 3 different from Llama 3.1 base?

Hermes 3 is a Nous Research finetune of Llama 3.1 with three key changes: native structured function-calling, agent capabilities with chain-of-thought reasoning, and neutral alignment (no aggressive refusals like Llama-Instruct). Same hardware, different personality and utility.

Q: How do I install Hermes 3 with Ollama?

One command: ollama pull hermes3:8b for the 8B, ollama pull hermes3:70b for the 70B. Then ollama run hermes3:8b to start chatting. You can also download weights from HuggingFace (NousResearch/Hermes-3-Llama-3.1-8B) if you prefer llama.cpp or vLLM.

1. The quick answer

If you are looking at hermes 3 local and are in a hurry: the 8B runs on any GPU with 6+ GB, the 70B needs 24 GB as a minimum and the 405B is for clusters. Use the VRAM calculator before downloading anything.

Variant	VRAM Q4	Recommended GPU	GPU price
Hermes 3 8B	4.8 GB	RTX 3060 12GB · RTX 4070 Super 12GB	€269 – €499
Hermes 3 70B	40 GB	RTX 4090 24GB (with offload) · 2× 3090	€1,799+
Hermes 3 405B	230 GB	Multi-GPU clusters only	N/A at home

For 95% of home users: Hermes 3 8B in Q4 is the sweet spot. Fits any GPU ≥6 GB, runs at 60–100 tok/s on an RTX 4070 Super, and inherits all the Nous Research improvements on top of Llama 3.1.

€499

mid Amazon Prime

NVIDIA GeForce RTX 4070 Super 12GB

4.7 (520 reviews)

Pros

12 GB VRAM for Hermes 3 8B Q8 with headroom
~75 tok/s on Hermes 3 8B Q4
Ada Lovelace — top efficiency in its class

Cons

Not enough for 70B on a single GPU

2. What Hermes 3 is and why it is not just another Llama

Hermes 3 is the third generation of models from Nous Research, an open source lab that specializes in high-quality finetunes. The Hermes 3 series is a full finetune of Llama 3.1 — same parameters, same architecture, same VRAM footprint. What changes is what comes out of the model when you talk to it.

Nous Research trained Hermes 3 with three clear goals: native structured function-calling, agent capabilities with <scratchpad>-style chain-of-thought reasoning, and neutral alignment — the model does not refuse legitimate requests or inject unnecessary disclaimers like Llama 3.1-Instruct does. It still has judgment, but it is not paternalistic.

Why the finetune matters

Llama 3.1 base is great as a foundation model, but the official Instruct version (trained by Meta) is heavily aligned toward rejecting anything ambiguous. For roleplay, fiction, coding assistance without babysitting or technical analysis on frontier topics, the official Instruct causes constant friction. Hermes 3 removes that behavior while keeping the technical capability intact.

The three Hermes 3 variants

Hermes 3 8B — based on Llama 3.1 8B. The home workhorse. HuggingFace: NousResearch/Hermes-3-Llama-3.1-8B.
Hermes 3 70B — based on Llama 3.1 70B. Quality close to GPT-4 on many tasks. HuggingFace: NousResearch/Hermes-3-Llama-3.1-70B.
Hermes 3 405B — based on Llama 3.1 405B. Frontier open model. HuggingFace: NousResearch/Hermes-3-Llama-3.1-405B. Clusters only.

All variants keep the 128K-token context window from Llama 3.1. That is ~100 pages of text — enough to pass an entire codebase, long logs or full books.

3. 8B, 70B and 405B variants: exact VRAM

The numbers below are the real ones from the model loaded with llama.cpp — they include KV cache overhead and base activations. Since Hermes 3 shares architecture with Llama 3.1, requirements are identical to the base model.

Hermes 3 8B

Recommended · home

FP16	Q8	Q4 (rec.)	Q2
19.2 GB	9.6 GB	4.8 GB	2.4 GB

With Q4 (4.8 GB) it fits in any GPU ≥6 GB: RTX 3060 12GB, RTX 4060 8GB, RTX 4070 Super 12GB, even a MacBook Air with 16 GB unified memory. With Q8 (9.6 GB) it fits comfortably in 12 GB. Ollama tag: hermes3:8b.

Hermes 3 70B

High-end · workstation

FP16	Q8	Q4 (rec.)	Q2
168 GB	80 GB	40 GB	20 GB

40 GB Q4 does not fit in a single consumer GPU. Real options: 2× RTX 3090 24GB (48 GB total, ideal), RTX 4090 24GB with RAM offload (works, slower ~6 tok/s), or Apple M3 Max 64GB / M2 Ultra 128GB (unified memory). Ollama tag: hermes3:70b.

Hermes 3 405B

Clusters only

FP16	Q8	Q4	Q2
972 GB	460 GB	230 GB	115 GB

Not viable at home. Requires 8× A100 80GB or 4× H100 for Q4. Included here for completeness — if you have cluster access, HuggingFace Inference Endpoints or Together AI serve it.

4. Hermes 3 vs Llama 3.1 base: when to pick each

Hermes 3 and Llama 3.1-Instruct have exactly the same VRAM footprint and speed — they are the same base model. The choice is about behavior, not hardware. If you already have Llama 3.1 installed, you can try Hermes 3 without changing a single GPU. For more background on Llama specifically, read our Llama vs Mistral comparison.

Task	Llama 3.1-Instruct	Hermes 3
General chat	Good	Equivalent
Structured function-calling	Inconsistent	Native
Roleplay / fiction	Refuses often	Fluent
Frontier technical analysis	Disclaimers	Direct
Chain-of-thought reasoning (agent)	Generic	<scratchpad>
Academic benchmarks (MMLU, GSM8K)	Baseline	±1–2% difference

Rule of thumb: if you are going to build an agent, integrate tool use, or do anything creative, Hermes 3. If you just want a basic chatbot with official Meta branding, Llama 3.1-Instruct. Both take exactly the same space on disk and VRAM.

5. Install with Ollama (step by step)

Ollama is the most direct way to run Hermes 3 locally. It handles download, quantization and a REST API with no configuration. Total time from zero: 5 minutes plus the model download (~5 GB for 8B Q4).

Step 1: Install Ollama

# Linux / macOS — one line
curl -fsSL https://ollama.ai/install.sh | sh

# Windows — download the installer from https://ollama.ai/download

Step 2: Download Hermes 3

# Hermes 3 8B — 4.8 GB VRAM Q4 — any GPU ≥6 GB
ollama pull hermes3:8b

# Hermes 3 70B — 40 GB VRAM Q4 — 2× 3090 or 4090 + offload
ollama pull hermes3:70b

Step 3: Run

# Interactive chat
ollama run hermes3:8b

# Direct query from terminal
ollama run hermes3:8b "Design an agent that reads my calendar and summarizes my day"

# REST API (automatic on localhost:11434)
curl http://localhost:11434/api/generate -d '{
  "model": "hermes3:8b",
  "prompt": "Hello Hermes",
  "stream": false
}'

Extended context: By default Ollama uses 2048 tokens. To use the full 128K of Hermes 3 use --num_ctx 32768 (tune based on VRAM). Every extra 8K tokens adds ~1 GB VRAM on the 8B.

6. HuggingFace and llama.cpp (alternative)

If you need the original raw weights (for further fine-tuning, vLLM, or custom quantizations), Hermes 3 models live on HuggingFace under the NousResearch account:

NousResearch/Hermes-3-Llama-3.1-8B — original FP16 weights
NousResearch/Hermes-3-Llama-3.1-8B-GGUF — already converted to GGUF for llama.cpp
NousResearch/Hermes-3-Llama-3.1-70B — 70B version
NousResearch/Hermes-3-Llama-3.1-405B — frontier version

# Download with huggingface-cli
pip install huggingface_hub
huggingface-cli download NousResearch/Hermes-3-Llama-3.1-8B-GGUF \
  Hermes-3-Llama-3.1-8B.Q4_K_M.gguf --local-dir ./models

# Run with llama.cpp
./llama-server -m ./models/Hermes-3-Llama-3.1-8B.Q4_K_M.gguf \
  -c 32768 --host 0.0.0.0 --port 8080

Use llama.cpp if you want full control over quantization (Q4_K_M, Q5_K_S, IQ4_XS), per-layer offloading, or integration into custom stacks. For 99% of users, Ollama is faster to get running.

7. Function-calling and agent capabilities

This is the killer feature of Hermes 3 over Llama 3.1 base. Nous Research trained the model with structured function-calling datasets, ReAct-style reasoning, and parallel tool use. JSON output is consistent — no need for brittle regex to parse it.

Function-calling format

<tool_call>
{ "name": "get_weather", "arguments": { "city": "Madrid" } }
</tool_call>

<scratchpad>
The user is asking for the weather in Madrid. I need to call get_weather
before answering.
</scratchpad>

The model emits <tool_call> blocks with valid JSON, and <scratchpad> blocks for internal reasoning. Integrates well with MCP (Model Context Protocol), LangChain, LlamaIndex and custom agent stacks.

Agent capabilities

Parallel tool use: can call several functions in a single turn
Chain-of-thought reasoning: <scratchpad> to plan before acting
Native JSON mode: structured output without temperature hacks
Long system prompts: follows complex instructions without degrading
Self-correction: if a tool call fails, it adjusts and retries

To build a real local agent with Hermes 3, the recommended combo is: Ollama (serving) + LangChain or LlamaIndex (orchestration) + your custom tools exposed as Python functions. With an RTX 4070 Super you get sub-second latency for most tool calls.

8. Real-world use cases

Local productivity agent

Reads calendar, emails, markdown notes and generates daily summaries. All local, nothing sent to the cloud. Hermes 3 8B + Ollama + MCP servers for each integration. Runs on an RTX 3060 12GB at ~40 tok/s without issues.

Coding copilot without restrictions

Unlike Llama 3.1-Instruct, Hermes 3 does not inject disclaimers when explaining security code, reverse engineering or educational exploits. For devs working on CTFs, pentesting or malware analysis, it is a practical pick. Integrates well with Continue.dev or Cline.

Roleplay and creative writing

Hermes 3 keeps characters consistent across long sessions (128K context helps), without breaking role with "as a language model I cannot...". For narrative design, writing assistants, or SillyTavern-style frontends, it is one of the best open models available.

RAG over private documentation

With 128K of context, you can drop an entire technical PDF into the prompt and ask questions without needing complex embeddings for small-to-medium projects. For larger corpora, pair it with a vector store (Qdrant, Weaviate).

9. Benchmarks and expected speed

Speeds estimated from each GPU's memory bandwidth and the model size. Since Hermes 3 shares architecture with Llama 3.1, tokens/sec are essentially identical to Llama 3.1.

GPU	Hermes 3 8B Q4	Hermes 3 8B Q8	Hermes 3 70B Q4
RTX 3060 12GB	~35 tok/s	~22 tok/s	Not viable
RTX 4060 Ti 8GB	~55 tok/s	Does not fit	Not viable
RTX 4070 Super 12GB	~75 tok/s	~48 tok/s	Not viable
RTX 4090 24GB	~110 tok/s	~70 tok/s	~6 tok/s (offload)
2× RTX 3090 24GB	~85 tok/s	~55 tok/s	~15 tok/s
M4 Max 64GB	~62 tok/s	~38 tok/s	~9 tok/s

To compare options and reason about your specific case, also see our guide on DeepSeek R1 locally, which covers reasoning with similar hardware.

Recommended hardware for Hermes 3

Four GPUs that cover the three real scenarios: budget entry for 8B, sweet spot for 8B with headroom, premium for 8B in Q8, and workstation for 70B.

€269

budget Amazon Prime

RTX 3060 12GB

4.8 (1,400 reviews)

Pros

12 GB VRAM — Hermes 3 8B Q8 fits comfortably
Unbeatable entry price (~€269)
Ideal to start with local agents

Cons

Bandwidth lower than Ada
Not enough for 70B

€399

mid Amazon Prime

RTX 4060 Ti 8GB

4.6 (640 reviews)

Pros

Ada Lovelace — efficiency per watt
Hermes 3 8B Q4 with headroom
Good price/performance balance

Cons

8 GB limit Q8 of the 8B
Long context pressures VRAM

€499

mid Amazon Prime

RTX 4070 Super 12GB

4.7 (520 reviews)

Pros

Best pick for Hermes 3 8B in 2026
Q8 comfortable + 32K context easy
~75 tok/s in Q4

Cons

Price above RTX 3060
12 GB tight for 8B + 128K context

€1799

pro Amazon Prime

RTX 4090 24GB

4.8 (1,200 reviews)

Pros

24 GB VRAM — only consumer GPU for Hermes 3 70B with offload
Ada Lovelace flagship
Peak speed on 8B (~110 tok/s)

Cons

Flagship price (~€1,799)
450W typical draw

As an Amazon Associate we earn from qualifying purchases. This does not affect our recommendations.

10. Frequently asked questions

What hardware do I need to run Hermes 3?

Hermes 3 8B runs on RTX 3060 12GB (~€269) in Q4 (4.8 GB VRAM). Hermes 3 70B needs RTX 4090 24GB (~€1,799) with offloading, or 2× RTX 3090 without offloading. The 405B is for clusters and does not run on home hardware.

What makes Hermes 3 different from Llama 3.1 base?

Hermes 3 is a full Nous Research finetune of Llama 3.1: native structured function-calling, agent capabilities with chain-of-thought reasoning, and neutral alignment without the aggressive refusals of Llama-Instruct. Same hardware, noticeably different behavior.

How do I install Hermes 3 with Ollama?

One command: ollama pull hermes3:8b for the 8B, ollama pull hermes3:70b for the 70B. Then ollama run hermes3:8b to start the chat.

Is Hermes 3 censored or restricted?

Hermes 3 uses neutral alignment: it does not refuse legitimate requests by default or inject unnecessary disclaimers. It is not strictly "uncensored" — it still has judgment — but it behaves like a useful model instead of one that avoids topics. For roleplay, fiction and creative tasks, it makes a real difference versus Llama-Instruct.

Is Hermes 3 worth it for function-calling?

Yes — it is one of the best open models for tool use in 2026. Nous Research specifically trained Hermes 3 with structured function-calling datasets. Consistent JSON output, parallel calls, and chain-of-thought reasoning before invoking tools.

Calculate your exact case

Pick your GPU and the Hermes 3 variant. The calculator tells you if it fits in VRAM and how many tokens/sec to expect.

Calculate my GPU now →

Found this useful? Get guides like this in your inbox every week.

1. The quick answer

NVIDIA GeForce RTX 4070 Super 12GB

2. What Hermes 3 is and why it is not just another Llama

Why the finetune matters

The three Hermes 3 variants

3. 8B, 70B and 405B variants: exact VRAM

Hermes 3 8B

Hermes 3 70B

Hermes 3 405B

4. Hermes 3 vs Llama 3.1 base: when to pick each

5. Install with Ollama (step by step)

Step 1: Install Ollama

Step 2: Download Hermes 3

Step 3: Run

6. HuggingFace and llama.cpp (alternative)

7. Function-calling and agent capabilities

Function-calling format

Agent capabilities

8. Real-world use cases

Local productivity agent

Coding copilot without restrictions

Roleplay and creative writing

RAG over private documentation

9. Benchmarks and expected speed

Recommended hardware for Hermes 3

RTX 3060 12GB

RTX 4060 Ti 8GB

RTX 4070 Super 12GB

RTX 4090 24GB

10. Frequently asked questions

What hardware do I need to run Hermes 3?

What makes Hermes 3 different from Llama 3.1 base?

How do I install Hermes 3 with Ollama?

Is Hermes 3 censored or restricted?

Is Hermes 3 worth it for function-calling?

Calculate your exact case

Sources