You know the feeling. You fire up your agent, ask a question about something you worked on together last week, and it stares back at you like you just met. “I don’t have context from previous sessions.” You explain the project again. You restate your preferences. You remind it who you are.

This isn’t an intelligence problem. It’s a memory architecture problem.

Most AI tools treat every conversation as a blank slate. ChatGPT doesn’t remember you from yesterday. Claude doesn’t carry context from one project to the next. Some bolt on a vector database and call it memory: dump embeddings in, pull the nearest neighbors out, hope the semantic space lines up with what you actually needed. It works about as well as you’d expect.

hermes-cashew takes a different approach. It builds a thought graph.

Instead of a flat pile of embeddings, every conversation turn becomes a node: observations, facts, decisions, beliefs, insights. Those nodes get connected by edges that capture how they relate: derivation, contradiction, cross-linking, synthesis. When you ask a question, retrieval doesn’t just find similar vectors. It walks the graph. Vector search finds seed nodes. Then BFS follows neighbors. Then neighbors of neighbors, scoring at each hop. The topology is the retrieval signal.

I wrote about this in April when hermes-cashew was at v0.2.0; more proof of concept than daily driver. Today it ships v0.8.0, and it’s ready for other people to use. Here’s how to get it running in five minutes.

What You’ll Get#

Once installed, hermes-cashew gives your Hermes Agent a persistent semantic memory that survives sessions, platforms, and restarts. Two LLM-accessible tools (cashew_query and cashew_extract) live alongside the built-in memory tool without conflict. Session-start prefetch pulls relevant context into your system prompt automatically.

Behind the scenes, three systems keep the graph healthy:

Think cycles. Every N turns, the system picks a cluster of related nodes and asks your configured LLM: what’s the non-obvious connection here? The resulting insight nodes bridge islands of knowledge you didn’t know were adjacent.

Sleep consolidation. At session end, the graph tidies itself up. It cross-links similar nodes, deduplicates near-identical content, and garbage-collects low-value noise. It promotes frequently accessed nodes to permanent status, then generates a dream node. All in about four seconds for a typical graph.

Organic decay. Low-value knowledge fades over time. Important patterns strengthen with use. You don’t have to curate anything.

Everything lives in a single SQLite file. Embeddings use all-MiniLM-L6-v2 locally. No data leaves your machine unless you explicitly configure an LLM for extraction; even then, only the prompts you choose to send.

Install#

hermes-cashew is a Hermes Agent plugin. One command:

hermes plugins install magnus919/hermes-cashew

That’s it. The plugin registers itself as a memory provider, creates the config file at ~/.hermes/cashew.json, and initializes an empty SQLite database at ~/.hermes/cashew/brain.db.

No server to run. No Docker container. No API key required unless you want LLM-powered features.

Configure (Optional but Worth It)#

The plugin works with zero configuration: it uses heuristics to extract facts from your conversations, stores them as nodes, and serves them back on query. You’ll get basic memory right away.

To unlock the real features (typed nodes, think cycles, and sleep consolidation), you need an LLM wired in. Add one line to ~/.hermes/cashew.json:

"llm_aux_role": "memory"

Then point it at a model in your Hermes config.yaml:

auxiliary:
  memory:
    provider: opencode-go
    model: deepseek-v4-flash
    base_url: https://opencode.ai/zen/go/v1
    timeout: 120

This uses Hermes’s existing auxiliary model infrastructure. No separate API key management. The plugin reads your Hermes config, resolves credentials, and constructs a callable; exactly the same pattern other memory providers use.

The model doesn’t need to be big. For extraction and think cycles, a fast, cheap model works fine. I use DeepSeek V4 Flash through OpenCode Go. OpenAI’s gpt-4o-mini or a local llama.cpp instance would work just as well.

Full config reference with all 31 settings is in the README.

Your First Query#

Once the plugin is installed and your gateway restarted, memory works automatically. Session-start prefetch pulls relevant context into your system prompt. You can also query explicitly:

What does the thought graph remember about mesh networking in Raleigh?

The agent calls cashew_query behind the scenes, which seeds a vector search, walks the graph recursively via BFS, and returns formatted context. You’ll see results tagged by domain and node type (fact, observation, insight, decision).

If you want to explicitly persist something to the graph:

Save this to memory: the Eastern Repeater now has a 9.3 dBi Diamond BC920 antenna.

That calls cashew_extract, which runs the full extraction pipeline: LLM classification if wired, heuristic fallback if not.

What v0.8.0 Fixed#

The April article was honest about what was missing: “The parts of Cashew that are most distinctive (engineered forgetting, think cycles, and autonomous synthesis) are working in cashew-brain itself today; my Hermes integration exposes the retrieval primitives and not yet the brain operations.”

v0.8.0 closes that gap.

More specifically, it closes a gap that v0.7.4 created. If you’d installed that version, your agent would hang at session end. You’d wonder if the plugin was broken. You might uninstall it. The upstream sleep cycle was catastrophically slow. On our 7,100-node graph, it computed a full N-by-N embedding similarity matrix: 49 million floats, 24.5 million pair comparisons in a Python double loop; and committed each new cross-link edge to SQLite one transaction at a time. DELETE journal mode. It ran at 100% CPU for hours without finishing.

The refactored sleep cycle in v0.8.0 replaces that with a nine-phase vectorized pipeline: numpy similarity search, batched 500-edge commits in WAL mode, connected-component dedup instead of Bron–Kerbosch clique detection, LLM-powered dream nodes, and a work cap so it converges gradually over a few sessions instead of eating the whole graph at once.

Result: 4 seconds. Down from hours. Tested at 7,100 nodes and 61,000 edges. 100% retrieval overlap with the original graph. Zero orphan edges. 32 new tests. And it runs automatically at session end when sleep_cycles: true. No cron job, no daemon thread, no dual-path flag. Just fires and logs a one-line summary. It finishes before you notice it started.

That’s the pattern hermes-cashew is built on: each release sharpens the parts that make it distinctive. v0.8.0 made sleep fast. The next ones will make it smarter.

Where to Go From Here#

hermes-cashew is open source (MIT). The GitHub repo has the full README, changelog, and contributing guide. If you hit a bug, find the extraction prompt biased toward certain types of content, or have an idea for a feature, open an issue. If you want to contribute, the CONTRIBUTING.md walks through the dev setup, test suite, and PR conventions.

The cashew-brain engine that powers all of this is Raj Kripal Danday’s project: an independent thought-graph engine with recursive BFS retrieval, organic decay, and autonomous synthesis protocols. hermes-cashew is a thin adapter around it. The upstream is active and worth following.

If you’re building your own agent memory system: the retrieval architecture matters far more than the extraction pipeline. On the LoCoMo benchmark, accuracy varied by 20 points depending on how the system retrieved memories, and only 3 to 8 points depending on how it summarized them at write time. How you find old context is more important than how you save it. The graph walk is the differentiator.

What’s in your agent’s memory right now that it can’t find?