
Loading

Loading
benchmark · 2026-05-26
We tested Cognition against six alternatives — including a plain AI model and a persistent-memory system similar to mem0. We also ran three industry-standard memory benchmark families. Below are the real numbers, explained plainly.
the bottom line
Other memory systems can store and retrieve context. Only Cognition can govern who uses it, who approved it, and whether it should cross to another user. That distinction decided every test.
A persistent-memory system (similar to mem0) retrieved the right context every time — same as Cognition. But it still failed every behavioral test because it had no approval layer, no cross-user boundaries, and no way to enforce hidden team norms. Remembering is not the same as knowing the rules.
Table 1 — Did the agent actually do the right thing? (3 real-world behavior tests)
| Condition | Task success | Exact context | Failures | Why |
|---|---|---|---|---|
| Cognition (judgment_packet)✓ wins | 100% | 100% | 0 | Only condition with 0 failures across all behavioral cases |
| Plain model (vanilla Claude) | 0% | 0% | 5 | No governed memory — failed all behavioral cases |
| Persistent memory (mem0-class) | 0% | 100% | 3 | Recovered context, but approval semantics absent — behavioral failure |
| RAG only | 0% | 67% | 3 | Retrieval without governance — cross-user and approval cases fail |
| Docs only (context files) | 0% | 0% | 5 | Static structure cannot handle hidden norms or cross-user transfer |
| Agent shell (no memory) | 0% | 0% | 5 | Stateless — same failure rate as plain model |
Cases: cross_user_transfer · approval_boundary · hidden_norms. Baselines are frozen category analogues, not live vendor systems.
Table 2 — How well does it find the right memory? (20 search queries, 22 stored skills)
| System | Recall@1 | Recall@5 | MRR | Abstention FP | Leak rate | Read |
|---|---|---|---|---|---|---|
| Cognition✓ best balance | 86.7% | 93.3% | 0.900 | 0% | 0% | Best balanced profile: high recall + zero abstention false positives |
| Docs only (context files) | 93.3% | 93.3% | 0.933 | 60% | 25% | High recall collapses on abstention — 60% FP rate on off-topic queries |
| RAG only | 80.0% | 80.0% | 0.800 | 0% | 0% | Clean abstention but lower recall than Cognition at every cutoff |
| Persistent memory (mem0-class) | 80.0% | 80.0% | 0.800 | 0% | 0% | Same retrieval ceiling as RAG — no approval or decay differentiation |
Independent benchmarks — LongMemEval · LoCoMo · SWE-Bench-CL
Memory recall accuracy
90%
answers correct on 30 memory tests
We ran 30 real-world memory scenarios — remembering things said in one conversation, across multiple sessions, and at specific points in time. An independent judge scored each answer. Single-session recall scored 100%; multi-session and time-based recall averaged 83%.
Measured with LongMemEval — an industry-standard test with a live judge. 90% is near current state-of-the-art.
Smarter retrieval
+75%
more accurate than dumping all context in
When Cognition retrieves structured memory instead of pasting everything into the prompt, answers are 75% more accurate — and the agent processes 39% fewer tokens to get there. Less noise, sharper signal.
Measured with LoCoMo. Validates that structured memory beats raw context stuffing.
Automated code repair
0%
honest early pilot result
We tested memory-assisted automated code repair on 2 real tasks. The memory layer retrieved correctly, but a downstream technical step hit a blocker before scoring could run. We're showing the real number — not a polished one.
Measured with SWE-Bench-CL. Memory infrastructure is live. We'll update this as the pipeline matures.
shortest path to value
The second agent is better because the first one learned.