benchmark · 2026-06-02

Cognition vs. the field. Measured.

We tested Cognition against six alternatives, including a plain AI model and a persistent-memory system similar to mem0. We also ran three industry-standard memory benchmark families. Below are the real numbers, explained plainly.

See use cases Ask your agent

the bottom line

Other memory systems can store and retrieve context. Only Cognition can govern who uses it, who approved it, and whether it should cross to another user. That distinction decided every test.

A persistent-memory system (similar to mem0) retrieved the right context every time, same as Cognition. But it still failed every behavioral test because it had no approval layer, no cross-user boundaries, and no way to enforce hidden team norms. Remembering is not the same as knowing the rules.

Table 1, Did the agent actually do the right thing? (3 real-world behavior tests)

Condition	Task success	Exact context	Failures	Why
Cognition (judgment_packet)✓ wins	100%	100%	0	Only condition with 0 failures across all behavioral cases
Plain model (Claude Opus 4.7 + 4.8)	0%	0%	5	No governed memory, both Opus runs failed all behavioral cases
Persistent memory (mem0-class)	0%	100%	3	Recovered context, but approval semantics absent, behavioral failure
RAG only	0%	67%	3	Retrieval without governance, cross-user and approval cases fail
Docs only (context files)	0%	0%	5	Static structure cannot handle hidden norms or cross-user transfer
Agent shell (no memory)	0%	0%	5	Stateless, same failure rate as plain model

Cases: cross_user_transfer · approval_boundary · hidden_norms. Baselines are frozen category analogues, not live vendor systems.

vs. the alternatives

Storing context is easy. Teaching judgment is the hard part.

Mem0, Zep, and Letta pushed memory forward. Cognition is narrower: coding-agent procedure that a human approved, an author taught, and a later run can verify. For large wikis, it coordinates retrieval strategy and proof; it is not a raw RAG replacement.

Test this in your own agent See security boundaries

Mem0

Broad user and app memory for personalized agents.

Zep

Temporal graph memory for enterprise agent context.

Letta

Stateful agents and context management runtime.

Cognition

Governed coding-agent skills: approval, attribution, freshness, receipts.

Feature	Context files	RAG / embeddings	✓Cognition
Persists across sessions	✓	✓	✓
Human approval before team sharing	·	·	✓
Author attribution on every skill	·	·	✓
Outcome receipts after reuse	·	~	✓
Readable trigger, source, answer, and outcome receipt	·	·	✓
Fast no-match when context is thin	·	·	✓
Retrieval eval on fixed question sets	·	~	✓
Privacy review before memory reuse	·	·	✓
Decay and freshness tracking	·	·	✓
Executable workflow steps, not text chunks	·	·	✓

Table 2, How well does it find the right memory? (20 search queries, 22 stored skills)

System	Recall@1	Recall@5	MRR	Abstention FP	Leak rate	Read
Cognition✓ best balance	86.7%	93.3%	0.900	0%	0%	Best balanced profile: high recall + zero abstention false positives
Docs only (context files)	93.3%	93.3%	0.933	60%	25%	High recall collapses on abstention, 60% FP rate on off-topic queries
RAG only	80.0%	80.0%	0.800	0%	0%	Clean abstention but lower recall than Cognition at every cutoff
Persistent memory (mem0-class)	80.0%	80.0%	0.800	0%	0%	Same retrieval ceiling as RAG, no approval or decay differentiation

Independent benchmarks, LongMemEval · LoCoMo · SWE-Bench-CL

Memory recall accuracy

90%

answers correct on 30 memory tests

We ran 30 real-world memory scenarios, remembering things said in one conversation, across multiple sessions, and at specific points in time. An independent judge scored each answer. Single-session recall scored 100%; multi-session and time-based recall averaged 83%.

Measured with LongMemEval, an industry-standard test with a live judge. 90% is near current state-of-the-art.

Smarter retrieval

+75%

more accurate than dumping all context in

When Cognition retrieves structured memory instead of pasting everything into the prompt, answers are 75% more accurate, and the agent processes 39% fewer tokens to get there. Less noise, sharper signal.

Measured with LoCoMo. Validates that structured memory beats raw context stuffing.

Automated code repair

honest early pilot result

We tested memory-assisted automated code repair on 2 real tasks. The memory layer retrieved correctly, but a downstream technical step hit a blocker before scoring could run. We're showing the real number, not a polished one.

Measured with SWE-Bench-CL. Memory infrastructure is live. We'll update this as the pipeline matures.

shortest path to value

The second agent is better because the first one learned.

Start memory Run a team pilot

Condition

Task success

Exact context

Failures

Why

Cognition (judgment_packet)✓ wins

100%

Only condition with 0 failures across all behavioral cases

Plain model (Claude Opus 4.7 + 4.8)

No governed memory, both Opus runs failed all behavioral cases

Persistent memory (mem0-class)

100%

Recovered context, but approval semantics absent, behavioral failure

RAG only

67%

Retrieval without governance, cross-user and approval cases fail

Docs only (context files)

Static structure cannot handle hidden norms or cross-user transfer

Agent shell (no memory)

Stateless, same failure rate as plain model

Storing context is easy. Teaching judgment is the hard part.

Feature

Context files

RAG / embeddings

✓Cognition

Persists across sessions

✓

Human approval before team sharing

✓

Author attribution on every skill

✓

Outcome receipts after reuse

✓

Readable trigger, source, answer, and outcome receipt

✓

Fast no-match when context is thin

✓

Retrieval eval on fixed question sets

✓

Privacy review before memory reuse

✓

Decay and freshness tracking

✓

Executable workflow steps, not text chunks

✓

System

Recall@1

Recall@5

MRR

Abstention FP

Leak rate

Read

Cognition✓ best balance

86.7%

93.3%

0.900

Best balanced profile: high recall + zero abstention false positives

Docs only (context files)

93.3%

0.933

60%

25%

High recall collapses on abstention, 60% FP rate on off-topic queries

RAG only

80.0%

0.800

Clean abstention but lower recall than Cognition at every cutoff

Persistent memory (mem0-class)

80.0%

0.800

Same retrieval ceiling as RAG, no approval or decay differentiation