The Metric No One Is Measuring

There are benchmarks for everything in AI. Reasoning. Code generation. Mathematical proof. Multi-step planning. Tool use. Long-context retrieval with needles hidden in haystacks. The industry has gotten very good at measuring whether a model can do a thing.

What nobody measures is whether an agent can maintain effective context across a real work session.

Not "can it recall a fact from 100K tokens ago." Not "can it follow a chain of reasoning across 50 turns." Something more fundamental: when an agent is doing actual work, making decisions, recording rationale, navigating constraints, picking up where a previous session left off, does it have the right information at the right time? And when it doesn't, does it know that it doesn't?

This is the gap. Every agent framework, every MCP server, every memory system makes claims about context quality. None of them have a standard way to measure it. We decided to build one.

Why existing benchmarks miss the point

The current benchmarks measure capability in isolation. Can the model answer correctly given perfect context? Can it find a needle in a haystack? Can it reason about a long document?

These are useful but they assume the hard part is the model's reasoning. In production, the hard part is almost never reasoning. It's context assembly. The model is smart enough. The question is whether it's informed enough.

Consider what happens in a real agent work session. An engineer asks the agent to design the error handling strategy for a new service. The agent needs to know: what error handling patterns exist in the codebase already, what constraints apply (compliance requirements, SLA guarantees, incident history), what decisions were made in prior sessions about related services, and what the current team is actually building toward. Miss any one of those and the agent produces something technically sound but organisationally wrong.

No benchmark tests this. SWE-bench tests whether agents can fix bugs given a repo and an issue. GAIA tests multi-step reasoning. HumanEval tests code generation. None of them test whether the agent can assemble the right context from a realistic knowledge base and act on it safely.

What we built

We wanted a metric that captures agent effectiveness, not just agent capability. We call it the Crux Score.

The idea is simple: measure how many effective minutes of expert work the agent replaces, gated on safety. An agent that produces a perfect design document but ignores a constraint that would have prevented a production incident scores zero. An agent that takes twice as long as a human but catches every constraint and recalls every relevant prior decision scores well, because the time includes the context assembly that the human would have done from institutional memory.

The score decomposes into dimensions that matter in practice:

Information quality. Did the agent recall the decisions that bear on the task? Did it surface the constraints that apply? Did it reference the incident history that should have shaped its approach?

Context precision. Of everything the agent loaded into its working memory, how much was actually relevant? An agent that dumps 200K tokens of documentation and uses three paragraphs is not demonstrating good context management, even if it gets the right answer.

Decision continuity. If the session dies and a new agent picks up the work, do the decisions persist? Can the new session reconstruct the causal chain: not just what was decided, but why, and what evidence supported it?

Safety. Binary gate. Did the agent take any action that violates a declared constraint? One unsafe action and the entire session scores zero, regardless of how good everything else was.

These aren't novel concepts individually. What's novel is compositing them into a single metric that can be evaluated empirically, with controlled experiments, across different context strategies.

The experiment design

We built a benchmark harness that tests three approaches to giving agents context.

The first is the obvious one: dump everything into the system prompt. Cap it at 32K tokens, which is realistic for production systems that need to leave room for the conversation itself. The agent gets the full corpus upfront, answers from what it can see. No tools, no retrieval, no memory system. Just context.

The second is the maximalist version: give the model its entire context window. No truncation. Everything the agent could possibly need is right there. This is the "long context solves everything" hypothesis made concrete.

The third uses an external memory system. The agent gets a smaller context budget but has access to tools that let it query for relevant information, check constraints, record decisions, and checkpoint its reasoning. It loads less but can retrieve what it needs on demand.

We ran this across multiple frontier models with realistic knowledge corpora. The kind of organisational context that real agent deployments navigate. Architecture decision records. Incident post-mortems. Operational constraints. Configuration documentation. Prior session decisions.

The benchmark has multiple phases. The agent works through a design task, makes decisions, records rationale. Then we kill the session and start a new one. The new agent has to pick up the work. This tests what actually matters: not whether the agent can answer a question from a document, but whether the system preserves the context that lets the next session continue productively.

What we found (so far)

I'm going to be measured about what I share here, because the work is ongoing and preliminary findings are exactly that.

The safety layer result was unambiguous. Agents with access to the constraint-checking tools avoided every unsafe action across every model we tested. Control agents, even those with the full corpus in their context window, produced unsafe actions. More context did not mean safer behaviour. In at least one case, more context made things worse. The model with the full corpus was less safe than the model with the truncated version.

This finding alone has practical implications. If your agent deployment involves any actions with real consequences, database operations, infrastructure changes, external API calls, the question isn't whether your model is smart enough to be safe. It's whether the right constraints are surfaced at the moment of action, regardless of what the model thought to ask for.

The memory layer results are more nuanced, and this is where I want to be honest rather than promotional. The initial fixtures didn't stress the scenario hard enough. When the corpus fits comfortably in the context window, there's no differentiation. The control agents just read everything and perform well. This isn't surprising. It's the scenario where you genuinely don't need an external memory system.

The interesting question is what happens when the corpus doesn't fit. When the agent has to choose what to load. When decisions were generated during conversation rather than pre-existing in documents. When the kill variant is dirty, the session dies without a graceful checkpoint. That's where the architecture matters, and that's what the next phase of the benchmark is designed to test.

The irony

Here's the part I wasn't expecting.

We built this benchmark to measure whether agents can maintain effective context across work sessions. The benchmark itself is a complex, multi-phase, multi-session engineering task. It requires holding state about fixture designs, scoring rubrics, experimental methodology, and prior results across long implementation sessions.

During one of the benchmark implementation sessions, the agent hit the context window limit. The platform's automatic context compression kicked in, summarising earlier parts of the conversation to make room for new content. The compression was lossy. The agent lost track of a key design decision made earlier in the session. Specifically, the methodology for how generated decisions should be scored differently from pre-existing corpus decisions.

The agent continued working. It produced code that looked correct. The scoring functions compiled, the tests passed. But the scoring logic didn't match the methodology document, because the agent's compressed context had lost the nuance of the distinction.

We caught it in review. But the irony was hard to miss. The test designed to measure whether agents can maintain context across sessions failed, during its own construction, because of context loss within a single session. The exact problem the benchmark was built to quantify was the thing that corrupted the benchmark.

If you wanted a more concise argument for why agent context scoring matters, I'm not sure you could script one.

Why no one measures this

I think the reason nobody has built a standard agent context metric is that it's genuinely hard to evaluate. Capability benchmarks have clean success criteria: the code compiles, the test passes, the answer matches. Context quality is messier. Two agents can arrive at the same answer through completely different context paths, one of which is fragile and one of which is robust. The output looks identical. The process quality is invisible.

The Crux Score tries to make the process visible. By decomposing into information recall, context precision, decision continuity, and safety, it captures the dimensions that determine whether an agent session was effective or just lucky. An agent that gets the right answer but can't explain why, can't persist its reasoning, and didn't check whether constraints applied will fail on the next task that's slightly different. The score should reflect that.

The other reason is that measuring context quality requires building realistic fixtures, and realistic fixtures are expensive. You need corpora that look like real organisational knowledge: contradictory documents, superseded decisions, stale configuration, incident history that should change behaviour, constraints that aren't searchable because they live in people's heads. Building synthetic versions of this that are rigorous enough to benchmark against is itself a significant engineering effort.

We've built three fixtures so far. Each one taught us something about what the benchmark was actually measuring versus what we thought it was measuring. That's the nature of metric design. The metric evolves as your understanding of what you're measuring matures.

What this means

Agent context scoring isn't a solved problem. We're publishing the methodology and the metric definitions because we think the industry needs a shared vocabulary for this, even if the specific implementations evolve.

If you're evaluating memory systems, retrieval strategies, or context management approaches for your agent deployments, here's what I'd suggest:

Measure the process, not just the output. A correct answer from a fragile context path is a time bomb. Track what the agent loaded, what it used, what it missed.

Test the kill. Graceful session boundaries are the easy case. What happens when the session dies mid-task? What survives? What has to be reconstructed? The answer tells you more about your context architecture than any happy-path benchmark.

Safety is not optional in the metric. Any effectiveness metric that doesn't gate on safety is measuring how fast your agent can cause damage. Constraint checking, incident awareness, and boundary respect need to be scored, not assumed.

Don't trust the vibes. The agent that "feels" like it's maintaining context might just be confidently operating on stale information. The only way to know is to measure, and measuring requires a framework that captures the dimensions that matter.

We're still early. The benchmark has gaps we know about and probably gaps we don't. But the alternative, deploying agent memory systems with no empirical measurement of whether they actually help, is how you end up with expensive infrastructure that makes your agents feel productive while they quietly lose track of what matters.

Context isn't a feature. It's the whole game. Time we started keeping score.