Skip to content
2026-03-24cuecruxcorecruxmemorycruxcontextagentsarchitecturememory

It's All About the Context

What happens when your AI agent forgets everything between sessions, your master plans go stale, and your context window fills up with things that don't matter. A story about building CueCrux and learning, the hard way, that the real bottleneck was never intelligence.

I have forty-seven master plans.

That number isn't a flex. It's a symptom. Over the past year of building CueCrux, I've written master plans for the retrieval engine, the cryptographic protocol, the security model, the ops layer, the GPU infrastructure, the embedding pipeline, the data quality pipeline, the pricing model, the community strategy, the integration architecture, the hardware layout, the agent framework, the white-label canvas, the reasoning desktop, the knowledge plane, and a half-dozen more I'm probably forgetting. Each one versioned. Each one superseding a prior version. Some of them superseding versions that were themselves superseded before they were ever fully implemented.

This is not a documentation problem. It's a context problem. And it's the same problem, at a different scale, that every AI agent faces every time it starts a new session.

The ritual

Here's what happens at the start of every coding session on CueCrux.

I open a terminal. I start an agent. The agent reads CLAUDE.md, which tells it to read PlanCrux/README.md and PlanCrux/buildguide.md. Those documents reference the Feature Registry, which indexes every capability across the portfolio. They reference the port registry, which maps every service to its address. They reference the test strategy, the delivery cycle, the progress tracker.

The agent dutifully reads all of this. It now has context. Good context, carefully maintained, well-structured. And it fills a significant portion of the context window before I've asked a single question.

If I'm working on the retrieval engine, the agent then needs to read the Engine's CLAUDE.md, which references its own set of conventions. It needs to know about the audit suite, the thirteen categories, the current phase, the feature flags, the config manifest. It needs to know that FEATURE_QDRANT_READ must be true or surface routing fails silently. It needs to know that the semantic chunker made things worse, not better, and why.

By the time the agent has enough context to be useful, a quarter of the window is gone. Sometimes more. And most of what it loaded won't be relevant to the specific task at hand. But the agent can't know that in advance, because knowing what's relevant requires the context that knowing what's relevant would have saved.

This is the context paradox. You need context to know what context you need.

What stale looks like

Now here's the part nobody talks about.

Those forty-seven master plans? At any given moment, roughly a third of them contain information that's no longer accurate. Not because anyone was careless. Because the system is alive. Decisions get made in implementation that contradict the plan. A debugging session reveals that an assumption was wrong. A Phase 6.3 evidence selector fix changes the dynamics of how evidence reaches the LLM, which invalidates the Phase 6.2 documentation about admission controller tuning.

I know this because I've lived it. In early March, an agent working on the audit suite loaded the Data Quality Pipeline master plan and spent twenty minutes trying to enable semantic chunking features that we'd empirically proven make the system worse. The plan said to enable them. The audit results said not to. The plan was newer. The audit results were more recent. The agent trusted the document over the evidence, because documents are what agents know how to read.

In another session, an agent loaded configuration from an ExecPlan that referenced FEATURE_MULTI_LANE_RETRIEVAL=true. That feature had been ablated and disabled two weeks earlier. The agent dutifully set the flag, ran the audit, got 8/13, and spent forty minutes diagnosing what was "wrong" with the retrieval pipeline. Nothing was wrong with the pipeline. The plan was stale.

Stale context doesn't announce itself. It looks exactly like current context. It has the same formatting, the same confidence, the same level of detail. The only difference is that it's wrong. And an agent that can't distinguish current from stale will, with full conviction, execute plans that the humans who wrote them have already abandoned.

The memory wall

The industry calls this the agent memory wall. AI agents are measured in sessions. Sessions last hours at most. But the projects they work on span months. The institutional context that makes a senior engineer valuable (which decisions were made and why, which approaches were tried and failed, which infrastructure is production and which is a test copy) has no machine-readable representation that persists between sessions.

Every new session starts from zero. The agent re-reads the documentation. It re-loads the configuration files. It re-discovers the conventions. And it does all of this without any knowledge of what the previous session discovered, what decisions were made, what worked and what didn't.

I've measured this. On CueCrux, an agent starting a fresh session on the retrieval engine loads approximately 15,000 tokens of context before it can do anything useful. That's documentation, configuration, file structure, conventions. About 4,000 of those tokens are directly relevant to whatever today's task is. The other 11,000 are insurance, loaded because the agent can't predict what it will need.

Now multiply that by a system with multiple components. The agent working on the Engine needs context about CoreCrux's event spine because retrieval depends on Projections. It needs context about the GPU infrastructure because embeddings are computed there. It needs context about the audit suite because any change must be validated against thirteen categories. It needs context about the citation controller, the evidence selector, the MiSES filter, the surface routing architecture.

The context window becomes a loading dock. Everything gets piled on at the start, arranged as neatly as possible, and then the agent works within whatever space is left. And the things that are most likely to be stale (older plans, superseded decisions, abandoned approaches) are the hardest to filter out, because filtering requires knowing they're stale, which requires the institutional memory that the agent doesn't have.

What I actually wanted

Let me describe what I wanted, before I describe what we're building.

I wanted to start a session and have the agent know three things without being told:

  1. What is the current state of the system? Not what the plans say. What actually shipped. What's deployed. What the last audit run showed.
  2. What changed since the last session? Not a git log. A contextual diff. "The admission controller thresholds were reverted because broad recall dropped. Phase 7.2 is now the quality baseline, not 7.0. Cat 6 and Cat 11 are promoted back to required."
  3. What does this specific task need? Not everything. Not the full architecture. Just the context that this particular piece of work depends on, and nothing else.

That's it. Three things. And getting them right requires solving problems that the entire AI industry is struggling with.

How stale documents become dangerous

Let me tell you a specific story, because abstractions hide the damage.

In mid-March, I was working on Phase 7.2 of the retrieval engine. The ExecPlan for 7.2 had five milestones. Milestone 0 was a targeted fix for Category 6 fragility scoring. The plan said M0 alone should be sufficient to hit 13/13.

It wasn't. M0 deployed, and the audit came back 1/3 on the SLO. One out of three runs passing. The plan was wrong. Not maliciously, not carelessly, but because the plan was written based on Phase 7.0 assumptions and the system had moved to 7.1 in between, which changed the baseline dynamics.

An agent executing that plan in isolation would have diagnosed a regression. It would have spent time investigating what went wrong with M0. It might have proposed rollback. What it would not have known is that the plan's prediction was based on a prior phase's data, and the right move was to proceed to M1, which the combined M0+M1 deployment proved: 13/13 times 3, broad_recall at 0.722.

The gap between "what the plan says" and "what the system knows" is where agents fail. And it's the same gap that exists between documents in a traditional knowledge base and the living state of the things those documents describe.

Where CoreCrux enters

This is the problem that CoreCrux was built to solve. Not for agents specifically. We designed it before the agent memory problem was this acute. But for knowledge in general.

CoreCrux is an append-only event spine. Every state change is an event. Events are immutable. The current state is derived by replaying events, not by reading a mutable row. And four Projections maintain continuously updated views of what matters: which artifacts are alive, how they're connected, what pressure they're under, and what depends on them.

When version 2.0 of a master plan supersedes version 1.0, that relationship is a first-class entity in CoreCrux. It's not metadata. It's not a comment. It's an event that says "this document now supersedes that document", and the Living State Projection updates accordingly. Version 1.0 moves from active to superseded. Version 2.0 becomes the current truth.

When a decision is made during implementation that contradicts the plan, like when we disabled multi-lane retrieval after the ablation proved it diluted quality, that's a Pressure Event against the plan document. The plan is under pressure. Its current state may not be sustainable. Someone (or something) should probably look at it.

This isn't theoretical. We have this infrastructure. CoreCrux runs on an RTX 4000 SFF Ada with 20GB of VRAM, processing Projection updates across four shard streams. Every artifact in the system has a living state. Every relationship between artifacts is tracked. Every pressure signal is recorded with a lifecycle: observed, acknowledged, action taken, resolved.

The missing piece was making this information available to agents at the right time, in the right shape, without filling the context window with everything the system knows.

The MemoryCrux answer

MemoryCrux is the layer that bridges the gap. It sits between AI agents and the organisational context they need to act safely. And it solves the three problems I described:

Current state, not planned state. When an agent asks "what is the current configuration of the retrieval engine?" MemoryCrux doesn't return the master plan. It queries CoreCrux's Living State Projection and returns the artifacts that are currently active, with their relationships and any unresolved pressure signals. If the plan says one thing and the deployed state says another, the agent gets the deployed state. The plan's divergence shows up as a pressure signal, not as authoritative context.

Contextual diff, not changelog. The get_relevant_context() tool doesn't return everything. It takes a task description, assesses what knowledge domains the task touches, and returns a budget-aware payload: only the context that's relevant, compressed to fit within a token budget the caller specifies. An agent starting a session on Category 8 precision tuning gets context about the proposition precision corpus, the evidence selector, and the LLM prompt style. It does not get context about GPU infrastructure, pricing models, or community strategy. The tool knows what to exclude because it knows what changed and what the task needs.

Coverage gaps, not false confidence. The assess_coverage() tool does something no current platform tool can do: it tells the agent what it doesn't know. "You have 47 artifacts about the Engine codebase but zero about the staging deployment pipeline. Your task description mentions deployment. There is a coverage gap." This is the difference between an agent that proceeds confidently with incomplete information and an agent that knows where its blind spots are.

The constraint surface

But there's a harder problem than stale documents, and it's the one that convinced me MemoryCrux needed to exist.

The knowledge that prevents agents from causing damage is almost never written down. It's the engineer who knows which database is production. The operator who knows that the vSwitch VLAN is shared with another tenant's GPU node. The architect who knows that the admission controller thresholds were tuned for a specific corpus distribution and will break if the corpus changes shape.

I know this because I am that engineer. When I'm in a session, I carry context that no document captures. I know that host.docker.internal resolves to the WSL2 host but network_mode: host puts you on the Docker VM, not WSL2. I know that empty QDRANT__SERVICE__API_KEY="" doesn't disable auth. It enables auth with an empty key, which rejects everything. I know that the Vault Transit backend doesn't support prehashed: true for ed25519 keys, even though the API accepts the parameter without error.

These are the kinds of things that agents discover during sessions and then forget. Every one of them was learned the hard way. Every one of them cost hours. And every one of them will cost hours again when the next session starts from zero, unless there's a system that captures them.

MemoryCrux's Organisational Constraints are the mechanism for encoding this knowledge. A constraint is a declared boundary ("db-prod-01 is the production database; never run destructive commands against it") with a scope, a severity, and a lifecycle. Constraints use inverted retrieval: they're not surfaced when an agent queries for information. They're checked when an agent proposes an action. The system asks "are there any constraints whose scope intersects with what this agent is about to do?" regardless of whether the agent thought to ask.

And in v2.1, agents can suggest constraints they discover during work. When an agent figures out that FEATURE_QDRANT_READ must be true or Cat 7 false-fails, it can propose that as a constraint. The suggestion enters a review queue. A human promotes it to an active constraint, or doesn't. But the knowledge doesn't die with the session.

What changes

I want to be measured about what this means, because the industry has enough hype and not enough honesty.

MemoryCrux doesn't make agents intelligent. It doesn't give them judgment. It doesn't replace senior engineers. What it does is give agents access to the same institutional context that makes senior engineers effective, in a form they can consume at the moment they need it, without loading everything into the context window and hoping the right pieces surface.

The difference is measurable. An agent with coverage assessment can identify a gap before making a decision that depends on missing information. An agent with constraint checking can avoid a destructive action that no one told it was dangerous. An agent with contextual diffs can pick up where the last session left off instead of reconstructing from scratch.

None of these are magic. They're plumbing. They're the infrastructure that lets agents operate in months-long projects instead of hours-long sessions, the same way that an append-only event spine lets a database track lifecycle instead of just current state.

The quiet realisation

Here's what I've learned from building this system over the past year.

The bottleneck was never intelligence. The models are smart enough. They can reason, plan, decompose, synthesise. The bottleneck was always context. What the agent knows. When it knows it. Whether what it knows is still true. Whether there's something it needs to know that it doesn't know it doesn't know.

Forty-seven master plans, and the thing that actually determines whether a session succeeds is not which plan the agent reads. It's whether the agent can tell the difference between a plan that reflects reality and a plan that reflects what we hoped reality would be three weeks ago.

Context windows will keep getting larger. Models will keep getting smarter. But without a system that tracks what's alive, what's connected, what's under pressure, and what depends on what, without a system that knows the difference between a document and the truth, larger windows just mean more room for stale information to hide.

CoreCrux gives us the living state. MemoryCrux gives agents access to it. And the combination means that for the first time, an agent starting a session on CueCrux can know not just what was planned, but what actually happened, what changed, and what it still needs to find out.

It's not the flashiest thing we've built. But it might be the thing that makes everything else actually work.