Three Versions of the Same Question
How the CueCrux retrieval engine evolved across three major versions, why V4.1 is fundamentally different from anything else in the market, and what the benchmarks actually show.
Every retrieval system answers the same fundamental question: given what the user asked, what should they see?
Most systems answer that question once. They embed the query, find the nearest documents, return the top results. The question is answered. Move on.
CueCrux answers it three different ways, depending on what's at stake. And the journey from the first version to the third is the story of discovering that "find the nearest documents" is the wrong question entirely.
The question that standard RAG answers
Standard retrieval-augmented generation works like this. A user submits a query. The system converts it into a vector embedding and searches a vector database for the most similar documents. It takes the top results, stuffs them into a prompt, and asks a language model to synthesise an answer.
This works well enough for simple cases. Ask about a specific topic and you'll probably get relevant documents back. The language model will produce a fluent, confident answer. Everyone moves on.
The problems show up gradually. A policy document gets updated but the old version still ranks higher because more text in the corpus references it. A chain of related documents (incident report, post-mortem, remediation, updated runbook) gets fragmented because each document uses different vocabulary. The system can't tell you whether the evidence it found is current, complete, or fragile. It certainly can't prove it.
These aren't edge cases. They're the normal operating conditions of any organisation that has more than a few hundred documents and has been operating for more than a few months.
CueCrux was built to handle these conditions, and the three retrieval versions represent three stages of understanding what that actually requires.
V1: the honest baseline
V1 is hybrid retrieval at its cleanest. Two lanes running in parallel: BM25 for keyword matching and vector similarity for semantic matching. Results from both lanes are fused using reciprocal rank fusion with a K factor of 60, producing a single ranked list.
The fusion weights are content-type aware. Standard text gets a 60/40 split favouring vector similarity. Structured documents (JSON, YAML) get an 80/20 split because their distinctive syntax makes keyword matching less effective. Informal content (chat logs, meeting notes) gets entity-enriched keyword indexing to compensate for conversational vocabulary that embeddings handle poorly.
V1 doesn't know about document relationships. It doesn't understand that a document has been superseded. It doesn't track whether the evidence it found is current or stale. It treats every document as equally valid and every query as equally simple.
And within those constraints, it works well.
On a clean corpus of 10,000 documents, V1 achieves supersession recall of 1.000 and causal chain recall of 1.000. Every expected document gets found. Every chain gets followed.
On an enterprise corpus of 25,000 documents across eight MIME types, things get more interesting. Supersession recall drops to 0.833 because cross-format supersession chains (a markdown policy replaced by a JSON configuration update) are harder to retrieve when the only signal is text similarity. Causal chain recall drops to 0.667 because vocabulary gaps between regulatory language, system architecture language, and operational incident language break the semantic link.
The degradation slope, how much precision drops per thousand documents added, is -0.020 per thousand with OpenAI embeddings and -0.010 with Nomic. At 10,000 documents, V1 retains 0.300 precision (OpenAI) or 0.400 (Nomic). At 25,000 documents it holds 0.200.
V1 is the baseline against which everything else is measured. It's also the proof that hybrid retrieval alone, however well-tuned, cannot solve the problems that matter in enterprise contexts. Finding documents by similarity is a solved problem. Knowing which documents are current, how they relate to each other, and whether the evidence is sufficient: that's something else entirely.
V3.1: the system starts to understand
V3.1 adds two capabilities that change the fundamental nature of what the engine can do: the artifact relations graph and the living state machine.
The artifact relations graph tracks how documents relate to each other. Not through text similarity, but through explicit typed relationships: supersedes, derived_from, cites, elaborates, implements, contradicts, is_version_of. These relationships are stored in a dedicated database table and can be traversed during retrieval.
This means the engine can find an amendment to a policy not because the amendment mentions the same keywords, but because the amendment is explicitly linked to the policy via an implements relationship. It can follow causal chains not because each document uses similar vocabulary, but because each document is linked to the next via derived_from.
The living state machine classifies every document into a lifecycle state: active, superseded, deprecated, contested, dormant, or stale. The classification is computed from the relation graph and temporal windows. When document B supersedes document A, A's living state transitions from active to superseded. When no new relations or references appear within a configurable window, a document can transition to dormant or stale.
This means the engine can answer questions about the present ("what does policy X currently require?") differently from questions about the past ("what did the team know when they made decision Y?"). It can reconstruct the knowledge state at any point in time.
V3.1 also introduces two components that don't exist in standard retrieval systems.
MiSES (Minimum Information Sufficiency Evidence Selector) enforces answer quality constraints. Instead of returning the top K documents, MiSES selects the minimum set of evidence that satisfies domain diversity requirements. In verified mode, every answer must be supported by evidence from at least two independent source domains. MiSES also measures fragility: for each citation, it simulates removal and checks whether the remaining set still satisfies the diversity constraint. This produces a fragility score that tells the user how robust the conclusion is. Remove one piece of evidence and the answer might collapse. Or remove any one and the answer still holds.
CROWN receipts appear in V3.1 as the cryptographic proof layer. Every answer produces a receipt containing a BLAKE3 hash of the query, hashed citations, a knowledge-state cursor anchoring the receipt to the engine's append-only event log, and a fragility score. Receipts are signed with ed25519 keys and chained via parent snapshot IDs. Each receipt links to its predecessor, forming a tamper-evident history that can be verified independently.
The benchmark numbers tell the story of what these capabilities add.
On the enterprise corpus at 25,000 documents, V3.1 achieves temporal reconstruction accuracy of 96.6% (172 out of 178 lifecycle states correctly classified). The six misclassifications fall into three patterns: contested-to-superseded overwrites at lifecycle boundaries, active-to-superseded transitions near temporal window edges, and rapid succession updates where two documents arrive within seconds of each other. When we isolated these patterns in the v3 capability probes, the engine scored 12 out of 12. The misclassifications were caused by ambiguous relation graphs in the enterprise corpus, not by logic errors.
Receipt chain verification is O(1) in practice. At depth 50 (fifty receipts chained together), verification takes 2 milliseconds. All chains intact. The cryptographic proof holds regardless of how long the system has been running.
But V3.1 also reveals a problem that it cannot solve on its own. The degradation slope at enterprise scale is -0.012, slightly worse than V1's -0.010 with Nomic embeddings. Living state filtering removes superseded documents from consideration, which is correct, but it also removes documents that V1 would have surfaced. When the relation graph is incomplete (which it always is in the real world), living state can penalise recall.
Precision at 25,000 documents drops to 0.133, compared to V1's 0.200. V3.1 is smarter about which documents matter, but that intelligence comes at a cost when the information it relies on is imperfect.
V4.1: the engine that corrects itself
V4.1 is where the architecture becomes something genuinely different from anything else in the retrieval landscape.
The gap that V3.1 exposed was the gap between retrieval and citation. The engine could find the right documents. The living state machine could classify them correctly. MiSES could select a diverse, sufficient set. But the language model, the component that actually produces the answer, would cite the wrong documents.
Not randomly wrong. Systematically wrong. In predictable, repeatable ways.
When presented with multiple versions of a document, the LLM consistently cites the most comprehensive version regardless of what the user asked for. Ask for "the original access control policy" and the model cites v3.0 because v3.0 contains the most detail. Ask about both a decision and its implementation and the model cites only the decision, because decisions tend to be self-contained while implementations reference their parent.
V4.1 addresses this with three components that don't exist in any standard retrieval system.
The citation controller is a deterministic post-LLM repair layer. It runs after the language model has produced its answer and before the response is sent to the user. It catches two classes of error.
Version swap repair: the controller parses every citation for version family membership (e.g., "policy-acl-v1.0" belongs to the "policy-acl" family at version 1.0). It extracts version intent from the query, detecting explicit version references ("v1.0"), temporal markers ("original", "latest", "current"), and compound temporal phrases ("before the automated scanning was introduced"). When the LLM has cited the wrong version, the controller swaps it to the correct one from the same family.
Relation partner injection: the controller detects when a query asks about a pair of related documents (via fourteen linguistic patterns covering "rationale and execution", "design and runbook", "ADR and implementation", and similar constructs). When the LLM cites only one half of the pair, the controller adds the missing partner from the artifact relations graph.
No additional LLM calls. No latency cost. No inference cost. The repair is deterministic pattern matching and graph traversal.
The evidence selector sits between reranking and the LLM. It solves a counterintuitive problem: giving the LLM more evidence makes it cite worse. When presented with 10 to 20 reranked candidates, the model drowns in options and cites whichever documents feel most comprehensive rather than whichever are most relevant. The evidence selector groups candidates by artifact, takes the top-scoring representative from each, and fills remaining slots by overall score. This narrows the LLM's context from 10-20 candidates to 4-6, and the effect on citation quality is dramatic.
Surface routing gives the engine two fundamentally different retrieval strategies for different types of queries. Precision queries go through standard hybrid retrieval and reranking. Broad queries are decomposed into sub-queries by the LLM, each sub-query runs its own retrieval, and results are fused across sub-queries before being filtered through an admission controller. After admission, the engine splits results into a coverage set (everything found, surfaced in the receipt) and an answer set (curated representatives, given to the LLM). This separation means the receipt proves everything the engine found, while the answer reflects only what the engine judged most relevant.
The benchmark numbers show what V4.1 achieves.
On the enterprise corpus at 25,000 documents, V4.1's degradation slope is -0.008 per thousand. That's the flattest of any version. It translates to real numbers: precision at 25,000 documents is 0.233, compared to V3.1's 0.133 and V1's 0.200. V4.1 retains 75% more precision than V3.1 at maximum scale.
On the v4 audit suite (1,127 documents, 462 queries, 13 categories), the final Phase 7.0 results across three consecutive runs:
- Supersession recall: 1.000
- Relation-bootstrapped recall: 1.000
- Format-aware retrieved recall: 1.000 (across all six MIME types)
- Temporal reconstruction: 100% on edge cases
- Receipt chain integrity: 10 out of 10 chains intact, 2ms verification at depth 50
- Fragility monotonicity: confirmed
- Broad query recall: 0.927
- Proposition precision at rank 1: 0.963 (77 out of 80)
- Semantic dedup effectiveness: 1.000
- Multi-hop chain completeness: 0.933 to 0.967
- Multi-document broad recall: 0.927
- Hard-negative version precision: 1.000
- Hard-negative parent-child recall: 1.000
- Adversarial recall: 0.818
Zero variance on precision, dedup, broad recall, parent-child recall, and adversarial recall across all three runs. The only categories with any variance are those involving LLM citation selection (Categories 2, 3, and 10), and even there the variance is plus or minus 0.03.
The citation controller alone accounts for a 125% improvement in version precision (0.444 to 1.000) and a 116% improvement in parent-child recall (0.462 to 1.000). These aren't retrieval improvements. The engine already found the right documents. These are citation accuracy improvements, fixing the LLM's systematic biases without additional model calls.
Why V4.1
We tested V1, V3.1, and V4.1 side by side across every audit suite. Every suite runs all three modes, producing a direct comparison.
On clean text with friendly queries, the versions are close. V1 works. V3.1 is comparable. V4.1 shows its advantage in degradation slope but the absolute numbers are similar at small scale.
On enterprise data with real-world query patterns, the gap opens. V3.1's living state adds temporal intelligence but at the cost of recall when the relation graph is imperfect. V4.1 adds the same temporal intelligence while recovering the recall loss through surface routing and evidence selection.
On adversarial queries (Category 12), the gap becomes a chasm. V1 has no concept of versions or relationships, so it cites whatever is most similar. V3.1 can filter by living state but can't correct the LLM's version confusion. V4.1 finds the right documents, corrects the LLM's citations deterministically, and produces a receipt proving exactly what it did.
The production decision came down to degradation slope and citation accuracy. V4.1 degrades at -0.008 per thousand documents, the flattest of any version. It achieves perfect version precision and perfect parent-child recall on the adversarial corpus. And it does both while maintaining or exceeding V1's performance on every other metric.
Why we went GPU
The retrieval pipeline runs on commodity hardware. The database is Postgres with pgvector on a dedicated server (i9-13900, 192GB DDR5, 2x1.92TB NVMe RAID-1). The application layer runs on standard cloud instances.
The GPU exists for three things: embedding, reranking, and future model inference.
Embedding is the foundational operation. Every document that enters the system gets converted to a 768-dimensional vector. Every query gets embedded before retrieval can begin. The production embedding model (Nomic embed-text-v1.5) runs on an RTX 4000 SFF Ada with 20GB of VRAM via the TEI (Text Embeddings Inference) runtime. Alongside it, a secondary embedding model (bge-m3 at 1024 dimensions) runs for the verified retrieval lane, and a cross-encoder reranker (BGE-reranker-v2-m3) runs for post-retrieval scoring.
The GPU decision was driven by the embedding bake-off results. We tested OpenAI's API-hosted embedding against self-hosted Nomic across every audit suite. Nomic matched or exceeded OpenAI on every metric and showed a 2x advantage in degradation slope on clean text corpora. At 10,000 documents, Nomic retained 0.400 precision versus OpenAI's 0.300.
Self-hosting also eliminates a dependency. API-hosted embeddings mean every ingest operation and every query depends on a third-party service. A rate limit, an outage, or a model deprecation affects the entire pipeline. With a dedicated GPU, the embedding service is as available as the rest of the infrastructure.
The reranker adds marginal quality (roughly one additional correct result per eighty queries in precision benchmarks) but its real value is in the verified and audit retrieval modes where evidence quality matters more than latency. In light mode, reranking is skipped entirely.
The cost of a dedicated GPU is fixed and predictable. The alternative (API-hosted embeddings and reranking) scales linearly with query volume and corpus size. For a system designed to handle living, growing enterprise knowledge bases, the self-hosted approach is both cheaper at scale and more reliable.
What the benchmarks show
The numbers tell a specific story, but it's worth being explicit about what they show and what they don't.
What they show:
The CueCrux engine retrieves the right documents with high consistency. Retrieved recall is 1.000 across all MIME types, all relation types, and all query patterns. The engine finds what it's supposed to find.
The engine degrades gracefully under scale. At 25,000 documents, V4.1 retains 0.233 precision, losing only 0.008 precision per thousand documents added. This is the flattest degradation curve we've measured.
The citation controller corrects systematic LLM errors deterministically. Version precision goes from 0.444 to 1.000. Parent-child recall goes from 0.462 to 1.000. These corrections are reproducible: 470 shadow replay records produce identical output every time.
The system is stable. Three consecutive full-suite runs produce zero variance on the metrics that matter most. The stochastic element (LLM citation selection) is bounded to plus or minus 0.03 on the categories where it appears.
What they don't show:
These benchmarks run against synthetic corpora. The Meridian Financial Services corpus is designed to be realistic, with heterogeneous formats, overlapping concerns, and adversarial edge cases, but it is not a real customer's knowledge base. Production corpora will have patterns we haven't anticipated.
The benchmarks test retrieval and citation quality, not end-to-end answer quality. The LLM's synthesis of cited evidence into a coherent answer is assessed only indirectly, through whether it cited the right documents.
The system has not been tested at 100,000 or 1,000,000 documents. The degradation slope predicts behaviour at those scales, but extrapolation is not measurement.
Why this shouldn't be compared to other systems
Standard RAG benchmarks measure retrieval accuracy: did the system find the right documents? CueCrux measures that, but it also measures things that standard benchmarks don't account for.
Temporal correctness. Does the system know which version of a document is current? Can it reconstruct the knowledge state at a past point in time? Standard RAG has no concept of document lifecycle.
Citation accuracy. After the LLM receives the right documents, does it cite the right ones in its answer? Most systems don't measure this at all. They assume that if the right documents are in the context, the answer will cite them. Our data shows that assumption is wrong roughly half the time for versioned documents.
Evidence sufficiency. Does the answer draw from enough independent sources? Is the answer fragile, dependent on a single piece of evidence, or robust, supported by multiple domains? Standard RAG returns top-K and hopes for the best.
Cryptographic provenance. Can the answer be independently verified? Can a third party confirm what evidence was used, that it was current, and that the record hasn't been altered? No standard RAG system even attempts this.
Comparing CueCrux to a standard RAG system on retrieval accuracy alone would be like comparing a car to a bicycle on speed and declaring them equivalent because both can travel at 15 miles per hour. The car also has brakes, seatbelts, and a navigation system. Those don't show up in a speed comparison, but they matter enormously when you're carrying passengers.
The meaningful comparison isn't "does it find the right documents?" The meaningful comparison is "can you prove what it found, that what it found was current, and that the proof hasn't been tampered with?"
Today, CueCrux is the only system that can answer yes to all three.
Living data at scale
There's a distinction worth drawing between static corpora and living knowledge bases.
A static corpus is a collection of documents that rarely changes. Academic papers. Published standards. Historical records. Standard RAG was designed for static corpora, and it works reasonably well there because the problems it ignores (supersession, temporal drift, lifecycle management) don't apply when documents never change.
A living knowledge base is different. Policies get updated. Decisions get revisited. Incidents trigger post-mortems that trigger remediation plans that trigger updated runbooks. Documents are born, they're active, they get contested by newer information, and eventually they're superseded. A living knowledge base is a moving target.
CueCrux's engine is designed for living data. The living state machine tracks where each document is in its lifecycle. The relation graph tracks how documents connect to each other. Receipts anchor answers to a specific point in the knowledge timeline, so that an answer given today can be verified against the evidence that existed today, even if that evidence changes tomorrow.
The enterprise corpus we tested against, Meridian Financial Services, is designed to behave like a living knowledge base. Documents have creation dates, supersession chains, and lifecycle transitions. At 25,000 documents with eight MIME types, the engine maintains temporal reconstruction accuracy of 96.6%. The six misclassifications out of 178 all occur at lifecycle boundaries where the correct classification is genuinely ambiguous.
The question isn't how much data the system can hold. Postgres with pgvector scales to millions of vectors. Qdrant scales horizontally. The question is how much living, changing, contradicting, superseding data the system can track correctly. At 25,000 documents with active lifecycle management, the answer is: with 96.6% temporal accuracy and a degradation slope of -0.008 per thousand. That slope means the system retains usable precision well into six-figure corpus sizes.
Where V4.1 goes from here
V4.1 is the production baseline. Twenty-three feature flags frozen in a config manifest. Three consecutive 13/13 runs. But it's not finished.
The citation recall gap between markdown and structured formats (1.000 versus 0.670) is a known limitation. The engine finds structured documents with perfect recall, but the LLM cites them less reliably. Format-aware citation prompting is the next step: giving the LLM explicit instructions about how to reference JSON, YAML, and CSV evidence.
Multi-lane retrieval is currently disabled. We tested it, measured it, and found that for the current corpus size it dilutes quality rather than improving it. The feature exists, gated behind a flag, ready for when the corpus grows large enough that multiple parallel embedding models add value rather than noise. The architecture is designed for that moment but isn't forcing it prematurely.
The adversarial corpus will keep growing. Category 12v2 was added during Phase 7.0 with adversarial queries specifically designed to confuse the citation controller. The system handles them at 0.818 recall against a 0.70 threshold. We'll keep adding harder cases. The point of an adversarial test suite is that it's never done.
And the shadow replay framework, capturing every citation decision for offline analysis, means that when the system encounters query patterns in production that it hasn't seen in the audit suite, we'll know. Not because someone files a bug report. Because the deterministic replay tells us exactly what the citation controller would have done differently.
V4.1 is not the final version. It's the first version where the engine can prove what it did, and where the proof holds up under sustained adversarial testing.
That's the version you build a cryptographic protocol on top of.
For the full testing story behind V4.1, see Thirteen Out of Thirteen. For what happens end to end when you ask the engine a question, including the monitoring that begins after the answer arrives, see What Happens When You Hit Submit.