Thirteen Out of Thirteen
The full story of testing CueCrux's retrieval engine, from four categories to thirteen, from clean text to adversarial hard-negatives, and why submitting CROWN to the SCITT working group feels like the right next step.
In the near future we will be submitting CROWN to the IETF SCITT working group.
CROWN is a protocol for producing cryptographic receipts that anchor AI-assisted decisions to the evidence that informed them. It's been in development for nine months, quietly evolving alongside the retrieval engine it was designed to prove. But before I talk about the submission, I want to talk about the testing that made it possible. All of it. From the beginning.
Because the protocol only matters if the system underneath it holds up. And proving that the system holds up, really proving it, not benchmarking it on friendly data and calling it done, turned out to be the hardest, most rewarding engineering challenge we've faced.
Where it started
Every retrieval system begins the same way. You ingest some documents, you run some queries, you look at the results. If the right document comes back near the top, you call it working.
That's where we were in early March. The engine was functional. It could retrieve, rank, and cite. But "functional" isn't a standard you can build a cryptographic protocol on top of. If CROWN receipts are going to prove what evidence a system used, the retrieval underneath had better be provably reliable. Not anecdotally. Not on curated demos. Provably.
So we started building an audit suite. Not a benchmark. Benchmarks tell you how a system performs on average. An audit tells you where it breaks.
The first four categories
The v1 suite was simple. Four categories, around 40 clean text documents each, scaling up to 10,000 with deterministic noise.
Supersession accuracy. When a newer document replaces an older one, does the engine know? If someone asks about a policy and the original has been superseded by version 3, does the engine surface version 3 and rank it higher than version 1? This sounds trivial. It isn't. Most retrieval systems treat every document as equally current.
Causal chain retrieval. When documents form a sequence, where an incident report leads to a post-mortem, the post-mortem leads to a remediation plan, and the remediation plan leads to an updated runbook, can the engine follow the chain? If you ask about the original incident, do you get all the downstream consequences?
Corpus degradation. What happens as the corpus grows? At 100 documents, everything works. At 1,000, things start to blur. At 10,000, retrieval systems that seemed fine at small scale quietly fall apart. We measured the degradation slope: how much precision drops per thousand documents added.
Temporal reconstruction. Documents have lifetimes. They're created, become active, get contested by newer information, and are eventually superseded. Can the engine reconstruct this lifecycle correctly? When three documents are all "active" at different points in time, does the system understand which one is current right now?
The v1 suite ran in nine minutes. 12/12 passed. But 12/12 on clean text with friendly queries means almost nothing. Real corpora aren't clean. Real queries aren't friendly.
Enterprise scale: the v2 corpus
We built a synthetic enterprise corpus called Meridian Financial Services. 550 base documents across eight MIME types: markdown, JSON, YAML, CSV, plain text chat logs, meeting notes, technical specs, policy documents. Twenty employees. Fifteen microservices. Ten projects. Scaled to 25,000 documents with deterministic noise.
This is closer to what a real organisation's knowledge base looks like. Not a tidy collection of well-formatted articles, but a messy pile of formats, authors, overlapping concerns, and contradictory information.
The v2 results told us something important. The engine handled enterprise scale better than clean text. The degradation slope at 25,000 documents was -0.008 per thousand, compared to v1's -0.020. The heterogeneous formats produced more distinctive embeddings, which actually improved discrimination under noise. Enterprise mess, it turned out, was easier to search than artificial tidiness.
But supersession recall dropped from 1.000 to 0.750. Cross-format supersession chains, such as a markdown policy superseded by a JSON configuration update, were harder to retrieve than same-format chains. Temporal reconstruction hit 96.6%, with six misclassifications out of 178, mostly at lifecycle boundaries where "contested" and "superseded" blur together.
We were learning where the edges were.
The capability probes: v3
The v3 suite added six focused categories designed to isolate specific engine capabilities.
Relation-bootstrapped retrieval. When a document has an amendment filed against it, does the engine find the amendment through the relation graph, not just through text similarity? This tests whether the engine understands document relationships as first-class entities or just treats everything as isolated chunks.
Format-aware ingestion recall. Can the engine retrieve documents in every supported format? Markdown, JSON, CSV, YAML, chat logs, meeting notes. Retrieved recall was 1.000 across all formats. The engine found everything. But citation recall, whether the LLM actually used the evidence in its answer, was 0.000 for YAML, chat, and notes. The engine found the right documents. The LLM ignored them. That distinction turned out to matter enormously.
BM25 versus vector decomposition. The engine runs two retrieval lanes: keyword-based (BM25) and semantic (vector similarity). We created three classes of documents. K-class documents use distinctive terminology that BM25 excels at finding. V-class documents use synonyms and rephrased concepts that only vector similarity can match. H-class documents work for both. The result: vector-only documents achieved 100% retrieved recall but 0% citation recall. The LLM doesn't cite documents that lack keyword anchors matching the query, even when the semantic meaning is identical.
Temporal edge cases. The v2 suite had six misclassifications. The v3 probes isolated the three failure patterns (contested-to-superseded transitions, rapid succession updates, and window boundary effects) and tested them in isolation. 12/12. The engine's living state machine was correct. The v2 misclassifications were caused by ambiguous relation graphs in the enterprise corpus, not by logic errors.
Receipt chain stress. CROWN receipts form hash chains. Each receipt links to its predecessor. We tested chain verification at depths from 5 to 50. Latency at depth 50: 2 milliseconds. All chains intact. Verification is effectively O(1) regardless of depth.
Fragility calibration. When you systematically remove pieces of evidence, how fragile is the answer? An answer supported by two documents from two domains is maximally fragile. Remove either and the answer loses a domain of support. An answer supported by six documents from four domains is robust. We tested whether the engine's fragility scoring reflects this correctly, whether it's monotonic: more evidence from more domains should always mean lower fragility.
The v3 suite confirmed that the core capabilities worked. Retrieved recall was strong across the board. But it also made clear that the gap between "finding the right documents" and "the LLM actually citing them in its answer" was a separate and much harder problem.
The embedding bake-off
Before pushing further into the audit suite, we had to settle a foundational question: which embedding model should power production?
We ran every suite (v1, v2, v3) twice. Once with OpenAI's text-embedding-3-small (768 dimensions, API-hosted) and once with Nomic embed-text-v1.5 (768 dimensions, self-hosted on our own GPU).
The capability probes were identical. Every single metric matched. Format recall, temporal accuracy, chain stress, fragility ordering. At the categorical level, whether the model can handle a given type of query, the embedding model doesn't matter.
Where it did matter was scale. At 10,000 documents, Nomic's degradation slope was half of OpenAI's. Precision at 10K was 0.400 versus 0.300, a 33% advantage. The enterprise corpus converged (both hit -0.008 slope at 25K), but on clean text Nomic was measurably better at keeping results sharp as the corpus grew.
We deployed Nomic. Not because self-hosting is philosophically superior, but because the data said it was better at the thing that matters in production: holding up at scale. It runs on an RTX 4000 SFF Ada with 20GB of VRAM, on a dedicated GPU node. Zero marginal cost per embedding. No API dependency. And the empirical evidence says it's at least as good as the commercial alternative.
That decision was made in early March and hasn't been revisited since.
The DQP experiment
Around the same time, we'd been building what we called the Data Quality Pipeline, a suite of advanced retrieval techniques: semantic chunking (breaking documents at natural semantic boundaries rather than fixed token windows), HyDE (hypothetical document embeddings to bridge vocabulary gaps), quality gating, and context notation.
These are well-regarded techniques in the retrieval literature. They're the kind of thing you read about in papers and assume will make your system better.
We tested them rigorously. The results were not what we expected.
Baseline recall with no DQP features enabled: 89.5%, 8 out of 10 categories passing. With semantic chunking alone: 24.3% recall, 4 out of 10. With the full DQP stack enabled (semantic chunking, HyDE, quality gating, the lot): 12.8% recall, 3 out of 10.
The advanced techniques made the system dramatically worse.
The semantic chunker was the culprit. Two mechanisms. First, splitting documents at semantic boundaries destroys the holistic signal that whole-document embeddings preserve. Queries written against whole documents fail to match fragments of those documents. Second, re-chunking redistributes terms across chunk boundaries, destroying the term frequency patterns that BM25 ranking depends on. One retrieval lane went from 52 found documents to 1.
We didn't ship DQP to production. The baseline stays. We published the negative result because pretending advanced techniques always help is how the industry got into the mess it's in. Sometimes the simpler system is the better system, and the only way to know is to measure it honestly.
The one thing we did keep: a no-split fix for the semantic chunker that preserves original content when no split occurs, and per-tenant benchmark isolation that eliminated a confounding variable where categories contaminated each other's results. Both were methodology improvements, not feature improvements.
The v4 suite: thirteen categories
By mid-March we had a solid foundation: reliable embeddings, a clean retrieval pipeline without DQP complications, and three audit suites that exercised different dimensions of quality.
The v4 suite was designed to be comprehensive. Not just "does it work?" but "does it work under every condition we can think of, including conditions specifically designed to make it fail?"
We expanded from six categories to thirteen. The corpus grew to 1,127 ingested documents and 462 queries. Here's what each category tests and why it exists.
Category 1: Relation-bootstrapped retrieval. Eight documents connected by amendment, support, and implementation relations. One query. Tests whether the engine can traverse the artifact relation graph to find related documents that text similarity alone wouldn't surface. This was the first capability we wanted to prove: that the engine understands documents as connected entities, not isolated vectors.
Category 2: Format-aware ingestion recall. 270 documents across six MIME types, 45 topics. Tests whether every ingested format can be retrieved and cited. This category exists because most retrieval systems are silently terrible at structured data. A YAML configuration file and a markdown policy document that describe the same thing should both be findable. Citation recall for structured formats is still lower than markdown (0.670 versus 1.000), but retrieved recall is 1.000 across the board. The engine finds everything. What the LLM does with it is a separate problem.
Category 3: BM25 versus vector decomposition. 165 documents across three retrieval lanes, 55 queries. Tests whether keyword-based and semantic retrieval complement each other correctly. This category caught one of the most persistent intermittent failures: LLM query decomposition would occasionally produce different sub-queries on repeated runs, leading to different retrieval results. We fixed it with a content-addressable decomposition cache and keyword retry fallback.
Category 4: Temporal edge cases. Tests the engine's living state machine at boundary conditions: rapid succession updates, contested-to-superseded transitions, window edges. Runs in V1 mode (skipping living state features not yet deployed), so it validates the core logic without depending on features still under development.
Category 5: Receipt chain stress. Two documents, ten queries at varying chain depths. Verifies that CROWN receipt chains maintain integrity and performance as they grow. This is the direct test of the cryptographic protocol: can you verify a receipt 50 links deep in under 10 milliseconds?
Category 6: Fragility calibration. Twelve documents across three perturbation scenarios, three queries. Tests whether the engine's fragility scoring is monotonic, that answers supported by more diverse evidence are correctly assessed as less fragile. This category forced us to invent pre-selector fragility computation, because the evidence selector was pruning candidates before fragility could be measured, making every answer appear equally fragile.
Category 7: Hierarchical broad query recall. 180 documents across 12 themes, 36 queries. When someone asks a broad question like "What are our security policies?" can the engine find documents across multiple themes and sub-topics? This tests the surface routing architecture: query decomposition into sub-queries, Qdrant summary search, admission control, and representative selection. Broad queries are fundamentally harder than precision queries because the answer space is wide and the engine has to decide which themes are relevant without knowing in advance.
Category 8: Proposition precision. 80 documents, 80 queries. When someone asks for a specific value like "What is the SLA for the payments API?" does the engine cite the exact document that contains the answer, at rank 1? Precision at rank 1 (P@1) is 0.963. Seventy-seven out of eighty queries return the right document first. The three misses are edge cases involving nearly identical propositions in different documents.
Category 9: Semantic deduplication. 126 documents arranged in 35 clusters, with one canonical document and two to three near-duplicates per cluster. 35 queries. When the corpus contains multiple versions of effectively the same information, does the engine surface the canonical version and suppress the duplicates? Dedup effectiveness is 1.000. Zero duplicates in results. This was broken at 0.000 in Phase 6.0. The dedup status was computed at ingest but never used during retrieval. A three-layer fix (scoring penalty, hard filter, representative selector awareness) resolved it completely.
Category 10: Contextual chain recall. 120 documents forming 30 multi-hop reasoning chains. When the answer requires following a chain of evidence, where document A references document B which references document C, does the engine retrieve the full chain? Chain completeness is 0.933 to 0.967 across runs, with minor LLM variance on which chain links get cited.
Category 11: Multi-document broad recall. 50 documents designed as chunking stress tests (500 to 2,000+ tokens), 80 queries across four modes: within-chunk, cross-chunk, broad, and multi-document precision. Tests whether the engine can handle the full spectrum from targeted single-document retrieval to broad multi-document synthesis.
Category 12: Hard-negative overlap. This is the adversarial category. It didn't exist before Phase 6.0.
The v1 corpus has 53 documents across five themes, with 36 queries. Documents are deliberately designed so that the wrong version of a document is textually more similar to the query than the right one. Five versioned document families where asking for "the original policy" should return v1.0, but v3.0 is a better semantic match. Parent-child document pairs (a decision and its implementation) where the LLM cites one but not the other.
The v2 corpus, added in Phase 7.0, expands to 61 documents and 51 queries. It adds a four-version security patching policy (v1.0 through v3.0), cross-theme ADR-to-runbook pairs that span different vocabulary domains, and adversarial queries with ambiguous temporal markers ("before the automated scanning was introduced"), indirect pair phrasing ("how did the team act on that decision"), and compound intent that requires resolving both version and relation in the same query.
Category 12v2: Adversarial expansion. The same hard-negative tests, run against the expanded corpus with deliberately misleading queries. Adversarial recall is 0.818 against a 0.70 threshold. This is the category that keeps us honest.
These thirteen categories weren't designed at once. They grew over ten days, each one added because we found a failure mode the existing categories didn't catch. The suite is an archaeology of everything that went wrong and how we proved it got fixed.
Phase 6.0: the surface routing bet
We deployed the surface routing architecture as the largest single change to the retrieval pipeline since the engine was built.
The core insight was that different types of queries need fundamentally different retrieval strategies. A precision query ("What is the SLA for the payments API?") needs to find one specific document and rank it first. A broad query ("What are our security policies?") needs to find documents across multiple themes and synthesise them coherently.
Surface routing handles this by profiling each query as precision or broad, then sending it down a different path. Precision queries go through standard hybrid retrieval: BM25 keyword matching combined with vector similarity, followed by cross-encoder reranking. Broad queries go through a more elaborate pipeline: the query is decomposed into eight sub-queries by the LLM, each sub-query runs its own hybrid retrieval, results are fused across sub-queries using reciprocal rank fusion, then filtered through an admission controller that decides which document groups are relevant enough to pass through.
After admission, the engine loads group members, selects representatives (canonical document, freshest document, highest-scoring document, overflow fill), and splits them into a coverage set (all group members, surfaced in the receipt as retrieved evidence) and an answer set (representatives only, passed to the LLM for synthesis). The coverage set proves what was found. The answer set keeps the LLM focused.
Phase 6.0 also introduced Category 12, the hard-negative overlap tests. We were confident enough in the broad retrieval architecture to start testing the adversarial cases.
Result: 10 out of 11. Category 3 failed due to LLM decomposition variance, where the same query produced different sub-queries on repeated runs. Everything else passed. Not a bad start, but we knew the one failure was going to be stubborn.
Phase 6.1: the first clean sweep
Two days later, we hit 12/12 for the first time.
Three root causes resolved in a single push. Category 9's semantic dedup had been computed but never applied during retrieval, and a three-layer fix added scoring penalties, hard filtering, and representative selector awareness. Category 11 had been producing false passes because the audit scripts weren't setting the Qdrant URL, so data was written to Postgres during ingest but searches went to an empty Qdrant instance. And Category 12 needed an implements relation type added to the expansion set.
The 12/12 felt good. It lasted about eight hours.
Phase 6.2: the cost of hardening
Phase 6.2 was infrastructure work. Seven milestones: fail-closed preflight diagnostics that abort the audit on critical infrastructure failures, DQP-native corpus types carrying dedup and living metadata inline, temporal hardening, observability and SLO baselines, format-aware citation hints, per-scenario isolation for fragility testing, and operational tuning documentation.
All necessary. All sensible. And when we ran the audit: 10/12. Category 3 had regressed (LLM decomposition flakiness was back) and Category 11 had regressed (LLM citation variance on multi-document precision queries).
We'd fixed infrastructure and broken quality. The admission controller changes that helped broad queries, relaxing the maximum groups from 3 to 12, dropping the minimum group score from 0.60 to 0, had changed the dynamics of what evidence reached the LLM, and the LLM responded differently.
This was the first time we felt the full force of the pipeline interaction problem. You can't change one layer without affecting every layer downstream.
Phase 6.3: the evidence selector
The diagnosis was clear: the LLM was receiving too many candidates and citing poorly as a result. Ten to twenty reranked documents dumped into the prompt, and the model would pick whichever ones felt right, which often wasn't the right ones.
The evidence selector was the answer. A non-LLM filter between reranking and the LLM that groups candidates by artifact, takes the top-scoring representative from each artifact, and fills remaining slots by overall score. The effect: LLM context dropped from 10-20 candidates to 4-6.
This fixed Category 3 (the decomposition cache and keyword retry eliminated flakiness) and Category 11 (fewer, better contexts meant more consistent citation). But it broke Category 6 (with only 4 contexts, every piece of evidence was load-bearing, making every answer appear maximally fragile) and Category 12 (taking top-1-per-artifact dropped parent-child document pairs).
10/12. Three times, consistently. Zero flakiness, which was progress. But two categories down.
Phase 6.4: fragility resolution
The fragility problem had an elegant solution: compute fragility from the full reranked candidate set before the evidence selector prunes it, not from the post-selector LLM citations.
Pre-selector fragility means the engine evaluates answer robustness against all the evidence it found, not just the subset it chose to present. This is more honest. Fragility should reflect how much evidence supports the answer in the corpus, not how many of the curated contexts the LLM chose to cite.
11/12. Category 6 fixed. Category 12 remained the sole failure.
And then a debugging detour. One run regressed to 6/12 due to a stale environment variable left over from a tuning sweep. We added config manifest enforcement after that: a frozen list of every feature flag and its expected value, checked before every audit run. Stale config would never silently corrupt results again.
Phase 6.5: the citation controller
Category 12 was, at this point, a well-understood problem. Retrieved recall was 1.000. The engine found every document it was supposed to find. But version precision was 0.444 (the LLM cited the wrong version more than half the time) and parent-child recall was 0.538 (the LLM failed to cite both the decision and its implementation).
This was not a retrieval problem. It was an LLM problem. The model deterministically chose the "most comprehensive" version of a document and ignored relation partners. No amount of retrieval tuning would fix it.
The citation controller was our answer. A deterministic post-LLM repair layer that catches two classes of error without making additional LLM calls.
Version swap repair: the controller parses document IDs for version families, extracts version intent from the query (explicit version references like "v1.0", or temporal markers like "original" or "latest"), and swaps citations to the correct version from the same family. Nine out of nine version swaps in the test corpus. Precision went from 0.444 to 1.000.
Relation partner injection: when the query asks about both a decision and its implementation (detected via fourteen patterns like "rationale and execution", "design and runbook", "ADR and implementation"), the controller adds the missing relation partner from the artifact relations graph.
This was the defining insight of the entire audit cycle. Some classes of LLM error are entirely predictable. The model will always prefer the most comprehensive version. It will always cite one of a pair when both are needed. These aren't random failures. They're systematic biases. And systematic biases can be corrected deterministically, after the fact, without going back to the model.
12/12. Three times. Zero flakiness. Version precision: 1.000. But parent-child recall had actually dropped from 0.538 to 0.462. The controller was adding the right citations, but something downstream was removing them.
Phase 6.6: finding the real filter
This was the most satisfying debugging session of the entire cycle.
The original diagnosis blamed packCitations, a function that assembles the final citation set. We'd built a supplementary candidates mechanism to inject controller-added citations back in. It didn't work. After careful analysis, we discovered why: llmCandidates is always a subset of final (the evidence selector narrows from the full set, not expands beyond it), so the supplementary candidates filter was always empty. We'd been patching the wrong layer.
The actual filter was MiSES, the Minimum Information Sufficiency Evidence Selector. MiSES ensures answer diversity by picking evidence from different source domains. With a maximum size of 3 and non-greedy mode, when multiple documents share a domain (say, eng.meridian.test), MiSES picks one per domain. The citation controller was adding relation partners that shared a domain with existing candidates, and MiSES was dropping them in favour of other same-domain candidates with more recent dates.
The fix was a pinnedIds parameter. Controller-output citations are seeded into MiSES first, bypassing domain and recency selection. Remaining slots fill with domain-diverse candidates as before. If pinned citations exceed the maximum size, the effective maximum expands to accommodate them.
Parent-child recall: 0.462 to 1.000. Thirteen out of thirteen relation pairs correctly cited. Every single Category 12 metric hit 1.000.
12/12. Three times. And for the first time, every metric in the hardest category was perfect.
Phase 7.0: stabilisation
Phase 7.0 was not about adding features. It was about proving that what we'd built was real.
Six milestones, all focused on validation rather than innovation.
We published the 6.6 results as the canonical baseline, documenting every trade-off: pinnedIds improves Category 2 and 12 citation recall but creates a soft trade-off on Categories 10 and 11, where MiSES's domain diversity would otherwise have selected slightly different evidence. We ran a full ablation (pinnedIds always-on, repair-only, and disabled) and confirmed that the always-on policy was the only one that kept Category 12 parent-child recall at 1.000. The trade-off is structural. No variant improves everything. We accepted it.
We integrated the adversarial Category 12v2 corpus side-by-side with v1. Adversarial recall: 0.818, well above the 0.70 threshold. Queries with ambiguous temporal markers, indirect pair phrasing, and compound intent, the hardest retrieval problems we could design, and the system got it right more than four times out of five.
We built and ran shadow replay across 470 captured records. Every single record produced identical output when replayed through the citation controller. 100% agreement. Zero repairs. Zero regressions. The controller is fully deterministic.
And then we ran the final validation. 13/13. Three times.
10/11 > 12/12 > 10/12 > 10/12 > 11/12 > 12/12 > 12/12 > 13/13
The trajectory tells the story better than any summary. Every regression was diagnosed. Every diagnosis was verified. Every fix was validated three times. And at the end, zero variance on the metrics that define quality:
- Precision at rank 1: 0.963. Three runs. Zero variance.
- Semantic dedup effectiveness: 1.000. Zero variance.
- Broad recall: 0.927. Zero variance.
- Parent-child recall: 1.000. Zero variance.
- Adversarial recall: 0.818. Zero variance.
- Shadow replay agreement: 470/470. 100%.
Categories 2 and 3 showed minor LLM citation variance of plus or minus 0.03, the irreducible stochasticity of language model citation selection. Everything else was rock solid.
Why this matters for CROWN
CROWN receipts prove three things: what evidence was retrieved, that the evidence was current at query time, and that the receipt hasn't been tampered with.
Each receipt contains a BLAKE3 hash of the query, hashed citation sets with individual quote hashes, a knowledge-state cursor anchoring the receipt to a position in an append-only event log, and a fragility score indicating how robust the answer is to evidence removal. The receipt chain links each new receipt to its predecessor via parent snapshot IDs, creating a tamper-evident history. Everything is signed with ed25519 keys managed through Vault Transit.
The protocol operates at three assurance levels. Light mode is for low-stakes queries, fast with minimal verification. Verified mode enforces a minimum of two independent source domains and includes fragility scoring. Audit mode uses the same constraints but with higher retrieval budgets and stricter evidence completeness requirements, for regulatory and formal audit contexts.
A receipt can exist unsigned if Vault is unavailable, but it cannot be retroactively signed. If signing is later restored, a separate detached attestation must be issued. This prevents silent hash collision attacks on unsigned receipts. It's a small design decision that matters enormously for auditability.
But none of this matters if the retrieval engine underneath can't reliably find the right evidence. A cryptographic proof of what the system found is only valuable if what the system found is correct. If the engine retrieves the wrong version of a document, the receipt faithfully records that it retrieved the wrong version. If the engine misses a related document, the receipt faithfully records the gap.
That's why we spent ten days getting to 13/13. The audit results aren't marketing. They're the evidence layer underneath the protocol. When someone verifies a CROWN receipt, they're trusting that the retrieval engine identified the right evidence, the citation controller corrected any systematic LLM errors, and the evidence chain hasn't been tampered with. The 462 queries across 13 categories, run three times with zero variance, are our proof that the first two claims hold.
The SCITT submission
CROWN is designed as a SCITT application profile, an implementation of the Supply Chain Integrity, Transparency and Trust architecture being standardised by the IETF. The compatibility layer includes CDDL schemas (RFC 8610) for CBOR-encoded receipts, COSE_Sign1 wrapped examples with annotated hex walkthroughs, SCITT terminology mappings, registration policy for Transparency Services, and privacy considerations for redacted proof packs.
It's complementary to other profiles emerging in the space. Kamimura's CAP-SRP covers refusal provenance: when an AI system refuses to act, there's a signed record of why. CROWN covers evidence provenance: when an AI system uses evidence to support a decision, there's a signed record of what it used and whether it was current. Different layer, same infrastructure.
We're submitting with published test vectors, a standalone verification library (zero vendor dependencies, just BLAKE3 and ed25519), and a regulatory mapping covering EU AI Act Articles 13-14 and DORA Articles 8-11 with per-article benchmark citations. Everything is published under CC BY 4.0.
We're seeking feedback from the working group on the CROWN-to-SCITT mapping, interest in AI evidence provenance as a use case for SCITT, and guidance on next steps toward a formal application profile document.
What comes next
On the engine side, the Phase 7.0 configuration is frozen as our production baseline. Twenty-three feature flags locked in a config manifest that's enforced before every audit run. The next phase of work will focus on areas we've deliberately deferred: format-aware citation prompting to close the gap between markdown and structured format citation recall, deeper adversarial corpus expansion, and integration with production traffic via the shadow replay framework we built in Phase 7.0.
There's also the question of what DQP features come back. Semantic chunking failed as a universal strategy, but the no-split fix and the benchmark isolation methodology both proved valuable. HyDE (hypothetical document embeddings) is implemented and gated, waiting for validation once the base retrieval is fully proven. Reranking is live. The DQP story isn't over. We now know which parts help and which parts hurt, because we measured instead of assuming.
On the protocol side, the SCITT submission opens a conversation about transparency service registration. When CROWN receipts are produced in audit mode, they should be registered with a Transparency Service promptly, within seconds to minutes. In verified mode, the window is minutes to hours. In light mode, daily batches. The infrastructure for this exists in specification. Making it real depends on the SCITT ecosystem maturing.
To understand how the engine itself evolved across three major versions, and why V4.1 became our production baseline, see Three Versions of the Same Question. For a walk through what happens end to end when a query arrives, including the monitoring that begins after the answer is delivered, see What Happens When You Hit Submit.
Why this matters beyond CueCrux
I want to step back and say something about what this represents, because I think it matters beyond our own system.
The AI industry has a provenance gap. Retrieval-augmented generation is deployed in compliance workflows, financial research tools, legal assistants, healthcare decision support. These systems retrieve evidence and use it to support decisions that affect real outcomes for real people. And in almost every case, they leave no verifiable trail of what evidence was used, whether it was current at query time, or how confident the conclusion was.
This isn't an abstract concern. The EU AI Act Article 13 transparency requirements begin enforcement in August 2026. DORA audit trail obligations are already in effect. For any AI system operating in a regulated domain (financial services, healthcare, legal, public sector) the gap between "the system produced an answer" and "we can prove what the system knew when it produced that answer" is a compliance gap that will only widen.
Most enterprise AI platforms can produce citations. A link. A reference. What they cannot produce is proof. A cryptographically signed evidence chain that a third party can independently verify without access to the originating system.
That's the gap CROWN is designed to close. And the audit work we've done over the past ten days is our proof that the system underneath the protocol is worth signing receipts about.
But this isn't just about CueCrux. The techniques we've developed (deterministic post-LLM citation repair, pre-selector fragility computation, evidence-aware diversity filtering) are solutions to problems that every retrieval-augmented system will eventually face. The insight that LLM citation errors are systematic and correctable without additional model calls is applicable far beyond our architecture. The willingness to publish negative results (the DQP findings) and adversarial test suites is something the industry needs more of, not less.
We've published CROWN under CC BY 4.0 deliberately. The verification library requires only BLAKE3 and ed25519, with no vendor dependencies and no CueCrux infrastructure. We want other systems to be able to produce and verify evidence receipts. We want regulators to have a standard they can point to. We want the SCITT working group to evaluate whether AI evidence provenance belongs in the same transparency infrastructure as software supply chain integrity.
The question isn't whether AI systems need provenance. The regulations have already answered that. The question is whether the industry will build provenance into the architecture from the beginning, or bolt it on after the enforcement date and hope no one looks too closely.
We chose to build it in. Thirteen categories. 462 queries. Three runs. Zero variance. And a protocol submission to show for it.
A moment worth marking
I don't usually write posts like this. Most of the work on CueCrux happens quietly, one migration at a time, one failing category at a time, one diagnosis at a time.
But this feels different. Not because we've finished. We haven't, and the list of what's next is long. But because something has shifted.
For months, the question was "can we build a retrieval engine that holds up under adversarial testing?" We've answered that.
For months, the question was "can we produce cryptographic evidence receipts that a third party can verify?" We've answered that too.
The question now is bigger than either of those. It's "what does the industry do when provable retrieval is possible?" When you can sign a receipt for what your AI system knew, when you can verify that receipt without the originating vendor's involvement, when you can prove to a regulator that the evidence was current and the chain is intact, what changes?
I think quite a lot changes. And today, by submitting CROWN to the people building the transparency infrastructure of the future, we're taking the first step toward finding out.