What Happens When You Hit Submit

Most AI systems treat a query like a transaction. You ask a question, you get an answer, and then it's over. The system has already moved on to the next request.

CueCrux treats a query like the beginning of an obligation.

When you hit submit, you're not just asking a question. You're asking the system to commit to what it finds, to prove what it used, and to keep watching whether what it told you is still true. The answer isn't the end of the process. It's the start of a chain of accountability that extends forward in time until the evidence changes.

This is a high-level walk through what actually happens, from the moment your query arrives to the moment the system starts monitoring what it told you.

The first hundred milliseconds

Your query arrives at the answer endpoint. Before any retrieval begins, three things happen.

Mode resolution. Every query runs in one of three modes: light, verified, or audit. Light mode is fast with minimal checks. Verified mode enforces evidence quality constraints and cryptographic signing. Audit mode adds the strictest evidence completeness requirements and forces the system to actively look for contradictions. The mode determines not just how hard the system works to find evidence, but how much it proves about what it found.

If the system can't meet the requirements of the requested mode (say, the signing service is temporarily unavailable), it downgrades rather than failing silently. The downgrade is recorded. When you get an answer back, the response tells you both the mode you requested and the mode that was actually applied. Nothing is hidden.

Policy binding. The system loads its configuration manifest, a frozen set of every feature flag and retrieval parameter, and computes a cryptographic hash of it. This hash gets baked into the receipt that will accompany the answer. It means that months from now, anyone verifying the receipt can confirm not just what evidence was used, but exactly what configuration the system was running when it found that evidence. If the configuration has changed since the receipt was issued, the verifier knows.

Budget allocation. Each mode gets a retrieval time budget. Light mode gets the least. Audit mode gets the most. This isn't just about performance. It determines how many retrieval lanes the system activates, how deep it searches, and how thoroughly it validates what it finds.

All of this happens before a single document is retrieved. The system has already decided how much it owes you.

Finding the evidence

Retrieval in CueCrux is not a single search. It's a structured process that runs differently depending on what you asked.

The system first determines whether your query is a precision query (asking for something specific, like a particular policy or a particular number) or a broad query (asking for an overview across multiple topics). This distinction matters because the two types need fundamentally different retrieval strategies.

For precision queries, the system runs hybrid retrieval: keyword matching and semantic similarity in parallel, fused into a single ranked list. The fusion weights are adjusted based on the content types in the corpus. Structured documents like JSON and YAML get weighted toward semantic matching because their syntax makes keyword matching less effective. Prose documents get a more balanced split.

For broad queries, the system takes a more elaborate path. It decomposes your query into sub-queries, each targeting a different facet of what you asked. Each sub-query runs its own retrieval. Results are fused across sub-queries, then filtered through an admission controller that decides which document groups are relevant enough to pass through. This prevents the system from returning a shallow smattering of loosely related documents and forces it to commit to the topics it thinks matter.

If relation expansion is active, the system also traverses the artifact relations graph. When it finds a document, it checks whether that document has amendments, implementations, or contradictions filed against it, and pulls those into the candidate set. This is how the engine finds documents that text similarity alone would miss: a regulation written in legal language linked to a technical implementation written in systems language.

If cross-encoder reranking is active, the candidates go through a second scoring pass. The reranker reads both the query and each candidate document together and produces a relevance score that captures semantic relationships that embedding similarity can miss.

By the end of retrieval, the system has a ranked set of candidates with scores from multiple sources: keyword matching, semantic similarity, relation graph expansion, and cross-encoder reranking. It knows which documents it found, how it found them, and how confident it is in each one.

Selecting what matters

Here is where CueCrux diverges most sharply from standard retrieval systems.

Most systems take the top K results from retrieval and dump them into a prompt. CueCrux does something different. It curates.

The evidence selector groups candidates by artifact (the source document they belong to) and takes the top-scoring representative from each artifact. Then it fills remaining slots by overall score. The goal is not "the K most relevant documents" but "the most relevant document from each distinct source, plus the strongest remaining candidates."

This matters because language models cite poorly when given too much context. Our audit data showed that when the LLM received 10 to 20 candidates, it would cite whichever documents felt most comprehensive rather than whichever were most relevant. Narrowing the context to 4 to 6 well-chosen candidates dramatically improved citation accuracy.

After evidence selection, the system splits the results into two sets: the coverage set (everything the engine found, regardless of whether it's passed to the LLM) and the answer set (the curated representatives that the LLM will actually use). The coverage set exists for the receipt. The answer set exists for the answer. The distinction means the receipt proves everything the engine found, while the answer reflects only what the engine judged most relevant.

The language model and the correction layer

The LLM receives the curated evidence and synthesises an answer. It produces text and a list of citations: which documents it chose to reference.

In most systems, this is the end. Whatever the LLM cited is what the user sees.

In CueCrux, the LLM's citations pass through a correction layer before they reach the user.

The citation controller is a deterministic repair step. It catches two classes of error that language models make predictably and repeatedly.

The first is version confusion. When a document family has multiple versions (v1.0, v2.0, v3.0), the LLM consistently cites the most comprehensive version regardless of what the user asked. If you ask about "the original policy", the model cites v3.0 because v3.0 contains the most detail. The controller detects this by parsing version families and extracting version intent from the query, then swaps the citation to the correct version.

The second is missing relation partners. When a query asks about both a decision and its implementation, the LLM tends to cite only the decision. The controller detects pair intent in the query and adds the missing partner from the artifact relations graph.

No additional language model calls. No added latency. No added cost. The repair is pattern matching and graph traversal, and it runs in under a millisecond. Our benchmarks show it improves version precision from 0.444 to 1.000 and parent-child recall from 0.462 to 1.000.

Minimum sufficient evidence

After citation correction, the evidence passes through MiSES: the Minimum Information Sufficiency Evidence Selector.

MiSES is not about finding more evidence. It's about proving that what you have is enough.

In verified and audit modes, every answer must be supported by evidence from at least two independent source domains. A policy from legal and a technical spec from engineering. A regulatory requirement and its implementation. MiSES selects the minimum set of citations that satisfies this diversity constraint.

It also measures fragility. For each citation in the final set, MiSES simulates removing it and checks whether the remaining evidence still meets the diversity requirement. If removing any single citation breaks the constraint, the answer is fragile. If you could remove any one citation and the answer still holds, the answer is robust.

The fragility score goes into the receipt. When someone reviews the answer later, they can see at a glance whether the conclusion rested on a single load-bearing citation or was supported by redundant, diverse evidence.

The validation gates

Before the answer is released, it passes through a series of validation gates. These gates exist to catch problems that retrieval and citation selection can't.

Domain validation confirms that the minimum domain diversity requirement was actually met. If the evidence selector and MiSES couldn't assemble a sufficiently diverse set, the query fails with a specific error code rather than returning an answer the system can't vouch for.

Freshness validation checks the publication dates of the cited evidence. If any citation is older than the freshness threshold for the mode, the query fails with a retryable error. The system is telling you: I found evidence, but it's too old for me to sign a receipt about it. This is preferable to silently returning stale evidence.

Provenance verification confirms that the LLM's quotes actually match the source documents. In verified mode, this is a check. In audit mode, it's a hard gate: if the LLM misquoted a source, the answer is rejected.

In audit mode, there is one more gate. Counterfactual probing actively searches for evidence that contradicts the answer. If contradictory evidence is found, it's recorded in the receipt alongside the supporting evidence. The answer still returns, but the receipt flags that the evidence base is contested. This is the system being honest about uncertainty rather than presenting a consensus that doesn't exist.

The receipt

Everything that has happened so far is now captured in a CROWN receipt.

The receipt contains a BLAKE3 hash of the query. Hashed citations with individual quote hashes. The full retrieval trace: which lanes were used, what candidates were found, what scores they received. The MiSES selection: which citations were chosen, from which domains, with what fragility score. The configuration hash: exactly what feature flags and parameters the system was running. Timings: how long each phase took. And if counterfactual probing was active, whether contradictory evidence was found.

The receipt is serialised to canonical JSON (sorted keys at every nesting level) and hashed with BLAKE3. The hash is sent to Vault Transit for ed25519 signing. The signed receipt is stored alongside the answer.

If signing is temporarily unavailable (Vault is down, network partition, key rotation in progress), the receipt is stored unsigned with a pending status. A background worker retries signing. But here's the important detail: an unsigned receipt cannot be retroactively signed with the original key at the original timestamp. If signing is restored, a separate detached attestation is issued. This prevents a class of attack where someone generates receipts now and signs them later with a different key.

The receipt links to its predecessor via a parent snapshot ID, forming an append-only hash chain. Each answer builds on the last, creating a tamper-evident history that can be verified independently by anyone with the BLAKE3 and ed25519 libraries. No CueCrux infrastructure required.

After the answer: the system keeps watching

This is where most systems stop. The answer has been delivered. The receipt has been signed. The user has what they asked for.

CueCrux doesn't stop.

WatchCrux is a continuous auditing and confidence monitoring service that runs independently of the engine. It is a separate process, on a separate schedule, with its own persistence. If the engine crashes, WatchCrux keeps running. If WatchCrux restarts, it picks up where it left off. They are deliberately decoupled because the thing monitoring the system must not depend on the system it monitors.

WatchCrux does four things.

Health monitoring. Every fifteen seconds, WatchCrux polls the engine's health and readiness endpoints. It captures response status, latency, build version, SDK version, and dependency state. This isn't just uptime monitoring. It's building a continuous record of the system's operational state, so that when a receipt is later verified, there's an independent record of whether the system was healthy when it was generated.

Metrics observation. Every sixty seconds, WatchCrux scrapes the engine's Prometheus metrics endpoint. It extracts whitelisted metrics (retrieval latency, error ratios, citation counts, mode distribution) and stores snapshots. Over time, this builds a trend line. If retrieval latency is drifting upward, or error ratios are creeping, WatchCrux detects it before it affects answer quality.

Version drift detection. WatchCrux tracks the engine's build version, SDK version, and compatibility state across all instances. If a deployment introduces a version skew (one instance running a newer build than another), WatchCrux flags it. If a major version change breaks compatibility with existing receipts, WatchCrux pages. This matters because receipts are bound to a specific configuration epoch. If the configuration changes, past receipts need to be interpreted in the context of the configuration that generated them, and WatchCrux ensures that change is visible.

Audit orchestration. WatchCrux can trigger and monitor full audit suite runs. It captures the results, compares them to previous runs, and detects regressions. If a deployment passes all unit tests but degrades retrieval quality on the canonical audit suite, WatchCrux catches it. The audit results are stored with timestamps and linked to the engine version that produced them, creating a longitudinal record of system quality.

When WatchCrux detects something, it routes alerts through Slack, Teams, or an operational escalation service. The alerts carry structured metadata: what changed, when, how severe, and what the downstream implications might be for receipt validity.

The obligation extends forward

The answer you received when you hit submit is not a static artifact. It's a point in a timeline.

The receipt anchors it to a specific knowledge state: what the system knew, at that moment, under that configuration. The receipt chain links it to every answer that came before and every answer that will come after. WatchCrux watches whether the conditions under which the answer was generated are still holding.

If the evidence changes, if a cited document is superseded by a newer version, the system knows. The living state machine tracks document lifecycles. The relation graph tracks how documents connect. The answer's receipt remains valid (it faithfully records what the system knew at query time) but the system can now flag that the underlying evidence has moved.

This is the fundamental shift from "the system answered your question" to "the system is accountable for what it told you."

Most AI systems produce answers. CueCrux produces answers, receipts, and an ongoing obligation to watch whether those receipts are still meaningful. The answer is the beginning, not the end.

When you hit submit, you're not just asking a question. You're asking the system to commit to what it finds, sign a receipt for what it used, and keep watching whether what it told you is still true.

That's what happens when you hit submit.