The Model You Don't Own

Two days ago I wrote about getting to thirteen out of thirteen. The audit suite passing, the CROWN submission, the trajectory from 10/11 to 13/13 × 3 with zero variance.

What I didn't write about was what happened next. Because what happened next changed how I think about the system.

After the freeze

Phase 7.0 was a celebration. Everything passed, everything was deterministic, everything was frozen. Twenty-three feature flags locked in a config manifest. Shadow replay agreement at 100%. The audit was clean.

Phase 7.1 was supposed to be operational hardening. Failure drills, retention policies, DQP observability, the boring work that makes a system ready for production rather than just capable in a test harness. And it was that. But it was also the first time we saw the quality numbers move without touching the retrieval engine.

Cat 6, our fragility calibration category, started producing all-zero leave-one-out scores in mixed-tenant runs even though it scored 1.0 in isolation. Cat 11, the multi-document broad recall category, swung between 0.273 and 0.927 across phases despite identical retrieval configuration. Cat 9's canonical recall, which we'd been reporting as 1.000, turned out to be 0.914 when we looked at the raw audit data honestly.

An external audit review forced a reckoning. Phase 7.0 was the quality baseline. Phase 7.1 was the ops baseline. They are not the same thing, and conflating them risks overreacting to variance that has nothing to do with the retrieval engine.

We reclassified Cat 6 and Cat 11 as monitor-only. Not because they don't matter, but because their variance is dominated by something outside our control: the language model's citation behaviour.

The staged deployment

Phase 7.2 was a controlled experiment. Two mechanisms, deployed in sequence, with attribution runs between each.

The first mechanism was profile-scoped pinnedIds. In Phase 6.6 we'd discovered that MiSES, the evidence diversity selector, was dropping controller-cited documents. The pinnedIds fix preserved them. But the original policy was always-on, which meant every query's evidence set was shaped by the controller's decisions. Profile-scoped meant only queries matching specific citation profiles (version families, relation pairs) got the pinned treatment.

Deployed alone, profile-scoped pinnedIds was insufficient. Cat 11 scored 0.707, 0.697, 0.697 across three runs. One pass out of three. Right at the threshold, but not through it.

The second mechanism was citation cascade. For broad queries only, when the primary LLM's citation set is sparse, a secondary pass with a lighter model fills coverage gaps. The insight was that broad queries fail differently than precision queries. Precision queries fail because the right document isn't retrieved. Broad queries fail because the LLM doesn't cite enough of what was retrieved. A cascade that checks for missing coverage is cheaper and more targeted than running the full pipeline twice.

With both mechanisms active: 0.722, 0.722, 0.722. Three runs. Perfect consistency. 13/13 × 3.

The cascade provided a stable +0.025 lift for broad queries. Not dramatic. But enough to clear the threshold reliably, which is all that matters when you're trying to prove stability rather than chase headroom.

We froze M0+M1 and moved on.

Phase 7.3: the two-layer narrative

Phase 7.3 shipped two features we'd been holding in reserve.

Format-aware citation prompting had been dormant since Phase 6.2. The code existed. The feature flag was off. The idea is simple: when the evidence set includes structured documents (JSON, YAML, CSV), tell the LLM explicitly that these formats contain retrievable facts and should be cited. Without the hint, structured documents get retrieved but ignored. Citation recall for structured formats went from 0.626 to 0.670-0.715. Not solved. But meaningfully better, and the mechanism is ours. We wrote it, we control it, we can improve it.

Relation-pair preservation was the second feature. In Phase 6.6 we fixed MiSES dropping controller-cited documents. But there was a subtler version of the same problem: after reranking, the evidence selector takes the top K results. If a parent document makes the cut but its relation-expanded child doesn't, the pair is broken before the controller ever sees it. Relation-pair preservation detects this at the selection boundary and injects the missing partner, capped at two injections per query.

Cat 12 parent-child recall went back to 1.000. It had been 0.846 in 7.1 and 7.2. The mechanism is clean and the attribution is clear.

But then we looked at Cat 11.

Broad recall had jumped from 0.722 in 7.2 to 0.927 in 7.3. A massive improvement. The kind of number that makes you want to write a press release.

We didn't write a press release. We ran the attribution matrix.

We disabled relation-pair preservation and re-ran. Cat 11: 0.927. We disabled format-aware citation and re-ran. Cat 11: 0.927.

Neither of our shipped features caused the improvement. The 0.722 to 0.927 lift was entirely external. Something changed in the language model's behaviour between the 7.2 and 7.3 run windows. Not our prompt. Not our retrieval. Not our evidence selection. The model itself.

This is when it hit me.

The variable you can't pin

We control the retrieval engine. We control the citation controller. We control the evidence selector, the fragility scorer, the config manifest, the receipt chain. We sign everything with ed25519 keys managed through Vault Transit. We can prove exactly what evidence was retrieved and exactly what configuration was running.

We do not control the language model.

And the language model is the single largest source of variance in the system.

Cat 11's journey tells the story. Phase 6.2: regression caused by admission controller changes. Phase 6.3: fixed by evidence selector narrowing context. Phase 7.0: 0.927. Phase 7.1: somewhere between 0.273 and 0.927 depending on when you ran it. Phase 7.2: stabilised at 0.722 with cascade. Phase 7.3: back to 0.927, and none of our code caused it.

The model we're calling through an API is a moving target. OpenAI's o4 snapshots shift without announcement. The behaviour changes are not documented. The timing is not predictable. And because we're calling it for the most sensitive operation in the pipeline, answer synthesis and citation selection, every shift in model behaviour propagates directly into the metrics that define our quality baseline.

This is not a hypothetical concern. We have five phases of empirical evidence showing that the model's citation behaviour is the dominant variable in at least three of our thirteen audit categories.

The decision we're circling

There are two responses to this.

The first is to harden the boundary. Accept that the model is external and build the observability to detect when it shifts. Pin model provenance in the config manifest. Capture the model ID, provider, and run date in every audit record. Build a model-drift sentinel that runs the most model-sensitive categories (Cat 2, Cat 11, Cat 12) on a schedule and alerts when the numbers move. Treat the model like any other external dependency: monitor it, version it, plan for it to change.

We've already started down this path. Phase 7.3 shipped with model provenance in the config manifest, model provenance in audit JSON output, a model provenance check in the release gate, and a drift sentinel pack. We hash our prompts and run spillover regression suites when they change. If the model shifts, we'll know.

The second response is to stop depending on it.

Not entirely. The retrieval engine doesn't need an external model. Embeddings are already self-hosted on our own GPU. Reranking is self-hosted. The citation controller is deterministic. The evidence selector is deterministic. The only place we call an external model is the final synthesis step: taking curated evidence and producing a human-readable answer with citations.

If we hosted the answer model ourselves, the moving-target problem disappears. We'd control the model version. We'd control the update schedule. We'd know, with certainty, that the model hadn't changed between one audit run and the next.

The trade-off is real. Self-hosted answer models are worse than frontier API models today. The gap is closing, but it exists. Running inference at the quality level we need requires serious hardware. And maintaining a model deployment adds operational surface area that we don't currently carry.

But here's what keeps me up at night. We built CROWN to prove what evidence a system used. We built thirteen categories of adversarial tests to prove the retrieval engine is reliable. We built a citation controller to correct systematic LLM errors deterministically. And after all of that, the single largest source of variance in our quality metrics is a model we access through an API whose behaviour changes without notice.

The logging-and-monitoring path is the pragmatic choice. It's what we're doing now. It works today. And it might be sufficient if model providers start publishing change logs and offering pinned snapshots with meaningful stability guarantees.

The self-hosted path is the architectural choice. It eliminates a class of risk rather than managing it. But it requires us to accept a quality trade-off today in exchange for control tomorrow.

What we're actually doing

Right now, both.

The monitoring infrastructure is live. Model provenance is tracked. Drift detection is automated. If the next o4 snapshot shifts citation behaviour, we'll see it in the sentinel pack before it reaches production.

And we're evaluating what self-hosted answer synthesis would look like. Not as an emergency measure. Not as a philosophical statement about vendor independence. As an engineering question: at what point does the cost of managing external model variance exceed the cost of running our own?

We don't have the answer yet. But the fact that we're asking the question at all tells you something about where the last two weeks have taken us.

Phase 7.0 proved the retrieval engine works. Phase 7.3 proved that the retrieval engine isn't the variable that matters most. The system is only as stable as its least controlled component, and right now, that component is the one thing we don't own.

The two-layer truth

The third external audit review of Phase 7.3 established a framework we're now using for everything.

Layer one: owned engineering delta. Things we built, we control, and we can prove. Relation-pair preservation taking Cat 12 from 0.846 to 1.000. Format-aware citation taking Cat 2 from 0.626 to 0.715. These are ours. They'll hold regardless of what the model does.

Layer two: observed environment delta. Things that improved but aren't attributable to shipped code. Cat 11's 0.722 to 0.927 lift. Healthy. Real. And entirely contingent on an external system continuing to behave the way it behaves today.

The canonical statement for 7.3: "Cat 2 and Cat 12 improvements are product-owned. Cat 11 is currently excellent but not attributable to a shipped mechanism."

This framing might seem like hedging. It isn't. It's honesty about what you control and what you don't. And in a system designed to produce cryptographic proof of evidence provenance, honesty about your own dependencies isn't optional. It's the whole point.

We'll keep building the layers we own. And we'll keep watching the one we don't. The question is how long that's sustainable before the architecture has to change.