A few weeks ago I ran a small, slightly lazy test, mostly to confirm our monitoring was working.
I took twelve B2B prompts we use to track a client’s AI search visibility and ran them across four engines — ChatGPT, Perplexity, Gemini, and Claude. Four days later, with nothing about the client changed in between — no new content, no fixes shipped — I ran the same twelve again.
The citation sets didn’t match. On the most volatile of the four engines, a meaningful share of the sources it had leaned on the first time were simply gone the second time, replaced by others. Same questions. Same correct answer. Four days apart.
I want to be careful with that, because this is where field notes go wrong. Twelve prompts is not a sample. It’s a probe. It’s enough to tell me something is worth measuring properly. It is nowhere near enough to tell me what. So this piece is not a result. It’s the probe, and the test we’ve designed because of it — written down in public before we run it, so the design can’t quietly bend to fit whatever answer we’d like.
It isn’t only us seeing this
The instability itself isn’t a westOeast discovery. Earlier this year Kevin Indig published an analysis of 3.7 million citations across 20,000 prompts and found that only about 2.37% of cited URLs showed up across three LLMs for the same prompt — roughly nine in ten were single-platform.
That’s a different cut of the problem than mine. His is about how little engines agree with each other; my probe is about how little an engine agrees with itself a few days later. But the underlying message is the same: “AI visibility” is not one stable thing you can measure once and file. If anything, my probe suggested the instability runs along a second axis — time — on top of the one Indig already mapped.
The test we’re running
Here’s the experiment we’ve set up. I’m describing the design, not the findings, on purpose — there are no findings yet.
Sixty prompts instead of twelve. Still small, still double-digit coverage per vertical, but enough that one strange prompt can’t steer the whole picture. The same four engines. And instead of two runs four days apart, one run every day for two weeks — because the probe’s four-day gap was too coarse to tell a steady drift apart from an uneven one.
One clarification, because it matters: this is internal R&D. It is not the 40-prompt baseline audit we run for client engagements — that’s a different instrument for a different job. This test exists to pressure-test our own methodology, not to deliver a client report.
What we’re specifically trying to settle:
- Whether the drift is uneven by engine. The probe hinted that some engines hold steadier than others. If that holds, then a blended cross-engine “visibility score” — the number most dashboards lead with — is averaging away the one engine you’d most need to watch.
- Whether it’s the index or the route. Between my two probe runs, the brand never dropped out of the engines’ knowledge. The engines kept the brand. What changed was which sources they assembled the answer from. If that pattern survives 60 prompts, then being indexable is necessary and nowhere near sufficient — the work is about which route the model walks, not just what it can reach.
- Whether cited sources flip category, not just rank — a community thread giving way to an industry article rather than the same kinds of source merely reshuffling.
- And whether there’s a language pattern. In the tiny probe, non-English prompts appeared to move less than English ones. That is exactly the sort of thing a 12-prompt probe will hand you by accident, so I’m holding it loosely until a real run agrees or disagrees.
What we already got wrong
Two things, before the full run has even produced a number.
The probe’s design was wrong in a way worth saying out loud. Four days between runs can establish that drift exists, but not its rhythm. If we’d kept that cadence and scaled only the prompt count, we’d have produced a tidier — and more confident — conclusion than the data could carry. The daily cadence is a correction to our own first mistake.
And I’m walking in carrying a hypothesis I’m a little too fond of: that drift is worst for new, thin content and settles as a page matures. That’s precisely the kind of belief an experiment exists to kill. I’m writing it down here so I can’t quietly drop it later if the run disagrees.
What the probe already earns the right to say
Even with no full result yet, the probe justifies one practical point — and it’s for the people who read AI-visibility reports, not the ones who run them.
If someone hands you a monthly AI search visibility number, ask two questions before you trust it. How often was it measured? And is it a single figure, or a spread? A number sampled once a month is closer to one weather reading than to a climate. A single blended score across engines has usually had its most useful information — the disagreement between engines — averaged out of it.
We’ve already changed our own reporting on that basis: we sample on a cadence rather than once, we report per engine rather than blended, and we lead with a median across a window instead of a latest-value snapshot. It makes the report less tidy. The 60-prompt run is partly there to tell us how much less tidy it honestly needs to be.
Closing
This is deliberately a field note from before the result, not after it. When the full run is done I’ll post what it shows — starting with whichever of those four questions it answers in a way I didn’t expect. The hypothesis it kills will probably be the most useful part.
Want to know how stable your brand’s AI citations actually are?
Our GEO baseline samples per engine on a cadence — not a one-shot snapshot — so you see the spread, not the average.
See our GEO services →