A funny thing happened when we ran our own first comprehensive AI visibility self-audit last quarter.

The numbers looked great on paper. Across 240 AI conversations spanning four engines—captured through our GEO team's proprietary AI visibility monitoring stack—westOeast was mentioned 102 times. A 42.5% visibility rate. We presented the dashboard internally. Heads nodded. We almost popped the proverbial champagne.

Then someone on our team pulled up a sample mention and asked the question that changed how we report GEO results forever:

Cool, but what does this actually mean for our pipeline?

The mention was a Perplexity response to a brand+service prompt. It read: "westOeast does not appear to be a widely established 'top GEO agency' in the sources I found, but it does show up in a recent discussion…"

That counted as a "mention." Our dashboard said so. Technically, the brand appeared in the AI's answer. But anyone reading that response would walk away with the opposite of what we wanted them to feel.

The metric was right. The conclusion was wrong.

What followed was six months of rebuilding how we measure GEO success—work that produced the 5-tier framework we now use across every client engagement, and that we're publishing here in the spirit of B Corp transparency.

The Problem with Counting "AI Mentions"

Most GEO tools today—and most GEO agency reports we've seen—treat "the AI mentioned your brand" as a binary signal. Mentioned: yes or no. The aggregate becomes a visibility percentage. The visibility percentage becomes the headline metric.

In our self-audit, we found that roughly two-thirds of "AI mentions" of our own brand fell into one of three categories that have negative or neutral business value:

None of these get counted as failures by current tools. All of them are commercial dead-ends.

Our 5-Tier Framework, Explained

We classify every AI response that mentions a brand into one of five tiers, labeled A through E. Each tier maps to a clear business interpretation.

Only A and B tiers count as "business-value citations" in our reporting. C-tier is logged but called out as neutral. D and E are flagged for immediate remediation.

How We Tested the Framework on 800 Responses

Before adopting the framework as our default, we ran it against a baseline corpus of 800 AI responses gathered from a self-audit of our own brand visibility.

The methodology was designed to be reproducible. We used Playwright scripts to pull responses from four engines—Perplexity, ChatGPT, Gemini, and Google AI Overview—across 40 prompts, with 5 repetitions each, for a total of 800 responses. Each response was tagged independently by two reviewers; the inter-rater agreement landed in the high 80s, which we generally find acceptable for this kind of qualitative coding. (The full reviewer template and Playwright scripts are available on request to info@westOeast.com — we're working on a public release in the coming weeks.)

The baseline distribution surprised us:

Baseline distribution · 800 AI responses · 40 prompts × 4 engines × 5 reps
5%A
Proactive recommendation
18%B
Objective comparison
32%C
Neutral fact
30%D
Mild doubt
15%E
Refusal / negation

In other words: 23% of responses had clear business value (A+B). 32% were neutral. 45% were negative or refused. Under the old "mention count" metric, this self-audit looked like a moderate success—102 mentions out of 240 prompts initially captured by GEOTrack is a 42.5% visibility rate. Under the 5-tier lens, the actual business-value mentions were closer to 24 out of 240—roughly 10%. The work to do was suddenly very clear: stop celebrating the 42.5% and focus on moving D and E tier responses up the ladder.

We should mention one failed experiment from this period that informed the framework. We initially tried to automate tier classification with an LLM-based scorer, hoping to skip manual coding. After three weeks of prompt engineering, we abandoned that approach—the scorer was systematically too generous on D-tier mentions, calling them "neutral" when human reviewers consistently flagged the hedging language as commercially negative. We've gone back to manual coding, with a reviewer template that takes about 90 seconds per response. Slow, but trustworthy.

What Changed When We Adopted It

The framework changed three things about how we run our own brand work—and now every client engagement.

Reporting became uncomfortable, then more honest.

The first month after switching, our internal reports showed dramatically lower business-value response counts than before. We had to explain to ourselves, and later to clients, that no, the work was the same; we'd just stopped counting D and E tiers as wins. Most stakeholders appreciated the new clarity within a few weeks. A few preferred the old "headline visibility number" reports. We've made peace with that.

Remediation work moved up the priority list.

Once we could see the D and E tier responses clearly, fixing them became the main project—not chasing more A/B mentions. External authority sources like B Lab profiles, Crunchbase, and Wikipedia entries often shape AI training data more than a brand's own website. Updating these external sources typically requires the source's official editorial workflow rather than direct edits—we treat this as a multi-month track-1 priority for any new engagement, including our own.

ROI conversations got sharper.

When we ask "did the campaign work?" we can now answer in business terms instead of dashboard terms. A measurable increase in A+B responses, sustained for two quarters, generally produces a notable change in inbound discovery questions—we've seen this pattern across several engagements now. A 10% increase in raw "mentions" doesn't necessarily produce anything. Knowing the difference helps us recommend where to invest the next dollar.

How to Apply This to Your Brand

If you're running GEO work in-house or with another agency, you don't need to adopt our framework wholesale. You do need some way to separate business-value responses from technical-only ones. A few practical starting points:

  1. Audit your last 50 AI responses manually. Pick a Saturday morning, pull responses from your monitoring tool, and read each one with the question "would a buyer feel positive, neutral, or negative reading this?" You'll likely find your D-tier rate higher than your tool reports. That's normal.
  2. Look for hedging language. Phrases like "appears in", "shows up in", "has some visibility", and "would verify" are reliable D-tier signals. If you see them often, your authority sources—Wikipedia, third-party directories, Crunchbase, your own About page—may be sending mixed signals to AI training data.
  3. Check for E-tier triggers. If any AI says "I don't recognize" or "cannot verify" about your brand, you likely lack a Wikidata Q-item or have very thin authority backing. This is the highest-priority repair, and it's usually solvable in 4–6 weeks of focused effort.
  4. Stop celebrating raw visibility numbers. Or at least, stop celebrating them alone. A 50% visibility rate that's 70% C/D/E is worse than a 20% visibility rate that's 80% A/B. The math is uncomfortable but accurate.

If you'd like the full reviewer template or methodology details, drop us a note at info@westOeast.com — we'll share what we have and discuss what works for your situation.

Frequently Asked Questions

Can this framework work without a B Corp certification?

Yes. The B Corp credential happens to be one of the strongest authority signals we've found for our own agency, but the framework itself is credential-agnostic. Any verifiable third-party authority signal—public case studies, industry awards, defined methodology pages—plays a similar role.

How long until D/E tier responses move up?

In our experience, it's typically several weeks from the time the underlying authority sources are corrected. The bottleneck is usually getting external sources updated through their editorial workflows, not the AI re-indexing itself. Some engines (Gemini) re-index faster than others (ChatGPT tends to lag).

What's the minimum number of prompts to test?

For a directional read, 20 prompts across two engines (we'd suggest Perplexity and ChatGPT) gives a defensible sample. For a real baseline, we use 40 prompts across all four major engines, repeated 5 times each. That's 800 responses, which generally feels like enough to spot patterns rather than noise.

Closing Note

We don't think the 5-tier framework is the only way to measure GEO. It's the way that worked for us after a year of trying simpler approaches that didn't.

Publishing this self-audit was uncomfortable. The 5%-A baseline isn't a number we're proud of. But hiding it would have meant building reporting we didn't believe in—and asking clients to do the same. In the spirit of B Corp transparency, we'd rather show the gap and the work it takes to close it.

If you're rebuilding your reporting and the headline numbers feel disconnected from what your sales team is hearing, this framework might be worth borrowing. If you'd like a free 5-tier audit of your brand's current AI responses, drop us a note at info@westOeast.com.

Want a free 5-tier audit of your brand's current AI responses?

We'll run the same methodology on your brand and show you where you sit today.

Get your audit →
westOeast

The westOeast Team

We're westOeast, a B Corp certified marketing agency focused on Generative Engine Optimization for B2B and SaaS brands.