1.Why we publish this page
Measuring how a site is structured for AI agents is one thing. Showing that the resulting number tracks real-world authority is another. Our internal review found that the legacy 5-dimension score inflated younger sites with strong technical hygiene and under-scored institutional sites that LLMs actually cite. We added an Authority dimension (Wikipedia presence, domain age, archive footprint, Open PageRank, Tranco rank) at 25% of the composite to fix it.
This page is how we test it. We hand-classify 30 sites into three strata — A institutional, B mid-tier SaaS, C younger / local — and verify that the 6-dimension score moves each stratum in the direction we'd expect: A goes up, C goes down. That's construct validity. We also run an LLM citation probe as a secondary check, with the limitations honestly disclosed below.
2.Method
- 30 sites across 3 strata: A — institutional / high-citation (LinkedIn, Wikipedia, GitHub, …); B — mid-tier SaaS (Calendly, PostHog, Linear, …); C — younger or local sites with limited corpus presence.
- 5 standardised category-level prompts per site, fired against a single LLM (OpenAI GPT-4o). We discuss the single-judge limit explicitly in the disclosure below.
- citationRate = (responses citing the domain) / (5 prompts) — a scalar in [0, 1].
- Primary signal: per-stratum mean of v1 (legacy) and v2 (with Authority). The Authority dimension is validated when stratum A's mean v2 ≥ mean v1 and stratum C's mean v2 ≤ mean v1.
- Secondary signal: Spearman rank correlation ρ(score, citationRate). Reported with bootstrap CI but interpreted with care — single-judge probes saturate at ceiling for sites the model already knows, especially in non-English markets.
Queries per site
5
5 prompts × 1 model
Models tested
1
OpenAI GPT-4o
Validation criterion
MET
Authority direction matches strata
3.Construct validity — score behaviour by stratum
If the Authority dimension does what we designed it to do, stratum A institutional sites should see their score lift (or stay flat) and stratum C smaller / younger sites should see it drop. The table below shows the mean v1 → v2 movement for each stratum, computed on the latest run (sentinel sites excluded).
| Stratum | Sites | Mean v1 | Mean v2 | Δ (v2 − v1) | Mean Authority | Mean cite rate |
|---|---|---|---|---|---|---|
| A · Institutional | 10 | 49.2 | 50.7 | +1.5 | 66.9 | 94% |
| B · Mid-tier SaaS | 10 | 57.0 | 51.2 | -5.8 | 47.9 | 70% |
| C · Younger / local | 9 | 54.7 | 45.1 | -9.6 | 28.3 | 93% |
↑ Authority lifted institutional sites by 1.5 points on average — exactly the direction the dimension is designed to push.
↓ Authority pulled younger / local sites down by 9.6 points on average — the inflation we were targeting.
4.Secondary signal · LLM citation probe
ρ(v1, citation)
-0.20
Legacy 5-dim score
ρ(v2, citation)
-0.02
6-dim with Authority · CI 95% [-0.38, 0.34]
Phase 1 status
MET
Δ = 0.18 · interpret with single-judge caveat
Single-judge note: with one LLM and 5 prompts per site, citationRate has only 6 distinct levels (0/5 … 5/5), and the model saturates at 100% for sites it knows well — particularly French brands present in its training data regardless of Authority signals (Wikipedia, PageRank, Tranco) which skew anglophone. Spearman is reported for transparency but cannot independently validate the score with this sampling budget. A multi-judge Phase 2 (multiple LLMs + ranked-position metric) is the proper empirical test; we publish Phase 1 as it stands rather than wait.
5.What Phase 2 would change
Three changes would close the empirical gap left by Phase 1: (a) two additional LLM judges to break the single-model bias, (b) a ranked-position citationRate (where does the brand appear in a top-N list?) to recover variance lost to saturation, (c) doubling the dataset to 60 sites with more international stratum C entries. We're publishing Phase 1 because the construct-validity evidence above is independent of those changes — they only refine the secondary signal.
Phase 1 run updated 2026-05-14. We re-run the benchmark whenever the scoring formula changes materially; results are versioned and the historical record stays available on request.