What is an Agent-Readiness Score?

An Agent-Readiness Score (0-100) measures how well AI agents can discover, understand, and recommend your business. It evaluates structured data, LLM readability, technical accessibility, and agentic SEO signals.

How does AgentLayers help with EU AI Act compliance?

AgentLayers automatically evaluates AI agents against EU AI Act requirements including risk classification, transparency obligations, and documentation standards. It provides compliance scoring and readiness checklists aligned with Regulation (EU) 2024/1689.

Is the Agent-Readiness Scanner free?

Yes, all features are free during the beta phase. You can run unlimited scans without signing up. Create a free account to save your scan history and track improvements over time.

What is the AgentLayers Trust Score for AI agents?

The Trust Score (0-100) evaluates AI agents across multiple dimensions: security, interoperability, documentation, and reliability. High-scoring agents earn a verified AgentLayers Certified badge and are listed in our curated agent directory.

Methodology Validation — Score Reproducibility

1.Why we publish this page

Measuring how a site is structured for AI agents is one thing. Showing that the resulting number tracks real-world authority is another. Our internal review found that the legacy 5-dimension score inflated younger sites with strong technical hygiene and under-scored institutional sites that LLMs actually cite. We added an Authority dimension (Wikipedia presence, domain age, archive footprint, Open PageRank, Tranco rank) at 25% of the composite to fix it.

This page is how we test it. We hand-classify 30 sites into three strata — A institutional, B mid-tier SaaS, C younger / local — and verify that the 6-dimension score moves each stratum in the direction we'd expect: A goes up, C goes down. That's construct validity. We also run an LLM citation probe as a secondary check, with the limitations honestly disclosed below.

2.Method

30 sites across 3 strata: A — institutional / high-citation (LinkedIn, Wikipedia, GitHub, …); B — mid-tier SaaS (Calendly, PostHog, Linear, …); C — younger or local sites with limited corpus presence.
5 standardised category-level prompts per site, fired against a single LLM (OpenAI GPT-4o). We discuss the single-judge limit explicitly in the disclosure below.
citationRate = (responses citing the domain) / (5 prompts) — a scalar in [0, 1].
Primary signal: per-stratum mean of v1 (legacy) and v2 (with Authority). The Authority dimension is validated when stratum A's mean v2 ≥ mean v1 and stratum C's mean v2 ≤ mean v1.
Secondary signal: Spearman rank correlation ρ(score, citationRate). Reported with bootstrap CI but interpreted with care — single-judge probes saturate at ceiling for sites the model already knows, especially in non-English markets.

Queries per site

5 prompts × 1 model

Models tested

OpenAI GPT-4o

Validation criterion

MET

Authority direction matches strata

3.Construct validity — score behaviour by stratum

If the Authority dimension does what we designed it to do, stratum A institutional sites should see their score lift (or stay flat) and stratum C smaller / younger sites should see it drop. The table below shows the mean v1 → v2 movement for each stratum, computed on the latest run (sentinel sites excluded).

Stratum	Sites	Mean v1	Mean v2	Δ (v2 − v1)	Mean Authority	Mean cite rate
A · Institutional	10	49.2	50.7	+1.5	66.9	94%
B · Mid-tier SaaS	10	57.0	51.2	-5.8	47.9	70%
C · Younger / local	9	54.7	45.1	-9.6	28.3	93%

↑ Authority lifted institutional sites by 1.5 points on average — exactly the direction the dimension is designed to push.

↓ Authority pulled younger / local sites down by 9.6 points on average — the inflation we were targeting.

4.Secondary signal · LLM citation probe

ρ(v1, citation)

-0.20

Legacy 5-dim score

ρ(v2, citation)

-0.02

6-dim with Authority · CI 95% [-0.38, 0.34]

Phase 1 status

MET

Δ = 0.18 · interpret with single-judge caveat

Single-judge note: with one LLM and 5 prompts per site, citationRate has only 6 distinct levels (0/5 … 5/5), and the model saturates at 100% for sites it knows well — particularly French brands present in its training data regardless of Authority signals (Wikipedia, PageRank, Tranco) which skew anglophone. Spearman is reported for transparency but cannot independently validate the score with this sampling budget. A multi-judge Phase 2 (multiple LLMs + ranked-position metric) is the proper empirical test; we publish Phase 1 as it stands rather than wait.

5.What Phase 2 would change

Three changes would close the empirical gap left by Phase 1: (a) two additional LLM judges to break the single-model bias, (b) a ranked-position citationRate (where does the brand appear in a top-N list?) to recover variance lost to saturation, (c) doubling the dataset to 60 sites with more international stratum C entries. We're publishing Phase 1 because the construct-validity evidence above is independent of those changes — they only refine the secondary signal.

Phase 1 run updated 2026-05-14. We re-run the benchmark whenever the scoring formula changes materially; results are versioned and the historical record stays available on request.

Does the Authority dimension move the score where we expect?

1.Why we publish this page

2.Method

3.Construct validity — score behaviour by stratum

4.Secondary signal · LLM citation probe

5.What Phase 2 would change