Abstract
As AI agents become critical infrastructure in business and consumer applications, the need for a standardized, reproducible, and transparent evaluation framework becomes paramount. This paper presents the dual-scoring methodology developed by AgentLayers: the Agent-Readiness Score, which measures a website's capacity to be discovered, understood, and recommended by AI agents (score 0–100); and the Trust Score, which evaluates the reliability, transparency, security, and compliance of AI agents through fully automated, timestamped, and publicly auditable tests. We detail each criterion, its weighting, the testing protocol, and the technical implementation. All tests are designed to be reproducible and independent of any commercial relationship with the evaluated entities.
1Introduction
The rapid proliferation of AI agents — from autonomous assistants to specialized task executors — has created a trust deficit in the ecosystem. Businesses deploying agents need assurance of reliability and compliance; end-users need guarantees of safety and transparency; and businesses wanting to be discovered by agents need to adapt their digital presence.
AgentLayers addresses this gap with two complementary evaluation instruments. The Agent-Readiness Score targets businesses seeking to optimize their web presence for the agentic economy. The Trust Score targets AI agents themselves, providing an objective, automated assessment of their operational quality. Additionally, the Skill Trust Score evaluates third-party skills and plugins before they are installed on AI agents, detecting prompt injection, obfuscation, excessive permissions, and supply chain risks. The MCP Server Trust Score evaluates Model Context Protocol server configurations for endpoint security, permission scope, data exfiltration, and auth weaknesses. The A2A Protocol Trust Score assesses Agent-to-Agent protocol implementations for authentication, message signing, delegation control, and identity verification.
2Agent-Readiness Score (Businesses)
The Agent-Readiness Score measures a website's capacity to be discovered, understood, and recommended by AI agents. The score ranges from 0 to 100 and is computed automatically by crawling the site and analyzing its structure, content, and metadata.
| Criterion | Weight | Description |
|---|---|---|
| Structured Data | 20% | JSON-LD, Open Graph, Microdata, RDFa detection and richness |
| LLM Readability | 15% | Ability for LLMs to comprehend positioning, offer and value proposition |
| Technical Accessibility | 10% | Crawlability, response time, SSL, robots.txt, sitemap |
| Agentic SEO | 15% | Likelihood of agent recommendation over competitors |
| Protocol Discovery | 15% | Well-known endpoints for MCP, OAuth, A2A, API Catalog, Agent Skills (Cloudflare-aligned) |
| Authority | 25% | External credibility — Wikipedia presence, domain age, Wayback footprint, Open PageRank, Tranco rank |
2.1Structured Data (30%)
Structured data constitutes the primary signal analyzed by AI agents. It transforms human-readable content into machine-readable information. Our crawler (built on Cheerio) parses the HTML document and extracts all <script type="application/ld+json"> blocks, itemscope/itemtype attributes (Microdata), typeof/property attributes (RDFa), and og:* meta tags.
Each JSON-LD block is parsed and the @type field is extracted. Types are compared against a curated list of high-value schema types (Organization, Product, FAQ, LocalBusiness, etc.). The theoretical maximum score per sub-criterion is 100, capped at the section weight.
2.2LLM Readability (25%)
This criterion assesses whether an LLM can comprehend the positioning, offer, and value proposition of a business by reading the page content. We evaluate clarity of value proposition, content structure and hierarchy, presence of pricing information, and use of natural language descriptions optimized for AI comprehension. The Live Test feature — shipped in production on PRO plans — probes real LLMs with brand-aware queries that include the domain name (e.g. "What do you know about Acme (acme.com)?") to disambiguate same-name brands and improve recall accuracy.
2.3Technical Accessibility (20%)
This criterion evaluates whether an AI agent can technically access the site's content without friction. Sub-criteria include: response time (<2s), SSL/TLS configuration, robots.txt accessibility, sitemap presence, proper HTTP status codes, and absence of aggressive anti-bot measures that would block legitimate AI crawlers.
2.4Agentic SEO (25%)
The most strategic criterion. It measures whether an AI agent would choose to recommend this business over a competitor. Factors include domain authority signals, citation frequency in AI training data, and semantic relevance of content to probable agent queries.
2.5Protocol Discovery
Beyond content and metadata, agents need machine-readable entry points: which APIs to call, where to authenticate, what skills are exposed. Protocol Discovery probes six well-known endpoints aligned with the IETF / OpenAPI / MCP / A2A specifications and Cloudflare's agent-readiness category. We validate the RFC-mandatory fields and ping the endpoints advertised in the metadata — a 200 response with valid-shape JSON is not enough. WebMCP is shown as a 7th informational row but is not counted in the score (it requires a headless browser to verify).
- MCP Server Card describing tools and capabilities (MCP SEP-1649).
- OAuth 2.0 Authorization Server metadata (RFC 8414). OIDC discovery accepted as fallback.
- OAuth 2.0 Protected Resource metadata (RFC 9728).
- API Catalog linkset pointing at OpenAPI / AsyncAPI service descriptions (RFC 9727).
- Google Agent2Agent (A2A) agent card.
- Agent Skills index (Cloudflare proposal).
- WebMCP runtime registration of browser-callable tools (best-effort; full check requires a headless browser).
2.6Authority (25%)
The first five dimensions measure how well a site is structured for agents — they tell you whether your content can be extracted, parsed, and acted on. They do not tell you whether an LLM will actually cite you. That signal lives outside the page: in Wikipedia, in the link graph, in the Wayback footprint, in the years your domain has been online.
Authority is the dimension that closes the validity gap. It composites five free, independently verifiable sub-signals — none of which can be gamed in a sprint:
- Open PageRank (30%) — Domain Authority-style 0–10 score derived from the public web graph (free tier, 1000 lookups / month).
- Wikipedia presence (20%) — does a Wikipedia article exist for this domain? In how many languages? A site with a 30-language Wikipedia article is in every major LLM's training set.
- Tranco rank (20%) — research-grade top-1M domain ranking with log-scale normalisation (rank 1 → 100, rank 1M → 0).
- Wayback Machine first seen (15%) — years of continuity archived by archive.org. Sites crawled since the 90s are memorised by every training run.
- Domain age (15%) — registration date via RDAP. A floor that prevents brand-new domains from impersonating established players.
2.7LLM Tests: Knowledge Recall + Live Discovery (PRO)
The Agent-Readiness Score is not a single signal. AgentLayers runs two complementary live LLM tests on every PRO scan, because they answer different questions. Reading them together is what produces an honest verdict — neither test alone is sufficient.
2.7.1 Knowledge Recall (offline, training-data only)
The Recall test queries a chat model with no web access (default: gpt-4o-mini). It asks brand-aware questions like "What do you know about Acme (acme.com)?" along with category and location prompts derived from your site's structured data. It measures whether the model already knows your brand from its training data.
Detection is mention-based and negation-aware: brand echoes wrapped in phrases like "I don't know" or "I'm not familiar with" are not counted as a mention. Each prompt is classified Direct, Likely, or Absent. This signal is most meaningful for established brands with significant pre-existing web presence at the model's training cutoff.
A low Recall score for a recent or low-traffic site is expected and not a failure — it simply means the model hasn't memorized you yet. Read the Discovery test below for the live signal.
2.7.2 Live Discovery (web-grounded)
The Discovery test queries a search-enabled model (default: gpt-4o-mini-search-preview) with category-level prompts that deliberately do not name your brand. It measures whether AI agents organically surface your site when a real user asks a realistic question today, using live web search results.
A prompt counts as a hit if either (a) your brand name appears in the answer or (b) at least one citation URL points to your own domain. Each result includes the cited sources so you can audit which pages the model actually fetched. This is the signal that matches what users see when they prompt ChatGPT, Perplexity, or Claude with web search enabled.
3Foundational Principles
Reproducibility
Every test can be re-run by any party and will produce the same score (within stochastic LLM variance). The methodology is documented publicly.
Temporal Evolution
Scores are not static. Each test is timestamped and stored. Users see score evolution curves over time. Degrading agents are flagged; improving agents gain visibility.
Methodology Transparency
The complete methodology is published in open access. The competitive advantage is not the methodology — it's execution, accumulated data, and network effects.
Independence
AgentLayers does not sell optimization services to the agents it rates. The business model (subscriptions, premium listings, API) creates no conflict of interest with scoring.
References
- Schema.org — Structured Data Vocabulary, https://schema.org (accessed March 2026).
- Google — Structured Data Testing Tool Documentation, https://developers.google.com/search/docs/appearance/structured-data (2025).
- European Commission — EU Artificial Intelligence Act, Regulation (EU) 2024/1689 (2024).
- GDPR — General Data Protection Regulation, Regulation (EU) 2016/679 (2016).
- OWASP — LLM Top 10 Security Risks, v1.1 (2024).
This document constitutes the public technical reference for AgentLayers's scoring algorithms. It is updated as the product evolves and serves as the basis for all evaluation processes.
© 2026 AgentLayers Research — Open methodology, v1.0 — March 2026