The problem with AI ratings
Ask any AI to rate a landing page on a scale of 1 to 10. Do it five times without changing the page. The scores will differ by 2 to 3 points.
This variance is not a bug in any specific tool. Direct numerical rating is structurally unreliable when performed by language models. The anchor for what "7" means shifts depending on context length, prompt phrasing, temperature, and model version. Maier et al. measured the correlation between direct AI numerical ratings and human expert judgment across systematic trials. The result: ρ=0.26 to 0.39.
That means a direct AI score agrees with a CRO specialist's assessment roughly a third of the time. The rest is noise. Any tool that asks an AI "rate this page 1-10" and reports the result as a score is operating at this level of reliability, whether the tool acknowledges it or not.
SSR: Semantic Similarity Rating
SSR replaces absolute judgment with comparative judgment. Instead of asking the AI "rate this page 1-10," BuyerEyes asks the AI to describe the page in natural language: what works, what fails, what a visitor would experience.
That description is then compared against 150 calibrated anchor statements using cosine similarity. The anchors function as benchmarks: "A page at this level of CTA performance looks like this," expressed in plain language. Six independent sets of anchors are compared and averaged. The score is derived from where the AI's description lands relative to those benchmarks.
The result: ρ=0.90 correlation with human expert judgment. That is a 3.5x improvement over direct rating, and it crosses the threshold where scores become actionable rather than decorative.
The anchors were developed and calibrated for e-commerce and SaaS conversion contexts. "Social proof at this level" has a specific definition: how many reviews, how visible, whether they address the objections a first-time buyer has. These are not generic quality descriptions.
Bias correction
SSR alone is not sufficient. Language models carry systematic biases that distort scores in predictable ways.
Position bias: elements near the top of a page tend to score higher than equivalent elements near the bottom. Length bias: detailed sections score higher than concise ones, independent of quality. Embedding anisotropy: the vector space used for semantic comparison is not uniformly distributed, making similarity measurements more reliable in some regions than others.
BuyerEyes corrects for all three. Evaluation order is randomized across dimensions. Copy scoring is normalized for content length. Embedding vectors are mean-centered before cosine similarity computation (following Ethayarajh, arXiv:2403.05440). The corrections are invisible in the output. Without them, a page with a long trust section near the top would systematically outscore an equivalent page with a shorter trust section near the bottom.
Multi-agent debate
A single AI evaluating a page confirms its own assumptions. BuyerEyes runs 14 specialized agents in parallel. Five domain agents score different conversion dimensions. Up to 10 buyer personas evaluate the page from distinct perspectives. An audience discovery agent selects which personas to deploy. An alignment agent checks whether the page matches the traffic driving visitors there.
When the domain agents disagree (standard deviation above 1.5 or score spread above 3.0), BuyerEyes triggers structured debate rounds. Each agent reviews the others' scores and reasoning. They revise or defend their positions with explicit justification. Up to three rounds, with convergence detection that stops the debate when positions stabilize.
After debate, an adversarial review layer runs. Every score above 7.0 faces a Devil's Advocate challenge: what weaknesses are being overlooked? Every score below 4.0 faces a Defender review: what strengths are being undervalued? Adjustments are capped at plus or minus 1.0 to prevent adversarial roles from dominating the final result.
The debate mechanism is grounded in two research lines: Hu et al. "Multi-Agent Debate for LLM Judges with Adaptive Stability Detection" (NeurIPS 2025, arXiv:2510.12697) and Du et al. "Improving Factuality and Reasoning in Language Models through Multiagent Debate" (ICML 2024, arXiv:2305.14325).
29 sub-scores across 6 dimensions
A score of "Visual: 6.5" does not tell your developer what to fix. BuyerEyes decomposes each of the six evaluation dimensions into atomic sub-scores, 29 in total. Each sub-score has its own number, its own rubric, and its own actionable recommendation.
Visual design
Contrast, hierarchy, whitespace, mobile layout, CTA prominence, image quality, brand consistency, above-fold composition
Copy and messaging
First impression, value proposition clarity, benefit specificity, urgency, readability, persuasion framework coverage
CTA effectiveness
Visibility, copy strength, placement, urgency, friction reduction
Trust and credibility
Social proof specificity, pricing transparency, authority signals, review authenticity, dark pattern detection
Technical experience
Load performance, layout stability, mobile usability, form quality, accessibility compliance
Purchase intent (persona simulation)
Value equation, risk perception, social validation needs, commitment readiness, objection resolution
Each recommendation in the report carries an effort tag (low, medium, high) and an impact estimate. "CTA Prominence: 4.2. Move primary CTA above fold on mobile. High impact, low effort." That is a ready-made ticket for your developer, not a suggestion to "improve your CTA."
Saliency heatmaps
Every report includes a visual attention heatmap generated from a single screenshot. No traffic required. No tracking code. No panel recruitment.
The heatmap uses TranSalNet, a transformer-based visual saliency model. It was validated against real eye-tracking data on 640 web pages, achieving CC=0.78 correlation with ground truth. Processing time: approximately 50 milliseconds per screenshot.
The heatmap answers one question: where does visual attention go on this page? That prediction feeds into the sub-score system. CTA Prominence is scored partly based on whether the CTA sits above the predicted attention threshold. If the heatmap shows the attention dropoff at position Y=400 on mobile and your CTA sits at Y=720, the report flags it with a specific score and a specific fix.
Confidence intervals
Every score in a BuyerEyes report comes with a confidence range. Not "Copy: 7.1" but "Copy: 7.1 [6.8 - 7.4, high stability]." The range comes from multi-trial scoring variance. When the agents converge tightly, the range is narrow and you can act on the number directly. When the range is wide, the score is a starting point that warrants investigation.
When evidence is insufficient to produce a reliable score, the report says "Insufficient data" instead of forcing a low number. That distinction matters. A low score means something is wrong. Insufficient data means the system cannot determine whether something is wrong. Those situations call for different responses.
Validation
The scoring system has passed 1,294 tests in production across the pipeline. The methodology draws on 30+ peer-reviewed papers across visual attention, persuasion science, cognitive load, trust calibration, and LLM evaluation reliability. The SCIENCE.md document in the BuyerEyes codebase tracks every paper with its implementation status and the specific files where its methods are applied.
Anchor statements are calibrated, not prompt-engineered. Changing them changes the entire scoring system. They were developed through iterative validation against CRO specialist assessments on e-commerce and SaaS pages across multiple verticals.
See it in action
29 sub-scores. Confidence intervals. Saliency heatmap. Prioritized recommendations. Report in 24-48 hours.
Get Your Report