How BuyerEyes scores are made

The scoring methodology, the agents, the validation pipeline. Everything that happens between your URL and your report.

ρ=0.90 correlation with human expert judgment
14 specialized agents per audit
29 atomic sub-scores per page
1,294 validation tests in production

The problem with AI ratings

Ask any AI to rate a landing page on a scale of 1 to 10. Do it five times without changing the page. The scores will differ by 2 to 3 points.

This variance is not a bug in any specific tool. Direct numerical rating is structurally unreliable when performed by language models. The anchor for what "7" means shifts depending on context length, prompt phrasing, temperature, and model version. Maier et al. measured the correlation between direct AI numerical ratings and human expert judgment across systematic trials. The result: ρ=0.26 to 0.39.

That means a direct AI score agrees with a CRO specialist's assessment roughly a third of the time. The rest is noise. Any tool that asks an AI "rate this page 1-10" and reports the result as a score is operating at this level of reliability, whether the tool acknowledges it or not.

Maier, S. et al. "Semantic Similarity Rating for Likert-Scale Evaluation with LLMs." arXiv:2510.08338v3, October 2025. The paper that established SSR as a calibrated alternative to direct AI rating.

SSR: Semantic Similarity Rating

SSR replaces absolute judgment with comparative judgment. Instead of asking the AI "rate this page 1-10," BuyerEyes asks the AI to describe the page in natural language: what works, what fails, what a visitor would experience.

That description is then compared against 150 calibrated anchor statements using cosine similarity. The anchors function as benchmarks: "A page at this level of CTA performance looks like this," expressed in plain language. Six independent sets of anchors are compared and averaged. The score is derived from where the AI's description lands relative to those benchmarks.

The result: ρ=0.90 correlation with human expert judgment. That is a 3.5x improvement over direct rating, and it crosses the threshold where scores become actionable rather than decorative.

The anchors were developed and calibrated for e-commerce and SaaS conversion contexts. "Social proof at this level" has a specific definition: how many reviews, how visible, whether they address the objections a first-time buyer has. These are not generic quality descriptions.

Bias correction

SSR alone is not sufficient. Language models carry systematic biases that distort scores in predictable ways.

Position bias: elements near the top of a page tend to score higher than equivalent elements near the bottom. Length bias: detailed sections score higher than concise ones, independent of quality. Embedding anisotropy: the vector space used for semantic comparison is not uniformly distributed, making similarity measurements more reliable in some regions than others.

BuyerEyes corrects for all three. Evaluation order is randomized across dimensions. Copy scoring is normalized for content length. Embedding vectors are mean-centered before cosine similarity computation (following Ethayarajh, arXiv:2403.05440). The corrections are invisible in the output. Without them, a page with a long trust section near the top would systematically outscore an equivalent page with a shorter trust section near the bottom.

Multi-agent debate

A single AI evaluating a page confirms its own assumptions. BuyerEyes runs 14 specialized agents in parallel. Five domain agents score different conversion dimensions. Up to 10 buyer personas evaluate the page from distinct perspectives. An audience discovery agent selects which personas to deploy. An alignment agent checks whether the page matches the traffic driving visitors there.

When the domain agents disagree (standard deviation above 1.5 or score spread above 3.0), BuyerEyes triggers structured debate rounds. Each agent reviews the others' scores and reasoning. They revise or defend their positions with explicit justification. Up to three rounds, with convergence detection that stops the debate when positions stabilize.

After debate, an adversarial review layer runs. Every score above 7.0 faces a Devil's Advocate challenge: what weaknesses are being overlooked? Every score below 4.0 faces a Defender review: what strengths are being undervalued? Adjustments are capped at plus or minus 1.0 to prevent adversarial roles from dominating the final result.

The debate mechanism is grounded in two research lines: Hu et al. "Multi-Agent Debate for LLM Judges with Adaptive Stability Detection" (NeurIPS 2025, arXiv:2510.12697) and Du et al. "Improving Factuality and Reasoning in Language Models through Multiagent Debate" (ICML 2024, arXiv:2305.14325).

Hu, T. et al. "Multi-Agent Debate for LLM Judges with Adaptive Stability Detection." arXiv:2510.12697, October 2025. NeurIPS 2025. The adaptive convergence mechanism used in BuyerEyes debate rounds.

29 sub-scores across 6 dimensions

A score of "Visual: 6.5" does not tell your developer what to fix. BuyerEyes decomposes each of the six evaluation dimensions into atomic sub-scores, 29 in total. Each sub-score has its own number, its own rubric, and its own actionable recommendation.

Visual design

Contrast, hierarchy, whitespace, mobile layout, CTA prominence, image quality, brand consistency, above-fold composition

8 sub-scores

Copy and messaging

First impression, value proposition clarity, benefit specificity, urgency, readability, persuasion framework coverage

6 sub-scores

CTA effectiveness

Visibility, copy strength, placement, urgency, friction reduction

5 sub-scores

Trust and credibility

Social proof specificity, pricing transparency, authority signals, review authenticity, dark pattern detection

5 sub-scores

Technical experience

Load performance, layout stability, mobile usability, form quality, accessibility compliance

5 sub-scores

Purchase intent (persona simulation)

Value equation, risk perception, social validation needs, commitment readiness, objection resolution

5 sub-scores via SSR

Each recommendation in the report carries an effort tag (low, medium, high) and an impact estimate. "CTA Prominence: 4.2. Move primary CTA above fold on mobile. High impact, low effort." That is a ready-made ticket for your developer, not a suggestion to "improve your CTA."

Saliency heatmaps

Every report includes a visual attention heatmap generated from a single screenshot. No traffic required. No tracking code. No panel recruitment.

The heatmap uses TranSalNet, a transformer-based visual saliency model. It was validated against real eye-tracking data on 640 web pages, achieving CC=0.78 correlation with ground truth. Processing time: approximately 50 milliseconds per screenshot.

The heatmap answers one question: where does visual attention go on this page? That prediction feeds into the sub-score system. CTA Prominence is scored partly based on whether the CTA sits above the predicted attention threshold. If the heatmap shows the attention dropoff at position Y=400 on mobile and your CTA sits at Y=720, the report flags it with a specific score and a specific fix.

Lou, J. et al. "TranSalNet: Towards Perceptually Relevant Visual Saliency Prediction." arXiv:2110.03593, 2021. Validated on the WIC640 web page dataset. CC=0.78, NSS=2.42.

Confidence intervals

Every score in a BuyerEyes report comes with a confidence range. Not "Copy: 7.1" but "Copy: 7.1 [6.8 - 7.4, high stability]." The range comes from multi-trial scoring variance. When the agents converge tightly, the range is narrow and you can act on the number directly. When the range is wide, the score is a starting point that warrants investigation.

When evidence is insufficient to produce a reliable score, the report says "Insufficient data" instead of forcing a low number. That distinction matters. A low score means something is wrong. Insufficient data means the system cannot determine whether something is wrong. Those situations call for different responses.

Validation

The scoring system has passed 1,294 tests in production across the pipeline. The methodology draws on 30+ peer-reviewed papers across visual attention, persuasion science, cognitive load, trust calibration, and LLM evaluation reliability. The SCIENCE.md document in the BuyerEyes codebase tracks every paper with its implementation status and the specific files where its methods are applied.

Anchor statements are calibrated, not prompt-engineered. Changing them changes the entire scoring system. They were developed through iterative validation against CRO specialist assessments on e-commerce and SaaS pages across multiple verticals.

Built by Kamil Andrusz, who spent 30 years building and optimizing web infrastructure before asking a different question: what if we could see a website through a buyer's eyes before spending a dollar on traffic? The answer took 30+ research papers, 14 agents, and a scoring methodology that survives its own internal debate.

See it in action

29 sub-scores. Confidence intervals. Saliency heatmap. Prioritized recommendations. Report in 24-48 hours.

Get Your Report