Methodology - How BuyerEyes Scores Are Made | SSR, Multi-Agent Debate, 29 Sub-Scores

ρ=0.90 correlation with human expert judgment

14 specialized agents per audit

29 atomic sub-scores per page

1,294 validation tests in production

The problem with AI ratings

Ask any AI to rate a landing page on a scale of 1 to 10. Do it five times without changing the page. The scores will differ by 2 to 3 points.

This variance is not a bug in any specific tool. Direct numerical rating is structurally unreliable when performed by language models. The anchor for what "7" means shifts depending on context length, prompt phrasing, temperature, and model version. Maier et al. measured the correlation between direct AI numerical ratings and human expert judgment across systematic trials. The result: ρ=0.26 to 0.39.

That means a direct AI score agrees with a CRO specialist's assessment roughly a third of the time. The rest is noise. Any tool that asks an AI "rate this page 1-10" and reports the result as a score is operating at this level of reliability, whether the tool acknowledges it or not.

Maier, S. et al. "Semantic Similarity Rating for Likert-Scale Evaluation with LLMs." arXiv:2510.08338v3, October 2025. The paper that established SSR as a calibrated alternative to direct AI rating.

SSR: Semantic Similarity Rating

SSR replaces absolute judgment with comparative judgment. Instead of asking the AI "rate this page 1-10," BuyerEyes asks the AI to describe the page in natural language: what works, what fails, what a visitor would experience.

That description is then compared against 90 calibrated anchor statements using cosine similarity. The anchors function as benchmarks: "A page at this level of CTA performance looks like this," expressed in plain language. Six independent sets of anchors are compared and averaged. The score is derived from where the AI's description lands relative to those benchmarks.

The result: ρ=0.90 correlation with human expert judgment. That is a 3.5x improvement over direct rating, and it crosses the threshold where scores become actionable rather than decorative.

The anchors were developed and calibrated for e-commerce and SaaS conversion contexts. "Social proof at this level" has a specific definition: how many reviews, how visible, whether they address the objections a first-time buyer has. These are not generic quality descriptions.

Bias correction

SSR alone is not sufficient. Language models carry systematic biases that distort scores in predictable ways.

Position bias: elements near the top of a page tend to score higher than equivalent elements near the bottom. Length bias: detailed sections score higher than concise ones, independent of quality. Embedding anisotropy: the vector space used for semantic comparison is not uniformly distributed, making similarity measurements more reliable in some regions than others.

BuyerEyes corrects for all three. Evaluation order is randomized across dimensions. Copy scoring is normalized for content length. Embedding vectors are mean-centered before cosine similarity computation (following Ethayarajh, arXiv:2403.05440). The corrections are invisible in the output. Without them, a page with a long trust section near the top would systematically outscore an equivalent page with a shorter trust section near the bottom.

Multi-agent debate

A single AI evaluating a page confirms its own assumptions. BuyerEyes runs 14 specialized agents in parallel. Five domain agents score different conversion dimensions. Up to 5 buyer personas evaluate the page from distinct perspectives. An audience discovery agent selects which personas to deploy. An alignment agent checks whether the page matches the traffic driving visitors there.

When the domain agents disagree (standard deviation above 1.5 or score spread above 3.0), BuyerEyes triggers structured debate rounds. Each agent reviews the others' scores and reasoning. They revise or defend their positions with explicit justification. Up to three rounds, with convergence detection that stops the debate when positions stabilize.

After debate, an adversarial review layer runs. Every score above 7.0 faces a Devil's Advocate challenge: what weaknesses are being overlooked? Every score below 4.0 faces a Defender review: what strengths are being undervalued? Adjustments are capped at plus or minus 1.0 to prevent adversarial roles from dominating the final result.

The debate mechanism is grounded in two research lines: Hu et al. "Multi-Agent Debate for LLM Judges with Adaptive Stability Detection" (NeurIPS 2025, arXiv:2510.12697) and Du et al. "Improving Factuality and Reasoning in Language Models through Multiagent Debate" (ICML 2024, arXiv:2305.14325).

Hu, T. et al. "Multi-Agent Debate for LLM Judges with Adaptive Stability Detection." arXiv:2510.12697, October 2025. NeurIPS 2025. The adaptive convergence mechanism used in BuyerEyes debate rounds.

29 sub-scores across 6 dimensions

A score of "Visual: 6.5" does not tell your developer what to fix. BuyerEyes decomposes each of the six evaluation dimensions into atomic sub-scores, 29 in total. Each sub-score has its own number, its own rubric, and its own actionable recommendation.

Visual design

Contrast, hierarchy, whitespace, mobile layout, CTA prominence, image quality, brand consistency, above-fold composition

8 sub-scores

Copy and messaging

First impression, value proposition clarity, benefit specificity, urgency, readability, persuasion framework coverage

6 sub-scores

CTA effectiveness

Visibility, copy strength, placement, urgency, friction reduction

5 sub-scores

Trust and credibility

Social proof specificity, pricing transparency, authority signals, review authenticity, dark pattern detection

5 sub-scores

Technical experience

Load performance, layout stability, mobile usability, form quality, accessibility compliance

5 sub-scores

Purchase intent (persona simulation)

Value equation, risk perception, social validation needs, commitment readiness, objection resolution

5 sub-scores via SSR

Each recommendation in the report carries an effort tag (low, medium, high) and an impact estimate. "CTA Prominence: 4.2. Move primary CTA above fold on mobile. High impact, low effort." That is a ready-made ticket for your developer, not a suggestion to "improve your CTA."

Saliency heatmaps

Every report includes a visual attention heatmap generated from a single screenshot. No traffic required. No tracking code. No panel recruitment.

The heatmap uses TranSalNet, a transformer-based visual saliency model. It was validated against real eye-tracking data on 640 web pages, achieving CC=0.78 correlation with ground truth. Processing time: approximately 50 milliseconds per screenshot.

The heatmap answers one question: where does visual attention go on this page? That prediction feeds into the sub-score system. CTA Prominence is scored partly based on whether the CTA sits above the predicted attention threshold. If the heatmap shows the attention dropoff at position Y=400 on mobile and your CTA sits at Y=720, the report flags it with a specific score and a specific fix.

Lou, J. et al. "TranSalNet: Towards Perceptually Relevant Visual Saliency Prediction." arXiv:2110.03593, 2021. Validated on the WIC640 web page dataset. CC=0.78, NSS=2.42.

Confidence intervals

Every score in a BuyerEyes report comes with a confidence range. Not "Copy: 7.1" but "Copy: 7.1 [6.8 - 7.4, high stability]." The range comes from multi-trial scoring variance. When the agents converge tightly, the range is narrow and you can act on the number directly. When the range is wide, the score is a starting point that warrants investigation.

When evidence is insufficient to produce a reliable score, the report says "Insufficient data" instead of forcing a low number. That distinction matters. A low score means something is wrong. Insufficient data means the system cannot determine whether something is wrong. Those situations call for different responses.

Validation

The scoring system has passed 1,294 tests in production across the pipeline. The methodology draws on 30+ peer-reviewed papers across visual attention, persuasion science, cognitive load, trust calibration, and LLM evaluation reliability. The SCIENCE.md document in the BuyerEyes codebase tracks every paper with its implementation status and the specific files where its methods are applied.

Anchor statements are calibrated, not prompt-engineered. Changing them changes the entire scoring system. They were developed through iterative validation against CRO specialist assessments on e-commerce and SaaS pages across multiple verticals.

Dark patterns and deceptive design detection

Conversion optimization stops being honest the moment the page leans on deceptive design. Pre-ticked subscription boxes, fake countdown timers, hidden costs revealed only at checkout, cookie banners where "Reject" is hidden behind extra clicks - these are the patterns that regulators have started fining in the hundreds of millions. Reports like FTC v. Epic Games ($520M), Arena v. Intuit ($141M), the Trump campaign pre-ticked recurring refund ($122M), and FTC v. Vonage cancellation friction ($100M) add up to roughly $1.5 billion in enforcement over the last 36 months for patterns BuyerEyes can detect on a single audit.

The detector ships as a deterministic pre-synthesis layer with 10 rules covering Harry Brignull's TOP 10 e-commerce patterns and the Mathur et al. (Princeton 2019) seven-category taxonomy. Each rule carries explicit legal references - which UCPD Annex I item, which GDPR article, which DSA Article 25 obligation, which AI Act Article 5(1)(a) or (b) prohibition, which FTC Act section, which ROSCA provision - so a finding in your report is also an evidentiary breadcrumb your compliance team can act on. The mapping is grounded in Mark Leiser's _Dark Patterns, Deceptive Design, and the Law_ (Hart Publishing 2025), which classifies every pattern against the enforcement gaps the law has not closed yet.

Seven of the ten rules detect on a single page (pre-ticked consent, fake countdown, forced registration without guest checkout, confirmshaming opt-out labels, disguised ads, cookie banner asymmetry, third-party fake-activity apps such as Fomopop or Beeketing). The remaining three (hidden costs drip pricing, sneak into basket, hard-to-cancel "roach motel" flows) require multi-step journey capture and ship alongside the journey-diff layer. The detector currently runs in shadow mode - findings appear in telemetry, the pipeline does not block. Once the false-positive rate is verified against a corpus of audited sites, the same rules can drive a compliance-grade section of the customer report.

The framing is Luguri and Strahilevitz's: a single mild pattern is easy to dismiss, but compounding patterns add up to material distortion of consumer choice, and that is what regulators measure. BuyerEyes scores each finding by severity and category so the report surfaces compounding exposure, not just individual hits.

Compounding score

Beyond the rule-by-rule scan, BuyerEyes aggregates findings from three independent detection streams: the deterministic rules with legal references described above, the Trust agent's semantic check against the Gray et al. (CHI 2024) six-category taxonomy, and tagged observations from the Visual and Copy agents. The aggregator deduplicates patterns by category, records which sources corroborate each finding, and computes a compounding score by severity weight (1-4), cross-source corroboration (+15% per extra source), category diversity (1.5x when patterns span three or more categories), and multi-step bonus (+25% per journey-diff finding). The output is one number, one tier (clean / low / moderate / high / critical), and a transparent formula your compliance team can replicate. ADR-115 in the BuyerEyes codebase documents the constants.

Multi-step Hidden Costs and Sneak Into Basket

Drip pricing and basket sneaking are not visible in a single screenshot. BuyerEyes captures journey mode (PDP -> cart -> checkout) and runs two specific detectors: Hidden Costs (delta greater than five percent between product page and final checkout, flagged critical when no prior fee disclosure exists) and Sneak Into Basket (cart contents diff after each step, flagged when items appear without a preceding Add-to-cart action). Findings include exact delta percentages and a before-after snippet so the report can show the buyer's actual experience.

BuyerEyes 2.0 - multi-user observatory (preview)

The patterns above all assume a single user. Personalised dark patterns - different price for different users, A/B variants where one arm is darker, geo-restricted opt-outs, cookie-state dependent dark patterns - are invisible to single-user audits and to most regulator inspections. Mark Leiser identifies this as the central enforcement gap in his 2025 book: "Inspector views site as one user. Cannot detect personalised variants without warrant + platform's internal logs."

BuyerEyes 2.0 closes that gap. The observatory module (foundation shipped in ADR-116) captures the same URL under multiple user profiles - geo, device, signed-in versus guest, cookie state, traffic source - and runs a pairwise diff engine across the captures. Four detectors operate on the diffs: personalised price, personalised UI variant, geo-restricted opt-out, and cookie-state-dependent dark pattern. Each finding carries its own legal reference set (GDPR Article 22 for automated decisions, UCPD Article 6 for price disclosure, DSA Article 25 for material distortion, AI Act Article 5(1)(a) for subliminal technique). The multi-profile capture orchestrator ships separately as part of the BuyerEyes 2.0 rollout.

Built by Kamil Andrusz, who spent 30 years building and optimizing web infrastructure before asking a different question: what if we could see a website through a buyer's eyes before spending a dollar on traffic? The answer took 30+ research papers, 14 agents, and a scoring methodology that survives its own internal debate.

See how BuyerEyes compares to other tools View pricing and order a report

See it in action

29 sub-scores. Confidence intervals. Saliency heatmap. Prioritized recommendations. Report in 24-48 hours.

Get Your Report

How BuyerEyes scores are made

The problem with AI ratings

SSR: Semantic Similarity Rating

Bias correction

Multi-agent debate

29 sub-scores across 6 dimensions

Visual design

Copy and messaging

CTA effectiveness

Trust and credibility

Technical experience

Purchase intent (persona simulation)

Saliency heatmaps

Confidence intervals

Validation

Dark patterns and deceptive design detection

Compounding score

Multi-step Hidden Costs and Sneak Into Basket

BuyerEyes 2.0 - multi-user observatory (preview)

See it in action