back to leaderboard
/ methodology

How WebAppBench scores AI-generated web apps.

Every entry on the leaderboard runs through the same harness against the same prompt. Each tool's output is scored by 23 deterministic and LLM-graded scorers, grouped into four dimensions. The composite is a weighted mean of dimension scores; cost is tracked but excluded from the ranking. The v0.3 backend track adds three scorers (F7, F8, S4) that run only on submissions shipping a real backend.

Composite formula
# composite (0–100)
composite = Σd w(d) × dimension(d)
# dimension
dimension(d) = Σs ∈ d w(s) × scorer(s)
# null handling
if scorer(s) is null → its weight is removed and the remaining w(s) within the dimension are renormalized to 100%.
if every scorer in a dimension is null → the dimension drops out and w(d) redistributes across the rest.
Dimensions
filter scorers:
Scorer glossary
F1

Render success

Functionalcomposite 7.05%·in-dim 15%

Page loads with HTTP 2xx and non-empty body within 30 s. Baseline gate — a failing site scores 0 on all downstream metrics.

Playwright navigates to the URL, waits for network idle (capped at 8 s), then checks status code and minimum text content (≥10 chars). Binary 0/1. Fixed 30 s timeout.

F2

Acceptance criteria

Functionalcomposite 21.15%·in-dim 45%

Per-prompt checklist of must-have and should-have requirements, executed as Playwright assertions (roles, labels, counts).

Each prompt ships a YAML mustHave / shouldHave list. score = (mustPassed + 0.5 × shouldPassed) / (mustTotal + 0.5 × shouldTotal). v0.2 added per-criterion setup actions (evaluate / fill / click / press / reload / waitFor) for stateful prompts.

F4

Functional intent judge

Functionalcomposite 4.70%·in-dim 10%

LLM judge over screenshots scoring functional intent on 4 criteria 1–5: intent match, feature completeness, content relevance, flow coherence.

Three screenshots (initial, mobile, mid-scroll) plus the prompt and acceptance-criterion IDs are sent to a vision model via OpenRouter. Defaults: intent_match, feature_completeness, content_relevance, flow_coherence. Returns missing_features list. Score normalised to 0–1.

F5

Runtime errors

Functionalcomposite 2.35%·in-dim 5%

Console errors, uncaught JS exceptions, and 4xx/5xx network responses. 0 errors = 1.0; linear decay to 0 at 10+ errors.

page.on(console) and page.on(response) listeners during the F1+F2 sweep. Third-party analytics whitelisted. Each error capped at 200 chars; up to 10 of each type collected.

F6

Verbatim constraints

Functionalcomposite 11.75%·in-dim 25%

Exact string constraints specified in the prompt (e.g. "Get started", "Nimbus Notes") must appear verbatim in the rendered page.

Source ZIP is extracted and scanned across .ts, .tsx, .js, .jsx, .css, .html, .svg, .json. Constraint kinds: exact_copy, hex_value, structural. Source-only scorer. score = passed / total; passed iff 100% honoured.

F7

Auth round-trip

Functionalbackend track · v0.3composite 3.27% (backend)·in-dim 8% add-on

Backend track: log in → create a uniquely-marked record → log out → log in again → confirm it persists. Catches broken sessions and writes that never reach the server. Null on frontend-only tools.

Runs only when the submission ships a backend block with signup_credentials. Playwright drives the deployed login form (resilient email/password/submit heuristics), creates a contact tagged with a unique per-run marker (F7_CONTACT_<run>_<rand>), logs out, re-navigates, logs in again, and asserts the marker is still visible — the unique marker means the check can't pass on seed data. passed requires both creation and post-relogin persistence; partial credit = fraction of lifecycle steps that succeeded. Additive weight 8: null on non-backend submissions, reflows in at 8/115 ≈ 7% of Functional on backend submissions.

F8

Cross-session persistence

Functionalbackend track · v0.3composite 2.86% (backend)·in-dim 7% add-on

Backend track: a record created in browser context A must be visible in a fresh incognito context B. Discriminates real backends from localStorage-only apps. Null on frontend-only tools.

Two Playwright browser contexts. Context A logs in and creates a contact with a unique marker; context B — a fresh browser.newContext() with clean storage — reopens the deployed URL, logs in with the same credentials, and asserts the marker is visible. Because B shares no cookies or localStorage with A, a localStorage-only app fails while a real-backend app passes (a distinction F2 can't make, since reload preserves localStorage). passed requires the marker to cross sessions; partial credit = fraction of steps succeeded. Additive weight 7: null on non-backend submissions, reflows in at 7/115 ≈ 6% of Functional on backend submissions.

C1

ESLint density

Code Qualitycomposite 3.60%·in-dim 20%

ESLint with typescript-eslint recommended rules. Decay 0 errors/1k LOC = 1.0 → 20+ errors/1k LOC = 0. Source-only.

Runs eslint with typescript-eslint recommended + no-console:warn + no-debugger:error. issuesPer1k = (errors + 0.1 × warnings) / LOC × 1000. score = max(0, 1 - issuesPer1k / 20).

C2

TypeScript safety

Code Qualitycomposite 0.90%·in-dim 5%

tsc --noEmit --strict. 0 type errors = 1.0; decay at 20 errors/1k LOC. "Cannot find module" filtered out. Source-only.

Finds tsconfig.json or tsconfig.app.json; falls back to a permissive inline config. Ignores missing-module errors common in AI-generated code without deps installed.

C3

Accessibility (axe-core)

Code Qualitycomposite 3.60%·in-dim 20%

axe-core WCAG 2.1/2.2 AA audit. Violations normalised per 1k DOM nodes; score = max(0, 1 − violationsPer1k / 50).

@axe-core/playwright with tags wcag2a, wcag2aa, wcag21a, wcag21aa, wcag22aa. violationsPer1k = violating nodes / total nodes × 1000; score decays to 0 at 50 violations per 1k nodes. Single-state scan (axe + Lighthouse catch only ~30–40% of true a11y issues).

C4

Lighthouse performance

Code Qualitycomposite 3.60%·in-dim 20%

Lighthouse performance score (mobile throttled, median of 3 runs). Composite of FCP, LCP, TBT, CLS, Speed Index.

v0.2: up to 3 runs, median of successful. Per-run 90 s timeout, overall 240 s. Tolerates ≥1 successful run; 0 → score null.

C5

Bundle payload (gzipped)

Code Qualitycomposite 0.90%·in-dim 5%

Gzipped JS + CSS payload transferred over the wire. Full marks ≤170 KB, linear decay to 0 at ≥1 MB. Passed if ≤350 KB.

Passive page.on("response") listener captures every script & stylesheet response during F1+F2 (Content-Length, falls back to body length). compressedMeasurement flagged true only when every response had a Content-Length. Source-tree fallback (uncompressed bytes, ≤150 KB → 1 MB) when no network capture is available, clearly labelled scoringSource: source-fallback.

C6

AST complexity

Code Qualitycomposite 0.90%·in-dim 5%

Cognitive complexity via eslint-plugin-sonarjs. Functions exceeding threshold 15 are flagged.

score = max(0, 1 - violationsPer1k / 10). Source-only.

C7

Maintainability judge

Code Qualitycomposite 2.70%·in-dim 15%

LLM judge over a sampled source excerpt scoring maintainability on 5 criteria 1–5: naming, separation of concerns, component reuse, prop typing, secret handling.

Up to 12 source files sampled (entry → components/hooks/features/pages → other). Excerpt capped at 12 KB. Source-only.

C8

Clean install

Code Qualitycomposite 0.90%·in-dim 5%

Strict frozen install (npm ci / pnpm / yarn / bun) from a clean checkout. Graded: 0 if nothing installs, else 1.0 docked for lockfile-hygiene defects, floored at 0.5. Source-only.

Detects every committed lockfile (npm/pnpm/yarn/bun) and runs its strict frozen install (--ignore-scripts) in a fresh temp dir, 240 s each. Score: 0 if no lockfile installs cleanly; otherwise starts at 1.0 and deducts for hygiene defects — duplicate lockfiles −0.15, out-of-sync −0.20, damaged −0.20, broken private-registry sibling −0.20 — floored at 0.5. Missing package.json or no manager on PATH → null.

C9

SEO hygiene

Code Qualitycomposite 0.90%·in-dim 5%

Deterministic DOM checks: title length, meta description, canonical URL, OG tags, html[lang], heading hierarchy.

Configurable per-prompt via seoApplicable. Checks title (10–70 chars), meta description (50–300 chars), canonical, og:title/description/type, twitter:card, json-ld, lang, heading hierarchy, robots.txt, sitemap.xml.

V1

MLLM visual judge

Visualcomposite 13.20%·in-dim 55%

MLLM visual judge (Gemini 2.5 Pro via OpenRouter). 8 criteria scored 1–5. Normalised to 0–1.

Defaults: visual hierarchy, typography, colour harmony, whitespace, brand fit, CTA prominence, mobile layout, overall polish. v0.2 also runs 3 copy-quality checks unless placeholder_copy: true. Per-prompt extras supported.

V2

Design heuristics

Visualcomposite 7.20%·in-dim 30%

Deterministic in-browser design heuristics, 8 checks across layout and CSS conventions.

Layout: whitespace ≥25%, WCAG AA contrast on ≥80% text, ≥80% text ≥14px, ≥70% blocks ≤85ch. CSS conventions: ≥80% box-sizing:border-box, prefers-reduced-motion media query, ≥5 CSS custom properties, ≥1 :focus-visible rule. Skips CORS-blocked stylesheets.

V4

Responsive design

Visualcomposite 3.60%·in-dim 15%

Playwright viewport tests at 360×800, 768×1024, 1440×900. No horizontal overflow + mobile touch targets ≥44 px.

New browser context per viewport. 4 checks total. score = passed / total; passed if ≥0.75.

S1

Secrets + deployed headers

Securitycomposite 4.40%·in-dim 40%

Source secret scan (regex + Semgrep + trufflehog when available) ⊕ deployed HTTP header audit (CSP, HSTS, X-Content-Type-Options, X-Frame-Options, Referrer-Policy, Permissions-Policy).

Sub-1: secrets — built-in 8-pattern regex (always on), Semgrep p/secrets+p/owasp-top-ten (if installed), trufflehog filesystem (if installed); findings unioned, any match = 0. Sub-2: header audit, score = passed/6. Final = mean of whichever ran.

S2

Client-security anti-patterns

Securitycomposite 3.85%·in-dim 35%

Client-side security anti-patterns general scanners miss: Supabase service-role keys in client code, RLS off, JWT decode without verify, Firebase test mode, hardcoded admin creds, XSS sinks, insecure transport, sensitive logging.

Severity-weighted: critical=10, high=5, medium=2 pts; score = max(0, 1 - penalty / 20); passed = no critical/high findings. 16 patterns across Supabase, Firebase, JWT, Stripe, OpenAI, generic third-party keys, hardcoded admin creds, password reset without token, plus v0.2 secure-by-default sinks (dangerouslySetInnerHTML without sanitizer, disabled TLS verification, secret/req logging). Client-side-only patterns are scoped to client paths; PUBLISHABLE/ANON/PUBLIC key names and vendor-generated files are excluded. (Registry key s2 unchanged; historically the "auth-pattern check".)

S3

Dependency vulnerabilities

Securitycomposite 2.75%·in-dim 25%

npm audit CVE count from the source lockfile (high/critical filtered). critical×10 + high×3 + moderate×1 + low×0.1; score = max(0, 1 − penalty/20). Passed if 0 critical and 0 high.

Finds package.json (root or one level down). If no lockfile, generates one via npm install --package-lock-only in a temp dir, then runs npm audit --json --omit=dev. Source-only.

S4

Backend security probes

Securitybackend track · v0.3composite 1.83% (backend)·in-dim 15% add-on

Backend track: read-only runtime probes against the deployed backend — an unauthenticated request and a cross-user request must not return another user's data. Catches the canonical "RLS off" leak that S2 only infers from client code. Null on frontend-only tools.

Runs only with a backend block carrying two accounts; credentials are the only input. Signs in as user B via the real login form, seeds a uniquely-marked record, and auto-discovers B's data endpoint (largest record array, GET or POST/RPC). Unauth probe: replays B's request with auth stripped from body + headers → must be rejected and serve no data. Cross-user probe: signs in as user A, captures A's own data response → B's marker must not appear. Each failed probe = 10 penalty pts; score = max(0, 1 − penalty/20); passed = no failed probes; details.crossTenantLeak is the headline. Complements rather than replaces S1/S2/S3. Additive weight 15: null on non-backend submissions, reflows in at 15/115 ≈ 13% of Security on backend submissions.

Renormalization on null scorers

When a scorer returns null (skipped, N/A, missing source), its weight is removed from the dimension and the remaining scorers' weights are renormalized to sum to 100% within the dimension.

Example. A submission without a source ZIP loses F6, C1, C2, C5, C6, C7, C8, S2 and S3 (source-only). Functional then becomes a weighted mean of just F1 / F2 / F4 / F5 with weights renormalized from {15, 45, 10, 5} to {21.4%, 64.3%, 14.3%, 7.1%}.

If a whole dimension has no contributors (e.g. Security with no source ZIP and unfetchable URL), the dimension drops and its weight redistributes across the remaining three.

backend track (v0.3): F7, F8 and S4 sit on top of their dimension's 100% base rather than carving into it. On frontend-only submissions they return null and the standard scorers keep their exact proportions; on submissions shipping a real backend the denominator expands to 115 and they reflow in — F7 → 7% and F8 → 6% of Functional, S4 → 13% of Security — with the others compressing proportionally.

F1 gate: if the site does not render, all browser-dependent scorers (F2, F4, F5, C3, C4, C9, V1, V2, V4) are skipped and scored as null. Source-dependent scorers (F6, C1, C2, C5, C6, C7, C8, S2, S3) require a source ZIP. S1 has two sub-checks (secrets + deployed headers) and runs whichever inputs are available.