F1Render success
Functionalcomposite 7.05%·in-dim 15%Page loads with HTTP 2xx and non-empty body within 30 s. Baseline gate — a failing site scores 0 on all downstream metrics.
Playwright navigates to the URL, waits for network idle (capped at 8 s), then checks status code and minimum text content (≥10 chars). Binary 0/1. Fixed 30 s timeout.
F2Acceptance criteria
Functionalcomposite 21.15%·in-dim 45%Per-prompt checklist of must-have and should-have requirements, executed as Playwright assertions (roles, labels, counts).
Each prompt ships a YAML mustHave / shouldHave list. score = (mustPassed + 0.5 × shouldPassed) / (mustTotal + 0.5 × shouldTotal). v0.2 added per-criterion setup actions (evaluate / fill / click / press / reload / waitFor) for stateful prompts.
F4Functional intent judge
Functionalcomposite 4.70%·in-dim 10%LLM judge over screenshots scoring functional intent on 4 criteria 1–5: intent match, feature completeness, content relevance, flow coherence.
Three screenshots (initial, mobile, mid-scroll) plus the prompt and acceptance-criterion IDs are sent to a vision model via OpenRouter. Defaults: intent_match, feature_completeness, content_relevance, flow_coherence. Returns missing_features list. Score normalised to 0–1.
F5Runtime errors
Functionalcomposite 2.35%·in-dim 5%Console errors, uncaught JS exceptions, and 4xx/5xx network responses. 0 errors = 1.0; linear decay to 0 at 10+ errors.
page.on(console) and page.on(response) listeners during the F1+F2 sweep. Third-party analytics whitelisted. Each error capped at 200 chars; up to 10 of each type collected.
F6Verbatim constraints
Functionalcomposite 11.75%·in-dim 25%Exact string constraints specified in the prompt (e.g. "Get started", "Nimbus Notes") must appear verbatim in the rendered page.
Source ZIP is extracted and scanned across .ts, .tsx, .js, .jsx, .css, .html, .svg, .json. Constraint kinds: exact_copy, hex_value, structural. Source-only scorer. score = passed / total; passed iff 100% honoured.
F7Auth round-trip
Functionalbackend track · v0.3composite 3.27% (backend)·in-dim 8% add-onBackend track: log in → create a uniquely-marked record → log out → log in again → confirm it persists. Catches broken sessions and writes that never reach the server. Null on frontend-only tools.
Runs only when the submission ships a backend block with signup_credentials. Playwright drives the deployed login form (resilient email/password/submit heuristics), creates a contact tagged with a unique per-run marker (F7_CONTACT_<run>_<rand>), logs out, re-navigates, logs in again, and asserts the marker is still visible — the unique marker means the check can't pass on seed data. passed requires both creation and post-relogin persistence; partial credit = fraction of lifecycle steps that succeeded. Additive weight 8: null on non-backend submissions, reflows in at 8/115 ≈ 7% of Functional on backend submissions.
F8Cross-session persistence
Functionalbackend track · v0.3composite 2.86% (backend)·in-dim 7% add-onBackend track: a record created in browser context A must be visible in a fresh incognito context B. Discriminates real backends from localStorage-only apps. Null on frontend-only tools.
Two Playwright browser contexts. Context A logs in and creates a contact with a unique marker; context B — a fresh browser.newContext() with clean storage — reopens the deployed URL, logs in with the same credentials, and asserts the marker is visible. Because B shares no cookies or localStorage with A, a localStorage-only app fails while a real-backend app passes (a distinction F2 can't make, since reload preserves localStorage). passed requires the marker to cross sessions; partial credit = fraction of steps succeeded. Additive weight 7: null on non-backend submissions, reflows in at 7/115 ≈ 6% of Functional on backend submissions.
C1ESLint density
Code Qualitycomposite 3.60%·in-dim 20%ESLint with typescript-eslint recommended rules. Decay 0 errors/1k LOC = 1.0 → 20+ errors/1k LOC = 0. Source-only.
Runs eslint with typescript-eslint recommended + no-console:warn + no-debugger:error. issuesPer1k = (errors + 0.1 × warnings) / LOC × 1000. score = max(0, 1 - issuesPer1k / 20).
C2TypeScript safety
Code Qualitycomposite 0.90%·in-dim 5%tsc --noEmit --strict. 0 type errors = 1.0; decay at 20 errors/1k LOC. "Cannot find module" filtered out. Source-only.
Finds tsconfig.json or tsconfig.app.json; falls back to a permissive inline config. Ignores missing-module errors common in AI-generated code without deps installed.
C3Accessibility (axe-core)
Code Qualitycomposite 3.60%·in-dim 20%axe-core WCAG 2.1/2.2 AA audit. Violations normalised per 1k DOM nodes; score = max(0, 1 − violationsPer1k / 50).
@axe-core/playwright with tags wcag2a, wcag2aa, wcag21a, wcag21aa, wcag22aa. violationsPer1k = violating nodes / total nodes × 1000; score decays to 0 at 50 violations per 1k nodes. Single-state scan (axe + Lighthouse catch only ~30–40% of true a11y issues).
C4Lighthouse performance
Code Qualitycomposite 3.60%·in-dim 20%Lighthouse performance score (mobile throttled, median of 3 runs). Composite of FCP, LCP, TBT, CLS, Speed Index.
v0.2: up to 3 runs, median of successful. Per-run 90 s timeout, overall 240 s. Tolerates ≥1 successful run; 0 → score null.
C5Bundle payload (gzipped)
Code Qualitycomposite 0.90%·in-dim 5%Gzipped JS + CSS payload transferred over the wire. Full marks ≤170 KB, linear decay to 0 at ≥1 MB. Passed if ≤350 KB.
Passive page.on("response") listener captures every script & stylesheet response during F1+F2 (Content-Length, falls back to body length). compressedMeasurement flagged true only when every response had a Content-Length. Source-tree fallback (uncompressed bytes, ≤150 KB → 1 MB) when no network capture is available, clearly labelled scoringSource: source-fallback.
C6AST complexity
Code Qualitycomposite 0.90%·in-dim 5%Cognitive complexity via eslint-plugin-sonarjs. Functions exceeding threshold 15 are flagged.
score = max(0, 1 - violationsPer1k / 10). Source-only.
C7Maintainability judge
Code Qualitycomposite 2.70%·in-dim 15%LLM judge over a sampled source excerpt scoring maintainability on 5 criteria 1–5: naming, separation of concerns, component reuse, prop typing, secret handling.
Up to 12 source files sampled (entry → components/hooks/features/pages → other). Excerpt capped at 12 KB. Source-only.
C8Clean install
Code Qualitycomposite 0.90%·in-dim 5%Strict frozen install (npm ci / pnpm / yarn / bun) from a clean checkout. Graded: 0 if nothing installs, else 1.0 docked for lockfile-hygiene defects, floored at 0.5. Source-only.
Detects every committed lockfile (npm/pnpm/yarn/bun) and runs its strict frozen install (--ignore-scripts) in a fresh temp dir, 240 s each. Score: 0 if no lockfile installs cleanly; otherwise starts at 1.0 and deducts for hygiene defects — duplicate lockfiles −0.15, out-of-sync −0.20, damaged −0.20, broken private-registry sibling −0.20 — floored at 0.5. Missing package.json or no manager on PATH → null.
C9SEO hygiene
Code Qualitycomposite 0.90%·in-dim 5%Deterministic DOM checks: title length, meta description, canonical URL, OG tags, html[lang], heading hierarchy.
Configurable per-prompt via seoApplicable. Checks title (10–70 chars), meta description (50–300 chars), canonical, og:title/description/type, twitter:card, json-ld, lang, heading hierarchy, robots.txt, sitemap.xml.
V1MLLM visual judge
Visualcomposite 13.20%·in-dim 55%MLLM visual judge (Gemini 2.5 Pro via OpenRouter). 8 criteria scored 1–5. Normalised to 0–1.
Defaults: visual hierarchy, typography, colour harmony, whitespace, brand fit, CTA prominence, mobile layout, overall polish. v0.2 also runs 3 copy-quality checks unless placeholder_copy: true. Per-prompt extras supported.
V2Design heuristics
Visualcomposite 7.20%·in-dim 30%Deterministic in-browser design heuristics, 8 checks across layout and CSS conventions.
Layout: whitespace ≥25%, WCAG AA contrast on ≥80% text, ≥80% text ≥14px, ≥70% blocks ≤85ch. CSS conventions: ≥80% box-sizing:border-box, prefers-reduced-motion media query, ≥5 CSS custom properties, ≥1 :focus-visible rule. Skips CORS-blocked stylesheets.
V4Responsive design
Visualcomposite 3.60%·in-dim 15%Playwright viewport tests at 360×800, 768×1024, 1440×900. No horizontal overflow + mobile touch targets ≥44 px.
New browser context per viewport. 4 checks total. score = passed / total; passed if ≥0.75.
S1Secrets + deployed headers
Securitycomposite 4.40%·in-dim 40%Source secret scan (regex + Semgrep + trufflehog when available) ⊕ deployed HTTP header audit (CSP, HSTS, X-Content-Type-Options, X-Frame-Options, Referrer-Policy, Permissions-Policy).
Sub-1: secrets — built-in 8-pattern regex (always on), Semgrep p/secrets+p/owasp-top-ten (if installed), trufflehog filesystem (if installed); findings unioned, any match = 0. Sub-2: header audit, score = passed/6. Final = mean of whichever ran.
S2Client-security anti-patterns
Securitycomposite 3.85%·in-dim 35%Client-side security anti-patterns general scanners miss: Supabase service-role keys in client code, RLS off, JWT decode without verify, Firebase test mode, hardcoded admin creds, XSS sinks, insecure transport, sensitive logging.
Severity-weighted: critical=10, high=5, medium=2 pts; score = max(0, 1 - penalty / 20); passed = no critical/high findings. 16 patterns across Supabase, Firebase, JWT, Stripe, OpenAI, generic third-party keys, hardcoded admin creds, password reset without token, plus v0.2 secure-by-default sinks (dangerouslySetInnerHTML without sanitizer, disabled TLS verification, secret/req logging). Client-side-only patterns are scoped to client paths; PUBLISHABLE/ANON/PUBLIC key names and vendor-generated files are excluded. (Registry key s2 unchanged; historically the "auth-pattern check".)
S3Dependency vulnerabilities
Securitycomposite 2.75%·in-dim 25%npm audit CVE count from the source lockfile (high/critical filtered). critical×10 + high×3 + moderate×1 + low×0.1; score = max(0, 1 − penalty/20). Passed if 0 critical and 0 high.
Finds package.json (root or one level down). If no lockfile, generates one via npm install --package-lock-only in a temp dir, then runs npm audit --json --omit=dev. Source-only.
S4Backend security probes
Securitybackend track · v0.3composite 1.83% (backend)·in-dim 15% add-onBackend track: read-only runtime probes against the deployed backend — an unauthenticated request and a cross-user request must not return another user's data. Catches the canonical "RLS off" leak that S2 only infers from client code. Null on frontend-only tools.
Runs only with a backend block carrying two accounts; credentials are the only input. Signs in as user B via the real login form, seeds a uniquely-marked record, and auto-discovers B's data endpoint (largest record array, GET or POST/RPC). Unauth probe: replays B's request with auth stripped from body + headers → must be rejected and serve no data. Cross-user probe: signs in as user A, captures A's own data response → B's marker must not appear. Each failed probe = 10 penalty pts; score = max(0, 1 − penalty/20); passed = no failed probes; details.crossTenantLeak is the headline. Complements rather than replaces S1/S2/S3. Additive weight 15: null on non-backend submissions, reflows in at 15/115 ≈ 13% of Security on backend submissions.