WebAppBench
WebAppBench
open benchmark
LeaderboardMetricsGitHub

WebAppBench.

The open benchmark for AI-generated web apps.

Reproducible harness measuring four dimensions — functional correctness, code quality, visual design, and security — across the same prompt. Composite is a weighted mean of dimension scores. Higher is better. /100.

benchmark monthJune 20262/2benchmarkv0.3scored runs 9promptmethodology
Dimensions

The composite is a weighted mean of dimension scores; each dimension is a weighted mean of its scorers. When a scorer is N/A its weight redistributes within its dimension; when a whole dimension is empty its weight redistributes across the rest.

Functional47%

Does the page actually render, satisfy acceptance criteria, and honour verbatim instructions from the prompt?

Code Quality18%

Lint, types, accessibility, performance, bundle size, complexity, maintainability, install hygiene, SEO.

Visual24%

MLLM judge over screenshots plus deterministic design heuristics and responsive viewport tests.

Security11%

Source-side secret/auth-pattern scans, deployed HTTP headers, CVE audit, and (backend track) live cross-tenant probes.

Leaderboard

Click any tool to see its full per-run scorecard with comments.

#
01
v0
v0
v0 Auto
90.8
99
98
73
83
$4.827m 51s
02
manus
manus
Undisclosed
88.1
99
77
78
83
$3.9416m 03s
03
modelence
modelence
Claude Opus 4.8
88.1
99
83
74
81
$2.416m 12s
04
replit
replit
Undisclosed
87.4
97
75
80
86
$2.179m 10s
05
lovable
lovable
Undisclosed
86.5
99
90
71
61
$0.552m 42s
06
bolt
bolt
Standard
84.2
89
92
70
84
$0.253m 23s
07
anything
anything
Default
84.1
100
78
73
51
$0.221m 43s
08
emergent
emergent
Claude Opus 4.7
83.9
94
76
82
61
$1.888m 16s
09
base44
base44
Authomatic
83.1
96
72
71
73
$0.241m 51s
WebAppBench · open source · scores captured at submission time · URL results may changeview scoring methodology →