WebAppBench.

The open benchmark for AI-generated web apps.

Reproducible harness measuring four dimensions — functional correctness, code quality, visual design, and security — across the same prompt. Composite is a weighted mean of dimension scores. Higher is better. /100.

benchmark monthJune 20262/2benchmarkv0.3scored runs 9promptmethodology

Dimensions

The composite is a weighted mean of dimension scores; each dimension is a weighted mean of its scorers. When a scorer is N/A its weight redistributes within its dimension; when a whole dimension is empty its weight redistributes across the rest.

Functional47%

Does the page actually render, satisfy acceptance criteria, and honour verbatim instructions from the prompt?

Code Quality18%

Lint, types, accessibility, performance, bundle size, complexity, maintainability, install hygiene, SEO.

Visual24%

MLLM judge over screenshots plus deterministic design heuristics and responsive viewport tests.

Security11%

Source-side secret/auth-pattern scans, deployed HTTP headers, CVE audit, and (backend track) live cross-tenant probes.

Leaderboard

Click any tool to see its full per-run scorecard with comments.

#
01	v0	v0 Auto	90.8	99	98	73	83	$4.82	7m 51s
02	manus	Undisclosed	88.1	99	77	78	83	$3.94	16m 03s
03	modelence	Claude Opus 4.8	88.1	99	83	74	81	$2.41	6m 12s
04	replit	Undisclosed	87.4	97	75	80	86	$2.17	9m 10s
05	lovable	Undisclosed	86.5	99	90	71	61	$0.55	2m 42s
06	bolt	Standard	84.2	89	92	70	84	$0.25	3m 23s
07	anything	Default	84.1	100	78	73	51	$0.22	1m 43s
08	emergent	Claude Opus 4.7	83.9	94	76	82	61	$1.88	8m 16s
09	base44	Authomatic	83.1	96	72	71	73	$0.24	1m 51s