Live benchmark

Your models don't have a skill issue. You do.

Start promptmaxxing.

PromptGolf benchmarks the spec writers, not the models. Write a spec, watch Agnes AI build it live, and let the hidden tests decide whether you actually know your domain.

full-stack checkout · ecommercepar 10
score
66
hidden
3/10
strokes
1
  • cents math$140.18 ≠ 140.17fail
  • promo normalizespaces + casefail
  • discount floorbelow zerofail
  • shipping thresholdwrong basefail
  • out-of-stock blockships anywayfail
  • double-submit lock2 ordersfail
  • quantity boundsqty → −1fail
  • invalid code errorpass
  • loading + errorpass
  • mobile + a11ypass

Looks done in the browser. Seven hidden checks fail because the prompt never named the product rules.

01

Agnes builds your spec

Agnes 2.0 Flash generates a real app from your prompt, exactly what you asked for and nothing you did not.

02

Sandbox serves it live

The build runs through an isolated sandbox path and comes back as an interactive app preview.

03

Playwright scores it

Hidden Playwright checks tear the live app apart. Vague specs ship bugs; precise specs survive.