PromptGolf / controlled agent benchmark

LeetCode for agentic prompting.

Same agent. Same task. Different human specification.

Everyone loves to benchmark models, but after seeing your prompts, I really ought to benchmark y'all instead. Hidden tests reveal whether the generated app survives production behavior.

Play the checkout challenge Inspect the expert run

Challenge

Full Stack Ecommerce Checkout Web App

Builder

OpenAI gpt-5.4-mini

Behavior

Stored EvalSpecs materialized by Playwright

Scroll rig / same builder

Watch a prompt become an evaluated system.

The page art is schematic, but the product claim is literal: same task, same OpenAI builder path, same stored checks, different human specification.

01Drafting gridThe task starts as a public brief: enough to build something plausible, not enough to survive product reality.

02Specification unfoldsA stronger prompt turns the brief into operating rules, assumptions, acceptance criteria, and edge cases.

03Artifact assemblesThe same builder receives the human spec and produces a checkout app. The model is held constant.

04Evaluator attachesStored EvalSpecs materialize as Playwright behavior checks. The score follows observable capability evidence.

05Hidden checks revealCents math, promo normalization, stock boundaries, double-submit safety, and mobile usability separate the field.

06Score locksPrompt diagnosis happens after scoring. It explains the gap; it never rewrites the generated result.

Spec → artifact → evaluator → hidden checks → score

Playwright behavior / hidden checks / score lock

Evidence model

It grades generated behavior, not prompt aesthetics.

The hidden-test thesis only works if the benchmark stays honest. Live runs use provider-backed build and preview boundaries; seeded references stay labeled as seeded references.

Behavior evidence

Examples, traces, and properties check observable capabilities.

Spec completeness

Requirement trees connect product claims to testable evidence.

Hidden checks

Private cases reward domain boundaries, not implementation resemblance.

Seeded proof / checkout

One product. Three levels of specification.

Each seeded reference uses the same checkout challenge. The visible basics stay easy; hidden ecommerce rules create the separation.

Human-spec gap / seeded references

These authored reference records hold the stated build and evaluation conditions fixed.

hidden survival

Naive request

seeded reference prompt excerpt

“Build an ecommerce checkout web app with cart items, quantity changes, promo code, totals, and confirmation. Make it look nice.”

Seeded diagnosis: This seeded scorecard suggests the brief names the visible product while leaving money, inventory, and concurrency rules unspecified.

hidden survival

3/10

prompt count

Seeded artifact scenario

Happy-path checkout

Seeded reference: a complete-looking checkout with most hidden ecommerce behavior left unspecified.

Authored reference context, not a captured or live generated artifact.

Integer cents mathmissnot passed
Promo normalizationmissnot passed
Shipping threshold ordermissnot passed
Double-submit preventionmissnot passed
Quantity boundariesmissnot passed

Inspect reference scorecard

Naive

1 prompt

3/10 hidden

Structured

2 prompts

7/10 hidden

Expert

1 prompt

10/10 hidden

Final target

Domain knowledge beats vague confidence.

A one-shot prompt is not a paragraph. It is a compact engineering spec: assumptions, edge cases, validation, states, and the product rules the hidden checks are waiting for.

Write the spec