
Every AI benchmark that collapses creative quality into a single number is making an editorial decision before the evaluation even begins. It decides that disagreement is error, that taste is noise, and that the goal is a ranking. Contra Labs built a framework that refuses all three assumptions.
The Human Creativity Benchmark, published April 30, 2026, draws on Contra's network of over 1.5 million independent professional creatives who have earned over $250 million on the platform. A selected group of evaluators across five domains assessed AI-generated outputs across three phases of the creative process, producing roughly 15,000 individual judgments. The central argument is structural: creative evaluation produces two distinct signals, convergence and divergence, and flattening them into one score destroys the most actionable information in the dataset.
Two Signals, Not One
Convergence emerges when evaluators agree. It surfaces shared professional standards around composition, legibility, hierarchy, and technical correctness. These are verifiable, stable, and the right target for training models to produce reliable output. Divergence emerges when evaluators disagree, not because one of them is wrong, but because the work has cleared the competence threshold and the remaining question is purely one of taste. Aesthetic direction, mood, conceptual risk — these are legitimately distributed across professionals. Smoothing them into a consensus score produces exactly the generic output creative teams already find unusable.
The Phase Problem No Model Has Solved
No model leads all three phases in any domain. That finding alone reframes how agencies should think about tool selection. The benchmark structures creative work into ideation, mockup, and refinement, and the models that excel at open-ended generation consistently struggle when asked to iterate on existing work. Claude Opus 4.6 leads ideation in landing pages with strong visual hierarchy and layout coherence, but Gemini 3.1 Pro Preview takes over at mockup with a 68.9% win rate, the highest Usability scalar in that phase at 4.03, because design-system constraints reward fidelity over invention. By refinement, the field compresses: all four models cluster between 3.9 and 4.4 across scalar dimensions, and preference returns to taste.
Where Agreement Actually Lives
The benchmark uses Kendall's W to quantify evaluator agreement at each phase. Ad images produce the clearest convergence arc in the dataset, with agreement rising at every transition from 0.345 to 0.436 to 0.549. The reason is that refinement in ad design reduces to verifiable criteria: is the typography legible, is the CTA placed correctly, does the contrast hold. Evaluators reach those answers without coordination. Landing pages run the opposite direction, moving from 0.484 to 0.293 to 0.333, because once all outputs become acceptable under design-system constraints, personal judgment takes over and agreement collapses. Visual appeal produces more evaluator disagreement than prompt adherence across every domain, and that gap is informative rather than problematic.
Generation Strength as Iteration Weakness
Veo 3.1 is the only model in the product video domain that degrades across all three phases on every measured dimension. It leads ideation at 61.1% but introduces new elements rather than applying targeted edits when the task shifts to refinement. Its realism sentiment ratio moves from positive six in ideation to negative three in refinement. Grok Imagine Video moves the opposite direction, from negative 15 to positive 20 across the same span, eventually leading refinement at 56.5%. The benchmark's epistemic network analysis maps this directly: Veo 3.1's evaluation profile clusters around generation quality themes, while Grok Imagine's clusters around production fidelity themes like scene coherence. The creative strength that makes a model excellent at first-draft generation is structurally the same property that makes it unreliable for targeted iteration.
Usability as a Hard Gate
In ad image evaluation, the data reveals that evaluators do not make holistic judgments. They follow a decision hierarchy. Usability acts as a hard gate: outputs scoring one on usability reach the top two positions only 10% of the time, rising to 22% at a score of two and 36% at a score of three, regardless of visual quality. Once that threshold is cleared, prompt adherence becomes the primary ordering criterion. Visual appeal functions as a tiebreaker. High visual appeal cannot rescue low prompt adherence. This hierarchy explains why GPT Image 1.5 leads ideation and mockup but drops to third by refinement, while Seedream 4.5 climbs from third in ideation to first by refinement.
The benchmark's practical value for teams using AI in production workflows is that it names the mechanism behind a frustration most creative professionals already feel. Models that look impressive in demos often fail during iteration. The HCB shows this is not a product quality issue or a prompting failure. It is a structural property of what different models optimize for.
If agencies begin selecting models by phase rather than by overall capability score, the downstream effect on workflow design could be significant. A studio might route ideation tasks to Claude Opus 4.6, hand mockup to Gemini 3.1 Pro Preview, and finish in refinement with Grok Imagine Video or Seedream 4.5, treating model selection as a craft decision rather than a procurement one. Whether that level of orchestration becomes standard practice may depend on how quickly tooling catches up to the insight the HCB has now put on paper.