Verity, evaluated.
Two independent test series — one on reasoning, one on prose quality — using the same model, the same evaluator, and a single variable: Verity on or off.
The harder the problem,
the more Verity delivers.
We ran Claude Sonnet 4.5 — no extended thinking — through three escalating reasoning benchmarks, once without Verity and once with. The gap widens precisely where it matters most.
How the gap grows with difficulty
Baseline performance degrades significantly under high reasoning load. Verity holds. The shaded area shows the growing benefit — widening as the test gets harder.
Score gain attributable to Verity — by difficulty tier
All other variables held constant: same model, same questions, same evaluator. This is precisely what Verity adds at each tier. The benefit scales with reasoning load.
C-test: dimension breakdown
Five independent dimensions, each scored out of 150. Verity lifts every one — the largest gains in compliance and clarity.
The A and B tests establish a strong baseline — 95.5% and 91.7% — to confirm the model is not weak. The C-test is designed to expose structural reasoning failures under sustained, multi-layered load: precisely where Verity's scaffold exerts the most corrective force.
Consistent analytical voice
across every domain.
Reasoning quality is one dimension. Prose quality is another. We evaluated Verity's writing across four domains — philosophy, science, practical instruction, and long-form argument — using an eight-criterion rubric scored out of 40.
Voice quality across domains — baseline vs. Verity
The base model's prose quality varies with domain formality — strongest on philosophy, weakest on practical instruction. Verity holds steady across all four. The shaded area shows where it is doing corrective work.
Criterion breakdown — mean scores across all sets
Eight rubric criteria, each scored 1–5. Means across all four test sets. Tone discipline reaches a perfect Verity mean of 5.0. The smallest gain — LLM artifact suppression — partially degrades in long-form, where bold-header reversion reintroduces a structural artifact.
The voice difference score increases as domain formality decreases — A: +8, B: +11, C: +16 — then contracts at long-form (D: +6). Verity does the most corrective work where the base model defaults are weakest. The base model ranges 9 points across domains (20–29); Verity ranges just 4 (33–37).