— performance research —

Verity, evaluated.

Two independent test series — one on reasoning, one on prose quality — using the same model, the same evaluator, and a single variable: Verity on or off.

The harder the problem,
the more Verity delivers.

We ran Claude Sonnet 4.5 — no extended thinking — through three escalating reasoning benchmarks, once without Verity and once with. The gap widens precisely where it matters most.

A-test · easy
716 736 / 750
+2.8%
B-test · medium
688 737 / 750
+7.1%

How the gap grows with difficulty

Baseline performance degrades significantly under high reasoning load. Verity holds. The shaded area shows the growing benefit — widening as the test gets harder.

With Verity Baseline (no Verity)
With Verity: A 736, B 737, C 745. Baseline: A 716, B 688, C 432, all out of 750.

Score gain attributable to Verity — by difficulty tier

All other variables held constant: same model, same questions, same evaluator. This is precisely what Verity adds at each tier. The benefit scales with reasoning load.

Verity adds 20 points on easy, 49 on medium, 313 on hard, out of a maximum of 750.

C-test: dimension breakdown

Five independent dimensions, each scored out of 150. Verity lifts every one — the largest gains in compliance and clarity.

Baseline Verity
Reasoning
89 → 150  +61
Fairness
98 → 148  +50
Calibration
94 → 148  +54
Clarity
78 → 149  +71
Compliance
73 → 150  +77
Methodology
Claude Sonnet 4.5 · no extended thinking 30 questions × 5 dimensions × 5-point scale = 750 max Single variable: Verity on / off A = easy · B = medium · C = hard Independent evaluator · controlled conditions

The A and B tests establish a strong baseline — 95.5% and 91.7% — to confirm the model is not weak. The C-test is designed to expose structural reasoning failures under sustained, multi-layered load: precisely where Verity's scaffold exerts the most corrective force.

Consistent analytical voice
across every domain.

Reasoning quality is one dimension. Prose quality is another. We evaluated Verity's writing across four domains — philosophy, science, practical instruction, and long-form argument — using an eight-criterion rubric scored out of 40.

25.3 → 35.5
mean score / 40 across all sets
+10.3
mean voice difference (series)
9 → 4
domain range narrowed (points)
Set A · philosophy
29 37 /40
+27.6%
Set B · science
25 36 /40
+44.0%
Set D · long-form
27 33 /40
+22.2%

Voice quality across domains — baseline vs. Verity

The base model's prose quality varies with domain formality — strongest on philosophy, weakest on practical instruction. Verity holds steady across all four. The shaded area shows where it is doing corrective work.

With Verity Baseline (no Verity)
With Verity: A 37, B 36, C 36, D 33. Baseline: A 29, B 25, C 20, D 27, out of 40.

Criterion breakdown — mean scores across all sets

Eight rubric criteria, each scored 1–5. Means across all four test sets. Tone discipline reaches a perfect Verity mean of 5.0. The smallest gain — LLM artifact suppression — partially degrades in long-form, where bold-header reversion reintroduces a structural artifact.

Baseline Verity
Tone discipline
3.25 → 5.00  +1.75
Explanatory guidance
3.00 → 4.75  +1.75
Paragraph discipline
3.00 → 4.75  +1.75
Reasoned conclusion
3.00 → 4.25  +1.25
Structural clarity
3.75 → 4.75  +1.00
Conceptual precision
3.50 → 4.50  +1.00
Sentence rhythm
2.75 → 3.75  +1.00
LLM artifact absence
3.00 → 3.75  +0.75
Methodology
Claude Sonnet 4.5 · no extended thinking 8 criteria × 5-point scale = 40 max per set 15 prompts per set A = philosophy · B = science · C = practical · D = long-form Single variable: Verity on / off

The voice difference score increases as domain formality decreases — A: +8, B: +11, C: +16 — then contracts at long-form (D: +6). Verity does the most corrective work where the base model defaults are weakest. The base model ranges 9 points across domains (20–29); Verity ranges just 4 (33–37).