— performance research —

Verity, evaluated.

Two independent test series — one on reasoning, one on prose quality — using the same model, the same evaluator, and a single variable: Verity on or off.

I · Reasoning

The harder the problem,
the more Verity delivers.

We ran Claude Sonnet 4.5 — no extended thinking — through three escalating reasoning benchmarks, once without Verity and once with. The gap widens precisely where it matters most.

A-test · easy

716 → 736 / 750

+2.8%

B-test · medium

688 → 737 / 750

+7.1%

C-test · hard ★

432 → 745 / 750

+72.5%

How the gap grows with difficulty

Baseline performance degrades significantly under high reasoning load. Verity holds. The shaded area shows the growing benefit — widening as the test gets harder.

With Verity Baseline (no Verity)

Score gain attributable to Verity — by difficulty tier

All other variables held constant: same model, same questions, same evaluator. This is precisely what Verity adds at each tier. The benefit scales with reasoning load.

C-test: dimension breakdown

Five independent dimensions, each scored out of 150. Verity lifts every one — the largest gains in compliance and clarity.

Baseline Verity

Reasoning

89 → 150 +61

Fairness

98 → 148 +50

Calibration

94 → 148 +54

Clarity

78 → 149 +71

Compliance

73 → 150 +77

Methodology

Claude Sonnet 4.5 · no extended thinking 30 questions × 5 dimensions × 5-point scale = 750 max Single variable: Verity on / off A = easy · B = medium · C = hard Independent evaluator · controlled conditions

The A and B tests establish a strong baseline — 95.5% and 91.7% — to confirm the model is not weak. The C-test is designed to expose structural reasoning failures under sustained, multi-layered load: precisely where Verity's scaffold exerts the most corrective force.

II · Voice

Consistent analytical voice
across every domain.

Reasoning quality is one dimension. Prose quality is another. We evaluated Verity's writing across four domains — philosophy, science, practical instruction, and long-form argument — using an eight-criterion rubric scored out of 40.

25.3 → 35.5

mean score / 40 across all sets

+10.3

mean voice difference (series)

9 → 4

domain range narrowed (points)

Set A · philosophy

29 → 37 /40

+27.6%

Set B · science

25 → 36 /40

+44.0%

Set C · practical ★

20 → 36 /40

+80.0%

Set D · long-form

27 → 33 /40

+22.2%

Voice quality across domains — baseline vs. Verity

The base model's prose quality varies with domain formality — strongest on philosophy, weakest on practical instruction. Verity holds steady across all four. The shaded area shows where it is doing corrective work.

With Verity Baseline (no Verity)

Criterion breakdown — mean scores across all sets

Eight rubric criteria, each scored 1–5. Means across all four test sets. Tone discipline reaches a perfect Verity mean of 5.0. The smallest gain — LLM artifact suppression — partially degrades in long-form, where bold-header reversion reintroduces a structural artifact.

Baseline Verity

Tone discipline

3.25 → 5.00 +1.75

Explanatory guidance

3.00 → 4.75 +1.75

Paragraph discipline

3.00 → 4.75 +1.75

Reasoned conclusion

3.00 → 4.25 +1.25

Structural clarity

3.75 → 4.75 +1.00

Conceptual precision

3.50 → 4.50 +1.00

Sentence rhythm

2.75 → 3.75 +1.00

LLM artifact absence

3.00 → 3.75 +0.75

Methodology

Claude Sonnet 4.5 · no extended thinking 8 criteria × 5-point scale = 40 max per set 15 prompts per set A = philosophy · B = science · C = practical · D = long-form Single variable: Verity on / off

The voice difference score increases as domain formality decreases — A: +8, B: +11, C: +16 — then contracts at long-form (D: +6). Verity does the most corrective work where the base model defaults are weakest. The base model ranges 9 points across domains (20–29); Verity ranges just 4 (33–37).

The harder the problem,the more Verity delivers.