Report: v14 → v15

🚫

BLOCK DEPLOYMENT — v15 shows significant hallucination regression

Hallucination rate increased by 12.7 percentage points vs baseline (v14). The regression is statistically significant (p < 0.001). Legal/Regulatory (+12.9pp) and Multi-Step Reasoning (+5.6pp) domains drove the increase. Confident hallucination rate approximately tripled from 3.1% to 8.9%. Manual review of flagged prompts is strongly recommended before proceeding.

Export this report for compliance documentation:

Hallucination Delta

+0.127

95% CI: [+0.089, +0.165]

v15 hallucinates 12.7pp more than v14

Confident Hallucination Rate

8.9%

↑ from 3.1% baseline (+5.8pp)

High Top-1 (>0.85) + TTM: hallucination

Baseline Score (v14)

0.194

Mean risk score across 1,000 prompts

Candidate Score (v15)

0.321

Mean risk score across 1,000 prompts

Risk Score Distribution

Per-sequence hallucination risk scores across all 1,000 prompts

v14 (baseline)

v15 (candidate)

0.00.10.20.30.4 0.50.60.70.80.91.0

P25

0.12 → 0.15

P50 (Median)

0.21 → 0.28

P75

0.34 → 0.41

P95 (Tail)

0.67 → 0.79

⚠ v15 shows a fatter right tail — more high-risk outputs than the mean delta suggests. P95 score increased from 0.67 → 0.79, indicating a growing class of very high-risk outputs.

Domain-Level Breakdown

Per-domain hallucination scores and delta · Sorted by absolute delta

Domain	Prompts	v14 Score	v15 Score	Delta	Conf. Hall. Δ	Assessment
⚖️ Legal / Regulatory	150	0.312	0.441	+0.129	+10.8pp	Regressed
🧠 Multi-Step Reasoning	120	0.231	0.287	+0.056	+4.2pp	Regressed
🏥 Medical / Clinical	100	0.198	0.205	+0.007	+0.5pp	Marginal
💰 Financial	80	0.241	0.246	+0.005	+0.2pp	Unchanged
💻 Code Generation	100	0.089	0.091	+0.002	+0.1pp	Unchanged
📋 Instruction Following	90	0.142	0.143	+0.001	0.0pp	Unchanged
📊 Summarization	120	0.176	0.167	−0.009	−0.7pp	Improved
❓ Open-Ended Generation	120	0.312	0.308	−0.004	−0.3pp	Improved
🔍 Factual QA	120	0.142	0.138	−0.004	−0.2pp	Improved

⚠ Confident Hallucination Breakdown

Outputs with Top-1 probability > 0.85 classified as hallucination by TTM

Critical — not detectable by single-signal methods

Baseline Rate (v14)

3.1%

31 / 1,000 prompts

Candidate Rate (v15)

8.9%

89 / 1,000 prompts

Delta

+5.8pp

Rate approximately tripled

Sample output class probabilities (prompt_0042 — highest delta):

confident_hallucination

0.910

confident_correct

0.041

uncertain_hallucination

0.027

genuine_uncertainty

0.014

creative_generation

0.008

top1_prob: 0.942 entropy: 0.71 bits logit_gap: 3.24 flagged_ranges: [[12,18],[34,41]]

Top 10 Flagged Prompts

Prompts with largest hallucination score increase between checkpoints. Re-run these yourself to inspect the outputs.

Token-Level Signal View

prompt_0042 — flagged ranges highlighted in red · Model: v15

Flagged token range

Normal

In Henderson v. Commissioner , 143 T.C. 430 ( 2014 ) , the court held that virtual currency transactions are subject to capital gains tax under IRC § 1001 and Revenue Ruling 2014 -21 provides the definitive fair market value methodology ...

⚠ Note: Henderson v. Commissioner 143 T.C. 430 (2014) does not appear in US Tax Court records. TTM flagged token ranges [12,18] and [34,41] as the confident hallucination signature — the model generated a plausible citation format with high confidence despite the citation not existing.

Top-1 probability trajectory · prompt_0042 · v15

Regression Report: v14 → v15