Generated 2h ago · 487s runtime + New Evaluation
🚫
BLOCK DEPLOYMENT — v15 shows significant hallucination regression
Hallucination rate increased by 12.7 percentage points vs baseline (v14). The regression is statistically significant (p < 0.001). Legal/Regulatory (+12.9pp) and Multi-Step Reasoning (+5.6pp) domains drove the increase. Confident hallucination rate approximately tripled from 3.1% to 8.9%. Manual review of flagged prompts is strongly recommended before proceeding.
Export this report for compliance documentation:
Hallucination Delta
+0.127
95% CI: [+0.089, +0.165]
v15 hallucinates 12.7pp more than v14
Confident Hallucination Rate
8.9%
↑ from 3.1% baseline (+5.8pp)
High Top-1 (>0.85) + TTM: hallucination
Baseline Score (v14)
0.194
Mean risk score across 1,000 prompts
Candidate Score (v15)
0.321
Mean risk score across 1,000 prompts
Risk Score Distribution
Per-sequence hallucination risk scores across all 1,000 prompts
v14 (baseline)
v15 (candidate)
0.00.10.20.30.4 0.50.60.70.80.91.0
P25
0.12 0.15
P50 (Median)
0.21 0.28
P75
0.34 0.41
P95 (Tail)
0.67 0.79

⚠ v15 shows a fatter right tail — more high-risk outputs than the mean delta suggests. P95 score increased from 0.67 → 0.79, indicating a growing class of very high-risk outputs.

Domain-Level Breakdown
Per-domain hallucination scores and delta · Sorted by absolute delta
Domain Prompts v14 Score v15 Score Delta Conf. Hall. Δ Assessment
⚖️ Legal / Regulatory 150 0.312 0.441 +0.129 +10.8pp Regressed
🧠 Multi-Step Reasoning 120 0.231 0.287 +0.056 +4.2pp Regressed
🏥 Medical / Clinical 100 0.198 0.205 +0.007 +0.5pp Marginal
💰 Financial 80 0.241 0.246 +0.005 +0.2pp Unchanged
💻 Code Generation 100 0.089 0.091 +0.002 +0.1pp Unchanged
📋 Instruction Following 90 0.142 0.143 +0.001 0.0pp Unchanged
📊 Summarization 120 0.176 0.167 −0.009 −0.7pp Improved
❓ Open-Ended Generation 120 0.312 0.308 −0.004 −0.3pp Improved
🔍 Factual QA 120 0.142 0.138 −0.004 −0.2pp Improved
⚠ Confident Hallucination Breakdown
Outputs with Top-1 probability > 0.85 classified as hallucination by TTM
Critical — not detectable by single-signal methods
Baseline Rate (v14)
3.1%
31 / 1,000 prompts
Candidate Rate (v15)
8.9%
89 / 1,000 prompts
Delta
+5.8pp
Rate approximately tripled
Sample output class probabilities (prompt_0042 — highest delta):
confident_hallucination
0.910
confident_correct
0.041
uncertain_hallucination
0.027
genuine_uncertainty
0.014
creative_generation
0.008
top1_prob: 0.942 entropy: 0.71 bits logit_gap: 3.24 flagged_ranges: [[12,18],[34,41]]
Top 10 Flagged Prompts
Prompts with largest hallucination score increase between checkpoints. Re-run these yourself to inspect the outputs.
Token-Level Signal View
prompt_0042 — flagged ranges highlighted in red · Model: v15
Flagged token range
Normal
In Henderson v. Commissioner , 143 T.C. 430 ( 2014 ) , the court held that virtual currency transactions are subject to capital gains tax under IRC § 1001 and Revenue Ruling 2014 -21 provides the definitive fair market value methodology ...

⚠ Note: Henderson v. Commissioner 143 T.C. 430 (2014) does not appear in US Tax Court records. TTM flagged token ranges [12,18] and [34,41] as the confident hallucination signature — the model generated a plausible citation format with high confidence despite the citation not existing.

Top-1 probability trajectory · prompt_0042 · v15