run_a7f21c
GPT-4o
1,000 prompts
LLMGuard curated · legal_qa_1000
Regression Report: v14 → v15
Hallucination comparison across 1,000 prompts · 9 domains · Generated Apr 11 2026
Calibration: gpt4o-2025-07-01
evaluation_method: native
Hallucination Delta
+0.127
95% CI: [+0.089, +0.165]
v15 hallucinates 12.7pp more than v14
Confident Hallucination Rate
8.9%
↑ from 3.1% baseline (+5.8pp)
High Top-1 (>0.85) + TTM: hallucination
Baseline Score (v14)
0.194
Mean risk score across 1,000 prompts
Candidate Score (v15)
0.321
Mean risk score across 1,000 prompts
Risk Score Distribution
Per-sequence hallucination risk scores across all 1,000 prompts
v14 (baseline)
v15 (candidate)
0.00.10.20.30.4
0.50.60.70.80.91.0
P25
0.12
→
0.15
P50 (Median)
0.21
→
0.28
P75
0.34
→
0.41
P95 (Tail)
0.67
→
0.79
⚠ v15 shows a fatter right tail — more high-risk outputs than the mean delta suggests. P95 score increased from 0.67 → 0.79, indicating a growing class of very high-risk outputs.
Domain-Level Breakdown
Per-domain hallucination scores and delta · Sorted by absolute delta
| Domain | Prompts | v14 Score | v15 Score | Delta | Conf. Hall. Δ | Assessment |
|---|---|---|---|---|---|---|
| ⚖️ Legal / Regulatory | 150 | 0.312 | 0.441 | +0.129 | +10.8pp | Regressed |
| 🧠 Multi-Step Reasoning | 120 | 0.231 | 0.287 | +0.056 | +4.2pp | Regressed |
| 🏥 Medical / Clinical | 100 | 0.198 | 0.205 | +0.007 | +0.5pp | Marginal |
| 💰 Financial | 80 | 0.241 | 0.246 | +0.005 | +0.2pp | Unchanged |
| 💻 Code Generation | 100 | 0.089 | 0.091 | +0.002 | +0.1pp | Unchanged |
| 📋 Instruction Following | 90 | 0.142 | 0.143 | +0.001 | 0.0pp | Unchanged |
| 📊 Summarization | 120 | 0.176 | 0.167 | −0.009 | −0.7pp | Improved |
| ❓ Open-Ended Generation | 120 | 0.312 | 0.308 | −0.004 | −0.3pp | Improved |
| 🔍 Factual QA | 120 | 0.142 | 0.138 | −0.004 | −0.2pp | Improved |
⚠ Confident Hallucination Breakdown
Outputs with Top-1 probability > 0.85 classified as hallucination by TTM
Baseline Rate (v14)
3.1%
31 / 1,000 prompts
Candidate Rate (v15)
8.9%
89 / 1,000 prompts
Delta
+5.8pp
Rate approximately tripled
Sample output class probabilities (prompt_0042 — highest delta):
confident_hallucination
0.910
confident_correct
0.041
uncertain_hallucination
0.027
genuine_uncertainty
0.014
creative_generation
0.008
top1_prob: 0.942
entropy: 0.71 bits
logit_gap: 3.24
flagged_ranges: [[12,18],[34,41]]
Top 10 Flagged Prompts
Prompts with largest hallucination score increase between checkpoints. Re-run these yourself to inspect the outputs.
Token-Level Signal View
prompt_0042 — flagged ranges highlighted in red · Model: v15
Flagged token range
Normal
In
Henderson
v.
Commissioner
,
143
T.C.
430
(
2014
)
,
the
court
held
that
virtual
currency
transactions
are
subject
to
capital
gains
tax
under
IRC
§
1001
and
Revenue
Ruling
2014
-21
provides
the
definitive
fair
market
value
methodology
...
⚠ Note: Henderson v. Commissioner 143 T.C. 430 (2014) does not appear in US Tax Court records. TTM flagged token ranges [12,18] and [34,41] as the confident hallucination signature — the model generated a plausible citation format with high confidence despite the citation not existing.
Top-1 probability trajectory · prompt_0042 · v15