Analytics
Historical hallucination trends across all model versions · Last 90 days
Total Runs (90d)
84
Across 3 model families
Peak Delta (90d)
+0.218
llama3-legal-v11 · 65 days ago
Best Improvement
−0.067
mistral-med-v2 · Yesterday
Deploy Blocks Issued
12
Out of 84 evaluations (14%)
Hallucination Delta Over Time
All model checkpoints · Baseline = each checkpoint's immediate predecessor
GPT-4o variants
Llama 3 variants
Mistral variants
Block threshold
0.250.200.150.100.050.00
Jan 10Jan 20Feb 1Feb 10
Feb 20Mar 1Mar 10Mar 20
Apr 1Apr 11
Domain Risk Heatmap
Model version × Domain · Color = delta severity
| Checkpoint | Legal | Reason. | Medical | Code | Factual | Fin. |
|---|---|---|---|---|---|---|
| v15 (latest) | +0.13 | +0.06 | +0.01 | 0.00 | -0.00 | 0.00 |
| v14 | +0.04 | +0.02 | +0.01 | -0.01 | -0.02 | +0.02 |
| v13 | +0.21 | +0.08 | +0.04 | +0.01 | +0.02 | +0.05 |
| v12 | +0.03 | +0.01 | -0.01 | -0.02 | -0.03 | +0.01 |
| v11 | +0.01 | -0.01 | -0.02 | +0.01 | +0.01 | -0.01 |
Low risk
High risk
Model Version Comparison
Latest evaluation delta per checkpoint
v15 (latest)
llama3-ft-v3
gpt4o-legal-v4
mistral-med-v2
v14
Evaluation Volume
Number of evaluation runs per week · Last 12 weeks
Jan 13Jan 20Jan 27Feb 3
Feb 10Feb 17Feb 24Mar 3
Mar 10Mar 17Mar 24Apr 7