Total Runs (90d)
84
Across 3 model families
Peak Delta (90d)
+0.218
llama3-legal-v11 · 65 days ago
Best Improvement
−0.067
mistral-med-v2 · Yesterday
Deploy Blocks Issued
12
Out of 84 evaluations (14%)
Hallucination Delta Over Time
All model checkpoints · Baseline = each checkpoint's immediate predecessor
GPT-4o variants
Llama 3 variants
Mistral variants
Block threshold
0.250.200.150.100.050.00
BLOCK
Jan 10Jan 20Feb 1Feb 10 Feb 20Mar 1Mar 10Mar 20 Apr 1Apr 11
Domain Risk Heatmap
Model version × Domain · Color = delta severity
Checkpoint Legal Reason. Medical Code Factual Fin.
v15 (latest) +0.13 +0.06 +0.01 0.00 -0.00 0.00
v14 +0.04 +0.02 +0.01 -0.01 -0.02 +0.02
v13 +0.21 +0.08 +0.04 +0.01 +0.02 +0.05
v12 +0.03 +0.01 -0.01 -0.02 -0.03 +0.01
v11 +0.01 -0.01 -0.02 +0.01 +0.01 -0.01
Low risk
High risk
Model Version Comparison
Latest evaluation delta per checkpoint
v15 (latest)
+0.127 Block
llama3-ft-v3
+0.043 Warn
gpt4o-legal-v4
+0.007 Approved
mistral-med-v2
−0.031 Improved
v14
−0.021 Approved
Evaluation Volume
Number of evaluation runs per week · Last 12 weeks
Jan 13Jan 20Jan 27Feb 3 Feb 10Feb 17Feb 24Mar 3 Mar 10Mar 17Mar 24Apr 7