Get your first regression report
Run a hallucination comparison between two model checkpoints in under 15 minutes.
✓
pip install llmguard
✓
Add API key
3
Enable logprobs=True
4
Run evaluation →
Overview
Hallucination evaluation metrics · Last 30 days
Total Eval Runs
12,847
↑ 34% vs last month
Across 6 model checkpoints
Avg Hallucination Delta
+0.127
↑ vs baseline v14
Latest checkpoint v15
Confident Hallucinations
8.9%
↑ 5.8pp vs baseline
High Top-1 + incorrect output
Domains Improved
3 / 9
↓ in reasoning & legal
Factual QA, code, summarization
Hallucination Delta Trend
v14 baseline → rolling candidate checkpoints
Hallucination delta
Threshold (0.05)
Current delta (v15)
+0.127
95% CI
[+0.089, +0.165]
Recommendation
BLOCK DEPLOYMENT
Recent Evaluation Runs
Click any row to view the full regression report
| Run ID | Baseline → Candidate | Model | Prompts | Delta | Conf. Hall. | Status | Started | |
|---|---|---|---|---|---|---|---|---|
| run_a7f21c | v14 → v15 | GPT-4o | 1,000 | +0.127 | 8.9% | Block |
2h ago | |
| run_b3d9ae | llama3-ft-v2 → llama3-ft-v3 | Llama 3 8B | 500 | +0.043 | 3.1% | Warn |
5h ago | |
| run_c1e44f | mistral-med-v1 → mistral-med-v2 | Mistral 7B | 2,000 | −0.031 | 1.2% | Approved |
Yesterday | |
| run_d8b22a | gpt4o-legal-v3 → gpt4o-legal-v4 | GPT-4o ft | 1,000 | +0.007 | 2.1% | Approved |
2 days ago | |
| run_e5c31b | gemma-code-v1 → gemma-code-v2 | Gemma 7B | 3,000 | — | — | Running |
14 min ago | |
| run_f2a89d | llama3-fin-v5 → llama3-fin-v6 | Llama 3 70B | 5,000 | — | — | Queued |
Just now |
Showing 6 of 84 runs
Domain Risk Snapshot
v14 → v15
Legal / Regulatory
+0.129
Multi-Step Reasoning
+0.056
Medical / Clinical
+0.007
Code Generation
+0.002
Factual QA
−0.004
Summarization
−0.009
Plan Usage
Growth
Evaluation runs
64,230 / 100,000
35,770 runs remaining · Resets May 1
Batch jobs this month
23 / unlimited
Team members
4 / 10
Next billing date: May 1, 2026 · $999/mo