Search runs, reports...
New Evaluation
🚀

Get your first regression report

Run a hallucination comparison between two model checkpoints in under 15 minutes.

pip install llmguard
Add API key
3
Enable logprobs=True
4
Run evaluation →
Start Evaluation View Docs
Total Eval Runs
12,847
↑ 34% vs last month
Across 6 model checkpoints
Avg Hallucination Delta
+0.127
↑ vs baseline v14
Latest checkpoint v15
Confident Hallucinations
8.9%
↑ 5.8pp vs baseline
High Top-1 + incorrect output
Domains Improved
3 / 9
↓ in reasoning & legal
Factual QA, code, summarization
Hallucination Delta Trend
v14 baseline → rolling candidate checkpoints
Hallucination delta
Threshold (0.05)
D-30 D-15 Today 0.20 0.10 0.02
Current delta (v15)
+0.127
95% CI
[+0.089, +0.165]
Recommendation
BLOCK DEPLOYMENT
View Full Report →
Recent Evaluation Runs
Click any row to view the full regression report
+ New Run
Run ID Baseline → Candidate Model Prompts Delta Conf. Hall. Status Started
run_a7f21c v14 v15 GPT-4o 1,000 +0.127 8.9%
Block
2h ago
run_b3d9ae llama3-ft-v2 llama3-ft-v3 Llama 3 8B 500 +0.043 3.1%
Warn
5h ago
run_c1e44f mistral-med-v1 mistral-med-v2 Mistral 7B 2,000 −0.031 1.2%
Approved
Yesterday
run_d8b22a gpt4o-legal-v3 gpt4o-legal-v4 GPT-4o ft 1,000 +0.007 2.1%
Approved
2 days ago
run_e5c31b gemma-code-v1 gemma-code-v2 Gemma 7B 3,000
Running
14 min ago
run_f2a89d llama3-fin-v5 llama3-fin-v6 Llama 3 70B 5,000
Queued
Just now
Showing 6 of 84 runs
Domain Risk Snapshot
v14 → v15
Legal / Regulatory
+0.129
Multi-Step Reasoning
+0.056
Medical / Clinical
+0.007
Code Generation
+0.002
Factual QA
−0.004
Summarization
−0.009
View Full Domain Report →
Plan Usage
Growth
Evaluation runs 64,230 / 100,000
35,770 runs remaining · Resets May 1
Batch jobs this month 23 / unlimited
Team members 4 / 10
Next billing date: May 1, 2026 · $999/mo
Settings →