Dashboard — LLMGuard

🚀

Get your first regression report

Run a hallucination comparison between two model checkpoints in under 15 minutes.

✓

pip install llmguard

✓

Add API key

Enable logprobs=True

Run evaluation →

Start Evaluation View Docs

Total Eval Runs

12,847

↑ 34% vs last month

Across 6 model checkpoints

Avg Hallucination Delta

+0.127

↑ vs baseline v14

Latest checkpoint v15

Confident Hallucinations

8.9%

↑ 5.8pp vs baseline

High Top-1 + incorrect output

Domains Improved

3 / 9

↓ in reasoning & legal

Factual QA, code, summarization

Hallucination Delta Trend

v14 baseline → rolling candidate checkpoints

Hallucination delta

Threshold (0.05)

Current delta (v15)

+0.127

95% CI

[+0.089, +0.165]

Recommendation

BLOCK DEPLOYMENT

View Full Report →

Recent Evaluation Runs

Click any row to view the full regression report

+ New Run

Run ID	Baseline → Candidate	Model	Prompts	Delta	Conf. Hall.	Status	Started
run_a7f21c	v14 → v15	GPT-4o	1,000	+0.127	8.9%	Block	2h ago	View
run_b3d9ae	llama3-ft-v2 → llama3-ft-v3	Llama 3 8B	500	+0.043	3.1%	Warn	5h ago	View
run_c1e44f	mistral-med-v1 → mistral-med-v2	Mistral 7B	2,000	−0.031	1.2%	Approved	Yesterday
run_d8b22a	gpt4o-legal-v3 → gpt4o-legal-v4	GPT-4o ft	1,000	+0.007	2.1%	Approved	2 days ago
run_e5c31b	gemma-code-v1 → gemma-code-v2	Gemma 7B	3,000	—	—	Running	14 min ago
run_f2a89d	llama3-fin-v5 → llama3-fin-v6	Llama 3 70B	5,000	—	—	Queued	Just now

Showing 6 of 84 runs

Domain Risk Snapshot

v14 → v15

Legal / Regulatory

+0.129

Multi-Step Reasoning

+0.056

Medical / Clinical

+0.007

Code Generation

+0.002

Factual QA

−0.004

Summarization

−0.009

View Full Domain Report →

Plan Usage

Growth

Evaluation runs 64,230 / 100,000

35,770 runs remaining · Resets May 1

Batch jobs this month 23 / unlimited

Team members 4 / 10

Next billing date: May 1, 2026 · $999/mo

Settings →

Get your first regression report

Overview