New Evaluation
Configure and launch a hallucination regression comparison between two model checkpoints.
1
Checkpoints
2
Prompt Suite
3
Configure
4
Review & Launch
Select Checkpoints
Choose your baseline checkpoint and the new candidate you want to test against it.
Checkpoint IDs are arbitrary strings you define — they don't need to exist in any registry.
🤖GPT-4o
⚡GPT-4o mini
🦙Llama 3 8B
🦙Llama 3 70B
💫Mistral 7B
💎Gemma 7B
🌊Claude (proxy)
✨Gemini (proxy)
Select Prompt Suite
Choose which prompts to run against both checkpoints.
⚖️ Legal QA · 1,000 prompts
Curated
Legal citation accuracy, regulatory interpretation, contract clause analysis, precedent identification. Designed to catch citation fabrication and regulatory hallucinations.
Est. runtime
~12 min
🌐 General Knowledge · 2,500 prompts
Curated
Factual QA, multi-step reasoning, open-ended generation, instruction following, summarization. Comprehensive cross-domain baseline evaluation.
Est. runtime
~30 min
🏥 Medical/Clinical · 1,000 prompts
Curated
Drug dosage, clinical guidelines, diagnostic reasoning, medical literature. Verified against UpToDate and clinical guidelines databases.
Est. runtime
~12 min
📁 Upload Custom Prompt Suite
JSONL
Upload your own prompts in JSONL format. Each line: {"id": "p1", "prompt": "...", "domain": "legal", "expected": "..."}. 100–10,000 prompts.
Configure Evaluation
Advanced settings and deployment thresholds.
Select which domains from your prompt suite to include in the domain breakdown. Add custom domain
If hallucination delta exceeds this value, the report will recommend BLOCK DEPLOYMENT
N tokens for rolling variance calculation. Default 50. Smaller = more sensitive to short sequences, larger = better for long generations.
POST request sent when evaluation completes. Body includes batch_id, status, and hallucination_delta.
Review & Launch
Confirm your evaluation configuration before starting.
Evaluation Summary
Baseline checkpointv14
Candidate checkpointv15
Model familyGPT-4o
Prompt suitelegal_qa_1000
Prompt count1,000 prompts
Block thresholdΔ > 0.05
Estimated runtime~12 minutes
Estimated run cost1,000 runs (~$0.04 total)
Runs remaining in plan35,770 / 100,000
ℹ
LLMGuard will call your model endpoint using the SDK's inference function. You'll be prompted to provide your inference function below, or pass a webhook URL for async evaluation.
python
Copy
# Run this in your eval environment with your model loaded import llmguard from openai import OpenAI client = OpenAI() guard = llmguard.Client(api_key="lg_sk_...") batch = guard.create_batch( checkpoint_id="v15", baseline_id="v14", prompt_suite="legal_qa_1000" ) def my_inference(prompt): return client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], logprobs=True, top_logprobs=5 ) report = batch.run(model_fn=my_inference) print(report.hallucination_delta, report.recommendation)