New Evaluation — LLMGuard

Checkpoints

Prompt Suite

Configure

Review & Launch

Select Checkpoints

Choose your baseline checkpoint and the new candidate you want to test against it.

Baseline Checkpoint ID — the known-good checkpoint you're comparing against

Checkpoint IDs are arbitrary strings you define — they don't need to exist in any registry.

Candidate Checkpoint ID — the new checkpoint you want to validate

Model Family

🤖GPT-4o

⚡GPT-4o mini

🦙Llama 3 8B

🦙Llama 3 70B

💫Mistral 7B

💎Gemma 7B

🌊Claude (proxy)

✨Gemini (proxy)

Select Prompt Suite

Choose which prompts to run against both checkpoints.

⚖️ Legal QA · 1,000 prompts Curated

Legal citation accuracy, regulatory interpretation, contract clause analysis, precedent identification. Designed to catch citation fabrication and regulatory hallucinations.

Est. runtime

~12 min

🌐 General Knowledge · 2,500 prompts Curated

Factual QA, multi-step reasoning, open-ended generation, instruction following, summarization. Comprehensive cross-domain baseline evaluation.

Est. runtime

~30 min

🏥 Medical/Clinical · 1,000 prompts Curated

Drug dosage, clinical guidelines, diagnostic reasoning, medical literature. Verified against UpToDate and clinical guidelines databases.

Est. runtime

~12 min

📁 Upload Custom Prompt Suite JSONL

Upload your own prompts in JSONL format. Each line: {"id": "p1", "prompt": "...", "domain": "legal", "expected": "..."}. 100–10,000 prompts.

Configure Evaluation

Advanced settings and deployment thresholds.

Domain Categories

Select which domains from your prompt suite to include in the domain breakdown. Add custom domain

Legal Reasoning Factual QA Medical Code Financial

Deployment Block Threshold

If hallucination delta exceeds this value, the report will recommend BLOCK DEPLOYMENT

Rolling Window Size (Temporal Variance)

N tokens for rolling variance calculation. Default 50. Smaller = more sensitive to short sequences, larger = better for long generations.

Webhook URL (optional)

POST request sent when evaluation completes. Body includes batch_id, status, and hallucination_delta.

Review & Launch

Confirm your evaluation configuration before starting.

Evaluation Summary

Baseline checkpointv14

Candidate checkpointv15

Model familyGPT-4o

Prompt suitelegal_qa_1000

Prompt count1,000 prompts

Block thresholdΔ > 0.05

Estimated runtime~12 minutes

Estimated run cost1,000 runs (~$0.04 total)

Runs remaining in plan35,770 / 100,000

ℹ

LLMGuard will call your model endpoint using the SDK's inference function. You'll be prompted to provide your inference function below, or pass a webhook URL for async evaluation.

SDK Integration Snippet

                  python
                  Copy
                

# Run this in your eval environment with your model loaded
import llmguard
from openai import OpenAI

client = OpenAI()
guard = llmguard.Client(api_key="lg_sk_...")

batch = guard.create_batch(
    checkpoint_id="v15",
    baseline_id="v14",
    prompt_suite="legal_qa_1000"
)

def my_inference(prompt):
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        logprobs=True, top_logprobs=5
    )

report = batch.run(model_fn=my_inference)
print(report.hallucination_delta, report.recommendation)