Models Under Evaluation
| Model | Provider | Version | Type | Status | Last Eval | |
|---|---|---|---|---|---|---|
| GPT-4o | OpenAI | 2025-05 | Frontier | Active | 2h ago | |
| Claude 3.5 Sonnet | Anthropic | 20241022 | Frontier | Active | 4h ago | |
| Llama 3.1 70B | Meta (vLLM) | 3.1 | Fine-tuned | Running | Active | |
| Mistral 7B | Mistral AI | v0.3 | Small/Fast | Active | 1d ago | |
| Gemini 1.5 Pro | 001 | Frontier | Deprecated | 7d ago | ||
| Custom RAG Fine-tune | Internal | v2.4 | Specialized | Pending | Never |
Test Suite Builder
Active Suite: RAG-QA-v3
48 cases| ID | Question | Expected | |
|---|---|---|---|
| T-001 | What is RAG? | Retrieval-Augmented Generation… | |
| T-002 | Compare FAISS vs pgvector | Both are vector stores… | |
| T-003 | Explain chain-of-thought | A prompting technique… | |
| T-004 | List top LLM providers | OpenAI, Anthropic, Meta… |
Add Test Case
Question / Prompt
Expected Answer (Golden)
Category
Eval Run Console
Model
Test Suite
Framework
Run Progress
Ready. Click "Start Run" to evaluate.
G-Eval Results — Llama 3.1 70B
Coherence
8.7/10
▲ 0.4 vs prior run
Relevance
9.1/10
▲ 0.2
Fluency
8.2/10
Correctness
9.4/10
| Test ID | Question | Coherence | Relevance | Fluency | Explanation |
|---|---|---|---|---|---|
| T-001 | What is RAG? | 9.2 | 9.8 | 8.9 | Accurate, well-structured answer |
| T-002 | Compare FAISS vs pgvector | 8.4 | 9.1 | 7.8 | Missing latency tradeoff nuance |
| T-003 | Explain chain-of-thought | 7.6 | 8.9 | 8.2 | Good but verbose example |
| T-004 | List top LLM providers | 9.0 | 9.4 | 9.1 | Comprehensive, current list |
RAGAS Analytics
Faithfulness
0.94
Target: ≥0.90
Context Precision
0.88
Context Recall
0.82
Below target
Answer Relevancy
0.91
Answer Correctness
0.89
| Query | Faithfulness | Context Precision | Context Recall | Answer Relevancy |
|---|---|---|---|---|
| What is RAG? | 0.98 | 0.92 | 0.90 | 0.95 |
| Compare FAISS vs pgvector | 0.91 | 0.84 | 0.78 | 0.90 |
| Explain chain-of-thought | 0.96 | 0.88 | 0.76 | 0.85 |
| List top LLM providers | 0.88 | 0.82 | 0.84 | 0.92 |
DeepEval Report — Run #48
Hallucination Rate
2.1%
▼ from 8.4%
Assertions Passed
96.8%
Bias Detected
0
0 out of 48
Toxicity
0%
| Metric | Assertion | Score | Status | Details |
|---|---|---|---|---|
| Hallucination | Score ≤ 0.10 | 0.021 | Pass | 2 minor factual deviations |
| Faithfulness | Score ≥ 0.90 | 0.94 | Pass | All claims grounded in context |
| Bias | Score = 0 | 0 | Pass | No bias patterns detected |
| Toxicity | Score = 0 | 0 | Pass | All responses safe |
| Answer Relevancy | Score ≥ 0.85 | 0.91 | Pass | High answer-to-query alignment |
| Context Recall | Score ≥ 0.85 | 0.82 | Fail | Missing context on 3 queries |
Model Comparison Matrix
| Model | Coherence | Faithfulness | RAGAS Score | Hallucination % | Cost/1K tok | Latency P95 | Rank |
|---|---|---|---|---|---|---|---|
| GPT-4o | 9.4 | 0.96 | 0.93 | 1.2% | $0.015 | 480ms | #1 |
| Claude 3.5 Sonnet | 9.2 | 0.95 | 0.91 | 1.8% | $0.012 | 420ms | #2 |
| Llama 3.1 70B | 8.7 | 0.94 | 0.88 | 2.1% | $0.003 | 620ms | #3 |
| Mistral 7B | 7.8 | 0.86 | 0.82 | 5.4% | $0.0008 | 180ms | #4 |
CI/CD Threshold Config
Kill-Switch Gates
Active on PR mergeFaithfulness ≥ 0.90
Hallucination ≤ 0.10
Context Recall ≥ 0.85
Coherence ≥ 8.0
Actions on Failure
On Gate Fail
Webhook URL
Notification Channel
Evaluation History & Trends
Total Runs
48
Avg Faithfulness
0.93
▲ trending up
Regressions Detected
3
CI Gates Blocked
2
| Run # | Date | Model | Suite | Faithfulness | Hallucination | Outcome |
|---|---|---|---|---|---|---|
| #48 | Today 08:14 | Llama 3.1 70B | RAG-QA-v3 | 0.94 | 2.1% | 1 fail |
| #47 | Yesterday | GPT-4o | RAG-QA-v3 | 0.96 | 1.2% | All pass |
| #46 | 2d ago | Claude 3.5 Sonnet | Factual-v2 | 0.95 | 1.8% | All pass |
| #45 | 3d ago | Mistral 7B | RAG-QA-v3 | 0.82 | 5.4% | Gate blocked |
| #44 | 4d ago | Llama 3.1 70B | RAG-QA-v3 | 0.87 | 3.8% | 2 fail |
Export & CI Integration
CI Integration
ConnectedCI Platform
GitHub Actions
Trigger
On PR to main branch
Report Artifact
evallab-report-{sha}.json
Webhook
Active
[CI] PR #284: eval gate PASSED (faith=0.94)
[CI] PR #283: eval gate PASSED (faith=0.96)
[CI] PR #280: eval gate BLOCKED — hallucination=0.18
[CI] PR #278: eval gate PASSED (faith=0.92)
Export Formats