LLM Eval Lab

Models Under Evaluation

Model	Provider	Version	Type	Status	Last Eval
GPT-4o	OpenAI	2025-05	Frontier	Active	2h ago
Claude 3.5 Sonnet	Anthropic	20241022	Frontier	Active	4h ago
Llama 3.1 70B	Meta (vLLM)	3.1	Fine-tuned	Running	Active
Mistral 7B	Mistral AI	v0.3	Small/Fast	Active	1d ago
Gemini 1.5 Pro	Google	001	Frontier	Deprecated	7d ago
Custom RAG Fine-tune	Internal	v2.4	Specialized	Pending	Never

Test Suite Builder

Active Suite: RAG-QA-v3

48 cases

ID	Question	Expected
T-001	What is RAG?	Retrieval-Augmented Generation…
T-002	Compare FAISS vs pgvector	Both are vector stores…
T-003	Explain chain-of-thought	A prompting technique…
T-004	List top LLM providers	OpenAI, Anthropic, Meta…

Add Test Case

Question / Prompt

Expected Answer (Golden)

Category

Eval Run Console

Model

Test Suite

Framework

Run Progress

Ready. Click "Start Run" to evaluate.

G-Eval Results — Llama 3.1 70B

Coherence

8.7/10

▲ 0.4 vs prior run

Relevance

9.1/10

▲ 0.2

Fluency

8.2/10

Correctness

9.4/10

Test ID	Question	Coherence	Relevance	Fluency	Explanation
T-001	What is RAG?	9.2	9.8	8.9	Accurate, well-structured answer
T-002	Compare FAISS vs pgvector	8.4	9.1	7.8	Missing latency tradeoff nuance
T-003	Explain chain-of-thought	7.6	8.9	8.2	Good but verbose example
T-004	List top LLM providers	9.0	9.4	9.1	Comprehensive, current list

RAGAS Analytics

Faithfulness

0.94

Target: ≥0.90

Context Precision

0.88

Context Recall

0.82

Below target

Answer Relevancy

0.91

Answer Correctness

0.89

Query	Faithfulness	Context Precision	Context Recall	Answer Relevancy
What is RAG?	0.98	0.92	0.90	0.95
Compare FAISS vs pgvector	0.91	0.84	0.78	0.90
Explain chain-of-thought	0.96	0.88	0.76	0.85
List top LLM providers	0.88	0.82	0.84	0.92

DeepEval Report — Run #48

Hallucination Rate

2.1%

▼ from 8.4%

Assertions Passed

96.8%

Bias Detected

0

0 out of 48

Toxicity

0%

Metric	Assertion	Score	Status	Details
Hallucination	Score ≤ 0.10	0.021	Pass	2 minor factual deviations
Faithfulness	Score ≥ 0.90	0.94	Pass	All claims grounded in context
Bias	Score = 0	0	Pass	No bias patterns detected
Toxicity	Score = 0	0	Pass	All responses safe
Answer Relevancy	Score ≥ 0.85	0.91	Pass	High answer-to-query alignment
Context Recall	Score ≥ 0.85	0.82	Fail	Missing context on 3 queries

Model Comparison Matrix

Model	Coherence	Faithfulness	RAGAS Score	Hallucination %	Cost/1K tok	Latency P95	Rank
GPT-4o	9.4	0.96	0.93	1.2%	$0.015	480ms	#1
Claude 3.5 Sonnet	9.2	0.95	0.91	1.8%	$0.012	420ms	#2
Llama 3.1 70B	8.7	0.94	0.88	2.1%	$0.003	620ms	#3
Mistral 7B	7.8	0.86	0.82	5.4%	$0.0008	180ms	#4

CI/CD Threshold Config

Kill-Switch Gates

Active on PR merge

Faithfulness ≥ 0.90

Hallucination ≤ 0.10

Context Recall ≥ 0.85

Coherence ≥ 8.0

Actions on Failure

On Gate Fail

Webhook URL

Notification Channel

Evaluation History & Trends

Total Runs

48

Avg Faithfulness

0.93

▲ trending up

Regressions Detected

3

CI Gates Blocked

2

Run #	Date	Model	Suite	Faithfulness	Hallucination	Outcome
#48	Today 08:14	Llama 3.1 70B	RAG-QA-v3	0.94	2.1%	1 fail
#47	Yesterday	GPT-4o	RAG-QA-v3	0.96	1.2%	All pass
#46	2d ago	Claude 3.5 Sonnet	Factual-v2	0.95	1.8%	All pass
#45	3d ago	Mistral 7B	RAG-QA-v3	0.82	5.4%	Gate blocked
#44	4d ago	Llama 3.1 70B	RAG-QA-v3	0.87	3.8%	2 fail

Export & CI Integration

CI Integration

Connected

CI Platform

GitHub Actions

Trigger

On PR to main branch

Report Artifact

evallab-report-{sha}.json

Webhook

Active

[CI] PR #284: eval gate PASSED (faith=0.94)

[CI] PR #283: eval gate PASSED (faith=0.96)

[CI] PR #280: eval gate BLOCKED — hallucination=0.18

[CI] PR #278: eval gate PASSED (faith=0.92)

Export Formats