Model Registry
6 Models Registered
EL
Models Under Evaluation
ModelProviderVersionTypeStatusLast Eval
GPT-4oOpenAI2025-05FrontierActive2h ago
Claude 3.5 SonnetAnthropic20241022FrontierActive4h ago
Llama 3.1 70BMeta (vLLM)3.1Fine-tunedRunningActive
Mistral 7BMistral AIv0.3Small/FastActive1d ago
Gemini 1.5 ProGoogle001FrontierDeprecated7d ago
Custom RAG Fine-tuneInternalv2.4SpecializedPendingNever
Test Suite Builder
Active Suite: RAG-QA-v3
48 cases
IDQuestionExpected
T-001What is RAG?Retrieval-Augmented Generation…
T-002Compare FAISS vs pgvectorBoth are vector stores…
T-003Explain chain-of-thoughtA prompting technique…
T-004List top LLM providersOpenAI, Anthropic, Meta…
Add Test Case
Question / Prompt
Expected Answer (Golden)
Category
Eval Run Console
Model
Test Suite
Framework
Run Progress
Ready. Click "Start Run" to evaluate.
G-Eval Results — Llama 3.1 70B
Coherence
8.7/10
▲ 0.4 vs prior run
Relevance
9.1/10
▲ 0.2
Fluency
8.2/10
Correctness
9.4/10
Test IDQuestionCoherenceRelevanceFluencyExplanation
T-001What is RAG?9.29.88.9Accurate, well-structured answer
T-002Compare FAISS vs pgvector8.49.17.8Missing latency tradeoff nuance
T-003Explain chain-of-thought7.68.98.2Good but verbose example
T-004List top LLM providers9.09.49.1Comprehensive, current list
RAGAS Analytics
Faithfulness
0.94
Target: ≥0.90
Context Precision
0.88
Context Recall
0.82
Below target
Answer Relevancy
0.91
Answer Correctness
0.89
QueryFaithfulnessContext PrecisionContext RecallAnswer Relevancy
What is RAG?0.980.920.900.95
Compare FAISS vs pgvector0.910.840.780.90
Explain chain-of-thought0.960.880.760.85
List top LLM providers0.880.820.840.92
DeepEval Report — Run #48
Hallucination Rate
2.1%
▼ from 8.4%
Assertions Passed
96.8%
Bias Detected
0
0 out of 48
Toxicity
0%
MetricAssertionScoreStatusDetails
HallucinationScore ≤ 0.100.021Pass2 minor factual deviations
FaithfulnessScore ≥ 0.900.94PassAll claims grounded in context
BiasScore = 00PassNo bias patterns detected
ToxicityScore = 00PassAll responses safe
Answer RelevancyScore ≥ 0.850.91PassHigh answer-to-query alignment
Context RecallScore ≥ 0.850.82FailMissing context on 3 queries
Model Comparison Matrix
ModelCoherenceFaithfulnessRAGAS ScoreHallucination %Cost/1K tokLatency P95Rank
GPT-4o9.40.960.931.2%$0.015480ms#1
Claude 3.5 Sonnet9.20.950.911.8%$0.012420ms#2
Llama 3.1 70B8.70.940.882.1%$0.003620ms#3
Mistral 7B7.80.860.825.4%$0.0008180ms#4
CI/CD Threshold Config
Kill-Switch Gates
Active on PR merge
Faithfulness ≥ 0.90
Hallucination ≤ 0.10
Context Recall ≥ 0.85
Coherence ≥ 8.0
Actions on Failure
On Gate Fail
Webhook URL
Notification Channel
Evaluation History & Trends
Total Runs
48
Avg Faithfulness
0.93
▲ trending up
Regressions Detected
3
CI Gates Blocked
2
Run #DateModelSuiteFaithfulnessHallucinationOutcome
#48Today 08:14Llama 3.1 70BRAG-QA-v30.942.1%1 fail
#47YesterdayGPT-4oRAG-QA-v30.961.2%All pass
#462d agoClaude 3.5 SonnetFactual-v20.951.8%All pass
#453d agoMistral 7BRAG-QA-v30.825.4%Gate blocked
#444d agoLlama 3.1 70BRAG-QA-v30.873.8%2 fail
Export & CI Integration
CI Integration
Connected
CI Platform
GitHub Actions
Trigger
On PR to main branch
Report Artifact
evallab-report-{sha}.json
Webhook
Active
[CI] PR #284: eval gate PASSED (faith=0.94)
[CI] PR #283: eval gate PASSED (faith=0.96)
[CI] PR #280: eval gate BLOCKED — hallucination=0.18
[CI] PR #278: eval gate PASSED (faith=0.92)
Export Formats