Inference Fleet
All Systems Healthy
OB
đź–Ą Active Providers
3
⚡ RPS
284
Live
⏱ P95 Latency
420ms
âś… Uptime
99.94%
đź’¸ Today's Spend
$2,840
ProviderModelTypeRPSP95 LatencyError RateHealth
Azure OpenAIGPT-4oFrontier142480ms0.02%Healthy
AnthropicClaude 3.5 SonnetFrontier98420ms0.01%Healthy
vLLM (K8s)Llama 3.1 70BSelf-hosted44620ms0.18%Degraded
Active Routing Rules
RuleConditionTarget ModelFallbackTraffic %Status
Cost-optimize-smalltokens < 500Llama 3.1 70BGPT-4o38%Active
Complex-reasoningtags includes "analyze"GPT-4oClaude 3.532%Active
Code-generationtags includes "code"Claude 3.5 SonnetGPT-4o22%Active
Default fallbackall otherGPT-4o—8%Catch-all
Fallback Chain
Primary: Llama 3.1 70B
→
Fallback 1: GPT-4o
→
Fallback 2: Claude 3.5
→
Circuit Breaker
Live Metrics
Auto-refresh: 5s
RPS (real-time)
284
Tokens/sec
14,200
P50 Latency
210ms
P95 Latency
420ms
P99 Latency
840ms
Error Rate
0.04%
Latency Distribution (last 5m)
0ms100ms200ms420ms600ms840ms1000ms2000ms
Trace Explorer
Trace IDModelPrompt (truncated)Tokens InTokens OutLatencyCost
tr-001-8a4fGPT-4oAnalyze quarterly report and summarize…1,240420481ms$0.025
tr-002-c2d1Claude 3.5Generate Python function to parse JSON…480680390ms$0.014
tr-003-b7e2Llama 3.1Explain RAG architecture for enterprise…320840634ms$0.003
tr-004-f1a8GPT-4oClassify customer support ticket…240120280ms$0.008
K8s GPU Cluster Monitor
Active Pods
24
GPU Utilization
84%
HPA threshold: 80%
HPA Events
8
Today
VRAM Used
89%
PodModelGPU %VRAMRequestsStatus
llm-pod-0Llama 3.1 70B
92%
38.2/40GB18High
llm-pod-1Llama 3.1 70B
78%
31.2/40GB14Normal
llm-pod-2Llama 3.1 70B
65%
26.0/40GB10Normal
Prompt Cache Manager
Cache Hit Rate
38.4%
â–˛ 12% this week
Tokens Saved
2.8M
Cost Saved
$420
This month
Cache Entries
18,420
Cache Key (prefix)HitsMissHit RateTTLTokens Saved
classify:support:*4,82038092.7%1h144K
summarize:report:*1,24048072.1%6h82K
codegen:python:*84092047.7%2h28K
Model Quantization Tradeoffs
ModelFormatSizeMMLU ScoreLatency P95ThroughputAccuracy vs FP32
Llama 3.1 70BINT837GB80.2540ms18 tok/s99.2%
Llama 3.1 70BFP1670GB80.8620ms14 tok/s100% (baseline)
Llama 3.1 70BINT419GB76.4320ms28 tok/s94.6%
Llama 3.1 7BINT84.8GB62.8120ms80 tok/s98.8%
Alert Manager
Active Alerts
2
Resolved (24h)
7
Escalated
1
MTTA
4.2 min
AlertConditionSeverityFiredEscalationStatus
vLLM High Error Rateerror_rate > 0.15%Warning12m agoOn-call: A. KimFiring
GPU Utilization Criticalgpu_util > 90%Critical38m agoSlack #llm-opsEscalated
P99 Latency SLO Breachp99 > 2000msInfo2h ago—Resolved
LLM Cost Tracker
Month to Date
$48,200
Budget: $60,000
Projected Month
$54,800
Cost per 1K Tokens
$0.0082
â–Ľ 18% vs last month
Cache Savings
$8,400
ProviderModelTokens UsedCost MTD% of BudgetTrend
Azure OpenAIGPT-4o1.82B$27,300
57%
↑ +8%
AnthropicClaude 3.5 Sonnet1.12B$13,440
28%
↑ +4%
vLLM (K8s)Llama 3.1 70B2.48B$7,440
15%
↑ +14%
Incident History
IDTitleSeverityStartDurationMTTRRoot Cause
INC-084vLLM OOM — pod restart loopP2Jun 18 02:1442 min38 minInsufficient VRAM for batch size 32
INC-083Azure OpenAI rate limit hitP3Jun 12 14:2818 min12 minTraffic spike during batch job
INC-082P99 latency breach — SLO violationP2Jun 8 09:4428 min24 minCache miss storm after deploy
INC-081Token cost spike — $4K/hrP1Jun 2 22:104 min4 minPrompt injection expanding tokens