Evaluations

Rubric v1 · 5 dims · LLM-as-judge (DeepSeek). Each row is an analysis scored 1–5 per dim.

Cost & usage

Total spend

$0.2770

n = 118

Avg per analysis

$0.00235

Avg latency

11560ms

↑ / ↓ tokens

231.6k / 195.0k

Excludes the streaming /api/analyze path — usage isn't reachable from that flow (limitation, not a bug). Includes batch evals + scenario runs + re-runs.

Average by prompt version

key	n	overall	Specificity	Safety	Actionability	Domain correctness	Completeness
v1	49	4.58	4.61	4.04	4.61	4.92	4.71
v2	52	4.41	4.54	3.98	4.54	4.81	4.17
v3	34	4.58	4.71	4.00	4.71	4.79	4.68

Average by output language

key	n	overall	Specificity	Safety	Actionability	Domain correctness	Completeness
en	67	4.60	4.72	4.07	4.72	4.91	4.60
zh	68	4.42	4.50	3.94	4.50	4.78	4.40

Individual evaluations (135)

v2zh3.60[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning6/3/2026, 12:09:16 AM
Specificity: 4Safety: 3Actionability: 4Domain correctness: 4Completeness: 3
Solid incident response with good specificity and domain correctness, but safety could be improved by adding safer-first alternatives, and completeness is hindered by placeholders in the postmortem.
v2zh4.20[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning6/3/2026, 12:08:55 AM
Specificity: 4Safety: 4Actionability: 4Domain correctness: 5Completeness: 4
Strong incident response with concrete evidence and actionable steps; minor gaps in command specificity and postmortem completeness.