Evaluations

Rubric v1 · 5 dims · LLM-as-judge (DeepSeek). Each row is an analysis scored 1–5 per dim.

Cost & usage

Total spend

$0.2709

n = 116

Avg per analysis

$0.00234

Avg latency

11411ms

↑ / ↓ tokens

223.6k / 191.4k

Excludes the streaming /api/analyze path — usage isn't reachable from that flow (limitation, not a bug). Includes batch evals + scenario runs + re-runs.

Average by prompt version

keynoverallSpecificitySafetyActionabilityDomain correctnessCompleteness
v1494.584.614.044.614.924.71
v2524.414.543.984.544.814.17
v3344.584.714.004.714.794.68

Average by output language

keynoverallSpecificitySafetyActionabilityDomain correctnessCompleteness
en674.604.724.074.724.914.60
zh684.424.503.944.504.784.40

Individual evaluations (135)