Deep dive: evaluation¶
A complete walkthrough of spine-eval — from a dataset to a CI gate.
The pieces¶
Case— one test:id,input, optionalexpected, optionalmetadata.Dataset— a list of cases (load_datasetreads YAML/JSON).Scorer—async def score(case, result) -> Score.evaluate(agent, dataset, scorers=None, *, concurrency=1) -> EvalReport.EvalReport— aggregates across Cost / Latency / Efficacy / Reliability.
1. Write a dataset¶
evals/support.yaml:
cases:
- id: refund-policy
input: "What is our refund window?"
expected: "30 days"
- id: greeting
input: "hi"
expected: "hello"
- id: math
input: "what is 12 * 12?"
expected: "144"
A bare list works too; id defaults to position.
2. Run it¶
from spine_eval import evaluate, load_dataset, Contains
dataset = load_dataset("evals/support.yaml")
report = await evaluate(agent, dataset, [Contains()], concurrency=4)
Each case runs in a fresh session (no cross-case leakage); concurrency bounds
parallel cases. A case that crashes is a failed case, not a failed eval.
3. Read the report¶
print(f"pass rate : {report.pass_rate:.0%} ({report.passed}/{report.total})")
print(f"error rate: {report.error_rate:.0%}")
print(f"cost : ${report.cost_total_usd:.4f} (avg ${report.cost_avg_usd:.5f})")
print(f"latency : avg {report.latency_avg_s:.2f}s · p95 {report.latency_p95_s:.2f}s")
print(f"by scorer : {report.scorer_means}")
print(f"stop mix : {report.stop_reasons}")
for r in report.results:
print(r.id, "PASS" if r.passed else "FAIL", r.scores)
The four dimensions map to the lab-to-prod gap: Efficacy (pass rate, scorer means), Cost (total/avg), Latency (avg/p95), Reliability (error rate, stop-reason mix).
4. Scorers¶
| Scorer | Use |
|---|---|
ExactMatch(strip=, casefold=) |
normalized equality |
Contains(casefold=) |
expected is a substring |
Regex(pattern, flags=) |
answer matches a pattern |
FunctionScorer(fn, name=, threshold=) |
any (case, result) -> bool \| float |
LLMJudge(provider, threshold=) |
grade with another model |
A custom scorer¶
from spine_eval import FunctionScorer
concise = FunctionScorer(
lambda case, result: len(result.answer or "") <= 280,
name="concise",
)
FunctionScorer accepts sync or async callables, returning bool or a float
score (compared to threshold).
LLM-as-judge¶
from spine_eval import LLMJudge
from spine_providers import OpenAIProvider
judge = LLMJudge(OpenAIProvider("gpt-4o-mini"), threshold=0.5)
report = await evaluate(agent, dataset, [judge])
The judge is asked for a JSON verdict ({"score", "pass", "reason"}); an
unparseable reply scores 0 (fails safe).
Stack scorers¶
report = await evaluate(agent, dataset, [Contains(), concise, judge])
# a case passes only if ALL its scorers pass; scorer_means has each one
5. From the CLI / in CI¶
Exits non-zero on any failure — drop it straight into a CI step:
6. Offline, deterministic eval¶
Pair eval with deterministic replay:
record real runs once, then eval the Replayer in CI — same scores, no API cost.