Skip to content

Deep dive: evaluation

A complete walkthrough of spine-eval — from a dataset to a CI gate.

The pieces

  • Case — one test: id, input, optional expected, optional metadata.
  • Dataset — a list of cases (load_dataset reads YAML/JSON).
  • Scorerasync def score(case, result) -> Score.
  • evaluate(agent, dataset, scorers=None, *, concurrency=1) -> EvalReport.
  • EvalReport — aggregates across Cost / Latency / Efficacy / Reliability.

1. Write a dataset

evals/support.yaml:

cases:
  - id: refund-policy
    input: "What is our refund window?"
    expected: "30 days"
  - id: greeting
    input: "hi"
    expected: "hello"
  - id: math
    input: "what is 12 * 12?"
    expected: "144"

A bare list works too; id defaults to position.

2. Run it

from spine_eval import evaluate, load_dataset, Contains

dataset = load_dataset("evals/support.yaml")
report = await evaluate(agent, dataset, [Contains()], concurrency=4)

Each case runs in a fresh session (no cross-case leakage); concurrency bounds parallel cases. A case that crashes is a failed case, not a failed eval.

3. Read the report

print(f"pass rate : {report.pass_rate:.0%} ({report.passed}/{report.total})")
print(f"error rate: {report.error_rate:.0%}")
print(f"cost      : ${report.cost_total_usd:.4f} (avg ${report.cost_avg_usd:.5f})")
print(f"latency   : avg {report.latency_avg_s:.2f}s · p95 {report.latency_p95_s:.2f}s")
print(f"by scorer : {report.scorer_means}")
print(f"stop mix  : {report.stop_reasons}")

for r in report.results:
    print(r.id, "PASS" if r.passed else "FAIL", r.scores)

The four dimensions map to the lab-to-prod gap: Efficacy (pass rate, scorer means), Cost (total/avg), Latency (avg/p95), Reliability (error rate, stop-reason mix).

4. Scorers

Scorer Use
ExactMatch(strip=, casefold=) normalized equality
Contains(casefold=) expected is a substring
Regex(pattern, flags=) answer matches a pattern
FunctionScorer(fn, name=, threshold=) any (case, result) -> bool \| float
LLMJudge(provider, threshold=) grade with another model

A custom scorer

from spine_eval import FunctionScorer

concise = FunctionScorer(
    lambda case, result: len(result.answer or "") <= 280,
    name="concise",
)

FunctionScorer accepts sync or async callables, returning bool or a float score (compared to threshold).

LLM-as-judge

from spine_eval import LLMJudge
from spine_providers import OpenAIProvider

judge = LLMJudge(OpenAIProvider("gpt-4o-mini"), threshold=0.5)
report = await evaluate(agent, dataset, [judge])

The judge is asked for a JSON verdict ({"score", "pass", "reason"}); an unparseable reply scores 0 (fails safe).

Stack scorers

report = await evaluate(agent, dataset, [Contains(), concise, judge])
# a case passes only if ALL its scorers pass; scorer_means has each one

5. From the CLI / in CI

spine eval evals/support.yaml --scorer contains

Exits non-zero on any failure — drop it straight into a CI step:

- run: uv run spine eval evals/support.yaml

6. Offline, deterministic eval

Pair eval with deterministic replay: record real runs once, then eval the Replayer in CI — same scores, no API cost.