Skip to main content

Evaluation modes

Use evaluation functions to turn task outputs into row scores and aggregate metrics.
Row evaluations run once per row and usually produce per-row score columns.
@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(row, answer_col, prediction_col):
    return {"exact_match": int(answer_col == prediction_col)}

How inputs bind

Use this mental model when writing evaluation functions:
  • Row evaluations receive one row plus any mapped scalar values.
  • Column evaluations receive mapped column lists across the whole run.
  • Run evaluations receive all_runs after repetitions.
@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(row, gold_col, pred_col):
    return {"exact_match": int(gold_col == pred_col)}

@ze.evaluation(mode="column", outputs=["accuracy"])
def accuracy(exact_match_col):
    return {"accuracy": sum(exact_match_col) / len(exact_match_col)}

Column mapping

Use column_map to bind evaluator function args to columns:
run = run.score(
    [exact_match, accuracy],
    column_map={
        "exact_match": {
            "gold_col": "answer",
            "pred_col": "prediction",
        },
        "accuracy": {"exact_match_col": "exact_match"},
    },
)
Required evaluator args must be mapped correctly. The SDK validates mappings and raises errors for unknown/missing columns.

score() vs eval()

  • run.score(...) is an alias of run.eval(...)
  • In the docs, prefer run.score(...) for clarity

Signals vs metrics

Signals are not scores.
  • Emit signals during task execution for runtime facts like retrieved_doc_count, phase, or tests_failed_after.
  • Compute metrics after execution with row/column/run evaluations.
This separation keeps execution facts reusable across multiple scorers.

Metric helpers

Eval also provides helper APIs:
run.column_metrics([accuracy])
run.run_metrics([accuracy_mean], all_runs=repeated_runs)
These helpers enforce mode-specific evaluation usage.

Output locations

  • Task outputs and row scores are appended to each run.rows[i]
  • Aggregate metrics are placed in run.metrics
  • The execution summary is available in run.health