Skip to main content

End-to-end example

import zeroeval as ze

ze.init()

dataset = ze.Dataset(
    "math-demo",
    data=[
        {"row_id": "q1", "question": "6 * 7", "answer": "42"},
        {"row_id": "q2", "question": "10 + 7", "answer": "17"},
    ],
)
dataset.push()

@ze.task(outputs=["prediction"])
def solve(row):
    # Replace with real model call.
    ze.emit_signal("phase", "solve")
    return {"prediction": row.answer}

@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(row, answer_col, prediction_col):
    return {"exact_match": int(answer_col == prediction_col)}

@ze.evaluation(mode="column", outputs=["accuracy"])
def accuracy(exact_match_col):
    n = len(exact_match_col)
    return {"accuracy": (sum(exact_match_col) / n) if n else 0.0}

run = dataset.eval(solve, workers=8)
run = run.score(
    [exact_match, accuracy],
    column_map={
        "exact_match": {
            "answer_col": "answer",
            "prediction_col": "prediction",
        },
        "accuracy": {"exact_match_col": "exact_match"},
    },
)

print("eval_id:", run.id)
print("metrics:", run.metrics)
print("health:", run.health)

Core primitives

print(type(dataset).__name__)  # Dataset
print(type(run).__name__)      # Eval
  • Dataset stores rows and backend version metadata.
  • @ze.task(...) defines the outputs generated for each row.
  • run = dataset.eval(...) executes the task and returns an Eval object.
  • run.score(...) applies evaluation functions and writes aggregate metrics into run.metrics.

Inspect the returned run

run = dataset.eval(solve)

print(run.id)         # backend eval ID
print(run.rows[0])    # task outputs live on rows
print(run.metrics)    # aggregate metrics live here
print(run.health)     # execution summary lives here

What happened

1

Dataset rows were persisted

dataset.push() stores the dataset and returns backend-linked metadata.
2

Task outputs were generated

dataset.eval(...) executed solve over each row with the requested worker count and returned a run object (Eval).
3

Runtime signals were captured

ze.emit_signal(...) attached execution facts to the task trace so they can be inspected later without turning them into scores immediately.
4

Evaluations produced scores

Row-level exact_match values were added to rows, then accuracy was aggregated into run.metrics.

One mental model for scoring

  • A row evaluation receives one row plus any mapped scalar values.
  • A column evaluation receives lists of values across all rows.
  • A run evaluation receives all_runs after you repeat a run.
@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(row, gold_col, pred_col):
    return {"exact_match": int(gold_col == pred_col)}

@ze.evaluation(mode="column", outputs=["accuracy"])
def accuracy(exact_match_col):
    return {"accuracy": sum(exact_match_col) / len(exact_match_col)}

run = run.score(
    [exact_match, accuracy],
    column_map={
        "exact_match": {"gold_col": "answer", "pred_col": "prediction"},
        "accuracy": {"exact_match_col": "exact_match"},
    },
)