Build Your First Eval in 10 Minutes
This quickstart walks through the full path:
- Initialize the SDK
- Create and push a dataset
- Define a task and evaluations
- Execute a run and read metrics
LLM Stats uses ZeroEval as its core evaluation library. In this guide, you are using the same library that powers LLM Stats in production.
Install and authenticate
pip install zeroeval
export ZEROEVAL_API_KEY="sk_ze_..."
Then initialize:import zeroeval as ze
ze.init() # reads ZEROEVAL_API_KEY
Create and push a dataset
import zeroeval as ze
ds = ze.Dataset(
"capital-cities-demo",
data=[
{"question": "Capital of France?", "answer": "Paris"},
{"question": "Capital of Germany?", "answer": "Berlin"},
{"question": "Capital of Spain?", "answer": "Madrid"},
],
description="Simple geography eval set",
)
ds.push()
Define a task and evaluations
import zeroeval as ze
@ze.task(outputs=["prediction"])
def predict_capital(row):
# Replace with your model/provider call.
return {"prediction": row.answer}
@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(row, answer_col, prediction_col):
return {"exact_match": int(answer_col == prediction_col)}
@ze.evaluation(mode="column", outputs=["accuracy"])
def accuracy(exact_match_col):
total = len(exact_match_col)
return {"accuracy": (sum(exact_match_col) / total) if total else 0.0}
Run and score
run = ds.eval(predict_capital, workers=8)
run = run.score(
[exact_match, accuracy],
column_map={
"exact_match": {
"answer_col": "answer",
"prediction_col": "prediction",
},
"accuracy": {"exact_match_col": "exact_match"},
},
)
print("run_id:", run.run_id)
print("metrics:", run.metrics)
print("health:", run.health)
Understand the mapping model
@ze.task defines new output columns.
@ze.evaluation(mode="row") computes per-row scores.
@ze.evaluation(mode="column") aggregates across rows.
column_map binds evaluator function args to dataset/run column names.
Prefer deterministic row IDs when you need reliable resume behavior. Include row_id in dataset rows whenever possible.