Skip to main content

Build Your First Eval in 10 Minutes

This quickstart walks through the full path:
  • Initialize the SDK
  • Create and push a dataset
  • Define a task and evaluations
  • Execute a run and read metrics
LLM Stats uses ZeroEval as its core evaluation library. In this guide, you are using the same library that powers LLM Stats in production.
1

Install and authenticate

pip install zeroeval
export ZEROEVAL_API_KEY="sk_ze_..."
Then initialize:
import zeroeval as ze
ze.init()  # reads ZEROEVAL_API_KEY
2

Create and push a dataset

import zeroeval as ze

ds = ze.Dataset(
    "capital-cities-demo",
    data=[
        {"question": "Capital of France?", "answer": "Paris"},
        {"question": "Capital of Germany?", "answer": "Berlin"},
        {"question": "Capital of Spain?", "answer": "Madrid"},
    ],
    description="Simple geography eval set",
)
ds.push()
3

Define a task and evaluations

import zeroeval as ze

@ze.task(outputs=["prediction"])
def predict_capital(row):
    # Replace with your model/provider call.
    return {"prediction": row.answer}

@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(row, answer_col, prediction_col):
    return {"exact_match": int(answer_col == prediction_col)}

@ze.evaluation(mode="column", outputs=["accuracy"])
def accuracy(exact_match_col):
    total = len(exact_match_col)
    return {"accuracy": (sum(exact_match_col) / total) if total else 0.0}
4

Run and score

run = ds.eval(predict_capital, workers=8)
run = run.score(
    [exact_match, accuracy],
    column_map={
        "exact_match": {
            "answer_col": "answer",
            "prediction_col": "prediction",
        },
        "accuracy": {"exact_match_col": "exact_match"},
    },
)

print("run_id:", run.run_id)
print("metrics:", run.metrics)
print("health:", run.health)

Understand the mapping model

  • @ze.task defines new output columns.
  • @ze.evaluation(mode="row") computes per-row scores.
  • @ze.evaluation(mode="column") aggregates across rows.
  • column_map binds evaluator function args to dataset/run column names.
Prefer deterministic row IDs when you need reliable resume behavior. Include row_id in dataset rows whenever possible.