Skip to main content

Build Your First Eval in 10 Minutes

This quickstart walks through the full path:
  • Initialize the SDK
  • Create and push a dataset
  • Define a task and evaluations
  • Execute a run and read metrics
LLM Stats uses ZeroEval as its core evaluation library. In this guide, you are using the same library that powers LLM Stats in production.
1

Install and authenticate

pip install zeroeval
export ZEROEVAL_API_KEY="sk_ze_..."
Then initialize:
import zeroeval as ze
ze.init()  # reads ZEROEVAL_API_KEY
2

Create and push a dataset

import zeroeval as ze

ds = ze.Dataset(
    "capital-cities-demo",
    data=[
        {"question": "Capital of France?", "answer": "Paris"},
        {"question": "Capital of Germany?", "answer": "Berlin"},
        {"question": "Capital of Spain?", "answer": "Madrid"},
    ],
    description="Simple geography eval set",
)
ds.push()
3

Define a task and evaluations

import zeroeval as ze

@ze.task(outputs=["prediction"])
def predict_capital(row):
    # Replace with your model/provider call.
    return {"prediction": row.answer}

@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(row, answer_col, prediction_col):
    return {"exact_match": int(answer_col == prediction_col)}

@ze.evaluation(mode="column", outputs=["accuracy"])
def accuracy(exact_match_col):
    total = len(exact_match_col)
    return {"accuracy": (sum(exact_match_col) / total) if total else 0.0}
4

Run and score

run = ds.eval(predict_capital, workers=8)
run = run.score(
    [exact_match, accuracy],
    column_map={
        "exact_match": {
            "answer_col": "answer",
            "prediction_col": "prediction",
        },
        "accuracy": {"exact_match_col": "exact_match"},
    },
)

print("eval_id:", run.id)
print("metrics:", run.metrics)
print("health:", run.health)

Understand the mapping model

  • @ze.task defines new output columns.
  • @ze.evaluation(mode="row") computes per-row scores.
  • @ze.evaluation(mode="column") aggregates across rows.
  • column_map binds evaluator function args to dataset/run column names.

Dataset deep dive

Learn creation patterns, loading, versioning, and multimodal data.

Evals deep dive

Learn execution config, scoring modes, retries, and resume.
Prefer deterministic row IDs when you need reliable resume behavior. Include row_id in dataset rows whenever possible.