Quickstart

Build Your First Eval in 10 Minutes

This quickstart walks through the full path:

Initialize the SDK
Create and push a dataset
Define a task and evaluations
Execute a run and read metrics

LLM Stats uses ZeroEval as its core evaluation library. In this guide, you are using the same library that powers LLM Stats in production.

Install and authenticate

pip install zeroeval
export ZEROEVAL_API_KEY="sk_ze_..."

Then initialize:

import zeroeval as ze
ze.init()  # reads ZEROEVAL_API_KEY

Create and push a dataset

import zeroeval as ze

ds = ze.Dataset(
    "capital-cities-demo",
    data=[
        {"question": "Capital of France?", "answer": "Paris"},
        {"question": "Capital of Germany?", "answer": "Berlin"},
        {"question": "Capital of Spain?", "answer": "Madrid"},
    ],
    description="Simple geography eval set",
)
ds.push()

Define a task and evaluations

import zeroeval as ze

@ze.task(outputs=["prediction"])
def predict_capital(row):
    # Replace with your model/provider call.
    return {"prediction": row.answer}

@ze.evaluation(mode="row", outputs=["exact_match"])
def exact_match(row, answer_col, prediction_col):
    return {"exact_match": int(answer_col == prediction_col)}

@ze.evaluation(mode="column", outputs=["accuracy"])
def accuracy(exact_match_col):
    total = len(exact_match_col)
    return {"accuracy": (sum(exact_match_col) / total) if total else 0.0}

Run and score

run = ds.eval(predict_capital, workers=8)
run = run.score(
    [exact_match, accuracy],
    column_map={
        "exact_match": {
            "answer_col": "answer",
            "prediction_col": "prediction",
        },
        "accuracy": {"exact_match_col": "exact_match"},
    },
)

print("eval_id:", run.id)
print("metrics:", run.metrics)
print("health:", run.health)

Understand the mapping model

@ze.task defines new output columns.
@ze.evaluation(mode="row") computes per-row scores.
@ze.evaluation(mode="column") aggregates across rows.
column_map binds evaluator function args to dataset/run column names.

Dataset deep dive

Learn creation patterns, loading, versioning, and multimodal data.

Evals deep dive

Learn execution config, scoring modes, retries, and resume.

Prefer deterministic row IDs when you need reliable resume behavior. Include row_id in dataset rows whenever possible.

Getting Started

Datasets

Evals

CLI

Examples

Build Your First Eval in 10 Minutes

Understand the mapping model

Dataset deep dive

Evals deep dive

Getting Started

Datasets

Evals

CLI

Examples

​Build Your First Eval in 10 Minutes

​Understand the mapping model

Dataset deep dive

Evals deep dive

Build Your First Eval in 10 Minutes

Understand the mapping model