Skip to main content

Why versioning matters

Reproducible evaluation depends on running against a known dataset snapshot. The SDK exposes version metadata after push and supports pulling a specific version.
dataset.push()
print(dataset.version_id)
print(dataset.version_number)
To run against a known version:
dataset = ze.Dataset.pull("capital-cities", version_number=3)

Subset pulls

If your dataset includes named subsets, pull only the subset you want:
diamond = ze.Dataset.pull("gpqa", subset="diamond")
print(len(diamond))
If subset is omitted, the SDK attempts default subset resolution based on dataset metadata.

Reproducibility checklist

1

Pin dataset version when benchmarking

For benchmark comparisons over time, prefer version_number instead of always pulling latest.
2

Track subset in experiments

Store subset name in run parameters for transparent reporting.
3

Use stable row identifiers

Include deterministic row_id fields to make resume and row comparisons robust.
run = dataset.eval(
    task_func=my_task,
    parameters={
        "dataset_name": dataset.name,
        "dataset_version": dataset.version_number,
        "subset": "diamond",
    },
)
Attach dataset version + subset in run parameters so dashboards and downstream analysis remain self-describing.