Why versioning matters
Reproducible evaluation depends on running against a known dataset snapshot. The SDK exposes version metadata after push and supports pulling a specific version.
dataset.push()
print(dataset.version_id)
print(dataset.version_number)
To run against a known version:
dataset = ze.Dataset.pull("capital-cities", version_number=3)
Subset pulls
If your dataset includes named subsets, pull only the subset you want:
diamond = ze.Dataset.pull("gpqa", subset="diamond")
print(len(diamond))
If subset is omitted, the SDK attempts default subset resolution based on dataset metadata.
Reproducibility checklist
Pin dataset version when benchmarking
For benchmark comparisons over time, prefer version_number instead of always pulling latest.
Track subset in experiments
Store subset name in run parameters for transparent reporting.
Use stable row identifiers
Include deterministic row_id fields to make resume and row comparisons robust.
run = dataset.eval(
task_func=my_task,
parameters={
"dataset_name": dataset.name,
"dataset_version": dataset.version_number,
"subset": "diamond",
},
)
Attach dataset version + subset in run parameters so dashboards and downstream analysis remain self-describing.