Why versioning matters
Reproducible evaluation depends on running against a known dataset snapshot. The SDK exposes version metadata after push and supports pulling a specific version.
dataset.push()
print(dataset.version_id)
print(dataset.version_number)
To run against a known version:
dataset = ze.Dataset.pull("capital-cities", version_number=3)
One compact example
dataset = ze.Dataset(
"qa-demo",
data=[{"row_id": "q1", "question": "6 * 7", "answer": "42"}],
)
dataset.push()
latest = ze.Dataset.pull("qa-demo")
pinned = ze.Dataset.pull("qa-demo", version_number=dataset.version_number)
subset = ze.Dataset.pull("gpqa", subset="diamond")
print(dataset.version_number)
print(len(latest), len(pinned), len(subset))
Subset pulls
If your dataset includes named subsets, pull only the subset you want:
diamond = ze.Dataset.pull("gpqa", subset="diamond")
print(len(diamond))
If subset is omitted, the SDK resolves the default subset automatically:
- The benchmark’s configured default subset
- A file named
data/default.parquet
- If there is only one subset, that subset
- The first subset alphabetically
Subsets are detected from Parquet files in the data/ directory. Each file
data/<name>.parquet becomes a subset named <name>. See
Uploading Data for the full layout.
Reproducibility checklist
Pin dataset version when benchmarking
For benchmark comparisons over time, prefer version_number instead of
always pulling latest.
Track subset in experiments
Store subset name in run parameters for transparent reporting.
Use stable row identifiers
Include deterministic row_id fields to make resume and row comparisons
robust.
run = dataset.eval(
task_func=my_task,
parameters={
"dataset_name": dataset.name,
"dataset_version": dataset.version_number,
"subset": "diamond",
},
)
Attach dataset version + subset in run parameters so dashboards and
downstream analysis remain self-describing.