Skip to main content

Why versioning matters

Reproducible evaluation depends on running against a known dataset snapshot. The SDK exposes version metadata after push and supports pulling a specific version.
dataset.push()
print(dataset.version_id)
print(dataset.version_number)
To run against a known version:
dataset = ze.Dataset.pull("capital-cities", version_number=3)

One compact example

dataset = ze.Dataset(
    "qa-demo",
    data=[{"row_id": "q1", "question": "6 * 7", "answer": "42"}],
)
dataset.push()

latest = ze.Dataset.pull("qa-demo")
pinned = ze.Dataset.pull("qa-demo", version_number=dataset.version_number)
subset = ze.Dataset.pull("gpqa", subset="diamond")

print(dataset.version_number)
print(len(latest), len(pinned), len(subset))

Subset pulls

If your dataset includes named subsets, pull only the subset you want:
diamond = ze.Dataset.pull("gpqa", subset="diamond")
print(len(diamond))
If subset is omitted, the SDK resolves the default subset automatically:
  1. The benchmark’s configured default subset
  2. A file named data/default.parquet
  3. If there is only one subset, that subset
  4. The first subset alphabetically
Subsets are detected from Parquet files in the data/ directory. Each file data/<name>.parquet becomes a subset named <name>. See Uploading Data for the full layout.

Reproducibility checklist

1

Pin dataset version when benchmarking

For benchmark comparisons over time, prefer version_number instead of always pulling latest.
2

Track subset in experiments

Store subset name in run parameters for transparent reporting.
3

Use stable row identifiers

Include deterministic row_id fields to make resume and row comparisons robust.
run = dataset.eval(
    task_func=my_task,
    parameters={
        "dataset_name": dataset.name,
        "dataset_version": dataset.version_number,
        "subset": "diamond",
    },
)
Attach dataset version + subset in run parameters so dashboards and downstream analysis remain self-describing.