Adding a Dataset
Adding a new synchronous dataset to apairo requires:
- A YAML profile describing the directory structure
- A 2-line Python subclass
- Tests
Step 1 -- Write the YAML profile
Create apairo/dataset/profiles/<your_dataset>.yaml. Study your dataset's folder layout and map it onto the layer DSL.
Example -- a hypothetical PandaSet-style layout:
Profile:
# apairo/dataset/profiles/pandaset.yaml
layers:
- sequence # root/001/, root/002/, ...
- modality # root/001/lidar/, root/001/camera/
modalities:
lidar:
ext: .bin
dtype: float32
reshape: [-1, 4]
camera:
ext: .png
loader: img # explicit: torchvision read_image
dtype: uint8 # ignored for non-binary, but required field
Choosing layer order
Write out one example file path relative to the dataset root, then annotate each component:
Then list the layers in that order: [sequence, modality].
Step 2 -- Create the Python subclass
Create apairo/dataset/pandaset/dataset.py:
from apairo.core.profiled_dataset import ProfiledDataset
class PandaSetDataset(ProfiledDataset):
_profile = "pandaset.yaml"
Create apairo/dataset/pandaset/__init__.py:
Export from apairo/dataset/__init__.py:
And add to apairo/__init__.py and its __all__.
Step 3 -- Write tests
Create test/dataset/test_pandaset.py. Use tmp_path to build a minimal mock tree:
import numpy as np
import pytest
from apairo.dataset.pandaset import PandaSetDataset
N = 40
@pytest.fixture
def pandaset_root(tmp_path):
for seq in ["001", "002"]:
(tmp_path / seq / "lidar").mkdir(parents=True)
for i in range(3):
np.random.rand(N, 4).astype(np.float32).tofile(
tmp_path / seq / "lidar" / f"{i:06d}.bin"
)
return tmp_path
def test_len(pandaset_root):
ds = PandaSetDataset(pandaset_root, keys=["lidar"])
assert len(ds) == 6 # 2 seqs x 3 frames
def test_getitem_shape(pandaset_root):
ds = PandaSetDataset(pandaset_root, keys=["lidar"])
sample = ds[0]
assert sample.data["lidar"].shape == (N, 4)
def test_available_keys():
assert PandaSetDataset.available_keys == frozenset({"lidar", "camera"})
Derived key support
Derived keys (preprocessed outputs) work automatically for any ProfiledDataset subclass. The derived_path() method generates the correct output path by replacing the modality component in an existing file path.
PandaSetDataset.run_preprocess(MyPreprocessor(), "/data/pandaset")
ds = PandaSetDataset("/data/pandaset", keys=["lidar", "my_output"])
No extra code needed in your subclass.
When to extend SynchronousDataset directly
Use a direct SynchronousDataset subclass (instead of ProfiledDataset) only when:
- The directory structure is too irregular to describe with the layer DSL
- You need custom loading logic per frame that cannot be expressed with dtype/reshape/mask
In that case, implement __len__, __getitem__, __iter__, __next__, and override derived_path() and _seq_root() as needed. See apairo/dataset/goose/dataset.py (the pre-refactor version in git history) as a reference.