Skip to content

Views & Caching

Two lightweight primitives for controlling what gets loaded and where it lives in memory. Both return full apairo datasets — they chain with .transform(), .filter(), .join(), and plug directly into PyTorch DataLoader.


ds.select(keys) — channel projection

Returns a ChannelView: a view over a subset of channels. The parent's transforms are applied first; then only the requested keys are kept.

ds = Rellis3DDataset(root, keys=["lidar", "trav_gt", "ground_height_csf"])
ds.transform("ground_height_csf", expensive_smooth)

view = ds.select(["ground_height_csf"])
view[0].data  # {"ground_height_csf": ...}  — smooth already applied

select() is most useful as the step before .cache().


ds.cache() — in-memory materialisation

Returns a CachedDataset: iterates the full dataset once at call time, stores every sample in RAM, and serves all subsequent accesses from memory with no I/O.

ds_prior = ds.select(["ground_height_csf"]).cache()
# ds_prior is now in RAM — reading it costs no disk I/O

Memory warning — all samples are loaded at construction. Only call .cache() on datasets that fit in RAM, typically after .filter() or .select() has reduced the volume.


.cache() as a deterministic boundary

The most important property of .cache() is not performance — it's what it communicates.

The rule is simple:

  • Deterministic → safe to cache: dtype casts, coordinate transforms, filters, preprocessed channels
  • Stochastic → never cache: data augmentation, random subsampling, dropout

Placing .cache() mid-chain makes the boundary explicit and visible at a glance:

# Everything before .cache() is deterministic — computed once, frozen in RAM
ds_train_base = (
    ds.split("train")
    .filter("trav_gt", HasMinPositives(min_pos))  # deterministic filter
    .transform("lidar", RobotFilter(d=1.0))        # deterministic transform
    .cache()                                        # <-- boundary
)

# Everything after .cache() is stochastic — runs fresh every access
ds_train = ds_train_base.transform(SparseAugment(voxel_size))

Without .cache(), the boundary exists but is invisible — a reader must trace mentally through the pipeline to find where determinism ends. With .cache(), it is structural.


Caching a derived channel across training runs

Cache an expensive derived channel once, reuse it across multiple training configurations:

ds = Rellis3DDataset(root, keys=["lidar", "trav_gt", "ground_height_csf"])
ds.transform("ground_height_csf", expensive_smooth)  # deterministic

# Computed once, stored in RAM
ds_prior = ds.select(["ground_height_csf"]).cache()

# Each training run: prior served from RAM, augmentation applied fresh each access
ds_v1 = Rellis3DDataset(root, keys=["lidar", "trav_gt"]).join(ds_prior).transform(augment_v1)
ds_v2 = Rellis3DDataset(root, keys=["lidar", "trav_gt"]).join(ds_prior).transform(augment_v2)

Behaviour summary

select(keys) cache()
When evaluated At access time At construction (eager)
I/O cost Same as parent Zero after construction
RAM cost None All samples in memory
Parent transforms Applied before projection Applied before storing
Chaining Full (transform, filter, join, cache) Full
What to put before Anything Deterministic operations only
What to put after Stochastic operations (augmentation)