Transforms

Transforms let you apply callables to channel data at access time, without persisting anything to disk. This is the right tool for normalisations, type conversions, augmentations, or any operation cheap enough to run on the fly.

Companion library: apairo_transform ships a collection of ready-made transforms (range filters, normalisation, voxelisation, …) that plug directly into this API.

`dataset.transform()`

A single method, two forms. All steps are registered in order and run as a unified pipeline at access time.

Per-channel form

ds.transform(key, fn, output=None, keep=True)

fn receives sample.data[key] and returns the transformed value. By default the result overwrites key in-place:

ds.transform("lidar", lambda pts: pts[pts[:, 2] > -2])
  .transform("lidar", lambda pts: pts / pts.max())

Sample-level form

ds.transform(fn)   # fn: Sample -> Sample

fn receives the full Sample. Use this when an operation must touch several channels consistently:

def range_filter(sample):
    mask = sample.data["lidar"][:, :3].max(axis=1) < 50.0
    sample.data["lidar"]  = sample.data["lidar"][mask]
    sample.data["labels"] = sample.data["labels"][mask]
    return sample

ds.transform(range_filter)

Both forms return self and compose in registration order:

ds.transform("lidar", Normalize())   # step 1
  .transform(range_filter)           # step 2 — sees normalised lidar
  .transform("lidar", Voxelize())    # step 3

Publishing a channel — `output`

Pass output to write the result of a per-channel transform to a new key while leaving the source intact. The published channel is then visible to all subsequent pipeline steps:

ds.transform("lidar", RangeFilter(max=50.0), output="lidar_f")

ds.transform("lidar_f", Normalize())   # branch 1 — reads published channel
ds.transform("lidar_f", Voxelize())    # branch 2 — same source, different op

Both branches read from lidar_f as it was when it was published, regardless of what the other branch does to it.

Temporary channels — `keep=False`

Set keep=False alongside output to drop the published channel from the final sample. Useful for intermediate results that are only needed within the pipeline:

ds.transform("lidar", compute_mask_fn, output="_mask", keep=False)
ds.transform(lambda s: apply_mask(s, "_mask"))
# "_mask" is gone from the returned sample; "lidar" and "labels" are filtered

`Compose`

Compose wraps multiple callables into one, useful for naming or reusing a pipeline:

from apairo import Compose

ds.transform("lidar", Compose([RangeFilter(max=50.0), Normalize()]))
print(ds._pipeline[-1])  # Compose([RangeFilter, Normalize])

Behaviour summary

Property	Detail
No disk writes	Transforms run in memory at `__getitem__` time.
Order	All steps (per-channel and sample-level) run in registration order.
Scope	Per-instance. Transforms on `ds` do not affect another instance at the same path.
`output`	Publishes result as a new channel; source channel unchanged.
`keep=False`	Removes an `output` channel from the final sample after the full pipeline runs.

Transforms register in place

transform() mutates the dataset and returns the same object for chaining. v1 = ds.transform(a) then v2 = ds.transform(b) leaves v1 is v2 is ds with both transforms stacked. To build independent variants from one dataset, branch first — e.g. via ds.filter(...), ds.select(ds.keys), or separate instances — and register transforms on each branch.

Transforms

dataset.transform()

Per-channel form

Sample-level form

Publishing a channel — output

Temporary channels — keep=False

Compose

Behaviour summary

`dataset.transform()`

Publishing a channel — `output`

Temporary channels — `keep=False`

`Compose`