Lazy time-series cubes — DatasetCollection#
DatasetCollection is pyramids' time-series primitive: a stack of
aligned rasters — one per timestep — that exposes a single 4-D
(T, B, R, C) dask.array.Array and a matching set of reductions.
It sits on top of the lazy Dataset path: each timestep is opened
on demand through CachingFileManager, workers never see a live
gdal.Dataset handle, and the whole collection pickles cleanly for
dask.distributed.
See lazy-compute.md for schedulers and
configure. This page covers the collection-specific surface:
construction, the 4-D cube, reductions, and the two serialisation
formats (Zarr + kerchunk).
Installing the extras:
pip install 'pyramids-gis[lazy]' # cube + reductions + Zarr
pip install 'pyramids-gis[netcdf-lazy]' # + to_kerchunk
# from_stac needs no extra — it's duck-typed over any iterable of
# STAC-like items (pystac.Item, raw JSON dicts, etc.). Users who
# build items via pystac-client install pystac themselves.
Constructing a collection#
Three entry points:
| Entry point | Source |
|---|---|
DatasetCollection.from_files(files) |
A list of local or cloud paths |
DatasetCollection.from_stac(items, asset="B04") |
STAC items (pystac / raw dicts) |
DatasetCollection.read_multiple_files(folder) |
A folder, sorted by date regex |
from_files opens only the first file eagerly — enough to derive
the geobox and dtype — and references the rest by path only:
from pathlib import Path
from pyramids.dataset import DatasetCollection
files = sorted(Path("/data/noah").glob("noah_*.tif"))
cube = DatasetCollection.from_files(files)
cube.time_length # len(files)
cube.rows, cube.columns, cube.shape
No per-file opens happen until you access cube.data.
from_stac takes any iterable of STAC Items — pystac.Item
objects from a live pystac-client search, raw JSON dicts from an
HTTP response, or any other duck-typed object with .assets and
.bbox — and resolves the named asset's href per item. pyramids
does not import pystac; pull it yourself (directly or transitively
via pystac-client) when you need to construct Items:
import pystac_client
from pyramids.dataset import DatasetCollection
catalog = pystac_client.Client.open("https://earth-search.aws.element84.com/v1")
search = catalog.search(
collections=["sentinel-2-l2a"],
bbox=(-1.0, 41.0, 0.0, 42.0),
datetime="2022-06-01/2022-06-30",
query={"eo:cloud_cover": {"lt": 20}},
)
cube = DatasetCollection.from_stac(
search.items(),
asset="red", # the B04 band alias
max_items=20, # cap for quick-look
)
Optional kwargs:
bbox=(minx, miny, maxx, maxy)— lon/lat filter; items whosebboxdoesn't intersect are dropped before hrefs are resolved.max_items=N— cap the item count after bbox filtering.patch_url=callable— rewrite each href before pyramids opens it. Useful for signing Planetary Computer URLs or adding query parameters.
The 4-D cube — collection.data#
collection.data is a lazy dask.array.Array of shape
(T, B, R, C):
cube = DatasetCollection.from_files(files)
cube.data.shape # (n_files, bands, rows, cols)
cube.data.chunks # ((1, 1, ...), (bands,), (rows,), (cols,))
Each per-file read is scheduled as a dask.delayed task that opens
the file via CachingFileManager and reads its full array. Workers
therefore never serialise a gdal.Dataset — only the file path
crosses the pickle boundary, matching the pattern xarray / stackstac
/ odc-stac use for dask.distributed safety.
Access the whole graph the same way you'd access any dask array:
# reductions along time
time_mean = cube.data.mean(axis=0).compute() # (B, R, C)
# slice a time window
sub = cube.data[0:30]
# combine with arithmetic
anomaly = cube.data - cube.data.mean(axis=0, keepdims=True)
Built-in reductions#
A small set of common reductions ships on the collection directly so
you don't have to write .data.nanmean(axis=0).compute() every
time:
| Method | Action |
|---|---|
collection.mean(skipna=True) |
Element-wise mean across time |
collection.sum(skipna=True) |
Element-wise sum |
collection.min(skipna=True) |
Element-wise min |
collection.max(skipna=True) |
Element-wise max |
collection.std(skipna=True) |
Element-wise standard deviation |
collection.var(skipna=True) |
Element-wise variance |
Each returns a numpy.ndarray of shape (B, R, C) after a single
.compute(). skipna=True (the default) routes to the nan* dask
variant.
cube = DatasetCollection.from_files(files)
mean_rain = cube.mean() # numpy.ndarray, shape (bands, rows, cols)
Grouped reductions — collection.groupby(labels)#
groupby lets you compute per-group reductions — monthly means,
per-cluster averages, seasonal climatologies. Pass a label array the
same length as the collection:
import numpy as np
cube = DatasetCollection.from_files(daily_files)
months = np.array([d.month for d in dates]) # one per file
grouped = cube.groupby(months)
monthly_mean = grouped.mean() # dict {1: array, 2: array, ...}
monthly_std = grouped.std()
Under the hood, pyramids routes through flox (single-pass
grouped reduction — installed with [lazy]). When flox isn't
available it falls back to a per-label loop with equivalent
semantics. Both paths return a dict mapping label → numpy array.
Per-timestep metadata — RasterMeta#
Every collection caches a picklable RasterMeta snapshot of the
template file's geobox + dtype + nodata. Available via
collection.meta — always reachable without reopening the template
dataset:
cube = DatasetCollection.from_files(files)
cube.meta.epsg # 4326
cube.meta.shape # (bands, rows, cols)
cube.meta.cell_size # 0.001
cube.meta.transform # GeoTransform tuple
cube.meta.nodata # per-band tuple
cube.meta.block_size # per-band (bw, bh)
The snapshot is derived eagerly at construction so:
- Downstream lazy paths (per-file reads, Zarr writes, kerchunk manifests) read geo metadata without paying a GDAL-open cost per call.
- The collection pickles cleanly even if the template Dataset handle
is closed or points at a
/vsimem/file.
Writing a cube to Zarr — collection.to_zarr#
Zarr is the only output where pyramids can do truly parallel writes:
each dask chunk in collection.data lands in an independent Zarr
chunk file. A rioxarray-compatible attribute schema is written so
downstream xr.open_zarr(store) consumers can reconstruct the geobox
without pyramids:
The root group carries epsg, GeoTransform, crs_wkt, nodata,
band_names, dtype, plus a pyramids_zarr_version marker and the
source file list.
Cloud writes via fsspec:
cube.to_zarr(
"s3://bucket/key.zarr",
storage_options={"anon": False, "key": "...", "secret": "..."},
)
Pass compute=False to return a dask.delayed.Delayed for deferred
execution; useful when the store-write is one step of a larger
graph.
collection.to_zarr raises RuntimeError on a collection without a
files list (e.g. the legacy create_cube(src, n) path) — Zarr
writes need a source file per timestep. ImportError raised when
the [lazy] extra is missing.
Kerchunk manifest — collection.to_kerchunk#
For NetCDF / HDF5-backed collections, emit a single JSON manifest that points at every timestep's source file. Downstream consumers open the entire cube as a lazy Zarr-backed xarray with zero data rewrite:
cube = DatasetCollection.from_files(netcdf_files)
cube.to_kerchunk("cube.kerchunk.json", concat_dim="time")
Consume with fsspec + xarray:
import fsspec, xarray as xr
mapper = fsspec.get_mapper("reference://", fo="cube.kerchunk.json")
ds = xr.open_zarr(mapper, consolidated=False)
GeoTIFF-backed collections raise NotImplementedError on
to_kerchunk — the GeoTIFF manifest path needs kerchunk.tiff +
tifffile which isn't yet wired. Use to_zarr for those today.
Kerchunk lives behind [netcdf-lazy].
When to use which output#
| Goal | Reach for |
|---|---|
| Parallel write, rewrite to compact storage | to_zarr |
| Zero-copy index over existing files | to_kerchunk |
| Reduce to a single array, no persistence | .mean() / etc. |
| Emit per-timestep processed artefacts | .data + user loop |
Interoperating with single-raster operations#
The template Dataset is always accessible via collection.base —
use it for anything that operates on a single raster (plotting,
zonal_stats with a reduced slab, geobox inspection):
cube = DatasetCollection.from_files(files)
# per-timestep zonal stats — loop over time axis
stats_per_step = []
for t in range(cube.time_length):
slab = cube.data[t].compute() # one timestep as numpy
# ... build a Dataset around `slab` if needed ...
# or: one zonal pass per reduction result
mean_raster = cube.mean() # numpy array (B, R, C)
Worked example — Sentinel-2 surface reflectance cube#
import pystac_client
from pyramids import configure
from pyramids.dataset import DatasetCollection
configure(cloud_defaults=True, aws={"aws_unsigned": True})
catalog = pystac_client.Client.open("https://earth-search.aws.element84.com/v1")
search = catalog.search(
collections=["sentinel-2-l2a"],
bbox=(-1.0, 41.0, -0.7, 41.3),
datetime="2023-06-01/2023-08-31",
query={"eo:cloud_cover": {"lt": 10}},
)
cube = DatasetCollection.from_stac(
search.items(), asset="nir", max_items=30,
)
# Per-pixel summer mean NIR reflectance.
summer_mean = cube.mean() # numpy (1, H, W)
# Write the cube for reuse.
cube.to_zarr("s2_nir_summer.zarr")
A full runnable notebook spanning search → cube → reduction →
write is at docs/examples/dataset/dask-lazy-datasets.ipynb.
Minimal-install behaviour#
DatasetCollection.from_filesworks without the[lazy]extra — construction is metadata-only.collection.data, the six reductions, andto_zarrraiseImportErrorfor the[lazy]extra when missing.collection.to_kerchunkraisesImportErrorfor[netcdf-lazy]; also raisesNotImplementedErroron GeoTIFF-backed collections regardless of extras.DatasetCollection.from_stacneeds no pyramids extra — the implementation is duck-typed and does not import pystac.