Lazy DatasetCollection — complete cookbook¶
A DatasetCollection stacks per-file reads into a single 4-D (T, B, R, C) lazy dask.array.Array. Each timestep is opened on demand through CachingFileManager, so workers never serialise a live gdal.Dataset handle — only the file path crosses the pickle boundary.
This notebook exercises the full collection surface against local test data:
DatasetCollection.from_files(...)— construct without opening every file eagerly.collection.data— the 4-D lazy cube.- Built-in reductions (
mean/sum/min/max/std/var). collection.groupby(labels)— per-group reductions (monthly means, seasonal climatologies).collection.meta— the picklableRasterMetasnapshot.collection.to_zarr(store)— parallel cube write.
Requirements¶
pip install 'pyramids-gis[lazy]'
Setup¶
import os
os.environ['MPLBACKEND'] = 'Agg'
from pathlib import Path
import numpy as np
DATA = (Path('..') / '..' / '..' / 'tests' / 'data').resolve()
DATA.is_dir()
1. Build a synthetic stack¶
The pyramids test suite has a folder of aligned rasters under tests/data/raster_folder/ — 6 rasters with the same geobox, different values. Perfect for a short cube.
# 6 aligned rasters named 0.tif..5.tif — same geobox, different values.
folder = DATA / 'crop_aligned_folder'
files = sorted(folder.glob('[0-5].tif'))
[f.name for f in files]
from pyramids.dataset import DatasetCollection
cube = DatasetCollection.from_files(files)
cube.time_length, cube.rows, cube.columns
2. The lazy 4-D cube — collection.data¶
collection.data is a dask.array.Array of shape (T, B, R, C). Every per-file read is a dask.delayed task that opens the file via CachingFileManager.
cube.data.shape, cube.data.dtype
# Chunks are one-timestep-per-chunk by default —
# the delayed-based graph reads a whole file at a time.
cube.data.chunks
3. Built-in reductions¶
Six reductions ship directly on the collection so you don't have to write .data.nanmean(axis=0).compute() every time. Each returns a numpy.ndarray of shape (B, R, C) after a single .compute().
time_mean = cube.mean(skipna=True)
time_sum = cube.sum(skipna=True)
type(time_mean).__name__, time_mean.shape
# std / var are also available
stats = {
'min': float(cube.min().min()),
'max': float(cube.max().max()),
'mean': float(cube.mean().mean()),
'std': float(cube.std().mean()),
}
stats
4. Grouped reductions — groupby(labels)¶
Pass a per-timestep label array and get {label: array}. Pyramids routes through flox for the single-pass grouped reduction when the [lazy] extra is installed; otherwise it falls back to a per-label loop with equivalent semantics.
# Synthetic label array — group timesteps into two buckets.
labels = np.array(['A', 'A', 'A', 'B', 'B', 'B'][: cube.time_length])
grouped = cube.groupby(labels)
group_means = grouped.mean()
sorted(group_means.keys()), group_means['A'].shape
5. collection.meta — the picklable RasterMeta snapshot¶
Every collection caches a frozen RasterMeta of the template file's geobox + dtype + nodata. It's the single place the lazy paths read metadata from — no GDAL opens per call, and the collection pickles cleanly for dask.distributed.
meta = cube.meta
(
meta.epsg,
meta.shape,
meta.cell_size,
meta.nodata,
meta.dtype,
)
# The snapshot is picklable — ship across dask workers.
import pickle
restored = pickle.loads(pickle.dumps(meta))
restored == meta
6. Parallel cube write — collection.to_zarr¶
Each dask chunk in collection.data lands in an independent Zarr chunk file. The store carries pyramids + rioxarray attributes (epsg, crs_wkt, GeoTransform, nodata, band_names, time_length, source file list), so downstream xr.open_zarr consumers can reconstruct the geobox without pyramids.
import tempfile
workdir = Path(tempfile.mkdtemp(prefix='pyramids-cube-'))
store = workdir / 'cube.zarr'
cube.to_zarr(str(store), mode='w')
sorted(p.name for p in store.iterdir())
# Inspect the metadata — readable by any Zarr/xarray consumer.
import json
root_attrs = json.loads((store / '.zattrs').read_text())
root_attrs['pyramids_zarr_version'], root_attrs['time_length']
Closing notes¶
from_stac(items, asset='B04')is the STAC variant — same cube, different source. Works on any duck-typed STAC-like iterable (pystac.Item, raw JSON dicts, ...); no pyramids extra required.to_kerchunk(path, concat_dim='time')emits a JSON index for NetCDF/HDF5 cubes; GeoTIFF is tracked as a follow-on.- For a cloud-hosted Sentinel-2 walkthrough, see
dask-lazy-datasets.ipynb(pairs STAC search withfrom_stac).