Lazy Dataset — complete cookbook¶
This notebook exercises every lazy surface that pyramids.dataset.Dataset exposes, against local test data shipped with pyramids. Nothing here hits the network — you can run it end-to-end after a pip install 'pyramids-gis[lazy]'.
What you'll see¶
Dataset.read_array(chunks=…)— eager vs lazy returns.- Picking a chunk size and inspecting the task graph.
dask.arrayreductions and arithmetic over the lazy array.Dataset.to_zarr/Dataset.from_zarr— the only truly-parallel raster output path.focal_mean,focal_std,focal_apply,slope,aspect,hillshade— every neighbourhood op accepts the samechunks=switch.zonal_statsover a polygon FeatureCollection.- Dispatch helpers:
is_lazy,as_numpy.
Requirements¶
pip install 'pyramids-gis[lazy]'
Setup — force a headless matplotlib backend¶
cleopatra (pyramids' plotting peer) forces TkAgg at import time, which opens blocking popup windows during non-interactive runs. Force Agg before any import.
import os
os.environ['MPLBACKEND'] = 'Agg'
from pathlib import Path
import numpy as np
DATA = Path('..') / '..' / '..' / 'tests' / 'data'
DATA = DATA.resolve()
DATA.is_dir()
1. Dataset.read_array(chunks=…) — eager vs lazy¶
chunks=None (the default) returns a numpy.ndarray and does not import dask. Any other value opts in to a dask.array.Array. The switch is local to the call — no global mode, no class subclass.
from pyramids.dataset import Dataset
ds = Dataset.read_file(str(DATA / 'acc4000.tif'))
ds.shape, ds.cell_size, ds.epsg
# Eager — numpy array.
eager = ds.read_array()
type(eager).__name__, eager.shape, eager.dtype
# Lazy — dask array. Nothing is read yet.
lazy = ds.read_array(chunks=(32, 32))
type(lazy).__name__, lazy.shape, lazy.chunks
Dispatch helpers — is_lazy, as_numpy¶
Library code that accepts either backend should use the structural helpers in pyramids.base.protocols instead of comparing against np.ndarray / dask.array.Array directly.
from pyramids.base.protocols import as_numpy, is_lazy
is_lazy(eager), is_lazy(lazy)
# as_numpy is a no-op on numpy, and forces a compute on dask.
materialised = as_numpy(lazy)
type(materialised).__name__, materialised.shape
2. Reductions and arithmetic over the lazy array¶
Every dask.array op is available — pyramids does not add a wrapper. The graph is built without touching the file; .compute() triggers the read + reduction.
lazy = ds.read_array(chunks=(32, 32))
# Build an arithmetic graph — no I/O yet.
anomaly = lazy - lazy.mean()
len(anomaly.__dask_graph__())
# Materialise.
result = anomaly.compute()
result.shape, float(result.min()), float(result.max())
3. Parallel writes — Dataset.to_zarr / from_zarr¶
Zarr is the only raster output where pyramids can do truly parallel writes: each dask chunk lands in an independent chunk file. The store carries rioxarray-compatible geobox attributes, so downstream consumers can reopen it without pyramids.
import tempfile
workdir = Path(tempfile.mkdtemp(prefix='pyramids-lazy-'))
store = workdir / 'acc4000.zarr'
ds.to_zarr(str(store), chunks=(1, 32, 32), mode='w')
sorted(p.name for p in store.iterdir())[:5]
# Round-trip back into a Dataset — still lazy if chunks is given.
reloaded = Dataset.from_zarr(str(store), chunks=(1, 32, 32))
reloaded.epsg, reloaded.shape, reloaded.cell_size
# The Zarr attributes are readable without pyramids — just json.
import json
meta = json.loads((store / 'data' / '.zattrs').read_text())
meta['epsg'], meta['dtype'], meta['GeoTransform']
4. Neighbourhood ops — focal_mean / _std / _apply¶
Every focal op takes the same chunks= switch and resolves to dask.array.map_overlap(kernel, depth=radius, boundary='reflect') when it's set. The eager kernel is unchanged (SciPy ndimage filters); only the halo bookkeeping changes.
# Eager focal mean.
mean_eager = ds.focal_mean(radius=1)
type(mean_eager).__name__, mean_eager.shape
# Lazy focal mean — same radius, chunks= provided.
mean_lazy = ds.focal_mean(radius=1, chunks=(32, 32))
type(mean_lazy).__name__, mean_lazy.chunks
# User kernel via focal_apply — per-window max over a 3x3 window.
big_max = ds.focal_apply(np.max, radius=1, chunks=(32, 32))
materialised_max = big_max.compute()
float(materialised_max.max())
# L4 — focal_std uses a two-pass formulation. A constant
# raster has zero variance, which the unstable formulation
# can return as a tiny negative float — we should see 0.
zero_std = ds.focal_std(radius=1, chunks=(32, 32))
float(zero_std.min().compute()) >= 0.0
5. DEM derivatives — slope, aspect, hillshade¶
The three classic DEM ops wrap centered-difference gradients. On lazy inputs the gradient is computed per chunk via the same map_overlap path.
dem = Dataset.read_file(str(DATA / 'dem' / 'DEM5km_Rhine_burned_acc.tif'))
dem.shape, dem.cell_size
slope_lazy = dem.slope(chunks=(64, 64), units='degrees')
aspect_lazy = dem.aspect(chunks=(64, 64))
shade_lazy = dem.hillshade(azimuth=315.0, altitude=45.0, chunks=(64, 64))
slope_lazy.shape, aspect_lazy.shape, shade_lazy.shape
# Materialise hillshade once — it's the most visually
# meaningful of the three derivatives.
shade = shade_lazy.compute()
float(shade.min()), float(shade.max())
6. Zonal statistics — Dataset.zonal_stats¶
Single-pass: every polygon is rasterised into one integer label grid; np.bincount then gives mean / sum / count. The FeatureCollection must share the Dataset's CRS (H4 — pyramids raises instead of silently mis-aligning).
from pyramids.feature import FeatureCollection
polys = FeatureCollection.read_file(
str(DATA / 'coello_polygons.geojson'),
)
# Match the raster CRS.
polys = FeatureCollection(polys.to_crs(ds.epsg))
len(polys), polys.epsg
stats = ds.zonal_stats(polys, stats=('mean', 'sum', 'count'))
stats.head()
Closing notes¶
Datasetitself is always eager;chunks=flips the read to lazy.- Every lazy call is safe to ship to
dask.distributedworkers —CachingFileManagerpickles its recipe, not its handle. - For the cloud-hosted Sentinel-2 walkthrough, see
dask-lazy-datasets.ipynb. - For the
DatasetCollectionlazy cube, seelazy-collection-complete.ipynb.