Lazy raster reads + parallel writes with Dataset and DatasetCollection¶
This notebook exercises the dask-backed raster path against a real public dataset — Sentinel-2 L2A COGs on AWS, hosted in the sentinel-cogs bucket (us-west-2). Each COG is ~100 MB and a full MGRS tile is ~1 GB; a single band at 10 m resolution is already 10980 × 10980 floats = ~480 MB per band.
What you'll see¶
- Lazy raster reads via
Dataset.read_array(chunks=...)over a cloud COG. Dataset.map_blocks(func, chunks=...)— apply a numpy kernel across every tile in parallel.- Parallel Zarr writes —
Dataset.to_zarr(store, compute=False)returns aDelayed; the actual write is per-tile and runs on every worker. DatasetCollection.from_files(...)over ten date slices of the same MGRS tile — a 4-D lazy cube you can reduce across time.
Requirements¶
pip install 'pyramids-gis[lazy]'
This pulls dask[array], distributed, zarr, fsspec. GDAL already handles /vsicurl/ / /vsis3/ cloud reads once the library is installed.
Configure cloud defaults + process scheduler¶
GDAL has ~8 environment variables that matter for cloud COG reads (multirange HTTP, caching, directory scans). pyramids.configure(cloud_defaults=True) applies them in one call. For the dask side, process scheduler avoids the GIL contention on numpy reductions.
import os
os.environ['MPLBACKEND'] = 'Agg'
import dask
dask.config.set(scheduler="processes") # process-backed; no GIL contention
from pyramids import configure
env = configure(cloud_defaults=True, aws={"aws_unsigned": True})
len(env), env.get("GDAL_DISABLE_READDIR_ON_OPEN")
1. Open a single Sentinel-2 scene as a lazy Dataset¶
Sentinel-2 L2A COGs are served under s3://sentinel-cogs/sentinel-s2-l2a-cogs/. Each scene is a folder of single-band COGs. We pick one tile + one scene + one band (B04 — red, 10 m) and read it lazily. The /vsicurl/ rewrite from _to_vsi means GDAL fetches only the block-ranges dask asks for, not the whole file.
from pyramids.dataset import Dataset
# Any recent 10m B04 Sentinel-2 L2A tile. Replace the path if the scene
# is no longer hosted (AWS keeps several years of history under this URL scheme).
SCENE = (
"https://sentinel-cogs.s3.us-west-2.amazonaws.com/"
"sentinel-s2-l2a-cogs/31/U/DQ/2024/6/S2B_31UDQ_20240612_0_L2A/B04.tif"
)
ds = Dataset.read_file(SCENE)
(ds.rows, ds.columns), ds.epsg, ds.cell_size
# Lazy read: nothing materialised yet.
# chunks='auto' respects the COG block shape (typically 1024x1024).
lazy_band = ds.read_array(band=0, chunks="auto")
type(lazy_band).__name__, lazy_band.shape, lazy_band.chunksize
2. map_blocks — apply a numpy kernel across every tile in parallel¶
Classic use case: band-math on a single scene (e.g. normalised difference index). Dataset.map_blocks passes each dask chunk through a pure-numpy function, so workers can run independently.
import numpy as np
def rescale(arr: np.ndarray) -> np.ndarray:
# Sentinel-2 L2A is 0-10000 reflectance × 10000; convert to 0-1 float
return (arr.astype(np.float32) / 10000.0).clip(0.0, 1.0)
rescaled = ds.map_blocks(rescale, chunks="auto", band=0)
type(rescaled).__name__, rescaled.shape
3. Parallel write via Zarr¶
Dataset.to_zarr(store, compute=False) returns a dask.delayed.Delayed bundling the data write + the geobox-metadata write into one atomic task. Calling .compute() triggers every chunk in parallel — on a LocalCluster, you'll see worker processes write individual Zarr chunk files.
import tempfile
from pathlib import Path
store = Path(tempfile.mkdtemp()) / "B04_rescaled.zarr"
delayed = ds.to_zarr(str(store), compute=False)
type(delayed).__name__
# Force the graph — every chunk writes in parallel under the scheduler configured above.
delayed.compute()
list(store.iterdir())[:5]
# Round-trip check.
recovered = Dataset.from_zarr(str(store))
(recovered.rows, recovered.columns), recovered.epsg
4. A temporal cube with DatasetCollection¶
Ten same-tile B04 scenes form a ~4 GB 4-D cube. DatasetCollection.from_files(files) does NOT open them all up front; .data returns a lazy dask array that materialises per-chunk. Reductions across the time axis (.mean(axis=0), .median()) parallelise across both time and space — this is where dask earns its keep.
from pyramids.dataset import DatasetCollection
# Substitute ten real URLs for the tile/date slices you care about.
# Each must be a Sentinel-2 B04 COG of the same MGRS tile.
FILES = [
f"https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/"
f"31/U/DQ/2024/{m}/S2B_31UDQ_2024{m:02d}12_0_L2A/B04.tif"
for m in range(3, 13)
]
cube = DatasetCollection.from_files(FILES)
cube.time_length, cube.data.shape, cube.data.chunksize
# Temporal reduction — workers pull only the chunks they need for the mean;
# the cube is never fully materialised in a single process.
annual_mean = cube.data.mean(axis=0).compute()
annual_mean.shape, float(annual_mean.mean())
5. Cube → Zarr for downstream xarray consumers¶
DatasetCollection.to_zarr(store) writes a CF-compatible Zarr store. Downstream xarray / rioxarray consumers can open it without pyramids.
cube_store = Path(tempfile.mkdtemp()) / "B04_annual_cube.zarr"
cube.to_zarr(str(cube_store)) # blocks until every partition is written
list(cube_store.iterdir())[:5]
Further reading¶
- Sentinel-2 L2A COGs — AWS Registry of Open Data
- Earth Search STAC API — discover scenes by bbox+date
- planning/dask/integration-plan.md — full DASK task tracker