Lazy NetCDF — complete cookbook¶
NetCDFs are typically the heaviest single files in a GIS pipeline. Pyramids exposes four lazy entry points that together cover single-file reads, multi-file stacks, zero-copy Zarr indexing, and xarray interop. This notebook walks through them against local test data.
What you'll see¶
NetCDF.read_array(variable, chunks=…)— eager vs lazy returns.- CF
unpack=True— applyingscale_factor/add_offsetlazily. NetCDF.open_mfdataset(paths, variable)— stack many files along a leading time axis.NetCDF.to_kerchunk/combine_kerchunk— zero-copy cube indexing (requires[netcdf-lazy]).- Handing a pyramids lazy array to
xarray.DataArrayvia the[xarray]extra — pyramids stays a peer, not a backend.
Requirements¶
pip install 'pyramids-gis[lazy]' # core lazy path
pip install 'pyramids-gis[netcdf-lazy]' # + kerchunk + xarray
Setup¶
import os
os.environ['MPLBACKEND'] = 'Agg'
from pathlib import Path
import numpy as np
DATA = (Path('..') / '..' / '..' / 'tests' / 'data').resolve() / 'netcdf'
sorted(p.name for p in DATA.glob('*.nc'))
1. NetCDF.read_array(variable, chunks=…)¶
chunks=None (the default) returns a numpy.ndarray. Any other value flips to a dask.array.Array whose chunks are materialised on demand through GDAL's MDArray reader.
from pyramids.netcdf import NetCDF
nc = NetCDF.read_file(str(DATA / 'pyramids-netcdf-3d.nc'))
nc.variable_names
# Eager — numpy.
eager = nc.read_array('values')
type(eager).__name__, eager.shape, eager.dtype
# Lazy — dask array. One chunk per time step.
lazy = nc.read_array('values', chunks=(1, -1, -1))
type(lazy).__name__, lazy.shape, lazy.chunks
# Reduce along time — no I/O until .compute().
per_cell_mean = lazy.mean(axis=0)
len(per_cell_mean.__dask_graph__())
materialised = per_cell_mean.compute()
materialised.shape, float(materialised.min()), float(materialised.max())
2. CF unpack=True — lazy scale/offset¶
NetCDF variables often store packed ints with scale_factor / add_offset CF attributes. unpack=True applies the transformation in the dask graph — no materialisation.
nc_packed = NetCDF.read_file(str(DATA / 'two_vars_scale_offset.nc'))
nc_packed.variable_names
# Variable with scale_factor / add_offset set.
packed_lazy = nc_packed.read_array(
nc_packed.variable_names[0],
unpack=True,
chunks='auto',
)
packed_lazy.dtype
3. open_mfdataset(paths, variable) — multi-file stack¶
Unlike xarray.open_mfdataset, the pyramids helper is deliberately narrow: one variable at a time, no combine strategies. parallel=True fans out per-file opens via dask.delayed.
# Re-use the same file three times for a 3-element stack.
paths = [str(DATA / 'pyramids-netcdf-3d.nc')] * 3
stack = NetCDF.open_mfdataset(paths, variable='values')
stack.shape
# Per-file mean — reduces back to one (T, H, W) slab.
stack_mean = stack.mean(axis=0).compute()
stack_mean.shape
4. Kerchunk — zero-copy cube indexing¶
A kerchunk manifest is a JSON document containing byte-range pointers into each source file; no pixel data is moved. Requires the [netcdf-lazy] extra. The helper raises a clean ImportError if kerchunk is missing.
import tempfile
workdir = Path(tempfile.mkdtemp(prefix='pyramids-kerchunk-'))
manifest_path = workdir / 'single.json'
try:
manifest = nc.to_kerchunk(str(manifest_path))
outcome = f'manifest has {len(manifest.get("refs", manifest))} entries'
except ImportError as exc:
outcome = f'extra missing: {exc}'
outcome
# Combine many files into one manifest. The test file's leading
# axis is named 'bands'; real-world cubes use 'time'.
combined_path = workdir / 'combined.json'
try:
manifest = NetCDF.combine_kerchunk(
paths,
str(combined_path),
concat_dims=('bands',),
identical_dims=(),
)
outcome = manifest_path.exists(), combined_path.exists()
except ImportError as exc:
outcome = f'extra missing: {exc}'
outcome
5. Handing a pyramids lazy array to xarray¶
Pyramids is a peer of xarray, not a backend beneath it — the lazy read goes through pyramids' CachingFileManager + GDAL MDArray, no xarray engine plugin in the middle. If you need xarray.DataArray ergonomics downstream, install [xarray] and wrap the lazy array yourself:
try:
import xarray as xr
lazy = nc.read_array('values', chunks=(1, -1, -1))
da = xr.DataArray(lazy, dims=('band', 'y', 'x'), name='values')
outcome = da.name, tuple(da.dims), da.shape
except ImportError as exc:
outcome = f'extra missing: {exc}'
outcome
Closing notes¶
read_arrayandopen_mfdatasetonly need the[lazy]extra.to_kerchunkandcombine_kerchunkneed[netcdf-lazy]— they raise a cleanImportErrornaming the extra on minimal installs.NetCDF.to_xarray()/.from_xarray()need the separate[xarray]extra; pyramids does not install xarray by default.- For a real ERA5-on-AWS walkthrough, see
dask-lazy-netcdf.ipynb.