Lazy NetCDF — NetCDF#
NetCDFs are typically the heaviest single files in a GIS pipeline — reanalysis, climate projections, ocean models. Pyramids exposes four native lazy entry points covering single-file reads, multi-file stacks, and zero-copy Zarr-backed cube indexing — no xarray in the lazy read path, so installs stay slim and pyramids stays a peer of xarray rather than a consumer:
| Entry point | Use when |
|---|---|
NetCDF.read_array(chunks=…) |
One file, one variable, partial reads |
NetCDF.open_mfdataset(paths, variable) |
Many files, one variable, stacked on time axis |
NetCDF.to_kerchunk(path) |
Emit a JSON index so downstream reads skip HDF5 |
NetCDF.combine_kerchunk(paths, ...) |
Combine per-file manifests into one cube index |
See lazy-compute.md for schedulers and configure
basics.
Installing the extras:
pip install 'pyramids-gis[lazy]' # read_array + open_mfdataset
pip install 'pyramids-gis[netcdf-lazy]' # + kerchunk manifests + h5py
Reading one variable lazily#
NetCDF.read_array preserves the legacy eager path on
chunks=None. Any other value flips to a dask-backed
dask.array.Array whose chunks are materialised on demand through
GDAL's MDArray reader:
from pyramids.netcdf import NetCDF
nc = NetCDF.read_file("era5_t2m_2023.nc")
temp_lazy = nc.read_array("t2m", chunks={"time": 24, "lat": 256, "lon": 256})
temp_lazy.shape # (8760, 721, 1440)
temp_lazy.chunks # chunked tuples per axis
Chunk spec accepts the same forms as Dataset.read_array: None
(eager), "auto", -1 (single chunk), an int, a tuple, or a dict
keyed by axis index or axis name.
Native chunks = best default#
Every variable has a preferred on-disk chunk size exposed via
VariableInfo.block_size. The default chunks="auto" respects it,
so if the file was written with chunksizes=(1, 256, 256), that's
what pyramids picks. Overriding only pays off when your access
pattern is diagonal to the on-disk layout (for example, reading a
time-series for one pixel over a time-chunked file).
unpack=True applies scale/offset lazily#
NetCDF variables often store packed ints with scale_factor /
add_offset CF attributes. unpack=True applies the transformation
in the dask graph — no materialisation:
temp_unpacked = nc.read_array("t2m", unpack=True, chunks={"time": 24})
# temp_unpacked.dtype == float32, still lazy
Locks#
Same mechanism as Dataset: lock=None picks the right
SerializableLock or dask.distributed.Lock automatically.
lock=False opts into lock-free reads when you know each worker has
its own handle.
Many files, one variable — open_mfdataset#
Unlike xarray.open_mfdataset, the pyramids helper is deliberately
narrow: one variable at a time, no "by_coords" inference, no combine
strategies. For the common hydrology / meteorology case — 365 daily
noah_YYYYMMDD.nc files, stack precipitation, reduce on time —
that narrowness is the whole point.
from pyramids.netcdf import NetCDF
stack = NetCDF.open_mfdataset(
"/data/noah_*.nc", # glob string
variable="precipitation",
chunks={"time": 1, "lat": 256, "lon": 256},
parallel=True, # fan out file opens over workers
)
stack.shape # (n_files, H, W)
stack.mean(axis=0).compute() # climatological mean
Three arguments worth knowing:
parallel=Truewraps each file's metadata read indask.delayedso opening 500 files on a distributed cluster fans out across workers instead of blocking sequentially. The probe of the first file stays eager so the stack shape + dtype are known at graph-construction time.chunks=forwards to every per-fileread_arraycall. Pick carefully: a chunk spec that doesn't align with per-file layout forces re-chunking on every read.preprocess=applies a user callable to eachNetCDFsubset before its array is extracted. Useful for unpacking, cropping, or dropping auxiliary variables.
def only_europe(nc):
return nc.crop(bbox=(-10.0, 35.0, 30.0, 65.0))
stack = NetCDF.open_mfdataset(
"/data/era5/*.nc",
variable="t2m",
preprocess=only_europe,
parallel=True,
)
paths accepts a glob string, an explicit path, or a list of paths.
Glob inputs are expanded and sorted alphabetically so the stack
order is deterministic.
Zero-copy cube indexing — kerchunk#
A kerchunk manifest is a JSON document containing byte-range pointers into each source file; no pixel data is moved. Downstream consumers open the resulting archive as a lazy Zarr-backed xarray with zero rewrite — the speedup is the avoided decode cost.
Emit a manifest for a single file:
from pyramids.netcdf import NetCDF
nc = NetCDF.read_file("era5_2023.nc")
manifest = nc.to_kerchunk(
"era5_2023.kerchunk.json",
inline_threshold=500, # embed chunks smaller than 500 bytes
vlen_encode="embed", # embed VLEN strings for compatibility
)
Combine many manifests into one cube index:
from pathlib import Path
srcs = sorted(Path("/data/era5").glob("era5_*.nc"))
manifest = NetCDF.combine_kerchunk(
srcs,
"era5_cube.json",
concat_dims=("time",),
identical_dims=("lat", "lon"),
)
Consume the manifest with fsspec + xarray:
import fsspec
import xarray as xr
mapper = fsspec.get_mapper(
"reference://",
fo="era5_cube.json",
remote_protocol="s3",
remote_options={"anon": True},
)
ds = xr.open_zarr(mapper, consolidated=False) # lazy, zero-copy
Kerchunk is behind [netcdf-lazy]. to_kerchunk and
combine_kerchunk raise a branded ImportError naming the extra
when it's absent.
When to use kerchunk vs open_mfdataset#
| You want… | Reach for |
|---|---|
| One-off parallel read-and-reduce | open_mfdataset |
| A reusable index other tools can open | combine_kerchunk |
| Cloud data read through xarray/fsspec | combine_kerchunk |
| Only pyramids consumers | open_mfdataset |
Handing a pyramids lazy array to xarray#
Pyramids deliberately does not register as an
engine= for xr.open_dataset. It is a peer of xarray, not a
backend beneath it. If your pipeline ends in xarray anyway, use
NetCDF.to_xarray() for an eager conversion, or wrap pyramids' lazy
array in xr.DataArray yourself:
import xarray as xr
from pyramids.netcdf import NetCDF
nc = NetCDF.read_file("era5_2023.nc")
arr = nc.read_array("t2m", chunks={"time": 24, "lat": 256, "lon": 256})
# arr is a dask.array.Array read via pyramids' CachingFileManager.
da = xr.DataArray(arr, dims=("time", "lat", "lon"), name="t2m")
You keep pyramids' CachingFileManager + SerializableLock
semantics (handle pooling, distributed-safe pickling) on the read
side, and xarray's ergonomics on the downstream side — no backend
adapter layer in between.
Install: [netcdf-lazy] pulls kerchunk + h5py; add the
[xarray] extra on top if you want xr.DataArray / .to_xarray() /
.from_xarray().
A worked pipeline — ERA5 on AWS#
from pyramids import configure
from pyramids.netcdf import NetCDF
configure(cloud_defaults=True, aws={"aws_unsigned": True})
nc = NetCDF.read_file(
"https://era5-pds.s3.amazonaws.com/2020/01/data/air_temperature_at_2_metres.nc",
)
# 1. Check metadata cheaply.
nc.variable_names
# 2. Read one month, 6 hours at a time, 256x256 spatial chunks.
temp = nc.read_array(
"air_temperature_at_2_metres",
chunks={"time": 6, "lat": 256, "lon": 256},
)
# 3. Monthly mean on time axis — builds dask graph.
monthly = temp.mean(axis=0)
# 4. Materialise — single parallel compute.
monthly_np = monthly.compute()
The full notebook at
docs/examples/netcdf/dask-lazy-netcdf.ipynb adds plotting, writes
the output to a local Zarr store, and demonstrates the equivalent
kerchunk-based path.
Minimal-install behaviour#
Calls that need dask raise an ImportError naming the [lazy]
extra:
Calls that need kerchunk name [netcdf-lazy]. NetCDF.to_xarray() /
.from_xarray() name [xarray]. The eager
NetCDF.read_array(chunks=None) path is always available and does
not touch dask / xarray / kerchunk.
UGRID unstructured grids#
UgridDataset (FVCOM, SCHISM, ADCIRC, Delft3D) is not yet wired to
pyramids' lazy stack — it currently uses eager reads throughout.
Lazy UGRID is tracked as a follow-on. The
docs/examples/netcdf/ugrid/dask-lazy-ugrid.ipynb notebook shows how
to layer dask on top of the eager UgridDataset manually using
dask.delayed for FVCOM NECOFS OPeNDAP endpoints; that path works
today but isn't yet a first-class pyramids API.