Lazy NetCDF reads + kerchunk + multi-file stacks with pyramids.netcdf¶
ERA5 climate reanalysis is the classic big-NetCDF dataset: ~9 PB across the full archive (1940 → present). The AWS Registry of Open Data hosts ERA5 in the public era5-pds bucket (us-east-1) as one NetCDF per month-per-variable — each file is ~2 GB. Opening a full decade eagerly would OOM any laptop.
This notebook uses ERA5 to walk through the lazy NetCDF surface in pyramids:
What you'll see¶
- Lazy MDArray reads via
NetCDF.read_array(chunks=...)against a cloud NetCDF. NetCDF.to_kerchunk(path)— emit a sidecar JSON manifest so xarray can treat the file as a Zarr store (no rewrite, just byte-range pointers).NetCDF.combine_kerchunk(paths)— a single manifest spanning many monthly files, giving xarray a unifiedtimeaxis across a decade of data.NetCDF.open_mfdataset(paths, variable=...)— one partitioned dask array across many files, no manifest step.
Requirements¶
pip install 'pyramids-gis[netcdf-lazy]'
This pulls xarray, netCDF4, h5netcdf (the [xarray] extra; cftime is a core pyramids dependency) plus dask[array], distributed, zarr, fsspec (the [lazy] extra) plus kerchunk.
Cloud + scheduler config¶
ERA5 files on S3 are best read over HTTP range requests; GDAL's VSI layer does this for us when GDAL_HTTP_MULTIRANGE=YES is set. configure(cloud_defaults=True) sets this plus seven other cloud-friendly options.
import os
os.environ['MPLBACKEND'] = 'Agg'
import dask
dask.config.set(scheduler="processes")
from pyramids import configure
configure(cloud_defaults=True, aws={"aws_unsigned": True})
1. Open one ERA5 file lazily¶
ERA5 S3 layout: s3://era5-pds/YYYY/MM/data/VARIABLE.nc. We pick one month of air_temperature_at_2_metres (the most-read ERA5 variable). A single file is ~1.5 GB and contains 720 hourly time steps × 721 × 1440 on a global 0.25° grid. The lazy read fetches only the chunks we actually touch.
from pyramids.netcdf import NetCDF
# One month of 2-metre air temperature. AWS ERA5 layout:
# s3://era5-pds/YYYY/MM/data/VARIABLE.nc
# Pyramids rewrites to /vsis3/ automatically via _to_vsi.
ERA5_URL = (
"https://era5-pds.s3.amazonaws.com/2024/06/data/" "air_temperature_at_2_metres.nc"
)
nc = NetCDF.read_file(ERA5_URL, open_as_multi_dimensional=True)
tair = nc.get_variable("air_temperature_at_2_metres")
tair.numpy_dtype, tair._block_size
# Lazy read — only the block-shape is used to build the dask graph.
# Pass an explicit chunk tuple for time-heavy workflows or 'auto' for
# dask's own heuristic.
lazy = tair.read_array(chunks={"time": 24, "lat": 721, "lon": 1440})
type(lazy).__name__, lazy.shape, lazy.chunksize
# Reduce over time — only the chunks we touch get fetched.
import numpy as np
daily_mean = lazy.mean(axis=0).compute()
daily_mean.shape, float(daily_mean.mean()) - 273.15 # Kelvin → Celsius
2. Kerchunk — turn one NetCDF into a Zarr-indexable manifest¶
Kerchunk scans an HDF5/NetCDF file, records every chunk's byte offset in a JSON sidecar, and xarray can open the manifest as if it were a Zarr store. No data copy. Downstream reads use the manifest to fetch only the byte ranges that matter, same as the lazy path above — except you can open the manifest with bare xarray.open_zarr without needing pyramids at all.
import tempfile
from pathlib import Path
manifest_path = Path(tempfile.mkdtemp()) / "era5_202406_t2m.kerchunk.json"
manifest = nc.to_kerchunk(str(manifest_path))
manifest_path.stat().st_size, len(manifest.get("refs", {}))
# Round-trip via bare xarray + kerchunk engine — pyramids is out of the picture.
import xarray as xr
ds = xr.open_dataset(
str(manifest_path),
engine="kerchunk",
chunks={},
)
ds.data_vars, dict(ds.dims)
3. combine_kerchunk — a decade in one manifest¶
This is where the approach earns its keep. Ten years × 12 months = 120 source files. combine_kerchunk records every byte offset across every file into one JSON (typically ~10-50 MB). Downstream consumers see a unified dataset with a single time axis 87600 steps long — without copying anything.
# Ten months for demo; scale the range to 10 years when you're ready to wait.
srcs = [
f"https://era5-pds.s3.amazonaws.com/2024/{m:02d}/data/"
f"air_temperature_at_2_metres.nc"
for m in range(1, 11)
]
combined_path = Path(tempfile.mkdtemp()) / "era5_2024_t2m_combined.kerchunk.json"
NetCDF.combine_kerchunk(
srcs,
str(combined_path),
concat_dims=("time",),
identical_dims=("lat", "lon"),
)
combined_path.stat().st_size
# Open the union as one lazy dataset.
union = xr.open_dataset(str(combined_path), engine="kerchunk", chunks={})
union.sizes, union.air_temperature_at_2_metres.data.chunksize
4. open_mfdataset — alternative multi-file path without a manifest¶
If you don't want the manifest overhead, NetCDF.open_mfdataset(paths, variable=...) returns one big dask array of shape (n_files, *var_shape). Cheaper to get started; harder to share across tools. Use kerchunk if a manifest buys you anything (cross-tool, archival, long-lived); use open_mfdataset for ad-hoc sessions.
stack = NetCDF.open_mfdataset(
srcs,
variable="air_temperature_at_2_metres",
chunks={"time": 24},
parallel=True,
)
type(stack).__name__, stack.shape
# Annual mean across 10 months of T2m.
annual = stack.mean(axis=(0, 1)).compute()
annual.shape, float(annual.mean()) - 273.15
Further reading¶
- ECMWF ERA5 Reanalysis on AWS — bucket layout + citation
- kerchunk documentation — what the manifest format does
- planning/dask/integration-plan.md — NetCDF dask tasks (DASK-11 to DASK-14)