Lazy NetCDF reads + kerchunk + multi-file stacks with `pyramids.netcdf`¶

ERA5 climate reanalysis is the classic big-NetCDF dataset: ~9 PB across the full archive (1940 → present). The AWS Registry of Open Data hosts ERA5 in the public era5-pds bucket (us-east-1) as one NetCDF per month-per-variable — each file is ~2 GB. Opening a full decade eagerly would OOM any laptop.

This notebook uses ERA5 to walk through the lazy NetCDF surface in pyramids:

What you'll see¶

Lazy MDArray reads via NetCDF.read_array(chunks=...) against a cloud NetCDF.
NetCDF.to_kerchunk(path) — emit a sidecar JSON manifest so xarray can treat the file as a Zarr store (no rewrite, just byte-range pointers).
NetCDF.combine_kerchunk(paths) — a single manifest spanning many monthly files, giving xarray a unified time axis across a decade of data.
NetCDF.open_mfdataset(paths, variable=...) — one partitioned dask array across many files, no manifest step.

Requirements¶

pip install 'pyramids-gis[lazy]' xarray

[lazy] pulls dask[array], distributed, zarr, fsspec, kerchunk, h5py; xarray (a peer dep, not a pyramids extra) is for the bare xr.open_dataset round-trips below. cftime is a core pyramids dependency.

Cloud + scheduler config¶

ERA5 files on S3 are best read over HTTP range requests; GDAL's VSI layer does this for us when GDAL_HTTP_MULTIRANGE=YES is set. configure(cloud_defaults=True) sets this plus seven other cloud-friendly options.

In [ ]:

Copied!

%matplotlib inline

import dask

dask.config.set(scheduler="processes")

from pyramids import configure

configure(cloud_defaults=True, aws={"aws_unsigned": True})
%matplotlib inline

import dask

dask.config.set(scheduler="processes")

from pyramids import configure

configure(cloud_defaults=True, aws={"aws_unsigned": True})

1. Open one ERA5 file lazily¶

ERA5 S3 layout: s3://era5-pds/YYYY/MM/data/VARIABLE.nc. We pick one month of air_temperature_at_2_metres (the most-read ERA5 variable). A single file is ~1.5 GB and contains 720 hourly time steps × 721 × 1440 on a global 0.25° grid. The lazy read fetches only the chunks we actually touch.

In [ ]:

Copied!





from pyramids.netcdf import NetCDF

# One month of 2-metre air temperature. AWS ERA5 layout:
#   s3://era5-pds/YYYY/MM/data/VARIABLE.nc
# Pyramids rewrites to /vsis3/ automatically via _to_vsi.
ERA5_URL = (
    "https://era5-pds.s3.amazonaws.com/2024/06/data/" "air_temperature_at_2_metres.nc"
)

nc = NetCDF.read_file(ERA5_URL, open_as_multi_dimensional=True)
tair = nc.get_variable("air_temperature_at_2_metres")
tair.numpy_dtype, tair._block_size
from pyramids.netcdf import NetCDF

# One month of 2-metre air temperature. AWS ERA5 layout:
#   s3://era5-pds/YYYY/MM/data/VARIABLE.nc
# Pyramids rewrites to /vsis3/ automatically via _to_vsi.
ERA5_URL = (
    "https://era5-pds.s3.amazonaws.com/2024/06/data/" "air_temperature_at_2_metres.nc"
)

nc = NetCDF.read_file(ERA5_URL, open_as_multi_dimensional=True)
tair = nc.get_variable("air_temperature_at_2_metres")
tair.numpy_dtype, tair._block_size

Plot the opened field¶

nc is a concrete NetCDF handle, so nc.plot(...) renders the 2-metre air temperature field for the first time step directly from the cloud file — GDAL fetches only the byte ranges the map needs.

In [ ]:

Copied!

nc.plot(variable="air_temperature_at_2_metres", title="ERA5 2-metre air temperature (2024-06)")
nc.plot(variable="air_temperature_at_2_metres", title="ERA5 2-metre air temperature (2024-06)")

In [ ]:

Copied!





# Lazy read — only the block-shape is used to build the dask graph.
# Pass an explicit chunk tuple for time-heavy workflows or 'auto' for
# dask's own heuristic.
lazy = tair.read_array(chunks={"time": 24, "lat": 721, "lon": 1440})
type(lazy).__name__, lazy.shape, lazy.chunksize
# Lazy read — only the block-shape is used to build the dask graph.
# Pass an explicit chunk tuple for time-heavy workflows or 'auto' for
# dask's own heuristic.
lazy = tair.read_array(chunks={"time": 24, "lat": 721, "lon": 1440})
type(lazy).__name__, lazy.shape, lazy.chunksize

In [ ]:

Copied!

# Reduce over time — only the chunks we touch get fetched.
import numpy as np

daily_mean = lazy.mean(axis=0).compute()
daily_mean.shape, float(daily_mean.mean()) - 273.15  # Kelvin → Celsius
# Reduce over time — only the chunks we touch get fetched.
import numpy as np

daily_mean = lazy.mean(axis=0).compute()
daily_mean.shape, float(daily_mean.mean()) - 273.15  # Kelvin → Celsius

2. Kerchunk — turn one NetCDF into a Zarr-indexable manifest¶

Kerchunk scans an HDF5/NetCDF file, records every chunk's byte offset in a JSON sidecar, and xarray can open the manifest as if it were a Zarr store. No data copy. Downstream reads use the manifest to fetch only the byte ranges that matter, same as the lazy path above — except you can open the manifest with bare xarray.open_zarr without needing pyramids at all.

In [ ]:

Copied!





import tempfile
from pathlib import Path

manifest_path = Path(tempfile.mkdtemp()) / "era5_202406_t2m.kerchunk.json"
manifest = nc.to_kerchunk(manifest_path)
manifest_path.stat().st_size, len(manifest.get("refs", {}))
import tempfile
from pathlib import Path

manifest_path = Path(tempfile.mkdtemp()) / "era5_202406_t2m.kerchunk.json"
manifest = nc.to_kerchunk(manifest_path)
manifest_path.stat().st_size, len(manifest.get("refs", {}))

In [ ]:

Copied!





# Round-trip via bare xarray + kerchunk engine — pyramids is out of the picture.
import xarray as xr

ds = xr.open_dataset(
    manifest_path,
    engine="kerchunk",
    chunks={},
)
ds.data_vars, dict(ds.dims)
# Round-trip via bare xarray + kerchunk engine — pyramids is out of the picture.
import xarray as xr

ds = xr.open_dataset(
    manifest_path,
    engine="kerchunk",
    chunks={},
)
ds.data_vars, dict(ds.dims)

3. `combine_kerchunk` — a decade in one manifest¶

This is where the approach earns its keep. Ten years × 12 months = 120 source files. combine_kerchunk records every byte offset across every file into one JSON (typically ~10-50 MB). Downstream consumers see a unified dataset with a single time axis 87600 steps long — without copying anything.

In [ ]:

Copied!





# Ten months for demo; scale the range to 10 years when you're ready to wait.
srcs = [
    f"https://era5-pds.s3.amazonaws.com/2024/{m:02d}/data/"
    f"air_temperature_at_2_metres.nc"
    for m in range(1, 11)
]

combined_path = Path(tempfile.mkdtemp()) / "era5_2024_t2m_combined.kerchunk.json"
NetCDF.combine_kerchunk(
    srcs,
    combined_path,
    concat_dims=("time",),
    identical_dims=("lat", "lon"),
)
combined_path.stat().st_size
# Ten months for demo; scale the range to 10 years when you're ready to wait.
srcs = [
    f"https://era5-pds.s3.amazonaws.com/2024/{m:02d}/data/"
    f"air_temperature_at_2_metres.nc"
    for m in range(1, 11)
]

combined_path = Path(tempfile.mkdtemp()) / "era5_2024_t2m_combined.kerchunk.json"
NetCDF.combine_kerchunk(
    srcs,
    combined_path,
    concat_dims=("time",),
    identical_dims=("lat", "lon"),
)
combined_path.stat().st_size

In [ ]:

Copied!

# Open the union as one lazy dataset.
union = xr.open_dataset(combined_path, engine="kerchunk", chunks={})
union.sizes, union.air_temperature_at_2_metres.data.chunksize
# Open the union as one lazy dataset.
union = xr.open_dataset(combined_path, engine="kerchunk", chunks={})
union.sizes, union.air_temperature_at_2_metres.data.chunksize

4. `open_mfdataset` — alternative multi-file path without a manifest¶

If you don't want the manifest overhead, NetCDF.open_mfdataset(paths, variable=...) returns one big dask array of shape (n_files, *var_shape). Cheaper to get started; harder to share across tools. Use kerchunk if a manifest buys you anything (cross-tool, archival, long-lived); use open_mfdataset for ad-hoc sessions.

In [ ]:

Copied!





stack = NetCDF.open_mfdataset(
    srcs,
    variable="air_temperature_at_2_metres",
    chunks={"time": 24},
    parallel=True,
)
type(stack).__name__, stack.shape
stack = NetCDF.open_mfdataset(
    srcs,
    variable="air_temperature_at_2_metres",
    chunks={"time": 24},
    parallel=True,
)
type(stack).__name__, stack.shape

In [ ]:

Copied!

# Annual mean across 10 months of T2m.
annual = stack.mean(axis=(0, 1)).compute()
annual.shape, float(annual.mean()) - 273.15
# Annual mean across 10 months of T2m.
annual = stack.mean(axis=(0, 1)).compute()
annual.shape, float(annual.mean()) - 273.15

Lazy NetCDF reads + kerchunk + multi-file stacks with pyramids.netcdf¶

What you'll see¶

Requirements¶

Cloud + scheduler config¶

1. Open one ERA5 file lazily¶

Plot the opened field¶

2. Kerchunk — turn one NetCDF into a Zarr-indexable manifest¶

3. combine_kerchunk — a decade in one manifest¶

4. open_mfdataset — alternative multi-file path without a manifest¶

Further reading¶

Lazy NetCDF reads + kerchunk + multi-file stacks with `pyramids.netcdf`¶

3. `combine_kerchunk` — a decade in one manifest¶

4. `open_mfdataset` — alternative multi-file path without a manifest¶