`earthlens.aggregate` — temporal aggregation reference #

Overview #

earthlens.aggregate turns a CDS-shaped NetCDF (one with a time dimension) into per-window GeoTIFFs (daily mean, monthly sum, weekly mean, seasonal climatology, ...). It runs against pyramids + numpy + pandas; xarray is not a runtime dependency.

The feature is reachable two ways:

Standalone. from earthlens import aggregate_netcdf, AggregationConfig and call against any pyramids-readable NetCDF. No ECMWF instance needed.
Bundled with download. ECMWF.download(aggregate=...) — when the parameter is set, every per-variable NetCDF that the backend retrieves is fed through the aggregator immediately after _api() returns.

The same module is also exposed via the user-facing facade: EarthLens(...).download(aggregate=...).

Public API #

`AggregationConfig`#

A frozen pydantic model carrying the windowing frequency, reduction operator, and output location. Frozen + extra="forbid" so a typo in a field name (e.g. freqency=) fails loud at construction time rather than silently using the default.

Field	Type	Default	Purpose
`freq`	`str` (required)	—	Pandas offset alias defining the window.
`op`	`Literal["mean","sum","min","max","std","auto"]`	`"auto"`	Reduction within each window. `"auto"` reads `Variable.is_flux`.
`out_dir`	`Path \\| None`	`None`	Where per-window GeoTIFFs are written. `None` skips the write step.
`cell_size`	`float`	`0.125`	Pixel size in degrees (informational; the geotransform is read off the NetCDF).
`level`	`int \\| float \\| None`	`None`	Pin a pressure level for 4-D inputs.
`skipna`	`bool`	`True`	NaN-aware reduction (`np.nanmean` etc.).
`min_count`	`int \\| None`	`None`	Minimum non-NaN samples for a window to produce a non-NaN value.

`aggregate_netcdf(nc_path, var_info, config) -> list[tuple[...]]`#

Slices a CDS-shaped NetCDF into per-window aggregated outputs. Returns a list of (window_label, array, geotiff_path) tuples — one per non-empty window. geotiff_path is None when config.out_dir was None.

Arguments:

nc_path — path to the NetCDF on disk.
var_info — :class:earthlens.ecmwf.Variable row (resolves op="auto" via is_flux, drives the output filename via cds_variable, picks the variable from the NetCDF via nc_variable).
config — :class:AggregationConfig describing the window, reduction, and output location.

`ECMWF.download(aggregate=...)`#

Adds an aggregate: AggregationConfig | None keyword-only argument. When supplied, every retrieved NetCDF is fed through aggregate_netcdf immediately after _api() returns. When aggregate.out_dir is None, it is defaulted to <self.root_dir>/aggregated/. Aggregation failures surface alongside retrieve failures in the per-variable failure summary, so a single bad variable does not abort the rest of the loop.

Supported reduction operators #

`op`	Reducer (skipna=True)	Reducer (skipna=False)
`"mean"`	`np.nanmean`	`np.mean`
`"sum"`	`np.nansum`	`np.sum`
`"min"`	`np.nanmin`	`np.min`
`"max"`	`np.nanmax`	`np.max`
`"std"`	`np.nanstd`	`np.std`
`"auto"`	resolves to `"mean"` (state) or `"sum"` (flux)	same

Supported `freq` values #

Anything accepted by pandas.Grouper(freq=...) works. Common choices:

Alias	Window
`"1D"`	one calendar day
`"7D"`	seven days (rolling weekly)
`"1MS"`	one calendar month, anchored at month-start
`"QS-DEC"`	climatological seasons (DJF, MAM, JJA, SON)
`"AS"`	calendar year

See the pandas offset aliases reference for the full grammar (e.g. "3H", "30min", "6MS", "AS-OCT", ...).

`op="auto"` semantics — flux vs state #

op="auto" is a sentinel that defers the choice of reducer to the catalog. The resolver is _resolve_op in earthlens.aggregate:

def _resolve_op(op, var_info):
    if op != "auto":
        return op
    return "sum" if var_info.is_flux else "mean"

Two-line decision:

An explicit op ("mean", "sum", "min", "max", "std") is returned unchanged. User choice always wins.
op="auto" reads var_info.is_flux:
True → "sum"
False → "mean"

Variable.is_flux is itself a thin property over the catalog row's types field (return self.types == "flux"). Set types: flux on the YAML row for accumulation-style variables; leave it unset (or set types: state) for instantaneous samples.

Why state → mean, flux → sum #

CDS daily ERA5 NetCDFs carry four 6-hourly slots per day at 00:00, 06:00, 12:00, 18:00.

State variables (temperature, pressure, humidity, wind components, geopotential) — each slot is the instantaneous value at that timestamp. The window mean is the natural daily/monthly summary.
Flux variables (precipitation, evaporation, runoff, radiation accumulations) — each slot is the accumulation since the previous post-processing step (6 hours of evaporation, in the legacy daily case). Summing the slots inside a window gives the actual window total; taking the mean does not.

Worked example — daily evaporation #

Imagine the four sample values for one pixel on 2009-01-01 (in m of water equivalent):

Slot	Value (m)	Physical meaning
00:00	0.001	water that evaporated 18:00 (prev day) → 00:00
06:00	0.002	water that evaporated 00:00 → 06:00
12:00	0.005	water that evaporated 06:00 → 12:00
18:00	0.004	water that evaporated 12:00 → 18:00

The physically correct daily total is the sum of the four 6-hour accumulations:

daily total = 0.001 + 0.002 + 0.005 + 0.004 = 0.012 m

op="auto" (resolves to "sum" for evaporation) writes 0.012 m to the GeoTIFF — the actual daily total water that evaporated.

A plain op="mean" would write (0.001 + 0.002 + 0.005 + 0.004) / 4 = 0.003 m — the average 6-hour accumulation. Same number for state variables (because the "mean of instantaneous samples" is what you want for state); 4× too small for fluxes at daily resolution, since you'd need to multiply by the slot count to recover the daily total.

For monthly resolution, the slot count is roughly 4 × days_in_month; using mean would be off by that same factor.

Migration note (vs. the pre-refactor `post_download`)#

The legacy post_download did mean × days_later for fluxes. That scaling under-counted: for daily it gave (s1+s2+s3+s4)/4 × 1 which is a quarter of the real daily total. The new op="auto" produces the correct total. Downstream code calibrated against the old (buggy) output will see flux GeoTIFF values 4× larger now (at daily resolution).

For state variables (e.g. 2m-temperature), the old mean × days_later = mean × 1 = mean matches the new op="auto" exactly — no change.

Overriding the routing #

auto is just a default. Pass an explicit op when you want different semantics:

Use case	`op`
Reproduce the legacy buggy daily-flux output	`"mean"`
Daily max / min temperature	`"max"` / `"min"`
Per-window standard deviation	`"std"`
Pre-aggregated CDS datasets like `derived-era5--daily-statistics` (each NetCDF sample is already* a daily aggregate; summing 4 of them would multiply by 4)	`"mean"`

Pressure-level support (`level=`)#

Pyramids exposes NetCDF.dimension_names and NetCDF.sel(...), which aggregate_netcdf uses to handle 4-D (time, level, lat, lon) NetCDFs:

NetCDF shape	`level` set	Result
3-D `(time, lat, lon)`	not set	aggregates as-is
3-D `(time, lat, lon)`	set	`ValueError` ("no pressure-level dim")
4-D `(time, level, lat, lon)`	not set	`ValueError` ("pass `level=...`")
4-D `(time, level, lat, lon)`	set	`nc.sel(<dim>=level)`, then aggregate

Aggregation across all levels at once is intentionally not supported — the user must pick a level explicitly.

Worked example — download to monthly mean GeoTIFFs #

Single-call pipeline that downloads daily ERA5 2-metre temperature for January 2022 over a 1° box and writes one monthly-mean GeoTIFF:

from earthlens import AggregationConfig
from earthlens.earthlens import EarthLens

earthlens = EarthLens(
    data_source="ecmwf",
    temporal_resolution="daily",
    start="2022-01-01",
    end="2022-01-31",
    variables={"reanalysis-era5-single-levels": ["2m-temperature"]},
    lat_lim=[4.0, 5.0],
    lon_lim=[-75.0, -74.0],
    path="out/era5",
)
earthlens.download(
    aggregate=AggregationConfig(freq="1MS", op="mean"),
)

The retrieved NetCDF lands at out/era5/2m_temperature_reanalysis-era5-single-levels.nc; the aggregated GeoTIFF lands at out/era5/aggregated/2m_temperature_1MS_20220101.tif (default out_dir = <root_dir>/aggregated/).

Worked example — aggregate later, separately #

If you already have the NetCDF on disk:

from earthlens import AggregationConfig, aggregate_netcdf
from earthlens.ecmwf import Catalog

spec = Catalog().get_variable(
    "reanalysis-era5-single-levels", "2m-temperature"
)
results = aggregate_netcdf(
    "out/era5/2m_temperature_reanalysis-era5-single-levels.nc",
    spec,
    AggregationConfig(freq="1MS", op="mean", out_dir="out/era5/monthly"),
)
for window_label, arr, target in results:
    print(window_label, arr.shape, target.name)

In-memory mode (`out_dir=None`)#

Skip GeoTIFF writes entirely and inspect the per-window arrays:

from earthlens import AggregationConfig, aggregate_netcdf
from earthlens.ecmwf import Catalog

spec = Catalog().get_variable(
    "reanalysis-era5-single-levels", "2m-temperature"
)
results = aggregate_netcdf(
    "out/era5/2m_temperature_reanalysis-era5-single-levels.nc",
    spec,
    AggregationConfig(freq="1D", op="mean"),
)
first_label, first_array, first_path = results[0]
assert first_path is None

CLI demo #

examples/post_process_ecmwf_netcdf.py is a thin CLI wrapper:

python examples/post_process_ecmwf_netcdf.py \
    out/era5/2m_temperature_reanalysis-era5-single-levels.nc \
    out/era5/daily \
    reanalysis-era5-single-levels \
    2m-temperature \
    --freq 1D --op auto

Flags map 1-to-1 to AggregationConfig fields. See --help for the full list.

Output filename convention #

Per-window GeoTIFFs are named:

<cds_variable>_<freq>_<window-label-as-YYYYMMDD>.tif

Examples:

2m_temperature_1D_20220101.tif — daily mean for 2022-01-01.
total_precipitation_1MS_20220101.tif — monthly sum for 2022-01.
temperature_QS-DEC_20220301.tif — MAM seasonal mean.

:class:earthlens.aggregate.AggregationConfig — frozen request payload.
:func:earthlens.aggregate.aggregate_netcdf — the core function.
:class:earthlens.ecmwf.Catalog — resolves (dataset, code) pairs to the Variable rows that drive op="auto" and the output filename.
:meth:earthlens.ecmwf.ECMWF.download — accepts the aggregate parameter for one-call download-and-aggregate.