Catalog & tooling#
The CHC backend is catalog-driven end-to-end. No FTP path,
filename pattern, region bound, or per-variable unit is hardcoded
in backend.py. This page documents the catalog's on-disk shape,
the validation it enforces at load, the health() self-report, and
the tooling that supports catalog work.
At a glance#
| File / tool | Role |
|---|---|
src/earthlens/chc/catalog/_index.yaml |
The informational available_datasets: walk-order list + the regions: block (named geographic-coverage profiles) |
src/earthlens/chc/catalog/<family>.yaml |
8 per-family files (CHIRPS-2.0, CHIRPS v3, CHIRP, CHIRTS, GEFS, CMIP6, indices, derived) carrying the actual datasets: blocks |
src/earthlens/chc/catalog.py |
The loader (Catalog, Dataset, Variable, _build_chc_dataset, _load_catalog_data, _StrictSafeLoader) |
src/earthlens/chc/backend.py |
The CHIRPS class — consumes the catalog through Dataset.ftp_bases / Dataset.file_patterns / Dataset.discrete_files |
tools/chc/refresh_chc_catalog.py |
Walks data.chc.ucsb.edu and regenerates the available_datasets: index |
tools/chc/audit_chc_datasets.py |
Coverage / staleness classifier (parallel of tools/gee/audit_gee_datasets.py) |
tools/chc/probe_chirps_gefs.py |
CHIRPS-GEFS FTP probe — was used to verify the v3 file patterns before they were withdrawn |
Layout — per-family split#
The catalog is a directory of per-family YAML files, mirroring
the GEE-style layout under src/earthlens/gee/catalog/:
src/earthlens/chc/catalog/
├── _index.yaml # 145 lines — available_datasets: + regions:
├── chirps-2.0.yaml # 27 datasets — CHIRPS-2.0 global/regional/preliminary
├── chirps-v3.yaml # 11 datasets — CHIRPS v3 (final + preliminary)
├── chirp.yaml # 7 datasets — CHIRP + CHIRP v3
├── chirts.yaml # 7 datasets — CHIRTSdaily + CHIRTSmonthly
├── gefs.yaml # 9 datasets — CHIRPS-GEFS v12
├── indices.yaml # 15 datasets — SPI/CHIRPS3 + SPEI v1
├── cmip6.yaml # 16 datasets — CHC_CMIP6 scenario deltas
└── derived.yaml # 5 datasets — CHPclim v2, WBGT, CentennialTrends v1
Family-file names are not load-bearing — the loader walks
*.yaml in the directory and merges every datasets: block into one
dict. Family files exist for editorial clarity (a maintainer touching
WBGT opens derived.yaml, not a 5000-line monolith). Dataset keys
must be unique across files; a collision raises ValueError
naming both filenames.
Catalog.load() accepts both the directory shape (canonical) and a
legacy single-file YAML for backwards compatibility / tests. The
loader dispatches on path.is_dir().
_index.yaml#
Two top-level blocks, no datasets::
available_datasets:
- global-daily
- global-monthly
- africa-daily
- chirts-daily-tmax
- wbgt-monthly
- centennial-trends-v1-monthly
- chc-cmip6-precip-daily-delta-2030-ssp245
# … 97 entries total
regions:
global:
lat_boundaries: [-50, 50]
lon_boundaries: [-180, 180]
global-land:
lat_boundaries: [-60, 70]
lon_boundaries: [-180, 180]
global-extended:
lat_boundaries: [-90, 90]
lon_boundaries: [-180, 180]
africa:
lat_boundaries: [-40, 40]
lon_boundaries: [-20, 55]
central-america-caribbean:
lat_boundaries: [5, 35]
lon_boundaries: [-120, -55]
east-africa:
lat_boundaries: [-12, 6]
lon_boundaries: [28, 42]
east-africa-centennial: # CenTrends-specific wider extent
lat_boundaries: [-12.25, 22.25]
lon_boundaries: [21.25, 51.25]
indonesia:
lat_boundaries: [-11, 6]
lon_boundaries: [95, 141]
western-hemisphere:
lat_boundaries: [-50, 50]
lon_boundaries: [-180, 0]
The regions: block is the single source of truth for spatial
bounds. Every dataset references a region by name; the loader
resolves ds.region → regions[ds.region] → (lat_boundaries,
lon_boundaries) and pins those on the Dataset instance. A dataset
that carries inline lat_boundaries / lon_boundaries is
rejected with a ValueError pointing at the regions block: pick an
existing region or add a new entry. This stops the drift case where a
region rename in _index.yaml silently disagrees with stale inline
bounds on N datasets.
Per-family file structure#
A per-family <family>.yaml carries a datasets: block; each entry
follows this schema:
Schema skeleton#
datasets:
<dataset_key>: # one block per dataset
ftp_bases: # REQUIRED — format-keyed FTP dirs
tif: pub/org/chc/products/CHIRPS-2.0/... # relative to the FTP root
cog: pub/org/chc/products/CHIRPS-2.0/... # additional formats as available
netcdf: ...
file_patterns: # REQUIRED for per-date datasets
tif: "{year}/chirps-v2.0.{year}.{month}.{day}.tif.gz" # one template per format
cog: ... # placeholders expanded by _placeholders()
discrete_files: # XOR with file_patterns — fixed list
tif:
- CHPclim2.90-90.01.tif # 12 files, one per climatological month
- CHPclim2.90-90.02.tif
# …
region: <name> # REQUIRED — must exist in regions:
temporal_resolution: <Literal> # REQUIRED — see vocabulary below
pandas_freq: <alias> # REQUIRED — validated via to_offset
spatial_resolution: [<deg>] # REQUIRED — pixel size in degrees
formats: [<fmt>, ...] # REQUIRED — must match ftp_bases keys
start_date: <YYYY-MM-DD> # REQUIRED — inclusive
end_date: <YYYY-MM-DD> # optional — None for ongoing products
preliminary: <bool> # optional — defaults to false
variables: # REQUIRED — per-variable map
<variable_code>:
description: <string> # short human-readable description
units: <string> # unit string (e.g. "mm/day")
types: <Literal> # "flux" or "state"
Required vs optional#
ftp_bases,region,temporal_resolution,pandas_freq,spatial_resolution,formats,start_date,variables— REQUIRED. Omission raisesValidationErrorat load.- Exactly one of
file_patterns/discrete_filesmust be set (enforced by amodel_validator(mode="after")onDataset). end_date,preliminary— OPTIONAL.
The model is frozen (ConfigDict(frozen=True, extra="forbid")) so a
typo'd field name raises rather than silently filing under the wrong
slot.
temporal_resolution vocabulary#
Constrained to a 14-entry Literal[...] over the
_TEMPORAL_RESOLUTIONS tuple:
10-day 5-day annual
15-day 6-hourly daily
2-monthly daily-delta dekadal
3-monthly monthly monthly-climatology
pentadal seasonal
A typo (e.g. "daly") raises ValidationError at load with the
literal vocabulary in the message. Adding a new cadence is a
two-line edit: add to the tuple AND to the Literal[...] annotation
on Dataset.temporal_resolution. Catalog.list_datasets(temporal_resolution=...)
also validates the argument against the same tuple.
pandas_freq#
Every dataset's pandas_freq is validated at load via
pd.tseries.frequencies.to_offset(value). Catches:
- Typos (
"daly","montly"). - Deprecated aliases (pandas 2.2 deprecated
"H"for"h"; pandas 3.x removed"AS"outright — see the H3 commit on this branch that swapped every"AS"for"YS"). - Non-string values.
The error message points at the pandas offset alias table.
Discrete-files datasets carry a placeholder pandas_freq; the
check still runs (the placeholder must be a legal alias).
file_patterns vs discrete_files#
A dataset publishes its bytes in one of two shapes — never both. The discriminator is structural, not a separate field:
| Shape | Field | Backend path |
|---|---|---|
| Per-date partitions (the common case) | file_patterns: {fmt: template} |
_download_dataset iterates pd.date_range(start, end, freq=pandas_freq), calls _placeholders(date, pandas_freq) per date, formats the template, fetches over FTP |
| Fixed archive files (CHPclim, CenTrends) | discrete_files: {fmt: [filename, ...]} |
_download_discrete iterates the filename list once per request, no date substitution |
Placeholders the backend's _placeholders() expands:
{year},{month},{day}— calendar position{dekad}(1/2/3) — third of the month{pentad}(1..6) — fifth of the month{hour}—00–23{doy}— Julian day-of-year (3-digit, zero-padded){start_yyyymmdd}/{end_yyyymmdd}— period-window endpoints derived frompandas_freq(WBGT)
{month_pair} (CHIRPS v3 2-monthly) and {res} / {scale} are not
implemented; a row using them would silently hit the per-date
KeyError-skip path. None of the shipped rows use them today.
Catalog.health()#
A structural-hygiene self-report. Most schema invariants are caught
at load time; health() covers the residual quality checks that
don't fit the pydantic schema:
| Check | What it surfaces |
|---|---|
dataset_without_variables |
datasets carrying zero curated variables — defence in depth, should always be [] |
end_date_before_start_date |
end_date < start_date (would yield an empty download window for every request) |
unreferenced_region |
keys in regions: that no dataset's region: field points at — registry rot |
index_missing_in_datasets |
keys in available_datasets: that have no entry under datasets: (get_dataset(key) would KeyError) |
datasets_missing_in_index |
the reverse — keys in datasets: that the index doesn't advertise |
variable_metadata_drift |
(variable_name, temporal_resolution) groups where the constituent rows disagree on (units, types) |
from earthlens.chc import Catalog
issues = Catalog().health()
{k: v for k, v in issues.items() if v}
# {'variable_metadata_drift': ['precipitation/daily']}
# (the H3-tracked drift across the multi-region CHIRPS-2.0 daily rows
# whose precipitation `description` legitimately varies)
A clean catalog returns {...: []} for every key.
Caching#
Catalog() parses through a module-level cache keyed on
(resolved_path, fingerprint):
- Directory layout: fingerprint is a
tuple((name, mtime_ns), ...)over the sorted*.yamlmembers. Editing any per-family file bumps the tuple and invalidates the entry. Collision-free under mtime permutations (a swap that leaves the sum unchanged still produces a different tuple). - File layout (legacy single-file): fingerprint is the file's
stat().st_mtime_ns.
Repeated Catalog() construction across a process is therefore ~1 ms
(cache hit) rather than re-parsing every YAML. Tests that
monkey-patch CATALOG_PATH should call clear_catalog_cache() to
avoid stale entries.
Strict YAML loading#
_StrictSafeLoader rejects duplicate keys in any mapping:
# An accidentally-duplicated key in the YAML:
#
# datasets:
# global-daily: { ... }
# global-monthly: { ... }
# global-daily: { ... } # ← second occurrence
#
# Pre-fix: silent shadowing (the second wins, the first is lost).
# Post-fix:
# ValueError: duplicate YAML key 'global-daily' at line 142,
# column 3 of chirps-2.0.yaml: every key in a YAML mapping must
# be unique
Cross-file duplicates are caught separately by the directory loader
(which keeps a {ds_key: filename} map and raises on the second
occurrence).
Tooling#
tools/chc/refresh_chc_catalog.py#
Walks data.chc.ucsb.edu and regenerates the
available_datasets: index in _index.yaml. Run after CHC publishes
a new product. CHC analogue of
tools/gee/refresh_gee_catalog.py and
tools/ecmwf/refresh_available_datasets.py.
tools/chc/audit_chc_datasets.py#
Coverage / staleness classifier. Walks the catalog and reports which
shipped datasets have been verified against the live FTP (and which
haven't been touched since N days). Parallel of
tools/gee/audit_gee_datasets.py.
tools/chc/probe_chirps_gefs.py#
CHIRPS-GEFS-specific FTP probe. Lists the real contents of each
CHIRPS-GEFS/... directory on the live FTP, prints a sample of
filenames, and suggests a filename template inferred from the
listing. Was the basis for the H2 decision to withdraw the
CHIRPS-GEFS v3 rows — the probe found the directory shape didn't
match the YAML's provisional patterns
(year/ partitioning vs the YAML's year/month/ assumption;
anom/data/zscore subdir split on the dekad/pentad variants). The
rows are reinstated only when the probe confirms a verified pattern.
Adding a new dataset#
- Pick the right per-family file (e.g. a new SPEI window goes in
indices.yaml). - Add a
datasets:entry following the schema above. If the region is new, add it to_index.yaml'sregions:block first. - Add the dataset key to
_index.yaml'savailable_datasets:(alphabetical or walk-order — the order is informational). - Run the catalog test suite:
- Run
Catalog().health()and make sure no new keys show up non-empty. - If the dataset uses a new placeholder, extend
CHIRPS._placeholders()to expand it. - If the FTP layout is provisional, leave a banner comment in the
family file and write a probe under
tools/chc/before relying on the row.