Lazy FeatureCollection — complete cookbook¶
FeatureCollection.read_file(..., backend='dask') and FeatureCollection.read_parquet(..., backend='dask') return a LazyFeatureCollection — a dask_geopandas.GeoDataFrame subclass that satisfies the LazySpatialObject protocol. Every partition-aware op (to_crs, clip, sjoin, spatial_shuffle) runs lazily; materialise with .compute() when you need eager rows.
This notebook uses local test data — no cloud hits.
What you'll see¶
- Backend detection —
has_lazy_backend,is_lazy_fc, the PEP-562__getattr__guard. - Building a lazy FC from a GeoJSON + a GeoParquet file.
- Partition-aware ops:
to_crs,clip,spatial_shuffle. spatial_shuffle→sjoinpartition pruning.compute()vspersist()vscompute_total_bounds.- Writing back with
to_parquet(partitioned directory). pyramids.configure_lazy_vector— scheduler + target partition size.
Requirements¶
pip install 'pyramids-gis[parquet-lazy]'
Setup¶
import os
os.environ['MPLBACKEND'] = 'Agg'
from pathlib import Path
import numpy as np
DATA = (Path('..') / '..' / '..' / 'tests' / 'data').resolve()
DATA.is_dir()
1. Backend detection¶
On a minimal install without the [parquet-lazy] extra, from pyramids.feature import LazyFeatureCollection raises an ImportError with an actionable install hint. has_lazy_backend() is the cheap check; is_lazy_fc(obj) is the dispatch helper.
from pyramids.feature import has_lazy_backend, is_lazy_fc
has_lazy_backend()
# On minimal installs, the import raises — guard with
# try/except ImportError, NOT hasattr (Py3 hasattr only
# catches AttributeError).
try:
from pyramids.feature import LazyFeatureCollection
backend_ok = True
except ImportError:
LazyFeatureCollection = None
backend_ok = False
backend_ok
2. Lazy reads — GeoJSON and GeoParquet¶
read_file(backend='dask') handles vector formats dask-geopandas can chunk (GeoJSON / Shapefile / GeoPackage — all via row counts). read_parquet(backend='dask') uses pyarrow's row-group splits so pushdown filters (filters=, columns=, split_row_groups=) deliver true I/O savings.
from pyramids.feature import FeatureCollection
lfc = FeatureCollection.read_file(
str(DATA / 'coello-gauges.geojson'),
backend='dask',
npartitions=2,
)
type(lfc).__name__, lfc.npartitions
# is_lazy_fc is the safe dispatch helper.
is_lazy_fc(lfc)
# Round-trip to GeoParquet and read it back lazily.
import tempfile
workdir = Path(tempfile.mkdtemp(prefix='pyramids-lazy-fc-'))
pq = workdir / 'gauges.parquet'
lfc.compute().to_parquet(pq)
# backend='dask' returns LazyFeatureCollection;
# backend='pandas' (default) returns eager FeatureCollection.
lfc_pq = FeatureCollection.read_parquet(str(pq), backend='dask')
type(lfc_pq).__name__, lfc_pq.npartitions
3. Partition-aware ops¶
to_crs, clip, spatial_shuffle, and any inherited dask_geopandas.GeoDataFrame method runs lazily. pyramids-specific helpers (extract_vertices, rasterize_with_col, with_coordinates, with_centroid, center_points) require .compute() first.
# Lazy CRS reproject. The __getattribute__ rebrand ensures every
# inherited dask-geopandas op returns LazyFeatureCollection, so
# pyramids-specific helpers (epsg, compute_total_bounds, is_lazy_fc)
# remain available after .to_crs / .clip / .copy / .drop_duplicates.
projected = lfc.to_crs(4326)
type(projected).__name__, projected.epsg
# Lazy clip by a bbox covering the data extent, in the projected CRS.
import geopandas as gpd
from shapely.geometry import box
xmin, ymin, xmax, ymax = projected.compute_total_bounds()
bbox = gpd.GeoDataFrame(
geometry=[box(xmin - 0.01, ymin - 0.01, xmax + 0.01, ymax + 0.01)],
crs='EPSG:4326',
)
clipped = projected.clip(bbox)
type(clipped).__name__, len(clipped.compute())
4. spatial_shuffle → sjoin pruning¶
The biggest speedup from going lazy comes from partition-pruned sjoin — each partition has a bounding box, and dask drops partition pairs that can't intersect before dispatching work. spatial_shuffle populates the spatial_partitions attribute that makes pruning possible.
# Read both sides lazily and reproject — with the rebrand hook,
# both .to_crs returns stay LazyFeatureCollection.
polys = FeatureCollection.read_file(
str(DATA / 'coello_polygons.geojson'),
backend='dask',
npartitions=2,
).to_crs(4326)
gauges = lfc.to_crs(4326)
(type(gauges).__name__, gauges.epsg), (type(polys).__name__, polys.epsg)
# spatial_shuffle — one-time cost, amortised across subsequent sjoins.
gauges_shuffled = gauges.spatial_shuffle(by='hilbert')
polys_shuffled = polys.spatial_shuffle(by='hilbert')
gauges_shuffled.spatial_partitions is not None, polys_shuffled.spatial_partitions is not None
# Partition-pruned sjoin — lazy.
joined = gauges_shuffled.sjoin(
polys_shuffled,
how='inner',
predicate='intersects',
)
type(joined).__name__, joined.npartitions
5. compute() vs persist() vs compute_total_bounds¶
.compute()— materialise to an eagerFeatureCollection(leaves the lazy domain)..persist()— materialise the graph into worker memory but keep the lazy wrapper.compute_total_bounds()— one-line helper for the lazytotal_boundsreduction.
# .compute() returns an eager FeatureCollection.
eager = lfc.compute()
type(eager).__name__, len(eager)
# .persist() keeps laziness but warms the graph.
persisted = lfc.persist()
type(persisted).__name__, is_lazy_fc(persisted)
# total_bounds is a dask Scalar on the lazy FC. The explicit
# helper returns the 4-float numpy array directly.
xmin, ymin, xmax, ymax = lfc.compute_total_bounds()
(xmin, ymin, xmax, ymax)
6. to_parquet — partitioned directory write¶
LazyFeatureCollection.to_parquet(path) is the only lazy-native write. It writes a partitioned directory of part.N.parquet files and always blocks until every partition is materialised — compute=False is rejected to keep the pyramids "to_* always writes" invariant.
out_dir = workdir / 'out.parquet'
lfc.to_parquet(str(out_dir))
sorted(p.name for p in out_dir.iterdir())[:5]
# Reopen the directory as a new lazy FC.
reopened = FeatureCollection.read_parquet(str(out_dir), backend='dask')
reopened.npartitions, len(reopened.compute())
7. pyramids.configure_lazy_vector — scheduler + partition size¶
Shapely holds the GIL, so the default threads scheduler serialises vector ops to one core. Flip it globally with configure_lazy_vector(scheduler='processes'). Raise target_bytes_per_partition if you have more worker RAM.
from pyramids import configure_lazy_vector
applied = configure_lazy_vector(
scheduler='synchronous',
target_bytes_per_partition=64 * 1024 * 1024,
)
applied
Closing notes¶
- The lazy FC has no lazy
to_file(OGR) path — call.compute().to_file(path)to materialise first. Dataset.zonal_stats(lazy_fc)is not yet supported — call.compute()first (tracked as a follow-on).- For a real Overture Maps walkthrough, see
dask-lazy-features.ipynb.