Lazy vector reads with LazyFeatureCollection¶
This notebook walks through the dask-backed vector path in pyramids.feature. It uses a real, public, 100+ GB dataset — Overture Maps — hosted as GeoParquet on AWS S3. You will NOT download the whole thing; the point is to show that lazy reads + Arrow pushdown filters let you work with a dataset that doesn't fit in RAM.
What you'll see¶
- Reading a subset of Overture Places (restaurants in a specific bbox) with Arrow filters pushed down to Parquet — only the matching row groups are fetched.
- Using
LazyFeatureCollectionmethods (.spatial_shuffle,.sjoin,.persist,.compute) on a partitioned frame. - Writing the result back to GeoParquet — the one genuinely lazy vector write path.
Requirements¶
pip install 'pyramids-gis[parquet-lazy]'
This pulls pyarrow, dask, dask-geopandas, fsspec, s3fs. The [parquet-lazy] extra is what unlocks everything below.
⚠ Scheduler — set this BEFORE any compute¶
Shapely and GEOS hold the Python GIL. Under dask's default threaded scheduler, map_partitions on geometry ops runs single-core, and you will see slower timings than the eager path. Configure a process scheduler (or a LocalCluster) up front:
import os
os.environ['MPLBACKEND'] = (
'Agg' # pyramids' matplotlib hazard; harmless if you don't use it
)
from pyramids import configure_lazy_vector
# One call, applies to every LazyFeatureCollection.compute() in this session.
applied = configure_lazy_vector(
scheduler="processes",
target_bytes_per_partition=128
* 1024
* 1024, # 128 MiB, the default — shown here for clarity
)
applied
1. Read a bounded slice of Overture Places¶
Overture partitions places data by theme / type under the public bucket s3://overturemaps-us-west-2/. With backend='dask' and the filters= pushdown, only row groups that intersect our bbox are fetched. We constrain to a tiny slice over central Paris so the notebook runs in a minute or two over a normal home connection.
from pyramids.feature import FeatureCollection
OVERTURE_PLACES = (
"s3://overturemaps-us-west-2/release/2026-03-18.0/" "theme=places/type=place/"
)
paris_bbox = (2.29, 48.85, 2.40, 48.90) # lon/lat (minx, miny, maxx, maxy)
lazy_places = FeatureCollection.read_parquet(
OVERTURE_PLACES,
backend="dask",
columns=["names", "categories", "geometry"],
filters=[
("bbox.xmin", ">=", paris_bbox[0]),
("bbox.ymin", ">=", paris_bbox[1]),
("bbox.xmax", "<=", paris_bbox[2]),
("bbox.ymax", "<=", paris_bbox[3]),
],
storage_options={"anon": True}, # Overture bucket is public; s3fs won't try to sign
)
type(lazy_places).__name__, lazy_places.npartitions
The return is a LazyFeatureCollection — a subclass of dask_geopandas.GeoDataFrame. Nothing has been fetched from S3 yet; only the Parquet file metadata has been scanned to build the task graph.
2. Stay lazy while filtering, keep pushdown benefits¶
Any dask_geopandas.GeoDataFrame method works. Methods pyramids overrides (compute, persist, spatial_shuffle, to_parquet) return a LazyFeatureCollection so you keep the pyramids type contract across ops.
# Spatial shuffle → spatial_partitions populated → partition-pruned sjoin.
# Partially eager — computes Hilbert curve distances for every row.
shuffled = lazy_places.spatial_shuffle(by="hilbert").persist()
# Cheap because total_bounds is inherited from dask-geopandas.
# It returns a dask Scalar; compute_total_bounds() forces the O(partitions)
# reduction explicitly so you see the cost in the method name.
bounds = shuffled.compute_total_bounds()
bounds
3. Materialise a slice back to eager for plotting / exporting¶
LazyFeatureCollection.compute() returns a FeatureCollection — the eager side of the same protocol. All pyramids-specific methods (extract_vertices, with_coordinates, plot, to_file) become available at this point because you're back in eager land.
eager = shuffled.compute()
type(eager).__name__, len(eager), eager.epsg
4. Write back to GeoParquet — the one lazy-native writer¶
LazyFeatureCollection.to_parquet is the only write path that stays lazy across partitions. .to_file(...) raises NotImplementedError because dask-geopandas has no lazy OGR write path — you'd have to .compute().to_file(path) first.
# Returns None after every partition has been written; no Delayed leaked out.
shuffled.to_parquet("paris_places.parquet", compression="snappy")
# Quick sanity check — read it back via the eager path.
reread = FeatureCollection.read_parquet("paris_places.parquet")
len(reread), reread.crs
5. What the API deliberately does NOT do on lazy FCs¶
The separate-class design (LazyFeatureCollection subclasses dask_geopandas.GeoDataFrame, not pyramids' own FeatureCollection) means inherited methods behave consistently with dask-geopandas. pyramids doesn't try to fake eager semantics:
| Op | Behaviour on LazyFC |
|---|---|
len(lfc) |
raises — dask.dataframe can't size without compute |
lfc.iloc[0] |
raises — positional row access needs compute |
lfc.total_bounds |
returns a dask Scalar; call .compute() or lfc.compute_total_bounds() |
lfc.plot() |
raises NotImplementedError — no lazy plot path |
lfc.to_file(path) |
raises NotImplementedError — no lazy OGR write |
lfc.extract_vertices(), .with_coordinates(), ... |
AttributeError — pyramids-specific eager-only methods |
Use pyramids.feature.is_lazy_fc(x) for dispatch code that must accept both eager and lazy FCs without try/except ImportError ceremony.
Further reading¶
- docs/tutorials/lazy/lazy-vector.md — decision tree + method matrix
- Overture Maps Quickstart — dataset schema + partitioning
- dask-geopandas docs — every inherited method