Skip to content

DatasetCollection#

Time-stacked collection of co-registered rasters. Holds N rasters that share a spatial template (rows, columns, cell size, CRS) and exposes them along a single time axis for multi-temporal analysis.

The two paths#

The class operates through two distinct backing paths, each serving a different concern. Knowing which path a method routes through is the key to using the class correctly.

Path A — per-timestep gdal.Dataset handles#

A list of lazy Dataset instances, one per timestep, populated on first access. Each Dataset.read_file(path) opens a gdal handle without reading pixels; the cost is one file descriptor + a metadata read per timestep. Pixel data flows block-by-block through GDAL when methods invoke read_array / crop / to_crs etc.

Method Behavior
iloc(i), __getitem__, __setitem__ Direct indexed access — returns a Dataset (iloc) or its array.
head / tail / first / last Subset views — returns a 3D slice (head/tail) or 2D array.
values getter Derived per-call: np.stack([ds.read_array() for ds in datasets]).
values = setter Rebuilds the list via Dataset.create_from_array(...) per slice.
crop, to_crs, align, apply, overlay Loop the handles; produce a new collection wrapping the results.
to_file, to_cog_stack Loop the handles; write each timestep to disk.
plot Materialises the cube on demand for cleopatra.array_glyph.

Path A works for both file-backed and in-memory collections. After a mutating op (in-place crop, apply, __setitem__, values =), the collection becomes in-memory; the new Dataset instances live in the GDAL MEM driver and Path A keeps working.

Path B — dask graph over file paths#

A list of path strings. The data property assembles a dask.array.Array of shape (time, bands, rows, cols) from [dask.delayed(_read_time_step)(p) for p in self._files]. Workers reopen each path on demand via a process-cached CachingFileManager — gdal handles never cross the pickle boundary, only path strings do. This is what makes the path scale to dask.distributed clusters and to cubes larger than RAM.

Method Behavior
data (property) The dask graph itself.
mean / sum / min / max / std / var Time-axis reductions via _reduce over data.
groupby(labels).<reduction>(...) Per-label reductions via _GroupedCollection.
to_zarr Streams the cube to a Zarr store; never holds it all in RAM.
to_kerchunk Pure metadata pass; reads only a few bytes per file.

Path B works for file-backed collections only. After a mutating op clears _files, the property raises:

RuntimeError: DatasetCollection.data requires a file-backed collection.
Use DatasetCollection.from_files(...) to construct one.

The transition is explicit; in-memory collections do not silently fail or return stale results.

Boundary#

The two paths read different attributes (_datasets vs _files). They are not parallel views of the same store; they cannot drift. A collection moves from "file-backed + usable from both paths" to "in-memory + Path A only" the moment a mutating op runs.

Operation After: Path A Path B
read_multiple_files / from_files / from_stac file-backed works works
crop(inplace=False) etc. (returns new) in-memory works raises (explicit)
crop(inplace=True) etc. (mutates self) in-memory works raises (explicit)
values = arr / __setitem__ in-memory works raises (explicit)

Cost model#

Path A Path B
Handles at rest N file descriptors (one per Dataset) 0
Where reads happen Synchronously per-method Inside dask tasks (parallelisable across workers)
Caching The collection owns the handles Process-global LRU FILE_CACHE (default 128 paths)
Pickle Cache dropped on __getstate__ Path strings serialise cleanly
Larger-than-RAM cubes Block-streaming per timestep Block-streaming across the whole cube + workers

Choosing a path#

  • Per-timestep transformations (crop / reproject / align / apply / overlay / write each step to disk) — Path A. Each Dataset.<op> already block-streams through GDAL; the collection is the loop.
  • Time-axis reductions (mean / sum / std / groupby) and out-of-process writes (zarr, kerchunk) — Path B. Pickleable paths cross dask boundaries; gdal handles cannot.

pyramids.dataset.collection.DatasetCollection #

Time-stacked collection of co-registered rasters.

Holds N rasters that share a spatial template (rows, columns, cell size, CRS) and exposes them as a single logical "cube" along a time axis. Used for multi-temporal analysis (a daily precipitation series, an annual NDVI stack, a model output forecast, …).

The class operates through two distinct backing paths, each serving a different concern. Understanding which methods route through which is the key to using the class correctly.

Path A — per-timestep gdal.Dataset handles (self._datasets) Backing store is a list of lazy Dataset instances, one per timestep, populated by the datasets property on first access. Each Dataset.read_file(path) opens a gdal handle but does not read pixels — the cost per timestep is one file descriptor + a small metadata read. Pixel data flows block-by-block through GDAL when downstream methods invoke read_array / crop / to_crs etc.

Methods that route through Path A:

* ``iloc(i)``, ``__getitem__``, ``__setitem__``,
  ``head``, ``tail``, ``first``, ``last``,
  ``values`` (read-side: derived per-call cube),
  ``values=`` (write-side: rebuilds the list with
  ``Dataset.create_from_array(...)`` per slice).
* Per-timestep ops: ``crop``, ``to_crs``, ``align``,
  ``apply``, ``overlay``, ``to_file``, ``to_cog_stack``.
  Each loops the handles via ``_apply_per_timestep`` and
  produces a new collection wrapping the per-timestep
  results.
* Visualisation: ``plot`` materialises the cube on demand
  via ``np.stack([ds.read_array() for ds in datasets])``.

Works for both **file-backed** and **in-memory** collections.
After a mutating op (in-place ``crop``, ``apply``,
``__setitem__``, ``values =``), the collection is in-memory
and Path A continues to work because the new ``Dataset``
instances live in the GDAL ``MEM`` driver.

Path B — dask graph over file paths (self._files) Backing store is a list of file path strings. The data property assembles a dask.array.Array of shape (time, bands, rows, cols) from [dask.delayed(_read_time_step)(p) for p in self._files]. Workers re-open each path on demand via a process-cached CachingFileManager — gdal handles never cross the pickle boundary, only path strings do. This is what makes the path scale to dask.distributed clusters and to cubes larger than RAM.

Methods that route through Path B:

* Reductions over the time axis: ``mean``, ``sum``, ``min``,
  ``max``, ``std``, ``var`` (all via ``_reduce``);
  ``groupby(...).<reduction>(...)``.
* Out-of-process writes: ``to_zarr`` (streams the cube to
  a Zarr store; never holds it all in RAM), ``to_kerchunk``
  (pure metadata pass; reads only a few bytes per file).

Works for **file-backed** collections only. After a mutating
op clears ``_files``, Path B raises a clean
``RuntimeError("DatasetCollection.data requires a
file-backed collection. Use DatasetCollection.from_files(...)
to construct one.")``.

Boundary between the two paths The two paths read different attributes (_datasets vs _files) — they are not parallel views of the same store and cannot drift. The collection moves from "file-backed + usable from both paths" to "in-memory + Path A only" the moment a mutating op runs. The transition is explicit (_files = None) and Path B raises clearly when called on an in-memory collection. There is no silent disagreement.

The cost split is also explicit:

* Path A holds N file descriptors for the lifetime of the
  collection; reads happen synchronously per-method.
* Path B holds zero handles at rest; reads happen inside
  dask tasks and share the process-global LRU
  (``pyramids.base._file_manager.FILE_CACHE``, default 128
  handles) — workers re-using the same path hit the same
  cache slot regardless of which dask task opened it first.

Pickle __getstate__ drops the lazy _datasets cache so pickle stores only the canonical metadata + paths. The post-unpickle instance re-opens lazily on first access. gdal handles never cross the pickle boundary, by design.

See Also

:class:pyramids.dataset.Dataset — the per-timestep raster wrapped by Path A and read on demand by Path B. :class:_GroupedCollection — Path B view returned by groupby.

Source code in src/pyramids/dataset/collection.py
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
class DatasetCollection:
    """Time-stacked collection of co-registered rasters.

    Holds N rasters that share a spatial template (rows, columns,
    cell size, CRS) and exposes them as a single logical "cube"
    along a time axis. Used for multi-temporal analysis (a daily
    precipitation series, an annual NDVI stack, a model output
    forecast, …).

    The class operates through **two distinct backing paths**, each
    serving a different concern. Understanding which methods route
    through which is the key to using the class correctly.

    Path A — per-timestep ``gdal.Dataset`` handles (``self._datasets``)
        Backing store is a list of lazy ``Dataset`` instances, one
        per timestep, populated by the ``datasets`` property on first
        access. Each ``Dataset.read_file(path)`` opens a gdal handle
        but does not read pixels — the cost per timestep is one file
        descriptor + a small metadata read. Pixel data flows
        block-by-block through GDAL when downstream methods invoke
        ``read_array`` / ``crop`` / ``to_crs`` etc.

        Methods that route through Path A:

        * ``iloc(i)``, ``__getitem__``, ``__setitem__``,
          ``head``, ``tail``, ``first``, ``last``,
          ``values`` (read-side: derived per-call cube),
          ``values=`` (write-side: rebuilds the list with
          ``Dataset.create_from_array(...)`` per slice).
        * Per-timestep ops: ``crop``, ``to_crs``, ``align``,
          ``apply``, ``overlay``, ``to_file``, ``to_cog_stack``.
          Each loops the handles via ``_apply_per_timestep`` and
          produces a new collection wrapping the per-timestep
          results.
        * Visualisation: ``plot`` materialises the cube on demand
          via ``np.stack([ds.read_array() for ds in datasets])``.

        Works for both **file-backed** and **in-memory** collections.
        After a mutating op (in-place ``crop``, ``apply``,
        ``__setitem__``, ``values =``), the collection is in-memory
        and Path A continues to work because the new ``Dataset``
        instances live in the GDAL ``MEM`` driver.

    Path B — dask graph over file paths (``self._files``)
        Backing store is a list of file path strings. The ``data``
        property assembles a ``dask.array.Array`` of shape
        ``(time, bands, rows, cols)`` from
        ``[dask.delayed(_read_time_step)(p) for p in self._files]``.
        Workers re-open each path on demand via a process-cached
        ``CachingFileManager`` — gdal handles never cross the
        pickle boundary, only path strings do. This is what makes
        the path scale to ``dask.distributed`` clusters and to
        cubes larger than RAM.

        Methods that route through Path B:

        * Reductions over the time axis: ``mean``, ``sum``, ``min``,
          ``max``, ``std``, ``var`` (all via ``_reduce``);
          ``groupby(...).<reduction>(...)``.
        * Out-of-process writes: ``to_zarr`` (streams the cube to
          a Zarr store; never holds it all in RAM), ``to_kerchunk``
          (pure metadata pass; reads only a few bytes per file).

        Works for **file-backed** collections only. After a mutating
        op clears ``_files``, Path B raises a clean
        ``RuntimeError("DatasetCollection.data requires a
        file-backed collection. Use DatasetCollection.from_files(...)
        to construct one.")``.

    Boundary between the two paths
        The two paths read different attributes (``_datasets`` vs
        ``_files``) — they are not parallel views of the same store
        and cannot drift. The collection moves from "file-backed +
        usable from both paths" to "in-memory + Path A only" the
        moment a mutating op runs. The transition is explicit
        (``_files = None``) and Path B raises clearly when called
        on an in-memory collection. There is no silent disagreement.

        The cost split is also explicit:

        * Path A holds N file descriptors for the lifetime of the
          collection; reads happen synchronously per-method.
        * Path B holds zero handles at rest; reads happen inside
          dask tasks and share the process-global LRU
          (``pyramids.base._file_manager.FILE_CACHE``, default 128
          handles) — workers re-using the same path hit the same
          cache slot regardless of which dask task opened it first.

    Pickle
        ``__getstate__`` drops the lazy ``_datasets`` cache so
        pickle stores only the canonical metadata + paths. The
        post-unpickle instance re-opens lazily on first access.
        gdal handles never cross the pickle boundary, by design.

    See Also:
        :class:`pyramids.dataset.Dataset` — the per-timestep raster
            wrapped by Path A and read on demand by Path B.
        :class:`_GroupedCollection` — Path B view returned by
            ``groupby``.
    """

    def __init__(
        self,
        src: Dataset,
        time_length: int,
        files: list[str] | None = None,
        *,
        meta: RasterMeta | None = None,
        datasets: list[Dataset] | None = None,
        gdal_env: dict[str, str] | None = None,
    ):
        """Construct DatasetCollection object.

        Args:
            src: Template :class:`~pyramids.dataset.Dataset` (also
                serves as the timestep when no per-file Datasets are
                given).
            time_length: Number of timesteps in the collection.
            files: Optional list of file paths backing each timestep.
                When given, per-timestep ops open each path as a lazy
                :class:`~pyramids.dataset.Dataset` on first access.
            meta: Optional :class:`RasterMeta` snapshot. When omitted,
                a snapshot is derived eagerly from `src` so downstream
                lazy paths can access geo metadata without
                reopening the template every call.
            datasets: Optional list of pre-opened
                :class:`~pyramids.dataset.Dataset` handles, one per
                timestep. When given, takes precedence over `files`
                and `time_length` for per-timestep access. Used
                internally by :meth:`to_crs` / :meth:`crop` /
                :meth:`align` to wrap the result of per-timestep ops
                without re-opening any files.
            gdal_env: Optional GDAL config (e.g. a signer's
                `gdal_env()`) installed around **every** open of the
                backing files — both the eager template open and each
                lazy per-timestep read on Path A (`datasets`) and Path B
                (`data` dask graph). Lets a signed / Requester-Pays
                collection (from :meth:`from_stac`) authenticate its
                reads. A plain dict so it survives pickling to dask
                workers. `None`/empty means no extra config.
        """
        self._base = src
        self._files = files
        self._time_length = time_length
        self._meta = meta if meta is not None else RasterMeta.from_dataset(src)
        self._gdal_env: dict[str, str] = dict(gdal_env) if gdal_env else {}
        # Cached lazy list of per-timestep Datasets. Populated on
        # first access via the `datasets` property: from `datasets=`
        # (caller-provided), then `files=` (open each path), then a
        # `[src] * time_length` fallback for legacy call sites that
        # don't pass either.
        self._datasets: list[Dataset] | None = (
            list(datasets) if datasets is not None else None
        )

    def __getstate__(self):
        """Pickle state — drop the lazy `_datasets` cache.

        Each `Dataset` in the cache wraps a live gdal handle that
        cannot be pickled. Stripping the cache forces the
        post-unpickle instance to re-open files on demand. The
        on-disk paths in `_files` are the canonical truth.
        """
        state = self.__dict__.copy()
        state["_datasets"] = None
        return state

    @property
    def datasets(self) -> list[Dataset]:
        """Lazy list of per-timestep :class:`Dataset` handles.

        Populates on first access. Three sources, in priority order:

        1. Caller-provided `datasets=` argument to ``__init__``
           (used by per-timestep ops to wrap their results).
        2. `files=` argument — each path opened as a lazy gdal
           handle via :meth:`Dataset.read_file`.
        3. Fallback for legacy ``DatasetCollection(src, time_length=N)``
           constructions with neither ``files`` nor ``datasets`` —
           the template ``src`` is replicated ``time_length`` times.

        The cache is per-instance and lives until the collection is
        garbage-collected. It is dropped on pickle (see
        :meth:`__getstate__`).
        """
        if self._datasets is None:
            if self._files is not None:
                # H4: install the persisted signer env (Requester-Pays / bearer /
                # SAS) around every per-timestep open so a signed file-backed
                # collection authenticates its Path A reads, not just the
                # template open in from_files. A no-op when _gdal_env is empty.
                with cloud_config_from_env(self._gdal_env):
                    self._datasets = [Dataset.read_file(str(p)) for p in self._files]
            else:
                self._datasets = [self._base] * self._time_length
        return self._datasets

    def __str__(self):
        """__str__."""
        source_line = (
            f"Files: {len(self._files)}"
            if self._files is not None
            else f"Time length: {self._time_length} (in-memory)"
        )
        message = f"""
            {source_line}
            Cell size: {self._base.cell_size}
            EPSG: {self._base.epsg}
            Dimension: {self.rows} * {self.columns}
            Mask: {self._base.no_data_value[0]}
        """
        return message

    def __repr__(self):
        """__repr__."""
        source_line = (
            f"Files: {len(self._files)}"
            if self._files is not None
            else f"Time length: {self._time_length} (in-memory)"
        )
        message = f"""
            {source_line}
            Cell size: {self._base.cell_size}
            EPSG: {self._base.epsg}
            Dimension: {self.rows} * {self.columns}
            Mask: {self._base.no_data_value[0]}
        """
        return message

    @property
    def base(self) -> Dataset:
        """base.

        Base Dataset
        """
        return self._base

    @property
    def files(self):
        """Files."""
        return self._files

    @property
    def time_length(self) -> int:
        """Length of the dataset."""
        return self._time_length

    @property
    def rows(self):
        """Number of rows."""
        return self._base.rows

    @property
    def shape(self):
        """Number of rows."""
        return self.time_length, self.rows, self.columns

    @property
    def columns(self):
        """Number of columns."""
        return self._base.columns

    @classmethod
    def create_cube(cls, src: Dataset, dataset_length: int) -> DatasetCollection:
        """Create DatasetCollection.

            - Create DatasetCollection from a sample raster and

        Args:
            src (Dataset):
                Raster object.
            dataset_length (int):
                Length of the dataset.

        Returns:
            DatasetCollection: DatasetCollection object.
        """
        return cls(src, dataset_length)

    def groupby(self, time_labels) -> _GroupedCollection:
        """Group time steps by per-timestep label.

        Returns a view exposing the same reduction surface as
        :class:`DatasetCollection` (`mean / sum / min / max / std /
        var`); each reduction runs once per unique label over the
        subset of timesteps carrying that label.

        Args:
            time_labels: Sequence of length `self.time_length` — each
                entry is the group label for the corresponding file
                (e.g. `["Jan", "Jan", "Feb", "Feb",...]` or integer
                month numbers for monthly groupings).

        Returns:
            _GroupedCollection: Lightweight view with `.mean()` etc.
            Each call returns a dict `{label: np.ndarray}`.

        Raises:
            ValueError: When `len(time_labels)!= self.time_length`.
        """
        if len(time_labels) != self._time_length:
            raise ValueError(
                f"time_labels length {len(time_labels)} does not match "
                f"time_length {self._time_length}"
            )
        return _GroupedCollection(self, list(time_labels))

    def reduce_time(
        self,
        times: Sequence,
        *,
        freq: str,
        op: str,
        skipna: bool = True,
    ) -> list[tuple[Any, Dataset]]:
        """Reduce the time axis by a calendar frequency, grid-attached.

        Buckets the timesteps by a pandas offset alias (``"1MS"``, ``"7D"``,
        ``"6h"``, …), reduces each window with ``op`` through the existing
        :meth:`groupby` reducer, and wraps each window's result back into a
        :class:`~pyramids.dataset.Dataset` carrying the collection's
        geotransform / CRS / no-data — so callers get ready-to-write rasters
        instead of the bare ``{label: ndarray}`` that :meth:`groupby` returns.

        The per-timestep timestamps are supplied by the caller (``times``)
        because a :class:`DatasetCollection` does not itself carry a time
        coordinate. The reduction runs through :attr:`data`, so the optional
        ``[lazy]`` extra (dask; flox recommended) is required.

        Args:
            times: Per-timestep timestamps, length ``self.time_length``. Any
                value :func:`pandas.to_datetime` accepts (``datetime``,
                ``"2022-01-01"``, ``pandas.Timestamp``, a ``DatetimeIndex``, …),
                aligned with the collection's timestep order.
            freq: A pandas offset alias naming the window size, e.g. ``"1MS"``
                (month start), ``"7D"`` (weekly), ``"1D"``, ``"6h"``.
            op: Reduction operation: one of ``"mean"``, ``"sum"``, ``"min"``,
                ``"max"``, ``"std"``, ``"var"``.
            skipna: When ``True`` (default) ignore the no-data value in each
                window; forwarded to the underlying reducer.

        Returns:
            list[tuple[Any, Dataset]]: ``(window_label, dataset)`` pairs, one
            per non-empty window, sorted by window label. ``window_label`` is
            the :class:`pandas.Timestamp` at the window's left edge; each
            ``dataset`` is a grid-attached reduction of that window.

        Raises:
            ValueError: ``op`` is not a supported reduction, ``len(times)`` does
                not match :attr:`time_length`, or ``times`` contains an
                unparseable / ``NaT`` entry.

        Examples:
            - Monthly means of a stack of daily COGs, ready to write:
                ```python
                >>> import pandas as pd  # doctest: +SKIP
                >>> from pyramids.dataset.collection import DatasetCollection  # doctest: +SKIP
                >>> coll = DatasetCollection.from_files(daily_cog_paths)  # doctest: +SKIP
                >>> times = pd.date_range("2022-01-01", periods=coll.time_length, freq="1D")  # doctest: +SKIP
                >>> monthly = coll.reduce_time(times, freq="1MS", op="mean")  # doctest: +SKIP
                >>> label, ds = monthly[0]  # doctest: +SKIP
                >>> ds.write_array  # a grid-attached Dataset, not a bare ndarray  # doctest: +SKIP

                ```
        """
        if op not in _GroupedCollection._OPS:
            raise ValueError(
                f"op must be one of {_GroupedCollection._OPS}, got {op!r}."
            )
        time_list = list(times)
        if len(time_list) != self._time_length:
            raise ValueError(
                f"times has {len(time_list)} entries but the collection has "
                f"{self._time_length} timesteps."
            )

        index = pd.DatetimeIndex(pd.to_datetime(time_list))
        if index.isna().any():
            raise ValueError(
                "times contains unparseable / NaT entries; every timestep must "
                "have a valid timestamp."
            )
        positions = pd.Series(np.arange(len(index)), index=index)
        window_labels: list[Any] = [None] * len(index)
        for window_key, members in positions.groupby(pd.Grouper(freq=freq)):
            for pos in members.to_numpy():
                window_labels[int(pos)] = window_key

        reduced = getattr(self.groupby(window_labels), op)(skipna=skipna)

        geo = self._base.geotransform
        epsg = self._base.epsg
        no_data_value = self._base.no_data_value[0]
        result: list[tuple[Any, Dataset]] = []
        for label in sorted(reduced):
            dataset = Dataset.create_from_array(
                np.asarray(reduced[label]),
                geo=geo,
                epsg=epsg,
                no_data_value=no_data_value,
            )
            result.append((label, dataset))
        return result

    def _reduce(self, op_name: str, *, skipna: bool) -> np.ndarray:
        """Shared reduction dispatcher over the time axis."""
        func, _ = resolve_dask_op(op_name, skipna=skipna)
        result = func(self.data, axis=0)
        return np.asarray(result.compute())

    def mean(self, *, skipna: bool = True) -> np.ndarray:
        """Element-wise mean across the time axis.

        Args:
            skipna: When True (default) skip `NaN` via
                :func:`dask.array.nanmean`; otherwise use
                :func:`dask.array.mean`.

        Returns:
            np.ndarray: Mean array of shape `(bands, rows, cols)`.
        """
        return self._reduce("mean", skipna=skipna)

    def sum(self, *, skipna: bool = True) -> np.ndarray:
        """Element-wise sum across the time axis."""
        return self._reduce("sum", skipna=skipna)

    def min(self, *, skipna: bool = True) -> np.ndarray:
        """Element-wise minimum across the time axis."""
        return self._reduce("min", skipna=skipna)

    def max(self, *, skipna: bool = True) -> np.ndarray:
        """Element-wise maximum across the time axis."""
        return self._reduce("max", skipna=skipna)

    def std(self, *, skipna: bool = True) -> np.ndarray:
        """Element-wise standard deviation across the time axis."""
        return self._reduce("std", skipna=skipna)

    def var(self, *, skipna: bool = True) -> np.ndarray:
        """Element-wise variance across the time axis."""
        return self._reduce("var", skipna=skipna)

    @property
    def data(self) -> Any:
        """Return a lazy `dask.array.Array` of shape `(T, B, R, C)`.

        Each per-file read is scheduled as a
        :func:`dask.delayed` task that opens the file via
        :class:`~pyramids.base._file_manager.CachingFileManager`
         and reads its full array. Workers therefore never
        serialise a `gdal.Dataset` — only the file path crosses the
        pickle boundary, which keeps the graph safe under dask.distributed.

        Raises:
            ImportError: If the optional `dask` extra is not
                installed.
            RuntimeError: If the collection was constructed without a
                `files` list (legacy `create_cube` path).
        """
        if self._files is None or len(self._files) == 0:
            raise RuntimeError(
                "DatasetCollection.data requires a file-backed collection. "
                "Use DatasetCollection.from_files(...) to construct one."
            )
        try:
            import dask
            import dask.array as da
        except ImportError as exc:
            raise ImportError(
                "DatasetCollection.data requires the optional 'dask' "
                "dependency. Install with one of:\n"
                "  - PyPI:        pip install 'pyramids-gis[lazy]'\n"
                "  - conda-forge: conda install -c conda-forge pyramids-lazy"
            ) from exc
        meta = self._meta
        shape = meta.shape
        dtype = np.dtype(meta.dtype)
        delayed_reads = [
            dask.delayed(_read_time_step)(path, self._gdal_env) for path in self._files
        ]
        arrays = [da.from_delayed(d, shape=shape, dtype=dtype) for d in delayed_reads]
        return da.stack(arrays, axis=0)

    @property
    def meta(self) -> RasterMeta:
        """Return the picklable :class:`RasterMeta` snapshot.

        Always accessible without reopening the template dataset — a
        snapshot is derived eagerly at construction (see
        :meth:`__init__`) so downstream lazy paths can read geobox +
        dtype metadata without paying a GDAL-open cost per call, and
        so the whole collection pickles cleanly even if the
        `_base` Dataset handle is closed or points at a /vsimem/
        file.
        """
        return self._meta

    def to_kerchunk(
        self,
        output_path,
        *,
        concat_dim: str = "time",
    ) -> dict:
        """Emit a combined kerchunk JSON manifest for the collection.

        Produces a single JSON sidecar that points at every timestep's
        source file — downstream consumers open the entire cube as a
        lazy Zarr-backed xarray with zero data rewrite.

        Currently routes through
        :func:`pyramids.netcdf._kerchunk.combine_kerchunk`, which
        handles NetCDF/HDF5 sources. GeoTIFF backing is a follow-on
        (kerchunk's tiff support requires `tifffile`).

        Args:
            output_path: Path where the manifest JSON is written.
            concat_dim: Dimension along which to concatenate per-file
                coordinates. Default `"time"`.

        Returns:
            dict: The combined manifest.

        Raises:
            ImportError: When kerchunk is not installed.
            RuntimeError: When the collection has no files list.
        """
        if self._files is None or len(self._files) == 0:
            raise RuntimeError(
                "DatasetCollection.to_kerchunk requires a file-backed "
                "collection. Use DatasetCollection.from_files(...) to "
                "construct one."
            )
        # current backend only handles HDF5 / NetCDF. Detect
        # GeoTIFF inputs and raise a clear NotImplementedError rather
        # than letting kerchunk.hdf produce a confusing failure mode.
        geotiff_exts = {".tif", ".tiff", ".cog"}
        geotiff_files = [
            p
            for p in self._files
            if any(str(p).lower().endswith(ext) for ext in geotiff_exts)
        ]
        if geotiff_files:
            raise NotImplementedError(
                "to_kerchunk currently supports NetCDF / HDF5 source files "
                "only. GeoTIFF support requires kerchunk.tiff + the "
                "tifffile backend which is not yet wired up. Offending "
                f"files: {geotiff_files[:3]}"
                f"{' ...' if len(geotiff_files) > 3 else ''}"
            )
        from pyramids.netcdf._kerchunk import combine_kerchunk

        return combine_kerchunk(
            self._files,
            output_path,
            concat_dims=(concat_dim,),
            identical_dims=(),
        )

    def to_zarr(
        self,
        store,
        *,
        compute: bool = True,
        mode: str = "w",
        storage_options: dict | None = None,
    ):
        """Serialise the 4-D `(T, B, R, C)` cube to a Zarr store.

        Each dask chunk in `self.data` lands in an independent Zarr
        chunk file — the only truly parallel raster output path pyramids
        offers. Geobox metadata (epsg, geotransform, nodata, band_names,
        time_length) is written as attributes on the root group + the
        `data` array following the standard `crs_wkt` / `GeoTransform`
        attribute convention, so downstream `xr.open_zarr(store)` consumers
        can reconstruct the geobox without pyramids.

        Args:
            store: Target store (path, fsspec URL, or zarr.Store).
            compute: `True` (default) writes immediately; `False`
                returns a :class:`dask.delayed.Delayed`.
            mode: Zarr open mode, typically `"w"` (fresh) or `"a"`.
            storage_options: Optional dict forwarded to
                :func:`fsspec.get_mapper` for cloud stores.

        Returns:
            `None` on `compute=True`; a :class:`dask.delayed.Delayed`
            on `compute=False`.

        Raises:
            OptionalPackageDoesNotExist: When the `[lazy]` extra is not
                installed.
            RuntimeError: When the collection has no files list.
        """
        if self._files is None or len(self._files) == 0:
            raise RuntimeError(
                "DatasetCollection.to_zarr requires a file-backed "
                "collection. Use DatasetCollection.from_files(...) to "
                "construct one."
            )
        import_zarr(
            "DatasetCollection.to_zarr requires the optional 'zarr' "
            "dependency. Install with one of:\n"
            "  - PyPI:        pip install 'pyramids-gis[lazy]'\n"
            "  - conda-forge: conda install -c conda-forge pyramids-lazy"
        )
        data = self.data
        resolved_store = _resolve_store(store, storage_options)
        write_result = data.to_zarr(
            resolved_store,
            component="data",
            overwrite=(mode == "w"),
            compute=compute,
        )
        if compute:
            _finalize_collection_metadata(resolved_store, self._meta, self._files)
            result: Any = None
        else:
            import dask

            result = dask.delayed(_finalize_after_write)(
                write_result,
                resolved_store,
                self._meta,
                self._files,
            )
        return result

    def to_netcdf(
        self,
        path: str | Path,
        *,
        time_dim: str = "time",
        time_coords: "Sequence[Any] | None" = None,
        var_per_band: bool = True,
    ) -> None:
        """Write the collection's ``(T, B, Y, X)`` cube to a single NetCDF.

        Materialises every timestep in memory, builds an
        :class:`xarray.Dataset`, and hands it to
        :meth:`pyramids.netcdf.NetCDF.from_xarray` (which routes through
        pyramids' own GDAL multidimensional NetCDF writer — no
        ``netcdf4`` / ``h5netcdf`` engine plug-in needed). The result is
        a self-describing NetCDF with one variable per band (``CF-1.8``
        ``Conventions`` attr; geobox attached as ``crs_wkt`` /
        ``GeoTransform`` root attrs).

        For huge cubes prefer :meth:`to_zarr` — this writer is
        eager (materialises the full T×B×Y×X array) since
        ``NetCDF.from_xarray`` itself materialises.

        No-data values are written as a ``nodata`` attribute on the root
        group and on each data variable. GDAL's multidim NetCDF writer
        rejects CF's standard ``_FillValue`` attribute via this code
        path, so the round-trip uses ``nodata`` for compatibility.

        Args:
            path: Output ``.nc`` path.
            time_dim: Name of the time dimension. Default ``"time"``.
            time_coords: Sequence of length ``time_length`` for the
                time axis values (e.g. ``pd.date_range(...)``). ``None``
                (default) emits a 0..T-1 integer index with a ``note``
                attr explaining it is positional, not calendar.
            var_per_band: When ``True`` (default), each band becomes its
                own data variable named after :attr:`meta.band_names`
                — CF-friendly and what :func:`aggregate_netcdf`-style
                consumers usually expect. When ``False``, one 4-D
                ``data`` variable is written with a ``band`` coordinate
                — saner for hyperspectral cubes with hundreds of bands.

        Raises:
            OptionalPackageDoesNotExist: When ``xarray`` is not
                installed. Install with one of: PyPI
                ``pip install 'pyramids-gis[xarray]'`` or conda-forge
                ``conda install -c conda-forge pyramids-xarray``.
            ValueError: When ``len(time_coords) != self.time_length``.
            RuntimeError: When :meth:`NetCDF.from_xarray` fails to write
                the file.

        Examples:
            - Stack two single-band rasters into one NetCDF and reopen it:
                ```python
                >>> import os, tempfile
                >>> import numpy as np
                >>> from pyramids.dataset import Dataset, DatasetCollection
                >>> from pyramids.netcdf import NetCDF
                >>> d = tempfile.mkdtemp()
                >>> paths = []
                >>> for i in range(2):
                ...     arr = (np.arange(20, dtype="int16").reshape(4, 5) + 100 * i)
                ...     p = os.path.join(d, f"t{i}.tif")
                ...     _ = Dataset.create_from_array(
                ...         arr, top_left_corner=(0, 0), cell_size=0.05, epsg=4326,
                ...         no_data_value=-9999, path=p,
                ...     ).close()
                ...     paths.append(p)
                >>> col = DatasetCollection.from_files(paths)
                >>> out = os.path.join(d, "cube.nc")
                >>> col.to_netcdf(out)
                >>> nc = NetCDF.read_file(out)
                >>> "Band_1" in nc.variables
                True
                >>> nc.epsg
                4326

                ```

        See Also:
            - :meth:`to_zarr`: parallel chunk-by-chunk writer; preferred
              for very large cubes.
            - :meth:`to_kerchunk`: emit a sidecar that points back at
              the source files without rewriting data.
            - :meth:`pyramids.netcdf.NetCDF.from_xarray`: the underlying
              writer.
        """
        try:
            import xarray as xr
        except ImportError as exc:
            raise OptionalPackageDoesNotExist(
                "DatasetCollection.to_netcdf requires the optional 'xarray' "
                "dependency. Install with one of:\n"
                "  - PyPI:        pip install 'pyramids-gis[xarray]'\n"
                "  - conda-forge: conda install -c conda-forge pyramids-xarray"
            ) from exc

        if time_coords is not None:
            # Materialise generators / iterators up front so np.asarray gets a
            # sized sequence (an iterator yields a 0-d object array, which
            # would trip a cryptic IndexError below).
            if not hasattr(time_coords, "__len__"):
                time_coords = list(time_coords)
            time_values = np.asarray(time_coords)
            if time_values.dtype.kind == "O":
                # pd.DatetimeIndex → datetime64 via asarray, but lists of
                # datetime / Timestamp objects come through as dtype=object.
                # Coerce so the datetime branch below picks them up.
                try:
                    time_values = np.asarray(time_values, dtype="datetime64[ns]")
                except (TypeError, ValueError):
                    pass
            if time_values.shape[0] != self.time_length:
                raise ValueError(
                    f"time_coords has {time_values.shape[0]} entries but "
                    f"the collection has {self.time_length} timesteps"
                )
            time_attrs: dict = {}
            if time_values.shape[0] > 1 and time_values.dtype.kind in "iufM":
                ordered = np.sort(time_values)
                if not np.array_equal(time_values, ordered):
                    warnings.warn(
                        "time_coords is not monotonically increasing; some "
                        "downstream tools (xr.open_dataset, aggregate_netcdf) "
                        "may reorder or refuse the axis",
                        stacklevel=2,
                    )
                if np.unique(time_values).size != time_values.size:
                    warnings.warn(
                        "time_coords contains duplicate values; downstream "
                        "indexers may pick an arbitrary timestep",
                        stacklevel=2,
                    )
            if time_values.dtype.kind == "M":
                # GDAL's multidim writer has no native datetime64 type; encode
                # as an int64 offset with CF `units` so xr.open_dataset can
                # decode it back to a calendar axis on read. Use nanosecond
                # resolution so the round-trip is lossless for the full
                # datetime64[ns] range (the CF "nanoseconds since …" time
                # unit).
                epoch = np.datetime64("1970-01-01", "ns")
                ns = (time_values.astype("datetime64[ns]") - epoch).astype("int64")
                time_values = ns
                time_attrs["units"] = "nanoseconds since 1970-01-01 00:00:00"
                time_attrs["calendar"] = "proleptic_gregorian"
        else:
            time_values = np.arange(self.time_length, dtype="int64")
            time_attrs = {
                "long_name": "time index",
                "note": "positional index, not a calendar time",
            }

        meta = self._meta
        nodata = (meta.nodata or (None,))[0]
        band_count = int(meta.shape[0])
        names: list[str] = (
            list(meta.band_names)
            if meta.band_names
            else [f"band_{i + 1}" for i in range(band_count)]
        )

        # Per-timestep read_array() returns (rows, cols) for a single-band
        # dataset and (bands, rows, cols) for multi-band, so np.stack gives
        # (T, rows, cols) or (T, bands, rows, cols). Insert a length-1 band
        # axis on the single-band path so the rest of this method can treat
        # the cube uniformly as (T, B, Y, X).
        cube = np.stack([np.asarray(ds.read_array()) for ds in self.datasets], axis=0)
        if cube.ndim == 3:
            cube = cube[:, np.newaxis, :, :]

        y_coord = np.asarray(self._base.y)
        x_coord = np.asarray(self._base.x)

        if var_per_band:
            data_vars = {
                names[i]: ((time_dim, "y", "x"), cube[:, i, :, :])
                for i in range(band_count)
            }
            coords = {
                time_dim: (time_dim, time_values, time_attrs),
                "y": ("y", y_coord),
                "x": ("x", x_coord),
            }
        else:
            # GDAL's multidim NetCDF writer can't write a string coord, so the
            # band axis carries an integer index and the human names ride along
            # on the root group as a ``band_names`` attribute. Round-trips are
            # lossless via ``xr.open_dataset``: caller reads
            # ``ds.attrs["band_names"]`` to recover the labels.
            data_vars = {"data": ((time_dim, "band", "y", "x"), cube)}
            coords = {
                time_dim: (time_dim, time_values, time_attrs),
                "band": ("band", np.arange(band_count)),
                "y": ("y", y_coord),
                "x": ("x", x_coord),
            }

        root_attrs: dict = {"Conventions": "CF-1.8"}
        try:
            crs_wkt = meta.crs.to_wkt() if meta.crs is not None else None
        except AttributeError:
            crs_wkt = None
        if crs_wkt:
            root_attrs["crs_wkt"] = crs_wkt
        if meta.epsg is not None:
            root_attrs["epsg"] = int(meta.epsg)
        root_attrs["GeoTransform"] = " ".join(str(v) for v in meta.geotransform)
        if not var_per_band:
            root_attrs["band_names"] = ",".join(names)

        if nodata is not None:
            typed_nodata = np.asarray(nodata, dtype=cube.dtype).item()
            # GDAL's multidim NetCDF writer rejects ``_FillValue`` as an
            # attribute (libnetcdf wants it set via the dedicated typed-fill
            # API the writer doesn't expose) and silently drops anything set
            # through ``xr.encoding``. Surface the no-data value under a
            # ``nodata`` attribute instead — both on the root group (matches
            # the root attrs ``to_zarr`` writes) and on every
            # data variable, so consumers can recover it.
            root_attrs["nodata"] = typed_nodata
        ds = xr.Dataset(data_vars=data_vars, coords=coords, attrs=root_attrs)
        if nodata is not None:
            target_vars = names if var_per_band else ["data"]
            for v_name in target_vars:
                ds[v_name].attrs["nodata"] = typed_nodata

        # Inline import: pyramids.netcdf depends on pyramids.dataset.Dataset,
        # so hoisting this to the module top would form a circular import
        # through pyramids.dataset.__init__. Matches the to_kerchunk pattern
        # (see ``to_kerchunk`` above) and CLAUDE.md's circular-import
        # carveout in "Code Style".
        from pyramids.netcdf import NetCDF  # noqa: E402

        NetCDF.from_xarray(ds, path)

    @classmethod
    def from_stac(
        cls,
        items,
        asset: str | Sequence[str],
        *,
        patch_url=None,
        bbox: tuple | None = None,
        max_items: int | None = None,
        signer: Any = None,
        align: bool = True,
        skip_missing: bool = False,
        groupby: str | None = None,
        like: Any = None,
        crs: int | str | None = None,
        resolution: float | None = None,
        bounds=None,
        anchor: str = "edge",
    ) -> DatasetCollection:
        """Build a collection from a STAC ItemCollection.

        Thin forwarder to :func:`pyramids.dataset._stac.from_stac`.
        Duck-typed — accepts :class:`pystac.Item` objects, raw JSON
        dicts, or any iterable of items with `.assets` + `.bbox`
        semantics. pyramids does not depend on pystac.

        Args:
            items: Iterable of STAC Items (pystac objects, raw JSON
                dicts, or any duck-typed equivalent).
            asset: A single asset key (`str`) for a single-asset time
                stack, or a sequence of keys (e.g. `["B04", "B03",
                "B02"]`) to stack those assets band-wise into one
                multi-band raster per timestep (band order = sequence
                order).
            patch_url: Optional low-level callable rewriting each href
                (runs before `signer`).
            bbox: M6 — optional `(minx, miny, maxx, maxy)` filter in
                lon/lat; items whose `bbox` doesn't intersect are
                dropped before hrefs are resolved.
            max_items: M6 — cap the number of items consumed (after
                bbox filtering). Useful for quick-look workflows.
            signer: Optional signer (e.g. a
                :class:`pyramids.stac.signers.Signer`). Its
                `sign_href` rewrites every asset href and its
                `gdal_env()` is captured onto the returned collection so
                every read of the backing files authenticates — making
                Requester-Pays / bearer / SAS catalogs work through
                `from_stac`. See :func:`pyramids.dataset._stac.from_stac`.
            align: Multi-asset only — resample assets at differing
                resolutions onto the first asset's grid (`True`,
                default) or raise on mismatch (`False`).
            skip_missing: Drop items missing any requested asset
                (`True`) instead of raising (`False`, default).
            groupby: `"solar_day"` mosaics same-solar-day items into one
                timestep each (single-asset only); `None` (default) keeps
                one timestep per item.
            like: Optional target-grid :class:`~pyramids.dataset.Dataset`;
                every timestep is aligned onto its CRS + grid. Mutually
                exclusive with `crs`/`resolution`/`bounds`.
            crs: Target CRS for an explicit grid (with `resolution`+`bounds`).
            resolution: Target pixel size for an explicit grid.
            bounds: Target `(minx, miny, maxx, maxy)` for an explicit grid.
            anchor: Grid-snap rule for the explicit grid (`"edge"`).

        Returns:
            DatasetCollection: File-backed collection (or grid-aligned
            collection when `like`/`crs` is given).
        """
        return _from_stac(
            items,
            asset,
            patch_url=patch_url,
            bbox=bbox,
            max_items=max_items,
            signer=signer,
            align=align,
            skip_missing=skip_missing,
            groupby=groupby,
            like=like,
            crs=crs,
            resolution=resolution,
            bounds=bounds,
            anchor=anchor,
        )

    @classmethod
    def from_point(
        cls,
        lat: float,
        lon: float,
        *,
        collection: str,
        bands,
        start_date: str,
        end_date: str,
        edge_size: int,
        resolution: float,
        units: str = "px",
        stac: str | None = None,
        query: Any = None,
        signer: Any = None,
        align: bool = True,
    ) -> DatasetCollection:
        """Build a point-centred STAC cube (cubo-style convenience constructor).

        Thin forwarder to :func:`pyramids.dataset._stac.from_point`: reprojects
        `(lat, lon)` to its local UTM, snaps to the `resolution` grid, expands to
        an `edge_size`-pixel (or -metre) square AOI, searches `collection` over
        that AOI + date range, and stacks the `bands` via :meth:`from_stac`.

        Args:
            lat: Center latitude in degrees (EPSG:4326).
            lon: Center longitude in degrees (EPSG:4326).
            collection: STAC collection id to search.
            bands: A single asset key or a sequence (multi-asset band axis).
            start_date: Search start (`YYYY-MM-DD` / RFC 3339).
            end_date: Search end (`YYYY-MM-DD` / RFC 3339).
            edge_size: Cube side length, in pixels (`units="px"`) or metres.
            resolution: Pixel size in metres.
            units: `"px"` (default) or `"m"`.
            stac: STAC API root URL; `None` uses the Planetary Computer default.
            query: Optional STAC `query` extension dict.
            signer: Optional signer, forwarded to the search and the reads.
            align: Multi-asset resolution policy (see :meth:`from_stac`).

        Returns:
            DatasetCollection: A time-stacked cube over the point AOI.
        """
        kwargs: dict[str, Any] = dict(
            collection=collection,
            bands=bands,
            start_date=start_date,
            end_date=end_date,
            edge_size=edge_size,
            resolution=resolution,
            units=units,
            query=query,
            signer=signer,
            align=align,
        )
        if stac is not None:
            kwargs["stac"] = stac
        return _from_point(lat, lon, **kwargs)

    @classmethod
    def from_files(
        cls,
        files: list[str | Path],
        *,
        meta: RasterMeta | None = None,
        gdal_env: dict[str, str] | None = None,
    ) -> DatasetCollection:
        """Build a collection from a list of files without pre-opening all.

        Only the first file is opened eagerly (to derive
        :class:`RasterMeta`). The remaining files are referenced by
        path only — lazy readers open them on demand through
        :class:`~pyramids.base._file_manager.CachingFileManager`.

        Args:
            files: Sequence of file paths backing each timestep.
            meta: Optional pre-computed :class:`RasterMeta`. When
                omitted, derived from the first file via
                :meth:`RasterMeta.from_dataset`.
            gdal_env: Optional GDAL config (e.g. a signer's
                `gdal_env()`) installed around every open of the backing
                files, including the eager template open below. Persisted
                on the collection for the lazy read paths. `None` (default)
                installs no extra config.

        Returns:
            DatasetCollection: A new collection whose `time_length`
            matches `len(files)`.

        Raises:
            ValueError: When `files` is empty.
        """
        resolved = [str(p) for p in files]
        if not resolved:
            raise ValueError("files must contain at least one path")
        with cloud_config_from_env(gdal_env):
            template = Dataset.read_file(resolved[0])
            if meta is None:
                meta = RasterMeta.from_dataset(template)
        return cls(template, len(resolved), files=resolved, meta=meta, gdal_env=gdal_env)

    @classmethod
    def from_archive(
        cls,
        url_or_path: str | Path,
        *,
        kind: str = "auto",
        member_glob: str = "*",
        meta: RasterMeta | None = None,
    ) -> DatasetCollection:
        """Build a collection from the raster members of an archive.

        Lists the archive's members (locally or over the network — a remote ZIP
        is read via the chained ``/vsizip//vsicurl/…`` path) and hands them to
        :meth:`from_files`, so each matching member becomes one timestep. Only
        the first member is opened eagerly; the rest are opened on demand.

        For "merge all members into one multi-band :class:`Dataset`" (bands,
        not timesteps) use :meth:`pyramids.dataset.Dataset.from_archive`.

        The archive's file name must carry a recognised extension (``.zip`` /
        ``.tar`` / ``.tar.gz`` / ``.gz``) — GDAL's archive handlers key off the
        extension. An extension-less download URL (e.g. an Earth Engine
        ``getDownloadURL`` ending in ``:getPixels``) must first be fetched and
        saved with a ``.zip`` name (or written to ``/vsimem/<name>.zip`` via
        :func:`osgeo.gdal.FileFromMemBuffer`) before calling this.

        Args:
            url_or_path: Path or URL of the archive (``.zip`` / ``.tar`` /
                ``.tar.gz`` / ``.gz``).
            kind: Archive kind — ``"zip"``, ``"tar"`` (also ``"tar.gz"`` /
                ``"tgz"``), ``"gzip"`` (also ``"gz"``), or ``"auto"`` (default,
                infer from the extension).
            member_glob: :mod:`fnmatch` pattern selecting which members to
                include, applied to top-level member names and sorted. Default
                ``"*"`` (all). Pass e.g. ``"*.tif"`` to skip sidecar files.
            meta: Optional pre-computed :class:`RasterMeta` for the timesteps.

        Returns:
            DatasetCollection: A collection whose ``time_length`` is the number
            of matching members.

        Raises:
            FileFormatNotSupportedError: ``kind="auto"`` and the extension is
                not recognised, or the archive could not be listed.
            FileNotFoundError: No member matched ``member_glob``.
            ValueError: ``kind`` is not a recognised archive kind.
        """
        dir_vsi = _io._archive_dir_vsi(url_or_path, kind)
        members = _io._archive_members(dir_vsi, member_glob)
        member_paths = [f"{dir_vsi}/{m}" for m in members]
        return cls.from_files(member_paths, meta=meta)

    @classmethod
    def read_multiple_files(
        cls,
        path: str | Path | list[str | Path],
        with_order: bool = False,
        regex_string: str = r"\d{4}.\d{2}.\d{2}",
        date: bool = True,
        file_name_data_fmt: str | None = None,
        start: str | None = None,
        end: str | None = None,
        fmt: str = "%Y-%m-%d",
        extension: str = ".tif",
    ) -> DatasetCollection:
        r"""read_multiple_files.

            - Read rasters from a folder (or list of files) and create a 3D array with the same 2D dimensions as the
              first raster and length equal to the number of files.

            - All rasters should have the same dimensions.
            - If you want to read the rasters with a certain order, the raster file names should contain a date
              that follows a consistent format (YYYY.MM.DD / YYYY-MM-DD or YYYY_MM_DD), e.g. "MSWEP_1979.01.01.tif".

        Args:
            path (str | list[str]):
                Path of the folder that contains all the rasters, or a list containing the paths of the rasters to read.
            with_order (bool):
                True if the raster names follow a certain order. Then the raster names should have a date that follows
                the same format (YYYY.MM.DD / YYYY-MM-DD or YYYY_MM_DD). For example:

                ```python
                >>> "MSWEP_1979.01.01.tif"
                >>> "MSWEP_1979.01.02.tif"
                >>> ...
                >>> "MSWEP_1979.01.20.tif"

                ```

            regex_string (str):
                A regex string used to locate the date in the file names. Default is r"\d{4}.\d{2}.\d{2}". For example:

                ```python
                >>> fname = "MSWEP_YYYY.MM.DD.tif"
                >>> regex_string = r"\d{4}.\d{2}.\d{2}"
                ```

                - Or:

                ```python
                >>> fname = "MSWEP_YYYY_M_D.tif"
                >>> regex_string = r"\d{4}_\d{1}_\d{1}"
                ```

                - If there is a number at the beginning of the name:

                ```python
                >>> fname = "1_MSWEP_YYYY_M_D.tif"
                >>> regex_string = r"\d+"
                ```

            date (bool):
                True if the number in the file name is a date. Default is True.
            file_name_data_fmt (str):
                If the file names contain a date and you want to read them ordered. Default is None. For example:

                ```python
                >>> "MSWEP_YYYY.MM.DD.tif"
                >>> file_name_data_fmt = "%Y.%m.%d"
                ```

            start (str):
                Start date if you want to read the input raster for a specific period only and not all rasters. If not
                given, all rasters in the given path will be read.
            end (str):
                End date if you want to read the input rasters for a specific period only. If not given, all rasters in
                the given path will be read.
            fmt (str):
                Format of the given date in the start/end parameter.
            extension (str):
                The extension of the files you want to read from the given path. Default is ".tif".

        Returns:
            DatasetCollection:
                Instance of the DatasetCollection class.

        Examples:
            - Read all rasters in a folder:

              ```python
              >>> from pyramids.dataset import DatasetCollection
              >>> raster_folder = "examples/GIS/data/raster-folder"
              >>> prec = DatasetCollection.read_multiple_files(raster_folder)

              ```

            - Read from a pre-collected list without ordering:

              ```python
              >>> raster_folder = Path("examples/GIS/data/raster-folder")
              >>> file_list = list(raster_folder.glob("*.tif"))
              >>> prec = DatasetCollection.read_multiple_files(file_list, with_order=False)

              ```
        """
        if not isinstance(path, (str, Path, list)):
            raise TypeError(
                f"path input should be string/Path/list type, given: {type(path)}"
            )

        if isinstance(path, (str, Path)):
            path = Path(path)
            # check whither the path exists or not
            if not path.exists():
                raise FileNotFoundError("The path you have provided does not exist")
            # get a list of all files
            files = [f.name for f in path.iterdir() if f.name.endswith(extension)]
            # check whether there are files or not inside the folder
            if len(files) < 1:
                raise FileNotFoundError("The path you have provided is empty")
        else:
            files = [str(p) for p in path]

        # to sort the files in the same order as the first number in the name
        if with_order:
            match_str_fn = lambda x: re.search(regex_string, x)
            list_dates = list(map(match_str_fn, files))

            if None in list_dates:
                raise ValueError(
                    "The date format/separator given does not match the file names"
                )
            if date:
                if file_name_data_fmt is None:
                    raise ValueError(
                        f"To read the raster with a certain order (with_order = {with_order}, then you have to enter "
                        f"the value of the parameter file_name_data_fmt(given: {file_name_data_fmt})"
                    )
                fn: Callable[[Any], Any] = lambda x: dt.datetime.strptime(
                    x.group(), file_name_data_fmt
                )
            else:
                fn = lambda x: int(x.group())
            list_dates = list(map(fn, list_dates))

            df = pd.DataFrame()
            df["files"] = files
            df["date"] = list_dates
            df.sort_values("date", inplace=True, ignore_index=True)
            files = df.loc[:, "files"].values

        if start is not None or end is not None:
            if date:
                start_dt: Any = dt.datetime.strptime(str(start), fmt)
                end_dt: Any = dt.datetime.strptime(str(end), fmt)

                files = (
                    df.loc[start_dt <= df["date"], :]
                    .loc[df["date"] <= end_dt, "files"]
                    .values
                )
            else:
                files = (
                    df.loc[start <= df["date"], :]
                    .loc[df["date"] <= end, "files"]
                    .values
                )

        if not isinstance(path, list):
            # add the path to all the files
            files = [f"{path}/{i}" for i in files]
        # create a 3d array with the 2d dimension of the first raster and the len
        # of the number of rasters in the folder
        sample = Dataset.read_file(files[0])

        return cls(sample, len(files), files)

    @property
    def values(self) -> np.ndarray:
        """Materialise the per-timestep arrays as a 3D numpy cube.

        **Derived, not cached.** Every access reads each timestep's
        first band via :meth:`Dataset.read_array` and stacks the
        result into ``(time, rows, cols)``. There is no stored cube
        that can drift from the canonical :attr:`datasets` source;
        callers that want repeated access should hold the returned
        array locally.

        Returns:
            np.ndarray: A fresh ``(time_length, rows, cols)`` float
                array each call.
        """
        return np.stack([ds.read_array(band=0) for ds in self.datasets], axis=0)

    @values.setter
    def values(self, val: np.ndarray) -> None:
        """Replace per-timestep Datasets with MEM Datasets built from a 3D array.

        Each slice ``val[i]`` becomes a new MEM-backed
        :class:`~pyramids.dataset.Dataset` cloned from the base
        template's georef with the slice written into band 1.
        Replaces — does not merge with — any current
        :attr:`datasets`. ``_files`` is cleared because the in-memory
        result no longer corresponds to the disk paths.

        Args:
            val: A ``(time_length, rows, cols)`` numpy array.

        Raises:
            ValueError: If ``val`` is not 3D, or if its first axis
                length disagrees with the existing :attr:`time_length`
                (when the collection has already been sized).
        """
        if val.ndim != 3:
            raise ValueError(
                f"values must be a 3D array (time, rows, cols); got "
                f"shape {val.shape}"
            )
        if (
            self._datasets is not None
            and self._datasets
            and val.shape[0] != self._time_length
        ):
            raise ValueError(
                f"The dimension of the new data: {val.shape}, differs "
                f"from the dimension of the original dataset: "
                f"({self._time_length}, {self.rows}, {self.columns}); "
                f"please redefine the base Dataset and dataset_length first"
            )
        # Build a fresh MEM Dataset per timestep using the INPUT
        # dtype (not the base template's). Cloning the base preserves
        # geo-ref but forces a dtype cast that silently lossy-rounds
        # if the base is e.g. float32 and the input is float64.
        # Using ``Dataset.create_from_array`` gives us the input
        # dtype back verbatim and reuses the base's georef explicitly.
        from pyramids.dataset.dataset import Dataset as _Dataset

        new_datasets = []
        for i in range(val.shape[0]):
            ds = _Dataset.create_from_array(
                arr=val[i],
                geo=self._base.geotransform,
                epsg=self._base.epsg,
                no_data_value=self._base.no_data_value[0],
            )
            new_datasets.append(ds)
        self._datasets = new_datasets
        self._time_length = val.shape[0]
        self._files = None

    def open_multi_dataset(self, band: int = 0) -> None:
        """Deprecated no-op (legacy API).

        The eager ``_values`` cube this method used to populate is
        gone. Per-timestep ``Dataset`` handles open lazily via
        :attr:`datasets` on first access; the legacy ``values`` /
        ``__getitem__`` / ``head`` / ``first`` views materialise on
        demand from those handles. There is nothing for this method
        to do.

        Kept as a callable shim so legacy code that does
        ``dc.open_multi_dataset()`` before reading ``.values`` still
        runs without modification. New code should not call it.

        Args:
            band: Ignored. The full per-timestep band selection
                happens inside :meth:`Dataset.read_array(band=...)`.
        """
        del band  # unused
        return None

    def __getitem__(self, key) -> np.ndarray:
        """Return one or more timestep arrays, indexed along the time axis.

        Equivalent to ``self.values[key]`` but with one slight
        optimisation: an integer ``key`` reads only that timestep's
        Dataset (never materialises the full cube).

        Args:
            key: Integer index or slice along the time axis.

        Returns:
            np.ndarray: A 2D array (single int) or a 3D array (slice).
        """
        if isinstance(key, int):
            return self.datasets[key].read_array(band=0)
        return self.values[key]

    def __setitem__(self, key: int, value: np.ndarray) -> None:
        """Replace a single timestep's Dataset with a MEM Dataset built from ``value``.

        Args:
            key (int): Integer index along the time axis.
            value (np.ndarray): A 2D ``(rows, cols)`` array.

        Raises:
            TypeError: If ``key`` is not an integer (slice assignment
                is not supported; rebuild the collection instead).
        """
        if not isinstance(key, int):
            raise TypeError(
                f"DatasetCollection.__setitem__ only accepts an integer "
                f"index along the time axis; got {type(key).__name__}. "
                f"Rebuild the collection if you need bulk replacement."
            )
        # Materialise the cache (so we have a list to modify) without
        # building the full cube. Use ``create_from_array`` so the
        # input array's dtype is preserved (CreateCopy on the base
        # would cast through whatever dtype the base happened to use).
        from pyramids.dataset.dataset import Dataset as _Dataset

        datasets = self.datasets
        datasets[key] = _Dataset.create_from_array(
            arr=value,
            geo=self._base.geotransform,
            epsg=self._base.epsg,
            no_data_value=self._base.no_data_value[0],
        )
        # The mutation breaks the disk correspondence for that slot;
        # if the user mutates any timestep, the lazy reductions can no
        # longer trust ``_files``. Drop the path list so they fall
        # through to the in-memory handles instead.
        self._files = None

    def __len__(self):
        """Number of timesteps in the collection."""
        return self._time_length

    def __iter__(self):
        """Iterate over per-timestep arrays (matches the legacy API)."""
        for ds in self.datasets:
            yield ds.read_array(band=0)

    def head(self, n: int = 5) -> np.ndarray:
        """First ``n`` timestep arrays as a 3D numpy slice.

        Args:
            n (int): Number of timesteps. Defaults to 5.

        Returns:
            np.ndarray: ``(min(n, time_length), rows, cols)`` array.
        """
        return self.values[:n]

    def tail(self, n: int = -5) -> np.ndarray:
        """Last ``-n`` timestep arrays as a 3D numpy slice.

        Matches the legacy signature: a NEGATIVE ``n`` (the default
        ``-5``) means "last 5". Implementation simply does
        ``self.values[n:]``, so a positive ``n`` would skip the first
        ``n`` rows instead — that's the legacy behaviour and left as
        is for back-compat.

        Args:
            n (int): Negative integer giving the offset from the
                end. Defaults to ``-5`` (last 5).

        Returns:
            np.ndarray: ``(abs(n), rows, cols)`` array (when ``n < 0``).
        """
        return self.values[n:]

    def first(self) -> np.ndarray:
        """First timestep array (2D).

        Cheaper than ``self.values[0]`` because it only reads one
        timestep instead of the full cube.
        """
        return self.datasets[0].read_array(band=0)

    def last(self) -> np.ndarray:
        """Last timestep array (2D).

        Cheaper than ``self.values[-1]`` because it only reads one
        timestep instead of the full cube.
        """
        return self.datasets[-1].read_array(band=0)

    def iloc(self, i: int) -> Dataset:
        """Return the ``Dataset`` at position ``i``.

        Args:
            i (int):
                Index of the timestep to access.

        Returns:
            Dataset: The lazy ``Dataset`` handle at position ``i``.
            Pixel values are not loaded — they're read on demand when
            the caller invokes a method on the returned Dataset.
        """
        return self.datasets[i]

    def plot(
        self, band: int = 0, exclude_value: Any | None = None, **kwargs: Any
    ) -> ArrayGlyph:
        r"""Render the collection as an animated stack of band slices.

            - read the values stored in a given band across every
              ``Dataset`` in the collection and hand the resulting
              ``(time, rows, cols)`` array to cleopatra's animation
              path.

        Implementation note: this method is a thin caller around the
        shared :func:`pyramids.dataset._plot_helpers.render_array`
        helper. It stacks one band per ``Dataset`` into a 3-D array
        and forwards to ``render_array(..., mode="animate",
        animation_axis_values=...)``. The duplicated ``ArrayGlyph``
        construction that used to live here is gone — the helper owns
        the cleopatra dispatch and the same code path serves the
        single-frame ``Dataset.plot`` and the multi-panel
        ``NetCDF.plot`` facets. See
        :mod:`pyramids.dataset._plot_helpers` for the three-mode
        contract.

        Args:
            band (int):
                The band you want to get its data. Default is 0.
            exclude_value (Any):
                Value to exclude from the plot. Default is None.
            **kwargs:
                | Parameter                  | Type                  | Description |
                |----------------------------|-----------------------|-------------|
                | points                     | array                 | 3-column array: col 1 = value to display, col 2 = row index, col 3 = column index. Columns 2 and 3 indicate the location of the point. |
                | point_color                | str                   | Color of the points. |
                | point_size                 | Any                   | Size of the points. |
                | pid_color                  | str                   | Color of the annotation of the point. Default is blue. |
                | pid_size                   | Any                   | Size of the point annotation. |
                | figsize                    | tuple, optional       | Figure size. Default is `(8, 8)`. |
                | title                      | str, optional         | Title of the plot. Default is `'Total Discharge'`. |
                | title_size                 | int, optional         | Title size. Default is `15`. |
                | orientation                | str, optional         | Orientation of the color bar (`horizontal` or `vertical`). Default is `'vertical'`. |
                | rotation                   | number, optional      | Rotation of the color bar label. Default is `-90`. |
                | cbar_length                | float, optional       | Ratio to control the height of the color bar. Default is `0.75`. |
                | ticks_spacing              | int, optional         | Spacing in the color bar ticks. Default is `2`. |
                | cbar_label_size            | int, optional         | Size of the color bar label. Default is `12`. |
                | cbar_label                 | str, optional         | Label of the color bar. Default is `'Discharge m³/s'`. |
                | color_scale                | str, optional         | Color-scale mode (default `"linear"`): one of `"linear"`, `"power"`, `"sym-lognorm"`, `"boundary-norm"`, `"midpoint"` (case-insensitive), or a `cleopatra.styles.ColorScale` member. Integer codes are no longer accepted. |
                | gamma                      | float, optional       | Exponent for `color_scale="power"`. Default is `1/2`. |
                | line_threshold             | float, optional       | `linthresh` for `color_scale="sym-lognorm"`. Default is `0.0001`. |
                | line_scale                 | float, optional       | `linscale` for `color_scale="sym-lognorm"`. Default is `0.001`. |
                | bounds                     | list                  | Discrete bounds for `color_scale="boundary-norm"`. Default is `None`. |
                | midpoint                   | float, optional       | Midpoint value for `color_scale="midpoint"`. Default is `0`. |
                | cmap                       | str, optional         | Color map style. Default is `'coolwarm_r'`. |
                | display_cell_value         | bool                  | Whether to display the values of the cells as text. |
                | num_size                   | int, optional         | Size of the numbers plotted on top of each cell. Default is `8`. |
                | background_color_threshold | float \| int, optional| Threshold for deciding number color: if value > threshold -> black; else white. If `None`, uses `max_value/2`. Default is `None`. |


        Returns:
            ArrayGlyph: A plotting/animation handle (from cleopatra.ArrayGlyph).
        """
        # Materialise the cube on demand for plotting. The render helper
        # expects a single (time, rows, cols) numpy array; reading each
        # Dataset's band into one stacked array is fine for a plot call
        # (the user explicitly asked to render). Delegates the cleopatra
        # call to :func:`render_array` (D-2 — shared with `Analysis.plot`).
        data = np.stack([ds.read_array(band=band) for ds in self.datasets], axis=0)
        exclude_value = (
            [self.base.no_data_value[band], exclude_value]
            if exclude_value is not None
            else [self.base.no_data_value[band]]
        )
        return render_array(
            arr=data,
            exclude_value=exclude_value,
            mode="animate",
            animation_axis_values=list(range(self.time_length)),
            **kwargs,
        )

    def to_file(
        self,
        path: str | Path | list[str | Path],
        driver: str = "geotiff",
        band: int = 0,
    ):
        """Save to geotiff format.

            saveRaster saves a raster to a path

        Args:
            path (str | list[str]):
                a path includng the name of the raster and extention.
            driver (str):
                driver = "geotiff".
            band (int):
                band index, needed only in case of ascii drivers. Default is 1.

        Examples:
            - Save to a file:

              ```python
              >>> raster_obj = Dataset.read_file("path/to/file/***.tif")
              >>> output_path = "examples/GIS/data/save_raster_test.tif"
              >>> raster_obj.to_file(output_path)

              ```
        """
        ext = CATALOG.get_extension(driver)

        if isinstance(path, (str, Path)):
            path = Path(path)
            if not path.exists():
                path.mkdir(parents=True, exist_ok=True)
            path = [str(path / f"{i}.{ext}") for i in range(self.time_length)]
        else:
            if not len(path) == self.time_length:
                raise ValueError(
                    f"Length of the given paths: {len(path)} does not equal number of rasters in the data cube: {self.time_length}"
                )
            path_list = [Path(p) for p in path]
            parent = path_list[0].parent
            if not parent.exists():
                parent.mkdir(parents=True, exist_ok=True)

        for i in range(self.time_length):
            src = self.iloc(i)
            arr = src.read_array()
            transient = Dataset.create_from_array(
                arr=arr,
                geo=src.geotransform,
                epsg=src.epsg,
                no_data_value=src.no_data_value[0],
            )
            transient.to_file(path[i], band=band)
            transient.close()
        self._datasets = None

    def to_cog_stack(
        self,
        directory: str | Path,
        *,
        pattern: str = "{name}_{i:04d}.tif",
        name: str = "slice",
        overwrite: bool = False,
        **cog_kwargs: Any,
    ) -> list[Path]:
        """Export each time slice of the collection as an individual COG.

        Args:
            directory: Output directory; created if missing.
            pattern: Filename template. Placeholders:

                - `{name}` — the `name` argument (default `'slice'`);
                - `{i}` — zero-padded integer index.

                The `{t}` placeholder is reserved for a future task
                that adds a time-coordinate axis; using it now raises
                :class:`ValueError`.
            name: Replacement for the `{name}` placeholder.
            overwrite: If `False`, raise :class:`FileExistsError`
                when a target path already exists.
            **cog_kwargs: Forwarded verbatim to
                :meth:`pyramids.dataset.engines.COG.to_cog`.

        Returns:
            List of written file paths, in temporal (index) order.

        Raises:
            DatasetNotFoundError: :meth:`open_multi_dataset` has not been
                called, so per-slice arrays are not loaded.
            ValueError: `{t}` placeholder used but no time coord is
                available.
            FileExistsError: `overwrite=False` and a target path exists.

        Examples:
            - Default naming — one COG per slice:
                ```python
                >>> dc.to_cog_stack("out/", compress="ZSTD")  # doctest: +SKIP
                [PosixPath('out/slice_0000.tif'), ..., PosixPath('out/slice_0002.tif')]

                ```
            - Custom filename pattern and name prefix:
                ```python
                >>> dc.to_cog_stack(  # doctest: +SKIP
                ...     "band4/",
                ...     pattern="B04_{i:03d}.tif",
                ...     name="B04",
                ... )
                [PosixPath('band4/B04_000.tif'), ...]

                ```
            - Overwrite existing outputs and forward COG options:
                ```python
                >>> dc.to_cog_stack(  # doctest: +SKIP
                ...     "out/",
                ...     overwrite=True,
                ...     compress="DEFLATE",
                ...     blocksize=256,
                ... )

                ```
        """
        # Check the backing attribute directly rather than going through
        # the `values` property: the property getter raises AttributeError
        # on unpopulated collections, which hasattr catches silently, but
        # a future refactor that changes the exception type would break
        if "{t}" in pattern:
            raise ValueError(
                "{t} placeholder not yet supported; DatasetCollection has "
                "no time-axis coord. Use {i} for integer index."
            )

        out_dir = Path(directory)
        out_dir.mkdir(parents=True, exist_ok=True)

        paths: list[Path] = []
        for i in range(self.time_length):
            filename = pattern.format(name=name, i=i)
            target = out_dir / filename
            if target.exists() and not overwrite:
                raise FileExistsError(
                    f"{target} exists; pass overwrite=True to replace."
                )
            slice_ds = self.iloc(i)
            slice_ds.to_cog(target, **cog_kwargs)
            paths.append(target)
        return paths

    def _apply_per_timestep(
        self, method_name: str, *args: Any, **kwargs: Any
    ) -> list[Dataset]:
        """Apply `Dataset.<method_name>(*args, **kwargs)` to each timestep.

        Iterates over the per-timestep ``Dataset`` handles in
        :attr:`datasets` and dispatches the named method. Each
        per-timestep call returns a new ``Dataset`` (typically a
        MEM-backed result of an internal :func:`gdal.Warp`); the
        list of those results is wrapped in a new collection by
        :meth:`_finalize_per_timestep_result`.

        Nothing materialises the full cube as a numpy array — each
        ``Dataset.<op>`` already streams blocks through GDAL.

        Args:
            method_name: Name of the method to call on each timestep
                Dataset (e.g. ``"to_crs"``, ``"crop"``, ``"align"``).
            *args, **kwargs: Forwarded to the per-timestep call.

        Returns:
            list[Dataset]: One ``Dataset`` per timestep — each is the
                output of calling the named method on the corresponding
                input handle.
        """
        return [getattr(ds, method_name)(*args, **kwargs) for ds in self.datasets]

    def to_crs(
        self,
        to_epsg: int = 3857,
        method: str = "nearest neighbor",
        maintain_alignment: bool = False,
        inplace: bool = False,
    ) -> DatasetCollection | None:
        """Reproject every timestep to a target EPSG.

        Args:
            to_epsg (int):
                Reference number to the new projection (https://epsg.io/)
                (default 3857, WGS84 web mercator).
            method (str):
                Resampling technique. Default is "nearest neighbor". See
                https://gisgeography.com/raster-resampling/. Accepted
                values are "nearest neighbor", "cubic", "bilinear".
            maintain_alignment (bool):
                True to maintain the number of rows and columns of the
                raster the same after reprojection. Default is False.
            inplace (bool):
                If True, mutate this collection in place and return None.
                If False (default), return a new `DatasetCollection`.

        Returns:
            DatasetCollection | None: New collection when
            `inplace=False`; `None` when `inplace=True`.

        Examples:
            - Reproject every timestep to EPSG:3857 and keep the result:

              ```python
              >>> reprojected = collection.to_crs(to_epsg=3857)  # doctest: +SKIP

              ```
            - Reproject in place:

              ```python
              >>> collection.to_crs(to_epsg=3857, inplace=True)  # doctest: +SKIP

              ```
        """
        new_datasets = self._apply_per_timestep(
            "to_crs",
            to_epsg,
            method=method,
            maintain_alignment=maintain_alignment,
        )
        return self._finalize_per_timestep_result(new_datasets, inplace=inplace)

    def crop(
        self,
        mask: Dataset | str | None = None,
        inplace: bool = False,
        touch: bool = True,
        *,
        bbox: tuple[float, float, float, float] | list[float] | None = None,
        epsg: Any = None,
    ) -> DatasetCollection | None:
        """Crop every timestep against ``mask`` or a ``bbox``.

        Args:
            mask (Dataset | None):
                Dataset object of the mask raster to crop the rasters (to get
                the NoData value and its location in the array). Mask should
                include the name of the raster and the extension like
                "data/dem.tif", or you can read the mask raster using gdal
                and use it as the first parameter to the function. Mutually
                exclusive with ``bbox``; exactly one of the two must be
                supplied.
            inplace (bool):
                If True, mutate this collection in place and return None.
                If False (default), return a new `DatasetCollection`.
            touch (bool):
                Include the cells that touch the polygon, not only those that lie entirely inside the polygon mask.
                Default is True.
            bbox (tuple[float, float, float, float] | None, keyword-only):
                ``(west, south, east, north)`` quadruple in the CRS named by
                ``epsg``. Internally wrapped in a one-row
                :class:`FeatureCollection` (built once and reused across
                timesteps). Mutually exclusive with ``mask``.
            epsg (Any, keyword-only):
                CRS for ``bbox`` — anything ``geopandas`` accepts. Defaults to
                the collection's own CRS.

        Returns:
            DatasetCollection | None: New collection when
            `inplace=False`; `None` when `inplace=True`.

        Examples:
            - Crop aligned rasters using a DEM mask:

              ```python
              >>> dem_path = "examples/GIS/data/acc4000.tif"
              >>> src_path = "examples/GIS/data/aligned_rasters/"
              >>> out_path = "examples/GIS/data/crop_aligned_folder/"
              >>> DatasetCollection.crop(dem_path, src_path, out_path)

              ```

            - Crop every timestep using a ``(W, S, E, N)`` bbox tuple — the FC
              is built once and reused across timesteps:

              ```python
              >>> import os, tempfile
              >>> import numpy as np
              >>> from pyramids.dataset import Dataset, DatasetCollection
              >>> d = tempfile.mkdtemp()
              >>> paths = []
              >>> for t in range(2):
              ...     p = os.path.join(d, f"t{t}.tif")
              ...     _ = Dataset.create_from_array(
              ...         (np.arange(100, dtype="int16").reshape(10, 10) * (t + 1)),
              ...         top_left_corner=(0, 0), cell_size=0.05, epsg=4326, path=p,
              ...     ).close()
              ...     paths.append(p)
              >>> col = DatasetCollection.from_files(paths)
              >>> cropped = col.crop(bbox=(0.1, -0.2, 0.2, -0.1))
              >>> cropped.time_length
              2
              >>> cropped.base.shape
              (1, 2, 2)

              ```
        """
        if bbox is not None:
            if mask is not None:
                raise ValueError("crop accepts either `mask` or `bbox`, not both")
            crs = epsg if epsg is not None else self._base.epsg
            mask = FeatureCollection.from_bbox(bbox, epsg=crs)
        if mask is None:
            raise TypeError(
                "crop requires a `mask` or a `bbox` (west, south, east, north)"
            )
        new_datasets = self._apply_per_timestep("crop", mask, touch=touch)
        return self._finalize_per_timestep_result(new_datasets, inplace=inplace)

    def align(
        self, alignment_src: Dataset, inplace: bool = False
    ) -> DatasetCollection | None:
        """Align every timestep to `alignment_src`.

        Matches the coordinate system, the number of rows and columns,
        and the cell size of every timestep raster to `alignment_src`.

        Args:
            alignment_src (Dataset):
                Dataset to use as the spatial template (CRS, rows, columns).
            inplace (bool):
                If True, mutate this collection in place and return None.
                If False (default), return a new `DatasetCollection`.

        Returns:
            DatasetCollection | None: New collection when
            `inplace=False`; `None` when `inplace=True`.

        Examples:
            - Align every timestep to a DEM template:

              ```python
              >>> aligned = collection.align(dem_dataset)  # doctest: +SKIP

              ```
        """
        if not isinstance(alignment_src, Dataset):
            raise TypeError("alignment_src input should be a Dataset object")
        new_datasets = self._apply_per_timestep("align", alignment_src)
        return self._finalize_per_timestep_result(new_datasets, inplace=inplace)

    def _finalize_per_timestep_result(
        self,
        new_datasets: list[Dataset],
        *,
        inplace: bool,
    ) -> DatasetCollection | None:
        """Wire a list of new per-timestep Datasets into a collection.

        Centralises the inplace / non-inplace contract used by
        :meth:`to_crs`, :meth:`crop`, and :meth:`align` so the three
        share a single decision point.

        Args:
            new_datasets: One ``Dataset`` per timestep — the output of
                the per-timestep op.
            inplace: When True, replace this collection's handles in
                place (and rebind ``_base`` to the first new handle).
                When False, return a new collection wrapping the list.
        """
        if inplace:
            self._datasets = new_datasets
            self._base = new_datasets[0]
            self._files = None  # In-memory results no longer correspond to disk paths.
            return None
        return DatasetCollection(
            new_datasets[0],
            time_length=len(new_datasets),
            datasets=new_datasets,
        )

    def merge(
        self,
        dst: str | Path,
        no_data_value: float | int | str = "0",
        init: float | int | str = "nan",
        n: float | int | str = "nan",
        method: str = "last",
    ) -> None:
        """Merge this collection's timesteps into one raster.

        File-backed collections merge their on-disk paths directly.
        In-memory collections (legacy `DatasetCollection(src,
        time_length=N)` constructions, anything produced by
        `crop(inplace=False)` / `apply()` / `to_crs(inplace=False)` /
        `align(inplace=False)`) are first staged through a temp
        directory, merged, and the staging directory is removed
        before the call returns.

        Args:
            dst (str | Path):
                Path to the output raster.
            no_data_value (float | int | str):
                Assign a specified nodata value to output bands.
            init (float | int | str):
                Pre-initialize the output image bands with these
                values. However, it is not marked as the nodata
                value in the output file. If only one value is
                given, the same value is used in all the bands.
            n (float | int | str):
                Ignore pixels from files being merged in with this
                pixel value.
            method (str):
                Overlap-resolution rule passed to
                :func:`~pyramids.dataset.merge.merge_rasters`: one of
                ``"first"``, ``"last"`` (default), ``"min"``, ``"max"``,
                ``"sum"``.

        Returns:
            None
        """
        if self._files:
            merge_rasters(
                self._files,
                dst,
                no_data_value=no_data_value,
                init=init,
                n=n,
                method=method,
            )
            return
        # In-memory collection (legacy `DatasetCollection(src,
        # time_length=N)` or anything returned by
        # `crop(inplace=False)` / `apply()` / `to_crs(inplace=False)` /
        # `align(inplace=False)`). Stage each timestep through a
        # tempfile, merge the temp paths, then drop the staging
        # directory. The tempfile pass is unavoidable: gdal_merge /
        # BuildVRT both take on-disk paths.
        with tempfile.TemporaryDirectory(prefix="pyramids-merge-") as staging:
            staging_path = Path(staging)
            self.to_file(staging_path, driver="geotiff")
            staged_files = sorted(staging_path.glob("*.tif"))
            merge_rasters(
                [str(p) for p in staged_files],
                dst,
                no_data_value=no_data_value,
                init=init,
                n=n,
                method=method,
            )

    def apply(
        self, ufunc: Callable, *, inplace: bool = False
    ) -> DatasetCollection | None:
        """Apply a function to every timestep raster.

        Each timestep ``Dataset.apply(ufunc)`` runs over the
        in-domain cells of its band; the result is a new
        ``Dataset``. The list of new ``Datasets`` is wrapped in a
        new collection (out-of-place) or replaces this collection's
        handles (inplace).

        Out-of-place is the default — the previous in-place
        signature mutated a shared numpy cube; with the
        ``Dataset``-list backing there is no shared cube to mutate
        and per-timestep ops always produce a new ``Dataset``.

        Args:
            ufunc (Callable):
                Callable universal function (builtin or user defined). See
                https://numpy.org/doc/stable/reference/ufuncs.html
                To create a ufunc from a normal function: https://numpy.org/doc/stable/reference/generated/numpy.frompyfunc.html
            inplace (bool):
                When True, replace this collection's per-timestep
                ``Dataset`` handles with the new outputs and return
                ``None``. When False (default), return a new
                ``DatasetCollection`` wrapping the new outputs.

        Returns:
            DatasetCollection | None: New collection when
            ``inplace=False``; ``None`` when ``inplace=True``.

        Examples:
            - Apply a simple modulo operation to each value:

              ```python
              >>> def func(val):
              ...    return val % 2
              >>> ufunc = np.frompyfunc(func, 1, 1)
              >>> result = collection.apply(ufunc)  # doctest: +SKIP

              ```
        """
        if not callable(ufunc):
            raise TypeError("The Second argument should be a function")
        new_datasets = self._apply_per_timestep("apply", ufunc)
        return self._finalize_per_timestep_result(new_datasets, inplace=inplace)

    def overlay(
        self,
        classes_map,
        exclude_value: float | int | None = None,
    ) -> dict[float, list[float]]:
        """Overlay.

        Args:
            classes_map (Dataset):
                Dataset object for the raster that has classes to overlay with.
            exclude_value (float | int, optional):
                Values to exclude from extracted values. Defaults to None.

        Returns:
            dict[float, list[float]]:
                Dictionary with a list of values in the basemap as keys and for each key a list of all the
                intersected values in the maps from the path.
        """
        values: dict[Any, list[float]] = {}
        for ds in self.datasets:
            dict_i = ds.overlay(classes_map, exclude_value)

            # these are the distinct values from the BaseMap which are keys in the
            # values dict with each one having a list of values
            for class_i, vals in dict_i.items():
                values.setdefault(class_i, []).extend(vals)

        return values

datasets property #

Lazy list of per-timestep :class:Dataset handles.

Populates on first access. Three sources, in priority order:

  1. Caller-provided datasets= argument to __init__ (used by per-timestep ops to wrap their results).
  2. files= argument — each path opened as a lazy gdal handle via :meth:Dataset.read_file.
  3. Fallback for legacy DatasetCollection(src, time_length=N) constructions with neither files nor datasets — the template src is replicated time_length times.

The cache is per-instance and lives until the collection is garbage-collected. It is dropped on pickle (see :meth:__getstate__).

base property #

base.

Base Dataset

time_length property #

Length of the dataset.

rows property #

Number of rows.

shape property #

Number of rows.

columns property #

Number of columns.

data property #

Return a lazy dask.array.Array of shape (T, B, R, C).

Each per-file read is scheduled as a :func:dask.delayed task that opens the file via :class:~pyramids.base._file_manager.CachingFileManager and reads its full array. Workers therefore never serialise a gdal.Dataset — only the file path crosses the pickle boundary, which keeps the graph safe under dask.distributed.

Raises:

Type Description
ImportError

If the optional dask extra is not installed.

RuntimeError

If the collection was constructed without a files list (legacy create_cube path).

meta property #

Return the picklable :class:RasterMeta snapshot.

Always accessible without reopening the template dataset — a snapshot is derived eagerly at construction (see :meth:__init__) so downstream lazy paths can read geobox + dtype metadata without paying a GDAL-open cost per call, and so the whole collection pickles cleanly even if the _base Dataset handle is closed or points at a /vsimem/ file.

values property writable #

Materialise the per-timestep arrays as a 3D numpy cube.

Derived, not cached. Every access reads each timestep's first band via :meth:Dataset.read_array and stacks the result into (time, rows, cols). There is no stored cube that can drift from the canonical :attr:datasets source; callers that want repeated access should hold the returned array locally.

Returns:

Type Description
ndarray

np.ndarray: A fresh (time_length, rows, cols) float array each call.

__init__(src, time_length, files=None, *, meta=None, datasets=None, gdal_env=None) #

Construct DatasetCollection object.

Parameters:

Name Type Description Default
src Dataset

Template :class:~pyramids.dataset.Dataset (also serves as the timestep when no per-file Datasets are given).

required
time_length int

Number of timesteps in the collection.

required
files list[str] | None

Optional list of file paths backing each timestep. When given, per-timestep ops open each path as a lazy :class:~pyramids.dataset.Dataset on first access.

None
meta RasterMeta | None

Optional :class:RasterMeta snapshot. When omitted, a snapshot is derived eagerly from src so downstream lazy paths can access geo metadata without reopening the template every call.

None
datasets list[Dataset] | None

Optional list of pre-opened :class:~pyramids.dataset.Dataset handles, one per timestep. When given, takes precedence over files and time_length for per-timestep access. Used internally by :meth:to_crs / :meth:crop / :meth:align to wrap the result of per-timestep ops without re-opening any files.

None
gdal_env dict[str, str] | None

Optional GDAL config (e.g. a signer's gdal_env()) installed around every open of the backing files — both the eager template open and each lazy per-timestep read on Path A (datasets) and Path B (data dask graph). Lets a signed / Requester-Pays collection (from :meth:from_stac) authenticate its reads. A plain dict so it survives pickling to dask workers. None/empty means no extra config.

None
Source code in src/pyramids/dataset/collection.py
def __init__(
    self,
    src: Dataset,
    time_length: int,
    files: list[str] | None = None,
    *,
    meta: RasterMeta | None = None,
    datasets: list[Dataset] | None = None,
    gdal_env: dict[str, str] | None = None,
):
    """Construct DatasetCollection object.

    Args:
        src: Template :class:`~pyramids.dataset.Dataset` (also
            serves as the timestep when no per-file Datasets are
            given).
        time_length: Number of timesteps in the collection.
        files: Optional list of file paths backing each timestep.
            When given, per-timestep ops open each path as a lazy
            :class:`~pyramids.dataset.Dataset` on first access.
        meta: Optional :class:`RasterMeta` snapshot. When omitted,
            a snapshot is derived eagerly from `src` so downstream
            lazy paths can access geo metadata without
            reopening the template every call.
        datasets: Optional list of pre-opened
            :class:`~pyramids.dataset.Dataset` handles, one per
            timestep. When given, takes precedence over `files`
            and `time_length` for per-timestep access. Used
            internally by :meth:`to_crs` / :meth:`crop` /
            :meth:`align` to wrap the result of per-timestep ops
            without re-opening any files.
        gdal_env: Optional GDAL config (e.g. a signer's
            `gdal_env()`) installed around **every** open of the
            backing files — both the eager template open and each
            lazy per-timestep read on Path A (`datasets`) and Path B
            (`data` dask graph). Lets a signed / Requester-Pays
            collection (from :meth:`from_stac`) authenticate its
            reads. A plain dict so it survives pickling to dask
            workers. `None`/empty means no extra config.
    """
    self._base = src
    self._files = files
    self._time_length = time_length
    self._meta = meta if meta is not None else RasterMeta.from_dataset(src)
    self._gdal_env: dict[str, str] = dict(gdal_env) if gdal_env else {}
    # Cached lazy list of per-timestep Datasets. Populated on
    # first access via the `datasets` property: from `datasets=`
    # (caller-provided), then `files=` (open each path), then a
    # `[src] * time_length` fallback for legacy call sites that
    # don't pass either.
    self._datasets: list[Dataset] | None = (
        list(datasets) if datasets is not None else None
    )

__getstate__() #

Pickle state — drop the lazy _datasets cache.

Each Dataset in the cache wraps a live gdal handle that cannot be pickled. Stripping the cache forces the post-unpickle instance to re-open files on demand. The on-disk paths in _files are the canonical truth.

Source code in src/pyramids/dataset/collection.py
def __getstate__(self):
    """Pickle state — drop the lazy `_datasets` cache.

    Each `Dataset` in the cache wraps a live gdal handle that
    cannot be pickled. Stripping the cache forces the
    post-unpickle instance to re-open files on demand. The
    on-disk paths in `_files` are the canonical truth.
    """
    state = self.__dict__.copy()
    state["_datasets"] = None
    return state

__str__() #

str.

Source code in src/pyramids/dataset/collection.py
def __str__(self):
    """__str__."""
    source_line = (
        f"Files: {len(self._files)}"
        if self._files is not None
        else f"Time length: {self._time_length} (in-memory)"
    )
    message = f"""
        {source_line}
        Cell size: {self._base.cell_size}
        EPSG: {self._base.epsg}
        Dimension: {self.rows} * {self.columns}
        Mask: {self._base.no_data_value[0]}
    """
    return message

__repr__() #

repr.

Source code in src/pyramids/dataset/collection.py
def __repr__(self):
    """__repr__."""
    source_line = (
        f"Files: {len(self._files)}"
        if self._files is not None
        else f"Time length: {self._time_length} (in-memory)"
    )
    message = f"""
        {source_line}
        Cell size: {self._base.cell_size}
        EPSG: {self._base.epsg}
        Dimension: {self.rows} * {self.columns}
        Mask: {self._base.no_data_value[0]}
    """
    return message

create_cube(src, dataset_length) classmethod #

Create DatasetCollection.

- Create DatasetCollection from a sample raster and

Parameters:

Name Type Description Default
src Dataset

Raster object.

required
dataset_length int

Length of the dataset.

required

Returns:

Name Type Description
DatasetCollection DatasetCollection

DatasetCollection object.

Source code in src/pyramids/dataset/collection.py
@classmethod
def create_cube(cls, src: Dataset, dataset_length: int) -> DatasetCollection:
    """Create DatasetCollection.

        - Create DatasetCollection from a sample raster and

    Args:
        src (Dataset):
            Raster object.
        dataset_length (int):
            Length of the dataset.

    Returns:
        DatasetCollection: DatasetCollection object.
    """
    return cls(src, dataset_length)

groupby(time_labels) #

Group time steps by per-timestep label.

Returns a view exposing the same reduction surface as :class:DatasetCollection (mean / sum / min / max / std / var); each reduction runs once per unique label over the subset of timesteps carrying that label.

Parameters:

Name Type Description Default
time_labels

Sequence of length self.time_length — each entry is the group label for the corresponding file (e.g. ["Jan", "Jan", "Feb", "Feb",...] or integer month numbers for monthly groupings).

required

Returns:

Name Type Description
_GroupedCollection _GroupedCollection

Lightweight view with .mean() etc.

_GroupedCollection

Each call returns a dict {label: np.ndarray}.

Raises:

Type Description
ValueError

When len(time_labels)!= self.time_length.

Source code in src/pyramids/dataset/collection.py
def groupby(self, time_labels) -> _GroupedCollection:
    """Group time steps by per-timestep label.

    Returns a view exposing the same reduction surface as
    :class:`DatasetCollection` (`mean / sum / min / max / std /
    var`); each reduction runs once per unique label over the
    subset of timesteps carrying that label.

    Args:
        time_labels: Sequence of length `self.time_length` — each
            entry is the group label for the corresponding file
            (e.g. `["Jan", "Jan", "Feb", "Feb",...]` or integer
            month numbers for monthly groupings).

    Returns:
        _GroupedCollection: Lightweight view with `.mean()` etc.
        Each call returns a dict `{label: np.ndarray}`.

    Raises:
        ValueError: When `len(time_labels)!= self.time_length`.
    """
    if len(time_labels) != self._time_length:
        raise ValueError(
            f"time_labels length {len(time_labels)} does not match "
            f"time_length {self._time_length}"
        )
    return _GroupedCollection(self, list(time_labels))

reduce_time(times, *, freq, op, skipna=True) #

Reduce the time axis by a calendar frequency, grid-attached.

Buckets the timesteps by a pandas offset alias ("1MS", "7D", "6h", …), reduces each window with op through the existing :meth:groupby reducer, and wraps each window's result back into a :class:~pyramids.dataset.Dataset carrying the collection's geotransform / CRS / no-data — so callers get ready-to-write rasters instead of the bare {label: ndarray} that :meth:groupby returns.

The per-timestep timestamps are supplied by the caller (times) because a :class:DatasetCollection does not itself carry a time coordinate. The reduction runs through :attr:data, so the optional [lazy] extra (dask; flox recommended) is required.

Parameters:

Name Type Description Default
times Sequence

Per-timestep timestamps, length self.time_length. Any value :func:pandas.to_datetime accepts (datetime, "2022-01-01", pandas.Timestamp, a DatetimeIndex, …), aligned with the collection's timestep order.

required
freq str

A pandas offset alias naming the window size, e.g. "1MS" (month start), "7D" (weekly), "1D", "6h".

required
op str

Reduction operation: one of "mean", "sum", "min", "max", "std", "var".

required
skipna bool

When True (default) ignore the no-data value in each window; forwarded to the underlying reducer.

True

Returns:

Name Type Description
list[tuple[Any, Dataset]]

list[tuple[Any, Dataset]]: (window_label, dataset) pairs, one

list[tuple[Any, Dataset]]

per non-empty window, sorted by window label. window_label is

the list[tuple[Any, Dataset]]

class:pandas.Timestamp at the window's left edge; each

list[tuple[Any, Dataset]]

dataset is a grid-attached reduction of that window.

Raises:

Type Description
ValueError

op is not a supported reduction, len(times) does not match :attr:time_length, or times contains an unparseable / NaT entry.

Examples:

  • Monthly means of a stack of daily COGs, ready to write:
    >>> import pandas as pd  # doctest: +SKIP
    >>> from pyramids.dataset.collection import DatasetCollection  # doctest: +SKIP
    >>> coll = DatasetCollection.from_files(daily_cog_paths)  # doctest: +SKIP
    >>> times = pd.date_range("2022-01-01", periods=coll.time_length, freq="1D")  # doctest: +SKIP
    >>> monthly = coll.reduce_time(times, freq="1MS", op="mean")  # doctest: +SKIP
    >>> label, ds = monthly[0]  # doctest: +SKIP
    >>> ds.write_array  # a grid-attached Dataset, not a bare ndarray  # doctest: +SKIP
    
Source code in src/pyramids/dataset/collection.py
def reduce_time(
    self,
    times: Sequence,
    *,
    freq: str,
    op: str,
    skipna: bool = True,
) -> list[tuple[Any, Dataset]]:
    """Reduce the time axis by a calendar frequency, grid-attached.

    Buckets the timesteps by a pandas offset alias (``"1MS"``, ``"7D"``,
    ``"6h"``, …), reduces each window with ``op`` through the existing
    :meth:`groupby` reducer, and wraps each window's result back into a
    :class:`~pyramids.dataset.Dataset` carrying the collection's
    geotransform / CRS / no-data — so callers get ready-to-write rasters
    instead of the bare ``{label: ndarray}`` that :meth:`groupby` returns.

    The per-timestep timestamps are supplied by the caller (``times``)
    because a :class:`DatasetCollection` does not itself carry a time
    coordinate. The reduction runs through :attr:`data`, so the optional
    ``[lazy]`` extra (dask; flox recommended) is required.

    Args:
        times: Per-timestep timestamps, length ``self.time_length``. Any
            value :func:`pandas.to_datetime` accepts (``datetime``,
            ``"2022-01-01"``, ``pandas.Timestamp``, a ``DatetimeIndex``, …),
            aligned with the collection's timestep order.
        freq: A pandas offset alias naming the window size, e.g. ``"1MS"``
            (month start), ``"7D"`` (weekly), ``"1D"``, ``"6h"``.
        op: Reduction operation: one of ``"mean"``, ``"sum"``, ``"min"``,
            ``"max"``, ``"std"``, ``"var"``.
        skipna: When ``True`` (default) ignore the no-data value in each
            window; forwarded to the underlying reducer.

    Returns:
        list[tuple[Any, Dataset]]: ``(window_label, dataset)`` pairs, one
        per non-empty window, sorted by window label. ``window_label`` is
        the :class:`pandas.Timestamp` at the window's left edge; each
        ``dataset`` is a grid-attached reduction of that window.

    Raises:
        ValueError: ``op`` is not a supported reduction, ``len(times)`` does
            not match :attr:`time_length`, or ``times`` contains an
            unparseable / ``NaT`` entry.

    Examples:
        - Monthly means of a stack of daily COGs, ready to write:
            ```python
            >>> import pandas as pd  # doctest: +SKIP
            >>> from pyramids.dataset.collection import DatasetCollection  # doctest: +SKIP
            >>> coll = DatasetCollection.from_files(daily_cog_paths)  # doctest: +SKIP
            >>> times = pd.date_range("2022-01-01", periods=coll.time_length, freq="1D")  # doctest: +SKIP
            >>> monthly = coll.reduce_time(times, freq="1MS", op="mean")  # doctest: +SKIP
            >>> label, ds = monthly[0]  # doctest: +SKIP
            >>> ds.write_array  # a grid-attached Dataset, not a bare ndarray  # doctest: +SKIP

            ```
    """
    if op not in _GroupedCollection._OPS:
        raise ValueError(
            f"op must be one of {_GroupedCollection._OPS}, got {op!r}."
        )
    time_list = list(times)
    if len(time_list) != self._time_length:
        raise ValueError(
            f"times has {len(time_list)} entries but the collection has "
            f"{self._time_length} timesteps."
        )

    index = pd.DatetimeIndex(pd.to_datetime(time_list))
    if index.isna().any():
        raise ValueError(
            "times contains unparseable / NaT entries; every timestep must "
            "have a valid timestamp."
        )
    positions = pd.Series(np.arange(len(index)), index=index)
    window_labels: list[Any] = [None] * len(index)
    for window_key, members in positions.groupby(pd.Grouper(freq=freq)):
        for pos in members.to_numpy():
            window_labels[int(pos)] = window_key

    reduced = getattr(self.groupby(window_labels), op)(skipna=skipna)

    geo = self._base.geotransform
    epsg = self._base.epsg
    no_data_value = self._base.no_data_value[0]
    result: list[tuple[Any, Dataset]] = []
    for label in sorted(reduced):
        dataset = Dataset.create_from_array(
            np.asarray(reduced[label]),
            geo=geo,
            epsg=epsg,
            no_data_value=no_data_value,
        )
        result.append((label, dataset))
    return result

mean(*, skipna=True) #

Element-wise mean across the time axis.

Parameters:

Name Type Description Default
skipna bool

When True (default) skip NaN via :func:dask.array.nanmean; otherwise use :func:dask.array.mean.

True

Returns:

Type Description
ndarray

np.ndarray: Mean array of shape (bands, rows, cols).

Source code in src/pyramids/dataset/collection.py
def mean(self, *, skipna: bool = True) -> np.ndarray:
    """Element-wise mean across the time axis.

    Args:
        skipna: When True (default) skip `NaN` via
            :func:`dask.array.nanmean`; otherwise use
            :func:`dask.array.mean`.

    Returns:
        np.ndarray: Mean array of shape `(bands, rows, cols)`.
    """
    return self._reduce("mean", skipna=skipna)

sum(*, skipna=True) #

Element-wise sum across the time axis.

Source code in src/pyramids/dataset/collection.py
def sum(self, *, skipna: bool = True) -> np.ndarray:
    """Element-wise sum across the time axis."""
    return self._reduce("sum", skipna=skipna)

min(*, skipna=True) #

Element-wise minimum across the time axis.

Source code in src/pyramids/dataset/collection.py
def min(self, *, skipna: bool = True) -> np.ndarray:
    """Element-wise minimum across the time axis."""
    return self._reduce("min", skipna=skipna)

max(*, skipna=True) #

Element-wise maximum across the time axis.

Source code in src/pyramids/dataset/collection.py
def max(self, *, skipna: bool = True) -> np.ndarray:
    """Element-wise maximum across the time axis."""
    return self._reduce("max", skipna=skipna)

std(*, skipna=True) #

Element-wise standard deviation across the time axis.

Source code in src/pyramids/dataset/collection.py
def std(self, *, skipna: bool = True) -> np.ndarray:
    """Element-wise standard deviation across the time axis."""
    return self._reduce("std", skipna=skipna)

var(*, skipna=True) #

Element-wise variance across the time axis.

Source code in src/pyramids/dataset/collection.py
def var(self, *, skipna: bool = True) -> np.ndarray:
    """Element-wise variance across the time axis."""
    return self._reduce("var", skipna=skipna)

to_kerchunk(output_path, *, concat_dim='time') #

Emit a combined kerchunk JSON manifest for the collection.

Produces a single JSON sidecar that points at every timestep's source file — downstream consumers open the entire cube as a lazy Zarr-backed xarray with zero data rewrite.

Currently routes through :func:pyramids.netcdf._kerchunk.combine_kerchunk, which handles NetCDF/HDF5 sources. GeoTIFF backing is a follow-on (kerchunk's tiff support requires tifffile).

Parameters:

Name Type Description Default
output_path

Path where the manifest JSON is written.

required
concat_dim str

Dimension along which to concatenate per-file coordinates. Default "time".

'time'

Returns:

Name Type Description
dict dict

The combined manifest.

Raises:

Type Description
ImportError

When kerchunk is not installed.

RuntimeError

When the collection has no files list.

Source code in src/pyramids/dataset/collection.py
def to_kerchunk(
    self,
    output_path,
    *,
    concat_dim: str = "time",
) -> dict:
    """Emit a combined kerchunk JSON manifest for the collection.

    Produces a single JSON sidecar that points at every timestep's
    source file — downstream consumers open the entire cube as a
    lazy Zarr-backed xarray with zero data rewrite.

    Currently routes through
    :func:`pyramids.netcdf._kerchunk.combine_kerchunk`, which
    handles NetCDF/HDF5 sources. GeoTIFF backing is a follow-on
    (kerchunk's tiff support requires `tifffile`).

    Args:
        output_path: Path where the manifest JSON is written.
        concat_dim: Dimension along which to concatenate per-file
            coordinates. Default `"time"`.

    Returns:
        dict: The combined manifest.

    Raises:
        ImportError: When kerchunk is not installed.
        RuntimeError: When the collection has no files list.
    """
    if self._files is None or len(self._files) == 0:
        raise RuntimeError(
            "DatasetCollection.to_kerchunk requires a file-backed "
            "collection. Use DatasetCollection.from_files(...) to "
            "construct one."
        )
    # current backend only handles HDF5 / NetCDF. Detect
    # GeoTIFF inputs and raise a clear NotImplementedError rather
    # than letting kerchunk.hdf produce a confusing failure mode.
    geotiff_exts = {".tif", ".tiff", ".cog"}
    geotiff_files = [
        p
        for p in self._files
        if any(str(p).lower().endswith(ext) for ext in geotiff_exts)
    ]
    if geotiff_files:
        raise NotImplementedError(
            "to_kerchunk currently supports NetCDF / HDF5 source files "
            "only. GeoTIFF support requires kerchunk.tiff + the "
            "tifffile backend which is not yet wired up. Offending "
            f"files: {geotiff_files[:3]}"
            f"{' ...' if len(geotiff_files) > 3 else ''}"
        )
    from pyramids.netcdf._kerchunk import combine_kerchunk

    return combine_kerchunk(
        self._files,
        output_path,
        concat_dims=(concat_dim,),
        identical_dims=(),
    )

to_zarr(store, *, compute=True, mode='w', storage_options=None) #

Serialise the 4-D (T, B, R, C) cube to a Zarr store.

Each dask chunk in self.data lands in an independent Zarr chunk file — the only truly parallel raster output path pyramids offers. Geobox metadata (epsg, geotransform, nodata, band_names, time_length) is written as attributes on the root group + the data array following the standard crs_wkt / GeoTransform attribute convention, so downstream xr.open_zarr(store) consumers can reconstruct the geobox without pyramids.

Parameters:

Name Type Description Default
store

Target store (path, fsspec URL, or zarr.Store).

required
compute bool

True (default) writes immediately; False returns a :class:dask.delayed.Delayed.

True
mode str

Zarr open mode, typically "w" (fresh) or "a".

'w'
storage_options dict | None

Optional dict forwarded to :func:fsspec.get_mapper for cloud stores.

None

Returns:

Type Description

None on compute=True; a :class:dask.delayed.Delayed

on compute=False.

Raises:

Type Description
OptionalPackageDoesNotExist

When the [lazy] extra is not installed.

RuntimeError

When the collection has no files list.

Source code in src/pyramids/dataset/collection.py
def to_zarr(
    self,
    store,
    *,
    compute: bool = True,
    mode: str = "w",
    storage_options: dict | None = None,
):
    """Serialise the 4-D `(T, B, R, C)` cube to a Zarr store.

    Each dask chunk in `self.data` lands in an independent Zarr
    chunk file — the only truly parallel raster output path pyramids
    offers. Geobox metadata (epsg, geotransform, nodata, band_names,
    time_length) is written as attributes on the root group + the
    `data` array following the standard `crs_wkt` / `GeoTransform`
    attribute convention, so downstream `xr.open_zarr(store)` consumers
    can reconstruct the geobox without pyramids.

    Args:
        store: Target store (path, fsspec URL, or zarr.Store).
        compute: `True` (default) writes immediately; `False`
            returns a :class:`dask.delayed.Delayed`.
        mode: Zarr open mode, typically `"w"` (fresh) or `"a"`.
        storage_options: Optional dict forwarded to
            :func:`fsspec.get_mapper` for cloud stores.

    Returns:
        `None` on `compute=True`; a :class:`dask.delayed.Delayed`
        on `compute=False`.

    Raises:
        OptionalPackageDoesNotExist: When the `[lazy]` extra is not
            installed.
        RuntimeError: When the collection has no files list.
    """
    if self._files is None or len(self._files) == 0:
        raise RuntimeError(
            "DatasetCollection.to_zarr requires a file-backed "
            "collection. Use DatasetCollection.from_files(...) to "
            "construct one."
        )
    import_zarr(
        "DatasetCollection.to_zarr requires the optional 'zarr' "
        "dependency. Install with one of:\n"
        "  - PyPI:        pip install 'pyramids-gis[lazy]'\n"
        "  - conda-forge: conda install -c conda-forge pyramids-lazy"
    )
    data = self.data
    resolved_store = _resolve_store(store, storage_options)
    write_result = data.to_zarr(
        resolved_store,
        component="data",
        overwrite=(mode == "w"),
        compute=compute,
    )
    if compute:
        _finalize_collection_metadata(resolved_store, self._meta, self._files)
        result: Any = None
    else:
        import dask

        result = dask.delayed(_finalize_after_write)(
            write_result,
            resolved_store,
            self._meta,
            self._files,
        )
    return result

to_netcdf(path, *, time_dim='time', time_coords=None, var_per_band=True) #

Write the collection's (T, B, Y, X) cube to a single NetCDF.

Materialises every timestep in memory, builds an :class:xarray.Dataset, and hands it to :meth:pyramids.netcdf.NetCDF.from_xarray (which routes through pyramids' own GDAL multidimensional NetCDF writer — no netcdf4 / h5netcdf engine plug-in needed). The result is a self-describing NetCDF with one variable per band (CF-1.8 Conventions attr; geobox attached as crs_wkt / GeoTransform root attrs).

For huge cubes prefer :meth:to_zarr — this writer is eager (materialises the full T×B×Y×X array) since NetCDF.from_xarray itself materialises.

No-data values are written as a nodata attribute on the root group and on each data variable. GDAL's multidim NetCDF writer rejects CF's standard _FillValue attribute via this code path, so the round-trip uses nodata for compatibility.

Parameters:

Name Type Description Default
path str | Path

Output .nc path.

required
time_dim str

Name of the time dimension. Default "time".

'time'
time_coords 'Sequence[Any] | None'

Sequence of length time_length for the time axis values (e.g. pd.date_range(...)). None (default) emits a 0..T-1 integer index with a note attr explaining it is positional, not calendar.

None
var_per_band bool

When True (default), each band becomes its own data variable named after :attr:meta.band_names — CF-friendly and what :func:aggregate_netcdf-style consumers usually expect. When False, one 4-D data variable is written with a band coordinate — saner for hyperspectral cubes with hundreds of bands.

True

Raises:

Type Description
OptionalPackageDoesNotExist

When xarray is not installed. Install with one of: PyPI pip install 'pyramids-gis[xarray]' or conda-forge conda install -c conda-forge pyramids-xarray.

ValueError

When len(time_coords) != self.time_length.

RuntimeError

When :meth:NetCDF.from_xarray fails to write the file.

Examples:

  • Stack two single-band rasters into one NetCDF and reopen it:
    >>> import os, tempfile
    >>> import numpy as np
    >>> from pyramids.dataset import Dataset, DatasetCollection
    >>> from pyramids.netcdf import NetCDF
    >>> d = tempfile.mkdtemp()
    >>> paths = []
    >>> for i in range(2):
    ...     arr = (np.arange(20, dtype="int16").reshape(4, 5) + 100 * i)
    ...     p = os.path.join(d, f"t{i}.tif")
    ...     _ = Dataset.create_from_array(
    ...         arr, top_left_corner=(0, 0), cell_size=0.05, epsg=4326,
    ...         no_data_value=-9999, path=p,
    ...     ).close()
    ...     paths.append(p)
    >>> col = DatasetCollection.from_files(paths)
    >>> out = os.path.join(d, "cube.nc")
    >>> col.to_netcdf(out)
    >>> nc = NetCDF.read_file(out)
    >>> "Band_1" in nc.variables
    True
    >>> nc.epsg
    4326
    
See Also
  • :meth:to_zarr: parallel chunk-by-chunk writer; preferred for very large cubes.
  • :meth:to_kerchunk: emit a sidecar that points back at the source files without rewriting data.
  • :meth:pyramids.netcdf.NetCDF.from_xarray: the underlying writer.
Source code in src/pyramids/dataset/collection.py
def to_netcdf(
    self,
    path: str | Path,
    *,
    time_dim: str = "time",
    time_coords: "Sequence[Any] | None" = None,
    var_per_band: bool = True,
) -> None:
    """Write the collection's ``(T, B, Y, X)`` cube to a single NetCDF.

    Materialises every timestep in memory, builds an
    :class:`xarray.Dataset`, and hands it to
    :meth:`pyramids.netcdf.NetCDF.from_xarray` (which routes through
    pyramids' own GDAL multidimensional NetCDF writer — no
    ``netcdf4`` / ``h5netcdf`` engine plug-in needed). The result is
    a self-describing NetCDF with one variable per band (``CF-1.8``
    ``Conventions`` attr; geobox attached as ``crs_wkt`` /
    ``GeoTransform`` root attrs).

    For huge cubes prefer :meth:`to_zarr` — this writer is
    eager (materialises the full T×B×Y×X array) since
    ``NetCDF.from_xarray`` itself materialises.

    No-data values are written as a ``nodata`` attribute on the root
    group and on each data variable. GDAL's multidim NetCDF writer
    rejects CF's standard ``_FillValue`` attribute via this code
    path, so the round-trip uses ``nodata`` for compatibility.

    Args:
        path: Output ``.nc`` path.
        time_dim: Name of the time dimension. Default ``"time"``.
        time_coords: Sequence of length ``time_length`` for the
            time axis values (e.g. ``pd.date_range(...)``). ``None``
            (default) emits a 0..T-1 integer index with a ``note``
            attr explaining it is positional, not calendar.
        var_per_band: When ``True`` (default), each band becomes its
            own data variable named after :attr:`meta.band_names`
            — CF-friendly and what :func:`aggregate_netcdf`-style
            consumers usually expect. When ``False``, one 4-D
            ``data`` variable is written with a ``band`` coordinate
            — saner for hyperspectral cubes with hundreds of bands.

    Raises:
        OptionalPackageDoesNotExist: When ``xarray`` is not
            installed. Install with one of: PyPI
            ``pip install 'pyramids-gis[xarray]'`` or conda-forge
            ``conda install -c conda-forge pyramids-xarray``.
        ValueError: When ``len(time_coords) != self.time_length``.
        RuntimeError: When :meth:`NetCDF.from_xarray` fails to write
            the file.

    Examples:
        - Stack two single-band rasters into one NetCDF and reopen it:
            ```python
            >>> import os, tempfile
            >>> import numpy as np
            >>> from pyramids.dataset import Dataset, DatasetCollection
            >>> from pyramids.netcdf import NetCDF
            >>> d = tempfile.mkdtemp()
            >>> paths = []
            >>> for i in range(2):
            ...     arr = (np.arange(20, dtype="int16").reshape(4, 5) + 100 * i)
            ...     p = os.path.join(d, f"t{i}.tif")
            ...     _ = Dataset.create_from_array(
            ...         arr, top_left_corner=(0, 0), cell_size=0.05, epsg=4326,
            ...         no_data_value=-9999, path=p,
            ...     ).close()
            ...     paths.append(p)
            >>> col = DatasetCollection.from_files(paths)
            >>> out = os.path.join(d, "cube.nc")
            >>> col.to_netcdf(out)
            >>> nc = NetCDF.read_file(out)
            >>> "Band_1" in nc.variables
            True
            >>> nc.epsg
            4326

            ```

    See Also:
        - :meth:`to_zarr`: parallel chunk-by-chunk writer; preferred
          for very large cubes.
        - :meth:`to_kerchunk`: emit a sidecar that points back at
          the source files without rewriting data.
        - :meth:`pyramids.netcdf.NetCDF.from_xarray`: the underlying
          writer.
    """
    try:
        import xarray as xr
    except ImportError as exc:
        raise OptionalPackageDoesNotExist(
            "DatasetCollection.to_netcdf requires the optional 'xarray' "
            "dependency. Install with one of:\n"
            "  - PyPI:        pip install 'pyramids-gis[xarray]'\n"
            "  - conda-forge: conda install -c conda-forge pyramids-xarray"
        ) from exc

    if time_coords is not None:
        # Materialise generators / iterators up front so np.asarray gets a
        # sized sequence (an iterator yields a 0-d object array, which
        # would trip a cryptic IndexError below).
        if not hasattr(time_coords, "__len__"):
            time_coords = list(time_coords)
        time_values = np.asarray(time_coords)
        if time_values.dtype.kind == "O":
            # pd.DatetimeIndex → datetime64 via asarray, but lists of
            # datetime / Timestamp objects come through as dtype=object.
            # Coerce so the datetime branch below picks them up.
            try:
                time_values = np.asarray(time_values, dtype="datetime64[ns]")
            except (TypeError, ValueError):
                pass
        if time_values.shape[0] != self.time_length:
            raise ValueError(
                f"time_coords has {time_values.shape[0]} entries but "
                f"the collection has {self.time_length} timesteps"
            )
        time_attrs: dict = {}
        if time_values.shape[0] > 1 and time_values.dtype.kind in "iufM":
            ordered = np.sort(time_values)
            if not np.array_equal(time_values, ordered):
                warnings.warn(
                    "time_coords is not monotonically increasing; some "
                    "downstream tools (xr.open_dataset, aggregate_netcdf) "
                    "may reorder or refuse the axis",
                    stacklevel=2,
                )
            if np.unique(time_values).size != time_values.size:
                warnings.warn(
                    "time_coords contains duplicate values; downstream "
                    "indexers may pick an arbitrary timestep",
                    stacklevel=2,
                )
        if time_values.dtype.kind == "M":
            # GDAL's multidim writer has no native datetime64 type; encode
            # as an int64 offset with CF `units` so xr.open_dataset can
            # decode it back to a calendar axis on read. Use nanosecond
            # resolution so the round-trip is lossless for the full
            # datetime64[ns] range (the CF "nanoseconds since …" time
            # unit).
            epoch = np.datetime64("1970-01-01", "ns")
            ns = (time_values.astype("datetime64[ns]") - epoch).astype("int64")
            time_values = ns
            time_attrs["units"] = "nanoseconds since 1970-01-01 00:00:00"
            time_attrs["calendar"] = "proleptic_gregorian"
    else:
        time_values = np.arange(self.time_length, dtype="int64")
        time_attrs = {
            "long_name": "time index",
            "note": "positional index, not a calendar time",
        }

    meta = self._meta
    nodata = (meta.nodata or (None,))[0]
    band_count = int(meta.shape[0])
    names: list[str] = (
        list(meta.band_names)
        if meta.band_names
        else [f"band_{i + 1}" for i in range(band_count)]
    )

    # Per-timestep read_array() returns (rows, cols) for a single-band
    # dataset and (bands, rows, cols) for multi-band, so np.stack gives
    # (T, rows, cols) or (T, bands, rows, cols). Insert a length-1 band
    # axis on the single-band path so the rest of this method can treat
    # the cube uniformly as (T, B, Y, X).
    cube = np.stack([np.asarray(ds.read_array()) for ds in self.datasets], axis=0)
    if cube.ndim == 3:
        cube = cube[:, np.newaxis, :, :]

    y_coord = np.asarray(self._base.y)
    x_coord = np.asarray(self._base.x)

    if var_per_band:
        data_vars = {
            names[i]: ((time_dim, "y", "x"), cube[:, i, :, :])
            for i in range(band_count)
        }
        coords = {
            time_dim: (time_dim, time_values, time_attrs),
            "y": ("y", y_coord),
            "x": ("x", x_coord),
        }
    else:
        # GDAL's multidim NetCDF writer can't write a string coord, so the
        # band axis carries an integer index and the human names ride along
        # on the root group as a ``band_names`` attribute. Round-trips are
        # lossless via ``xr.open_dataset``: caller reads
        # ``ds.attrs["band_names"]`` to recover the labels.
        data_vars = {"data": ((time_dim, "band", "y", "x"), cube)}
        coords = {
            time_dim: (time_dim, time_values, time_attrs),
            "band": ("band", np.arange(band_count)),
            "y": ("y", y_coord),
            "x": ("x", x_coord),
        }

    root_attrs: dict = {"Conventions": "CF-1.8"}
    try:
        crs_wkt = meta.crs.to_wkt() if meta.crs is not None else None
    except AttributeError:
        crs_wkt = None
    if crs_wkt:
        root_attrs["crs_wkt"] = crs_wkt
    if meta.epsg is not None:
        root_attrs["epsg"] = int(meta.epsg)
    root_attrs["GeoTransform"] = " ".join(str(v) for v in meta.geotransform)
    if not var_per_band:
        root_attrs["band_names"] = ",".join(names)

    if nodata is not None:
        typed_nodata = np.asarray(nodata, dtype=cube.dtype).item()
        # GDAL's multidim NetCDF writer rejects ``_FillValue`` as an
        # attribute (libnetcdf wants it set via the dedicated typed-fill
        # API the writer doesn't expose) and silently drops anything set
        # through ``xr.encoding``. Surface the no-data value under a
        # ``nodata`` attribute instead — both on the root group (matches
        # the root attrs ``to_zarr`` writes) and on every
        # data variable, so consumers can recover it.
        root_attrs["nodata"] = typed_nodata
    ds = xr.Dataset(data_vars=data_vars, coords=coords, attrs=root_attrs)
    if nodata is not None:
        target_vars = names if var_per_band else ["data"]
        for v_name in target_vars:
            ds[v_name].attrs["nodata"] = typed_nodata

    # Inline import: pyramids.netcdf depends on pyramids.dataset.Dataset,
    # so hoisting this to the module top would form a circular import
    # through pyramids.dataset.__init__. Matches the to_kerchunk pattern
    # (see ``to_kerchunk`` above) and CLAUDE.md's circular-import
    # carveout in "Code Style".
    from pyramids.netcdf import NetCDF  # noqa: E402

    NetCDF.from_xarray(ds, path)

from_stac(items, asset, *, patch_url=None, bbox=None, max_items=None, signer=None, align=True, skip_missing=False, groupby=None, like=None, crs=None, resolution=None, bounds=None, anchor='edge') classmethod #

Build a collection from a STAC ItemCollection.

Thin forwarder to :func:pyramids.dataset._stac.from_stac. Duck-typed — accepts :class:pystac.Item objects, raw JSON dicts, or any iterable of items with .assets + .bbox semantics. pyramids does not depend on pystac.

Parameters:

Name Type Description Default
items

Iterable of STAC Items (pystac objects, raw JSON dicts, or any duck-typed equivalent).

required
asset str | Sequence[str]

A single asset key (str) for a single-asset time stack, or a sequence of keys (e.g. ["B04", "B03", "B02"]) to stack those assets band-wise into one multi-band raster per timestep (band order = sequence order).

required
patch_url

Optional low-level callable rewriting each href (runs before signer).

None
bbox tuple | None

M6 — optional (minx, miny, maxx, maxy) filter in lon/lat; items whose bbox doesn't intersect are dropped before hrefs are resolved.

None
max_items int | None

M6 — cap the number of items consumed (after bbox filtering). Useful for quick-look workflows.

None
signer Any

Optional signer (e.g. a :class:pyramids.stac.signers.Signer). Its sign_href rewrites every asset href and its gdal_env() is captured onto the returned collection so every read of the backing files authenticates — making Requester-Pays / bearer / SAS catalogs work through from_stac. See :func:pyramids.dataset._stac.from_stac.

None
align bool

Multi-asset only — resample assets at differing resolutions onto the first asset's grid (True, default) or raise on mismatch (False).

True
skip_missing bool

Drop items missing any requested asset (True) instead of raising (False, default).

False
groupby str | None

"solar_day" mosaics same-solar-day items into one timestep each (single-asset only); None (default) keeps one timestep per item.

None
like Any

Optional target-grid :class:~pyramids.dataset.Dataset; every timestep is aligned onto its CRS + grid. Mutually exclusive with crs/resolution/bounds.

None
crs int | str | None

Target CRS for an explicit grid (with resolution+bounds).

None
resolution float | None

Target pixel size for an explicit grid.

None
bounds

Target (minx, miny, maxx, maxy) for an explicit grid.

None
anchor str

Grid-snap rule for the explicit grid ("edge").

'edge'

Returns:

Name Type Description
DatasetCollection DatasetCollection

File-backed collection (or grid-aligned

DatasetCollection

collection when like/crs is given).

Source code in src/pyramids/dataset/collection.py
@classmethod
def from_stac(
    cls,
    items,
    asset: str | Sequence[str],
    *,
    patch_url=None,
    bbox: tuple | None = None,
    max_items: int | None = None,
    signer: Any = None,
    align: bool = True,
    skip_missing: bool = False,
    groupby: str | None = None,
    like: Any = None,
    crs: int | str | None = None,
    resolution: float | None = None,
    bounds=None,
    anchor: str = "edge",
) -> DatasetCollection:
    """Build a collection from a STAC ItemCollection.

    Thin forwarder to :func:`pyramids.dataset._stac.from_stac`.
    Duck-typed — accepts :class:`pystac.Item` objects, raw JSON
    dicts, or any iterable of items with `.assets` + `.bbox`
    semantics. pyramids does not depend on pystac.

    Args:
        items: Iterable of STAC Items (pystac objects, raw JSON
            dicts, or any duck-typed equivalent).
        asset: A single asset key (`str`) for a single-asset time
            stack, or a sequence of keys (e.g. `["B04", "B03",
            "B02"]`) to stack those assets band-wise into one
            multi-band raster per timestep (band order = sequence
            order).
        patch_url: Optional low-level callable rewriting each href
            (runs before `signer`).
        bbox: M6 — optional `(minx, miny, maxx, maxy)` filter in
            lon/lat; items whose `bbox` doesn't intersect are
            dropped before hrefs are resolved.
        max_items: M6 — cap the number of items consumed (after
            bbox filtering). Useful for quick-look workflows.
        signer: Optional signer (e.g. a
            :class:`pyramids.stac.signers.Signer`). Its
            `sign_href` rewrites every asset href and its
            `gdal_env()` is captured onto the returned collection so
            every read of the backing files authenticates — making
            Requester-Pays / bearer / SAS catalogs work through
            `from_stac`. See :func:`pyramids.dataset._stac.from_stac`.
        align: Multi-asset only — resample assets at differing
            resolutions onto the first asset's grid (`True`,
            default) or raise on mismatch (`False`).
        skip_missing: Drop items missing any requested asset
            (`True`) instead of raising (`False`, default).
        groupby: `"solar_day"` mosaics same-solar-day items into one
            timestep each (single-asset only); `None` (default) keeps
            one timestep per item.
        like: Optional target-grid :class:`~pyramids.dataset.Dataset`;
            every timestep is aligned onto its CRS + grid. Mutually
            exclusive with `crs`/`resolution`/`bounds`.
        crs: Target CRS for an explicit grid (with `resolution`+`bounds`).
        resolution: Target pixel size for an explicit grid.
        bounds: Target `(minx, miny, maxx, maxy)` for an explicit grid.
        anchor: Grid-snap rule for the explicit grid (`"edge"`).

    Returns:
        DatasetCollection: File-backed collection (or grid-aligned
        collection when `like`/`crs` is given).
    """
    return _from_stac(
        items,
        asset,
        patch_url=patch_url,
        bbox=bbox,
        max_items=max_items,
        signer=signer,
        align=align,
        skip_missing=skip_missing,
        groupby=groupby,
        like=like,
        crs=crs,
        resolution=resolution,
        bounds=bounds,
        anchor=anchor,
    )

from_point(lat, lon, *, collection, bands, start_date, end_date, edge_size, resolution, units='px', stac=None, query=None, signer=None, align=True) classmethod #

Build a point-centred STAC cube (cubo-style convenience constructor).

Thin forwarder to :func:pyramids.dataset._stac.from_point: reprojects (lat, lon) to its local UTM, snaps to the resolution grid, expands to an edge_size-pixel (or -metre) square AOI, searches collection over that AOI + date range, and stacks the bands via :meth:from_stac.

Parameters:

Name Type Description Default
lat float

Center latitude in degrees (EPSG:4326).

required
lon float

Center longitude in degrees (EPSG:4326).

required
collection str

STAC collection id to search.

required
bands

A single asset key or a sequence (multi-asset band axis).

required
start_date str

Search start (YYYY-MM-DD / RFC 3339).

required
end_date str

Search end (YYYY-MM-DD / RFC 3339).

required
edge_size int

Cube side length, in pixels (units="px") or metres.

required
resolution float

Pixel size in metres.

required
units str

"px" (default) or "m".

'px'
stac str | None

STAC API root URL; None uses the Planetary Computer default.

None
query Any

Optional STAC query extension dict.

None
signer Any

Optional signer, forwarded to the search and the reads.

None
align bool

Multi-asset resolution policy (see :meth:from_stac).

True

Returns:

Name Type Description
DatasetCollection DatasetCollection

A time-stacked cube over the point AOI.

Source code in src/pyramids/dataset/collection.py
@classmethod
def from_point(
    cls,
    lat: float,
    lon: float,
    *,
    collection: str,
    bands,
    start_date: str,
    end_date: str,
    edge_size: int,
    resolution: float,
    units: str = "px",
    stac: str | None = None,
    query: Any = None,
    signer: Any = None,
    align: bool = True,
) -> DatasetCollection:
    """Build a point-centred STAC cube (cubo-style convenience constructor).

    Thin forwarder to :func:`pyramids.dataset._stac.from_point`: reprojects
    `(lat, lon)` to its local UTM, snaps to the `resolution` grid, expands to
    an `edge_size`-pixel (or -metre) square AOI, searches `collection` over
    that AOI + date range, and stacks the `bands` via :meth:`from_stac`.

    Args:
        lat: Center latitude in degrees (EPSG:4326).
        lon: Center longitude in degrees (EPSG:4326).
        collection: STAC collection id to search.
        bands: A single asset key or a sequence (multi-asset band axis).
        start_date: Search start (`YYYY-MM-DD` / RFC 3339).
        end_date: Search end (`YYYY-MM-DD` / RFC 3339).
        edge_size: Cube side length, in pixels (`units="px"`) or metres.
        resolution: Pixel size in metres.
        units: `"px"` (default) or `"m"`.
        stac: STAC API root URL; `None` uses the Planetary Computer default.
        query: Optional STAC `query` extension dict.
        signer: Optional signer, forwarded to the search and the reads.
        align: Multi-asset resolution policy (see :meth:`from_stac`).

    Returns:
        DatasetCollection: A time-stacked cube over the point AOI.
    """
    kwargs: dict[str, Any] = dict(
        collection=collection,
        bands=bands,
        start_date=start_date,
        end_date=end_date,
        edge_size=edge_size,
        resolution=resolution,
        units=units,
        query=query,
        signer=signer,
        align=align,
    )
    if stac is not None:
        kwargs["stac"] = stac
    return _from_point(lat, lon, **kwargs)

from_files(files, *, meta=None, gdal_env=None) classmethod #

Build a collection from a list of files without pre-opening all.

Only the first file is opened eagerly (to derive :class:RasterMeta). The remaining files are referenced by path only — lazy readers open them on demand through :class:~pyramids.base._file_manager.CachingFileManager.

Parameters:

Name Type Description Default
files list[str | Path]

Sequence of file paths backing each timestep.

required
meta RasterMeta | None

Optional pre-computed :class:RasterMeta. When omitted, derived from the first file via :meth:RasterMeta.from_dataset.

None
gdal_env dict[str, str] | None

Optional GDAL config (e.g. a signer's gdal_env()) installed around every open of the backing files, including the eager template open below. Persisted on the collection for the lazy read paths. None (default) installs no extra config.

None

Returns:

Name Type Description
DatasetCollection DatasetCollection

A new collection whose time_length

DatasetCollection

matches len(files).

Raises:

Type Description
ValueError

When files is empty.

Source code in src/pyramids/dataset/collection.py
@classmethod
def from_files(
    cls,
    files: list[str | Path],
    *,
    meta: RasterMeta | None = None,
    gdal_env: dict[str, str] | None = None,
) -> DatasetCollection:
    """Build a collection from a list of files without pre-opening all.

    Only the first file is opened eagerly (to derive
    :class:`RasterMeta`). The remaining files are referenced by
    path only — lazy readers open them on demand through
    :class:`~pyramids.base._file_manager.CachingFileManager`.

    Args:
        files: Sequence of file paths backing each timestep.
        meta: Optional pre-computed :class:`RasterMeta`. When
            omitted, derived from the first file via
            :meth:`RasterMeta.from_dataset`.
        gdal_env: Optional GDAL config (e.g. a signer's
            `gdal_env()`) installed around every open of the backing
            files, including the eager template open below. Persisted
            on the collection for the lazy read paths. `None` (default)
            installs no extra config.

    Returns:
        DatasetCollection: A new collection whose `time_length`
        matches `len(files)`.

    Raises:
        ValueError: When `files` is empty.
    """
    resolved = [str(p) for p in files]
    if not resolved:
        raise ValueError("files must contain at least one path")
    with cloud_config_from_env(gdal_env):
        template = Dataset.read_file(resolved[0])
        if meta is None:
            meta = RasterMeta.from_dataset(template)
    return cls(template, len(resolved), files=resolved, meta=meta, gdal_env=gdal_env)

from_archive(url_or_path, *, kind='auto', member_glob='*', meta=None) classmethod #

Build a collection from the raster members of an archive.

Lists the archive's members (locally or over the network — a remote ZIP is read via the chained /vsizip//vsicurl/… path) and hands them to :meth:from_files, so each matching member becomes one timestep. Only the first member is opened eagerly; the rest are opened on demand.

For "merge all members into one multi-band :class:Dataset" (bands, not timesteps) use :meth:pyramids.dataset.Dataset.from_archive.

The archive's file name must carry a recognised extension (.zip / .tar / .tar.gz / .gz) — GDAL's archive handlers key off the extension. An extension-less download URL (e.g. an Earth Engine getDownloadURL ending in :getPixels) must first be fetched and saved with a .zip name (or written to /vsimem/<name>.zip via :func:osgeo.gdal.FileFromMemBuffer) before calling this.

Parameters:

Name Type Description Default
url_or_path str | Path

Path or URL of the archive (.zip / .tar / .tar.gz / .gz).

required
kind str

Archive kind — "zip", "tar" (also "tar.gz" / "tgz"), "gzip" (also "gz"), or "auto" (default, infer from the extension).

'auto'
member_glob str

:mod:fnmatch pattern selecting which members to include, applied to top-level member names and sorted. Default "*" (all). Pass e.g. "*.tif" to skip sidecar files.

'*'
meta RasterMeta | None

Optional pre-computed :class:RasterMeta for the timesteps.

None

Returns:

Name Type Description
DatasetCollection DatasetCollection

A collection whose time_length is the number

DatasetCollection

of matching members.

Raises:

Type Description
FileFormatNotSupportedError

kind="auto" and the extension is not recognised, or the archive could not be listed.

FileNotFoundError

No member matched member_glob.

ValueError

kind is not a recognised archive kind.

Source code in src/pyramids/dataset/collection.py
@classmethod
def from_archive(
    cls,
    url_or_path: str | Path,
    *,
    kind: str = "auto",
    member_glob: str = "*",
    meta: RasterMeta | None = None,
) -> DatasetCollection:
    """Build a collection from the raster members of an archive.

    Lists the archive's members (locally or over the network — a remote ZIP
    is read via the chained ``/vsizip//vsicurl/…`` path) and hands them to
    :meth:`from_files`, so each matching member becomes one timestep. Only
    the first member is opened eagerly; the rest are opened on demand.

    For "merge all members into one multi-band :class:`Dataset`" (bands,
    not timesteps) use :meth:`pyramids.dataset.Dataset.from_archive`.

    The archive's file name must carry a recognised extension (``.zip`` /
    ``.tar`` / ``.tar.gz`` / ``.gz``) — GDAL's archive handlers key off the
    extension. An extension-less download URL (e.g. an Earth Engine
    ``getDownloadURL`` ending in ``:getPixels``) must first be fetched and
    saved with a ``.zip`` name (or written to ``/vsimem/<name>.zip`` via
    :func:`osgeo.gdal.FileFromMemBuffer`) before calling this.

    Args:
        url_or_path: Path or URL of the archive (``.zip`` / ``.tar`` /
            ``.tar.gz`` / ``.gz``).
        kind: Archive kind — ``"zip"``, ``"tar"`` (also ``"tar.gz"`` /
            ``"tgz"``), ``"gzip"`` (also ``"gz"``), or ``"auto"`` (default,
            infer from the extension).
        member_glob: :mod:`fnmatch` pattern selecting which members to
            include, applied to top-level member names and sorted. Default
            ``"*"`` (all). Pass e.g. ``"*.tif"`` to skip sidecar files.
        meta: Optional pre-computed :class:`RasterMeta` for the timesteps.

    Returns:
        DatasetCollection: A collection whose ``time_length`` is the number
        of matching members.

    Raises:
        FileFormatNotSupportedError: ``kind="auto"`` and the extension is
            not recognised, or the archive could not be listed.
        FileNotFoundError: No member matched ``member_glob``.
        ValueError: ``kind`` is not a recognised archive kind.
    """
    dir_vsi = _io._archive_dir_vsi(url_or_path, kind)
    members = _io._archive_members(dir_vsi, member_glob)
    member_paths = [f"{dir_vsi}/{m}" for m in members]
    return cls.from_files(member_paths, meta=meta)

read_multiple_files(path, with_order=False, regex_string='\\d{4}.\\d{2}.\\d{2}', date=True, file_name_data_fmt=None, start=None, end=None, fmt='%Y-%m-%d', extension='.tif') classmethod #

read_multiple_files.

- Read rasters from a folder (or list of files) and create a 3D array with the same 2D dimensions as the
  first raster and length equal to the number of files.

- All rasters should have the same dimensions.
- If you want to read the rasters with a certain order, the raster file names should contain a date
  that follows a consistent format (YYYY.MM.DD / YYYY-MM-DD or YYYY_MM_DD), e.g. "MSWEP_1979.01.01.tif".

Parameters:

Name Type Description Default
path str | list[str]

Path of the folder that contains all the rasters, or a list containing the paths of the rasters to read.

required
with_order bool

True if the raster names follow a certain order. Then the raster names should have a date that follows the same format (YYYY.MM.DD / YYYY-MM-DD or YYYY_MM_DD). For example:

>>> "MSWEP_1979.01.01.tif"
>>> "MSWEP_1979.01.02.tif"
>>> ...
>>> "MSWEP_1979.01.20.tif"
False
regex_string str

A regex string used to locate the date in the file names. Default is r"\d{4}.\d{2}.\d{2}". For example:

>>> fname = "MSWEP_YYYY.MM.DD.tif"
>>> regex_string = r"\d{4}.\d{2}.\d{2}"
  • Or:
>>> fname = "MSWEP_YYYY_M_D.tif"
>>> regex_string = r"\d{4}_\d{1}_\d{1}"
  • If there is a number at the beginning of the name:
>>> fname = "1_MSWEP_YYYY_M_D.tif"
>>> regex_string = r"\d+"
'\\d{4}.\\d{2}.\\d{2}'
date bool

True if the number in the file name is a date. Default is True.

True
file_name_data_fmt str

If the file names contain a date and you want to read them ordered. Default is None. For example:

>>> "MSWEP_YYYY.MM.DD.tif"
>>> file_name_data_fmt = "%Y.%m.%d"
None
start str

Start date if you want to read the input raster for a specific period only and not all rasters. If not given, all rasters in the given path will be read.

None
end str

End date if you want to read the input rasters for a specific period only. If not given, all rasters in the given path will be read.

None
fmt str

Format of the given date in the start/end parameter.

'%Y-%m-%d'
extension str

The extension of the files you want to read from the given path. Default is ".tif".

'.tif'

Returns:

Name Type Description
DatasetCollection DatasetCollection

Instance of the DatasetCollection class.

Examples:

  • Read all rasters in a folder:
>>> from pyramids.dataset import DatasetCollection
>>> raster_folder = "examples/GIS/data/raster-folder"
>>> prec = DatasetCollection.read_multiple_files(raster_folder)
  • Read from a pre-collected list without ordering:
>>> raster_folder = Path("examples/GIS/data/raster-folder")
>>> file_list = list(raster_folder.glob("*.tif"))
>>> prec = DatasetCollection.read_multiple_files(file_list, with_order=False)
Source code in src/pyramids/dataset/collection.py
@classmethod
def read_multiple_files(
    cls,
    path: str | Path | list[str | Path],
    with_order: bool = False,
    regex_string: str = r"\d{4}.\d{2}.\d{2}",
    date: bool = True,
    file_name_data_fmt: str | None = None,
    start: str | None = None,
    end: str | None = None,
    fmt: str = "%Y-%m-%d",
    extension: str = ".tif",
) -> DatasetCollection:
    r"""read_multiple_files.

        - Read rasters from a folder (or list of files) and create a 3D array with the same 2D dimensions as the
          first raster and length equal to the number of files.

        - All rasters should have the same dimensions.
        - If you want to read the rasters with a certain order, the raster file names should contain a date
          that follows a consistent format (YYYY.MM.DD / YYYY-MM-DD or YYYY_MM_DD), e.g. "MSWEP_1979.01.01.tif".

    Args:
        path (str | list[str]):
            Path of the folder that contains all the rasters, or a list containing the paths of the rasters to read.
        with_order (bool):
            True if the raster names follow a certain order. Then the raster names should have a date that follows
            the same format (YYYY.MM.DD / YYYY-MM-DD or YYYY_MM_DD). For example:

            ```python
            >>> "MSWEP_1979.01.01.tif"
            >>> "MSWEP_1979.01.02.tif"
            >>> ...
            >>> "MSWEP_1979.01.20.tif"

            ```

        regex_string (str):
            A regex string used to locate the date in the file names. Default is r"\d{4}.\d{2}.\d{2}". For example:

            ```python
            >>> fname = "MSWEP_YYYY.MM.DD.tif"
            >>> regex_string = r"\d{4}.\d{2}.\d{2}"
            ```

            - Or:

            ```python
            >>> fname = "MSWEP_YYYY_M_D.tif"
            >>> regex_string = r"\d{4}_\d{1}_\d{1}"
            ```

            - If there is a number at the beginning of the name:

            ```python
            >>> fname = "1_MSWEP_YYYY_M_D.tif"
            >>> regex_string = r"\d+"
            ```

        date (bool):
            True if the number in the file name is a date. Default is True.
        file_name_data_fmt (str):
            If the file names contain a date and you want to read them ordered. Default is None. For example:

            ```python
            >>> "MSWEP_YYYY.MM.DD.tif"
            >>> file_name_data_fmt = "%Y.%m.%d"
            ```

        start (str):
            Start date if you want to read the input raster for a specific period only and not all rasters. If not
            given, all rasters in the given path will be read.
        end (str):
            End date if you want to read the input rasters for a specific period only. If not given, all rasters in
            the given path will be read.
        fmt (str):
            Format of the given date in the start/end parameter.
        extension (str):
            The extension of the files you want to read from the given path. Default is ".tif".

    Returns:
        DatasetCollection:
            Instance of the DatasetCollection class.

    Examples:
        - Read all rasters in a folder:

          ```python
          >>> from pyramids.dataset import DatasetCollection
          >>> raster_folder = "examples/GIS/data/raster-folder"
          >>> prec = DatasetCollection.read_multiple_files(raster_folder)

          ```

        - Read from a pre-collected list without ordering:

          ```python
          >>> raster_folder = Path("examples/GIS/data/raster-folder")
          >>> file_list = list(raster_folder.glob("*.tif"))
          >>> prec = DatasetCollection.read_multiple_files(file_list, with_order=False)

          ```
    """
    if not isinstance(path, (str, Path, list)):
        raise TypeError(
            f"path input should be string/Path/list type, given: {type(path)}"
        )

    if isinstance(path, (str, Path)):
        path = Path(path)
        # check whither the path exists or not
        if not path.exists():
            raise FileNotFoundError("The path you have provided does not exist")
        # get a list of all files
        files = [f.name for f in path.iterdir() if f.name.endswith(extension)]
        # check whether there are files or not inside the folder
        if len(files) < 1:
            raise FileNotFoundError("The path you have provided is empty")
    else:
        files = [str(p) for p in path]

    # to sort the files in the same order as the first number in the name
    if with_order:
        match_str_fn = lambda x: re.search(regex_string, x)
        list_dates = list(map(match_str_fn, files))

        if None in list_dates:
            raise ValueError(
                "The date format/separator given does not match the file names"
            )
        if date:
            if file_name_data_fmt is None:
                raise ValueError(
                    f"To read the raster with a certain order (with_order = {with_order}, then you have to enter "
                    f"the value of the parameter file_name_data_fmt(given: {file_name_data_fmt})"
                )
            fn: Callable[[Any], Any] = lambda x: dt.datetime.strptime(
                x.group(), file_name_data_fmt
            )
        else:
            fn = lambda x: int(x.group())
        list_dates = list(map(fn, list_dates))

        df = pd.DataFrame()
        df["files"] = files
        df["date"] = list_dates
        df.sort_values("date", inplace=True, ignore_index=True)
        files = df.loc[:, "files"].values

    if start is not None or end is not None:
        if date:
            start_dt: Any = dt.datetime.strptime(str(start), fmt)
            end_dt: Any = dt.datetime.strptime(str(end), fmt)

            files = (
                df.loc[start_dt <= df["date"], :]
                .loc[df["date"] <= end_dt, "files"]
                .values
            )
        else:
            files = (
                df.loc[start <= df["date"], :]
                .loc[df["date"] <= end, "files"]
                .values
            )

    if not isinstance(path, list):
        # add the path to all the files
        files = [f"{path}/{i}" for i in files]
    # create a 3d array with the 2d dimension of the first raster and the len
    # of the number of rasters in the folder
    sample = Dataset.read_file(files[0])

    return cls(sample, len(files), files)

open_multi_dataset(band=0) #

Deprecated no-op (legacy API).

The eager _values cube this method used to populate is gone. Per-timestep Dataset handles open lazily via :attr:datasets on first access; the legacy values / __getitem__ / head / first views materialise on demand from those handles. There is nothing for this method to do.

Kept as a callable shim so legacy code that does dc.open_multi_dataset() before reading .values still runs without modification. New code should not call it.

Parameters:

Name Type Description Default
band int

Ignored. The full per-timestep band selection happens inside :meth:Dataset.read_array(band=...).

0
Source code in src/pyramids/dataset/collection.py
def open_multi_dataset(self, band: int = 0) -> None:
    """Deprecated no-op (legacy API).

    The eager ``_values`` cube this method used to populate is
    gone. Per-timestep ``Dataset`` handles open lazily via
    :attr:`datasets` on first access; the legacy ``values`` /
    ``__getitem__`` / ``head`` / ``first`` views materialise on
    demand from those handles. There is nothing for this method
    to do.

    Kept as a callable shim so legacy code that does
    ``dc.open_multi_dataset()`` before reading ``.values`` still
    runs without modification. New code should not call it.

    Args:
        band: Ignored. The full per-timestep band selection
            happens inside :meth:`Dataset.read_array(band=...)`.
    """
    del band  # unused
    return None

__getitem__(key) #

Return one or more timestep arrays, indexed along the time axis.

Equivalent to self.values[key] but with one slight optimisation: an integer key reads only that timestep's Dataset (never materialises the full cube).

Parameters:

Name Type Description Default
key

Integer index or slice along the time axis.

required

Returns:

Type Description
ndarray

np.ndarray: A 2D array (single int) or a 3D array (slice).

Source code in src/pyramids/dataset/collection.py
def __getitem__(self, key) -> np.ndarray:
    """Return one or more timestep arrays, indexed along the time axis.

    Equivalent to ``self.values[key]`` but with one slight
    optimisation: an integer ``key`` reads only that timestep's
    Dataset (never materialises the full cube).

    Args:
        key: Integer index or slice along the time axis.

    Returns:
        np.ndarray: A 2D array (single int) or a 3D array (slice).
    """
    if isinstance(key, int):
        return self.datasets[key].read_array(band=0)
    return self.values[key]

__setitem__(key, value) #

Replace a single timestep's Dataset with a MEM Dataset built from value.

Parameters:

Name Type Description Default
key int

Integer index along the time axis.

required
value ndarray

A 2D (rows, cols) array.

required

Raises:

Type Description
TypeError

If key is not an integer (slice assignment is not supported; rebuild the collection instead).

Source code in src/pyramids/dataset/collection.py
def __setitem__(self, key: int, value: np.ndarray) -> None:
    """Replace a single timestep's Dataset with a MEM Dataset built from ``value``.

    Args:
        key (int): Integer index along the time axis.
        value (np.ndarray): A 2D ``(rows, cols)`` array.

    Raises:
        TypeError: If ``key`` is not an integer (slice assignment
            is not supported; rebuild the collection instead).
    """
    if not isinstance(key, int):
        raise TypeError(
            f"DatasetCollection.__setitem__ only accepts an integer "
            f"index along the time axis; got {type(key).__name__}. "
            f"Rebuild the collection if you need bulk replacement."
        )
    # Materialise the cache (so we have a list to modify) without
    # building the full cube. Use ``create_from_array`` so the
    # input array's dtype is preserved (CreateCopy on the base
    # would cast through whatever dtype the base happened to use).
    from pyramids.dataset.dataset import Dataset as _Dataset

    datasets = self.datasets
    datasets[key] = _Dataset.create_from_array(
        arr=value,
        geo=self._base.geotransform,
        epsg=self._base.epsg,
        no_data_value=self._base.no_data_value[0],
    )
    # The mutation breaks the disk correspondence for that slot;
    # if the user mutates any timestep, the lazy reductions can no
    # longer trust ``_files``. Drop the path list so they fall
    # through to the in-memory handles instead.
    self._files = None

__len__() #

Number of timesteps in the collection.

Source code in src/pyramids/dataset/collection.py
def __len__(self):
    """Number of timesteps in the collection."""
    return self._time_length

__iter__() #

Iterate over per-timestep arrays (matches the legacy API).

Source code in src/pyramids/dataset/collection.py
def __iter__(self):
    """Iterate over per-timestep arrays (matches the legacy API)."""
    for ds in self.datasets:
        yield ds.read_array(band=0)

head(n=5) #

First n timestep arrays as a 3D numpy slice.

Parameters:

Name Type Description Default
n int

Number of timesteps. Defaults to 5.

5

Returns:

Type Description
ndarray

np.ndarray: (min(n, time_length), rows, cols) array.

Source code in src/pyramids/dataset/collection.py
def head(self, n: int = 5) -> np.ndarray:
    """First ``n`` timestep arrays as a 3D numpy slice.

    Args:
        n (int): Number of timesteps. Defaults to 5.

    Returns:
        np.ndarray: ``(min(n, time_length), rows, cols)`` array.
    """
    return self.values[:n]

tail(n=-5) #

Last -n timestep arrays as a 3D numpy slice.

Matches the legacy signature: a NEGATIVE n (the default -5) means "last 5". Implementation simply does self.values[n:], so a positive n would skip the first n rows instead — that's the legacy behaviour and left as is for back-compat.

Parameters:

Name Type Description Default
n int

Negative integer giving the offset from the end. Defaults to -5 (last 5).

-5

Returns:

Type Description
ndarray

np.ndarray: (abs(n), rows, cols) array (when n < 0).

Source code in src/pyramids/dataset/collection.py
def tail(self, n: int = -5) -> np.ndarray:
    """Last ``-n`` timestep arrays as a 3D numpy slice.

    Matches the legacy signature: a NEGATIVE ``n`` (the default
    ``-5``) means "last 5". Implementation simply does
    ``self.values[n:]``, so a positive ``n`` would skip the first
    ``n`` rows instead — that's the legacy behaviour and left as
    is for back-compat.

    Args:
        n (int): Negative integer giving the offset from the
            end. Defaults to ``-5`` (last 5).

    Returns:
        np.ndarray: ``(abs(n), rows, cols)`` array (when ``n < 0``).
    """
    return self.values[n:]

first() #

First timestep array (2D).

Cheaper than self.values[0] because it only reads one timestep instead of the full cube.

Source code in src/pyramids/dataset/collection.py
def first(self) -> np.ndarray:
    """First timestep array (2D).

    Cheaper than ``self.values[0]`` because it only reads one
    timestep instead of the full cube.
    """
    return self.datasets[0].read_array(band=0)

last() #

Last timestep array (2D).

Cheaper than self.values[-1] because it only reads one timestep instead of the full cube.

Source code in src/pyramids/dataset/collection.py
def last(self) -> np.ndarray:
    """Last timestep array (2D).

    Cheaper than ``self.values[-1]`` because it only reads one
    timestep instead of the full cube.
    """
    return self.datasets[-1].read_array(band=0)

iloc(i) #

Return the Dataset at position i.

Parameters:

Name Type Description Default
i int

Index of the timestep to access.

required

Returns:

Name Type Description
Dataset Dataset

The lazy Dataset handle at position i.

Dataset

Pixel values are not loaded — they're read on demand when

Dataset

the caller invokes a method on the returned Dataset.

Source code in src/pyramids/dataset/collection.py
def iloc(self, i: int) -> Dataset:
    """Return the ``Dataset`` at position ``i``.

    Args:
        i (int):
            Index of the timestep to access.

    Returns:
        Dataset: The lazy ``Dataset`` handle at position ``i``.
        Pixel values are not loaded — they're read on demand when
        the caller invokes a method on the returned Dataset.
    """
    return self.datasets[i]

plot(band=0, exclude_value=None, **kwargs) #

Render the collection as an animated stack of band slices.

- read the values stored in a given band across every
  ``Dataset`` in the collection and hand the resulting
  ``(time, rows, cols)`` array to cleopatra's animation
  path.

Implementation note: this method is a thin caller around the shared :func:pyramids.dataset._plot_helpers.render_array helper. It stacks one band per Dataset into a 3-D array and forwards to render_array(..., mode="animate", animation_axis_values=...). The duplicated ArrayGlyph construction that used to live here is gone — the helper owns the cleopatra dispatch and the same code path serves the single-frame Dataset.plot and the multi-panel NetCDF.plot facets. See :mod:pyramids.dataset._plot_helpers for the three-mode contract.

Parameters:

Name Type Description Default
band int

The band you want to get its data. Default is 0.

0
exclude_value Any

Value to exclude from the plot. Default is None.

None
**kwargs Any
Parameter Type Description
points array 3-column array: col 1 = value to display, col 2 = row index, col 3 = column index. Columns 2 and 3 indicate the location of the point.
point_color str Color of the points.
point_size Any Size of the points.
pid_color str Color of the annotation of the point. Default is blue.
pid_size Any Size of the point annotation.
figsize tuple, optional Figure size. Default is (8, 8).
title str, optional Title of the plot. Default is 'Total Discharge'.
title_size int, optional Title size. Default is 15.
orientation str, optional Orientation of the color bar (horizontal or vertical). Default is 'vertical'.
rotation number, optional Rotation of the color bar label. Default is -90.
cbar_length float, optional Ratio to control the height of the color bar. Default is 0.75.
ticks_spacing int, optional Spacing in the color bar ticks. Default is 2.
cbar_label_size int, optional Size of the color bar label. Default is 12.
cbar_label str, optional Label of the color bar. Default is 'Discharge m³/s'.
color_scale str, optional Color-scale mode (default "linear"): one of "linear", "power", "sym-lognorm", "boundary-norm", "midpoint" (case-insensitive), or a cleopatra.styles.ColorScale member. Integer codes are no longer accepted.
gamma float, optional Exponent for color_scale="power". Default is 1/2.
line_threshold float, optional linthresh for color_scale="sym-lognorm". Default is 0.0001.
line_scale float, optional linscale for color_scale="sym-lognorm". Default is 0.001.
bounds list Discrete bounds for color_scale="boundary-norm". Default is None.
midpoint float, optional Midpoint value for color_scale="midpoint". Default is 0.
cmap str, optional Color map style. Default is 'coolwarm_r'.
display_cell_value bool Whether to display the values of the cells as text.
num_size int, optional Size of the numbers plotted on top of each cell. Default is 8.
background_color_threshold float | int, optional Threshold for deciding number color: if value > threshold -> black; else white. If None, uses max_value/2. Default is None.
{}

Returns:

Name Type Description
ArrayGlyph ArrayGlyph

A plotting/animation handle (from cleopatra.ArrayGlyph).

Source code in src/pyramids/dataset/collection.py
def plot(
    self, band: int = 0, exclude_value: Any | None = None, **kwargs: Any
) -> ArrayGlyph:
    r"""Render the collection as an animated stack of band slices.

        - read the values stored in a given band across every
          ``Dataset`` in the collection and hand the resulting
          ``(time, rows, cols)`` array to cleopatra's animation
          path.

    Implementation note: this method is a thin caller around the
    shared :func:`pyramids.dataset._plot_helpers.render_array`
    helper. It stacks one band per ``Dataset`` into a 3-D array
    and forwards to ``render_array(..., mode="animate",
    animation_axis_values=...)``. The duplicated ``ArrayGlyph``
    construction that used to live here is gone — the helper owns
    the cleopatra dispatch and the same code path serves the
    single-frame ``Dataset.plot`` and the multi-panel
    ``NetCDF.plot`` facets. See
    :mod:`pyramids.dataset._plot_helpers` for the three-mode
    contract.

    Args:
        band (int):
            The band you want to get its data. Default is 0.
        exclude_value (Any):
            Value to exclude from the plot. Default is None.
        **kwargs:
            | Parameter                  | Type                  | Description |
            |----------------------------|-----------------------|-------------|
            | points                     | array                 | 3-column array: col 1 = value to display, col 2 = row index, col 3 = column index. Columns 2 and 3 indicate the location of the point. |
            | point_color                | str                   | Color of the points. |
            | point_size                 | Any                   | Size of the points. |
            | pid_color                  | str                   | Color of the annotation of the point. Default is blue. |
            | pid_size                   | Any                   | Size of the point annotation. |
            | figsize                    | tuple, optional       | Figure size. Default is `(8, 8)`. |
            | title                      | str, optional         | Title of the plot. Default is `'Total Discharge'`. |
            | title_size                 | int, optional         | Title size. Default is `15`. |
            | orientation                | str, optional         | Orientation of the color bar (`horizontal` or `vertical`). Default is `'vertical'`. |
            | rotation                   | number, optional      | Rotation of the color bar label. Default is `-90`. |
            | cbar_length                | float, optional       | Ratio to control the height of the color bar. Default is `0.75`. |
            | ticks_spacing              | int, optional         | Spacing in the color bar ticks. Default is `2`. |
            | cbar_label_size            | int, optional         | Size of the color bar label. Default is `12`. |
            | cbar_label                 | str, optional         | Label of the color bar. Default is `'Discharge m³/s'`. |
            | color_scale                | str, optional         | Color-scale mode (default `"linear"`): one of `"linear"`, `"power"`, `"sym-lognorm"`, `"boundary-norm"`, `"midpoint"` (case-insensitive), or a `cleopatra.styles.ColorScale` member. Integer codes are no longer accepted. |
            | gamma                      | float, optional       | Exponent for `color_scale="power"`. Default is `1/2`. |
            | line_threshold             | float, optional       | `linthresh` for `color_scale="sym-lognorm"`. Default is `0.0001`. |
            | line_scale                 | float, optional       | `linscale` for `color_scale="sym-lognorm"`. Default is `0.001`. |
            | bounds                     | list                  | Discrete bounds for `color_scale="boundary-norm"`. Default is `None`. |
            | midpoint                   | float, optional       | Midpoint value for `color_scale="midpoint"`. Default is `0`. |
            | cmap                       | str, optional         | Color map style. Default is `'coolwarm_r'`. |
            | display_cell_value         | bool                  | Whether to display the values of the cells as text. |
            | num_size                   | int, optional         | Size of the numbers plotted on top of each cell. Default is `8`. |
            | background_color_threshold | float \| int, optional| Threshold for deciding number color: if value > threshold -> black; else white. If `None`, uses `max_value/2`. Default is `None`. |


    Returns:
        ArrayGlyph: A plotting/animation handle (from cleopatra.ArrayGlyph).
    """
    # Materialise the cube on demand for plotting. The render helper
    # expects a single (time, rows, cols) numpy array; reading each
    # Dataset's band into one stacked array is fine for a plot call
    # (the user explicitly asked to render). Delegates the cleopatra
    # call to :func:`render_array` (D-2 — shared with `Analysis.plot`).
    data = np.stack([ds.read_array(band=band) for ds in self.datasets], axis=0)
    exclude_value = (
        [self.base.no_data_value[band], exclude_value]
        if exclude_value is not None
        else [self.base.no_data_value[band]]
    )
    return render_array(
        arr=data,
        exclude_value=exclude_value,
        mode="animate",
        animation_axis_values=list(range(self.time_length)),
        **kwargs,
    )

to_file(path, driver='geotiff', band=0) #

Save to geotiff format.

saveRaster saves a raster to a path

Parameters:

Name Type Description Default
path str | list[str]

a path includng the name of the raster and extention.

required
driver str

driver = "geotiff".

'geotiff'
band int

band index, needed only in case of ascii drivers. Default is 1.

0

Examples:

  • Save to a file:
>>> raster_obj = Dataset.read_file("path/to/file/***.tif")
>>> output_path = "examples/GIS/data/save_raster_test.tif"
>>> raster_obj.to_file(output_path)
Source code in src/pyramids/dataset/collection.py
def to_file(
    self,
    path: str | Path | list[str | Path],
    driver: str = "geotiff",
    band: int = 0,
):
    """Save to geotiff format.

        saveRaster saves a raster to a path

    Args:
        path (str | list[str]):
            a path includng the name of the raster and extention.
        driver (str):
            driver = "geotiff".
        band (int):
            band index, needed only in case of ascii drivers. Default is 1.

    Examples:
        - Save to a file:

          ```python
          >>> raster_obj = Dataset.read_file("path/to/file/***.tif")
          >>> output_path = "examples/GIS/data/save_raster_test.tif"
          >>> raster_obj.to_file(output_path)

          ```
    """
    ext = CATALOG.get_extension(driver)

    if isinstance(path, (str, Path)):
        path = Path(path)
        if not path.exists():
            path.mkdir(parents=True, exist_ok=True)
        path = [str(path / f"{i}.{ext}") for i in range(self.time_length)]
    else:
        if not len(path) == self.time_length:
            raise ValueError(
                f"Length of the given paths: {len(path)} does not equal number of rasters in the data cube: {self.time_length}"
            )
        path_list = [Path(p) for p in path]
        parent = path_list[0].parent
        if not parent.exists():
            parent.mkdir(parents=True, exist_ok=True)

    for i in range(self.time_length):
        src = self.iloc(i)
        arr = src.read_array()
        transient = Dataset.create_from_array(
            arr=arr,
            geo=src.geotransform,
            epsg=src.epsg,
            no_data_value=src.no_data_value[0],
        )
        transient.to_file(path[i], band=band)
        transient.close()
    self._datasets = None

to_cog_stack(directory, *, pattern='{name}_{i:04d}.tif', name='slice', overwrite=False, **cog_kwargs) #

Export each time slice of the collection as an individual COG.

Parameters:

Name Type Description Default
directory str | Path

Output directory; created if missing.

required
pattern str

Filename template. Placeholders:

  • {name} — the name argument (default 'slice');
  • {i} — zero-padded integer index.

The {t} placeholder is reserved for a future task that adds a time-coordinate axis; using it now raises :class:ValueError.

'{name}_{i:04d}.tif'
name str

Replacement for the {name} placeholder.

'slice'
overwrite bool

If False, raise :class:FileExistsError when a target path already exists.

False
**cog_kwargs Any

Forwarded verbatim to :meth:pyramids.dataset.engines.COG.to_cog.

{}

Returns:

Type Description
list[Path]

List of written file paths, in temporal (index) order.

Raises:

Type Description
DatasetNotFoundError

:meth:open_multi_dataset has not been called, so per-slice arrays are not loaded.

ValueError

{t} placeholder used but no time coord is available.

FileExistsError

overwrite=False and a target path exists.

Examples:

  • Default naming — one COG per slice:
    >>> dc.to_cog_stack("out/", compress="ZSTD")  # doctest: +SKIP
    [PosixPath('out/slice_0000.tif'), ..., PosixPath('out/slice_0002.tif')]
    
  • Custom filename pattern and name prefix:
    >>> dc.to_cog_stack(  # doctest: +SKIP
    ...     "band4/",
    ...     pattern="B04_{i:03d}.tif",
    ...     name="B04",
    ... )
    [PosixPath('band4/B04_000.tif'), ...]
    
  • Overwrite existing outputs and forward COG options:
    >>> dc.to_cog_stack(  # doctest: +SKIP
    ...     "out/",
    ...     overwrite=True,
    ...     compress="DEFLATE",
    ...     blocksize=256,
    ... )
    
Source code in src/pyramids/dataset/collection.py
def to_cog_stack(
    self,
    directory: str | Path,
    *,
    pattern: str = "{name}_{i:04d}.tif",
    name: str = "slice",
    overwrite: bool = False,
    **cog_kwargs: Any,
) -> list[Path]:
    """Export each time slice of the collection as an individual COG.

    Args:
        directory: Output directory; created if missing.
        pattern: Filename template. Placeholders:

            - `{name}` — the `name` argument (default `'slice'`);
            - `{i}` — zero-padded integer index.

            The `{t}` placeholder is reserved for a future task
            that adds a time-coordinate axis; using it now raises
            :class:`ValueError`.
        name: Replacement for the `{name}` placeholder.
        overwrite: If `False`, raise :class:`FileExistsError`
            when a target path already exists.
        **cog_kwargs: Forwarded verbatim to
            :meth:`pyramids.dataset.engines.COG.to_cog`.

    Returns:
        List of written file paths, in temporal (index) order.

    Raises:
        DatasetNotFoundError: :meth:`open_multi_dataset` has not been
            called, so per-slice arrays are not loaded.
        ValueError: `{t}` placeholder used but no time coord is
            available.
        FileExistsError: `overwrite=False` and a target path exists.

    Examples:
        - Default naming — one COG per slice:
            ```python
            >>> dc.to_cog_stack("out/", compress="ZSTD")  # doctest: +SKIP
            [PosixPath('out/slice_0000.tif'), ..., PosixPath('out/slice_0002.tif')]

            ```
        - Custom filename pattern and name prefix:
            ```python
            >>> dc.to_cog_stack(  # doctest: +SKIP
            ...     "band4/",
            ...     pattern="B04_{i:03d}.tif",
            ...     name="B04",
            ... )
            [PosixPath('band4/B04_000.tif'), ...]

            ```
        - Overwrite existing outputs and forward COG options:
            ```python
            >>> dc.to_cog_stack(  # doctest: +SKIP
            ...     "out/",
            ...     overwrite=True,
            ...     compress="DEFLATE",
            ...     blocksize=256,
            ... )

            ```
    """
    # Check the backing attribute directly rather than going through
    # the `values` property: the property getter raises AttributeError
    # on unpopulated collections, which hasattr catches silently, but
    # a future refactor that changes the exception type would break
    if "{t}" in pattern:
        raise ValueError(
            "{t} placeholder not yet supported; DatasetCollection has "
            "no time-axis coord. Use {i} for integer index."
        )

    out_dir = Path(directory)
    out_dir.mkdir(parents=True, exist_ok=True)

    paths: list[Path] = []
    for i in range(self.time_length):
        filename = pattern.format(name=name, i=i)
        target = out_dir / filename
        if target.exists() and not overwrite:
            raise FileExistsError(
                f"{target} exists; pass overwrite=True to replace."
            )
        slice_ds = self.iloc(i)
        slice_ds.to_cog(target, **cog_kwargs)
        paths.append(target)
    return paths

to_crs(to_epsg=3857, method='nearest neighbor', maintain_alignment=False, inplace=False) #

Reproject every timestep to a target EPSG.

Parameters:

Name Type Description Default
to_epsg int

Reference number to the new projection (https://epsg.io/) (default 3857, WGS84 web mercator).

3857
method str

Resampling technique. Default is "nearest neighbor". See https://gisgeography.com/raster-resampling/. Accepted values are "nearest neighbor", "cubic", "bilinear".

'nearest neighbor'
maintain_alignment bool

True to maintain the number of rows and columns of the raster the same after reprojection. Default is False.

False
inplace bool

If True, mutate this collection in place and return None. If False (default), return a new DatasetCollection.

False

Returns:

Type Description
DatasetCollection | None

DatasetCollection | None: New collection when

DatasetCollection | None

inplace=False; None when inplace=True.

Examples:

  • Reproject every timestep to EPSG:3857 and keep the result:

>>> reprojected = collection.to_crs(to_epsg=3857)  # doctest: +SKIP
- Reproject in place:

>>> collection.to_crs(to_epsg=3857, inplace=True)  # doctest: +SKIP
Source code in src/pyramids/dataset/collection.py
def to_crs(
    self,
    to_epsg: int = 3857,
    method: str = "nearest neighbor",
    maintain_alignment: bool = False,
    inplace: bool = False,
) -> DatasetCollection | None:
    """Reproject every timestep to a target EPSG.

    Args:
        to_epsg (int):
            Reference number to the new projection (https://epsg.io/)
            (default 3857, WGS84 web mercator).
        method (str):
            Resampling technique. Default is "nearest neighbor". See
            https://gisgeography.com/raster-resampling/. Accepted
            values are "nearest neighbor", "cubic", "bilinear".
        maintain_alignment (bool):
            True to maintain the number of rows and columns of the
            raster the same after reprojection. Default is False.
        inplace (bool):
            If True, mutate this collection in place and return None.
            If False (default), return a new `DatasetCollection`.

    Returns:
        DatasetCollection | None: New collection when
        `inplace=False`; `None` when `inplace=True`.

    Examples:
        - Reproject every timestep to EPSG:3857 and keep the result:

          ```python
          >>> reprojected = collection.to_crs(to_epsg=3857)  # doctest: +SKIP

          ```
        - Reproject in place:

          ```python
          >>> collection.to_crs(to_epsg=3857, inplace=True)  # doctest: +SKIP

          ```
    """
    new_datasets = self._apply_per_timestep(
        "to_crs",
        to_epsg,
        method=method,
        maintain_alignment=maintain_alignment,
    )
    return self._finalize_per_timestep_result(new_datasets, inplace=inplace)

crop(mask=None, inplace=False, touch=True, *, bbox=None, epsg=None) #

Crop every timestep against mask or a bbox.

Parameters:

Name Type Description Default
mask Dataset | None

Dataset object of the mask raster to crop the rasters (to get the NoData value and its location in the array). Mask should include the name of the raster and the extension like "data/dem.tif", or you can read the mask raster using gdal and use it as the first parameter to the function. Mutually exclusive with bbox; exactly one of the two must be supplied.

None
inplace bool

If True, mutate this collection in place and return None. If False (default), return a new DatasetCollection.

False
touch bool

Include the cells that touch the polygon, not only those that lie entirely inside the polygon mask. Default is True.

True
bbox (tuple[float, float, float, float] | None, keyword - only)

(west, south, east, north) quadruple in the CRS named by epsg. Internally wrapped in a one-row :class:FeatureCollection (built once and reused across timesteps). Mutually exclusive with mask.

None
epsg (Any, keyword - only)

CRS for bbox — anything geopandas accepts. Defaults to the collection's own CRS.

None

Returns:

Type Description
DatasetCollection | None

DatasetCollection | None: New collection when

DatasetCollection | None

inplace=False; None when inplace=True.

Examples:

  • Crop aligned rasters using a DEM mask:
>>> dem_path = "examples/GIS/data/acc4000.tif"
>>> src_path = "examples/GIS/data/aligned_rasters/"
>>> out_path = "examples/GIS/data/crop_aligned_folder/"
>>> DatasetCollection.crop(dem_path, src_path, out_path)
  • Crop every timestep using a (W, S, E, N) bbox tuple — the FC is built once and reused across timesteps:
>>> import os, tempfile
>>> import numpy as np
>>> from pyramids.dataset import Dataset, DatasetCollection
>>> d = tempfile.mkdtemp()
>>> paths = []
>>> for t in range(2):
...     p = os.path.join(d, f"t{t}.tif")
...     _ = Dataset.create_from_array(
...         (np.arange(100, dtype="int16").reshape(10, 10) * (t + 1)),
...         top_left_corner=(0, 0), cell_size=0.05, epsg=4326, path=p,
...     ).close()
...     paths.append(p)
>>> col = DatasetCollection.from_files(paths)
>>> cropped = col.crop(bbox=(0.1, -0.2, 0.2, -0.1))
>>> cropped.time_length
2
>>> cropped.base.shape
(1, 2, 2)
Source code in src/pyramids/dataset/collection.py
def crop(
    self,
    mask: Dataset | str | None = None,
    inplace: bool = False,
    touch: bool = True,
    *,
    bbox: tuple[float, float, float, float] | list[float] | None = None,
    epsg: Any = None,
) -> DatasetCollection | None:
    """Crop every timestep against ``mask`` or a ``bbox``.

    Args:
        mask (Dataset | None):
            Dataset object of the mask raster to crop the rasters (to get
            the NoData value and its location in the array). Mask should
            include the name of the raster and the extension like
            "data/dem.tif", or you can read the mask raster using gdal
            and use it as the first parameter to the function. Mutually
            exclusive with ``bbox``; exactly one of the two must be
            supplied.
        inplace (bool):
            If True, mutate this collection in place and return None.
            If False (default), return a new `DatasetCollection`.
        touch (bool):
            Include the cells that touch the polygon, not only those that lie entirely inside the polygon mask.
            Default is True.
        bbox (tuple[float, float, float, float] | None, keyword-only):
            ``(west, south, east, north)`` quadruple in the CRS named by
            ``epsg``. Internally wrapped in a one-row
            :class:`FeatureCollection` (built once and reused across
            timesteps). Mutually exclusive with ``mask``.
        epsg (Any, keyword-only):
            CRS for ``bbox`` — anything ``geopandas`` accepts. Defaults to
            the collection's own CRS.

    Returns:
        DatasetCollection | None: New collection when
        `inplace=False`; `None` when `inplace=True`.

    Examples:
        - Crop aligned rasters using a DEM mask:

          ```python
          >>> dem_path = "examples/GIS/data/acc4000.tif"
          >>> src_path = "examples/GIS/data/aligned_rasters/"
          >>> out_path = "examples/GIS/data/crop_aligned_folder/"
          >>> DatasetCollection.crop(dem_path, src_path, out_path)

          ```

        - Crop every timestep using a ``(W, S, E, N)`` bbox tuple — the FC
          is built once and reused across timesteps:

          ```python
          >>> import os, tempfile
          >>> import numpy as np
          >>> from pyramids.dataset import Dataset, DatasetCollection
          >>> d = tempfile.mkdtemp()
          >>> paths = []
          >>> for t in range(2):
          ...     p = os.path.join(d, f"t{t}.tif")
          ...     _ = Dataset.create_from_array(
          ...         (np.arange(100, dtype="int16").reshape(10, 10) * (t + 1)),
          ...         top_left_corner=(0, 0), cell_size=0.05, epsg=4326, path=p,
          ...     ).close()
          ...     paths.append(p)
          >>> col = DatasetCollection.from_files(paths)
          >>> cropped = col.crop(bbox=(0.1, -0.2, 0.2, -0.1))
          >>> cropped.time_length
          2
          >>> cropped.base.shape
          (1, 2, 2)

          ```
    """
    if bbox is not None:
        if mask is not None:
            raise ValueError("crop accepts either `mask` or `bbox`, not both")
        crs = epsg if epsg is not None else self._base.epsg
        mask = FeatureCollection.from_bbox(bbox, epsg=crs)
    if mask is None:
        raise TypeError(
            "crop requires a `mask` or a `bbox` (west, south, east, north)"
        )
    new_datasets = self._apply_per_timestep("crop", mask, touch=touch)
    return self._finalize_per_timestep_result(new_datasets, inplace=inplace)

align(alignment_src, inplace=False) #

Align every timestep to alignment_src.

Matches the coordinate system, the number of rows and columns, and the cell size of every timestep raster to alignment_src.

Parameters:

Name Type Description Default
alignment_src Dataset

Dataset to use as the spatial template (CRS, rows, columns).

required
inplace bool

If True, mutate this collection in place and return None. If False (default), return a new DatasetCollection.

False

Returns:

Type Description
DatasetCollection | None

DatasetCollection | None: New collection when

DatasetCollection | None

inplace=False; None when inplace=True.

Examples:

  • Align every timestep to a DEM template:
>>> aligned = collection.align(dem_dataset)  # doctest: +SKIP
Source code in src/pyramids/dataset/collection.py
def align(
    self, alignment_src: Dataset, inplace: bool = False
) -> DatasetCollection | None:
    """Align every timestep to `alignment_src`.

    Matches the coordinate system, the number of rows and columns,
    and the cell size of every timestep raster to `alignment_src`.

    Args:
        alignment_src (Dataset):
            Dataset to use as the spatial template (CRS, rows, columns).
        inplace (bool):
            If True, mutate this collection in place and return None.
            If False (default), return a new `DatasetCollection`.

    Returns:
        DatasetCollection | None: New collection when
        `inplace=False`; `None` when `inplace=True`.

    Examples:
        - Align every timestep to a DEM template:

          ```python
          >>> aligned = collection.align(dem_dataset)  # doctest: +SKIP

          ```
    """
    if not isinstance(alignment_src, Dataset):
        raise TypeError("alignment_src input should be a Dataset object")
    new_datasets = self._apply_per_timestep("align", alignment_src)
    return self._finalize_per_timestep_result(new_datasets, inplace=inplace)

merge(dst, no_data_value='0', init='nan', n='nan', method='last') #

Merge this collection's timesteps into one raster.

File-backed collections merge their on-disk paths directly. In-memory collections (legacy DatasetCollection(src, time_length=N) constructions, anything produced by crop(inplace=False) / apply() / to_crs(inplace=False) / align(inplace=False)) are first staged through a temp directory, merged, and the staging directory is removed before the call returns.

Parameters:

Name Type Description Default
dst str | Path

Path to the output raster.

required
no_data_value float | int | str

Assign a specified nodata value to output bands.

'0'
init float | int | str

Pre-initialize the output image bands with these values. However, it is not marked as the nodata value in the output file. If only one value is given, the same value is used in all the bands.

'nan'
n float | int | str

Ignore pixels from files being merged in with this pixel value.

'nan'
method str

Overlap-resolution rule passed to :func:~pyramids.dataset.merge.merge_rasters: one of "first", "last" (default), "min", "max", "sum".

'last'

Returns:

Type Description
None

None

Source code in src/pyramids/dataset/collection.py
def merge(
    self,
    dst: str | Path,
    no_data_value: float | int | str = "0",
    init: float | int | str = "nan",
    n: float | int | str = "nan",
    method: str = "last",
) -> None:
    """Merge this collection's timesteps into one raster.

    File-backed collections merge their on-disk paths directly.
    In-memory collections (legacy `DatasetCollection(src,
    time_length=N)` constructions, anything produced by
    `crop(inplace=False)` / `apply()` / `to_crs(inplace=False)` /
    `align(inplace=False)`) are first staged through a temp
    directory, merged, and the staging directory is removed
    before the call returns.

    Args:
        dst (str | Path):
            Path to the output raster.
        no_data_value (float | int | str):
            Assign a specified nodata value to output bands.
        init (float | int | str):
            Pre-initialize the output image bands with these
            values. However, it is not marked as the nodata
            value in the output file. If only one value is
            given, the same value is used in all the bands.
        n (float | int | str):
            Ignore pixels from files being merged in with this
            pixel value.
        method (str):
            Overlap-resolution rule passed to
            :func:`~pyramids.dataset.merge.merge_rasters`: one of
            ``"first"``, ``"last"`` (default), ``"min"``, ``"max"``,
            ``"sum"``.

    Returns:
        None
    """
    if self._files:
        merge_rasters(
            self._files,
            dst,
            no_data_value=no_data_value,
            init=init,
            n=n,
            method=method,
        )
        return
    # In-memory collection (legacy `DatasetCollection(src,
    # time_length=N)` or anything returned by
    # `crop(inplace=False)` / `apply()` / `to_crs(inplace=False)` /
    # `align(inplace=False)`). Stage each timestep through a
    # tempfile, merge the temp paths, then drop the staging
    # directory. The tempfile pass is unavoidable: gdal_merge /
    # BuildVRT both take on-disk paths.
    with tempfile.TemporaryDirectory(prefix="pyramids-merge-") as staging:
        staging_path = Path(staging)
        self.to_file(staging_path, driver="geotiff")
        staged_files = sorted(staging_path.glob("*.tif"))
        merge_rasters(
            [str(p) for p in staged_files],
            dst,
            no_data_value=no_data_value,
            init=init,
            n=n,
            method=method,
        )

apply(ufunc, *, inplace=False) #

Apply a function to every timestep raster.

Each timestep Dataset.apply(ufunc) runs over the in-domain cells of its band; the result is a new Dataset. The list of new Datasets is wrapped in a new collection (out-of-place) or replaces this collection's handles (inplace).

Out-of-place is the default — the previous in-place signature mutated a shared numpy cube; with the Dataset-list backing there is no shared cube to mutate and per-timestep ops always produce a new Dataset.

Parameters:

Name Type Description Default
ufunc Callable

Callable universal function (builtin or user defined). See https://numpy.org/doc/stable/reference/ufuncs.html To create a ufunc from a normal function: https://numpy.org/doc/stable/reference/generated/numpy.frompyfunc.html

required
inplace bool

When True, replace this collection's per-timestep Dataset handles with the new outputs and return None. When False (default), return a new DatasetCollection wrapping the new outputs.

False

Returns:

Type Description
DatasetCollection | None

DatasetCollection | None: New collection when

DatasetCollection | None

inplace=False; None when inplace=True.

Examples:

  • Apply a simple modulo operation to each value:
>>> def func(val):
...    return val % 2
>>> ufunc = np.frompyfunc(func, 1, 1)
>>> result = collection.apply(ufunc)  # doctest: +SKIP
Source code in src/pyramids/dataset/collection.py
def apply(
    self, ufunc: Callable, *, inplace: bool = False
) -> DatasetCollection | None:
    """Apply a function to every timestep raster.

    Each timestep ``Dataset.apply(ufunc)`` runs over the
    in-domain cells of its band; the result is a new
    ``Dataset``. The list of new ``Datasets`` is wrapped in a
    new collection (out-of-place) or replaces this collection's
    handles (inplace).

    Out-of-place is the default — the previous in-place
    signature mutated a shared numpy cube; with the
    ``Dataset``-list backing there is no shared cube to mutate
    and per-timestep ops always produce a new ``Dataset``.

    Args:
        ufunc (Callable):
            Callable universal function (builtin or user defined). See
            https://numpy.org/doc/stable/reference/ufuncs.html
            To create a ufunc from a normal function: https://numpy.org/doc/stable/reference/generated/numpy.frompyfunc.html
        inplace (bool):
            When True, replace this collection's per-timestep
            ``Dataset`` handles with the new outputs and return
            ``None``. When False (default), return a new
            ``DatasetCollection`` wrapping the new outputs.

    Returns:
        DatasetCollection | None: New collection when
        ``inplace=False``; ``None`` when ``inplace=True``.

    Examples:
        - Apply a simple modulo operation to each value:

          ```python
          >>> def func(val):
          ...    return val % 2
          >>> ufunc = np.frompyfunc(func, 1, 1)
          >>> result = collection.apply(ufunc)  # doctest: +SKIP

          ```
    """
    if not callable(ufunc):
        raise TypeError("The Second argument should be a function")
    new_datasets = self._apply_per_timestep("apply", ufunc)
    return self._finalize_per_timestep_result(new_datasets, inplace=inplace)

overlay(classes_map, exclude_value=None) #

Overlay.

Parameters:

Name Type Description Default
classes_map Dataset

Dataset object for the raster that has classes to overlay with.

required
exclude_value float | int

Values to exclude from extracted values. Defaults to None.

None

Returns:

Type Description
dict[float, list[float]]

dict[float, list[float]]: Dictionary with a list of values in the basemap as keys and for each key a list of all the intersected values in the maps from the path.

Source code in src/pyramids/dataset/collection.py
def overlay(
    self,
    classes_map,
    exclude_value: float | int | None = None,
) -> dict[float, list[float]]:
    """Overlay.

    Args:
        classes_map (Dataset):
            Dataset object for the raster that has classes to overlay with.
        exclude_value (float | int, optional):
            Values to exclude from extracted values. Defaults to None.

    Returns:
        dict[float, list[float]]:
            Dictionary with a list of values in the basemap as keys and for each key a list of all the
            intersected values in the maps from the path.
    """
    values: dict[Any, list[float]] = {}
    for ds in self.datasets:
        dict_i = ds.overlay(classes_map, exclude_value)

        # these are the distinct values from the BaseMap which are keys in the
        # values dict with each one having a list of values
        for class_i, vals in dict_i.items():
            values.setdefault(class_i, []).extend(vals)

    return values