Skip to content

Catalog

Catalog construction has two separable concerns:

  1. Fetch — query a STAC endpoint (CMR-STAC) for what / when / where → a Catalog (a stac-geoparquet table of granule metadata, reusable across many grids).
  2. Shard map — take a Catalog plus an output grid → a ShardMap: the work-distribution manifest mapping shard keys to granules.

The CLI chains them, building the output grid from the same pipeline config the aggregator uses, so a shard map can never be built against a different grid than the run (enforced at run time via grid.signature()).

Building a shard map (CLI)

# HEALPix grid from atl06.yaml, an ICESat-2 cycle, Antarctic polygon:
python -m zagg.catalog --config atl06.yaml --short-name ATL06 --cycle 22 \
    --polygon antarctica.geojson

# Rectilinear (UTM) grid from a config, explicit dates, over a bbox:
python -m zagg.catalog --config serc_atl03.yaml --short-name ATL03 \
    --start-date 2025-01-01 --end-date 2025-12-31 \
    --bbox=-76.62107,38.84504,-76.50583,38.93512

# Persist the fetched Catalog too (reusable for other grids):
python -m zagg.catalog --config atl06.yaml --short-name ATL06 --cycle 22 \
    --polygon antarctica.geojson --catalog-out cycle22.parquet

--polygon drives both the CMR query bbox and the coverage mask; --bbox gives the query box directly (coverage falls back to that rectangle). The geometry backend (--backend) defaults to auto: exact-S2 spherely if the spherely fork (with SpatialIndex) is installed separately, else mortie (HEALPix) / shapely (rectilinear).

Endpoint selection (S3 vs HTTPS) is not made here — each granule record keeps both hrefs, and the aggregator picks one at run time via data_source.driver.

Fetch

zagg.catalog.sources.Query dataclass

A spatiotemporal metadata query: what, when, where.

Parameters:

  • short_name (str) –

    Product short name (e.g. "ATL03").

  • version (str) –

    Product version (e.g. "007").

  • start_date (str) –

    Inclusive date bounds, YYYY-MM-DD.

  • end_date (str) –

    Inclusive date bounds, YYYY-MM-DD.

  • region (tuple or str) –

    Either a (lon_min, lat_min, lon_max, lat_max) bbox or a path to a GeoJSON file (its bounding box is used for the STAC query).

  • provider (str, default: 'NSIDC_CPRD' ) –

    CMR provider / STAC sub-catalog. Default "NSIDC_CPRD".

collection property

collection: str

CMR-STAC collection id, {short_name}_{version}.

zagg.catalog.sources.CMRSource

Fetch granule metadata from NASA's CMR-STAC endpoint.

Parameters:

  • provider (str, default: None ) –

    Overrides the query provider for the STAC sub-catalog URL.

  • timeout (int, default: 60 ) –

    Per-request timeout in seconds.

fetch

fetch(
    query: Query, *, preserve_thumbnails: bool = False, limit: int = 2000
) -> "Catalog"

Run query against CMR-STAC and return a Catalog.

Parameters:

  • query (Query) –

    What/when/where to fetch.

  • preserve_thumbnails (bool, default: False ) –

    Keep thumbnail_*/browse assets (default drops them).

  • limit (int, default: 2000 ) –

    Page size hint; CMR clamps it and paging follows rel=next.

Returns:

zagg.catalog.sources.Catalog dataclass

Fetched granule metadata: a stac-geoparquet table + provenance.

Reusable across many ShardMap builds. Endpoint-neutral -- each granule carries both its S3 and HTTPS .h5 hrefs.

Parameters:

  • table (Table) –

    stac-geoparquet table (one row per granule).

  • metadata (dict, default: dict() ) –

    Query provenance (product, version, bbox, dates, ...).

from_geoparquet classmethod

from_geoparquet(path: str) -> 'Catalog'

Load a catalog from a stac-geoparquet file (CMR or user-supplied).

granule_records

granule_records() -> list[dict]

Decode the table into per-granule dicts for ShardMap building.

Returns:

  • list of dict

    Each: {"id", "s3", "https", "lats", "lons"} where lats/ lons are the footprint exterior-ring coordinate arrays (WGS84) and s3/https are the data-asset hrefs (either may be None).

to_geoparquet

to_geoparquet(path: str) -> None

Write the catalog to a stac-geoparquet file.

stac_geoparquet rewrites schema metadata with only the GeoParquet geo key, so we reopen and merge zagg provenance back in (keeping geo intact) before the final write.

Shard map

zagg.catalog.shardmap.ShardMap dataclass

Work-distribution manifest: shard key -> granules, tied to one grid.

Parameters:

  • grid_signature (dict) –

    grid.spatial_signature() at build time -- the spatial layout only (#89). The runner checks it against the run grid's spatial signature so a map can't be paired with a mismatched spatial grid, while staying reusable across configs that differ only in aggregation fields. (Kept as grid_signature for back-compat; old maps carry the full signature and still validate via a spatial-subset projection.)

  • shard_keys (list of int) –

    Sorted shard keys with at least one granule.

  • granules (list of list of dict) –

    Parallel to shard_keys. Each granule is {"id", "s3", "https"} (option C -- self-contained, endpoint-neutral).

  • metadata (dict, default: dict() ) –

    Provenance copied from the Catalog plus backend/timing info.

build classmethod

build(
    catalog,
    grid,
    *,
    region=None,
    backend: str = "auto",
    mortie_order: int | None = None,
    footprint: str = "swath",
) -> "ShardMap"

Build a ShardMap from a Catalog and an output grid.

Parameters:

  • catalog (Catalog) –

    Fetched granule metadata (provides granule_records()).

  • grid (OutputGrid) –

    Output grid (provides coverage, shard_footprint, spatial_signature).

  • region (list of (lats, lons), default: None ) –

    Coverage mask in WGS84. Defaults to the catalog bbox rectangle.

  • backend (('auto', 'spherely', 'mortie'), default: "auto" ) –

    Geometry backend. "auto" -> spherely when importable, else mortie for HEALPix grids (non-HEALPix grids require spherely and raise an ImportError with an install pointer when it is absent).

  • mortie_order (int, default: None ) –

    MOC order for the mortie backend. None (default) pins it to the grid's inner-chunk order grid.chunk_order (the chunk_inner order, defaulting to parent_order when unset), clamped to mortie's order-18 coverage cap -- the dispatch chunk's own resolution, enough to keep moc_to_order from upsampling a footprint onto neighbor shards (#92) at near-minimal compute. Raises if the resolved order is coarser than parent_order.

  • footprint (('swath', 'beams'), default: "swath" ) –

    Granule footprint used for intersection. "swath" (default) uses the raw CMR polygon. "beams" decomposes ICESat-2 ATL03/06 swaths into per-beam-pair corridors so granules stop being assigned to shards their beams never cross (issue #65); non-beam products fall back to the swath ring.

    .. deprecated:: The "beams" corridor mechanism is a stopgap (see beams.py); remove it once native per-beam CMR geometry, the memory-handling robustness in #66, or data virtualization (#97) lands.

Returns:

from_json classmethod

from_json(path: str) -> 'ShardMap'

Load a manifest from JSON.

to_json

to_json(path: str) -> None

Write the manifest as JSON.

Convenience

zagg.catalog.make_shardmap

make_shardmap(
    query,
    grid,
    *,
    region=None,
    backend="auto",
    catalog_out=None,
    footprint="swath",
)

Fetch a Catalog and build a ShardMap in one call (concerns 1+2 chained).

Parameters:

  • query (Query) –

    What/when/where to fetch.

  • grid (OutputGrid) –

    Output grid (typically from_config(config)).

  • region (list of (lats, lons), default: None ) –

    Coverage mask. Defaults to the query bbox rectangle.

  • backend (str, default: 'auto' ) –

    Geometry backend for the shard map.

  • catalog_out (str, default: None ) –

    If given, persist the fetched Catalog to this geoparquet path.

  • footprint (('swath', 'beams'), default: "swath" ) –

    Granule footprint for intersection; "beams" tightens ICESat-2 ATL03/06 assignment to per-beam-pair corridors (issue #65).

    .. deprecated:: The "beams" corridor mechanism is a stopgap. Remove it once a better fix lands -- native per-beam CMR geometry, the memory-handling robustness in #66, or data virtualization tracked in #97.

Returns:

Temporal / spatial helpers

zagg.catalog.cycle_to_dates

cycle_to_dates(cycle: int) -> tuple[datetime, datetime]

Convert an ICESat-2 repeat cycle number to a (start, end) date range.

Parameters:

  • cycle (int) –

    ICESat-2 cycle number (1-based).

Returns:

  • tuple of (datetime, datetime)

zagg.catalog.load_polygon

load_polygon(geojson_path: str) -> list[tuple]

Load polygon(s) from a GeoJSON file as (lats, lons) parts.

Supports Feature, FeatureCollection, Polygon, and MultiPolygon geometries.

Parameters:

  • geojson_path (str) –

    Path to a GeoJSON file.

Returns:

  • list of (lats, lons)

    One coordinate-array pair per polygon ring (WGS84).

zagg.catalog.polygon_to_bbox

polygon_to_bbox(parts: list[tuple]) -> tuple[float, float, float, float]

Compute a (lon_min, lat_min, lon_max, lat_max) bbox from polygon parts.

Parameters:

  • parts (list of (lats, lons)) –

Returns:

  • tuple of (lon_min, lat_min, lon_max, lat_max)

zagg.catalog.load_antarctic_basins

load_antarctic_basins(filepath=None) -> list[tuple]

Load Antarctic drainage basin polygons as (lats, lons) parts.

Parameters:

  • filepath (str, default: None ) –

    Path to the basin polygon file. Defaults to the file shipped with mortie.

Returns:

  • list of (lats, lons)

    One pair per basin.