Skip to content

Schema Design

Single Source of Truth

The YAML pipeline config (configs/atl06.yaml) is the single source of truth for:

  • Column definitions and types --- coordinate columns (cell_ids, morton) and data variables (count, h_mean, etc.)
  • Aggregation recipes --- each data variable specifies a function (resolved via resolve_function()) or an expression (evaluated at runtime), plus source column and params
  • Zarr array configuration --- dtype and fill value per column, used by xdggs_spec to generate the Zarr template

Coordinate and data variable column names are derived from the config via get_coords() and get_data_vars(), not hardcoded.

Aggregation Config

Each variable in aggregation.variables carries metadata describing its behavior:

Key Values Description
function e.g. "min", "average", "quantile" Function resolved via resolve_function() (numpy shorthand or dotted path)
expression e.g. "1.0 / np.sqrt(np.sum(1.0 / s_li**2))" Python expression evaluated with column arrays in namespace
source e.g. "h_li" Input column from the raw observations
params e.g. {weights: "1.0 / s_li**2"}, {q: 0.25} Extra parameters; string values referencing data_source variables are resolved to arrays
dtype e.g. "float32", "int32" Data type for the Zarr array
fill_value 0 or "NaN" Fill value for unoccupied cells

function and expression are mutually exclusive.

Aggregation Dispatch

calculate_cell_statistics is config-driven: it iterates the aggregation variable metadata and dispatches via resolve_function() (for function-based fields) or evaluate_expression() (for expression-based fields).

Available aggregation functions (via numpy or dotted import paths):

Name Description Params
len / count Number of observations ---
min Minimum value ---
max Maximum value ---
var Variance ---
average Weighted average weights: column name or expression
quantile Quantile value q: quantile (0--1)

Extending the Schema

To add a new output statistic, add a variable entry to the YAML config:

aggregation:
  variables:
    h_iqr:
      expression: "float(np.quantile(h_li, 0.75) - np.quantile(h_li, 0.25))"
      dtype: float32

Or using a function:

    h_median:
      function: median
      source: h_li
      dtype: float32

Everything else --- the Zarr template, calculate_cell_statistics, and process_morton_cell --- adapts automatically.

Output Columns

Column Type Description
cell_ids uint64 HEALPix cell ID at child order
morton int64 Morton index at child order
count int32 Number of observations
h_mean float32 Inverse-variance weighted mean elevation
h_sigma float32 Uncertainty of weighted mean
h_min, h_max float32 Elevation range
h_variance float32 Variance
h_q25, h_q50, h_q75 float32 Quartiles