Lambda benchmark CI/CD setup¶

This guide stands up the Lambda-benchmark pipeline (issue #110): on every same-repo PR and every merge to main, GitHub Actions dispatches the single densest shard over the NEON SERC AOP box to the deployed process-shard Lambda, harvests seconds + cost, and tracks cost/shard and cost/100 km² over merge history.

The code (workflows, scripts, pinned targets) ships in the repo. This page is the one-time infrastructure setup: the AWS OIDC trust, a scoped IAM role, an S3 output bucket, the GitHub repo variables/secrets, the benchmarks data branch, and the benchmark label — each with a step to confirm it worked.

You need: admin on the GitHub repo, an AWS account with the process-shard Lambda already deployed (see Standing Up the Backend), the AWS CLI authenticated to that account, and an Earthdata Login (the orchestrator mints NSIDC S3 read credentials it forwards to the Lambda).

Throughout, substitute your values for ACCOUNT_ID, REGION (e.g. us-west-2), FUNCTION_NAME (default process-shard), and the repo englacial/zagg.

What the pipeline needs (overview)¶

Piece	Why
GitHub OIDC provider in AWS	Lets Actions mint short-lived AWS creds with no stored secret
Scoped IAM role (`zagg-benchmark`)	What those creds can do: invoke the Lambda + probe account concurrency (no direct S3 — the Lambda writes the store)
S3 bucket / prefix	Where the benchmark Zarr stores are written
Repo variables `BENCHMARK_*`	Non-secret config (role ARN, region, function, store prefix)
Repo secret `EARTHDATA_TOKEN`	Earthdata Login bearer token for the orchestrator to mint NSIDC read creds
`benchmarks` branch	Holds the retained parquet series + rendered charts
`benchmark` label	One of the two fork-PR on-demand triggers

1. Create the GitHub OIDC identity provider in AWS¶

This is account-wide; skip if you already have it (e.g. from another repo).

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1

Verify — it should list the provider:

aws iam list-open-id-connect-providers
# -> arn:aws:iam::ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com

The thumbprint is no longer security-critical (AWS validates GitHub's OIDC certificate via its own trust store) but the argument is still required.

2. Create the CI/CD roles + distribution bucket (one CloudFormation stack)¶

All three GitHub-OIDC roles — benchmark-invoke (this section), test-deploy and release (section 9) — plus the public sliderule-public-cors bucket (section 10) are provisioned by one template, deployment/aws/benchmark_cicd.yaml, kept in its own stack so it rolls back / tears down cleanly (the SlideRule CFN-first convention; mirrors execution_role.yaml). It references the existing OIDC provider from section 1 by account ID, so nothing here recreates it.

aws cloudformation deploy \
  --template-file deployment/aws/benchmark_cicd.yaml \
  --stack-name zagg-benchmark-cicd \
  --region us-west-2 \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides \
      GitHubRepo=englacial/zagg \
      BenchmarkFunctionName=process-shard \
      TestFunctionName=process-shard-test \
      BenchmarkStoreBucket=sliderule-public \
      DistBucketName=sliderule-public-cors

Parameters (all default to the values in this guide):

Parameter	Default	Scopes
`GitHubRepo`	`englacial/zagg`	the OIDC trust `sub` (`repo:<repo>:*`) on all three roles
`BenchmarkFunctionName`	`process-shard`	the stable function the invoke role calls + the release role deploys
`TestFunctionName`	`process-shard-test`	the function the deploy role updates + the invoke role calls
`BenchmarkStoreBucket`	`sliderule-public`	the benchmark output store + the staged test layer (`lambda-test/*`)
`DistBucketName`	`sliderule-public-cors`	the public distribution bucket (`CreateDistBucket=false` to reuse an existing one)

The benchmark-invoke role (zagg-benchmark) gets lambda:InvokeFunction + lambda:GetFunctionConfiguration + lambda:GetFunctionConcurrency on the stable + test functions, plus the worker-sizing probe (lambda:GetAccountSettings + cloudwatch:GetMetricStatistics, issue #28/#63). No S3 — the Zarr template write happens inside the Lambda, so the orchestrator never touches the bucket.

The Lambda reads the ATL03 source granules itself, using the NSIDC S3 credentials the orchestrator forwards — that's the Lambda execution role's concern, not this role's. This role only invokes + probes concurrency.

Verify — the stack Outputs hand you the exact ARNs/names for section 4:

aws cloudformation describe-stacks --stack-name zagg-benchmark-cicd \
  --region us-west-2 --query 'Stacks[0].Outputs' --output table

The trust sub is the repo wildcard repo:englacial/zagg:* (a fork's sub can't match); branch/PR/fork-context gating lives in the workflows + the production Environment (section 11). To tighten at the IAM layer, override the roles' conditions — but the on-demand fork path runs under pull_request_target (base sub) and the release job under environment:production, so don't over-restrict or you'll lock them out.

Manual (raw-CLI) alternative for the invoke role

If you can't use CloudFormation, create just the invoke role with this trust (`trust.json`) and permission (`permissions.json`) policy:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": { "token.actions.githubusercontent.com:aud": "sts.amazonaws.com" },
      "StringLike": { "token.actions.githubusercontent.com:sub": "repo:englacial/zagg:*" }
    }
  }]
}

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "InvokeBenchmarkLambda",
      "Effect": "Allow",
      "Action": ["lambda:InvokeFunction", "lambda:GetFunctionConfiguration", "lambda:GetFunctionConcurrency"],
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME"
    },
    {
      "Sid": "ConcurrencyProbe",
      "Effect": "Allow",
      "Action": ["lambda:GetAccountSettings", "cloudwatch:GetMetricStatistics"],
      "Resource": "*"
    }
  ]
}

aws iam create-role --role-name zagg-benchmark \
  --assume-role-policy-document file://trust.json
aws iam put-role-policy --role-name zagg-benchmark \
  --policy-name zagg-benchmark-access --policy-document file://permissions.json

(Section 9's deploy/release roles + section 10's bucket then also need their own `create-role`/`s3api` commands — the CloudFormation path does all of it at once.)

3. Provision the S3 output bucket / prefix¶

Reuse an existing bucket or make one. The store prefix is where each target's Zarr lands (overwritten every run — these are throwaway).

aws s3 mb s3://BUCKET --region REGION   # if it doesn't exist

Verify — a write+read+delete round-trip with the prefix you'll configure:

echo ok | aws s3 cp - s3://BUCKET/PREFIX/.probe
aws s3 ls s3://BUCKET/PREFIX/.probe && aws s3 rm s3://BUCKET/PREFIX/.probe

4. Configure GitHub repo variables and secrets¶

The workflows read four variables (non-secret) and two secrets.

# Variables (Settings -> Secrets and variables -> Actions -> Variables)
gh variable set BENCHMARK_ROLE_ARN      --body "arn:aws:iam::ACCOUNT_ID:role/zagg-benchmark"
gh variable set BENCHMARK_AWS_REGION    --body "REGION"
gh variable set BENCHMARK_FUNCTION_NAME --body "FUNCTION_NAME"
gh variable set BENCHMARK_STORE_PREFIX  --body "s3://BUCKET/PREFIX"

# Secret: an Earthdata Login BEARER TOKEN for the orchestrator (the NSIDC S3
# credentials endpoint accepts a user token). Generate one at
# https://urs.earthdata.nasa.gov/documentation/for_users/user_token (Profile ->
# Generate Token). earthaccess reads it from EARTHDATA_TOKEN.
gh secret set EARTHDATA_TOKEN --body "<edl-bearer-token>"

A token is preferred over username/password: it's scoped, expiring (EDL allows up to 2 active tokens, ~60-day lifetime), and rotates without touching the EDL account password. Set a calendar reminder to regenerate + re-gh secret set before expiry — an expired token fails the benchmark at the NSIDC-credentials step.

Verify:

gh variable list   # 4 BENCHMARK_* entries
gh secret list     # EARTHDATA_TOKEN

5. Bootstrap the `benchmarks` data branch¶

The retained parquet series and the rendered charts live on an orphan benchmarks branch. The merge job checks it out and fails if it's missing, so create it once:

git switch --orphan benchmarks
git rm -rf . >/dev/null 2>&1 || true
printf '# zagg benchmark data\n\nRetained `series.parquet` + rendered `site/` charts (issue #110).\n' > README.md
git add README.md
git commit -m "init benchmarks data branch"
git push origin benchmarks
git switch main

Verify:

git ls-remote --heads origin benchmarks   # one ref printed

6. Create the `benchmark` label¶

One of the two fork-PR on-demand triggers (the other is a /benchmark comment).

gh label create benchmark \
  --description "Run the Lambda benchmark on this PR" --color FBCA04

Verify: gh label list | grep benchmark.

7. End-to-end verification¶

Manual run. Trigger the workflow by hand — it runs the merge path (retains + publishes):

gh workflow run "Lambda Benchmark" --ref main
gh run watch

On success, the benchmarks branch gains series.parquet and site/:

git fetch origin benchmarks
git ls-tree -r origin/benchmarks --name-only | grep -E 'series.parquet|site/'

PR comment. Open a small same-repo PR (non-draft) and confirm a 🤖-free benchmark table comment appears (marker ), updated in place on each push — and that the point is not added to the retained series (PR runs are ephemeral).
Fork on-demand. On a fork PR, comment /benchmark (or apply the benchmark label) as a write/admin user and confirm the run fires; confirm a non-write user's /benchmark is ignored (the gate logs a notice).

If a run fails at Assume AWS role, re-check the trust policy sub matches the repo and the OIDC provider exists (steps 1–2). If it fails minting NSIDC creds, re-check the EARTHDATA_TOKEN secret (step 4) — an expired token fails here.

Surfacing the charts on GitHub Pages¶

The repo already serves the mkdocs docs from the Pages Actions source (docs.yml), and a repo has a single Pages site — so the benchmark does not deploy Pages itself. Instead the charts are embedded in the docs: the Benchmark results page references the rendered PNGs by their raw URL on the benchmarks branch, so they appear in the docs site and refresh as each merge updates the branch — no docs rebuild, no Pages reconfiguration.

The PNGs also render directly in the benchmarks branch file view on GitHub if you want the raw artifacts.

Cost and safety notes¶

Bounded spend. Each run dispatches one shard per target, hard-capped by the Lambda timeout (720 s deploy default ≤ 900 s AWS max) — pennies per run. The auto PR job has per-PR concurrency, so rapid pushes don't stack billed runs.
No fork auto-runs. Fork PRs never get the role automatically; a write/admin maintainer must opt in per PR (/benchmark or the benchmark label). Doing so runs the fork's checked-out code with the (minimally-scoped) role — that's the cost of benchmarking a fork; the role can only invoke the one Lambda and write the one bucket prefix.
Rotating creds. OIDC mints fresh AWS creds per run (nothing to rotate). Regenerate the EARTHDATA_TOKEN before it expires (EDL tokens are short-lived).

Deploy automation (issue #25)¶

The benchmark above measures the deployed Lambda worker. For a PR's worker changes to actually be measured, the PR's code has to be deployed first. This second pipeline does that across three tiers:

Internal PRs / merges that touch the deployed closure (src/zagg/**, deployment/aws/**, pyproject.toml) → build arm64 + redeploy a separate process-shard-test function, then benchmark against it. Non-affecting changes benchmark the stable function.
Fork PRs → never redeploy; the benchmark comment is annotated that the numbers reflect the stable worker, not the PR.
Releases (tags) → update the production function in place and publish the zips to a public, listable bucket (sliderule-public-cors), replacing the source.coop mirror over a transition window.

This is the "AWS update for this PR" step in the rollout order (#112 → benchmark setup → the deploy PR → this). Until these resources + variables exist, the deploy/distribute/prod jobs skip cleanly — the benchmark keeps running against the stable function and releases still attach zips.

8. The `process-shard-test` stack¶

A second copy of the backend stack, named process-shard-test, so PR benchmarks never touch production. Stand it up once (the template already parameterizes the function name); subsequent PR deploys only update-function-code, no stack churn.

OUTPUT_BUCKET=sliderule-public \
STACK_NAME=zagg-backend-test \
FUNCTION_NAME=process-shard-test \
  ./deployment/aws/stand_up.sh

(stand_up.sh passes FUNCTION_NAME to the template's FunctionName parameter.) Subsequent PR deploys only update-function-code, so the stack is stood up once. Verify: aws lambda get-function --function-name process-shard-test returns the function.

9. Deploy + release IAM roles (OIDC)¶

The zagg-benchmark-deploy and zagg-lambda-release roles are created by the section-2 stack (benchmark_cicd.yaml). Their grants, for reference:

zagg-benchmark-deploy (test tier) — update the test function + stage the layer:
lambda:UpdateFunctionCode, lambda:UpdateFunctionConfiguration, lambda:PublishLayerVersion, lambda:GetLayerVersion, lambda:GetFunction, lambda:GetFunctionConfiguration on process-shard-test (+ its -deps layer). GetFunctionConfiguration is required: deploy_lambda.sh calls aws lambda wait function-updated, whose waiter polls it. GetLayerVersion is required: update-function-configuration --layers validates the versioned layer ARN.
s3:PutObject/s3:GetObject on s3://sliderule-public/lambda-test/* (the staged layer).
zagg-lambda-release (release tier) — update production + publish:
the same six lambda:* actions (incl. lambda:GetLayerVersion) on process-shard (+ process-shard-deps).
s3:PutObject/s3:GetObject on s3://sliderule-public-cors/* (distribute).

Verify: aws iam get-role --role-name zagg-benchmark-deploy / zagg-lambda-release.

The release role can mutate production, so gate it (section 11) — don't rely on the trust policy alone.

10. The `sliderule-public-cors` distribution bucket¶

A real public bucket (readable + listable from anywhere — unlike s3://sliderule-public, which is cryocloud-only and no-list), in us-west-2 (CloudFormation reads Lambda code same-region) so stand_up.sh can fetch it directly. Listing lets a user resolve LAMBDA_VERSION=latest from versions.json instead of being pinned to their clone's version.

The section-2 stack creates this bucket (with the public read/list policy + CORS rule below) when CreateDistBucket=true (the default). One account-level caveat the stack can't handle: if your account has Block Public Access enabled (aws s3control get-public-access-block --account-id ACCOUNT_ID), the bucket-level setting can't override it — relax it at the account level too, or the public policy is silently ineffective.

Manual (raw-CLI) alternative

aws s3 mb s3://sliderule-public-cors --region us-west-2
aws s3api put-public-access-block --bucket sliderule-public-cors \
  --public-access-block-configuration BlockPublicPolicy=false,RestrictPublicBuckets=false
aws s3api put-bucket-policy --bucket sliderule-public-cors --policy '{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "PublicReadList",
    "Effect": "Allow",
    "Principal": "*",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": ["arn:aws:s3:::sliderule-public-cors",
                 "arn:aws:s3:::sliderule-public-cors/*"]
  }]
}'
aws s3api put-bucket-cors --bucket sliderule-public-cors --cors-configuration '{
  "CORSRules": [{
    "AllowedOrigins": ["*"],
    "AllowedMethods": ["GET", "HEAD"],
    "AllowedHeaders": ["*"],
    "ExposeHeaders": ["Content-Length", "Content-Range", "Content-Type", "ETag", "Accept-Ranges"]
  }]
}'

Verify (anonymous read + list both work):

aws s3 ls s3://sliderule-public-cors/ --no-sign-request
aws s3 cp s3://sliderule-public-cors/versions.json - --no-sign-request   # after the first release

The bucket must be in the same region as the production Lambda — the release job's publish-layer-version reads the layer zip from it in-region.

11. GitHub Environments, variables, and tag protection¶

The deploy/release jobs read these variables (no extra secrets — only the benchmark jobs use EARTHDATA_TOKEN). The role ARNs are the section-2 stack's Outputs (DeployRoleArn / ReleaseRoleArn) — paste those rather than hand-building the ARNs below:

# Benchmark test-deploy (tier 2)
gh variable set BENCHMARK_DEPLOY_ROLE_ARN    --body "arn:aws:iam::ACCOUNT_ID:role/zagg-benchmark-deploy"
gh variable set BENCHMARK_TEST_FUNCTION_NAME --body "process-shard-test"
gh variable set BENCHMARK_TEST_STAGE_BUCKET  --body "sliderule-public"
# Release (tier 3)
gh variable set LAMBDA_RELEASE_ROLE_ARN   --body "arn:aws:iam::ACCOUNT_ID:role/zagg-lambda-release"
gh variable set LAMBDA_PROD_FUNCTION_NAME --body "process-shard"
gh variable set LAMBDA_DIST_BUCKET        --body "sliderule-public-cors"
gh variable set LAMBDA_AWS_REGION         --body "us-west-2"

Protect the production deploy:

production Environment (Settings → Environments → New → production) with a required reviewer — the release workflow's deploy-prod job waits for an approval before it mutates the live function.
Tag protection for *.*.* so only maintainers can cut a release (a tag is what triggers the prod deploy + PyPI publish).

Verify: gh variable list shows the seven new vars; the Environments page shows production with a reviewer.

12. Verify the deploy tiers¶

Tier 2. Open an internal PR that edits src/zagg/processing/ — the benchmark run should include a deploy-test job and the comment should have no stale-worker banner. A docs-only PR should skip the deploy and (correctly) show no banner either.
Tier 1. On a fork PR that touches src/zagg/, a maintainer /benchmark → the comment carries the "worker = stable main" banner.
Tier 3. Cut a pre-release tag → the GitHub release gets the four zips attached, s3://sliderule-public-cors/<minor>/ is populated with a versions.json, and (after the production approval) process-shard is updated. LAMBDA_VERSION=latest ./stand_up.sh then resolves the new minor.

Distribution transition (source.coop → CORS bucket)¶

stand_up.sh now prefers sliderule-public-cors and falls back to the source.coop mirror for any minor not yet on the new bucket — so older standups keep working while new releases populate the new bucket. Once enough releases have published to the CORS bucket, retire the source.coop mirror (and its publish_mirror.sh step) in a follow-up. Set LAMBDA_VERSION=latest to always pull the newest published minor.