Skip to content

[FEATURE] Add NOAA passive acoustic data source #5

@amrit110

Description

@amrit110

Problem Statement

whalu currently supports MBARI Pacific Sound (California) and Orcasound (Puget Sound). Adding NOAA's passive acoustic archive would expand coverage to 12 US ocean regions, including the Atlantic, Gulf of Mexico, Alaska, Hawaii, and National Marine Sanctuaries — unlocking multi-year, multi-site whale detection at national scale.

Proposed Solution

Implement whalu/data/noaa.py as a new data source module, analogous to mbari.py, backed by NOAA's public GCS bucket.

Bucket: gs://noaa-passive-bioacoustic (public, no auth required)

Two highest-priority sub-datasets:

1. NRS (Ocean Noise Reference Station Network)

12 fixed moorings, 2014-present, continuous long-term monitoring.

Field Value
Path nrs/audio/{station_id}/{deployment}/audio/
File format FLAC, ~4h recordings
Sample rate 5 kHz (optimised for 20 Hz-2 kHz low-frequency whales)
Naming NRS01_20141014_234015.flac
Stations NRS01 (Bering Sea), NRS02 (Gulf of Alaska), NRS03 (Olympic Coast), NRS04 (Hawaii), NRS05 (Channel Islands), NRS06 (Gulf of Mexico), NRS07-08 (Atlantic), NRS09 (Stellwagen Bank, right whales), NRS10 (American Samoa), NRS11 (Cordell Bank), NRS12 (US Virgin Islands)

2. SanctSound (National Marine Sanctuaries)

30 sites across 8 sanctuaries, 2018-2021, higher sample rates.

Field Value
Path sanctsound/audio/{site}/{deployment}/audio/
File format FLAC, 15-30 min recordings
Sample rate 48-96 kHz (SoundTrap instruments)
Naming SanctSound_MB01_01_671399971_20181115T000002Z.flac
Sites mb=Monterey Bay, hi=Hawaiian Islands, sb=Stellwagen Bank, ci=Channel Islands, fk=Florida Keys, oc=Olympic Coast, gr=Gray's Reef, pm=Papahanaumokuakea

Implementation Ideas

Data access (GCS, not AWS S3)

# google-cloud-storage with anonymous credentials
from google.cloud import storage
client = storage.Client.create_anonymous_client()
bucket = client.bucket("noaa-passive-bioacoustic")

New dependency: google-cloud-storage (to add to pyproject.toml).

whalu/data/noaa.py — key functions

def list_deployments(program: str, site: str) -> list[str]
    # e.g. list_deployments("nrs", "01") -> ["nrs_01_2014-2015", ...]

def list_files(program: str, site: str, deployment: str) -> list[str]
    # returns sorted GCS blob names for all FLAC files

def download_audio(blob_name: str, target_sr: int, limit_s: float | None) -> tuple[np.ndarray, float]
    # downloads FLAC to tempfile, loads with librosa (handles FLAC natively)

def stream_chunks(blob_name: str, target_sr: int, chunk_s: float = 3600.0) -> Iterator[...]
    # for long recordings (NRS ~4h files), stream in chunks

Sample rate considerations

  • NRS at 5 kHz: the Perch multispecies_whale model expects 24 kHz input — librosa resampling handles this but 5 kHz recordings only carry energy up to 2.5 kHz (Nyquist), so detection sensitivity for higher-frequency calls (e.g. orca clicks) will be reduced. Low-frequency species (blue, fin, humpback) should be unaffected.
  • SanctSound at 48-96 kHz: downsampling to 24 kHz is straightforward and lossless for the model's frequency range.

Timestamp parsing

NRS files use NRS01_YYYYMMDD_HHMMSS.flac — a different naming scheme from MBARI's MARS-YYYYMMDDTHHMMSSZ-16kHz.wav. add_timestamps() in analysis.py will need to handle this pattern (or source names should be normalised at ingest time).

CLI additions

# List available NRS deployments
uv run whalu info noaa-nrs

# Scan a specific NRS station
uv run whalu scan noaa --program nrs --site 05 --start 2023-01 --output-dir data/detections/noaa

# Scan SanctSound Monterey Bay
uv run whalu scan noaa --program sanctsound --site mb01 --output-dir data/detections/noaa

sources.py

Add NOAA_NRS and NOAA_SANCTSOUND entries to SOURCE_REGISTRY.

Use Cases

  • Multi-region comparison: Pacific coast vs Atlantic vs Hawaii species distributions
  • Stellwagen Bank NRS09 for North Atlantic right whale (critically endangered) detection
  • Long time series (2014-present NRS) for seasonal and inter-annual trends
  • SanctSound labeled detection data (available via ERDDAP) as ground truth for model validation

Component Impact

  • Core functionality (whalu/data/noaa.py, whalu/sources.py)
  • CLI (whalu/cli/scan.py — new scan noaa subcommand)
  • Documentation
  • API
  • Docker/Infrastructure

Additional Context

NOAA also exposes species presence/absence detections (no audio processing needed) via ERDDAP for SanctSound sites:

https://coastwatch.pfeg.noaa.gov/erddap/griddap/noaaSanctSound_MB01_01_bluewhale_1d

This could be a fast path to validated ground-truth data for benchmarking the Perch model against human annotators.

Metadata JSON per deployment is available at:
gs://noaa-passive-bioacoustic/{program}/audio/{site}/{deployment}/metadata/*.json

Priority

  • Important for my use case

Metadata

Metadata

Assignees

No one assigned

    Labels

    data-sourceNew or updated audio data source integrationenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions