Problem Statement
whalu currently supports MBARI Pacific Sound (California) and Orcasound (Puget Sound). Adding NOAA's passive acoustic archive would expand coverage to 12 US ocean regions, including the Atlantic, Gulf of Mexico, Alaska, Hawaii, and National Marine Sanctuaries — unlocking multi-year, multi-site whale detection at national scale.
Proposed Solution
Implement whalu/data/noaa.py as a new data source module, analogous to mbari.py, backed by NOAA's public GCS bucket.
Bucket: gs://noaa-passive-bioacoustic (public, no auth required)
Two highest-priority sub-datasets:
1. NRS (Ocean Noise Reference Station Network)
12 fixed moorings, 2014-present, continuous long-term monitoring.
| Field |
Value |
| Path |
nrs/audio/{station_id}/{deployment}/audio/ |
| File format |
FLAC, ~4h recordings |
| Sample rate |
5 kHz (optimised for 20 Hz-2 kHz low-frequency whales) |
| Naming |
NRS01_20141014_234015.flac |
| Stations |
NRS01 (Bering Sea), NRS02 (Gulf of Alaska), NRS03 (Olympic Coast), NRS04 (Hawaii), NRS05 (Channel Islands), NRS06 (Gulf of Mexico), NRS07-08 (Atlantic), NRS09 (Stellwagen Bank, right whales), NRS10 (American Samoa), NRS11 (Cordell Bank), NRS12 (US Virgin Islands) |
2. SanctSound (National Marine Sanctuaries)
30 sites across 8 sanctuaries, 2018-2021, higher sample rates.
| Field |
Value |
| Path |
sanctsound/audio/{site}/{deployment}/audio/ |
| File format |
FLAC, 15-30 min recordings |
| Sample rate |
48-96 kHz (SoundTrap instruments) |
| Naming |
SanctSound_MB01_01_671399971_20181115T000002Z.flac |
| Sites |
mb=Monterey Bay, hi=Hawaiian Islands, sb=Stellwagen Bank, ci=Channel Islands, fk=Florida Keys, oc=Olympic Coast, gr=Gray's Reef, pm=Papahanaumokuakea |
Implementation Ideas
Data access (GCS, not AWS S3)
# google-cloud-storage with anonymous credentials
from google.cloud import storage
client = storage.Client.create_anonymous_client()
bucket = client.bucket("noaa-passive-bioacoustic")
New dependency: google-cloud-storage (to add to pyproject.toml).
whalu/data/noaa.py — key functions
def list_deployments(program: str, site: str) -> list[str]
# e.g. list_deployments("nrs", "01") -> ["nrs_01_2014-2015", ...]
def list_files(program: str, site: str, deployment: str) -> list[str]
# returns sorted GCS blob names for all FLAC files
def download_audio(blob_name: str, target_sr: int, limit_s: float | None) -> tuple[np.ndarray, float]
# downloads FLAC to tempfile, loads with librosa (handles FLAC natively)
def stream_chunks(blob_name: str, target_sr: int, chunk_s: float = 3600.0) -> Iterator[...]
# for long recordings (NRS ~4h files), stream in chunks
Sample rate considerations
- NRS at 5 kHz: the Perch
multispecies_whale model expects 24 kHz input — librosa resampling handles this but 5 kHz recordings only carry energy up to 2.5 kHz (Nyquist), so detection sensitivity for higher-frequency calls (e.g. orca clicks) will be reduced. Low-frequency species (blue, fin, humpback) should be unaffected.
- SanctSound at 48-96 kHz: downsampling to 24 kHz is straightforward and lossless for the model's frequency range.
Timestamp parsing
NRS files use NRS01_YYYYMMDD_HHMMSS.flac — a different naming scheme from MBARI's MARS-YYYYMMDDTHHMMSSZ-16kHz.wav. add_timestamps() in analysis.py will need to handle this pattern (or source names should be normalised at ingest time).
CLI additions
# List available NRS deployments
uv run whalu info noaa-nrs
# Scan a specific NRS station
uv run whalu scan noaa --program nrs --site 05 --start 2023-01 --output-dir data/detections/noaa
# Scan SanctSound Monterey Bay
uv run whalu scan noaa --program sanctsound --site mb01 --output-dir data/detections/noaa
sources.py
Add NOAA_NRS and NOAA_SANCTSOUND entries to SOURCE_REGISTRY.
Use Cases
- Multi-region comparison: Pacific coast vs Atlantic vs Hawaii species distributions
- Stellwagen Bank NRS09 for North Atlantic right whale (critically endangered) detection
- Long time series (2014-present NRS) for seasonal and inter-annual trends
- SanctSound labeled detection data (available via ERDDAP) as ground truth for model validation
Component Impact
Additional Context
NOAA also exposes species presence/absence detections (no audio processing needed) via ERDDAP for SanctSound sites:
https://coastwatch.pfeg.noaa.gov/erddap/griddap/noaaSanctSound_MB01_01_bluewhale_1d
This could be a fast path to validated ground-truth data for benchmarking the Perch model against human annotators.
Metadata JSON per deployment is available at:
gs://noaa-passive-bioacoustic/{program}/audio/{site}/{deployment}/metadata/*.json
Priority
Problem Statement
whalu currently supports MBARI Pacific Sound (California) and Orcasound (Puget Sound). Adding NOAA's passive acoustic archive would expand coverage to 12 US ocean regions, including the Atlantic, Gulf of Mexico, Alaska, Hawaii, and National Marine Sanctuaries — unlocking multi-year, multi-site whale detection at national scale.
Proposed Solution
Implement
whalu/data/noaa.pyas a new data source module, analogous tombari.py, backed by NOAA's public GCS bucket.Bucket:
gs://noaa-passive-bioacoustic(public, no auth required)Two highest-priority sub-datasets:
1. NRS (Ocean Noise Reference Station Network)
12 fixed moorings, 2014-present, continuous long-term monitoring.
nrs/audio/{station_id}/{deployment}/audio/NRS01_20141014_234015.flac2. SanctSound (National Marine Sanctuaries)
30 sites across 8 sanctuaries, 2018-2021, higher sample rates.
sanctsound/audio/{site}/{deployment}/audio/SanctSound_MB01_01_671399971_20181115T000002Z.flacImplementation Ideas
Data access (GCS, not AWS S3)
New dependency:
google-cloud-storage(to add topyproject.toml).whalu/data/noaa.py— key functionsSample rate considerations
multispecies_whalemodel expects 24 kHz input — librosa resampling handles this but 5 kHz recordings only carry energy up to 2.5 kHz (Nyquist), so detection sensitivity for higher-frequency calls (e.g. orca clicks) will be reduced. Low-frequency species (blue, fin, humpback) should be unaffected.Timestamp parsing
NRS files use
NRS01_YYYYMMDD_HHMMSS.flac— a different naming scheme from MBARI'sMARS-YYYYMMDDTHHMMSSZ-16kHz.wav.add_timestamps()inanalysis.pywill need to handle this pattern (or source names should be normalised at ingest time).CLI additions
sources.pyAdd
NOAA_NRSandNOAA_SANCTSOUNDentries toSOURCE_REGISTRY.Use Cases
Component Impact
whalu/data/noaa.py,whalu/sources.py)whalu/cli/scan.py— newscan noaasubcommand)Additional Context
NOAA also exposes species presence/absence detections (no audio processing needed) via ERDDAP for SanctSound sites:
This could be a fast path to validated ground-truth data for benchmarking the Perch model against human annotators.
Metadata JSON per deployment is available at:
gs://noaa-passive-bioacoustic/{program}/audio/{site}/{deployment}/metadata/*.jsonPriority