diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..0cfe58b --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,96 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Common Development Commands + +### Testing +```bash +pytest # Run all tests +pytest tests/test_process_rodb.py # Run specific test file +pytest tests/test_mooring_rodb.py # Run mooring RODB test file +pytest -v # Verbose output +pytest --cov=oceanarray # With coverage report +``` + +### Code Quality and Linting +```bash +black . # Format code with black +ruff check . # Run ruff linter +ruff check . --fix # Auto-fix issues where possible +pre-commit run --all-files # Run all pre-commit hooks +codespell # Check for spelling errors +``` + +### Documentation +```bash +cd docs +make html # Build documentation locally +make clean html # Clean build and rebuild +``` + +### Environment Setup +```bash +pip install -r requirements-dev.txt # Install development dependencies +pip install -e . # Install package in development mode +``` + +### Jupyter Notebooks +```bash +jupyter nbconvert --clear-output --inplace notebooks/*.ipynb # Clear notebook outputs +``` + +## High-Level Architecture + +### Core Processing Stages +The codebase implements a multi-stage processing pipeline for oceanographic mooring data: + +1. **Stage 1** (`stage1.py`): Raw data conversion and initial processing using ctd_tools readers +2. **Stage 2** (`stage2.py`): Advanced processing, calibration, and quality control +3. **Time Gridding** (`time_gridding.py`): Multi-instrument coordination, filtering, and interpolation onto common time grids (supersedes `mooring_rodb.py`) +4. **Array Level** (`transports.py`): Cross-mooring calculations and transport computations (work in progress) + +### Key Components + +- **Data Readers** (`readers.py`, `rodb.py`): Handle various oceanographic data formats +- **Data Writers** (`writers.py`): Output processed data in standardized formats +- **Processing Tools** (`tools.py`, `utilities.py`): Core algorithms for data manipulation +- **Time Operations** (`time_gridding.py`, `clock_offset.py`, `find_deployment.py`): Temporal processing +- **Visualization** (`plotters.py`): Data visualization and quality assessment +- **Logging** (`logger.py`): Configurable logging system + +### Data Flow Architecture +1. Raw instrument files → Stage1 → CF-NetCDF standardized format +2. Stage1 outputs → Stage2 → Advanced processing and quality control +3. Multiple instruments → Time Gridding → Common time grid with optional filtering +4. Multiple moorings → Array-level transport calculations (in development) + +### File Type Support +Supports multiple instrument formats via ctd_tools: +- SeaBird CNV/ASC files (`sbe-cnv`, `sbe-asc`) +- RBR RSK/DAT files (`rbr-rsk`, `rbr-dat`) +- Nortek AQD files (`nortek-aqd`) + +### Key Design Patterns +- Uses xarray.Dataset as primary data structure throughout pipeline +- Implements CF conventions for metadata and naming +- Modular processing stages that can be run independently +- Configurable logging with different verbosity levels +- YAML-based configuration for processing parameters + +### Legacy Modules +- `process_rodb.py`: Legacy RODB-format processing functions (for RAPID-style workflows) +- `mooring_rodb.py`: Legacy RODB mooring-level processing (superseded by time_gridding.py) + +### Testing Structure +- Comprehensive test coverage with pytest +- Tests organized by module (`test_*.py` files) +- Uses sample data for integration testing +- Pre-commit hooks ensure code quality + +### Dependencies +- **Core**: numpy, pandas, xarray, netcdf4, scipy +- **Oceanographic**: gsw (seawater calculations), ioos_qc (quality control), ctd_tools +- **Development**: pytest, black, ruff, pre-commit, sphinx + +The codebase emphasizes reproducible scientific data processing with clear documentation and methodological transparency. \ No newline at end of file diff --git a/docs/source/project_structure.md b/docs/source/project_structure.md index bacc873..5400c70 100644 --- a/docs/source/project_structure.md +++ b/docs/source/project_structure.md @@ -1,156 +1,191 @@ -# What’s in This Template Project? +# OceanArray Project Structure -> 🐍 This project is designed for a **Python-based code repository**. It includes features to help you manage, test, document, and share your code. - -Below is an overview of the files and folders you’ll find in the `template-project`, along with what they do and why they’re useful. If you're new to GitHub or Python packaging, this is your orientation. +This document provides an overview of the oceanarray codebase structure and organization. --- ## 🔍 Project Structure Overview -📷 *This is what the template looks like when you clone or fork it:* -# 📁 `template-project` File Structure - -A minimal, modular Python project structure for collaborative research and reproducible workflows. - ``` -template-project/ -├── template_project # [core] Main Python package with scientific code -│ ├── __init__.py # [core] Makes this a Python package -│ ├── plotters.py # [core] Functions to plot data -│ ├── readers.py # [core] Functions to read raw data into xarray datasets -│ ├── read_rapid.py # [core] Example for a separate module for a specific dataset -│ ├── writers.py # [core] Functions to write data (e.g., to NetCDF) -│ ├── tools.py # [core] Utilities for unit conversion, calculations, etc. -│ ├── logger.py # [core] Structured logging configuration for reproducible runs -│ ├── template_project.mplstyle # [core] Default plotting parameters -│ └── utilities.py # [core] Helper functions (e.g., file download or parsing) +oceanarray/ +├── oceanarray/ # [core] Main Python package for oceanographic processing +│ ├── __init__.py # [core] Makes this a Python package +│ ├── stage1.py # [core] Stage1: Raw data conversion to NetCDF (modern workflow) +│ ├── stage2.py # [core] Stage2: Clock corrections and trimming (modern workflow) +│ ├── time_gridding.py # [core] Time gridding and mooring-level processing (modern workflow) +│ ├── clock_offset.py # [core] Clock offset detection and correction analysis +│ ├── find_deployment.py # [core] Deployment detection from temperature profiles +│ ├── readers.py # [core] Functions to read various oceanographic data formats +│ ├── writers.py # [core] Functions to write processed data to NetCDF +│ ├── rodb.py # [core] RODB format reader for legacy RAPID data +│ ├── process_rodb.py # [legacy] Legacy RODB instrument processing functions +│ ├── mooring_rodb.py # [legacy] Legacy RODB mooring-level processing functions +│ ├── tools.py # [core] Core utilities (lag correlation, QC functions) +│ ├── convertOS.py # [format] OceanSites format conversion utilities +│ ├── plotters.py # [viz] Data visualization and plotting functions +│ ├── rapid_interp.py # [interp] Physics-based vertical interpolation +│ ├── transports.py # [analysis] Transport calculations (work in progress) +│ ├── logger.py # [core] Structured logging configuration +│ ├── utilities.py # [core] General helper functions +│ └── config/ # [config] Configuration files for processing +│ ├── OS1_var_names.yaml # [config] OceanSites variable name mappings +│ ├── OS1_vocab_attrs.yaml # [config] OceanSites vocabulary attributes +│ ├── OS1_sensor_attrs.yaml # [config] OceanSites sensor attributes +│ └── project_RAPID.yaml # [config] RAPID project configuration │ -├── tests/ # [test] Unit tests using pytest -│ ├── test_readers.py # [test] Test functions in readers.py -│ ├── test_tools.py # [test] Test functions in tools.py -│ ├── test_utilities.py # [test] Test functions in utilities.py +├── tests/ # [test] Unit tests using pytest +│ ├── test_stage1.py # [test] Test Stage1 processing +│ ├── test_stage2.py # [test] Test Stage2 processing +│ ├── test_rodb.py # [test] Test RODB data reading +│ ├── test_process_rodb.py # [test] Test legacy RODB processing functions +│ ├── test_mooring_rodb.py # [test] Test legacy RODB mooring functions +│ ├── test_tools.py # [test] Test core utility functions +│ ├── test_convertOS.py # [test] Test OceanSites conversion │ └── ... │ -├── docs/ # [docs] -│ ├── source/ # [docs] Sphinx documentation source files -│ │ ├── conf.py # [docs] Setup for documentation -│ │ ├── index.rst # [docs] Main page with menus in *.rst -│ │ ├── setup.md # [docs] One of the documentation pages in *.md -│ │ ├── template_project.rst # [docs] The file to create the API based on docstrings -│ │ ├── ... # [docs] More *.md or *.rst linked in index.rst -│ │ └── _static # [docs] Figures -│ │ ├── css/custom.css # [docs, style] Custom style sheet for docs -│ │ └── logo.png # [docs] logo for top left of docs/ -│ └── Makefile # [docs] Build the docs -│ -├── notebooks/ # [demo] Example notebooks -│ ├── demo.ipynb # [demo] Also run in docs.yml to appear in docs -│ └── ... +├── notebooks/ # [demo] Processing demonstration notebooks +│ ├── demo_stage1.ipynb # [demo] Stage1 processing demo +│ ├── demo_stage2.ipynb # [demo] Stage2 processing demo +│ ├── demo_step1.ipynb # [demo] Time gridding (mooring-level) demo +│ ├── demo_instrument.ipynb # [demo] Compact instrument processing workflow +│ ├── demo_clock_offset.ipynb # [demo] Clock offset analysis (refactored) +│ ├── demo_check_clock.ipynb # [demo] Clock offset analysis (original) +│ ├── demo_instrument_rdb.ipynb # [demo] Legacy RODB instrument processing +│ ├── demo_mooring_rdb.ipynb # [demo] Legacy RODB mooring processing +│ ├── demo_batch_instrument.ipynb # [demo] Batch processing and QC analysis +│ └── demo_climatology.ipynb # [demo] Climatological processing │ -├── data/ # [data] -│ └── moc_transports.nc # [data] Example data file used for the template. +├── docs/ # [docs] Sphinx documentation +│ ├── source/ # [docs] Documentation source files +│ │ ├── conf.py # [docs] Sphinx configuration +│ │ ├── index.rst # [docs] Main documentation page +│ │ ├── processing_framework.rst # [docs] Processing workflow documentation +│ │ ├── roadmap.rst # [docs] Development roadmap +│ │ ├── methods/ # [docs] Method documentation +│ │ │ ├── standardisation.rst # [docs] Stage1 standardization +│ │ │ ├── trimming.rst # [docs] Stage2 trimming +│ │ │ ├── time_gridding.rst # [docs] Time gridding methods +│ │ │ ├── clock_offset.rst # [docs] Clock offset analysis +│ │ │ └── ... +│ │ └── _static/ # [docs] Static files (images, CSS) +│ └── Makefile # [docs] Build documentation │ -├── logs/ # [core] Log output from structured logging -│ └── amocarray_*.log # [core] +├── data/ # [data] Sample and test data +│ ├── moor/ # [data] Mooring data directory structure +│ │ ├── proc/ # [data] Processed data +│ │ └── raw/ # [data] Raw instrument files +│ └── climatology/ # [data] Climatological reference data │ -├── .github/ # [ci] GitHub-specific workflows (e.g., Actions) +├── .github/ # [ci] GitHub-specific workflows │ ├── workflows/ -│ │ ├── docs.yml # [ci] Test build documents on *pull-request* -│ │ ├── docs_deploy.yml # [ci] Build and deploy documents on "merge" -│ │ ├── pypi.yml # [ci] Package and release on GitHub.com "release" -│ │ └── test.yml # [ci] Run pytest on tests/test_.py on *pull-request* -│ ├── ISSUE_TEMPLATE.md # [ci, meta] Template for issues on Github -│ └── PULL_REQUEST_TEMPLATE.md # [ci, meta] Template for pull requests on Github +│ │ ├── tests.yml # [ci] Run pytest on pull requests +│ │ └── docs.yml # [ci] Build documentation +│ └── ... │ -├── .gitignore # [meta] Exclude build files, logs, data, etc. -├── requirements.txt # [meta] Pip requirements -├── requirements-dev.txt # [meta] Pip requirements for development (docs, tests, linting) -├── .pre-commit-config.yaml # [style] Instructions for pre-commits to run (linting) -├── pyproject.toml # [ci, meta, style] Build system and config linters -├── CITATION.cff # [meta] So Github can populate the "cite" button -├── README.md # [meta] Project overview and getting started -└── LICENSE # [meta] Open source license (e.g., MIT as default) +├── CLAUDE.md # [meta] Claude Code guidance file +├── .gitignore # [meta] Git ignore patterns +├── requirements.txt # [meta] Core dependencies +├── requirements-dev.txt # [meta] Development dependencies +├── .pre-commit-config.yaml # [style] Pre-commit hooks configuration +├── pyproject.toml # [meta] Build system and project metadata +├── README.md # [meta] Project overview +└── LICENSE # [meta] MIT License ``` -The tags above give an indication of what parts of this template project are used for what purposes, where: -- `# [core]` – Scientific core logic or core functions used across the project. - -- `# [docs]` – Documentation sources, configs, and assets for building project docs. -- `# [test]` – Automated tests for validating functionality. -- `# [demo]` – Notebooks and minimal working examples for demos or tutorials. -- `# [data]` – Sample or test data files. -- `# [ci]` – Continuous integration setup (GitHub Actions). -- `# [style]` – Configuration for code style, linting, and formatting. -- `# [meta]` – Project metadata (e.g., citation info, license, README). - -**Note:** There are also files that you may end up generating but which don't necessarily appear in the project on GitHub.com (due to being ignored by your `.gitignore`). These may include your environment (`venv/`, if you use pip and virtual environments), distribution files `dist/` for building packages to deploy on http://pypi.org, `htmlcov/` for coverage reports for tests, `template_project_efw.egg-info` for editable installs (e.g., `pip install -e .`). - -## 🔍 Notes - -- **Modularity**: Code is split by function (reading, writing, tools). -- **Logging**: All major functions support structured logging to `logs/`. -- **Tests**: Pytest-compatible tests are in `tests/`, with one file per module. -- **Docs**: Sphinx documentation is in `docs/`. - +## 🔍 Architecture Overview + +### Modern Processing Workflow +The current recommended workflow uses: +1. **Stage1** (`stage1.py`) - Format conversion from raw instrument files to CF-NetCDF +2. **Stage2** (`stage2.py`) - Clock corrections and deployment period trimming +3. **Time Gridding** (`time_gridding.py`) - Multi-instrument coordination and filtering +4. **Clock Offset Analysis** (`clock_offset.py`) - Inter-instrument timing validation + +### Legacy RODB Workflow +For RAPID/RODB format compatibility: +- **`process_rodb.py`** - Individual instrument processing functions +- **`mooring_rodb.py`** - Mooring-level stacking and filtering functions +- **`rodb.py`** - RODB format data reader + +### Key Design Principles +- **CF-Compliant**: Uses CF conventions for metadata and variable naming +- **xarray-Based**: Primary data structure throughout the pipeline +- **Modular**: Independent processing stages that can be run separately +- **Configurable**: YAML-driven configuration for processing parameters +- **Reproducible**: Comprehensive logging and processing history tracking + +### File Organization Tags +- `[core]` - Essential processing functionality and utilities +- `[legacy]` - RODB/RAPID legacy format compatibility functions +- `[demo]` - Example notebooks demonstrating workflows +- `[test]` - Automated tests for functionality validation +- `[docs]` - Documentation sources and configuration +- `[config]` - Processing configuration and parameter files +- `[data]` - Sample data and directory structure examples +- `[ci]` - Continuous integration and automation +- `[meta]` - Project metadata and development configuration --- -## 🔰 The Basics (Always Included) +## 🔧 Processing Stages -- **`README.md`** – The first thing people see when they visit your GitHub repo. Use this to explain what your project is, how to install it, and how to get started. -- **`LICENSE`** – Explains what others are allowed to do with your code. This template uses the **MIT License**: - - ✅ Very permissive — allows commercial and private use, modification, and distribution. - - 🔗 More license info: [choosealicense.com](https://choosealicense.com/) -- **`.gitignore`** – Tells Git which files/folders to ignore (e.g., system files, data outputs). -- **`requirements.txt`** – Lists the Python packages your project needs to run. +### Stage 1: Standardization +- **Purpose**: Convert raw instrument files to standardized NetCDF format +- **Input**: Raw files (`.cnv`, `.rsk`, `.dat`, `.mat`) +- **Output**: CF-compliant NetCDF files (`*_raw.nc`) +- **Module**: `stage1.py` ---- +### Stage 2: Temporal Corrections +- **Purpose**: Apply clock corrections and trim to deployment periods +- **Input**: Stage1 files + YAML with clock offsets +- **Output**: Time-corrected files (`*_use.nc`) +- **Module**: `stage2.py` -## 🧰 Python Packaging and Development +### Time Gridding: Mooring Coordination +- **Purpose**: Combine instruments onto common time grids with optional filtering +- **Input**: Stage2 files from multiple instruments +- **Output**: Mooring-level combined datasets +- **Module**: `time_gridding.py` -- **`pyproject.toml`** – A modern configuration file for building, installing, and describing your package (e.g. name, author, dependencies). -- **`requirements-dev.txt`** – Additional tools for developers (testing, linting, formatting, etc.). -- **`template_project/`** – Your main code lives here. Python will treat this as an importable module. -- **`pip install -e .`** – Lets you install your project locally in a way that updates as you edit files. +### Clock Offset Analysis +- **Purpose**: Detect timing errors between instruments on same mooring +- **Input**: Stage1 files from multiple instruments +- **Output**: Recommended clock offset corrections for YAML +- **Module**: `clock_offset.py` --- -## 🧪 Testing and Continuous Integration +## 📊 Data Flow + +``` +Raw Files → Stage1 → Stage2 → Time Gridding → Array Analysis + ↓ ↓ ↓ ↓ ↓ + Various *_raw.nc *_use.nc Combined Transports + Formats Mooring & Products + Datasets +``` -- **`tests/`** – Folder for test files. Use these to make sure your code works as expected. -- **`.github/workflows/`** – GitHub Actions automation: - - `tests.yml` – Runs your tests automatically when you push changes. - - `docs.yml` – Builds your documentation to check for errors. - - `docs_deploy.yml` – Publishes documentation to GitHub Pages. - - `pypi.yml` – Builds and uploads a release to PyPI when you tag a new version. +**Clock Offset Loop**: Stage1 → Clock Analysis → Update YAML → Stage2 --- -## 📝 Documentation +## 🧪 Testing Structure -- **`docs/`** – Contains Sphinx and Markdown files to build your documentation site. - - Run `make html` or use GitHub Actions to generate a website. -- **`.vscode/`** – Optional settings for Visual Studio Code (e.g., interpreter paths). -- **`notebooks/`** – A place to keep example Jupyter notebooks. +Tests are organized by module with comprehensive coverage: +- **Core workflow tests**: `test_stage*.py` +- **Legacy format tests**: `test_*_rodb.py` +- **Utility tests**: `test_tools.py`, `test_convertOS.py` +- **Integration tests**: Via demo notebooks in CI --- -## 🧾 Metadata and Community +## 📚 Documentation Structure -- **`CITATION.cff`** – Machine-readable citation info. Lets GitHub generate a "Cite this repository" button. -- **`CONTRIBUTING.md`** – Guidelines for contributing to the project. Useful if you welcome outside help. -- **`.pre-commit-config.yaml`** – Configuration for running automated checks (e.g., code formatting) before each commit. +- **Methods documentation**: Detailed processing methodology +- **API documentation**: Auto-generated from docstrings +- **Demo notebooks**: Interactive examples and tutorials +- **Development guides**: Roadmap and contribution guidelines --- -## ✅ Summary - -This template is a starting point for research or open-source Python projects. It supports: -- Clean project structure -- Reproducible environments -- Easy testing -- Auto-publishing documentation -- Optional packaging for PyPI - -> 💡 Use what you need. Delete what you don’t. This is your scaffold for doing good, shareable science/code. +This structure supports both modern CF-compliant processing workflows and legacy RAPID/RODB format compatibility, providing a flexible framework for oceanographic mooring data processing. \ No newline at end of file diff --git a/docs/source/roadmap.rst b/docs/source/roadmap.rst index 19ab123..0482ec8 100644 --- a/docs/source/roadmap.rst +++ b/docs/source/roadmap.rst @@ -26,7 +26,7 @@ The OceanArray framework currently provides a solid foundation for oceanographic 🟡 **Partially Implemented** - Stage 3: Auto QC - basic QARTOD functions exist (``tools.py``) - - Stage 4: Calibration - microcat calibration exists (``instrument.py``) + - Stage 4: Calibration - microcat calibration exists (``process_rodb.py``) - Step 2: Vertical Gridding - physics-based interpolation exists (``rapid_interp.py``) ❌ **Documented but Not Implemented** @@ -235,7 +235,7 @@ Priority 3: Enhanced Calibration System **Documentation**: ``docs/source/methods/calibration.rst`` -**Current State**: Basic microcat calibration exists in ``instrument.py``. +**Current State**: Basic microcat calibration exists in ``process_rodb.py``. **Missing Implementation**: - Multi-instrument calibration support (not just microcat) @@ -247,7 +247,7 @@ Priority 3: Enhanced Calibration System **Estimated Effort**: 2-3 weeks **Implementation Plan**: - 1. Expand ``instrument.py`` calibration functions + 1. Expand ``process_rodb.py`` calibration functions 2. Create calibration configuration system 3. Add uncertainty propagation 4. Design calibration workflow automation diff --git a/notebooks/demo_batch_instrument.ipynb b/notebooks/demo_batch_instrument.ipynb index 80c7c7c..ebf172d 100644 --- a/notebooks/demo_batch_instrument.ipynb +++ b/notebooks/demo_batch_instrument.ipynb @@ -16,18 +16,7 @@ "id": "6a1920f3", "metadata": {}, "outputs": [], - "source": [ - "from pathlib import Path\n", - "import numpy as np\n", - "import xarray as xr\n", - "import numpy as np\n", - "from oceanarray import readers, mooring, plotters, tools\n", - "from oceanarray import writers, convertOS\n", - "from ioos_qc import qartod\n", - "from ioos_qc.config import Config\n", - "import numpy as np\n", - "import gsw\n" - ] + "source": "from pathlib import Path\nimport numpy as np\nimport xarray as xr\nimport numpy as np\nfrom oceanarray import readers, mooring_rodb, plotters, tools, process_rodb\nimport pandas as pd" }, { "cell_type": "markdown", @@ -84,20 +73,7 @@ "id": "c782730e", "metadata": {}, "outputs": [], - "source": [ - "import importlib\n", - "importlib.reload(mooring)\n", - "# Flag bad data to convert from P to D\n", - "data_dir = Path(\"..\", \"data\")\n", - "files = list(data_dir.glob(\"OS_wb2_9_201114_*_P.nc\"))\n", - "\n", - "ds_list_OS1 = readers.load_dataset(files)\n", - "\n", - "ds_stack = mooring.combine_mooring_OS(ds_list_OS)\n", - "ds_stack = tools.calc_psal(ds_stack)\n", - "\n", - "ds_stack\n" - ] + "source": "import importlib\nimportlib.reload(mooring_rodb)\n# Flag bad data to convert from P to D\ndata_dir = Path(\"..\", \"data\")\nfiles = list(data_dir.glob(\"OS_wb2_9_201114_*_P.nc\"))\n\nds_list_OS1 = readers.load_dataset(files)\n\nds_stack = mooring_rodb.combine_mooring_OS(ds_list_OS)\nds_stack = tools.calc_psal(ds_stack)\n\nds_stack" }, { "cell_type": "code", @@ -529,33 +505,7 @@ "id": "b16cbb1d", "metadata": {}, "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "\n", - "for i, (idx_leq, *rest) in enumerate(depth_indices):\n", - " plt.figure(figsize=(10, 4))\n", - " # Main index (black)\n", - " main_data = ds_stack.CNDC[:, i].values\n", - " plt.plot(ds_stack.TIME, tools.normalize_by_middle_percent(main_data, percent=95), color='k', label=f'DEPTH={depths[i]}m (main)')\n", - "\n", - " # Next shallower (red)\n", - " if idx_leq is not None:\n", - " shallower_data = ds_stack.CNDC[:, idx_leq].values\n", - " plt.plot(ds_stack.TIME, tools.normalize_by_middle_percent(shallower_data, percent=95), color='r', label=f'DEPTH={depths[idx_leq]}m (shallower)')\n", - "\n", - " # Next deeper (blue)\n", - " if rest and rest[0] is not None:\n", - " idx_gt = rest[0]\n", - " deeper_data = ds_stack.CNDC[:, idx_gt].values\n", - " plt.plot(ds_stack.TIME, tools.normalize_by_middle_percent(deeper_data, percent=95), color='b', label=f'DEPTH={depths[idx_gt]}m (deeper)')\n", - "\n", - " plt.title(f'CNDC at DEPTH={depths[i]}m and neighbors (normalized)')\n", - " plt.xlabel('Time')\n", - " plt.ylabel('Normalized CNDC')\n", - " plt.legend()\n", - " plt.tight_layout()\n", - " plt.show()\n" - ] + "source": "import matplotlib.pyplot as plt\n\nfor i, (idx_leq, *rest) in enumerate(depth_indices):\n plt.figure(figsize=(10, 4))\n # Main index (black)\n main_data = ds_stack.CNDC[:, i].values\n plt.plot(ds_stack.TIME, process_rodb.normalize_by_middle_percent(main_data, percent=95), color='k', label=f'DEPTH={depths[i]}m (main)')\n\n # Next shallower (red)\n if idx_leq is not None:\n shallower_data = ds_stack.CNDC[:, idx_leq].values\n plt.plot(ds_stack.TIME, process_rodb.normalize_by_middle_percent(shallower_data, percent=95), color='r', label=f'DEPTH={depths[idx_leq]}m (shallower)')\n\n # Next deeper (blue)\n if rest and rest[0] is not None:\n idx_gt = rest[0]\n deeper_data = ds_stack.CNDC[:, idx_gt].values\n plt.plot(ds_stack.TIME, process_rodb.normalize_by_middle_percent(deeper_data, percent=95), color='b', label=f'DEPTH={depths[idx_gt]}m (deeper)')\n\n plt.title(f'CNDC at DEPTH={depths[i]}m and neighbors (normalized)')\n plt.xlabel('Time')\n plt.ylabel('Normalized CNDC')\n plt.legend()\n plt.tight_layout()\n plt.show()" }, { "cell_type": "code", diff --git a/notebooks/demo_check_clock.ipynb b/notebooks/demo_check_clock.ipynb index dccb1ad..3cd243e 100644 --- a/notebooks/demo_check_clock.ipynb +++ b/notebooks/demo_check_clock.ipynb @@ -4,35 +4,7 @@ "cell_type": "markdown", "id": "71edb016", "metadata": {}, - "source": [ - "## Demo: Clock check - for offsets in instrument clocks\n", - "\n", - "This is an intermediate step between stage1 and stage 2. We are trying to determine whether the timestamps for any of the instruments on the same mooring are incorrect. This is slightly faulty because they could *all* be wrong, unless we are comparing against UTC or have more exact timing knowledge. For more exact timing knowledge, the deployment time and recovery time (anchor release, either dropping from the ship or release from the seabed) have been added to the yaml file in UTC. This can be compared against the times estimated through lag correlations.\n", - "\n", - "### This notebook \n", - "\n", - "**It does not change anything in the data files.** You run this notebook in order to update the field `clock_offset` (in seconds) in the YAML file for each instrument on a mooring. This is normally due to the instruments being set up incorrectly (i.e., with a clock time that did not match UTC).\n", - "\n", - "After determining the appropriate clock offset, then run the stage2 processing to apply the clock offset to the netCDF files for each instrument.\n", - "\n", - "Then, running this notebook again using the stage2 files (`*_use.nc`) should predict no additional clock offsets.\n", - "\n", - "Clock offset is in integer seconds ADDED to the original instrument time. I.e., shifts the record later.\n", - "\n", - "### Main check\n", - "\n", - "- We look at when--according to the instrument clocks--the `temperature` values are cold. This assumes that in the middle of the record, the temperatures are colder than the near-surface temperatures (may fail for polar deployments). Cold is within the mean +- 3 * std of the deep values.\n", - "\n", - "- Then check when the instrument first reads a temperature within those bounds: `start_time`\n", - "- And check when the instrument last reads a temperature within those bounds: `end_time`\n", - "\n", - "Check whether the first timestamp within the cold water for that instrument is similar in time to the first timestamp for another instrument. This should be reasonably good at getting large offsets in clocks.\n", - "\n", - "### Secondary check\n", - "\n", - "- We interpolate data onto a common time grid (rough and ready)\n", - "- Check for lag correlation between instruments, and use this to estimate an offset" - ] + "source": "## Demo: Clock check - for offsets in instrument clocks\n\n**Note: This is the original file by the user. See `demo_clock_offset.ipynb` for a refactored version by Claude.**\n\nThis is an intermediate step between stage1 and stage 2. We are trying to determine whether the timestamps for any of the instruments on the same mooring are incorrect. This is slightly faulty because they could *all* be wrong, unless we are comparing against UTC or have more exact timing knowledge. For more exact timing knowledge, the deployment time and recovery time (anchor release, either dropping from the ship or release from the seabed) have been added to the yaml file in UTC. This can be compared against the times estimated through lag correlations.\n\n### This notebook \n\n**It does not change anything in the data files.** You run this notebook in order to update the field `clock_offset` (in seconds) in the YAML file for each instrument on a mooring. This is normally due to the instruments being set up incorrectly (i.e., with a clock time that did not match UTC).\n\nAfter determining the appropriate clock offset, then run the stage2 processing to apply the clock offset to the netCDF files for each instrument.\n\nThen, running this notebook again using the stage2 files (`*_use.nc`) should predict no additional clock offsets.\n\nClock offset is in integer seconds ADDED to the original instrument time. I.e., shifts the record later.\n\n### Main check\n\n- We look at when--according to the instrument clocks--the `temperature` values are cold. This assumes that in the middle of the record, the temperatures are colder than the near-surface temperatures (may fail for polar deployments). Cold is within the mean +- 3 * std of the deep values.\n\n- Then check when the instrument first reads a temperature within those bounds: `start_time`\n- And check when the instrument last reads a temperature within those bounds: `end_time`\n\nCheck whether the first timestamp within the cold water for that instrument is similar in time to the first timestamp for another instrument. This should be reasonably good at getting large offsets in clocks.\n\n### Secondary check\n\n- We interpolate data onto a common time grid (rough and ready)\n- Check for lag correlation between instruments, and use this to estimate an offset" }, { "cell_type": "code", diff --git a/notebooks/demo_climatology.ipynb b/notebooks/demo_climatology.ipynb index af15e56..913f24a 100644 --- a/notebooks/demo_climatology.ipynb +++ b/notebooks/demo_climatology.ipynb @@ -16,14 +16,7 @@ "id": "6a1920f3", "metadata": {}, "outputs": [], - "source": [ - "from pathlib import Path\n", - "import numpy as np\n", - "import xarray as xr\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "from oceanarray import readers, plotters, tools, convertOS, writers, mooring, rapid_interp\n" - ] + "source": "from pathlib import Path\nimport numpy as np\nimport xarray as xr\nimport numpy as np\nfrom oceanarray import readers, plotters, tools, convertOS, writers, mooring_rodb, rapid_interp\nimport pandas as pd" }, { "cell_type": "markdown", diff --git a/notebooks/demo_clock_offset.ipynb b/notebooks/demo_clock_offset.ipynb index 861d2f4..ca620b1 100644 --- a/notebooks/demo_clock_offset.ipynb +++ b/notebooks/demo_clock_offset.ipynb @@ -4,21 +4,7 @@ "cell_type": "markdown", "id": "streamlined-demo", "metadata": {}, - "source": [ - "# Demo: Streamlined Clock Offset Analysis\n", - "\n", - "This notebook provides a streamlined version of clock offset analysis for oceanographic instruments.\n", - "It uses the new `oceanarray.clock_offset` module for cleaner, more maintainable code.\n", - "\n", - "## Purpose\n", - "\n", - "This notebook helps determine whether instrument timestamps are incorrect by:\n", - "1. Analyzing deployment timing based on temperature profiles\n", - "2. Performing lag correlation analysis between instruments\n", - "3. Calculating recommended clock offset corrections\n", - "\n", - "**Note:** This notebook does not modify data files. It only analyzes and suggests clock_offset values for the YAML configuration.\n" - ] + "source": "# Demo: Streamlined Clock Offset Analysis\n\n**Note: This is a refactored version by Claude. See `demo_check_clock.ipynb` for the original user file.**\n\nThis notebook provides a streamlined version of clock offset analysis for oceanographic instruments.\nIt uses the new `oceanarray.clock_offset` module for cleaner, more maintainable code.\n\n## Purpose\n\nThis notebook helps determine whether instrument timestamps are incorrect by:\n1. Analyzing deployment timing based on temperature profiles\n2. Performing lag correlation analysis between instruments\n3. Calculating recommended clock offset corrections\n\n**Note:** This notebook does not modify data files. It only analyzes and suggests clock_offset values for the YAML configuration." }, { "cell_type": "code", diff --git a/notebooks/demo_instrument.ipynb b/notebooks/demo_instrument.ipynb index bf66bd2..b8d3d7a 100644 --- a/notebooks/demo_instrument.ipynb +++ b/notebooks/demo_instrument.ipynb @@ -4,11 +4,7 @@ "cell_type": "markdown", "id": "c6a29764-f39c-431c-8e77-fbc6bfe20f01", "metadata": {}, - "source": [ - "# Demo: instrument-level processing (Stage 1 and Stage 2)\n", - "\n", - "This notebook walks through the instrument-level processing in the oceanarray code.\n" - ] + "source": "# Demo: Instrument-Level Processing (Compact Workflow)\n\nThis notebook walks through the complete instrument-level processing pipeline in the oceanarray codebase, from raw files to science-ready datasets. It demonstrates the same processing steps as `demo_stage1.ipynb` and `demo_stage2.ipynb` but in a more compact, streamlined format.\n\n## Processing Overview\n\n### Stage 1: Format Conversion (`*_raw.nc`)\n- **Purpose**: Convert raw instrument files to standardized NetCDF format\n- **Input**: Raw instrument files (`.cnv`, `.rsk`, `.dat`, `.mat`) \n- **Output**: Standardized NetCDF files (`*_raw.nc`)\n- **Processing**: Uses `oceanarray.stage1.MooringProcessor` - same as `demo_stage1.ipynb`\n\n### Stage 2: Temporal Corrections & Trimming (`*_use.nc`)\n- **Purpose**: Apply clock corrections and trim to deployment periods\n- **Input**: Stage1 files (`*_raw.nc`) + updated YAML with clock offsets\n- **Output**: Time-corrected files (`*_use.nc`)\n- **Processing**: Uses `oceanarray.stage2.process_multiple_moorings_stage2` - same as `demo_stage2.ipynb`\n\n### Stage 3: Calibrations & Corrections (Optional)\n- **Purpose**: Apply sensor-specific calibrations and corrections\n- **Status**: Commented out sections showing how to apply additional calibrations\n\n### Stage 4: Format Conversion (Optional)\n- **Purpose**: Convert to OceanSites or other standardized formats\n- **Status**: Commented out sections for format conversion\n\n## Key Features\n\n- **Compact Format**: Covers the same ground as separate stage notebooks in one place\n- **Instrument-Level Processing**: Each instrument processed independently before mooring-level coordination\n- **Multiple Instrument Types**: Handles various instrument types with analysis functions\n- **Visualization**: Includes plotting and analysis of processed results\n- **Metadata Management**: YAML configuration files drive processing parameters\n\n## Comparison with Other Notebooks\n\n- **vs demo_stage1.ipynb**: Same Stage1 processing but more concise\n- **vs demo_stage2.ipynb**: Same Stage2 processing but integrated workflow \n- **vs demo_step1.ipynb**: Focuses on individual instruments rather than mooring-level time gridding\n\nChoose this notebook if you want a complete instrument processing workflow in one place, or use the separate stage notebooks for more detailed exploration of each processing step.\n\nVersion: 1.0 \nDate: 2025-01-15" }, { "cell_type": "code", @@ -227,10 +223,7 @@ "id": "60fab40c", "metadata": {}, "outputs": [], - "source": [ - "#ds_cal = instrument.apply_microcat_calibration_from_txt(data_dir / 'wb1_12_2015_005.microcat.txt', data_dir / 'wb1_12_2015_6123.use')\n", - "#ds_cal\n" - ] + "source": "#ds_cal = process_rodb.apply_microcat_calibration_from_txt(data_dir / 'wb1_12_2015_005.microcat.txt', data_dir / 'wb1_12_2015_6123.use')\n#ds_cal" }, { "cell_type": "code", diff --git a/notebooks/demo_instrument_rdb.ipynb b/notebooks/demo_instrument_rdb.ipynb index 3a68f30..1337011 100644 --- a/notebooks/demo_instrument_rdb.ipynb +++ b/notebooks/demo_instrument_rdb.ipynb @@ -7,7 +7,7 @@ "source": [ "# Demo: instrument-level processing\n", "\n", - "This notebook walks through the instrument-level processing in the oceanarray code.\n" + "This notebook walks through the instrument-level processing in the oceanarray code **based on RAPID data formats**. This is a test to verify the initial steps. The actual code instead uses demo_stage1.ipynb (load data), demo_stage2.ipynb (raw to use).\n" ] }, { @@ -16,14 +16,7 @@ "id": "6a1920f3", "metadata": {}, "outputs": [], - "source": [ - "from pathlib import Path\n", - "import numpy as np\n", - "import xarray as xr\n", - "import numpy as np\n", - "from oceanarray import readers, rodb, instrument, plotters, tools, writers, convertOS\n", - "import pandas as pd\n" - ] + "source": "from pathlib import Path\nimport numpy as np\nimport xarray as xr\nimport numpy as np\nfrom oceanarray import readers, rodb, process_rodb, plotters, tools, writers, convertOS\nimport pandas as pd" }, { "cell_type": "markdown", @@ -65,16 +58,7 @@ "id": "ed615f11", "metadata": {}, "outputs": [], - "source": [ - "\n", - "ds2, dstart, dend = instrument.stage2_trim(ds)\n", - "\n", - "print(\"Deployment start:\", dstart)\n", - "print(\"Deployment end:\", dend)\n", - "\n", - "fig = plotters.plot_microcat(ds2)\n", - "\n" - ] + "source": "ds2, dstart, dend = process_rodb.stage2_trim(ds)\n\nprint(\"Deployment start:\", dstart)\nprint(\"Deployment end:\", dend)\n\nfig = plotters.plot_microcat(ds2)\n" }, { "cell_type": "code", @@ -82,10 +66,7 @@ "id": "6b1ecfad", "metadata": {}, "outputs": [], - "source": [ - "dstart, dend = instrument.trim_suggestion(ds)\n", - "fig, ax = plotters.plot_trim_windows(ds, dstart, dend)\n" - ] + "source": "dstart, dend = process_rodb.trim_suggestion(ds)\nfig, ax = plotters.plot_trim_windows(ds, dstart, dend)" }, { "cell_type": "code", @@ -93,17 +74,7 @@ "id": "660769c4", "metadata": {}, "outputs": [], - "source": [ - "dstart = np.datetime64('2015-11-30T19:00:00')\n", - "dend = np.datetime64('2017-03-28T14:00:00')\n", - "\n", - "ds2, dstart, dend = instrument.stage2_trim(ds, deployment_start=dstart, deployment_end=dend)\n", - "\n", - "print(\"Deployment start:\", dstart)\n", - "print(\"Deployment end:\", dend)\n", - "\n", - "fig = plotters.plot_microcat(ds2)" - ] + "source": "dstart = np.datetime64('2015-11-30T19:00:00')\ndend = np.datetime64('2017-03-28T14:00:00')\n\nds2, dstart, dend = process_rodb.stage2_trim(ds, deployment_start=dstart, deployment_end=dend)\n\nprint(\"Deployment start:\", dstart)\nprint(\"Deployment end:\", dend)\n\nfig = plotters.plot_microcat(ds2)" }, { "cell_type": "markdown", @@ -119,10 +90,7 @@ "id": "60fab40c", "metadata": {}, "outputs": [], - "source": [ - "ds_cal = instrument.apply_microcat_calibration_from_txt(data_dir / 'wb1_12_2015_005.microcat.txt', data_dir / 'wb1_12_2015_6123.use')\n", - "ds_cal\n" - ] + "source": "ds_cal = process_rodb.apply_microcat_calibration_from_txt(data_dir / 'wb1_12_2015_005.microcat.txt', data_dir / 'wb1_12_2015_6123.use')\nds_cal" }, { "cell_type": "code", diff --git a/notebooks/demo_mooring_rdb.ipynb b/notebooks/demo_mooring_rdb.ipynb index 5f5eb5e..e58957d 100644 --- a/notebooks/demo_mooring_rdb.ipynb +++ b/notebooks/demo_mooring_rdb.ipynb @@ -16,16 +16,7 @@ "id": "6a1920f3", "metadata": {}, "outputs": [], - "source": [ - "from pathlib import Path\n", - "import numpy as np\n", - "import gsw\n", - "import xarray as xr\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "from oceanarray import readers, plotters, tools, convertOS, writers, mooring\n", - "from oceanarray import rapid_interp\n" - ] + "source": "from pathlib import Path\nimport numpy as np\nimport xarray as xr\nimport numpy as np\nfrom oceanarray import readers, plotters, tools, convertOS, writers, mooring_rodb\nimport pandas as pd" }, { "cell_type": "markdown", @@ -45,17 +36,7 @@ "id": "ce860d75", "metadata": {}, "outputs": [], - "source": [ - "data_dir = Path(\"..\", \"data\")\n", - "files = list(data_dir.glob(\"OS_wb2_9_201114_P.nc\"))\n", - "print(files)\n", - "ds_stack = xr.open_dataset(files[0])\n", - "\n", - "#ds_list_OS = readers.load_dataset(files)\n", - "\n", - "#ds_stack = mooring.combine_mooring_OS(ds_list_OS)\n", - "#ds_stack" - ] + "source": "data_dir = Path(\"..\", \"data\")\nfiles = list(data_dir.glob(\"OS_wb2_9_201114_P.nc\"))\nprint(files)\nds_stack = xr.open_dataset(files[0])\n\n#ds_list_OS = readers.load_dataset(files)\n\n#ds_stack = mooring_rodb.combine_mooring_OS(ds_list_OS)\n#ds_stack" }, { "cell_type": "markdown", @@ -77,11 +58,7 @@ "id": "b824bf8c", "metadata": {}, "outputs": [], - "source": [ - "ds_filt = mooring.filter_all_time_vars(ds_stack)\n", - "ds_12h = mooring.interp_to_12hour_grid(ds_filt)\n", - "ds_12h" - ] + "source": "ds_filt = mooring_rodb.filter_all_time_vars(ds_stack)\nds_12h = mooring_rodb.interp_to_12hour_grid(ds_filt)\nds_12h" }, { "cell_type": "code", diff --git a/notebooks/demo_stage1.ipynb b/notebooks/demo_stage1.ipynb index ee2288c..6053357 100644 --- a/notebooks/demo_stage1.ipynb +++ b/notebooks/demo_stage1.ipynb @@ -4,13 +4,7 @@ "cell_type": "markdown", "id": "71edb016", "metadata": {}, - "source": [ - "## Demo: Stage1 processing for mooring data\n", - "\n", - "Read the original raw files and convert to netCDF. None to minimal additional processing.\n", - "\n", - "This notebook demonstrates the usage of the refactored `oceanarray.stage1_mooring` module." - ] + "source": "## Demo: Stage1 processing for mooring data\n\n**Stage 1 Overview**: This is the first processing stage that converts raw instrument files into standardized CF-compliant NetCDF format. It handles multiple instrument types using the `ctd_tools` library for reading and the oceanarray framework for metadata management.\n\n### What Stage1 Does:\n- **File Conversion**: Reads raw instrument files (SeaBird, RBR, Nortek, etc.) and converts to NetCDF\n- **Standardization**: Applies CF conventions for variable names, units, and metadata\n- **Format Preservation**: Preserves original data values without modification or filtering\n- **Metadata Enrichment**: Adds deployment information from YAML configuration files\n- **Organization**: Outputs files organized by instrument type in the processed directory\n\n### Input Files:\n- Raw instrument data files (various formats: `.cnv`, `.rsk`, `.dat`, `.mat`)\n- YAML configuration files specifying mooring and instrument metadata\n\n### Output Files:\n- Standardized NetCDF files: `{mooring}_{serial}_raw.nc`\n- Processing log files for debugging and quality assurance\n\n### Processing Flow:\n1. Raw files → ctd_tools readers → xarray.Dataset\n2. Apply CF conventions and metadata from YAML\n3. Preserve all original data values unchanged\n4. Save as NetCDF with standardized naming\n\n**Note**: Stage1 focuses purely on format conversion with no data modification. All processing (filtering, clock corrections, quality control, trimming) happens in Stage2 and subsequent stages.\n\nThis notebook demonstrates the usage of the refactored `oceanarray.stage1` module." }, { "cell_type": "code", diff --git a/notebooks/demo_stage2.ipynb b/notebooks/demo_stage2.ipynb index afb8836..39f771b 100644 --- a/notebooks/demo_stage2.ipynb +++ b/notebooks/demo_stage2.ipynb @@ -4,12 +4,7 @@ "cell_type": "markdown", "id": "71edb016", "metadata": {}, - "source": [ - "## Demo: Stage2 processing for mooring data\n", - "\n", - "- Apply clock offsets\n", - "- Trim to deployment period" - ] + "source": "## Demo: Stage2 processing for mooring data\n\n**Stage 2 Overview**: This is the second processing stage that applies temporal corrections and deployment trimming to Stage1 NetCDF files. It focuses solely on time coordinate modifications and data trimming without changing the actual data values.\n\n### What Stage2 Does:\n- **Clock Offset Corrections**: Applies time corrections specified in YAML configuration files to fix instrument clock errors\n- **Deployment Trimming**: Crops data to the actual deployment period (removes pre/post deployment data) \n- **Time Coordinate Only**: Only modifies the time coordinate - all data values remain unchanged\n- **Metadata Updates**: Updates processing history and adds Stage2-specific attributes\n- **File Renaming**: Converts `*_raw.nc` files to `*_use.nc` files indicating they're ready for analysis\n\n### Input Files:\n- Stage1 NetCDF files: `{mooring}_{serial}_raw.nc`\n- Updated YAML configuration files with `clock_offset` values (determined from clock offset analysis)\n- Deployment timing information in YAML files\n\n### Output Files:\n- Time-corrected NetCDF files: `{mooring}_{serial}_use.nc`\n- Processing logs documenting applied corrections\n\n### Processing Flow:\n1. Load Stage1 (`*_raw.nc`) files\n2. Read clock offset values from YAML configuration\n3. Apply temporal corrections to time coordinate only\n4. Trim data to deployment period (between deployment and recovery times)\n5. Update metadata with processing history\n6. Save as Stage2 (`*_use.nc`) files\n\n### Clock Offset Workflow:\n1. **Analyze**: Use `demo_clock_offset.ipynb` (which uses `oceanarray.clock_offset.py`) to determine timing corrections needed\n2. **Update YAML**: Manually add `clock_offset` values (in seconds) to instrument configurations \n3. **Process Stage2**: Apply corrections and create `*_use.nc` files\n4. **Verify**: Re-run clock offset analysis on `*_use.nc` files to confirm corrections\n\n**Note**: Stage2 only modifies time coordinates and trims data - no quality control or data value modifications are performed. The `*_use.nc` files represent individual instruments with corrected timing and trimmed to deployment periods.\n\nThis notebook demonstrates the usage of the `oceanarray.stage2` module." }, { "cell_type": "code", @@ -52,6 +47,47 @@ "basedir = '/Users/eddifying/Dropbox/data/ifmro_mixsed/ds_data_eleanor/'\n", "results = process_multiple_moorings_stage2(moorlist, basedir)" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5bc08fb9", + "metadata": {}, + "outputs": [], + "source": [ + "dir1 = '/Users/eddifying/Dropbox/data/ifmro_mixsed/ds_data_eleanor/moor/proc/dsE_1_2018/microcat'\n", + "fname = 'dsE_1_2018_7518_use.nc'\n", + "from pathlib import Path\n", + "data1 = xr.open_dataset(Path(dir1) / fname)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61f8cb88", + "metadata": {}, + "outputs": [], + "source": [ + "data1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c024c1c", + "metadata": {}, + "outputs": [], + "source": [ + "data1.serial_number.values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3ca0c0bc", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/notebooks/demo_step1.ipynb b/notebooks/demo_step1.ipynb index 0344576..928cc59 100644 --- a/notebooks/demo_step1.ipynb +++ b/notebooks/demo_step1.ipynb @@ -58,19 +58,7 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# Set your data paths here\n", - "basedir = '/Users/eddifying/Dropbox/data/ifmro_mixsed/ds_data_eleanor/'\n", - "mooring_name = 'dsE_1_2018'\n", - "\n", - "# Construct paths\n", - "proc_dir = Path(basedir) / 'moor' / 'proc' / mooring_name\n", - "config_file = proc_dir / f\"{mooring_name}.mooring.yaml\"\n", - "\n", - "print(f\"Processing directory: {proc_dir}\")\n", - "print(f\"Configuration file: {config_file}\")\n", - "print(f\"Config exists: {config_file.exists()}\")" - ] + "source": "# Set your data paths here\nbasedir = '/Users/eddifying/Dropbox/data/ifmro_mixsed/ds_data_eleanor/'\nmooring_name = 'dsE_1_2018'\n\n# Set file suffix for processing\nfile_suffix = '_use' # Change from '_raw' to '_use'\n\n# Construct paths\nproc_dir = Path(basedir) / 'moor' / 'proc' / mooring_name\nconfig_file = proc_dir / f\"{mooring_name}.mooring.yaml\"\n\nprint(f\"Processing directory: {proc_dir}\")\nprint(f\"Configuration file: {config_file}\")\nprint(f\"File suffix: {file_suffix}\")\nprint(f\"Config exists: {config_file.exists()}\")" }, { "cell_type": "code", @@ -111,76 +99,7 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# Find and examine individual instrument files\n", - "file_suffix = \"_use\"\n", - "instrument_files = []\n", - "instrument_datasets = []\n", - "rows = []\n", - "\n", - "if config_file.exists():\n", - " for inst_config in config.get(\"instruments\", []):\n", - " instrument_type = inst_config.get(\"instrument\", \"unknown\")\n", - " serial = inst_config.get(\"serial\", 0)\n", - " depth = inst_config.get(\"depth\", 0)\n", - "\n", - " # Look for the file\n", - " filename = f\"{mooring_name}_{serial}{file_suffix}.nc\"\n", - " filepath = proc_dir / instrument_type / filename\n", - "\n", - " if filepath.exists():\n", - " ds = xr.open_dataset(filepath)\n", - " instrument_files.append(filepath)\n", - " instrument_datasets.append(ds)\n", - "\n", - " # Time coverage\n", - " t0, t1 = ds.time.values[0], ds.time.values[-1]\n", - " npoints = len(ds.time)\n", - "\n", - " # Median sampling interval\n", - " time_diff = np.diff(ds.time.values) / np.timedelta64(1, \"m\") # in minutes\n", - " median_interval = np.nanmedian(time_diff)\n", - " if median_interval > 1:\n", - " sampling = f\"{median_interval:.1f} min\"\n", - " else:\n", - " sampling = f\"{median_interval*60:.1f} sec\"\n", - "\n", - " # Collect a row for the table\n", - " rows.append(\n", - " {\n", - " \"Instrument\": instrument_type,\n", - " \"Serial\": serial,\n", - " \"Depth [m]\": depth,\n", - " \"File\": filepath.name,\n", - " \"Start\": str(t0)[:19],\n", - " \"End\": str(t1)[:19],\n", - " \"Points\": npoints,\n", - " \"Sampling\": sampling,\n", - " \"Variables\": \", \".join(list(ds.data_vars)),\n", - " }\n", - " )\n", - " else:\n", - " rows.append(\n", - " {\n", - " \"Instrument\": instrument_type,\n", - " \"Serial\": serial,\n", - " \"Depth [m]\": depth,\n", - " \"File\": \"MISSING\",\n", - " \"Start\": \"\",\n", - " \"End\": \"\",\n", - " \"Points\": 0,\n", - " \"Sampling\": \"\",\n", - " \"Variables\": \"\",\n", - " }\n", - " )\n", - "\n", - " # Make a DataFrame summary\n", - " summary = pd.DataFrame(rows)\n", - " pd.set_option(\"display.max_colwidth\", 80) # allow long var lists\n", - " print(summary.to_markdown(index=False))\n", - "\n", - " print(f\"\\nFound {len(instrument_datasets)} instrument datasets\")\n" - ] + "source": "# Find and examine individual instrument files\ninstrument_files = []\ninstrument_datasets = []\nrows = []\n\nif config_file.exists():\n for inst_config in config.get(\"instruments\", []):\n instrument_type = inst_config.get(\"instrument\", \"unknown\")\n serial = inst_config.get(\"serial\", 0)\n depth = inst_config.get(\"depth\", 0)\n\n # Look for the file\n filename = f\"{mooring_name}_{serial}{file_suffix}.nc\"\n filepath = proc_dir / instrument_type / filename\n\n if filepath.exists():\n ds = xr.open_dataset(filepath)\n instrument_files.append(filepath)\n instrument_datasets.append(ds)\n\n # Time coverage\n t0, t1 = ds.time.values[0], ds.time.values[-1]\n npoints = len(ds.time)\n\n # Median sampling interval\n time_diff = np.diff(ds.time.values) / np.timedelta64(1, \"m\") # in minutes\n median_interval = np.nanmedian(time_diff)\n if median_interval > 1:\n sampling = f\"{median_interval:.1f} min\"\n else:\n sampling = f\"{median_interval*60:.1f} sec\"\n\n # Collect a row for the table\n rows.append(\n {\n \"Instrument\": instrument_type,\n \"Serial\": serial,\n \"Depth [m]\": depth,\n \"File\": filepath.name,\n \"Start\": str(t0)[:19],\n \"End\": str(t1)[:19],\n \"Points\": npoints,\n \"Sampling\": sampling,\n \"Variables\": \", \".join(list(ds.data_vars)),\n }\n )\n else:\n rows.append(\n {\n \"Instrument\": instrument_type,\n \"Serial\": serial,\n \"Depth [m]\": depth,\n \"File\": \"MISSING\",\n \"Start\": \"\",\n \"End\": \"\",\n \"Points\": 0,\n \"Sampling\": \"\",\n \"Variables\": \"\",\n }\n )\n\n # Make a DataFrame summary\n summary = pd.DataFrame(rows)\n pd.set_option(\"display.max_colwidth\", 80) # allow long var lists\n print(summary.to_markdown(index=False))\n\n print(f\"\\nFound {len(instrument_datasets)} instrument datasets\")" }, { "cell_type": "markdown", @@ -196,33 +115,14 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# Process without filtering\n", - "print(\"Processing mooring with time gridding only (no filtering)...\")\n", - "print(\"=\"*60)\n", - "\n", - "result = time_gridding_mooring(mooring_name, basedir, file_suffix='_use')\n", - "\n", - "print(f\"\\nProcessing result: {'SUCCESS' if result else 'FAILED'}\")" - ] + "source": "# Process without filtering\nprint(\"Processing mooring with time gridding only (no filtering)...\")\nprint(\"=\"*60)\n\nresult = time_gridding_mooring(mooring_name, basedir, file_suffix=file_suffix)\n\nprint(f\"\\nProcessing result: {'SUCCESS' if result else 'FAILED'}\")" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# Load and examine the combined dataset\n", - "output_file = proc_dir / f\"{mooring_name}_mooring_use.nc\"\n", - "\n", - "if output_file.exists():\n", - " print(f\"Output file exists: {output_file}\")\n", - "\n", - " # Load the combined dataset\n", - " combined_ds = xr.open_dataset(output_file)\n", - "else:\n", - " print(\"Output file not found - processing may have failed\")" - ] + "source": "# Load and examine the combined dataset\noutput_file = proc_dir / f\"{mooring_name}_mooring{file_suffix}.nc\"\n\nif output_file.exists():\n print(f\"Output file exists: {output_file}\")\n\n # Load the combined dataset\n combined_ds = xr.open_dataset(output_file)\nelse:\n print(\"Output file not found - processing may have failed\")" }, { "cell_type": "markdown", @@ -360,55 +260,14 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# Process with RAPID-style low-pass filtering\n", - "print(\"Processing mooring with 2-day low-pass filtering (RAPID-style)...\")\n", - "print(\"=\"*60)\n", - "print(\"IMPORTANT: Filtering is applied to each instrument on its native time grid\")\n", - "print(\"BEFORE interpolation to preserve data integrity.\")\n", - "print()\n", - "\n", - "filter_params = {\n", - " 'cutoff_days': 2.0, # 2-day cutoff\n", - " 'order': 6 # 6th order Butterworth\n", - "}\n", - "\n", - "result_filtered = time_gridding_mooring(\n", - " mooring_name, basedir,\n", - " file_suffix='_use',\n", - " filter_type='lowpass',\n", - " filter_params=filter_params\n", - ")\n", - "\n", - "print(f\"\\nFiltered processing result: {'SUCCESS' if result_filtered else 'FAILED'}\")" - ] + "source": "# Process with RAPID-style low-pass filtering\nprint(\"Processing mooring with 2-day low-pass filtering (RAPID-style)...\")\nprint(\"=\"*60)\nprint(\"IMPORTANT: Filtering is applied to each instrument on its native time grid\")\nprint(\"BEFORE interpolation to preserve data integrity.\")\nprint()\n\nfilter_params = {\n 'cutoff_days': 2.0, # 2-day cutoff\n 'order': 6 # 6th order Butterworth\n}\n\nresult_filtered = time_gridding_mooring(\n mooring_name, basedir,\n file_suffix=file_suffix,\n filter_type='lowpass',\n filter_params=filter_params\n)\n\nprint(f\"\\nFiltered processing result: {'SUCCESS' if result_filtered else 'FAILED'}\")" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# Load the filtered dataset\n", - "filtered_output_file = proc_dir / f\"{mooring_name}_mooring_use_lowpass.nc\"\n", - "\n", - "if filtered_output_file.exists():\n", - " print(f\"Filtered output file created: {filtered_output_file}\")\n", - "\n", - " # Load the filtered dataset\n", - " filtered_ds = xr.open_dataset(filtered_output_file)\n", - "\n", - " print(\"\\nFiltered Dataset Attributes:\")\n", - " filter_attrs = {k: v for k, v in filtered_ds.attrs.items()\n", - " if 'filter' in k.lower()}\n", - " for key, value in filter_attrs.items():\n", - " print(f\" {key}: {value}\")\n", - "\n", - " print(f\"\\nDataset shape: {dict(filtered_ds.dims)}\")\n", - "else:\n", - " print(\"Filtered output file not found\")\n", - " filtered_ds = None" - ] + "source": "# Load the filtered dataset\nfiltered_output_file = proc_dir / f\"{mooring_name}_mooring{file_suffix}_lowpass.nc\"\n\nif filtered_output_file.exists():\n print(f\"Filtered output file created: {filtered_output_file}\")\n\n # Load the filtered dataset\n filtered_ds = xr.open_dataset(filtered_output_file)\n\n print(\"\\nFiltered Dataset Attributes:\")\n filter_attrs = {k: v for k, v in filtered_ds.attrs.items()\n if 'filter' in k.lower()}\n for key, value in filter_attrs.items():\n print(f\" {key}: {value}\")\n\n print(f\"\\nDataset shape: {dict(filtered_ds.dims)}\")\nelse:\n print(\"Filtered output file not found\")\n filtered_ds = None" }, { "cell_type": "markdown", diff --git a/notebooks/dsE_1_2018_microcat_7518_temperature.png b/notebooks/dsE_1_2018_microcat_7518_temperature.png new file mode 100644 index 0000000..b6fcd1f Binary files /dev/null and b/notebooks/dsE_1_2018_microcat_7518_temperature.png differ diff --git a/oceanarray/mooring.py b/oceanarray/mooring_rodb.py similarity index 92% rename from oceanarray/mooring.py rename to oceanarray/mooring_rodb.py index 880a100..9ee725b 100644 --- a/oceanarray/mooring.py +++ b/oceanarray/mooring_rodb.py @@ -2,8 +2,9 @@ import pandas as pd import xarray as xr from scipy.interpolate import interp1d +from scipy.signal import butter, filtfilt -from oceanarray import tools, utilities +from oceanarray import utilities def find_time_vars(ds_list, time_key="TIME"): @@ -226,6 +227,40 @@ def combine_mooring_OS(ds_list): return ds_combined +def auto_filt(y, sr, co, typ="low", fo=6): + """ + Apply a Butterworth digital filter to a data array. + + Parameters + ---------- + y : array_like + Input data array (1D). + sr : float + Sampling rate (Hz or 1/time units of your data). + co : float or tuple of float + Cutoff frequency/frequencies. A scalar for 'low' or 'high', a 2-tuple for 'bandstop'. + typ : str, optional + Filter type: 'low', 'high', or 'bandstop'. Default is 'low'. + fo : int, optional + Filter order. Default is 6. + + Returns + ------- + yf : ndarray + Filtered data array. + """ + # Normalize cutoff frequency to the Nyquist rate + nyquist = 0.5 * sr + if isinstance(co, (list, tuple, np.ndarray)): + wh = [c / nyquist for c in co] + else: + wh = co / nyquist + + b, a = butter(fo, wh, btype=typ) + yf = filtfilt(b, a, y) + return yf + + def filter_all_time_vars(ds, cutoff_days=2, fo=6): """ Apply a lowpass Butterworth filter to all data variables that depend on TIME. @@ -270,7 +305,7 @@ def filter_all_time_vars(ds, cutoff_days=2, fo=6): y_filt[:, i] = np.nan else: try: - y_filt[:, i] = tools.auto_filt(y1d, sr, co, typ="low", fo=fo) + y_filt[:, i] = auto_filt(y1d, sr, co, typ="low", fo=fo) except ValueError: y_filt[:, i] = np.nan # fallback for rare failures diff --git a/oceanarray/instrument.py b/oceanarray/process_rodb.py similarity index 62% rename from oceanarray/instrument.py rename to oceanarray/process_rodb.py index c0c56c4..71d1f12 100644 --- a/oceanarray/instrument.py +++ b/oceanarray/process_rodb.py @@ -5,12 +5,160 @@ import numpy as np import xarray as xr -from oceanarray import rodb, tools +from oceanarray import rodb from oceanarray.logger import log_debug, log_info, log_warning DUMMY_VALUE = -9.99e-29 # adjust if needed +def middle_percent(values, percent=95): + """ + Return the lower and upper bounds for the central `percent` of the data. + + Parameters + ---------- + values : array-like + Input data (1D array). NaNs will be ignored. + percent : float + Desired central percentage (e.g., 95 for middle 95%). + + Returns + ------- + tuple + (lower_bound, upper_bound) + """ + values = np.asarray(values) + values = values[~np.isnan(values)] + + if not 0 < percent < 100: + raise ValueError("percent must be between 0 and 100 (exclusive)") + + tail = (100 - percent) / 2 + lower = np.nanpercentile(values, tail) + upper = np.nanpercentile(values, 100 - tail) + return lower, upper + + +def mean_of_middle_percent(values, percent=95): + """ + Compute the mean of values within the central `percent` of the data. + + Parameters + ---------- + values : array-like + Input data (1D array). NaNs will be ignored. + percent : float + Desired central percentage (e.g., 95 for middle 95%). + + Returns + ------- + float + Mean of values within the specified middle percentage. + """ + values = np.asarray(values) + values = values[~np.isnan(values)] + lower, upper = middle_percent(values, percent) + filtered = values[(values >= lower) & (values <= upper)] + return np.mean(filtered) + + +def std_of_middle_percent(values, percent=95): + """ + Compute the standard deviation of values within the central `percent` of the data. + + Parameters + ---------- + values : array-like + Input data (1D array). NaNs will be ignored. + percent : float + Desired central percentage (e.g., 95 for middle 95%). + + Returns + ------- + float + Standard deviation of values within the specified middle percentage. + """ + values = np.asarray(values) + values = values[~np.isnan(values)] + lower, upper = middle_percent(values, percent) + filtered = values[(values >= lower) & (values <= upper)] + return np.std(filtered) + + +def normalize_by_middle_percent(values, percent=95): + """ + Normalize a data series by the mean and standard deviation + of its central `percent` range. + + Parameters + ---------- + values : array-like + Input data (1D array). NaNs are ignored. + percent : float + Central percentage to define the 'middle' of the distribution (e.g., 95). + + Returns + ------- + array + Normalized array with the same shape as input. + """ + values = np.asarray(values) + mask = ~np.isnan(values) + valid_values = values[mask] + + if valid_values.size == 0: + return values # return original if all NaNs + + lower, upper = middle_percent(valid_values, percent) + middle_vals = valid_values[(valid_values >= lower) & (valid_values <= upper)] + + mean_mid = np.mean(middle_vals) + std_mid = np.std(middle_vals) + + if std_mid == 0: + raise ValueError( + "Standard deviation of middle percent is zero — normalization not possible." + ) + + normalized = (values - mean_mid) / std_mid + return normalized + + +def normalize_dataset_by_middle_percent(ds, percent=95): + """ + Normalize all 1D data variables in an xarray Dataset that match the length of TIME, + using the mean and std over the central `percent` of each variable. + + Parameters + ---------- + ds : xarray.Dataset + Input dataset with a 'TIME' coordinate. + percent : float + Percentage of central values to define the middle (e.g., 95 for middle 95%). + + Returns + ------- + xarray.Dataset + New dataset with normalized data variables. + """ + ds_norm = xr.Dataset(attrs=ds.attrs) + time_shape = ds["TIME"].shape + + for var in ds.data_vars: + if ds[var].shape == time_shape: + norm_values = normalize_by_middle_percent(ds[var].values, percent) + ds_norm[var] = xr.DataArray( + norm_values, + coords=ds[var].coords, + dims=ds[var].dims, + attrs=ds[var].attrs, + ) + + # Retain TIME coordinate + ds_norm = ds_norm.assign_coords({"TIME": ds["TIME"]}) + return ds_norm + + def trim_suggestion(ds, percent=95, threshold=6, vars_to_check=["T", "C", "P"]): """ Normalize dataset variables using the middle percentile and determine suggested @@ -34,7 +182,7 @@ def trim_suggestion(ds, percent=95, threshold=6, vars_to_check=["T", "C", "P"]): end_time : np.datetime64 or None Suggested deployment end time. """ - ds_norm = tools.normalize_dataset_by_middle_percent(ds, percent=percent) + ds_norm = normalize_dataset_by_middle_percent(ds, percent=percent) start_candidates = [] end_candidates = [] diff --git a/oceanarray/read_rapid.py b/oceanarray/read_rapid.py deleted file mode 100644 index 1c7cc59..0000000 --- a/oceanarray/read_rapid.py +++ /dev/null @@ -1,133 +0,0 @@ -from pathlib import Path -from typing import Union - -import xarray as xr - -# Import the modules used -from oceanarray import logger, utilities -from oceanarray.logger import log_error, log_info, log_warning -from oceanarray.utilities import apply_defaults - -log = logger.log # Use the global logger - -# Default list of RAPID data files -RAPID_DEFAULT_SOURCE = "https://rapid.ac.uk/sites/default/files/rapid_data/" -RAPID_TRANSPORT_FILES = ["moc_transports.nc"] -RAPID_DEFAULT_FILES = ["moc_transports.nc"] - -# Inline metadata dictionary -RAPID_METADATA = { - "description": "RAPID 26N transport estimates dataset", - "project": "RAPID-AMOC 26°N array", - "web_link": "https://rapid.ac.uk/rapidmoc", - "note": "Dataset accessed and processed via xarray", -} - -# File-specific metadata placeholder -RAPID_FILE_METADATA = { - "moc_transports.nc": { - "data_product": "RAPID layer transport time series", - }, -} - - -@apply_defaults(RAPID_DEFAULT_SOURCE, RAPID_DEFAULT_FILES) -def read_rapid( - source: Union[str, Path, None], - file_list: Union[str, list[str]], - transport_only: bool = True, - data_dir: Union[str, Path, None] = None, - redownload: bool = False, -) -> list[xr.Dataset]: - """Load the RAPID transport dataset from a URL or local file path into an xarray.Dataset. - - Parameters - ---------- - source : str, optional - URL or local path to the NetCDF file(s). - Defaults to the RAPID data repository URL. - file_list : str or list of str, optional - Filename or list of filenames to process. - If None, will attempt to list files in the source directory. - transport_only : bool, optional - If True, restrict to transport files only. - data_dir : str, Path or None, optional - Optional local data directory. - redownload : bool, optional - If True, force redownload of the data. - - Returns - ------- - xr.Dataset - The loaded xarray dataset with basic inline metadata. - - Raises - ------ - ValueError - If the source is neither a valid URL nor a directory path. - FileNotFoundError - If no valid NetCDF files are found in the provided file list. - - """ - log_info("Starting to read RAPID dataset") - - if file_list is None: - file_list = RAPID_DEFAULT_FILES - if transport_only: - file_list = RAPID_TRANSPORT_FILES - if isinstance(file_list, str): - file_list = [file_list] - - local_data_dir = Path(data_dir) if data_dir else utilities.get_default_data_dir() - local_data_dir.mkdir(parents=True, exist_ok=True) - - datasets = [] - - for file in file_list: - if not file.lower().endswith(".nc"): - log_warning("Skipping non-NetCDF file: %s", file) - continue - - download_url = ( - f"{source.rstrip('/')}/{file}" if utilities._is_valid_url(source) else None - ) - - file_path = utilities.resolve_file_path( - file_name=file, - source=source, - download_url=download_url, - local_data_dir=local_data_dir, - redownload=redownload, - ) - - try: - log_info("Opening RAPID dataset: %s", file_path) - ds = xr.open_dataset(file_path) - except Exception as e: - log_error("Failed to open NetCDF file: %s: %s", file_path, e) - raise FileNotFoundError(f"Failed to open NetCDF file: {file_path}: {e}") - - file_metadata = RAPID_FILE_METADATA.get(file, {}) - log_info("Attaching metadata to RAPID dataset from file: %s", file) - utilities.safe_update_attrs( - ds, - { - "source_file": file, - "source_path": str(file_path), - **RAPID_METADATA, - **file_metadata, - }, - ) - if "time" in ds.dims or "time" in ds.coords: - log_info("Renaming 'time' dimension/coordinate to 'TIME'") - ds = ds.rename({"time": "TIME"}) - - datasets.append(ds) - - if not datasets: - log_error("No valid RAPID NetCDF files found in %s", file_list) - raise FileNotFoundError(f"No valid RAPID NetCDF files found in {file_list}") - - log_info("Successfully loaded %d RAPID dataset(s)", len(datasets)) - - return datasets diff --git a/oceanarray/readers.py b/oceanarray/readers.py index f6a32b0..0137660 100644 --- a/oceanarray/readers.py +++ b/oceanarray/readers.py @@ -4,7 +4,6 @@ import xarray as xr from oceanarray.logger import log_info -from oceanarray.read_rapid import read_rapid DUMMY_VALUES = [1e32, -9.0, -9.9] @@ -123,33 +122,3 @@ def rodbload_old(filepath: Path, variables: list[str]) -> xr.Dataset: ds = xr.Dataset(data_vars, coords=coords) ds.attrs["source_file"] = str(filepath) return ds - - -def _get_reader(array_name: str): - """Return the reader function for the given array name. - - Parameters - ---------- - array_name : str - The name of the observing array. - - Returns - ------- - function - Reader function corresponding to the given array name. - - Raises - ------ - ValueError - If an unknown array name is provided. - - """ - readers = { - "rapid": read_rapid, - } - try: - return readers[array_name.lower()] - except KeyError: - raise ValueError( - f"Unknown array name: {array_name}. Valid options are: {list(readers.keys())}", - ) diff --git a/oceanarray/stage2.py b/oceanarray/stage2.py index 3757b1d..20bf4fd 100644 --- a/oceanarray/stage2.py +++ b/oceanarray/stage2.py @@ -102,21 +102,116 @@ def _trim_to_deployment_window( return dataset + def _extract_metadata_from_filepath( + self, filepath: Path, mooring_name: str + ) -> Dict[str, Any]: + """Extract metadata from filepath when not available in YAML or dataset. + + Expected pattern: {instrument_type}/{mooring_name}_{serial}_raw.nc + """ + fallback_metadata = {} + + # Extract instrument type from parent directory + instrument_type = filepath.parent.name + fallback_metadata["instrument"] = instrument_type + + # Extract serial number from filename + filename = filepath.stem # Remove .nc extension + if filename.endswith("_raw"): + filename = filename[:-4] # Remove _raw suffix + + # Pattern: mooring_name_serial + if filename.startswith(f"{mooring_name}_"): + serial_str = filename[len(f"{mooring_name}_") :] + try: + serial = int(serial_str) + fallback_metadata["serial"] = serial + self._log_print( + f"Extracted from filename - instrument: {instrument_type}, serial: {serial}" + ) + except ValueError: + self._log_print( + f"WARNING: Could not parse serial number from filename: {filename}" + ) + + return fallback_metadata + + def _get_figure_naming_info( + self, dataset: xr.Dataset, mooring_name: str + ) -> Dict[str, str]: + """Get information needed for figure naming convention. + + Returns dict with mooring_name, instrument, serial for creating + figure names like: dsE_1_2018_microcat_7518_ctd.png + """ + instrument = str(dataset.get("instrument", "unknown").values) + serial = str(int(dataset.get("serial_number", 0).values)) + + return { + "mooring_name": mooring_name, + "instrument": instrument, + "serial": serial, + } + def _add_missing_metadata( - self, dataset: xr.Dataset, instrument_config: Dict[str, Any] + self, + dataset: xr.Dataset, + instrument_config: Dict[str, Any], + filepath: Path, + mooring_name: str, ) -> xr.Dataset: - """Add any missing metadata variables to dataset.""" - # Add instrument depth if missing - if "InstrDepth" not in dataset.variables and "depth" in instrument_config: - dataset["InstrDepth"] = instrument_config["depth"] + """Add any missing metadata variables to dataset with fallback extraction.""" + + # Get metadata from YAML config (highest priority) + yaml_instrument = instrument_config.get("instrument") + yaml_serial = instrument_config.get("serial") + yaml_depth = instrument_config.get("depth", 0) + + # Check if we need fallback for any missing fields + needs_instrument_fallback = yaml_instrument is None + needs_serial_fallback = yaml_serial is None + + fallback_used = False + final_instrument = yaml_instrument + final_serial = yaml_serial + + if needs_instrument_fallback or needs_serial_fallback: + self._log_print( + "Some metadata missing from YAML, attempting extraction from filepath..." + ) + fallback_metadata = self._extract_metadata_from_filepath( + filepath, mooring_name + ) + + # Use fallback only for the missing fields + if needs_instrument_fallback and "instrument" in fallback_metadata: + final_instrument = fallback_metadata["instrument"] + self._log_print(f"Using fallback instrument type: {final_instrument}") + fallback_used = True + + if needs_serial_fallback and "serial" in fallback_metadata: + final_serial = fallback_metadata["serial"] + self._log_print(f"Using fallback serial number: {final_serial}") + fallback_used = True - # Add instrument type if missing - if "instrument" not in dataset.variables and "instrument" in instrument_config: - dataset["instrument"] = instrument_config["instrument"] + # Add metadata to dataset if missing + if "InstrDepth" not in dataset.variables: + dataset["InstrDepth"] = yaml_depth - # Add serial number if missing - if "serial_number" not in dataset.variables and "serial" in instrument_config: - dataset["serial_number"] = instrument_config["serial"] + if "instrument" not in dataset.variables and final_instrument is not None: + dataset["instrument"] = final_instrument + + if "serial_number" not in dataset.variables and final_serial is not None: + dataset["serial_number"] = final_serial + + # Add history note if fallback was used + if fallback_used: + history_note = "non-standard enrichment of metadata from filename patterns" + if "history" in dataset.attrs: + dataset.attrs["history"] += f"; {history_note}" + else: + dataset.attrs["history"] = history_note + self._log_print(f"Added history note: {history_note}") return dataset @@ -193,8 +288,10 @@ def _process_instrument( # Create a copy to modify dataset = ds.load() - # Add missing metadata - dataset = self._add_missing_metadata(dataset, instrument_config) + # Add missing metadata with fallback extraction + dataset = self._add_missing_metadata( + dataset, instrument_config, raw_filepath, mooring_name + ) # Clean unnecessary variables dataset = self._clean_unnecessary_variables(dataset) diff --git a/oceanarray/tools.py b/oceanarray/tools.py index f5e10fc..e8c0005 100644 --- a/oceanarray/tools.py +++ b/oceanarray/tools.py @@ -4,15 +4,13 @@ import numpy as np import pandas as pd import xarray as xr -from scipy.signal import butter, filtfilt +from scipy.signal import find_peaks from oceanarray import utilities # Initialize logging _log = logging.getLogger(__name__) -from scipy.signal import find_peaks - def lag_correlation(x, y, max_lag, min_overlap=10): """Pearson correlation at integer lags in [-max_lag, max_lag].""" @@ -116,90 +114,6 @@ def find_cold_entry_exit( return time[s0], time[eL], thr -def find_deployment(ds, var_name="temperature"): - pre_deploy_before = [] - start_deployment = [] - end_deployment = [] - mooring_rising = [] - split_vals = [] - split_vals2 = [] - N_LEVELS = ds["N_LEVELS"] - for i in range(0, len(N_LEVELS)): - if var_name in ds and ds[var_name].dims == ("time", "N_LEVELS"): - data1 = ds[var_name][:, i] - - splitter = split_value(data1) - x, y, split2 = find_cold_entry_exit(ds["time"], data1) - - # Assume the deployment data are below the threshold - idx_less_than = np.where(data1 < splitter) - idx_more_than = np.where(data1 > splitter) - - # Find out whether idx_less_than or idx_more_than contains the first non-Nan value - if idx_less_than[0].size > 0 and ( - idx_more_than[0].size == 0 or idx_less_than[0][0] < idx_more_than[0][0] - ): - # idx_less_than starts sooner (i.e. contains the pre-deployment) - idx = idx_more_than - condition = ">" - else: - idx = idx_less_than - condition = "<" - - first_deep_time = ds["time"][idx].values[0] if idx[0].size > 0 else None - time_before = ds["time"][idx[0][0] - 1].values if idx[0].size > 0 else None - # End of deployment + one after - last_deep_time = ds["time"][idx].values[-1] if idx[0].size > 0 else None - time_after = ds["time"][idx[0][-1] + 1].values if idx[0].size > 0 else None - - pre_deploy_before.append(time_before) - start_deployment.append(first_deep_time) - end_deployment.append(last_deep_time) - mooring_rising.append(time_after) - split_vals.append(splitter) - split_vals2.append(split2) - - # Initialise new variable in dataset ds, called start_time with same size as ds[var_name] - if "start_time" not in ds: - # Use proper datetime64 unit specification - ds["start_time"] = ( - ("N_LEVELS"), - np.full(ds["N_LEVELS"].shape, np.datetime64("NaT", "ns")), - ) - ds["end_time"] = ( - ("N_LEVELS"), - np.full(ds["N_LEVELS"].shape, np.datetime64("NaT", "ns")), - ) - if "split_value" not in ds: - ds["split_value"] = ( - ("N_LEVELS"), - np.full(ds["N_LEVELS"].shape, np.nan), - ) - if "split_value2" not in ds: - ds["split_value2"] = ( - ("N_LEVELS"), - np.full(ds["N_LEVELS"].shape, np.nan), - ) - ds["start_time"][i] = first_deep_time - ds["end_time"][i] = last_deep_time - ds["split_value"][i] = splitter - ds["split_value2"][i] = split2 - - print( - f"{i}/{data1['serial_number'].values}:{data1['instrument'].values}: Split at {splitter:1.2f}. Start after {first_deep_time}. End with {last_deep_time}." - ) - - else: - pre_deploy_before.append(np.datetime64("NaT", "ns")) - start_deployment.append(np.datetime64("NaT", "ns")) - end_deployment.append(np.datetime64("NaT", "ns")) - mooring_rising.append(np.datetime64("NaT", "ns")) - split_vals.append(np.nan) - split_vals2.append(np.nan) - - return ds - - def calc_psal(ds): if "PSAL" not in ds: SP = gsw.SP_from_C(ds["CNDC"], ds["TEMP"], ds["PRES"]) @@ -434,188 +348,6 @@ def process_dataset( return ds_standard -def auto_filt(y, sr, co, typ="low", fo=6): - """ - Apply a Butterworth digital filter to a data array. - - Parameters - ---------- - y : array_like - Input data array (1D). - sr : float - Sampling rate (Hz or 1/time units of your data). - co : float or tuple of float - Cutoff frequency/frequencies. A scalar for 'low' or 'high', a 2-tuple for 'bandstop'. - typ : str, optional - Filter type: 'low', 'high', or 'bandstop'. Default is 'low'. - fo : int, optional - Filter order. Default is 6. - - Returns - ------- - yf : ndarray - Filtered data array. - """ - # Normalize cutoff frequency to the Nyquist rate - nyquist = 0.5 * sr - if isinstance(co, (list, tuple, np.ndarray)): - wh = [c / nyquist for c in co] - else: - wh = co / nyquist - - b, a = butter(fo, wh, btype=typ) - yf = filtfilt(b, a, y) - return yf - - -def normalize_dataset_by_middle_percent(ds, percent=95): - """ - Normalize all 1D data variables in an xarray Dataset that match the length of TIME, - using the mean and std over the central `percent` of each variable. - - Parameters - ---------- - ds : xarray.Dataset - Input dataset with a 'TIME' coordinate. - percent : float - Percentage of central values to define the middle (e.g., 95 for middle 95%). - - Returns - ------- - xarray.Dataset - New dataset with normalized data variables. - """ - ds_norm = xr.Dataset(attrs=ds.attrs) - time_shape = ds["TIME"].shape - - for var in ds.data_vars: - if ds[var].shape == time_shape: - norm_values = normalize_by_middle_percent(ds[var].values, percent) - ds_norm[var] = xr.DataArray( - norm_values, - coords=ds[var].coords, - dims=ds[var].dims, - attrs=ds[var].attrs, - ) - - # Retain TIME coordinate - ds_norm = ds_norm.assign_coords({"TIME": ds["TIME"]}) - return ds_norm - - -def normalize_by_middle_percent(values, percent=95): - """ - Normalize a data series by the mean and standard deviation - of its central `percent` range. - - Parameters - ---------- - values : array-like - Input data (1D array). NaNs are ignored. - percent : float - Central percentage to define the 'middle' of the distribution (e.g., 95). - - Returns - ------- - array - Normalized array with the same shape as input. - """ - values = np.asarray(values) - mask = ~np.isnan(values) - valid_values = values[mask] - - if valid_values.size == 0: - return values # return original if all NaNs - - lower, upper = middle_percent(valid_values, percent) - middle_vals = valid_values[(valid_values >= lower) & (valid_values <= upper)] - - mean_mid = np.mean(middle_vals) - std_mid = np.std(middle_vals) - - if std_mid == 0: - raise ValueError( - "Standard deviation of middle percent is zero — normalization not possible." - ) - - normalized = (values - mean_mid) / std_mid - return normalized - - -def std_of_middle_percent(values, percent=95): - """ - Compute the standard deviation of values within the central `percent` of the data. - - Parameters - ---------- - values : array-like - Input data (1D array). NaNs will be ignored. - percent : float - Desired central percentage (e.g., 95 for middle 95%). - - Returns - ------- - float - Standard deviation of values within the specified middle percentage. - """ - values = np.asarray(values) - values = values[~np.isnan(values)] - lower, upper = middle_percent(values, percent) - filtered = values[(values >= lower) & (values <= upper)] - return np.std(filtered) - - -def mean_of_middle_percent(values, percent=95): - """ - Compute the mean of values within the central `percent` of the data. - - Parameters - ---------- - values : array-like - Input data (1D array). NaNs will be ignored. - percent : float - Desired central percentage (e.g., 95 for middle 95%). - - Returns - ------- - float - Mean of values within the specified middle percentage. - """ - values = np.asarray(values) - values = values[~np.isnan(values)] - lower, upper = middle_percent(values, percent) - filtered = values[(values >= lower) & (values <= upper)] - return np.mean(filtered) - - -def middle_percent(values, percent=95): - """ - Return the lower and upper bounds for the central `percent` of the data. - - Parameters - ---------- - values : array-like - Input data (1D array). NaNs will be ignored. - percent : float - Desired central percentage (e.g., 95 for middle 95%). - - Returns - ------- - tuple - (lower_bound, upper_bound) - """ - values = np.asarray(values) - values = values[~np.isnan(values)] - - if not 0 < percent < 100: - raise ValueError("percent must be between 0 and 100 (exclusive)") - - tail = (100 - percent) / 2 - lower = np.nanpercentile(values, tail) - upper = np.nanpercentile(values, 100 - tail) - return lower, upper - - def calc_ds_difference(ds1, ds2): # Check that the time grids are the same if not np.array_equal(ds1["TIME"].values, ds2["TIME"].values): diff --git a/tests/test_mooring.py b/tests/test_mooring_rodb.py similarity index 98% rename from tests/test_mooring.py rename to tests/test_mooring_rodb.py index 9da6ecb..a2e0c9c 100644 --- a/tests/test_mooring.py +++ b/tests/test_mooring_rodb.py @@ -3,8 +3,8 @@ import pytest import xarray as xr -from oceanarray import mooring -from oceanarray.mooring import ( # Adjust import as needed +from oceanarray import mooring_rodb +from oceanarray.mooring_rodb import ( # Adjust import as needed filter_all_time_vars, find_common_attributes, find_time_vars, get_12hourly_time_grid, interp_to_12hour_grid, stack_instruments) @@ -39,7 +39,7 @@ def create_mock_os_dataset(depth, serial_number, source_file): def test_combine_mooring(): ds1 = create_mock_os_dataset(100.0, "1234", "file1.nc") ds2 = create_mock_os_dataset(200.0, "5678", "file2.nc") - combined = mooring.combine_mooring_OS([ds1, ds2]) + combined = mooring_rodb.combine_mooring_OS([ds1, ds2]) assert "TEMP" in combined assert "serial_number" not in combined.attrs diff --git a/tests/test_instrument.py b/tests/test_process_rodb.py similarity index 95% rename from tests/test_instrument.py rename to tests/test_process_rodb.py index 2cd6e5e..0d56832 100644 --- a/tests/test_instrument.py +++ b/tests/test_process_rodb.py @@ -5,8 +5,8 @@ import pandas as pd import xarray as xr -from oceanarray.instrument import (apply_microcat_calibration_from_txt, - stage2_trim, trim_suggestion) +from oceanarray.process_rodb import (apply_microcat_calibration_from_txt, + stage2_trim, trim_suggestion) from oceanarray.rodb import rodbload @@ -65,9 +65,9 @@ def test_apply_microcat_with_flags(tmp_path): ds.to_netcdf(use_path) # write as .nc, simulate reading in `rodbload` # Patch rodbload to return this dataset - from oceanarray import instrument + from oceanarray import process_rodb - instrument.rodb.rodbload = lambda _: ds + process_rodb.rodb.rodbload = lambda _: ds ds_cal = apply_microcat_calibration_from_txt(txt, use_path) assert "T" in ds_cal and np.allclose( diff --git a/tests/test_stage1.py b/tests/test_stage1.py index 19a230c..e03e6ad 100644 --- a/tests/test_stage1.py +++ b/tests/test_stage1.py @@ -145,6 +145,147 @@ def test_get_netcdf_writer_params(self, processor): assert params["chunk_time"] == 3600 assert params["complevel"] == 5 + def test_clean_dataset_variables_sbe_cnv(self, processor): + """Test cleaning dataset variables for SBE CNV files.""" + # Create a mock dataset with variables that should be removed + import numpy as np + + ds = xr.Dataset( + { + "temperature": (["time"], [20.0, 21.0, 22.0]), + "pressure": (["time"], [100.0, 101.0, 102.0]), + "potential_temperature": ( + ["time"], + [19.8, 20.8, 21.8], + ), # Should be removed + "density": (["time"], [1025.0, 1026.0, 1027.0]), # Should be removed + "julian_days_offset": (["time"], [1, 2, 3]), # Should be removed + "salinity": (["time"], [35.0, 35.1, 35.2]), # Should be kept + }, + coords={ + "time": (["time"], np.arange(3)), + "depth": (["time"], [100.0, 100.0, 100.0]), # Should be removed + "latitude": 60.0, # Should be removed + "longitude": -30.0, # Should be removed + }, + ) + + # Clean the dataset + cleaned_ds = processor._clean_dataset_variables(ds, "sbe-cnv") + + # Check that unwanted variables were removed + assert "potential_temperature" not in cleaned_ds.variables + assert "density" not in cleaned_ds.variables + assert "julian_days_offset" not in cleaned_ds.variables + + # Check that wanted variables were kept + assert "temperature" in cleaned_ds.variables + assert "pressure" in cleaned_ds.variables + assert "salinity" in cleaned_ds.variables + + # Check that unwanted coordinates were removed + assert "depth" not in cleaned_ds.coords + assert "latitude" not in cleaned_ds.coords + assert "longitude" not in cleaned_ds.coords + + # Check that time coordinate was kept + assert "time" in cleaned_ds.coords + + def test_clean_dataset_variables_unknown_type(self, processor): + """Test cleaning dataset variables for unknown file type.""" + import numpy as np + + ds = xr.Dataset( + { + "temperature": (["time"], [20.0, 21.0, 22.0]), + "unwanted_var": (["time"], [1.0, 2.0, 3.0]), + }, + coords={ + "time": (["time"], np.arange(3)), + "unwanted_coord": (["time"], [100.0, 100.0, 100.0]), + }, + ) + + # Clean with unknown file type (should not remove anything) + cleaned_ds = processor._clean_dataset_variables(ds, "unknown-type") + + # Check that all variables and coordinates are preserved + assert "temperature" in cleaned_ds.variables + assert "unwanted_var" in cleaned_ds.variables + assert "time" in cleaned_ds.coords + assert "unwanted_coord" in cleaned_ds.coords + + def test_add_global_attributes_complete(self, processor): + """Test adding global attributes with complete YAML data.""" + import numpy as np + + # Create a simple dataset + ds = xr.Dataset( + {"temperature": (["time"], [20.0, 21.0, 22.0])}, + coords={"time": (["time"], np.arange(3))}, + ) + + yaml_data = { + "name": "test_mooring", + "waterdepth": 1000, + "longitude": -30.0, + "latitude": 60.0, + "deployment_latitude": "60 00.000 N", + "deployment_longitude": "030 00.000 W", + "deployment_time": "2018-08-12T08:00:00", + "seabed_latitude": "59 59.500 N", + "seabed_longitude": "030 00.500 W", + "recovery_time": "2018-08-26T20:47:24", + } + + # Add global attributes + updated_ds = processor._add_global_attributes(ds, yaml_data) + + # Check that all attributes were added correctly + assert updated_ds.attrs["mooring_name"] == "test_mooring" + assert updated_ds.attrs["waterdepth"] == 1000 + assert updated_ds.attrs["longitude"] == -30.0 + assert updated_ds.attrs["latitude"] == 60.0 + assert updated_ds.attrs["deployment_latitude"] == "60 00.000 N" + assert updated_ds.attrs["deployment_longitude"] == "030 00.000 W" + assert updated_ds.attrs["deployment_time"] == "2018-08-12T08:00:00" + assert updated_ds.attrs["seabed_latitude"] == "59 59.500 N" + assert updated_ds.attrs["seabed_longitude"] == "030 00.500 W" + assert updated_ds.attrs["recovery_time"] == "2018-08-26T20:47:24" + + def test_add_global_attributes_minimal(self, processor): + """Test adding global attributes with minimal YAML data.""" + import numpy as np + + # Create a simple dataset + ds = xr.Dataset( + {"temperature": (["time"], [20.0, 21.0, 22.0])}, + coords={"time": (["time"], np.arange(3))}, + ) + + # Minimal YAML data (only required fields) + yaml_data = { + "name": "minimal_mooring", + "waterdepth": 500, + } + + # Add global attributes + updated_ds = processor._add_global_attributes(ds, yaml_data) + + # Check that required attributes were added + assert updated_ds.attrs["mooring_name"] == "minimal_mooring" + assert updated_ds.attrs["waterdepth"] == 500 + + # Check that default values were used for missing fields + assert updated_ds.attrs["longitude"] == 0.0 + assert updated_ds.attrs["latitude"] == 0.0 + assert updated_ds.attrs["deployment_latitude"] == "00 00.000 N" + assert updated_ds.attrs["deployment_longitude"] == "000 00.000 W" + assert updated_ds.attrs["deployment_time"] == "YYYY-mm-ddTHH:MM:ss" + assert updated_ds.attrs["seabed_latitude"] == "00 00.000 N" + assert updated_ds.attrs["seabed_longitude"] == "000 00.000 W" + assert updated_ds.attrs["recovery_time"] == "YYYY-mm-ddTHH:MM:ss" + class TestRealDataProcessing: """Integration tests using real data files.""" diff --git a/tests/test_stage2.py b/tests/test_stage2.py index 20ee482..2b52395 100644 --- a/tests/test_stage2.py +++ b/tests/test_stage2.py @@ -246,7 +246,7 @@ def test_trim_to_deployment_window_empty_result( # Should result in empty dataset assert len(result.time) == 0 - def test_add_missing_metadata(self, processor, sample_raw_dataset): + def test_add_missing_metadata(self, processor, sample_raw_dataset, tmp_path): """Test adding missing metadata variables.""" # Remove some metadata to test adding it back ds = sample_raw_dataset.copy() @@ -254,13 +254,22 @@ def test_add_missing_metadata(self, processor, sample_raw_dataset): instrument_config = {"depth": 150, "instrument": "new_microcat", "serial": 9999} - result = processor._add_missing_metadata(ds, instrument_config) + # Create a mock filepath for testing + mock_filepath = tmp_path / "microcat" / "test_mooring_7518_raw.nc" + mock_filepath.parent.mkdir(parents=True) + mock_filepath.touch() + + result = processor._add_missing_metadata( + ds, instrument_config, mock_filepath, "test_mooring" + ) assert result["InstrDepth"].values == 150 assert result["instrument"].values == "new_microcat" assert result["serial_number"].values == 9999 - def test_add_missing_metadata_no_overwrite(self, processor, sample_raw_dataset): + def test_add_missing_metadata_no_overwrite( + self, processor, sample_raw_dataset, tmp_path + ): """Test that existing metadata is not overwritten.""" instrument_config = { "depth": 999, @@ -268,13 +277,89 @@ def test_add_missing_metadata_no_overwrite(self, processor, sample_raw_dataset): "serial": 8888, } - result = processor._add_missing_metadata(sample_raw_dataset, instrument_config) + # Create a mock filepath for testing + mock_filepath = tmp_path / "microcat" / "test_mooring_7518_raw.nc" + mock_filepath.parent.mkdir(parents=True) + mock_filepath.touch() + + result = processor._add_missing_metadata( + sample_raw_dataset, instrument_config, mock_filepath, "test_mooring" + ) # Should keep original values assert result["InstrDepth"].values == 100 assert result["instrument"].values == "microcat" assert result["serial_number"].values == 7518 + def test_fallback_metadata_extraction(self, processor, tmp_path): + """Test fallback metadata extraction from filepath.""" + # Create dataset missing metadata + import numpy as np + import xarray as xr + + ds = xr.Dataset( + { + "temperature": (["time"], np.random.random(10)), + }, + coords={"time": pd.date_range("2018-01-01", periods=10, freq="h")}, + ) + + # YAML config is missing instrument and serial (None, not provided) + instrument_config = {"depth": 150} # Only depth provided + + # Create filepath that contains metadata + mock_filepath = tmp_path / "sbe56" / "dsE_1_2018_6363_raw.nc" + mock_filepath.parent.mkdir(parents=True) + mock_filepath.touch() + + result = processor._add_missing_metadata( + ds, instrument_config, mock_filepath, "dsE_1_2018" + ) + + # Should extract from filepath + assert result["instrument"].values == "sbe56" + assert result["serial_number"].values == 6363 + assert result["InstrDepth"].values == 150 # From YAML + + # Should add history note about non-standard enrichment + assert ( + "non-standard enrichment of metadata from filename patterns" + in result.attrs["history"] + ) + + def test_no_fallback_when_yaml_complete(self, processor, tmp_path): + """Test that fallback is not used when YAML provides complete metadata.""" + # Create dataset missing metadata + import numpy as np + import xarray as xr + + ds = xr.Dataset( + { + "temperature": (["time"], np.random.random(10)), + }, + coords={"time": pd.date_range("2018-01-01", periods=10, freq="h")}, + ) + + # YAML config has complete metadata + instrument_config = {"depth": 150, "instrument": "microcat", "serial": 7518} + + # Create filepath with different metadata + mock_filepath = tmp_path / "sbe56" / "dsE_1_2018_6363_raw.nc" + mock_filepath.parent.mkdir(parents=True) + mock_filepath.touch() + + result = processor._add_missing_metadata( + ds, instrument_config, mock_filepath, "dsE_1_2018" + ) + + # Should use YAML metadata, not filepath + assert result["instrument"].values == "microcat" # From YAML + assert result["serial_number"].values == 7518 # From YAML + assert result["InstrDepth"].values == 150 # From YAML + + # Should NOT add history note since no fallback was used + assert "history" not in result.attrs + def test_clean_unnecessary_variables(self, processor, sample_raw_dataset): """Test removal of unnecessary variables.""" result = processor._clean_unnecessary_variables(sample_raw_dataset) diff --git a/tests/test_tools.py b/tests/test_tools.py index 9950598..3d7a301 100644 --- a/tests/test_tools.py +++ b/tests/test_tools.py @@ -3,6 +3,11 @@ import xarray as xr from oceanarray import tools +from oceanarray.mooring_rodb import auto_filt +from oceanarray.process_rodb import (mean_of_middle_percent, middle_percent, + normalize_by_middle_percent, + normalize_dataset_by_middle_percent, + std_of_middle_percent) from oceanarray.tools import calc_ds_difference @@ -67,33 +72,33 @@ def test_downsample_to_sparse_shapes(): def test_middle_percent(): data = np.linspace(0, 100, 100) - lo, hi = tools.middle_percent(data, 80) + lo, hi = middle_percent(data, 80) assert lo > 0 and hi < 100 and hi > lo def test_middle_percent_bounds(): data = np.linspace(0, 100, 1000) - lower, upper = tools.middle_percent(data, 90) + lower, upper = middle_percent(data, 90) assert np.isclose(lower, 5) assert np.isclose(upper, 95) def test_mean_of_middle_percent(): data = np.concatenate([np.random.normal(10, 1, 1000), np.array([1000, -1000])]) - mean = tools.mean_of_middle_percent(data, 95) + mean = mean_of_middle_percent(data, 95) assert abs(mean - 10) < 0.2 def test_std_of_middle_percent(): data = np.concatenate([np.random.normal(5, 2, 1000), np.array([999, -999])]) - std = tools.std_of_middle_percent(data, 95) + std = std_of_middle_percent(data, 95) assert 1.5 < std < 2.5 def test_mean_std_middle_percent(): data = np.random.normal(0, 1, 1000) - mean = tools.mean_of_middle_percent(data, 90) - std = tools.std_of_middle_percent(data, 90) + mean = mean_of_middle_percent(data, 90) + std = std_of_middle_percent(data, 90) assert np.isfinite(mean) assert np.isfinite(std) assert std > 0 @@ -101,15 +106,15 @@ def test_mean_std_middle_percent(): def test_normalize_by_middle_percent(): data = np.random.normal(0, 1, 1000) - norm = tools.normalize_by_middle_percent(data, 90) - mid_std = tools.std_of_middle_percent(norm, 90) + norm = normalize_by_middle_percent(data, 90) + mid_std = std_of_middle_percent(norm, 90) assert 0.9 < mid_std < 1.1 def test_normalize_dataset_by_middle_percent(): time = np.arange(10) ds = xr.Dataset({"TEMP": ("TIME", np.random.rand(10) + 20)}, coords={"TIME": time}) - ds_norm = tools.normalize_dataset_by_middle_percent(ds) + ds_norm = normalize_dataset_by_middle_percent(ds) assert "TEMP" in ds_norm assert np.allclose(ds_norm.TIME, time) @@ -118,7 +123,7 @@ def test_auto_filt_low(): sr = 1.0 # Hz t = np.linspace(0, 10, 500) signal = np.sin(2 * np.pi * 0.1 * t) + 0.5 * np.sin(2 * np.pi * 2 * t) - filtered = tools.auto_filt(signal, sr, co=0.2, typ="low") + filtered = auto_filt(signal, sr, co=0.2, typ="low") assert len(filtered) == len(signal) assert np.std(filtered) < np.std(signal)