Add MultimodalUniverseDataset (HF datasets) support, config updates, docs and tests by mtauraso · Pull Request #846 · lincc-frameworks/hyrax

mtauraso · 2026-03-27T22:32:28Z

Change Description

Enable loading Multimodal Universe (MMU) datasets through Hugging Face datasets within Hyrax to support images, spectra, and time-series data. Preserve URI-style data_location values (e.g. hf://... and https://...) instead of converting them to filesystem paths. Provide example notebooks and defaults to make the new dataset type easy to try.

Solution Description

Add MultimodalUniverseDataset implementation in src/hyrax/datasets/mmu_dataset.py that loads HF datasets via datasets.load_dataset, supports streaming, split handling, and per-column getter registration with sanitized aliases. Getters always return numpy arrays (PIL Images are converted via np.asarray).
Fix max_samples parsing in MultimodalUniverseDataset to treat False as None (Hyrax TOML sentinel convention where key = false means "not set"), preventing accidental coercion to 0 which would produce an empty dataset.
Update src/hyrax/datasets/__init__.py to export MultimodalUniverseDataset.
Modify src/hyrax/config_schemas/data_request.py to preserve URI schemes by detecting and returning URIs unchanged (added urlparse usage) when resolving data_location.
Add default dataset settings to src/hyrax/hyrax_default_config.toml for MultimodalUniverseDataset (split, max_samples, streaming).
Add datasets dependency to pyproject.toml with an explanatory comment.
Add three pre-executed example notebooks under docs/pre_executed/ demonstrating image, spectra, and time-series MMU usage with Hyrax. All three notebooks include "nbsphinx": {"execute": "never"} metadata to prevent Sphinx docs builds from attempting to execute them (they require Hugging Face network access).
Add unit tests: tests/hyrax/test_data_request_config.py extended to validate URI preservation, and tests/hyrax/test_mmu_dataset.py covering HF URI handling, max-sample limiting, and streaming validation. Tests include docstrings (D103 compliance), use pytest.raises idiomatically, and use _FakeMapDataset/_FakeIterableDataset fakes with with_format support to correctly mirror the production code path.

Code Quality

I have read the Contribution Guide and agree to the Code of Conduct
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

review-notebook-app · 2026-03-27T22:32:34Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2026-03-27T22:37:53Z

Codecov Report

❌ Patch coverage is 88.17204% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.52%. Comparing base (0d6bc3f) to head (1c7b490).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/hyrax/datasets/mmu_dataset.py	87.20%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #846      +/-   ##
==========================================
+ Coverage   66.21%   66.52%   +0.30%     
==========================================
  Files          62       63       +1     
  Lines        6412     6504      +92     
==========================================
+ Hits         4246     4327      +81     
- Misses       2166     2177      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds first-class support for loading Multimodal Universe (MMU) datasets via Hugging Face datasets, alongside config/schema changes to preserve URI-style data_location values and accompanying docs/tests.

Changes:

Introduces MultimodalUniverseDataset to load HF datasets (optionally streaming) and auto-register get_* accessors for columns (with sanitized aliases).
Updates DataRequestConfig validation to preserve URI data_location values (e.g. hf://, https://) instead of resolving them to filesystem paths.
Adds defaults, dependency, docs notebooks, and unit tests for the new dataset and URI preservation behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/hyrax/datasets/mmu_dataset.py`	New dataset implementation backed by HF `datasets.load_dataset`, with streaming + column getter registration.
`src/hyrax/datasets/__init__.py`	Exports `MultimodalUniverseDataset` from the datasets package.
`src/hyrax/config_schemas/data_request.py`	Preserves URI `data_location` values during validation via `urlparse`.
`src/hyrax/hyrax_default_config.toml`	Adds default `data_set.MultimodalUniverseDataset` settings (`split`, `max_samples`, `streaming`).
`pyproject.toml`	Adds `datasets` dependency for HF dataset loading.
`tests/hyrax/test_mmu_dataset.py`	New unit tests for HF URI normalization, limiting, and streaming validation.
`tests/hyrax/test_data_request_config.py`	Adds unit tests asserting URI `data_location` values are preserved.
`docs/pre_executed/mmu_images_with_hyrax.ipynb`	Example notebook showing MMU images usage via Hyrax.
`docs/pre_executed/mmu_spectra_with_hyrax.ipynb`	Example notebook showing MMU spectra usage via Hyrax.
`docs/pre_executed/mmu_time_series_with_hyrax.ipynb`	Example notebook showing MMU time-series usage via Hyrax.

Copilot · 2026-03-27T22:39:21Z

+    try:
+        MultimodalUniverseDataset(
+            config={"data_set": {"MultimodalUniverseDataset": {"split": "train", "streaming": True}}},
+            data_location="hf://MultimodalUniverse/plasticc",
+        )
+        raised = False
+    except ValueError:
+        raised = True
+
+    assert raised


Prefer using pytest.raises(ValueError) here instead of manual try/except + boolean flag; it’s more idiomatic in this test suite and produces clearer failure output.

Copilot · 2026-03-27T22:39:21Z

+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.11"


Because Sphinx uses nbsphinx and nbsphinx_execute defaults to auto, a new notebook with empty outputs/execution_count will be executed during docs builds (including pre-commit sphinx-build). Since this notebook depends on Hugging Face network access, add notebook-level metadata { "nbsphinx": { "execute": "never" } } (or otherwise ensure outputs are pre-populated) to prevent execution.

Suggested change

"version": "3.11"

"version": "3.11"

},

"nbsphinx": {

"execute": "never"

Copilot · 2026-03-27T22:39:22Z

+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.11"


Because Sphinx uses nbsphinx and nbsphinx_execute defaults to auto, a new notebook with empty outputs/execution_count will be executed during docs builds (including pre-commit sphinx-build). Since this notebook depends on Hugging Face network access, add notebook-level metadata { "nbsphinx": { "execute": "never" } } (or otherwise ensure outputs are pre-populated) to prevent execution.

Suggested change

"version": "3.11"

"version": "3.11"

},

"nbsphinx": {

"execute": "never"

Copilot · 2026-03-27T22:39:22Z

+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.11"


Because Sphinx uses nbsphinx and nbsphinx_execute defaults to auto, a new notebook with empty outputs/execution_count will be executed during docs builds (including pre-commit sphinx-build). Since this notebook depends on Hugging Face network access, add notebook-level metadata { "nbsphinx": { "execute": "never" } } (or otherwise ensure outputs are pre-populated) to prevent execution.

Suggested change

"version": "3.11"

"version": "3.11"

},

"nbsphinx": {

"execute": "never"

github-actions · 2026-03-27T23:28:36Z

Before [`0d6bc3f`]	After [`17a10d0`]	Ratio	Benchmark (Parameter)
15.266899234398545	16.237276137117483	1.06	data_cache_benchmarks.DataCacheBenchmarks.track_cache_hsc1k_hyrax_size_undercount
38.7±0.3ms	40.6±0.6ms	1.05	benchmarks.time_nb_obj_construct
1.57G	1.63G	1.04	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
421±0.6ms	431±3ms	1.02	vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'qdrant')
6.87±0.03s	6.95±0.03s	1.01	benchmarks.time_database_connection_help
6.84±0s	6.93±0s	1.01	benchmarks.time_rebuild_manifest_help
7.89±0.01s	7.97±0.05s	1.01	data_cache_benchmarks.DataCacheBenchmarks.time_preload_cache_hsc1k
114±0.4μs	116±2μs	1.01	data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
1.16G	1.17G	1.01	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(2048, 'chromadb')
6.85±0.01s	6.83±0.01s	1	benchmarks.time_download_help

Click here to view all benchmarks.

- Always return numpy objects - Demonstrate in examples how to access and filter data - Fix Cache so numpy objects with non-numpy base data containers are accounted for correctly - Demonstrate how PIL Images can be reconstructed from MMU numpy arrays in example notebook

…th queries

mtauraso · 2026-03-28T00:39:09Z

@copilot Can you re-review at the tip of the branch and fixup the tests. The code should be doing correct behavior as demonstrated by the notebooks now.

…s89e

…s False sentinel, nbsphinx execute never Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/bf603d2d-3c59-4d22-bdd1-0134542b2327 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Copilot · 2026-03-28T00:48:22Z

@copilot Can you re-review at the tip of the branch and fixup the tests. The code should be doing correct behavior as demonstrated by the notebooks now.

Done in d19172f. Fixed the following based on the review feedback:

Test docstrings – Added docstrings to all three test functions (satisfies D103 lint rule).
pytest.raises – Replaced the manual try/except+flag in test_mmu_dataset_streaming_requires_max_samples with with pytest.raises(ValueError).
with_format support in fakes – Added with_format method to _FakeMapDataset and a new _FakeIterableDataset class so the tests properly mirror the production code path that calls dataset.with_format("numpy").
max_samples=False sentinel – mmu_dataset.py now treats False the same as None (per Hyrax TOML convention where key = false means "not set"), preventing accidental coercion to 0.
Notebook nbsphinx metadata – Added {"nbsphinx": {"execute": "never"}} to all three MMU notebooks so the Sphinx docs build won't try to execute them.

mtauraso · 2026-03-28T00:52:24Z

@copilot Great job on the tests, please fix remaining ruff errors.

Copilot · 2026-03-28T00:52:29Z

@mtauraso The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.