Skip to content

Add MultimodalUniverseDataset (HF datasets) support, config updates, docs and tests#846

Merged
mtauraso merged 11 commits intomainfrom
codex/create-dataset-class-for-mmu-access-wms89e
Apr 3, 2026
Merged

Add MultimodalUniverseDataset (HF datasets) support, config updates, docs and tests#846
mtauraso merged 11 commits intomainfrom
codex/create-dataset-class-for-mmu-access-wms89e

Conversation

@mtauraso
Copy link
Copy Markdown
Collaborator

@mtauraso mtauraso commented Mar 27, 2026

Change Description

Enable loading Multimodal Universe (MMU) datasets through Hugging Face datasets within Hyrax to support images, spectra, and time-series data. Preserve URI-style data_location values (e.g. hf://... and https://...) instead of converting them to filesystem paths. Provide example notebooks and defaults to make the new dataset type easy to try.

Solution Description

  • Add MultimodalUniverseDataset implementation in src/hyrax/datasets/mmu_dataset.py that loads HF datasets via datasets.load_dataset, supports streaming, split handling, and per-column getter registration with sanitized aliases. Getters always return numpy arrays (PIL Images are converted via np.asarray).
  • Fix max_samples parsing in MultimodalUniverseDataset to treat False as None (Hyrax TOML sentinel convention where key = false means "not set"), preventing accidental coercion to 0 which would produce an empty dataset.
  • Update src/hyrax/datasets/__init__.py to export MultimodalUniverseDataset.
  • Modify src/hyrax/config_schemas/data_request.py to preserve URI schemes by detecting and returning URIs unchanged (added urlparse usage) when resolving data_location.
  • Add default dataset settings to src/hyrax/hyrax_default_config.toml for MultimodalUniverseDataset (split, max_samples, streaming).
  • Add datasets dependency to pyproject.toml with an explanatory comment.
  • Add three pre-executed example notebooks under docs/pre_executed/ demonstrating image, spectra, and time-series MMU usage with Hyrax. All three notebooks include "nbsphinx": {"execute": "never"} metadata to prevent Sphinx docs builds from attempting to execute them (they require Hugging Face network access).
  • Add unit tests: tests/hyrax/test_data_request_config.py extended to validate URI preservation, and tests/hyrax/test_mmu_dataset.py covering HF URI handling, max-sample limiting, and streaming validation. Tests include docstrings (D103 compliance), use pytest.raises idiomatically, and use _FakeMapDataset/_FakeIterableDataset fakes with with_format support to correctly mirror the production code path.

Code Quality

  • I have read the Contribution Guide and agree to the Code of Conduct
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Copilot AI review requested due to automatic review settings March 27, 2026 22:32
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@mtauraso mtauraso marked this pull request as draft March 27, 2026 22:37
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 88.17204% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.52%. Comparing base (0d6bc3f) to head (1c7b490).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/hyrax/datasets/mmu_dataset.py 87.20% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #846      +/-   ##
==========================================
+ Coverage   66.21%   66.52%   +0.30%     
==========================================
  Files          62       63       +1     
  Lines        6412     6504      +92     
==========================================
+ Hits         4246     4327      +81     
- Misses       2166     2177      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class support for loading Multimodal Universe (MMU) datasets via Hugging Face datasets, alongside config/schema changes to preserve URI-style data_location values and accompanying docs/tests.

Changes:

  • Introduces MultimodalUniverseDataset to load HF datasets (optionally streaming) and auto-register get_* accessors for columns (with sanitized aliases).
  • Updates DataRequestConfig validation to preserve URI data_location values (e.g. hf://, https://) instead of resolving them to filesystem paths.
  • Adds defaults, dependency, docs notebooks, and unit tests for the new dataset and URI preservation behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/hyrax/datasets/mmu_dataset.py New dataset implementation backed by HF datasets.load_dataset, with streaming + column getter registration.
src/hyrax/datasets/__init__.py Exports MultimodalUniverseDataset from the datasets package.
src/hyrax/config_schemas/data_request.py Preserves URI data_location values during validation via urlparse.
src/hyrax/hyrax_default_config.toml Adds default data_set.MultimodalUniverseDataset settings (split, max_samples, streaming).
pyproject.toml Adds datasets dependency for HF dataset loading.
tests/hyrax/test_mmu_dataset.py New unit tests for HF URI normalization, limiting, and streaming validation.
tests/hyrax/test_data_request_config.py Adds unit tests asserting URI data_location values are preserved.
docs/pre_executed/mmu_images_with_hyrax.ipynb Example notebook showing MMU images usage via Hyrax.
docs/pre_executed/mmu_spectra_with_hyrax.ipynb Example notebook showing MMU spectra usage via Hyrax.
docs/pre_executed/mmu_time_series_with_hyrax.ipynb Example notebook showing MMU time-series usage via Hyrax.

Comment thread tests/hyrax/test_mmu_dataset.py
Comment thread tests/hyrax/test_mmu_dataset.py
Comment thread tests/hyrax/test_mmu_dataset.py Outdated
Comment on lines +78 to +87
try:
MultimodalUniverseDataset(
config={"data_set": {"MultimodalUniverseDataset": {"split": "train", "streaming": True}}},
data_location="hf://MultimodalUniverse/plasticc",
)
raised = False
except ValueError:
raised = True

assert raised
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using pytest.raises(ValueError) here instead of manual try/except + boolean flag; it’s more idiomatic in this test suite and produces clearer failure output.

Copilot uses AI. Check for mistakes.
},
"language_info": {
"name": "python",
"version": "3.11"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Sphinx uses nbsphinx and nbsphinx_execute defaults to auto, a new notebook with empty outputs/execution_count will be executed during docs builds (including pre-commit sphinx-build). Since this notebook depends on Hugging Face network access, add notebook-level metadata { "nbsphinx": { "execute": "never" } } (or otherwise ensure outputs are pre-populated) to prevent execution.

Suggested change
"version": "3.11"
"version": "3.11"
},
"nbsphinx": {
"execute": "never"

Copilot uses AI. Check for mistakes.
},
"language_info": {
"name": "python",
"version": "3.11"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Sphinx uses nbsphinx and nbsphinx_execute defaults to auto, a new notebook with empty outputs/execution_count will be executed during docs builds (including pre-commit sphinx-build). Since this notebook depends on Hugging Face network access, add notebook-level metadata { "nbsphinx": { "execute": "never" } } (or otherwise ensure outputs are pre-populated) to prevent execution.

Suggested change
"version": "3.11"
"version": "3.11"
},
"nbsphinx": {
"execute": "never"

Copilot uses AI. Check for mistakes.
},
"language_info": {
"name": "python",
"version": "3.11"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Sphinx uses nbsphinx and nbsphinx_execute defaults to auto, a new notebook with empty outputs/execution_count will be executed during docs builds (including pre-commit sphinx-build). Since this notebook depends on Hugging Face network access, add notebook-level metadata { "nbsphinx": { "execute": "never" } } (or otherwise ensure outputs are pre-populated) to prevent execution.

Suggested change
"version": "3.11"
"version": "3.11"
},
"nbsphinx": {
"execute": "never"

Copilot uses AI. Check for mistakes.
Comment thread src/hyrax/datasets/mmu_dataset.py Outdated
Comment thread tests/hyrax/test_mmu_dataset.py
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 27, 2026

Before [0d6bc3f] After [17a10d0] Ratio Benchmark (Parameter)
15.266899234398545 16.237276137117483 1.06 data_cache_benchmarks.DataCacheBenchmarks.track_cache_hsc1k_hyrax_size_undercount
38.7±0.3ms 40.6±0.6ms 1.05 benchmarks.time_nb_obj_construct
1.57G 1.63G 1.04 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
421±0.6ms 431±3ms 1.02 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'qdrant')
6.87±0.03s 6.95±0.03s 1.01 benchmarks.time_database_connection_help
6.84±0s 6.93±0s 1.01 benchmarks.time_rebuild_manifest_help
7.89±0.01s 7.97±0.05s 1.01 data_cache_benchmarks.DataCacheBenchmarks.time_preload_cache_hsc1k
114±0.4μs 116±2μs 1.01 data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
1.16G 1.17G 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(2048, 'chromadb')
6.85±0.01s 6.83±0.01s 1 benchmarks.time_download_help

Click here to view all benchmarks.

- Always return numpy objects
- Demonstrate in examples how to access and filter data
- Fix Cache so numpy objects with non-numpy base data containers are accounted for correctly
- Demonstrate how PIL Images can be reconstructed from MMU numpy arrays in example notebook
@mtauraso
Copy link
Copy Markdown
Collaborator Author

@copilot Can you re-review at the tip of the branch and fixup the tests. The code should be doing correct behavior as demonstrated by the notebooks now.

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

@copilot Can you re-review at the tip of the branch and fixup the tests. The code should be doing correct behavior as demonstrated by the notebooks now.

Done in d19172f. Fixed the following based on the review feedback:

  1. Test docstrings – Added docstrings to all three test functions (satisfies D103 lint rule).
  2. pytest.raises – Replaced the manual try/except+flag in test_mmu_dataset_streaming_requires_max_samples with with pytest.raises(ValueError).
  3. with_format support in fakes – Added with_format method to _FakeMapDataset and a new _FakeIterableDataset class so the tests properly mirror the production code path that calls dataset.with_format("numpy").
  4. max_samples=False sentinelmmu_dataset.py now treats False the same as None (per Hyrax TOML convention where key = false means "not set"), preventing accidental coercion to 0.
  5. Notebook nbsphinx metadata – Added {"nbsphinx": {"execute": "never"}} to all three MMU notebooks so the Sphinx docs build won't try to execute them.

@mtauraso
Copy link
Copy Markdown
Collaborator Author

@copilot Great job on the tests, please fix remaining ruff errors.

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

@mtauraso The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 5ea7f22d-42b9-45e0-b879-b474f3c66531

5 similar comments
Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

@mtauraso The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 5ea7f22d-42b9-45e0-b879-b474f3c66531

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

@mtauraso The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 5ea7f22d-42b9-45e0-b879-b474f3c66531

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

@mtauraso The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 5ea7f22d-42b9-45e0-b879-b474f3c66531

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

@mtauraso The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 5ea7f22d-42b9-45e0-b879-b474f3c66531

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

@mtauraso The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 5ea7f22d-42b9-45e0-b879-b474f3c66531

@mtauraso mtauraso marked this pull request as ready for review March 28, 2026 01:12
Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple minor comments about the dataset class. Looking at the notebooks now.

"primary_id_field": "object_id",
"dataset_config": {
"MultimodalUniverseDataset": {
"split": "train",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: should this be something other than "train" so that it matches the "infer" data group?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actually. MMU has "splits" in their data as uploaded to hf, but all of their data is in the train "split."

AFAICT the convention is that when you don't want to define a split on HF uploaded data, You do exactly has MMU has done.

Comment thread src/hyrax/datasets/mmu_dataset.py Outdated

self.data_location = str(data_location)
dataset_settings = (
config.get("data_set", {}).get("MultimodalUniverseDataset", {}) if config is not None else {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use .get()s here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The robot didn't fully understand the path by which default data would be injected, and I missed this when I took a pass at all its config.get() calls.

Comment thread src/hyrax/datasets/mmu_dataset.py
Comment thread src/hyrax/datasets/mmu_dataset.py Outdated
Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebooks look good. A single plot in the spectra and timeseries would be perfect.

@drewoldag
Copy link
Copy Markdown
Collaborator

Overall this looks good to me. It does make me start to question when it would make sense to think about breaking things out in to separate packages.

Could this be a hyrax-mmu package? Or more generally a hyrax-huggingface package?

@mtauraso
Copy link
Copy Markdown
Collaborator Author

mtauraso commented Apr 2, 2026

Yeah we probably want to start making packages for this stuff.

That'll also give people working on like... Euclid/HSC/etc a place to put their data set classes.

@mtauraso mtauraso merged commit ce50083 into main Apr 3, 2026
9 checks passed
@mtauraso mtauraso deleted the codex/create-dataset-class-for-mmu-access-wms89e branch April 3, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants