Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #846 +/- ##
==========================================
+ Coverage 66.21% 66.52% +0.30%
==========================================
Files 62 63 +1
Lines 6412 6504 +92
==========================================
+ Hits 4246 4327 +81
- Misses 2166 2177 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds first-class support for loading Multimodal Universe (MMU) datasets via Hugging Face datasets, alongside config/schema changes to preserve URI-style data_location values and accompanying docs/tests.
Changes:
- Introduces
MultimodalUniverseDatasetto load HF datasets (optionally streaming) and auto-registerget_*accessors for columns (with sanitized aliases). - Updates
DataRequestConfigvalidation to preserve URIdata_locationvalues (e.g.hf://,https://) instead of resolving them to filesystem paths. - Adds defaults, dependency, docs notebooks, and unit tests for the new dataset and URI preservation behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/hyrax/datasets/mmu_dataset.py |
New dataset implementation backed by HF datasets.load_dataset, with streaming + column getter registration. |
src/hyrax/datasets/__init__.py |
Exports MultimodalUniverseDataset from the datasets package. |
src/hyrax/config_schemas/data_request.py |
Preserves URI data_location values during validation via urlparse. |
src/hyrax/hyrax_default_config.toml |
Adds default data_set.MultimodalUniverseDataset settings (split, max_samples, streaming). |
pyproject.toml |
Adds datasets dependency for HF dataset loading. |
tests/hyrax/test_mmu_dataset.py |
New unit tests for HF URI normalization, limiting, and streaming validation. |
tests/hyrax/test_data_request_config.py |
Adds unit tests asserting URI data_location values are preserved. |
docs/pre_executed/mmu_images_with_hyrax.ipynb |
Example notebook showing MMU images usage via Hyrax. |
docs/pre_executed/mmu_spectra_with_hyrax.ipynb |
Example notebook showing MMU spectra usage via Hyrax. |
docs/pre_executed/mmu_time_series_with_hyrax.ipynb |
Example notebook showing MMU time-series usage via Hyrax. |
| try: | ||
| MultimodalUniverseDataset( | ||
| config={"data_set": {"MultimodalUniverseDataset": {"split": "train", "streaming": True}}}, | ||
| data_location="hf://MultimodalUniverse/plasticc", | ||
| ) | ||
| raised = False | ||
| except ValueError: | ||
| raised = True | ||
|
|
||
| assert raised |
There was a problem hiding this comment.
Prefer using pytest.raises(ValueError) here instead of manual try/except + boolean flag; it’s more idiomatic in this test suite and produces clearer failure output.
| }, | ||
| "language_info": { | ||
| "name": "python", | ||
| "version": "3.11" |
There was a problem hiding this comment.
Because Sphinx uses nbsphinx and nbsphinx_execute defaults to auto, a new notebook with empty outputs/execution_count will be executed during docs builds (including pre-commit sphinx-build). Since this notebook depends on Hugging Face network access, add notebook-level metadata { "nbsphinx": { "execute": "never" } } (or otherwise ensure outputs are pre-populated) to prevent execution.
| "version": "3.11" | |
| "version": "3.11" | |
| }, | |
| "nbsphinx": { | |
| "execute": "never" |
| }, | ||
| "language_info": { | ||
| "name": "python", | ||
| "version": "3.11" |
There was a problem hiding this comment.
Because Sphinx uses nbsphinx and nbsphinx_execute defaults to auto, a new notebook with empty outputs/execution_count will be executed during docs builds (including pre-commit sphinx-build). Since this notebook depends on Hugging Face network access, add notebook-level metadata { "nbsphinx": { "execute": "never" } } (or otherwise ensure outputs are pre-populated) to prevent execution.
| "version": "3.11" | |
| "version": "3.11" | |
| }, | |
| "nbsphinx": { | |
| "execute": "never" |
| }, | ||
| "language_info": { | ||
| "name": "python", | ||
| "version": "3.11" |
There was a problem hiding this comment.
Because Sphinx uses nbsphinx and nbsphinx_execute defaults to auto, a new notebook with empty outputs/execution_count will be executed during docs builds (including pre-commit sphinx-build). Since this notebook depends on Hugging Face network access, add notebook-level metadata { "nbsphinx": { "execute": "never" } } (or otherwise ensure outputs are pre-populated) to prevent execution.
| "version": "3.11" | |
| "version": "3.11" | |
| }, | |
| "nbsphinx": { | |
| "execute": "never" |
Click here to view all benchmarks. |
- Always return numpy objects - Demonstrate in examples how to access and filter data - Fix Cache so numpy objects with non-numpy base data containers are accounted for correctly - Demonstrate how PIL Images can be reconstructed from MMU numpy arrays in example notebook
|
@copilot Can you re-review at the tip of the branch and fixup the tests. The code should be doing correct behavior as demonstrated by the notebooks now. |
…s False sentinel, nbsphinx execute never Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/bf603d2d-3c59-4d22-bdd1-0134542b2327 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Done in d19172f. Fixed the following based on the review feedback:
|
|
@copilot Great job on the tests, please fix remaining ruff errors. |
|
@mtauraso The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
5 similar comments
|
@mtauraso The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@mtauraso The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@mtauraso The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@mtauraso The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@mtauraso The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
drewoldag
left a comment
There was a problem hiding this comment.
Just a couple minor comments about the dataset class. Looking at the notebooks now.
| "primary_id_field": "object_id", | ||
| "dataset_config": { | ||
| "MultimodalUniverseDataset": { | ||
| "split": "train", |
There was a problem hiding this comment.
Minor: should this be something other than "train" so that it matches the "infer" data group?
There was a problem hiding this comment.
No actually. MMU has "splits" in their data as uploaded to hf, but all of their data is in the train "split."
AFAICT the convention is that when you don't want to define a split on HF uploaded data, You do exactly has MMU has done.
|
|
||
| self.data_location = str(data_location) | ||
| dataset_settings = ( | ||
| config.get("data_set", {}).get("MultimodalUniverseDataset", {}) if config is not None else {} |
There was a problem hiding this comment.
Fixed. The robot didn't fully understand the path by which default data would be injected, and I missed this when I took a pass at all its config.get() calls.
drewoldag
left a comment
There was a problem hiding this comment.
The notebooks look good. A single plot in the spectra and timeseries would be perfect.
|
Overall this looks good to me. It does make me start to question when it would make sense to think about breaking things out in to separate packages. Could this be a hyrax-mmu package? Or more generally a hyrax-huggingface package? |
|
Yeah we probably want to start making packages for this stuff. That'll also give people working on like... Euclid/HSC/etc a place to put their data set classes. |
Change Description
Enable loading Multimodal Universe (MMU) datasets through Hugging Face
datasetswithin Hyrax to support images, spectra, and time-series data. Preserve URI-styledata_locationvalues (e.g.hf://...andhttps://...) instead of converting them to filesystem paths. Provide example notebooks and defaults to make the new dataset type easy to try.Solution Description
MultimodalUniverseDatasetimplementation insrc/hyrax/datasets/mmu_dataset.pythat loads HF datasets viadatasets.load_dataset, supports streaming, split handling, and per-column getter registration with sanitized aliases. Getters always return numpy arrays (PIL Images are converted vianp.asarray).max_samplesparsing inMultimodalUniverseDatasetto treatFalseasNone(Hyrax TOML sentinel convention wherekey = falsemeans "not set"), preventing accidental coercion to0which would produce an empty dataset.src/hyrax/datasets/__init__.pyto exportMultimodalUniverseDataset.src/hyrax/config_schemas/data_request.pyto preserve URI schemes by detecting and returning URIs unchanged (addedurlparseusage) when resolvingdata_location.src/hyrax/hyrax_default_config.tomlforMultimodalUniverseDataset(split,max_samples,streaming).datasetsdependency topyproject.tomlwith an explanatory comment.docs/pre_executed/demonstrating image, spectra, and time-series MMU usage with Hyrax. All three notebooks include"nbsphinx": {"execute": "never"}metadata to prevent Sphinx docs builds from attempting to execute them (they require Hugging Face network access).tests/hyrax/test_data_request_config.pyextended to validate URI preservation, andtests/hyrax/test_mmu_dataset.pycovering HF URI handling, max-sample limiting, and streaming validation. Tests include docstrings (D103 compliance), usepytest.raisesidiomatically, and use_FakeMapDataset/_FakeIterableDatasetfakes withwith_formatsupport to correctly mirror the production code path.Code Quality
⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.