feat(elt-pipelines): Add initial project with example pipeline#368
feat(elt-pipelines): Add initial project with example pipeline#368WHTaylor wants to merge 7 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughA new Changeselt-pipelines project setup and statusdisplay pipeline
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
elt-pipelines/.gitignore (1)
1-8: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winIgnore the local virtual environment directory.
The setup instructions create
.venv, but it is not ignored here, so it can be accidentally committed.Suggested patch
# ignore basic python artifacts .env +.venv/ **/__pycache__/ **/*.py[cod] **/*$py.class **/build/ **/*.egg-info/🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@elt-pipelines/.gitignore` around lines 1 - 8, The .gitignore entry set is missing the local virtual environment directory, so update the ignore rules to also exclude the project’s .venv folder alongside the existing Python artifacts. Keep the change in the same ignore list near the other environment/build entries so the setup-created virtual environment is not accidentally committed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@elt-pipelines/pipelines/ingest/accelerator/statusdisplay/statusdisplay.py`:
- Around line 33-38: The fetch() helper currently calls requests.get(CYCLES_URL)
without a timeout and only handles non-OK responses, so transport failures can
escape unhandled. Update fetch() to wrap the requests.get call in try/except
requests.RequestException, add a timeout to the request, and raise one
RuntimeError that includes the CYCLES_URL context for both request failures and
bad responses.
- Around line 26-30: The JSON loading path in statusdisplay’s fetch/read flow
needs an empty-input guard because pyarrow.json.read_json will fail on a
zero-length stream. Update the logic around clean(fetch()) and the yield
pyarrow.json.read_json(f) call to detect when no rows are returned and
immediately yield an empty table or return early instead of building and parsing
an empty buffer.
In `@elt-pipelines/README.md`:
- Around line 16-22: The setup instructions for the `uv sync` flow are
incomplete because `elt-pipelines` depends on the local sibling checkout of
`elt-common`. Update the README section that shows `uv venv`, `source
.venv/bin/activate`, and `uv sync` to explicitly tell users to clone or place
`elt-common` alongside `elt-pipelines` before running those commands, so the
editable local source can be resolved. Use the existing setup text in the README
and mention the dependency on `../elt-common` near the `uv sync` instructions.
---
Outside diff comments:
In `@elt-pipelines/.gitignore`:
- Around line 1-8: The .gitignore entry set is missing the local virtual
environment directory, so update the ignore rules to also exclude the project’s
.venv folder alongside the existing Python artifacts. Keep the change in the
same ignore list near the other environment/build entries so the setup-created
virtual environment is not accidentally committed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 91b01f08-3b41-40bf-bb15-c76f161ed326
⛔ Files ignored due to path filters (1)
elt-pipelines/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
elt-pipelines/.gitignoreelt-pipelines/README.mdelt-pipelines/pipelines/ingest/accelerator/statusdisplay/statusdisplay.pyelt-pipelines/pyproject.toml
|
Thanks for this. I'll take a look at the code shortly but to answer the questions:
I think for the pipelines here that's fine, at least for now. I'd been mostly hoping to avoid publishing to PyPI if possible as I wasn't really aiming to create a general purpose package for all of the world. In that case the naming is then more challenging. Given you can install with pip from a git url, including a versioned one, then I thought that was a fine way to go. It's pure Python so installing this way should be as easy as a package from PyPi.
Makes sense!
I was naming things after the system they came from and the source of the cycles is the system that supports the isis status display. Happy to change if it's found to be too confusing - I'm not particularly wedded to it. It's emphasizing the need for more documentation around this though! |
For some reason this didn't pick up when running the linter manually, but did fail in CI https://github.com/ISISNeutronMuon/analytics-data-platform/actions/runs/28166532207/job/83419815751. Making a purposefully incorrect change was enough to trigger the formatter
|
ref the last two commits, I was poking around the pre-commit set up (because the markdown-lint step is slightly annoyingly slow) and saw that the ruff linter/formatter needed to be specifically set up to include the new directory. It then didn't pick up the already existing incorrect formatting on commit because the file hadn't changed. |
| return data | ||
|
|
||
|
|
||
| if __name__ == "__main__": |
There was a problem hiding this comment.
I guess this facilitates easier debugging?
There was a problem hiding this comment.
Yeah, it was useful for quick checks whilst working on the pipeline. Me as of yesterday thought it'd be a good idea to leave it in as an example, but me as of today disagrees, so I've taken it out.
There was a problem hiding this comment.
My original thought here would be that the child directories of elt-pipelines would be named after the lakekeeper warehouse that the transformed models end up in, i.e. the cycles tables are associated with facility operationsso end up in thefacility_ops` warehouse.
For the FASE data our thinking was to have a separate warehouse given there are more access controls required for, e.g. who can access what. In the faciity_ops case the data can all be simply read only. It would also then be feasible to have separate repositories for each set of pipelines targeting a given warehouse.
What do you think about having:
elt-pipelines/
|-- facility_ops/
| |-- ingest/
| | |-- accelerator/
| |-- transform/
| | |-- # not here yet but would be...
| |-- pyproject.toml
| |-- .gitignore
| |-- ...
|-- other_warehouse
There was a problem hiding this comment.
child directories of elt-pipelines would be named after the lakekeeper warehouse that the transformed models end up in
For the FASE data our thinking was to have a separate warehouse
I think this makes sense, and it might be possible to also use the directories for configuring pyiceberg to control the destination warehouse (either using the directory name instead of getting the default catalog here, or putting some amount of the config into the directories).
It would also then be feasible to have separate repositories for each set of pipelines targeting a given warehouse
This feels like it'd fragment the project, especially given the use of the relative path for the elt-common dependency; what benefits do you see it having?
This is technically the only use of pydantic-settings in the project, so it could be removed as a dependency. However, any pipeline that wants to include custom configuration will need to extend BaseSettings, so I think we should leave the dependency as is.
ref #321
Creates an
elt-pipelinesproject with astatusdisplaypipeline, which useselt-commonto ingest data from the ISIS cycles endpoint. The pipeline can be run using the instructions from the README.elt-commonis currently included as a dependency inelt-pipelinesusing a relative path pointing at the package in the parent folder. This makes it easy to work on both locally, but means anything wanting to run pipelines needs both packages in its working directory; don't think it's a big deal, but maybe not ideal? Are we aiming to publishelt-commonto PyPI so we can use it as a 'normal' dependency?elt_cyclestable for testing purposes. Once we want to migrate to this pipeline in production, it should probably start ingesting intocyclesinstead, but because it's using a different schema from the current DLT pipeline we'll need to replace the table entirely, so there will be a bit of extra work needed at the timestatusdisplay? Should it change to something likecyclesorisiscycles?Summary by CodeRabbit
New Features
Documentation
Chores