Skip to content

feat(elt-pipelines): Add initial project with example pipeline#368

Open
WHTaylor wants to merge 7 commits into
mainfrom
321-pipelines-project
Open

feat(elt-pipelines): Add initial project with example pipeline#368
WHTaylor wants to merge 7 commits into
mainfrom
321-pipelines-project

Conversation

@WHTaylor

@WHTaylor WHTaylor commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

ref #321

Creates an elt-pipelines project with a statusdisplay pipeline, which uses elt-common to ingest data from the ISIS cycles endpoint. The pipeline can be run using the instructions from the README.

  • elt-common is currently included as a dependency in elt-pipelines using a relative path pointing at the package in the parent folder. This makes it easy to work on both locally, but means anything wanting to run pipelines needs both packages in its working directory; don't think it's a big deal, but maybe not ideal? Are we aiming to publish elt-common to PyPI so we can use it as a 'normal' dependency?
  • The new pipeline is ingesting into an elt_cycles table for testing purposes. Once we want to migrate to this pipeline in production, it should probably start ingesting into cycles instead, but because it's using a different schema from the current DLT pipeline we'll need to replace the table entirely, so there will be a bit of extra work needed at the time
  • Something that only just occurred to me - why is it called statusdisplay? Should it change to something like cycles or isiscycles?

Summary by CodeRabbit

  • New Features

    • Added a new pipeline that fetches and formats status data for ingestion.
    • Introduced support for an optional extra with the required data-processing and HTTP libraries.
  • Documentation

    • Added a project README with setup steps, dependency installation, and example run instructions.
  • Chores

    • Added ignore rules for common Python, environment, and build artefacts.

@WHTaylor WHTaylor requested a review from a team as a code owner June 24, 2026 14:19
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6fa262d1-537b-4007-8f06-aee5d37a6c66

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new elt-pipelines Python package is introduced with pyproject.toml, .gitignore, and a README.md. The first pipeline implementation, statusdisplay.py, fetches accelerator cycle/phase data from a fixed API endpoint, reformats ISO date strings, and yields PyArrow tables via an Extract class.

Changes

elt-pipelines project setup and statusdisplay pipeline

Layer / File(s) Summary
Project scaffold: metadata, gitignore, and README
elt-pipelines/pyproject.toml, elt-pipelines/.gitignore, elt-pipelines/README.md
pyproject.toml declares package metadata, elt-common as an editable local dep, a statusdisplay optional extra (pyarrow, requests), and a dev group. .gitignore excludes Python caches and build artefacts. README.md documents uv-based setup and elt CLI usage.
statusdisplay ingestion pipeline
elt-pipelines/pipelines/ingest/accelerator/statusdisplay/statusdisplay.py
Adds Extract registering an elt_cycles replace-mode resource. fetch() performs HTTP GET with RuntimeError on failure. clean()/reformat() mutate phase start/end ISO strings to "%Y-%m-%d %H:%M:%S". Converted payload is read into a PyArrow table via pyarrow.json.read_json().

Poem

A new pipeline burrows into the ground 🐇
With cycles and phases all neatly found,
Dates reformatted, arrows drawn tight,
The statusdisplay gleams in the night.
uv sync --extra statusdisplay — away we go,
Watch the elt-pipelines put on a show! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarises the main change: adding the initial elt-pipelines project with an example pipeline.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
elt-pipelines/.gitignore (1)

1-8: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Ignore the local virtual environment directory.

The setup instructions create .venv, but it is not ignored here, so it can be accidentally committed.

Suggested patch
 # ignore basic python artifacts
 .env
+.venv/
 **/__pycache__/
 **/*.py[cod]
 **/*$py.class
 **/build/
 **/*.egg-info/
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@elt-pipelines/.gitignore` around lines 1 - 8, The .gitignore entry set is
missing the local virtual environment directory, so update the ignore rules to
also exclude the project’s .venv folder alongside the existing Python artifacts.
Keep the change in the same ignore list near the other environment/build entries
so the setup-created virtual environment is not accidentally committed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@elt-pipelines/pipelines/ingest/accelerator/statusdisplay/statusdisplay.py`:
- Around line 33-38: The fetch() helper currently calls requests.get(CYCLES_URL)
without a timeout and only handles non-OK responses, so transport failures can
escape unhandled. Update fetch() to wrap the requests.get call in try/except
requests.RequestException, add a timeout to the request, and raise one
RuntimeError that includes the CYCLES_URL context for both request failures and
bad responses.
- Around line 26-30: The JSON loading path in statusdisplay’s fetch/read flow
needs an empty-input guard because pyarrow.json.read_json will fail on a
zero-length stream. Update the logic around clean(fetch()) and the yield
pyarrow.json.read_json(f) call to detect when no rows are returned and
immediately yield an empty table or return early instead of building and parsing
an empty buffer.

In `@elt-pipelines/README.md`:
- Around line 16-22: The setup instructions for the `uv sync` flow are
incomplete because `elt-pipelines` depends on the local sibling checkout of
`elt-common`. Update the README section that shows `uv venv`, `source
.venv/bin/activate`, and `uv sync` to explicitly tell users to clone or place
`elt-common` alongside `elt-pipelines` before running those commands, so the
editable local source can be resolved. Use the existing setup text in the README
and mention the dependency on `../elt-common` near the `uv sync` instructions.

---

Outside diff comments:
In `@elt-pipelines/.gitignore`:
- Around line 1-8: The .gitignore entry set is missing the local virtual
environment directory, so update the ignore rules to also exclude the project’s
.venv folder alongside the existing Python artifacts. Keep the change in the
same ignore list near the other environment/build entries so the setup-created
virtual environment is not accidentally committed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 91b01f08-3b41-40bf-bb15-c76f161ed326

📥 Commits

Reviewing files that changed from the base of the PR and between 857eba7 and 781e1d1.

⛔ Files ignored due to path filters (1)
  • elt-pipelines/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • elt-pipelines/.gitignore
  • elt-pipelines/README.md
  • elt-pipelines/pipelines/ingest/accelerator/statusdisplay/statusdisplay.py
  • elt-pipelines/pyproject.toml

Comment thread elt-pipelines/README.md
@martyngigg martyngigg self-assigned this Jun 25, 2026
@martyngigg

Copy link
Copy Markdown
Member

Thanks for this. I'll take a look at the code shortly but to answer the questions:

* `elt-common` is currently included as a dependency in `elt-pipelines` using a relative path pointing at the package in the parent folder. This makes it easy to work on both locally, but means anything wanting to run pipelines needs both packages in its working directory; don't think it's a big deal, but maybe not ideal? Are we aiming to publish `elt-common` to PyPI so we can use it as a 'normal' dependency?

I think for the pipelines here that's fine, at least for now. I'd been mostly hoping to avoid publishing to PyPI if possible as I wasn't really aiming to create a general purpose package for all of the world. In that case the naming is then more challenging. Given you can install with pip from a git url, including a versioned one, then I thought that was a fine way to go. It's pure Python so installing this way should be as easy as a package from PyPi.

* The new pipeline is ingesting into an `elt_cycles` table for testing purposes. Once we want to migrate to this pipeline in production, it should probably start ingesting into `cycles` instead, but because it's using a different schema from the current DLT pipeline we'll need to replace the table entirely, so there will be a bit of extra work needed at the time

Makes sense!

* Something that only just occurred to me - why is it called `statusdisplay`? Should it change to something like `cycles` or `isiscycles`?

I was naming things after the system they came from and the source of the cycles is the system that supports the isis status display. Happy to change if it's found to be too confusing - I'm not particularly wedded to it. It's emphasizing the need for more documentation around this though!

WHTaylor added 2 commits June 25, 2026 12:20
For some reason this didn't pick up when running the linter manually, but did fail in CI https://github.com/ISISNeutronMuon/analytics-data-platform/actions/runs/28166532207/job/83419815751. Making a purposefully incorrect change was enough to trigger the formatter
@WHTaylor

Copy link
Copy Markdown
Contributor Author

ref the last two commits, I was poking around the pre-commit set up (because the markdown-lint step is slightly annoyingly slow) and saw that the ruff linter/formatter needed to be specifically set up to include the new directory. It then didn't pick up the already existing incorrect formatting on commit because the file hadn't changed.

return data


if __name__ == "__main__":

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this facilitates easier debugging?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was useful for quick checks whilst working on the pipeline. Me as of yesterday thought it'd be a good idea to leave it in as an example, but me as of today disagrees, so I've taken it out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thought here would be that the child directories of elt-pipelines would be named after the lakekeeper warehouse that the transformed models end up in, i.e. the cycles tables are associated with facility operationsso end up in thefacility_ops` warehouse.

For the FASE data our thinking was to have a separate warehouse given there are more access controls required for, e.g. who can access what. In the faciity_ops case the data can all be simply read only. It would also then be feasible to have separate repositories for each set of pipelines targeting a given warehouse.

What do you think about having:

elt-pipelines/
|-- facility_ops/
|    |-- ingest/
|    |    |-- accelerator/
|    |-- transform/
|    |    |-- # not here yet but would be...
|    |-- pyproject.toml
|    |-- .gitignore
|    |-- ...
|-- other_warehouse

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child directories of elt-pipelines would be named after the lakekeeper warehouse that the transformed models end up in

For the FASE data our thinking was to have a separate warehouse

I think this makes sense, and it might be possible to also use the directories for configuring pyiceberg to control the destination warehouse (either using the directory name instead of getting the default catalog here, or putting some amount of the config into the directories).

It would also then be feasible to have separate repositories for each set of pipelines targeting a given warehouse

This feels like it'd fragment the project, especially given the use of the relative path for the elt-common dependency; what benefits do you see it having?

This is technically the only use of pydantic-settings in the project, so it could be removed as a dependency. However, any pipeline that wants to include custom configuration will need to extend BaseSettings, so I think we should leave the dependency as is.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants