Skip to content

[DataLoader] Remove pyiceberg fork dependency and ArrivalOrder API#504

Merged
cbb330 merged 1 commit intolinkedin:mainfrom
cbb330:remove-arrival-order-dep
Mar 18, 2026
Merged

[DataLoader] Remove pyiceberg fork dependency and ArrivalOrder API#504
cbb330 merged 1 commit intolinkedin:mainfrom
cbb330:remove-arrival-order-dep

Conversation

@cbb330
Copy link
Collaborator

@cbb330 cbb330 commented Mar 18, 2026

Summary

Remove the temporary dependency on the sumedhsakdeo/iceberg-python fork and all APIs it exposed (ArrivalOrder, ScanOrder, TaskOrder, batch_size). Reverts pyiceberg to the stable pyiceberg~=0.11.0 release from PyPI.

The PEP 508 direct reference (pyiceberg @ git+https://github.com/sumedhsakdeo/iceberg-python@<sha>) gets baked into the published wheel's Requires-Dist metadata, so any downstream consumer that installs openhouse-dataloader is forced to resolve pyiceberg from a personal GitHub fork. This fails ELR in downstream consumer environments, which only permit dependencies sourced from approved registries (PyPI, internal Artifactory). The fork pin cannot be approved through ELR because it is not a published release of a recognized OSS project.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

pyproject.toml — Drop [tool.hatch.metadata] allow-direct-references, revert pyiceberg from the fork SHA back to pyiceberg~=0.11.0.

data_loader_split.py — Remove ArrivalOrder import and order=ArrivalOrder(...) kwarg from to_record_batches(). Remove batch_size parameter from DataLoaderSplit.__init__.

data_loader.py — Remove batch_size parameter from OpenHouseDataLoader.__init__.

uv.lock — Regenerated to resolve pyiceberg from PyPI.

Tests — Delete test_arrival_order.py (entire file tested fork-only APIs). Remove batch_size tests from test_data_loader.py, test_data_loader_split.py, and integration_tests.py.

Testing Done

  • Updated existing tests to reflect the changes made.

make verify passes — 135 tests pass, lint, format, and mypy all green.

Additional Information

  • Breaking Changes

batch_size is removed from OpenHouseDataLoader and DataLoaderSplit. Any callers passing batch_size will need to remove that argument.

Revert pyiceberg from the sumedhsakdeo/iceberg-python fork back to the
stable pyiceberg~=0.11.0 release from PyPI. Remove all usage of the
fork-only APIs: ArrivalOrder, ScanOrder, TaskOrder, and the batch_size
parameter on DataLoaderSplit and OpenHouseDataLoader.

- pyproject.toml: drop allow-direct-references, pin pyiceberg~=0.11.0
- data_loader_split.py: remove ArrivalOrder import and order= kwarg,
  remove batch_size param
- data_loader.py: remove batch_size param
- Delete test_arrival_order.py
- Remove batch_size tests from test_data_loader.py,
  test_data_loader_split.py, and integration_tests.py
- Regenerate uv.lock
@cbb330 cbb330 merged commit aea66d1 into linkedin:main Mar 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants