feat(elt-common): Extend schema generation#367
Conversation
Handle schemas with nested fields (structs and lists), and improves validation when evolving schemas. ref #321
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe ChangesIceberg Schema Field ID and Evolution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
elt-common/tests/unit_tests/iceberg/test_schema.py (2)
129-142: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winAvoid set-based expectations for evolved schema field names.
Using sets here hides ordering regressions; order is part of the compatibility semantics you’re testing elsewhere. Prefer ordered expectations and list comparison.
Suggested adjustment
`@pytest.mark.parametrize`( - ["iceberg_field_idxs", "expected_new_field_names"], + ["iceberg_field_idxs", "expected_field_names"], [ - ([], {"row_id", "entry_name", "entry_timestamp", "entry_weight"}), + ([], ["row_id", "entry_name", "entry_timestamp", "entry_weight"]), ( [0, 1, 2], - {"row_id", "entry_name", "entry_timestamp", "entry_weight"}, + ["row_id", "entry_name", "entry_timestamp", "entry_weight"], ), - ([0, 1, 2, 3], {}), + ([0, 1, 2, 3], None), ], ) def test_evolve_schema( - arrow_schema: pa.Schema, iceberg_field_idxs: list[int], expected_new_field_names + arrow_schema: pa.Schema, iceberg_field_idxs: list[int], expected_field_names ): existing_fields = [iceberg_fields[i] for i in iceberg_field_idxs] existing_schema = Schema(*existing_fields) @@ - if expected_new_field_names: + if expected_field_names is not None: assert schema_with_new_fields is not None - assert {f.name for f in schema_with_new_fields.fields} == expected_new_field_names + assert [f.name for f in schema_with_new_fields.fields] == expected_field_names else: assert schema_with_new_fields is None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@elt-common/tests/unit_tests/iceberg/test_schema.py` around lines 129 - 142, The test_evolve_schema test is using set literals for expected_new_field_names parameter values, which masks ordering regressions since sets are unordered. Replace all set literals (using curly braces) with list literals (using square brackets) for the expected_new_field_names values in the parametrize decorator. This ensures the test validates that evolved schema fields maintain the correct order, which is part of the compatibility semantics being tested.
68-81: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winStrengthen nested mapping assertions beyond root
StructType/ListType.These new cases only verify the outer class, so nested-field mapping regressions (including inner shape/IDs) could still pass. Please add at least one deep assertion per complex case.
Proposed test hardening
`@pytest.mark.parametrize`( ["arrow_type", "expected_type"], [ ... ], ) def test_returns_expected_iceberg_type(arrow_type, expected_type): result = arrow_type_to_iceberg(arrow_type) assert isinstance(result, expected_type) + + +def test_nested_struct_mapping_preserves_inner_fields(): + arrow_type = pa.struct([("nested", pa.struct([("test", pa.int32())]))]) + result = arrow_type_to_iceberg(arrow_type) + + assert isinstance(result, StructType) + assert result.fields[0].name == "nested" + assert isinstance(result.fields[0].field_type, StructType) + assert result.fields[0].field_type.fields[0].name == "test"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@elt-common/tests/unit_tests/iceberg/test_schema.py` around lines 68 - 81, The parametrized test cases for complex nested structures (including pa.list_ with nested pa.struct containing additional fields, and deeply nested pa.struct cases) currently only verify that the mapped result is an instance of the outer type class (StructType or ListType), but they do not validate the properties of the nested fields themselves. To fix this, add deeper assertions for each complex test case that verify not only the outer type but also the structure and properties of nested fields, including field names, field types, and any structural identifiers, so that regressions in nested field mapping cannot slip through the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@elt-common/src/elt_common/iceberg/schema.py`:
- Around line 125-156: The evolve_schema function creates new_iceberg_schema
without preserving the identifier_field_ids from the original iceberg_schema,
causing the schema equality check to always fail and the returned schema to lose
identifier field metadata. After creating new_iceberg_schema via
create_schema(new_arrow_schema), explicitly copy the identifier_field_ids from
the original iceberg_schema to the new_iceberg_schema before performing the
equality comparison on line 127 and before returning the evolved schema on line
156. This ensures backwards compatibility is properly detected and identifier
semantics are preserved.
- Around line 168-172: The get_max_field_id function will crash with a
ValueError when processing a StructType with empty struct_fields because calling
max() on an empty sequence is invalid. To fix this, add a check before calling
max() in the StructType handling block to detect when struct_fields is empty,
and in that case return f.field_id as a fallback value instead of attempting to
compute the maximum from subfields.
In `@elt-common/tests/unit_tests/iceberg/test_schema.py`:
- Around line 179-180: The test for evolve_schema with incompatible schemas uses
pytest.raises(ValueError) without validating the specific error message, which
can mask regressions if the function raises ValueError for the wrong reason. Add
a match parameter to the pytest.raises call to verify that the ValueError
contains the expected incompatibility error message when evolve_schema is called
with incompatible schemas, ensuring the test only passes when the correct
validation error is raised.
---
Nitpick comments:
In `@elt-common/tests/unit_tests/iceberg/test_schema.py`:
- Around line 129-142: The test_evolve_schema test is using set literals for
expected_new_field_names parameter values, which masks ordering regressions
since sets are unordered. Replace all set literals (using curly braces) with
list literals (using square brackets) for the expected_new_field_names values in
the parametrize decorator. This ensures the test validates that evolved schema
fields maintain the correct order, which is part of the compatibility semantics
being tested.
- Around line 68-81: The parametrized test cases for complex nested structures
(including pa.list_ with nested pa.struct containing additional fields, and
deeply nested pa.struct cases) currently only verify that the mapped result is
an instance of the outer type class (StructType or ListType), but they do not
validate the properties of the nested fields themselves. To fix this, add deeper
assertions for each complex test case that verify not only the outer type but
also the structure and properties of nested fields, including field names, field
types, and any structural identifiers, so that regressions in nested field
mapping cannot slip through the tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1d290c87-6575-4ea9-85ba-117b63eb420e
📒 Files selected for processing (2)
elt-common/src/elt_common/iceberg/schema.pyelt-common/tests/unit_tests/iceberg/test_schema.py
martyngigg
left a comment
There was a problem hiding this comment.
A nice set of changes for schema evolution and list/struct iceberg support. Just a couple of quick questions to ensure I understand what's happening.
ref #321
Handle schemas with nested fields (structs and lists), and improve validation when evolving schemas.
This came out of looking at porting the
statusdisplaypipeline to ELT. The data it ingests includes a nested field - 'cycles' have a field which is a list of 'phases'. DLT has been importing those as two separate tables (cyclesandcycles__phases), and attaching ID fields which are used for joining them back together when running DBT transforms. After a small discussion we decided it would be simplest to ingest it all into a single table, which this PR enables.Extract class
This is the extract class I've been using to guide this, based on the data returned for the
statusdisplayingest;Summary by CodeRabbit
Release Notes
Bug Fixes
Tests