Skip to content

feat(elt-common): Extend schema generation#367

Merged
WHTaylor merged 4 commits into
mainfrom
321-nested-fields
Jun 24, 2026
Merged

feat(elt-common): Extend schema generation#367
WHTaylor merged 4 commits into
mainfrom
321-nested-fields

Conversation

@WHTaylor

@WHTaylor WHTaylor commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

ref #321

Handle schemas with nested fields (structs and lists), and improve validation when evolving schemas.

This came out of looking at porting the statusdisplay pipeline to ELT. The data it ingests includes a nested field - 'cycles' have a field which is a list of 'phases'. DLT has been importing those as two separate tables (cycles and cycles__phases), and attaching ID fields which are used for joining them back together when running DBT transforms. After a small discussion we decided it would be simplest to ingest it all into a single table, which this PR enables.

Extract class

This is the extract class I've been using to guide this, based on the data returned for the statusdisplay ingest;

class Extract(BaseExtract):
    def extract_resource_properties(self) -> Iterator[tuple[str, ResourceProperties]]:
        yield (
            "testing",
            ResourceProperties(
                extractor=some_test_values,
                write_properties=ResourceWriteProperties(write_mode="replace"),
                watermark_column=None
            )
        )


def some_test_values(_):
    # pretend we've fetched, data rather than hardcoding it
    newline_delimited = "\n".join(json.dumps(row) for row in data)

    with io.BytesIO(newline_delimited.encode()) as f:
        yield pyarrow.json.read_json(f)


data = [
    {
        "id": "2025-1",
        "label": "2025/1",
        "status": "completed",
        "phases": [
            {
                "type": "user-time",
                "target": 1,
                "start": "1992-10-07 07:30:00",
                "end": "1992-11-08 00:00:00"
            }
        ]
    },
    {
        "id": "2025-2",
        "label": "2025/2",
        "status": "completed",
        "phases": [
            {
                "type": "user-time",
                "target": 1,
                "start": "2025-10-07 07:30:00",
                "end": "2025-11-08 00:00:00"
            },
            {
                "type": "user-time",
                "target": 2,
                "start": "2025-10-07 07:30:00",
                "end": "2025-11-08 00:00:00"
            }
        ]
    }
]

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved consistency of Iceberg field ID assignment across nested data types (lists, structs).
    • Updated schema evolution handling to validate backwards compatibility and reject incompatible changes (e.g., removals, renames, type/requiredness changes).
  • Tests

    • Expanded coverage for nested Arrow→Iceberg schema conversions (including nested list/struct combinations).
    • Strengthened schema evolution tests with explicit compatibility and incompatibility scenarios.

Handle schemas with nested fields (structs and lists), and improves
validation when evolving schemas.

ref #321
@WHTaylor WHTaylor requested a review from a team as a code owner June 23, 2026 16:49
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0cc42556-bef5-42c5-8cf0-bae08eaa1938

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

The arrow_type_to_iceberg function gains a field_id parameter and recursive handling for list and struct Arrow types. A new get_max_field_id helper computes the maximum field ID in a nested tree. create_schema uses a running ID counter, and evolve_schema is rewritten to validate backwards compatibility rather than appending columns. Tests are expanded accordingly.

Changes

Iceberg Schema Field ID and Evolution

Layer / File(s) Summary
arrow_type_to_iceberg signature, nested type handling, and get_max_field_id
elt-common/src/elt_common/iceberg/schema.py
Signature updated to accept field_id: int = 1 and return IcebergType. List and struct Arrow types are now recursively mapped with propagated/incremented field IDs. arrow_field_to_iceberg passes column_id + 1 into the mapper. get_max_field_id added to compute the deepest field ID in a NestedField tree (primitive, struct, list). itertools import removed.
create_schema ID counter and evolve_schema compatibility validation
elt-common/src/elt_common/iceberg/schema.py
create_schema advances a running col_id by get_max_field_id + 1 per field instead of assuming sequential top-level IDs. evolve_schema replaced: generates a full candidate schema, returns None when unchanged, raises ValueError on removals/renames/type changes/requiredness changes, and returns the new schema only for pure additions.
Test fixtures, complex type cases, and incompatible evolution tests
elt-common/tests/unit_tests/iceberg/test_schema.py
Shared arrow_fields and iceberg_fields module-level constants introduced. test_returns_expected_iceberg_type extended with struct, nested struct, and list-of-struct parametrised cases. test_evolve_schema updated to select fields by index and assert expected new field names. New test_evolve_schema_incompatible parametrised test asserts ValueError for removals, reordering, and type/property changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 Hoppity-hop through fields nested deep,
IDs now flow where structs and lists sleep.
No more appending when schemas collide—
A ValueError warns: incompatible stride!
The iceberg grows safely, one layer at a time,
And the rabbit checks schemas with reason and rhyme. ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat(elt-common): Extend schema generation' is vague and overly broad, failing to specify the key improvement (nested field support and schema validation). Consider a more specific title such as 'feat(elt-common): Support nested fields in Iceberg schema generation' to clearly convey the main technical advancement.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
elt-common/tests/unit_tests/iceberg/test_schema.py (2)

129-142: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Avoid set-based expectations for evolved schema field names.

Using sets here hides ordering regressions; order is part of the compatibility semantics you’re testing elsewhere. Prefer ordered expectations and list comparison.

Suggested adjustment
 `@pytest.mark.parametrize`(
-    ["iceberg_field_idxs", "expected_new_field_names"],
+    ["iceberg_field_idxs", "expected_field_names"],
     [
-        ([], {"row_id", "entry_name", "entry_timestamp", "entry_weight"}),
+        ([], ["row_id", "entry_name", "entry_timestamp", "entry_weight"]),
         (
             [0, 1, 2],
-            {"row_id", "entry_name", "entry_timestamp", "entry_weight"},
+            ["row_id", "entry_name", "entry_timestamp", "entry_weight"],
         ),
-        ([0, 1, 2, 3], {}),
+        ([0, 1, 2, 3], None),
     ],
 )
 def test_evolve_schema(
-    arrow_schema: pa.Schema, iceberg_field_idxs: list[int], expected_new_field_names
+    arrow_schema: pa.Schema, iceberg_field_idxs: list[int], expected_field_names
 ):
     existing_fields = [iceberg_fields[i] for i in iceberg_field_idxs]
     existing_schema = Schema(*existing_fields)
@@
-    if expected_new_field_names:
+    if expected_field_names is not None:
         assert schema_with_new_fields is not None
-        assert {f.name for f in schema_with_new_fields.fields} == expected_new_field_names
+        assert [f.name for f in schema_with_new_fields.fields] == expected_field_names
     else:
         assert schema_with_new_fields is None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@elt-common/tests/unit_tests/iceberg/test_schema.py` around lines 129 - 142,
The test_evolve_schema test is using set literals for expected_new_field_names
parameter values, which masks ordering regressions since sets are unordered.
Replace all set literals (using curly braces) with list literals (using square
brackets) for the expected_new_field_names values in the parametrize decorator.
This ensures the test validates that evolved schema fields maintain the correct
order, which is part of the compatibility semantics being tested.

68-81: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Strengthen nested mapping assertions beyond root StructType/ListType.

These new cases only verify the outer class, so nested-field mapping regressions (including inner shape/IDs) could still pass. Please add at least one deep assertion per complex case.

Proposed test hardening
 `@pytest.mark.parametrize`(
     ["arrow_type", "expected_type"],
     [
         ...
     ],
 )
 def test_returns_expected_iceberg_type(arrow_type, expected_type):
     result = arrow_type_to_iceberg(arrow_type)
     assert isinstance(result, expected_type)
+
+
+def test_nested_struct_mapping_preserves_inner_fields():
+    arrow_type = pa.struct([("nested", pa.struct([("test", pa.int32())]))])
+    result = arrow_type_to_iceberg(arrow_type)
+
+    assert isinstance(result, StructType)
+    assert result.fields[0].name == "nested"
+    assert isinstance(result.fields[0].field_type, StructType)
+    assert result.fields[0].field_type.fields[0].name == "test"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@elt-common/tests/unit_tests/iceberg/test_schema.py` around lines 68 - 81, The
parametrized test cases for complex nested structures (including pa.list_ with
nested pa.struct containing additional fields, and deeply nested pa.struct
cases) currently only verify that the mapped result is an instance of the outer
type class (StructType or ListType), but they do not validate the properties of
the nested fields themselves. To fix this, add deeper assertions for each
complex test case that verify not only the outer type but also the structure and
properties of nested fields, including field names, field types, and any
structural identifiers, so that regressions in nested field mapping cannot slip
through the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@elt-common/src/elt_common/iceberg/schema.py`:
- Around line 125-156: The evolve_schema function creates new_iceberg_schema
without preserving the identifier_field_ids from the original iceberg_schema,
causing the schema equality check to always fail and the returned schema to lose
identifier field metadata. After creating new_iceberg_schema via
create_schema(new_arrow_schema), explicitly copy the identifier_field_ids from
the original iceberg_schema to the new_iceberg_schema before performing the
equality comparison on line 127 and before returning the evolved schema on line
156. This ensures backwards compatibility is properly detected and identifier
semantics are preserved.
- Around line 168-172: The get_max_field_id function will crash with a
ValueError when processing a StructType with empty struct_fields because calling
max() on an empty sequence is invalid. To fix this, add a check before calling
max() in the StructType handling block to detect when struct_fields is empty,
and in that case return f.field_id as a fallback value instead of attempting to
compute the maximum from subfields.

In `@elt-common/tests/unit_tests/iceberg/test_schema.py`:
- Around line 179-180: The test for evolve_schema with incompatible schemas uses
pytest.raises(ValueError) without validating the specific error message, which
can mask regressions if the function raises ValueError for the wrong reason. Add
a match parameter to the pytest.raises call to verify that the ValueError
contains the expected incompatibility error message when evolve_schema is called
with incompatible schemas, ensuring the test only passes when the correct
validation error is raised.

---

Nitpick comments:
In `@elt-common/tests/unit_tests/iceberg/test_schema.py`:
- Around line 129-142: The test_evolve_schema test is using set literals for
expected_new_field_names parameter values, which masks ordering regressions
since sets are unordered. Replace all set literals (using curly braces) with
list literals (using square brackets) for the expected_new_field_names values in
the parametrize decorator. This ensures the test validates that evolved schema
fields maintain the correct order, which is part of the compatibility semantics
being tested.
- Around line 68-81: The parametrized test cases for complex nested structures
(including pa.list_ with nested pa.struct containing additional fields, and
deeply nested pa.struct cases) currently only verify that the mapped result is
an instance of the outer type class (StructType or ListType), but they do not
validate the properties of the nested fields themselves. To fix this, add deeper
assertions for each complex test case that verify not only the outer type but
also the structure and properties of nested fields, including field names, field
types, and any structural identifiers, so that regressions in nested field
mapping cannot slip through the tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1d290c87-6575-4ea9-85ba-117b63eb420e

📥 Commits

Reviewing files that changed from the base of the PR and between 4b27a39 and ecccad8.

📒 Files selected for processing (2)
  • elt-common/src/elt_common/iceberg/schema.py
  • elt-common/tests/unit_tests/iceberg/test_schema.py

Comment thread elt-common/src/elt_common/iceberg/schema.py Outdated
Comment thread elt-common/src/elt_common/iceberg/schema.py
Comment thread elt-common/tests/unit_tests/iceberg/test_schema.py
@martyngigg martyngigg self-assigned this Jun 24, 2026

@martyngigg martyngigg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice set of changes for schema evolution and list/struct iceberg support. Just a couple of quick questions to ensure I understand what's happening.

Comment thread elt-common/src/elt_common/iceberg/schema.py
Comment thread elt-common/src/elt_common/iceberg/schema.py
@WHTaylor WHTaylor merged commit 857eba7 into main Jun 24, 2026
4 checks passed
@WHTaylor WHTaylor deleted the 321-nested-fields branch June 24, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants