feat(elt-common): Extend schema generation by WHTaylor · Pull Request #367 · ISISNeutronMuon/analytics-data-platform

WHTaylor · 2026-06-23T16:49:29Z

Handle schemas with nested fields (structs and lists), and improve validation when evolving schemas.

This came out of looking at porting the statusdisplay pipeline to ELT. The data it ingests includes a nested field - 'cycles' have a field which is a list of 'phases'. DLT has been importing those as two separate tables (cycles and cycles__phases), and attaching ID fields which are used for joining them back together when running DBT transforms. After a small discussion we decided it would be simplest to ingest it all into a single table, which this PR enables.

Extract class

This is the extract class I've been using to guide this, based on the data returned for the statusdisplay ingest;

class Extract(BaseExtract):
    def extract_resource_properties(self) -> Iterator[tuple[str, ResourceProperties]]:
        yield (
            "testing",
            ResourceProperties(
                extractor=some_test_values,
                write_properties=ResourceWriteProperties(write_mode="replace"),
                watermark_column=None
            )
        )


def some_test_values(_):
    # pretend we've fetched, data rather than hardcoding it
    newline_delimited = "\n".join(json.dumps(row) for row in data)

    with io.BytesIO(newline_delimited.encode()) as f:
        yield pyarrow.json.read_json(f)


data = [
    {
        "id": "2025-1",
        "label": "2025/1",
        "status": "completed",
        "phases": [
            {
                "type": "user-time",
                "target": 1,
                "start": "1992-10-07 07:30:00",
                "end": "1992-11-08 00:00:00"
            }
        ]
    },
    {
        "id": "2025-2",
        "label": "2025/2",
        "status": "completed",
        "phases": [
            {
                "type": "user-time",
                "target": 1,
                "start": "2025-10-07 07:30:00",
                "end": "2025-11-08 00:00:00"
            },
            {
                "type": "user-time",
                "target": 2,
                "start": "2025-10-07 07:30:00",
                "end": "2025-11-08 00:00:00"
            }
        ]
    }
]

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved consistency of Iceberg field ID assignment across nested data types (lists, structs).
- Updated schema evolution handling to validate backwards compatibility and reject incompatible changes (e.g., removals, renames, type/requiredness changes).
Tests
- Expanded coverage for nested Arrow→Iceberg schema conversions (including nested list/struct combinations).
- Strengthened schema evolution tests with explicit compatibility and incompatibility scenarios.

Handle schemas with nested fields (structs and lists), and improves validation when evolving schemas. ref #321

coderabbitai · 2026-06-23T16:54:42Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0cc42556-bef5-42c5-8cf0-bae08eaa1938

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

📝 Walkthrough

Walkthrough

The arrow_type_to_iceberg function gains a field_id parameter and recursive handling for list and struct Arrow types. A new get_max_field_id helper computes the maximum field ID in a nested tree. create_schema uses a running ID counter, and evolve_schema is rewritten to validate backwards compatibility rather than appending columns. Tests are expanded accordingly.

Changes

Iceberg Schema Field ID and Evolution

Layer / File(s)	Summary
`arrow_type_to_iceberg` signature, nested type handling, and `get_max_field_id` `elt-common/src/elt_common/iceberg/schema.py`	Signature updated to accept `field_id: int = 1` and return `IcebergType`. List and struct Arrow types are now recursively mapped with propagated/incremented field IDs. `arrow_field_to_iceberg` passes `column_id + 1` into the mapper. `get_max_field_id` added to compute the deepest field ID in a `NestedField` tree (primitive, struct, list). `itertools` import removed.
`create_schema` ID counter and `evolve_schema` compatibility validation `elt-common/src/elt_common/iceberg/schema.py`	`create_schema` advances a running `col_id` by `get_max_field_id + 1` per field instead of assuming sequential top-level IDs. `evolve_schema` replaced: generates a full candidate schema, returns `None` when unchanged, raises `ValueError` on removals/renames/type changes/requiredness changes, and returns the new schema only for pure additions.
Test fixtures, complex type cases, and incompatible evolution tests `elt-common/tests/unit_tests/iceberg/test_schema.py`	Shared `arrow_fields` and `iceberg_fields` module-level constants introduced. `test_returns_expected_iceberg_type` extended with struct, nested struct, and list-of-struct parametrised cases. `test_evolve_schema` updated to select fields by index and assert expected new field names. New `test_evolve_schema_incompatible` parametrised test asserts `ValueError` for removals, reordering, and type/property changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 Hoppity-hop through fields nested deep,
IDs now flow where structs and lists sleep.
No more appending when schemas collide—
A ValueError warns: incompatible stride!
The iceberg grows safely, one layer at a time,
And the rabbit checks schemas with reason and rhyme. ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'feat(elt-common): Extend schema generation' is vague and overly broad, failing to specify the key improvement (nested field support and schema validation).	Consider a more specific title such as 'feat(elt-common): Support nested fields in Iceberg schema generation' to clearly convey the main technical advancement.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

elt-common/tests/unit_tests/iceberg/test_schema.py (2)

129-142: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Avoid set-based expectations for evolved schema field names.

Using sets here hides ordering regressions; order is part of the compatibility semantics you’re testing elsewhere. Prefer ordered expectations and list comparison.

Suggested adjustment

 `@pytest.mark.parametrize`(
-    ["iceberg_field_idxs", "expected_new_field_names"],
+    ["iceberg_field_idxs", "expected_field_names"],
     [
-        ([], {"row_id", "entry_name", "entry_timestamp", "entry_weight"}),
+        ([], ["row_id", "entry_name", "entry_timestamp", "entry_weight"]),
         (
             [0, 1, 2],
-            {"row_id", "entry_name", "entry_timestamp", "entry_weight"},
+            ["row_id", "entry_name", "entry_timestamp", "entry_weight"],
         ),
-        ([0, 1, 2, 3], {}),
+        ([0, 1, 2, 3], None),
     ],
 )
 def test_evolve_schema(
-    arrow_schema: pa.Schema, iceberg_field_idxs: list[int], expected_new_field_names
+    arrow_schema: pa.Schema, iceberg_field_idxs: list[int], expected_field_names
 ):
     existing_fields = [iceberg_fields[i] for i in iceberg_field_idxs]
     existing_schema = Schema(*existing_fields)
@@
-    if expected_new_field_names:
+    if expected_field_names is not None:
         assert schema_with_new_fields is not None
-        assert {f.name for f in schema_with_new_fields.fields} == expected_new_field_names
+        assert [f.name for f in schema_with_new_fields.fields] == expected_field_names
     else:
         assert schema_with_new_fields is None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@elt-common/tests/unit_tests/iceberg/test_schema.py` around lines 129 - 142,
The test_evolve_schema test is using set literals for expected_new_field_names
parameter values, which masks ordering regressions since sets are unordered.
Replace all set literals (using curly braces) with list literals (using square
brackets) for the expected_new_field_names values in the parametrize decorator.
This ensures the test validates that evolved schema fields maintain the correct
order, which is part of the compatibility semantics being tested.

68-81: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Strengthen nested mapping assertions beyond root StructType/ListType.

These new cases only verify the outer class, so nested-field mapping regressions (including inner shape/IDs) could still pass. Please add at least one deep assertion per complex case.

Proposed test hardening

 `@pytest.mark.parametrize`(
     ["arrow_type", "expected_type"],
     [
         ...
     ],
 )
 def test_returns_expected_iceberg_type(arrow_type, expected_type):
     result = arrow_type_to_iceberg(arrow_type)
     assert isinstance(result, expected_type)
+
+
+def test_nested_struct_mapping_preserves_inner_fields():
+    arrow_type = pa.struct([("nested", pa.struct([("test", pa.int32())]))])
+    result = arrow_type_to_iceberg(arrow_type)
+
+    assert isinstance(result, StructType)
+    assert result.fields[0].name == "nested"
+    assert isinstance(result.fields[0].field_type, StructType)
+    assert result.fields[0].field_type.fields[0].name == "test"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@elt-common/tests/unit_tests/iceberg/test_schema.py` around lines 68 - 81, The
parametrized test cases for complex nested structures (including pa.list_ with
nested pa.struct containing additional fields, and deeply nested pa.struct
cases) currently only verify that the mapped result is an instance of the outer
type class (StructType or ListType), but they do not validate the properties of
the nested fields themselves. To fix this, add deeper assertions for each
complex test case that verify not only the outer type but also the structure and
properties of nested fields, including field names, field types, and any
structural identifiers, so that regressions in nested field mapping cannot slip
through the tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@elt-common/src/elt_common/iceberg/schema.py`:
- Around line 125-156: The evolve_schema function creates new_iceberg_schema
without preserving the identifier_field_ids from the original iceberg_schema,
causing the schema equality check to always fail and the returned schema to lose
identifier field metadata. After creating new_iceberg_schema via
create_schema(new_arrow_schema), explicitly copy the identifier_field_ids from
the original iceberg_schema to the new_iceberg_schema before performing the
equality comparison on line 127 and before returning the evolved schema on line
156. This ensures backwards compatibility is properly detected and identifier
semantics are preserved.
- Around line 168-172: The get_max_field_id function will crash with a
ValueError when processing a StructType with empty struct_fields because calling
max() on an empty sequence is invalid. To fix this, add a check before calling
max() in the StructType handling block to detect when struct_fields is empty,
and in that case return f.field_id as a fallback value instead of attempting to
compute the maximum from subfields.

In `@elt-common/tests/unit_tests/iceberg/test_schema.py`:
- Around line 179-180: The test for evolve_schema with incompatible schemas uses
pytest.raises(ValueError) without validating the specific error message, which
can mask regressions if the function raises ValueError for the wrong reason. Add
a match parameter to the pytest.raises call to verify that the ValueError
contains the expected incompatibility error message when evolve_schema is called
with incompatible schemas, ensuring the test only passes when the correct
validation error is raised.

---

Nitpick comments:
In `@elt-common/tests/unit_tests/iceberg/test_schema.py`:
- Around line 129-142: The test_evolve_schema test is using set literals for
expected_new_field_names parameter values, which masks ordering regressions
since sets are unordered. Replace all set literals (using curly braces) with
list literals (using square brackets) for the expected_new_field_names values in
the parametrize decorator. This ensures the test validates that evolved schema
fields maintain the correct order, which is part of the compatibility semantics
being tested.
- Around line 68-81: The parametrized test cases for complex nested structures
(including pa.list_ with nested pa.struct containing additional fields, and
deeply nested pa.struct cases) currently only verify that the mapped result is
an instance of the outer type class (StructType or ListType), but they do not
validate the properties of the nested fields themselves. To fix this, add deeper
assertions for each complex test case that verify not only the outer type but
also the structure and properties of nested fields, including field names, field
types, and any structural identifiers, so that regressions in nested field
mapping cannot slip through the tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1d290c87-6575-4ea9-85ba-117b63eb420e

📥 Commits

Reviewing files that changed from the base of the PR and between 4b27a39 and ecccad8.

📒 Files selected for processing (2)

elt-common/src/elt_common/iceberg/schema.py
elt-common/tests/unit_tests/iceberg/test_schema.py

martyngigg

A nice set of changes for schema evolution and list/struct iceberg support. Just a couple of quick questions to ensure I understand what's happening.

feat(elt-common): Extend schema generation

ecccad8

Handle schemas with nested fields (structs and lists), and improves validation when evolving schemas. ref #321

WHTaylor requested a review from a team as a code owner June 23, 2026 16:49

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread elt-common/src/elt_common/iceberg/schema.py Outdated

Comment thread elt-common/src/elt_common/iceberg/schema.py

Comment thread elt-common/tests/unit_tests/iceberg/test_schema.py

martyngigg self-assigned this Jun 24, 2026

martyngigg reviewed Jun 24, 2026

View reviewed changes

Comment thread elt-common/src/elt_common/iceberg/schema.py

Comment thread elt-common/src/elt_common/iceberg/schema.py

martyngigg approved these changes Jun 24, 2026

View reviewed changes

WHTaylor added 3 commits June 24, 2026 11:13

Handle ids for empty structs

ca4dccb

Carrry over id fields when evolving schema

5238b76

Improve schema tests

e38925e

WHTaylor merged commit 857eba7 into main Jun 24, 2026
4 checks passed

WHTaylor deleted the 321-nested-fields branch June 24, 2026 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(elt-common): Extend schema generation#367

feat(elt-common): Extend schema generation#367
WHTaylor merged 4 commits into
mainfrom
321-nested-fields

WHTaylor commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martyngigg left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WHTaylor commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martyngigg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WHTaylor commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading