Skip to content

LIHADOOP-86205: Add view dependency tracking to capture full view lineage chain#577

Open
thejdeep wants to merge 4 commits intolinkedin:masterfrom
thejdeep:LIHADOOP-86205
Open

LIHADOOP-86205: Add view dependency tracking to capture full view lineage chain#577
thejdeep wants to merge 4 commits intolinkedin:masterfrom
thejdeep:LIHADOOP-86205

Conversation

@thejdeep
Copy link

What changes are proposed in this pull request, and why are they necessary?

Adds a ViewDependencyTracker that records the view dependency chain during Calcite's recursive view expansion. This enables downstream consumers to emit the full view-to-view-to-table dependency graph instead of just flattened base tables.

  • Add ViewDependency data class and ViewDependencyTracker in coral-common
  • Instrument HiveViewExpander.expandView() to record enter/exit of views
  • Instrument HiveDbSchema and CoralDatabaseSchema to record base table deps
  • Add getViewDependencies() API to CoralSpark
  • Add tests for nested, simple, and base table view dependency scenarios

How was this patch tested?

Added unit tests in CoralSparkTest

…eage chain

Adds a ThreadLocal ViewDependencyTracker that records the view dependency
chain during Calcite's recursive view expansion. This enables downstream
consumers (e.g., Spark lineage) to emit the full view-to-view-to-table
dependency graph instead of just flattened base tables.

- Add ViewDependency data class and ViewDependencyTracker in coral-common
- Instrument ToRelConverter.convertView() to record top-level view entry
- Instrument HiveViewExpander.expandView() to record nested view enter/exit
- Instrument HiveDbSchema and CoralDatabaseSchema to record base table deps
- Add getViewDependencies() API to CoralSpark
- Add tests for nested, simple, and base table view dependency scenarios
- Add __pycache__/ to .gitignore
@aastha25
Copy link
Contributor

reviewing offline if this is the best place to record lineage (which will be done real time) or if it's possible for the usecase to do this analysis offline post hoc.

Copy link

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick pass

/**
* Called when a base table (non-view) is encountered during view expansion.
*/
public void recordBaseDependency(String dbName, String tableName) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base -> BaseTable or some such ?

SqlNode sqlNode = processView(hiveDbName, hiveViewName);
return toRel(sqlNode);
} finally {
ViewDependencyTracker.get().exitView();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return and validate it is the same view ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for validating the return of exitView


public static void reset() {
INSTANCE.remove();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After each resolution, the state needs to be cleared - not just in tests, but also for 'regular' use

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this method will be invoked from the caller who invokes the rel converter of Coral

Copy link

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.
Please do have it reviewed by someone more familiar with coral as well :-)

@wmoustafa
Copy link
Contributor

Discussed offline. This PR requires a deep refactor of the view expansion APIs since they are not friendly to tracking lineage in the first place.

@aastha25
Copy link
Contributor

I’m not aligned with this PR. The proposed design introduces lineage tracking at an awkward point in the translation path (Coral IR ↔ Calcite connector), which makes it difficult to justify without a broader redesign. At this stage, we shouldn’t need to emit lineage. All referenced tables and views should already be accessible via the catalog client’s getTable API, which is the defined contract between the caller and Coral. And the lineage should be extracted from there.
Another cleaner solution would be to implement it at the parser layer—when Coral parses the view definition and has direct visibility into the underlying datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants