LIHADOOP-86205: Add view dependency tracking to capture full view lineage chain#577
LIHADOOP-86205: Add view dependency tracking to capture full view lineage chain#577thejdeep wants to merge 4 commits intolinkedin:masterfrom
Conversation
…eage chain Adds a ThreadLocal ViewDependencyTracker that records the view dependency chain during Calcite's recursive view expansion. This enables downstream consumers (e.g., Spark lineage) to emit the full view-to-view-to-table dependency graph instead of just flattened base tables. - Add ViewDependency data class and ViewDependencyTracker in coral-common - Instrument ToRelConverter.convertView() to record top-level view entry - Instrument HiveViewExpander.expandView() to record nested view enter/exit - Instrument HiveDbSchema and CoralDatabaseSchema to record base table deps - Add getViewDependencies() API to CoralSpark - Add tests for nested, simple, and base table view dependency scenarios - Add __pycache__/ to .gitignore
|
reviewing offline if this is the best place to record lineage (which will be done real time) or if it's possible for the usecase to do this analysis offline post hoc. |
| /** | ||
| * Called when a base table (non-view) is encountered during view expansion. | ||
| */ | ||
| public void recordBaseDependency(String dbName, String tableName) { |
| SqlNode sqlNode = processView(hiveDbName, hiveViewName); | ||
| return toRel(sqlNode); | ||
| } finally { | ||
| ViewDependencyTracker.get().exitView(); |
There was a problem hiding this comment.
return and validate it is the same view ?
There was a problem hiding this comment.
Added a check for validating the return of exitView
|
|
||
| public static void reset() { | ||
| INSTANCE.remove(); | ||
| } |
There was a problem hiding this comment.
After each resolution, the state needs to be cleared - not just in tests, but also for 'regular' use
There was a problem hiding this comment.
I agree, this method will be invoked from the caller who invokes the rel converter of Coral
mridulm
left a comment
There was a problem hiding this comment.
Looks reasonable to me.
Please do have it reviewed by someone more familiar with coral as well :-)
|
Discussed offline. This PR requires a deep refactor of the view expansion APIs since they are not friendly to tracking lineage in the first place. |
|
I’m not aligned with this PR. The proposed design introduces lineage tracking at an awkward point in the translation path (Coral IR ↔ Calcite connector), which makes it difficult to justify without a broader redesign. At this stage, we shouldn’t need to emit lineage. All referenced tables and views should already be accessible via the catalog client’s getTable API, which is the defined contract between the caller and Coral. And the lineage should be extracted from there. |
What changes are proposed in this pull request, and why are they necessary?
Adds a
ViewDependencyTrackerthat records the view dependency chain during Calcite's recursive view expansion. This enables downstream consumers to emit the full view-to-view-to-table dependency graph instead of just flattened base tables.How was this patch tested?
Added unit tests in
CoralSparkTest