Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors Analyzer.set_layer_metrics to reduce repeated per-column work and mitigate DataFrame fragmentation during derived-metric column generation, and adds a dedicated test module to validate expected behavior.
Changes:
- Precomputes column classification and numeric coercions, and evaluates each derived-metric condition once per metric.
- Builds all derived columns in-memory and appends them to
hlmin a single concat. - Adds
tests/test_set_layer_metrics.pycovering correctness plus a basic performance smoke loop.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
python/dftracer/analyzer/analyzer.py |
Refactors set_layer_metrics to batch derived-column creation and reduce repeated evaluation/coercion. |
tests/test_set_layer_metrics.py |
Introduces tests for derived-column correctness and a repeat-call perf smoke test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request refactors the
set_layer_metricsmethod inanalyzer.pyto improve performance and correctness when generating derived metric columns, and adds a new test suite to verify its behavior. The main changes include precomputing column types and numeric representations, building derived columns in-memory before appending, and introducing comprehensive tests for correctness and performance.Refactor and performance improvements in metric computation
is_size_col,is_string_col) and numeric representations (numeric_cols) once per source column to avoid repeated computation and improve efficiency inset_layer_metrics.Correctness and logic changes
size_derived_metrics, fixing previous logic that could produce unintended columns.Nonefor non-matching rows so that downstream processing (e.g.,unique_set_flatten) correctly skips missing values.New tests for correctness and performance
tests/test_set_layer_metrics.pywith correctness tests to verify column creation, value propagation, and missing value handling, as well as a performance smoke test to ensure efficient repeated calls.