[data] feat: add datachecker#27
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the data handling and validation mechanisms within the RL-Insight project. By introducing a dedicated Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request refactors the data validation and handling mechanism by introducing a DataChecker class and moving DataEnum and DataValidationError to rl_insight/data/data_checker.py and rl_insight/data/rules.py respectively. The BaseData class and MultiJsonData class are removed, and the PathExistsRule is updated to check for directory existence. The OfflineInsightPipeline and BaseClusterParser are updated to use the new DataChecker for input validation and to remove hardcoded input types, making the system more modular. A hardcoded local path in rl_insight/main.py was updated, and a generated HTML file was added to the repository. Review comments highlight critical issues such as non-portable hardcoded paths, shared mutable class attributes in DataChecker and PathExistsRule, and global logging configuration. Other improvements suggested include renaming a shadowed parameter, safely accessing dictionary keys, aligning type hints with implementation, clarifying path existence checks, and making input types more configurable in BaseClusterParser and RLTimelineVisualizer. Additionally, the generated rl_timeline.html file should be added to .gitignore.
I am having trouble creating individual review comments. Click here to see my feedback.
rl_insight/main.py (32)
The default value for --input-path is a hardcoded path specific to a user's local machine (C:\Users\Tardis\Documents\profile_data\...). This is not portable and will cause issues for other developers or in different environments. It should be changed to a generic, relative path or a path that is expected to exist in a typical development setup, or removed if no sensible default can be provided.
"--input-path", default="./data", help="Raw path of profiling data"
rl_insight/data/data_checker.py (41-46)
The rules attribute is defined as a class attribute, meaning it's shared across all instances of DataChecker. If an instance modifies this dictionary, it will affect all other instances, which is likely unintended. It should be an instance attribute initialized within the __init__ method to ensure each DataChecker instance has its own set of rules.
class DataChecker():
"""Base data class for RL-Insight."""
def __init__(self, type: DataEnum, data: str|dict):
self.type = type
self.data = data
self.rules: dict[DataEnum, List[ValidationRule]] = {
DataEnum.MULTI_JSON: [PathExistsRule()],
DataEnum.SUMMARY_EVENT: [],
}
rl_insight/data/rules.py (52)
The _error_message attribute is defined as a class attribute, making it shared across all instances of PathExistsRule. This means if one instance sets an error message, it will overwrite the message for all other instances. It should be an instance attribute initialized in the __init__ method. Additionally, the type hint str for _error_message conflicts with the error_message property's return type List[str] (line 68).
class PathExistsRule(ValidationRule):
def __init__(self):
self._error_message: List[str] = []
def check(self, data: str|dict|pd.DataFrame) -> bool:
rl_insight/data/data_checker.py (22-26)
Configuring logging globally within a module can lead to unexpected behavior or conflicts if other parts of the application require different logging setups. It's generally best practice to configure logging at the application's entry point or use a more flexible configuration mechanism.
rl_insight/data/data_checker.py (48)
Using type as a parameter name shadows the built-in type() function in Python. This can lead to confusion and potential issues. Consider renaming this parameter to something more descriptive, like data_type.
def __init__(self, data_type: DataEnum, data: str|dict):rl_insight/data/data_checker.py (53)
Accessing self.rules[self.type] directly can raise a KeyError if self.type is not present in the rules dictionary. It's safer to use self.rules.get(self.type, []) to gracefully handle cases where no rules are defined for a specific DataEnum, or explicitly raise a more informative error.
rules = self.rules.get(self.type, [])rl_insight/data/rules.py (53)
The type hint for data includes pd.DataFrame, but the current implementation of the check method only handles str input. If pd.DataFrame is a valid input type for this rule, the logic within the check method needs to be updated to correctly process it. Otherwise, the type hint should be narrowed to str.
rl_insight/data/rules.py (59)
The path.is_dir() check is specific to directories. If the PathExistsRule is intended to validate the existence of any path (file or directory), path.exists() would be more appropriate. If it's strictly for directories, the error message should clearly state that a directory is expected.
if not path.exists():
rl_insight/parser/parser.py (36)
Hardcoding self.input_type = DataEnum.MULTI_JSON in the BaseClusterParser makes the base class less flexible. If different parsers need to handle different input types, this should either be an abstract property that subclasses must implement, or passed in during initialization, rather than being fixed in the base class.
rl_insight/visualizer/visualizer.py (83)
Similar to the BaseClusterParser, hardcoding self.input_type = DataEnum.SUMMARY_EVENT in RLTimelineVisualizer limits its flexibility. If other visualizers or future extensions need to handle different input types, this should be a more dynamic property or configurable.
test/rl_timeline.html (1-7)
This rl_timeline.html file appears to be a generated output from the visualization process. Generated files should generally not be committed to the repository. They can cause unnecessary diffs, bloat the repository size, and lead to merge conflicts. Please add this file to .gitignore.
| return False | ||
| return True | ||
| except Exception as e: | ||
| self._error_message = f"Source path does not exist: {data}" |
There was a problem hiding this comment.
The exception message variable 'e' is unused, recommand to add it to self._error_message
* add datachecker * Delete rl_timeline.html
|
lgtm |
* [data] feat: add base data class framework with validation support (#26) * [data] feat: add base data class framework with validation support Signed-off-by: Debonex <debonexx@gmail.com> * [refactor] move data module to rl_insight package and fix imports Signed-off-by: Debonex <debonexx@gmail.com> * [data] refactor: simplify validation system with class method approach Signed-off-by: Debonex <debonexx@gmail.com> * [data] feat: update data related interfaces and tests Signed-off-by: Debonex <debonexx@gmail.com> --------- Signed-off-by: Debonex <debonexx@gmail.com> * [data] feat: add datachecker (#27) * add datachecker * Delete rl_timeline.html * Unify logging and address other code review comments * update requirement * address code review comments from gemini * adjust ut * pre-commit * pre-commit * format uworkflow yml * Address review comments --------- Signed-off-by: Debonex <debonexx@gmail.com> Co-authored-by: Debonet <37174444+Debonex@users.noreply.github.com>
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includemstx,mvtx,torch_profile,deployment,perf,algo,env,doc,data,cfg,ci,misc,,like[mstx, ci]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][mstx, torch_profile] feat: support timeline parsingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always