Skip to content

[data] feat: add datachecker#27

Merged
tardis-key merged 2 commits into
verl-project:devfrom
tardis-key:dev
Mar 23, 2026
Merged

[data] feat: add datachecker#27
tardis-key merged 2 commits into
verl-project:devfrom
tardis-key:dev

Conversation

@tardis-key
Copy link
Copy Markdown
Collaborator

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include mstx, mvtx, torch_profile, deployment, perf, algo, env, doc, data, cfg, ci, misc,
    • If this PR involves multiple modules, separate them with , like [mstx, ci]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][mstx, torch_profile] feat: support timeline parsing

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the data handling and validation mechanisms within the RL-Insight project. By introducing a dedicated DataChecker class, the changes aim to centralize and enforce data integrity throughout the data processing pipeline. This consolidation simplifies the overall architecture, making data validation more consistent and easier to manage, ultimately improving the reliability of data flow from parsing to visualization.

Highlights

  • Centralized Data Validation: Introduced a new DataChecker class to centralize and standardize data validation logic across the RL-Insight pipeline, replacing previous ad-hoc validation mechanisms.
  • Refactored Data Type Handling: Consolidated the DataEnum definition into the new data_checker.py module and removed several deprecated data-related files, streamlining the data module's structure.
  • Simplified Pipeline Data Flow: Updated the OfflineInsightPipeline to utilize the DataChecker for validating both input and intermediate data, removing direct dependencies on BaseData and simplifying data type checks within the pipeline.
  • Improved Validation Rules: Refactored ValidationRule and DataValidationError into rules.py and enhanced the PathExistsRule to be more robust, handling various input types and providing clearer error messages.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@tardis-key tardis-key changed the title add datachecker [data] feat: add datachecker Mar 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request refactors the data validation and handling mechanism by introducing a DataChecker class and moving DataEnum and DataValidationError to rl_insight/data/data_checker.py and rl_insight/data/rules.py respectively. The BaseData class and MultiJsonData class are removed, and the PathExistsRule is updated to check for directory existence. The OfflineInsightPipeline and BaseClusterParser are updated to use the new DataChecker for input validation and to remove hardcoded input types, making the system more modular. A hardcoded local path in rl_insight/main.py was updated, and a generated HTML file was added to the repository. Review comments highlight critical issues such as non-portable hardcoded paths, shared mutable class attributes in DataChecker and PathExistsRule, and global logging configuration. Other improvements suggested include renaming a shadowed parameter, safely accessing dictionary keys, aligning type hints with implementation, clarifying path existence checks, and making input types more configurable in BaseClusterParser and RLTimelineVisualizer. Additionally, the generated rl_timeline.html file should be added to .gitignore.

I am having trouble creating individual review comments. Click here to see my feedback.

rl_insight/main.py (32)

critical

The default value for --input-path is a hardcoded path specific to a user's local machine (C:\Users\Tardis\Documents\profile_data\...). This is not portable and will cause issues for other developers or in different environments. It should be changed to a generic, relative path or a path that is expected to exist in a typical development setup, or removed if no sensible default can be provided.

        "--input-path", default="./data", help="Raw path of profiling data"

rl_insight/data/data_checker.py (41-46)

high

The rules attribute is defined as a class attribute, meaning it's shared across all instances of DataChecker. If an instance modifies this dictionary, it will affect all other instances, which is likely unintended. It should be an instance attribute initialized within the __init__ method to ensure each DataChecker instance has its own set of rules.

class DataChecker():
    """Base data class for RL-Insight."""

    def __init__(self, type: DataEnum, data: str|dict):
        self.type = type
        self.data = data
        self.rules: dict[DataEnum, List[ValidationRule]] = {
            DataEnum.MULTI_JSON: [PathExistsRule()],
            DataEnum.SUMMARY_EVENT: [],
        }

rl_insight/data/rules.py (52)

high

The _error_message attribute is defined as a class attribute, making it shared across all instances of PathExistsRule. This means if one instance sets an error message, it will overwrite the message for all other instances. It should be an instance attribute initialized in the __init__ method. Additionally, the type hint str for _error_message conflicts with the error_message property's return type List[str] (line 68).

class PathExistsRule(ValidationRule):
    def __init__(self):
        self._error_message: List[str] = []

    def check(self, data: str|dict|pd.DataFrame) -> bool:

rl_insight/data/data_checker.py (22-26)

medium

Configuring logging globally within a module can lead to unexpected behavior or conflicts if other parts of the application require different logging setups. It's generally best practice to configure logging at the application's entry point or use a more flexible configuration mechanism.

rl_insight/data/data_checker.py (48)

medium

Using type as a parameter name shadows the built-in type() function in Python. This can lead to confusion and potential issues. Consider renaming this parameter to something more descriptive, like data_type.

    def __init__(self, data_type: DataEnum, data: str|dict):

rl_insight/data/data_checker.py (53)

medium

Accessing self.rules[self.type] directly can raise a KeyError if self.type is not present in the rules dictionary. It's safer to use self.rules.get(self.type, []) to gracefully handle cases where no rules are defined for a specific DataEnum, or explicitly raise a more informative error.

        rules = self.rules.get(self.type, [])

rl_insight/data/rules.py (53)

medium

The type hint for data includes pd.DataFrame, but the current implementation of the check method only handles str input. If pd.DataFrame is a valid input type for this rule, the logic within the check method needs to be updated to correctly process it. Otherwise, the type hint should be narrowed to str.

rl_insight/data/rules.py (59)

medium

The path.is_dir() check is specific to directories. If the PathExistsRule is intended to validate the existence of any path (file or directory), path.exists() would be more appropriate. If it's strictly for directories, the error message should clearly state that a directory is expected.

            if not path.exists():

rl_insight/parser/parser.py (36)

medium

Hardcoding self.input_type = DataEnum.MULTI_JSON in the BaseClusterParser makes the base class less flexible. If different parsers need to handle different input types, this should either be an abstract property that subclasses must implement, or passed in during initialization, rather than being fixed in the base class.

rl_insight/visualizer/visualizer.py (83)

medium

Similar to the BaseClusterParser, hardcoding self.input_type = DataEnum.SUMMARY_EVENT in RLTimelineVisualizer limits its flexibility. If other visualizers or future extensions need to handle different input types, this should be a more dynamic property or configurable.

test/rl_timeline.html (1-7)

medium

This rl_timeline.html file appears to be a generated output from the visualization process. Generated files should generally not be committed to the repository. They can cause unnecessary diffs, bloat the repository size, and lead to merge conflicts. Please add this file to .gitignore.

Comment thread rl_insight/data/rules.py
return False
return True
except Exception as e:
self._error_message = f"Source path does not exist: {data}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception message variable 'e' is unused, recommand to add it to self._error_message

@tardis-key tardis-key merged commit 379af63 into verl-project:dev Mar 23, 2026
0 of 3 checks passed
tardis-key added a commit to tardis-key/rl-insight that referenced this pull request Mar 23, 2026
* add datachecker

* Delete rl_timeline.html
@ZLiao097
Copy link
Copy Markdown
Contributor

lgtm

tardis-key added a commit that referenced this pull request Mar 26, 2026
* [data] feat: add base data class framework with validation support (#26)

* [data] feat: add base data class framework with validation support

Signed-off-by: Debonex <debonexx@gmail.com>

* [refactor] move data module to rl_insight package and fix imports

Signed-off-by: Debonex <debonexx@gmail.com>

* [data] refactor: simplify validation system with class method approach

Signed-off-by: Debonex <debonexx@gmail.com>

* [data] feat: update data related interfaces and tests

Signed-off-by: Debonex <debonexx@gmail.com>

---------

Signed-off-by: Debonex <debonexx@gmail.com>

* [data] feat: add datachecker (#27)

* add datachecker

* Delete rl_timeline.html

* Unify logging and address other code review comments

* update  requirement

* address code review comments from gemini

* adjust ut

* pre-commit

* pre-commit

* format uworkflow yml

* Address review comments

---------

Signed-off-by: Debonex <debonexx@gmail.com>
Co-authored-by: Debonet <37174444+Debonex@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants