[data] feat: add datachecker by tardis-key · Pull Request #27 · verl-project/rl-insight

tardis-key · 2026-03-20T09:34:42Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include mstx, mvtx, torch_profile, deployment, perf, algo, env, doc, data, cfg, ci, misc,
- If this PR involves multiple modules, separate them with , like [mstx, ci]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][mstx, torch_profile] feat: support timeline parsing

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...

gemini-code-assist · 2026-03-20T09:35:05Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the data handling and validation mechanisms within the RL-Insight project. By introducing a dedicated DataChecker class, the changes aim to centralize and enforce data integrity throughout the data processing pipeline. This consolidation simplifies the overall architecture, making data validation more consistent and easier to manage, ultimately improving the reliability of data flow from parsing to visualization.

Highlights

Centralized Data Validation: Introduced a new DataChecker class to centralize and standardize data validation logic across the RL-Insight pipeline, replacing previous ad-hoc validation mechanisms.
Refactored Data Type Handling: Consolidated the DataEnum definition into the new data_checker.py module and removed several deprecated data-related files, streamlining the data module's structure.
Simplified Pipeline Data Flow: Updated the OfflineInsightPipeline to utilize the DataChecker for validating both input and intermediate data, removing direct dependencies on BaseData and simplifying data type checks within the pipeline.
Improved Validation Rules: Refactored ValidationRule and DataValidationError into rules.py and enhanced the PathExistsRule to be more robust, handling various input types and providing clearer error messages.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request refactors the data validation and handling mechanism by introducing a DataChecker class and moving DataEnum and DataValidationError to rl_insight/data/data_checker.py and rl_insight/data/rules.py respectively. The BaseData class and MultiJsonData class are removed, and the PathExistsRule is updated to check for directory existence. The OfflineInsightPipeline and BaseClusterParser are updated to use the new DataChecker for input validation and to remove hardcoded input types, making the system more modular. A hardcoded local path in rl_insight/main.py was updated, and a generated HTML file was added to the repository. Review comments highlight critical issues such as non-portable hardcoded paths, shared mutable class attributes in DataChecker and PathExistsRule, and global logging configuration. Other improvements suggested include renaming a shadowed parameter, safely accessing dictionary keys, aligning type hints with implementation, clarifying path existence checks, and making input types more configurable in BaseClusterParser and RLTimelineVisualizer. Additionally, the generated rl_timeline.html file should be added to .gitignore.

I am having trouble creating individual review comments. Click here to see my feedback.

rl_insight/main.py (32)

The default value for --input-path is a hardcoded path specific to a user's local machine (C:\Users\Tardis\Documents\profile_data\...). This is not portable and will cause issues for other developers or in different environments. It should be changed to a generic, relative path or a path that is expected to exist in a typical development setup, or removed if no sensible default can be provided.

        "--input-path", default="./data", help="Raw path of profiling data"

rl_insight/data/data_checker.py (41-46)

The rules attribute is defined as a class attribute, meaning it's shared across all instances of DataChecker. If an instance modifies this dictionary, it will affect all other instances, which is likely unintended. It should be an instance attribute initialized within the __init__ method to ensure each DataChecker instance has its own set of rules.

class DataChecker():
    """Base data class for RL-Insight."""

    def __init__(self, type: DataEnum, data: str|dict):
        self.type = type
        self.data = data
        self.rules: dict[DataEnum, List[ValidationRule]] = {
            DataEnum.MULTI_JSON: [PathExistsRule()],
            DataEnum.SUMMARY_EVENT: [],
        }

rl_insight/data/rules.py (52)

The _error_message attribute is defined as a class attribute, making it shared across all instances of PathExistsRule. This means if one instance sets an error message, it will overwrite the message for all other instances. It should be an instance attribute initialized in the __init__ method. Additionally, the type hint str for _error_message conflicts with the error_message property's return type List[str] (line 68).

class PathExistsRule(ValidationRule):
    def __init__(self):
        self._error_message: List[str] = []

    def check(self, data: str|dict|pd.DataFrame) -> bool:

rl_insight/data/data_checker.py (22-26)

Configuring logging globally within a module can lead to unexpected behavior or conflicts if other parts of the application require different logging setups. It's generally best practice to configure logging at the application's entry point or use a more flexible configuration mechanism.

rl_insight/data/data_checker.py (48)

Using type as a parameter name shadows the built-in type() function in Python. This can lead to confusion and potential issues. Consider renaming this parameter to something more descriptive, like data_type.

    def __init__(self, data_type: DataEnum, data: str|dict):

rl_insight/data/data_checker.py (53)

Accessing self.rules[self.type] directly can raise a KeyError if self.type is not present in the rules dictionary. It's safer to use self.rules.get(self.type, []) to gracefully handle cases where no rules are defined for a specific DataEnum, or explicitly raise a more informative error.

        rules = self.rules.get(self.type, [])

rl_insight/data/rules.py (53)

The type hint for data includes pd.DataFrame, but the current implementation of the check method only handles str input. If pd.DataFrame is a valid input type for this rule, the logic within the check method needs to be updated to correctly process it. Otherwise, the type hint should be narrowed to str.

rl_insight/data/rules.py (59)

The path.is_dir() check is specific to directories. If the PathExistsRule is intended to validate the existence of any path (file or directory), path.exists() would be more appropriate. If it's strictly for directories, the error message should clearly state that a directory is expected.

            if not path.exists():

rl_insight/parser/parser.py (36)

Hardcoding self.input_type = DataEnum.MULTI_JSON in the BaseClusterParser makes the base class less flexible. If different parsers need to handle different input types, this should either be an abstract property that subclasses must implement, or passed in during initialization, rather than being fixed in the base class.

rl_insight/visualizer/visualizer.py (83)

Similar to the BaseClusterParser, hardcoding self.input_type = DataEnum.SUMMARY_EVENT in RLTimelineVisualizer limits its flexibility. If other visualizers or future extensions need to handle different input types, this should be a more dynamic property or configurable.

test/rl_timeline.html (1-7)

This rl_timeline.html file appears to be a generated output from the visualization process. Generated files should generally not be committed to the repository. They can cause unnecessary diffs, bloat the repository size, and lead to merge conflicts. Please add this file to .gitignore.

ZLiao097 · 2026-03-23T03:16:28Z

+                return False
+            return True
+        except Exception as e:
+            self._error_message = f"Source path does not exist: {data}"


The exception message variable 'e' is unused, recommand to add it to self._error_message

* add datachecker * Delete rl_timeline.html

ZLiao097 · 2026-03-25T02:41:19Z

lgtm

* [data] feat: add base data class framework with validation support (#26) * [data] feat: add base data class framework with validation support Signed-off-by: Debonex <debonexx@gmail.com> * [refactor] move data module to rl_insight package and fix imports Signed-off-by: Debonex <debonexx@gmail.com> * [data] refactor: simplify validation system with class method approach Signed-off-by: Debonex <debonexx@gmail.com> * [data] feat: update data related interfaces and tests Signed-off-by: Debonex <debonexx@gmail.com> --------- Signed-off-by: Debonex <debonexx@gmail.com> * [data] feat: add datachecker (#27) * add datachecker * Delete rl_timeline.html * Unify logging and address other code review comments * update requirement * address code review comments from gemini * adjust ut * pre-commit * pre-commit * format uworkflow yml * Address review comments --------- Signed-off-by: Debonex <debonexx@gmail.com> Co-authored-by: Debonet <37174444+Debonex@users.noreply.github.com>

add datachecker

140d7de

Delete rl_timeline.html

899ca75

tardis-key changed the title ~~add datachecker~~ [data] feat: add datachecker Mar 20, 2026

gemini-code-assist Bot reviewed Mar 20, 2026

View reviewed changes

ZLiao097 reviewed Mar 23, 2026

View reviewed changes

tardis-key merged commit 379af63 into verl-project:dev Mar 23, 2026
0 of 3 checks passed

tardis-key added a commit to tardis-key/rl-insight that referenced this pull request Mar 23, 2026

[data] feat: add datachecker (verl-project#27)

6470387

* add datachecker * Delete rl_timeline.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] feat: add datachecker#27

[data] feat: add datachecker#27
tardis-key merged 2 commits into
verl-project:devfrom
tardis-key:dev

tardis-key commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

ZLiao097 Mar 23, 2026

Uh oh!

Uh oh!

ZLiao097 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tardis-key commented Mar 20, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

rl_insight/main.py (32)

rl_insight/data/data_checker.py (41-46)

rl_insight/data/rules.py (52)

rl_insight/data/data_checker.py (22-26)

rl_insight/data/data_checker.py (48)

rl_insight/data/data_checker.py (53)

rl_insight/data/rules.py (53)

rl_insight/data/rules.py (59)

rl_insight/parser/parser.py (36)

rl_insight/visualizer/visualizer.py (83)

test/rl_timeline.html (1-7)

Uh oh!

ZLiao097 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZLiao097 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants