Skip to content

[data] feat: add DataChecker#28

Merged
tardis-key merged 11 commits into
verl-project:mainfrom
tardis-key:main
Mar 26, 2026
Merged

[data] feat: add DataChecker#28
tardis-key merged 11 commits into
verl-project:mainfrom
tardis-key:main

Conversation

@tardis-key
Copy link
Copy Markdown
Collaborator

@tardis-key tardis-key commented Mar 23, 2026

What does this PR do?

  1. Added type validation for DataChecker in the pipeline, with custom rules implemented via ValidationRule.
  2. Unify the logger
  3. Adjusted the unit tests (UT), and reorganized the UT names to make them more clear and readable.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include mstx, mvtx, torch_profile, deployment, perf, algo, env, doc, data, cfg, ci, misc,
    • If this PR involves multiple modules, separate them with , like [mstx, ci]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][mstx, torch_profile] feat: support timeline parsing

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Debonex and others added 2 commits March 23, 2026 16:40
…erl-project#26)

* [data] feat: add base data class framework with validation support

Signed-off-by: Debonex <debonexx@gmail.com>

* [refactor] move data module to rl_insight package and fix imports

Signed-off-by: Debonex <debonexx@gmail.com>

* [data] refactor: simplify validation system with class method approach

Signed-off-by: Debonex <debonexx@gmail.com>

* [data] feat: update data related interfaces and tests

Signed-off-by: Debonex <debonexx@gmail.com>

---------

Signed-off-by: Debonex <debonexx@gmail.com>
* add datachecker

* Delete rl_timeline.html
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data handling capabilities of the RL-Insight project by introducing a dedicated data validation system. The new framework allows for explicit definition and checking of data types and conditions at critical stages of the processing pipeline, from initial input to intermediate results. This change aims to improve the overall robustness and maintainability of the application by catching data-related issues early and providing clearer error feedback.

Highlights

  • Data Validation Framework: Introduced a new DataChecker class along with ValidationRule and PathExistsRule to provide a robust and extensible data validation framework. This includes a DataValidationError for consistent error handling.
  • Pipeline Integration: Integrated the DataChecker into the OfflineInsightPipeline to perform validation on both input data before parsing and output data before visualization, ensuring data integrity throughout the workflow.
  • Data Type Enumeration: Added a DataEnum to standardize the representation of various data types (e.g., MULTI_JSON, VERL_LOG, SUMMARY_EVENT) across the application, improving clarity and type safety.
  • Refactoring and Cleanup: Removed several old data-related files (data/base.py, data/multi_json.py, data/summary_event.py, data/verl_log.py) and reorganized the data module under rl_insight/data.
  • Argument and Type Handling Updates: Modified rl_insight/main.py to include an --input-type argument and updated BaseClusterParser and RLTimelineVisualizer to leverage the new DataEnum for explicit type declarations.
  • New Unit Tests: Added comprehensive unit tests for the new DataChecker and validation rules to ensure their correctness and reliability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment thread rl_insight/data/rules.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a DataChecker to validate data types and structures within the pipeline, which is a great step towards making the tool more robust. The changes involve refactoring the data flow, removing obsolete files, and adding new validation rules and tests. My review focuses on improving the implementation of this new data validation feature, ensuring API consistency, and addressing some potential usability issues. Key feedback includes avoiding module-level logging configuration, fixing a type violation in a validation rule, removing a hardcoded path from the CLI arguments, and improving the consistency of the parser and visualizer APIs.

Comment thread rl_insight/main.py Outdated
Comment thread rl_insight/data/data_checker.py Outdated
Comment thread rl_insight/data/rules.py Outdated
Comment thread rl_insight/data/data_checker.py Outdated
Comment thread rl_insight/data/rules.py Outdated
Comment thread rl_insight/main.py Outdated
Comment thread rl_insight/parser/parser.py Outdated
Comment thread rl_insight/pipeline/offline_insight_pipeline.py
Comment thread rl_insight/visualizer/visualizer.py Outdated
@tardis-key
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a DataChecker to validate data within the pipeline, which is a great step towards making the system more robust. The refactoring to use loguru for logging and cleaning up the pipeline class is also a good improvement. I've identified a few areas with potential bugs and opportunities for improvement, particularly in the new data validation logic and the associated tests. My comments focus on improving correctness, robustness, and clarity.

Comment thread rl_insight/data/rules.py Outdated
Comment thread rl_insight/data/data_checker.py Outdated
Comment thread rl_insight/data/rules.py Outdated
Comment thread tests/parser/test_cluster_analysis.py Outdated
Comment thread tests/parser/test_cluster_analysis.py Outdated
Comment thread rl_insight/data/data_checker.py Outdated
Comment thread rl_insight/parser/parser.py Outdated
Comment thread rl_insight/visualizer/visualizer.py Outdated
@tardis-key
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust data validation mechanism using a new DataChecker class and DataEnum for data types. It refactors the data processing pipeline to incorporate these checks, ensuring data integrity at various stages. The logging infrastructure has been updated to use loguru for consistency. Several old data-related files have been removed, streamlining the project structure. The changes significantly improve the maintainability and reliability of the data handling within the RL-Insight project. The comments suggesting type hint changes have been dropped as modifying sys.path is permissible when standard import resolution methods are not sufficient.

@tardis-key tardis-key force-pushed the main branch 2 times, most recently from 0ee62d3 to 3038e7b Compare March 25, 2026 03:16
@tardis-key
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new data validation system with DataChecker, DataEnum, and ValidationRule classes, including a PathExistsRule. It integrates loguru for logging across various modules, replacing the standard logging setup. The BaseClusterParser and BaseVisualizer classes are refactored to use an input_type class attribute instead of get_input_type and get_output_type methods, and the OfflineInsightPipeline is updated to leverage the new data validation. The review highlights a Python version compatibility issue with type hint syntax in rl_insight/data/rules.py and suggests using Union for broader compatibility. Additionally, it points out that some test mocks in tests/parser/test_cluster_analysis.py are now redundant due to the removal of get_input_type and get_output_type methods, suggesting their removal for improved test clarity.

Comment thread rl_insight/data/rules.py Outdated
Comment thread tests/parser/test_cluster_analysis.py Outdated
Comment thread tests/parser/test_cluster_analysis.py Outdated
@tardis-key tardis-key force-pushed the main branch 3 times, most recently from f00e91c to 7ecc7b2 Compare March 25, 2026 03:36
@tardis-key
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust data validation system using DataChecker, DataEnum, and ValidationRules, which is integrated into the OfflineInsightPipeline to ensure data integrity. It also migrates the logging infrastructure from Python's built-in logging module to loguru across various files. The --input-path argument in main.py should be made required to prevent potential TypeErrors, and the DataChecker's rules dictionary should explicitly include all DataEnum members to avoid ValueErrors for unhandled types. Additionally, the input_type attribute in RLTimelineVisualizer is redundantly declared and can be removed.

Comment thread rl_insight/data/data_checker.py
Comment thread rl_insight/main.py Outdated
Comment thread rl_insight/visualizer/visualizer.py Outdated
@tardis-key tardis-key mentioned this pull request Mar 25, 2026
19 tasks
"""Base data definitions for RL-Insight."""

from typing import Any, List
from .rules import ValidationRule, PathExistsRule, DataValidationError
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the current import sequence pass the pre-commit check?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have ran the pre_commit again, it appears to be accepatable

@tardis-key tardis-key merged commit 259d665 into verl-project:main Mar 26, 2026
5 checks passed
@Rhetee
Copy link
Copy Markdown
Collaborator

Rhetee commented Mar 26, 2026

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants