Skip to content

Conversation

@usernameisnull
Copy link
Contributor

  1. Added support for DATABASE and HADOOP.
  2. Moved the used secret keys into the pkg directory.
  3. Updated the corresponding .proto file and Dockerfile.

@gemini-code-assist
Copy link

Summary of Changes

Hello @usernameisnull, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data loading capabilities by introducing support for DATABASE and HADOOP data sources. It also provides more granular control over resource allocation during data warm-up phases and improves code organization by centralizing secret key definitions. These changes aim to make the system more versatile and robust in handling diverse data integration scenarios.

Highlights

  • New Dataset Types: Added support for 'DATABASE' and 'HADOOP' as new dataset source types, expanding the range of data sources that can be integrated.
  • Resource Specification for Data Warm-Up: Introduced a new field, DataWarmUpResources, within the DatasetSpec to allow users to specify resource requirements (CPU, memory) for data warm-up operations.
  • Refactored Secret Key Management: Moved the definition of common secret keys (e.g., username, password, private key) from an internal package to a shared pkg/utils directory, improving accessibility and reusability.
  • Dockerfile Enhancements: Updated the data-loader.Dockerfile to include multi-stage builds for MySQL and Hadoop clients, along with the necessary Java Runtime Environment (JRE), to support the new data source types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • Dockerfile
    • Standardized FROM syntax to use AS for builder stage.
  • api/dataset/v1alpha1/dataset_types.go
    • Added DatasetTypeDatabase and DatasetTypeHadoop to the DatasetType enum.
    • Updated kubebuilder:validation:Enum to include DATABASE and HADOOP types.
    • Included URI format examples for DATABASE (database://<ip>:<port>) and HADOOP (hdfs://<ip>:<port>) in comments.
    • Added DataWarmUpResources of type v1.ResourceRequirements to DatasetSpec.
  • config/crd/bases/dataset.baizeai.io_datasets.yaml
    • Added schema definitions for resources (DataWarmUpResources) including claims, limits, and requests.
    • Updated the enum for DatasetSource.Type to include DATABASE and HADOOP.
    • Updated the description for DatasetSource.URI and DatasetSource.Options to reflect the new DATABASE and HADOOP types.
  • data-loader.Dockerfile
    • Introduced multi-stage builds for Hadoop and MySQL clients.
    • Added environment variables for JAVA_HOME and HADOOP_HOME.
    • Installed ca-certificates and Temurin JRE 11.
    • Copied mysql client and Hadoop client binaries/configuration from their respective build stages.
    • Configured hadoop-env.sh and performed cleanup of unnecessary Hadoop files.
  • internal/cmd/dataloader/root.go
    • Integrated NewModelDatabaseLoader and NewModelHadoopLoader into the execCopy function to handle the new data source types.
  • internal/controller/dataset/dataset_controller.go
    • Reordered imports for consistency.
    • Updated supportPreload function to enable preload support for DatasetTypeDatabase and DatasetTypeHadoop.
    • Added logging for unsupported preload types in reconcileJob and reconcileJobStatus.
    • Implemented changeDefinitionForHadoop function to dynamically modify JobSpec for Hadoop datasets, including volume mounts for configuration and environment variables.
  • internal/pkg/datasources/credentials.go
    • Removed local SecretKey type and constants, now importing them from pkg/utils.
    • Updated ReadAndParseSecrets to use the utils.SecretKey type.
  • internal/pkg/datasources/datasource_database.go
    • Added new file implementing ModelDatabaseLoader for DATABASE type.
    • Provides functionality to export data from MySQL databases to CSV, including batch processing and row counting.
  • internal/pkg/datasources/datasource_hadoop.go
    • Added new file implementing ModelHadoopLoader for HADOOP type.
    • Provides functionality to fetch data from HDFS using the hdfs dfs -get command.
  • internal/pkg/datasources/types.go
    • Added TypeDatabase and TypeHadoop to the Type constants and SupportedTypes lists.
  • pkg/utils/secret_key.go
    • Added new file defining SecretKey type and constants for common secret keys, moved from internal/pkg/datasources/credentials.go.
Activity
  • The pull request introduces new features and refactors existing code. No specific comments or reviews have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for DATABASE and HADOOP as new dataset source types, which is a significant enhancement. The changes are extensive, touching API definitions, controller logic, the data loader, and Dockerfiles. I appreciate the refactoring of secret key definitions into a shared pkg directory. However, my review has uncovered several critical and high-severity issues that must be addressed. These include a critical SQL injection vulnerability, a bug in the data-loader.Dockerfile that will break the build, and a lack of validation for required options that could lead to runtime panics. I've provided detailed comments and suggestions for each issue.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for two new datasource types (DATABASE and HADOOP) to the dataset management system. The changes include moving SecretKey constants from the internal package to pkg/utils for external accessibility, implementing new data loaders for MySQL databases and HDFS, updating the controller to handle these new types, and modifying the Dockerfile to include necessary dependencies (MySQL client and Hadoop client).

Changes:

  • Added DATABASE and HADOOP as new supported datasource types across API definitions, CRDs, and internal type systems
  • Moved SecretKey type and constants from internal/pkg/datasources to pkg/utils for external code access
  • Implemented ModelDatabaseLoader for MySQL database exports with batch processing and CSV output
  • Implemented ModelHadoopLoader for HDFS data retrieval
  • Updated data-loader Dockerfile to include MySQL 8.4 client, Hadoop 3.4.2 client, and Java runtime dependencies

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
pkg/utils/secret_key.go New file defining SecretKey type and constants, moved from internal package for external accessibility
internal/pkg/datasources/types.go Added TypeDatabase and TypeHadoop to supported datasource types
internal/pkg/datasources/datasource_hadoop.go New Hadoop datasource loader implementation for HDFS file retrieval
internal/pkg/datasources/datasource_database.go New MySQL database loader with batch export to CSV functionality
internal/pkg/datasources/credentials.go Updated to use SecretKey from pkg/utils instead of local definition
internal/controller/dataset/dataset_controller.go Added DATABASE and HADOOP to preload support, implemented changeDefinitionForHadoop function for Hadoop-specific job configuration
internal/cmd/dataloader/root.go Added case handlers for DATABASE and HADOOP datasource types
data-loader.Dockerfile Added MySQL client and Hadoop client dependencies, included Java runtime for Hadoop
config/crd/bases/dataset.baizeai.io_datasets.yaml Added DATABASE and HADOOP to CRD enum values and documentation
api/dataset/v1alpha1/dataset_types.go Added DatasetTypeDatabase and DatasetTypeHadoop constants, updated validation enums
Dockerfile Minor formatting change (AS keyword capitalization)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yyzxw
Copy link
Contributor

yyzxw commented Feb 5, 2026

基本没啥问题,主要就是AI提出的那几个地方需要添加更多的校验就好了

@usernameisnull
Copy link
Contributor Author

基本没啥问题,主要就是AI提出的那几个地方需要添加更多的校验就好了

ok

@kebe7jun kebe7jun changed the title Add some type Add mysql and hdfs sources support Feb 6, 2026
Copy link
Contributor

@kebe7jun kebe7jun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, @yyzxw PTAL.

@yyzxw
Copy link
Contributor

yyzxw commented Feb 6, 2026

/lgtm

@yyzxw yyzxw merged commit b52fb72 into BaizeAI:main Feb 9, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants