Add mysql and hdfs sources support #63

usernameisnull · 2026-02-05T08:35:50Z

Added support for DATABASE and HADOOP.
Moved the used secret keys into the pkg directory.
Updated the corresponding .proto file and Dockerfile.

gemini-code-assist · 2026-02-05T08:36:15Z

Summary of Changes

Hello @usernameisnull, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data loading capabilities by introducing support for DATABASE and HADOOP data sources. It also provides more granular control over resource allocation during data warm-up phases and improves code organization by centralizing secret key definitions. These changes aim to make the system more versatile and robust in handling diverse data integration scenarios.

Highlights

New Dataset Types: Added support for 'DATABASE' and 'HADOOP' as new dataset source types, expanding the range of data sources that can be integrated.
Resource Specification for Data Warm-Up: Introduced a new field, DataWarmUpResources, within the DatasetSpec to allow users to specify resource requirements (CPU, memory) for data warm-up operations.
Refactored Secret Key Management: Moved the definition of common secret keys (e.g., username, password, private key) from an internal package to a shared pkg/utils directory, improving accessibility and reusability.
Dockerfile Enhancements: Updated the data-loader.Dockerfile to include multi-stage builds for MySQL and Hadoop clients, along with the necessary Java Runtime Environment (JRE), to support the new data source types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

Dockerfile
- Standardized FROM syntax to use AS for builder stage.
api/dataset/v1alpha1/dataset_types.go
- Added DatasetTypeDatabase and DatasetTypeHadoop to the DatasetType enum.
- Updated kubebuilder:validation:Enum to include DATABASE and HADOOP types.
- Included URI format examples for DATABASE (database://<ip>:<port>) and HADOOP (hdfs://<ip>:<port>) in comments.
- Added DataWarmUpResources of type v1.ResourceRequirements to DatasetSpec.
config/crd/bases/dataset.baizeai.io_datasets.yaml
- Added schema definitions for resources (DataWarmUpResources) including claims, limits, and requests.
- Updated the enum for DatasetSource.Type to include DATABASE and HADOOP.
- Updated the description for DatasetSource.URI and DatasetSource.Options to reflect the new DATABASE and HADOOP types.
data-loader.Dockerfile
- Introduced multi-stage builds for Hadoop and MySQL clients.
- Added environment variables for JAVA_HOME and HADOOP_HOME.
- Installed ca-certificates and Temurin JRE 11.
- Copied mysql client and Hadoop client binaries/configuration from their respective build stages.
- Configured hadoop-env.sh and performed cleanup of unnecessary Hadoop files.
internal/cmd/dataloader/root.go
- Integrated NewModelDatabaseLoader and NewModelHadoopLoader into the execCopy function to handle the new data source types.
internal/controller/dataset/dataset_controller.go
- Reordered imports for consistency.
- Updated supportPreload function to enable preload support for DatasetTypeDatabase and DatasetTypeHadoop.
- Added logging for unsupported preload types in reconcileJob and reconcileJobStatus.
- Implemented changeDefinitionForHadoop function to dynamically modify JobSpec for Hadoop datasets, including volume mounts for configuration and environment variables.
internal/pkg/datasources/credentials.go
- Removed local SecretKey type and constants, now importing them from pkg/utils.
- Updated ReadAndParseSecrets to use the utils.SecretKey type.
internal/pkg/datasources/datasource_database.go
- Added new file implementing ModelDatabaseLoader for DATABASE type.
- Provides functionality to export data from MySQL databases to CSV, including batch processing and row counting.
internal/pkg/datasources/datasource_hadoop.go
- Added new file implementing ModelHadoopLoader for HADOOP type.
- Provides functionality to fetch data from HDFS using the hdfs dfs -get command.
internal/pkg/datasources/types.go
- Added TypeDatabase and TypeHadoop to the Type constants and SupportedTypes lists.
pkg/utils/secret_key.go
- Added new file defining SecretKey type and constants for common secret keys, moved from internal/pkg/datasources/credentials.go.

Activity

The pull request introduces new features and refactors existing code. No specific comments or reviews have been recorded yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for DATABASE and HADOOP as new dataset source types, which is a significant enhancement. The changes are extensive, touching API definitions, controller logic, the data loader, and Dockerfiles. I appreciate the refactoring of secret key definitions into a shared pkg directory. However, my review has uncovered several critical and high-severity issues that must be addressed. These include a critical SQL injection vulnerability, a bug in the data-loader.Dockerfile that will break the build, and a lack of validation for required options that could lead to runtime panics. I've provided detailed comments and suggestions for each issue.

data-loader.Dockerfile

internal/pkg/datasources/datasource_database.go

internal/controller/dataset/dataset_controller.go

internal/pkg/datasources/datasource_database.go

Dockerfile

data-loader.Dockerfile

internal/controller/dataset/dataset_controller.go

Copilot

Pull request overview

This PR adds support for two new datasource types (DATABASE and HADOOP) to the dataset management system. The changes include moving SecretKey constants from the internal package to pkg/utils for external accessibility, implementing new data loaders for MySQL databases and HDFS, updating the controller to handle these new types, and modifying the Dockerfile to include necessary dependencies (MySQL client and Hadoop client).

Changes:

Added DATABASE and HADOOP as new supported datasource types across API definitions, CRDs, and internal type systems
Moved SecretKey type and constants from internal/pkg/datasources to pkg/utils for external code access
Implemented ModelDatabaseLoader for MySQL database exports with batch processing and CSV output
Implemented ModelHadoopLoader for HDFS data retrieval
Updated data-loader Dockerfile to include MySQL 8.4 client, Hadoop 3.4.2 client, and Java runtime dependencies

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
pkg/utils/secret_key.go	New file defining SecretKey type and constants, moved from internal package for external accessibility
internal/pkg/datasources/types.go	Added TypeDatabase and TypeHadoop to supported datasource types
internal/pkg/datasources/datasource_hadoop.go	New Hadoop datasource loader implementation for HDFS file retrieval
internal/pkg/datasources/datasource_database.go	New MySQL database loader with batch export to CSV functionality
internal/pkg/datasources/credentials.go	Updated to use SecretKey from pkg/utils instead of local definition
internal/controller/dataset/dataset_controller.go	Added DATABASE and HADOOP to preload support, implemented changeDefinitionForHadoop function for Hadoop-specific job configuration
internal/cmd/dataloader/root.go	Added case handlers for DATABASE and HADOOP datasource types
data-loader.Dockerfile	Added MySQL client and Hadoop client dependencies, included Java runtime for Hadoop
config/crd/bases/dataset.baizeai.io_datasets.yaml	Added DATABASE and HADOOP to CRD enum values and documentation
api/dataset/v1alpha1/dataset_types.go	Added DatasetTypeDatabase and DatasetTypeHadoop constants, updated validation enums
Dockerfile	Minor formatting change (AS keyword capitalization)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/pkg/datasources/datasource_database.go

internal/controller/dataset/dataset_controller.go

internal/pkg/datasources/datasource_hadoop.go

data-loader.Dockerfile

internal/pkg/datasources/datasource_database.go

internal/controller/dataset/dataset_controller.go

data-loader.Dockerfile

yyzxw · 2026-02-05T09:05:52Z

基本没啥问题，主要就是AI提出的那几个地方需要添加更多的校验就好了

usernameisnull · 2026-02-05T09:51:25Z

基本没啥问题，主要就是AI提出的那几个地方需要添加更多的校验就好了

ok

api/dataset/v1alpha1/dataset_types.go

internal/pkg/datasources/datasource_database.go

internal/pkg/datasources/datasource_hadoop.go

data-loader.Dockerfile

kebe7jun

lgtm, @yyzxw PTAL.

internal/pkg/datasources/datasource_database.go

internal/pkg/datasources/datasource_hadoop.go

yyzxw · 2026-02-06T10:28:13Z

/lgtm

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

kebe7jun requested review from Copilot and yyzxw February 5, 2026 08:38

Copilot started reviewing on behalf of kebe7jun February 5, 2026 08:38 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

usernameisnull closed this Feb 5, 2026

usernameisnull force-pushed the add-some-type branch from c669d6d to 214bfb6 Compare February 5, 2026 09:33

usernameisnull reopened this Feb 5, 2026

usernameisnull added 5 commits February 6, 2026 14:20

add mysql and hadoop as source

d4c41d4

add new field and generate the yaml of crd

647e13b

support mysql and hadoop

31c85fd

put secret key in pkg

88fea75

modify Dockerfiles

a66035b

usernameisnull force-pushed the add-some-type branch from 214bfb6 to e170bc8 Compare February 6, 2026 06:20

pr comments addressed

3128fdf

usernameisnull force-pushed the add-some-type branch from e170bc8 to 3128fdf Compare February 6, 2026 06:28

kebe7jun reviewed Feb 6, 2026

View reviewed changes

api/dataset/v1alpha1/dataset_types.go Show resolved Hide resolved

api/dataset/v1alpha1/dataset_types.go Outdated Show resolved Hide resolved

internal/pkg/datasources/datasource_database.go Show resolved Hide resolved

internal/pkg/datasources/datasource_hadoop.go Show resolved Hide resolved

add desc for DATABASE and HADOOP in the yaml

14d080c

kebe7jun reviewed Feb 6, 2026

View reviewed changes

data-loader.Dockerfile Show resolved Hide resolved

fix the static check and ut

f2cf403

usernameisnull force-pushed the add-some-type branch from 978d5f3 to f2cf403 Compare February 6, 2026 09:35

kebe7jun changed the title ~~Add some type~~ Add mysql and hdfs sources support Feb 6, 2026

kebe7jun approved these changes Feb 6, 2026

View reviewed changes

yyzxw reviewed Feb 6, 2026

View reviewed changes

internal/pkg/datasources/datasource_database.go Show resolved Hide resolved

yyzxw reviewed Feb 6, 2026

View reviewed changes

internal/pkg/datasources/datasource_hadoop.go Show resolved Hide resolved

yyzxw merged commit b52fb72 into BaizeAI:main Feb 9, 2026
3 checks passed

Add mysql and hdfs sources support #63

Add mysql and hdfs sources support #63

Uh oh!

Conversation

usernameisnull commented Feb 5, 2026

Uh oh!

gemini-code-assist bot commented Feb 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yyzxw commented Feb 5, 2026

Uh oh!

usernameisnull commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kebe7jun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yyzxw commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants