🚀 Cloud2Stm : ML Training and Benchmarking Pipeline on Databricks

📚 Resources

🧠 Description

This project provides a robust pipeline for machine learning model training, benchmarking, and deployment, with a focus on integration with Databricks. It supports various ML use cases, including multi-target modeling and time series forecasting, and facilitates model conversion and quantization for optimized deployment. The system is designed for compatibility with both Windows and Windows Subsystem for Linux (WSL) environments.

⭐ Key Features

🧩 Multi-target Modeling: Train models for multiple target variables efficiently in a single run.
📈 Time Series Forecasting: Specialized functionalities for advanced time series prediction tasks.
🏆 Comprehensive Model Benchmarking: Leverages PyCaret (optional) for automatic comparison and selection of top-performing models.
📤 Multi-format Model Export: Save trained models in various formats including pickle, ONNX, TFLite, and Keras.
⚡ Quantization Support: Automatically extracts test data to facilitate model quantization for optimized inference.
📊 Performance Visualization: Generates insightful charts and visualizations of model performance.
⚙️ Feature Engineering Integration: Seamlessly integrates with a feature engineering pipeline to use engineered features for modeling.
🎮 GPU Acceleration: Supports GPU usage (configurable) for faster model training, leveraging technologies like CUDA and cuDNN.
🖥️ Cross-Platform Compatibility: Full support for Windows and Windows Subsystem for Linux (WSL).
☁️ Databricks Integration: Designed for interaction with Databricks environments (as implied by databricks_connector_linux.py).

💼 Use Cases

Multi-target modeling: Train models for multiple target variables in one run
Time series forecasting: Advanced features for time series prediction
Model benchmarking: Automatic comparison and selection of top-performing models
Model format conversion: Save models in multiple formats (pickle, ONNX, TFLite, Keras)
Quantization support: Automatic test data extraction for model quantization
WSL compatibility: Full support for Windows Subsystem for Linux
Result visualization: Generate performance visualizations for trained models
Feature engineering integration: Seamless integration with feature engineering pipeline

🧱 Technical Stack Highlights

This project utilizes a range of powerful libraries and technologies, including:

Python 3.9+
PyCaret: For automated machine learning and model benchmarking.
TensorFlow/Keras: For building and training deep learning models.
ONNX (Open Neural Network Exchange): For model interoperability and deployment.
Pandas & NumPy: For data manipulation and numerical operations.
Matplotlib & Seaborn: For data visualization.
Protobuf: For efficient data serialization (used internally by TensorFlow).
CUDA/cuDNN: For GPU acceleration in TensorFlow.

📂 Directory Structure

databricks_apis/                         
├── databricks_api_endpoints/
│   ├── ml_training_benchmarking.py   # Main ML pipeline
│   ├── feature_engineering_linux.py  # Feature engineering pipeline for Linux
│   └── ...
├── results/
│   ├── feature_artifacts/            # Feature engineering outputs
│   │   ├── selected_features_data_*  # Feature datasets
│   │   └── ...
│   ├── ml_models/                    # ML model outputs
│   │   ├── HP_CompE21EnergyIn_*/     # Models for specific target variables
│   │   ├── quantization_data/        # Data for model quantization
│   │   └── visualizations/           # Performance charts and visualizations
│   └── feature_metadata.json         # Feature engineering metadata
└── wsl_venv/                         # Python virtual environment for WSL

⚙️ Setup Instructions

🪟 Windows Setup

Clone the repository:

git clone https://github.com/yourusername/databricks_apis.git
cd databricks_apis

Create a virtual environment and install dependencies:

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

Run the pipeline:

python databricks_api_endpoints/ml_training_benchmarking.py

🐧 WSL (Windows Subsystem for Linux) Setup

Install WSL if not already installed:

# On Windows PowerShell as Administrator
wsl --install

Launch WSL and navigate to your project directory:

# If your project is on D: drive
cd /mnt/d/College/databricks_apis

Create a Python virtual environment in WSL:

python3 -m venv wsl_venv
source wsl_venv/bin/activate
pip install -r requirements.txt
# Install additional packages for full functionality
pip install pycaret[full] tensorflow onnx onnxruntime onnxmltools skl2onnx tf2onnx matplotlib seaborn

Run the ML training pipeline in WSL:

python databricks_api_endpoints/ml_training_benchmarking.py

⚙️ Configuration

The pipeline is configured through the ml_config dictionary in ml_training_benchmarking.py. Key options include:

use_case: Type of ML problem (regression, classification, time_series, etc.)
use_pycaret: Toggle PyCaret integration (set to False to save time during development)
target_columns: List of target variables to model
save_quantization_data: Save test data for model quantization
time_series: Configuration for time series models

Example configuration adjustment:

ml_config["use_pycaret"] = True  # Enable PyCaret for comprehensive model comparison
ml_config["gpu_enabled"] = True  # Enable GPU acceleration if available

🐧 Linux-specific Files

The repository includes Linux-specific implementations of key functionality:

feature_engineering_linux.py: Feature engineering pipeline optimized for Linux/WSL
databricks_connector_linux.py: Databricks API connector for Linux environments

These files handle platform-specific operations like filesystem interactions and parallel processing optimizations for Linux environments.

📊 Results

After running the pipeline, results are organized in the results directory:

Model files in various formats (pickle, ONNX, TFLite, Keras)
Performance metrics in JSON format
Visualizations of model performance
Quantization test data for deployment optimization

📦 Dependencies

Python 3.9+
TensorFlow/Keras
PyCaret
ONNX Runtime
Pandas/NumPy
Matplotlib/Seaborn

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

▶️ Run Command

pm2 start ecosystem.config.js

or

autogenstudio ui

(for local demos today)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
autogenstudio		autogenstudio
databricks_api_endpoints		databricks_api_endpoints
demo_data		demo_data
graphviz		graphviz
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.api		Dockerfile.api
README.md		README.md
bgsw_no_code_tool.pdf		bgsw_no_code_tool.pdf
check_java.sh		check_java.sh
docker-compose.yml		docker-compose.yml
fix_paths.py		fix_paths.py
libneai_project-2025-05-06-04-50_2.zip		libneai_project-2025-05-06-04-50_2.zip
requirements.txt		requirements.txt
spark_env.sh		spark_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Cloud2Stm : ML Training and Benchmarking Pipeline on Databricks

📚 Resources

🧠 Description

⭐ Key Features

💼 Use Cases

🧱 Technical Stack Highlights

📂 Directory Structure

⚙️ Setup Instructions

🪟 Windows Setup

🐧 WSL (Windows Subsystem for Linux) Setup

⚙️ Configuration

🐧 Linux-specific Files

📊 Results

📦 Dependencies

📜 License

▶️ Run Command

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Cloud2Stm : ML Training and Benchmarking Pipeline on Databricks

📚 Resources

🧠 Description

⭐ Key Features

💼 Use Cases

🧱 Technical Stack Highlights

📂 Directory Structure

⚙️ Setup Instructions

🪟 Windows Setup

🐧 WSL (Windows Subsystem for Linux) Setup

⚙️ Configuration

🐧 Linux-specific Files

📊 Results

📦 Dependencies

📜 License

▶️ Run Command

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages