Skip to content

Pratham-Jain-3903/BGSW_no_code_tinyml_tool

Repository files navigation

๐Ÿš€ Cloud2Stm : ML Training and Benchmarking Pipeline on Databricks

Python Platform License TensorFlow PyCaret

๐Ÿ“š Resources

  1. Proposed Wireframes for Cloud Deployment
  2. Video Demo
  3. Project Presentation

๐Ÿง  Description

This project provides a robust pipeline for machine learning model training, benchmarking, and deployment, with a focus on integration with Databricks. It supports various ML use cases, including multi-target modeling and time series forecasting, and facilitates model conversion and quantization for optimized deployment. The system is designed for compatibility with both Windows and Windows Subsystem for Linux (WSL) environments.

โญ Key Features

  • ๐Ÿงฉ Multi-target Modeling: Train models for multiple target variables efficiently in a single run.
  • ๐Ÿ“ˆ Time Series Forecasting: Specialized functionalities for advanced time series prediction tasks.
  • ๐Ÿ† Comprehensive Model Benchmarking: Leverages PyCaret (optional) for automatic comparison and selection of top-performing models.
  • ๐Ÿ“ค Multi-format Model Export: Save trained models in various formats including pickle, ONNX, TFLite, and Keras.
  • โšก Quantization Support: Automatically extracts test data to facilitate model quantization for optimized inference.
  • ๐Ÿ“Š Performance Visualization: Generates insightful charts and visualizations of model performance.
  • โš™๏ธ Feature Engineering Integration: Seamlessly integrates with a feature engineering pipeline to use engineered features for modeling.
  • ๐ŸŽฎ GPU Acceleration: Supports GPU usage (configurable) for faster model training, leveraging technologies like CUDA and cuDNN.
  • ๐Ÿ–ฅ๏ธ Cross-Platform Compatibility: Full support for Windows and Windows Subsystem for Linux (WSL).
  • โ˜๏ธ Databricks Integration: Designed for interaction with Databricks environments (as implied by databricks_connector_linux.py).

๐Ÿ’ผ Use Cases

  • Multi-target modeling: Train models for multiple target variables in one run
  • Time series forecasting: Advanced features for time series prediction
  • Model benchmarking: Automatic comparison and selection of top-performing models
  • Model format conversion: Save models in multiple formats (pickle, ONNX, TFLite, Keras)
  • Quantization support: Automatic test data extraction for model quantization
  • WSL compatibility: Full support for Windows Subsystem for Linux
  • Result visualization: Generate performance visualizations for trained models
  • Feature engineering integration: Seamless integration with feature engineering pipeline

๐Ÿงฑ Technical Stack Highlights

This project utilizes a range of powerful libraries and technologies, including:

  • Python 3.9+
  • PyCaret: For automated machine learning and model benchmarking.
  • TensorFlow/Keras: For building and training deep learning models.
  • ONNX (Open Neural Network Exchange): For model interoperability and deployment.
  • Pandas & NumPy: For data manipulation and numerical operations.
  • Matplotlib & Seaborn: For data visualization.
  • Protobuf: For efficient data serialization (used internally by TensorFlow).
  • CUDA/cuDNN: For GPU acceleration in TensorFlow.

๐Ÿ“‚ Directory Structure

databricks_apis/                         
โ”œโ”€โ”€ databricks_api_endpoints/
โ”‚   โ”œโ”€โ”€ ml_training_benchmarking.py   # Main ML pipeline
โ”‚   โ”œโ”€โ”€ feature_engineering_linux.py  # Feature engineering pipeline for Linux
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ results/
โ”‚   โ”œโ”€โ”€ feature_artifacts/            # Feature engineering outputs
โ”‚   โ”‚   โ”œโ”€โ”€ selected_features_data_*  # Feature datasets
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ ml_models/                    # ML model outputs
โ”‚   โ”‚   โ”œโ”€โ”€ HP_CompE21EnergyIn_*/     # Models for specific target variables
โ”‚   โ”‚   โ”œโ”€โ”€ quantization_data/        # Data for model quantization
โ”‚   โ”‚   โ””โ”€โ”€ visualizations/           # Performance charts and visualizations
โ”‚   โ””โ”€โ”€ feature_metadata.json         # Feature engineering metadata
โ””โ”€โ”€ wsl_venv/                         # Python virtual environment for WSL

โš™๏ธ Setup Instructions

๐ŸชŸ Windows Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/databricks_apis.git
    cd databricks_apis
  2. Create a virtual environment and install dependencies:

    python -m venv venv
    venv\Scripts\activate
    pip install -r requirements.txt
  3. Run the pipeline:

    python databricks_api_endpoints/ml_training_benchmarking.py

๐Ÿง WSL (Windows Subsystem for Linux) Setup

  1. Install WSL if not already installed:

    # On Windows PowerShell as Administrator
    wsl --install
  2. Launch WSL and navigate to your project directory:

    # If your project is on D: drive
    cd /mnt/d/College/databricks_apis
  3. Create a Python virtual environment in WSL:

    python3 -m venv wsl_venv
    source wsl_venv/bin/activate
    pip install -r requirements.txt
    # Install additional packages for full functionality
    pip install pycaret[full] tensorflow onnx onnxruntime onnxmltools skl2onnx tf2onnx matplotlib seaborn
  4. Run the ML training pipeline in WSL:

    python databricks_api_endpoints/ml_training_benchmarking.py

โš™๏ธ Configuration

The pipeline is configured through the ml_config dictionary in ml_training_benchmarking.py. Key options include:

  • use_case: Type of ML problem (regression, classification, time_series, etc.)
  • use_pycaret: Toggle PyCaret integration (set to False to save time during development)
  • target_columns: List of target variables to model
  • save_quantization_data: Save test data for model quantization
  • time_series: Configuration for time series models

Example configuration adjustment:

ml_config["use_pycaret"] = True  # Enable PyCaret for comprehensive model comparison
ml_config["gpu_enabled"] = True  # Enable GPU acceleration if available

๐Ÿง Linux-specific Files

The repository includes Linux-specific implementations of key functionality:

  • feature_engineering_linux.py: Feature engineering pipeline optimized for Linux/WSL
  • databricks_connector_linux.py: Databricks API connector for Linux environments

These files handle platform-specific operations like filesystem interactions and parallel processing optimizations for Linux environments.

๐Ÿ“Š Results

After running the pipeline, results are organized in the results directory:

  • Model files in various formats (pickle, ONNX, TFLite, Keras)
  • Performance metrics in JSON format
  • Visualizations of model performance
  • Quantization test data for deployment optimization

๐Ÿ“ฆ Dependencies

  • Python 3.9+
  • TensorFlow/Keras
  • PyCaret
  • ONNX Runtime
  • Pandas/NumPy
  • Matplotlib/Seaborn

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

โ–ถ๏ธ Run Command

pm2 start ecosystem.config.js

or

autogenstudio ui

(for local demos today)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages