This project provides a robust pipeline for machine learning model training, benchmarking, and deployment, with a focus on integration with Databricks. It supports various ML use cases, including multi-target modeling and time series forecasting, and facilitates model conversion and quantization for optimized deployment. The system is designed for compatibility with both Windows and Windows Subsystem for Linux (WSL) environments.
- ๐งฉ Multi-target Modeling: Train models for multiple target variables efficiently in a single run.
- ๐ Time Series Forecasting: Specialized functionalities for advanced time series prediction tasks.
- ๐ Comprehensive Model Benchmarking: Leverages PyCaret (optional) for automatic comparison and selection of top-performing models.
- ๐ค Multi-format Model Export: Save trained models in various formats including pickle, ONNX, TFLite, and Keras.
- โก Quantization Support: Automatically extracts test data to facilitate model quantization for optimized inference.
- ๐ Performance Visualization: Generates insightful charts and visualizations of model performance.
- โ๏ธ Feature Engineering Integration: Seamlessly integrates with a feature engineering pipeline to use engineered features for modeling.
- ๐ฎ GPU Acceleration: Supports GPU usage (configurable) for faster model training, leveraging technologies like CUDA and cuDNN.
- ๐ฅ๏ธ Cross-Platform Compatibility: Full support for Windows and Windows Subsystem for Linux (WSL).
- โ๏ธ Databricks Integration: Designed for interaction with Databricks environments (as implied by
databricks_connector_linux.py).
- Multi-target modeling: Train models for multiple target variables in one run
- Time series forecasting: Advanced features for time series prediction
- Model benchmarking: Automatic comparison and selection of top-performing models
- Model format conversion: Save models in multiple formats (pickle, ONNX, TFLite, Keras)
- Quantization support: Automatic test data extraction for model quantization
- WSL compatibility: Full support for Windows Subsystem for Linux
- Result visualization: Generate performance visualizations for trained models
- Feature engineering integration: Seamless integration with feature engineering pipeline
This project utilizes a range of powerful libraries and technologies, including:
Python 3.9+
PyCaret: For automated machine learning and model benchmarking.
TensorFlow/Keras: For building and training deep learning models.
ONNX (Open Neural Network Exchange): For model interoperability and deployment.
Pandas & NumPy: For data manipulation and numerical operations.
Matplotlib & Seaborn: For data visualization.
Protobuf: For efficient data serialization (used internally by TensorFlow).
CUDA/cuDNN: For GPU acceleration in TensorFlow.
databricks_apis/
โโโ databricks_api_endpoints/
โ โโโ ml_training_benchmarking.py # Main ML pipeline
โ โโโ feature_engineering_linux.py # Feature engineering pipeline for Linux
โ โโโ ...
โโโ results/
โ โโโ feature_artifacts/ # Feature engineering outputs
โ โ โโโ selected_features_data_* # Feature datasets
โ โ โโโ ...
โ โโโ ml_models/ # ML model outputs
โ โ โโโ HP_CompE21EnergyIn_*/ # Models for specific target variables
โ โ โโโ quantization_data/ # Data for model quantization
โ โ โโโ visualizations/ # Performance charts and visualizations
โ โโโ feature_metadata.json # Feature engineering metadata
โโโ wsl_venv/ # Python virtual environment for WSL
-
Clone the repository:
git clone https://github.com/yourusername/databricks_apis.git cd databricks_apis -
Create a virtual environment and install dependencies:
python -m venv venv venv\Scripts\activate pip install -r requirements.txt
-
Run the pipeline:
python databricks_api_endpoints/ml_training_benchmarking.py
-
Install WSL if not already installed:
# On Windows PowerShell as Administrator wsl --install
-
Launch WSL and navigate to your project directory:
# If your project is on D: drive cd /mnt/d/College/databricks_apis
-
Create a Python virtual environment in WSL:
python3 -m venv wsl_venv source wsl_venv/bin/activate pip install -r requirements.txt # Install additional packages for full functionality pip install pycaret[full] tensorflow onnx onnxruntime onnxmltools skl2onnx tf2onnx matplotlib seaborn
-
Run the ML training pipeline in WSL:
python databricks_api_endpoints/ml_training_benchmarking.py
The pipeline is configured through the ml_config dictionary in ml_training_benchmarking.py. Key options include:
use_case: Type of ML problem (regression, classification, time_series, etc.)use_pycaret: Toggle PyCaret integration (set to False to save time during development)target_columns: List of target variables to modelsave_quantization_data: Save test data for model quantizationtime_series: Configuration for time series models
Example configuration adjustment:
ml_config["use_pycaret"] = True # Enable PyCaret for comprehensive model comparison
ml_config["gpu_enabled"] = True # Enable GPU acceleration if availableThe repository includes Linux-specific implementations of key functionality:
feature_engineering_linux.py: Feature engineering pipeline optimized for Linux/WSLdatabricks_connector_linux.py: Databricks API connector for Linux environments
These files handle platform-specific operations like filesystem interactions and parallel processing optimizations for Linux environments.
After running the pipeline, results are organized in the results directory:
- Model files in various formats (pickle, ONNX, TFLite, Keras)
- Performance metrics in JSON format
- Visualizations of model performance
- Quantization test data for deployment optimization
- Python 3.9+
- TensorFlow/Keras
- PyCaret
- ONNX Runtime
- Pandas/NumPy
- Matplotlib/Seaborn
This project is licensed under the MIT License - see the LICENSE file for details.
pm2 start ecosystem.config.jsor
autogenstudio ui(for local demos today)