This project is a comprehensive desktop application developed in Python with Tkinter/PyQt5, designed for data exploration and comparison of various Machine Learning algorithms. It provides an interactive platform for analyzing datasets, particularly medical data such as the Pima Indians Diabetes dataset.
Main application menu allowing users to load files and choose between supervised and unsupervised classification.
Analysis interface where users can select an algorithm and visualize results.
The application offers a complete pipeline from data preprocessing to model evaluation.
- Loading data files (ARFF format)
- Exploratory data analysis: descriptive statistics, distribution visualization (Boxplots, Scatter plots)
- Preprocessing: handling missing values, data normalization (Min-Max, Z-score)
The application implements and compares the performance of the following algorithms:
- K-Nearest Neighbors (KNN)
- Gaussian Naive Bayes
- Decision Tree (with Gain Ratio and Gini Index)
- Neural Network (MLP)
- Support Vector Machine (SVM) with different kernels (RBF, Linear, Polynomial)
- Linear Regression adapted for classification
- Apriori Algorithm for association rule discovery and risk factor identification in data
- Comprehensive metrics: Precision, Recall, F1-Score, and Accuracy
- Visualization: Confusion matrix, metric evolution curves (for KNN), and decision boundary visualization (for Linear Regression and SVM)
- Python 3
- Tkinter / PyQt5 for the graphical interface
- Pandas & Numpy for data manipulation
- Scikit-learn for models and metrics
- Matplotlib & Seaborn for visualizations
- SciPy for ARFF file loading
- PyDotPlus & Graphviz for decision tree visualization
Ensure you have Python 3.7+ installed on your system.
-
Clone the repository:
git clone https://github.com/yourusername/ML-Analysis-Toolkit-GUI.git cd ML-Analysis-Toolkit-GUI -
(Recommended) Create a virtual environment:
python -m venv venv # On Windows: .\venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Launch the GUI application:
python main_app.py
ML-Analysis-Toolkit-GUI/
├── main_app.py # Main application entry point
├── requirements.txt # Python dependencies
├── src/
│ ├── algorithms/ # ML algorithm implementations
│ │ ├── knn.py
│ │ ├── naive_bayes.py
│ │ ├── decision_tree.py
│ │ ├── neural_network.py
│ │ ├── svm.py
│ │ └── linear_regression.py
│ ├── preprocessing/ # Data preprocessing modules
│ │ ├── data_loader.py
│ │ ├── data_cleaner.py
│ │ └── normalizer.py
│ ├── visualization/ # Visualization modules
│ │ ├── plots.py
│ │ └── metrics_viz.py
│ └── gui/ # GUI components
│ ├── main_window.py
│ ├── analysis_window.py
│ └── results_window.py
├── data/ # Sample datasets
│ └── pima_diabetes.arff
├── docs/ # Documentation and screenshots
│ ├── images/
│ └── user_guide.md
├── tests/ # Unit tests
├── Rapport_Projet_ML.pdf # Complete project report
└── README.md # This file
- Launch the application using
python main_app.py - Click "Load File" to import your ARFF dataset
- The application will automatically perform initial data analysis
- Choose from various preprocessing options:
- Handle missing values (mean, median, mode imputation)
- Normalize data (Min-Max scaling, Z-score normalization)
- Feature selection and engineering
- Navigate to the analysis window
- Select from available algorithms:
- Supervised Learning: KNN, Naive Bayes, Decision Tree, Neural Network, SVM, Linear Regression
- Unsupervised Learning: Apriori algorithm for association rules
- Configure algorithm parameters through the GUI
- Train the selected model on your dataset
- View comprehensive evaluation metrics and visualizations
- Confusion matrices for classification performance
- ROC curves and precision-recall curves
- Decision boundary plots (where applicable)
- Feature importance analysis
- Configurable K values
- Distance metrics (Euclidean, Manhattan, Minkowski)
- Performance visualization with varying K values
- Multiple kernel support (RBF, Linear, Polynomial)
- Hyperparameter tuning interface
- Decision boundary visualization
- Customizable network architecture
- Training progress monitoring
- Activation function selection
- Splitting criteria (Gini Index, Gain Ratio)
- Tree pruning options
- Visual tree representation
- Association rule mining
- Support and confidence thresholds
- Risk factor identification in medical data
The toolkit provides comprehensive model evaluation:
-
Classification Metrics:
- Accuracy
- Precision
- Recall
- F1-Score
- Specificity
- Sensitivity
-
Visualization Tools:
- Confusion Matrix
- ROC Curve
- Precision-Recall Curve
- Learning Curves
- Feature Importance Plots
- Statistical summaries
- Distribution analysis
- Correlation matrices
- Missing value detection
- Outlier identification
- Missing value imputation
- Feature scaling and normalization
- Categorical variable encoding
- Feature selection techniques
- User-Friendly Interface: Intuitive design with clear navigation
- Real-Time Updates: Live visualization of training progress
- Interactive Plots: Clickable charts and graphs
- Export Functionality: Save results and visualizations
- Theme Support: Light and dark mode options
- Model Comparison: Side-by-side algorithm performance comparison
- Cross-Validation: K-fold cross-validation implementation
- Hyperparameter Tuning: Grid search and random search options
- Batch Processing: Analyze multiple datasets simultaneously
- Export Reports: Generate comprehensive analysis reports
Run the test suite to ensure everything is working correctly:
# Run all tests
python -m pytest tests/
# Run specific test modules
python -m pytest tests/test_algorithms.py
python -m pytest tests/test_preprocessing.py- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add unit tests for new features
- Update documentation for API changes
- Use descriptive commit messages
- Python 3.7 or higher
- Windows 10/macOS 10.14/Ubuntu 18.04 or later
- Minimum 4GB RAM recommended
- 1GB free disk space
tkinter>=8.6
pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=1.0.0
matplotlib>=3.4.0
seaborn>=0.11.0
scipy>=1.7.0
pydotplus>=2.0.2
graphviz>=0.17
-
Graphviz Installation:
# Windows (using conda) conda install graphviz # macOS brew install graphviz # Ubuntu sudo apt-get install graphviz
-
ARFF File Loading Issues:
- Ensure your ARFF file follows the standard format
- Check for special characters in attribute names
- Verify data type consistency
-
Memory Issues with Large Datasets:
- Consider data sampling for initial analysis
- Use chunked processing for very large files
- Monitor memory usage during processing
If you use this project, please cite it as:
Ayoub Laib (2025), Machine Learning Analysis Toolkit, GitHub repository: https://github.com/aylaib/ML-Analysis-Toolkit-GUI/edit/main/README.md
If you find this project helpful, please consider giving it a star! ⭐
This comprehensive Machine Learning Analysis Toolkit demonstrates advanced skills in data science, software engineering, and GUI development, making it an excellent addition to any data science portfolio.