PRGminer is a state-of-the-art bioinformatics tool that employs deep learning to predict and classify plant resistance genes (R-genes). The tool implements a two-phase prediction approach:
- Phase 1: Binary classification of sequences as R-genes or non-R-genes
- Phase 2: Detailed classification of R-genes into eight distinct categories:
- CNL (Coiled-coil NBS-LRR)
- KIN (Kinase)
- LYK (Lysin Motif Kinase)
- LECRK (Lectin Receptor Kinase)
- RLK (Receptor-like Kinase)
- RLP (Receptor-like Protein)
- TIR (Toll/Interleukin-1 Receptor)
- TNL (TIR-NBS-LRR)
- 🧬 Advanced deep learning models for accurate R-gene prediction
- 🔄 Two-phase prediction pipeline
- 📊 Detailed probability scores for each prediction
- 🚀 Fast and efficient processing
- 💻 User-friendly command-line interface
- 📝 Comprehensive output reports
- Linux or macOS operating system
- Python 3.7 or higher
- 4GB RAM (minimum)
- 2GB free disk space
- TensorFlow 2.x
- Keras
- NumPy
- Pandas
- Biopython
- Scikit-learn
Choose one of the following installation methods:
# Clone the repository
git clone https://github.com/navduhan/PRGminer.git
cd PRGminer
# Initialize Git LFS and pull large model files
git lfs install
git lfs pull
# Create and activate conda environment
conda env create -f environment.yml
conda activate PRGminer
# Install the package
pip install .# Download PRGminer
wget https://kaabil.net/PRGminer/download/PRGminer.tar.gz
tar -xvzf PRGminer.tar.gz
cd PRGminer
# Create and activate environment
conda env create -f environment.yml
conda activate PRGminer
pip install .# Download and extract
wget https://kaabil.net/PRGminer/download/PRGminer.tar.gz
tar -xvzf PRGminer.tar.gz
cd PRGminer
# Install
pip install .PRGminer -i <input.fasta> -od <output_directory> -l <prediction_level>| Argument | Description | Default |
|---|---|---|
-i, --fasta_file |
Input protein sequences in FASTA format | Required |
-od, --output_dir |
Output directory for results | PRGminer_results |
-l, --level |
Prediction level (Phase1 or Phase2) | Phase2 |
-o, --output_file |
Output file name | PRGminer_results.txt |
# Phase 1 prediction (R-gene vs non-R-gene)
PRGminer -i proteins.fasta -od results_phase1 -l Phase1
# Phase 2 prediction (detailed classification)
PRGminer -i proteins.fasta -od results_phase2 -l Phase2PRGminer accepts protein sequences in FASTA format only. Example:
>protein1
MAEGEQVQSGEDLGSPVAQVLQKAREQGAQAAVLVVPPGEEQVQSAEDLGSPVAQVLQKA
>protein2
MTKFTILLFFLSVALASNAQPGCNQSQTLSPNWQNVFGASAASSCP
PRGminer generates a comprehensive output directory with the following structure:
output_directory/
├── PRGminer.log # Detailed execution log
├── prediction_summary.txt # Summary of prediction results
├── PRGminer_results.txt # Final consolidated results
└── intermediate_files/
├── Phase1_predictions.tsv # Phase 1 detailed predictions
├── phase2_input.fasta # R-genes identified for Phase 2
└── Phase2_predictions.tsv # Phase 2 detailed predictions
SampleID Prediction Probability Additional_Info
seq1 Rgene 0.9234 CNL
seq2 Non-Rgene 0.8567 -
seq3 Rgene 0.9876 RLK
PRGminer Prediction Summary
========================
Prediction Level: Phase1
Total sequences analyzed: 100
Class Distribution:
-------------------
Rgene: 35 sequences (35.0000%)
Non-Rgene: 65 sequences (65.0000%)
=================================================
Prediction Level: Phase2
Total sequences analyzed: 35
Class Distribution:
-------------------
CNL: 10 sequences (28.5714%)
RLK: 8 sequences (22.8571%)
TNL: 6 sequences (17.1429%)
...
SampleID Prediction Rgene Non-Rgene
seq1 Rgene 0.9234 0.0766
seq2 Non-Rgene 0.1433 0.8567
seq3 Rgene 0.9876 0.0124
SampleID Prediction CNL KIN LYK LECRK RLK RLP TIR TNL
seq1 CNL 0.8234 0.0234 0.0156 0.0145 0.0567 0.0234 0.0230 0.0200
seq3 RLK 0.0234 0.0567 0.0145 0.0234 0.7234 0.0890 0.0456 0.0240
-
Probability Scores
- All probabilities are reported with 4 decimal places
- Values range from 0.0000 to 1.0000 (or 0% to 100%)
- Higher values indicate stronger predictions
-
Prediction Confidence
- High confidence: > 0.8000 (80%)
- Medium confidence: 0.6000-0.8000 (60-80%)
- Low confidence: < 0.6000 (60%)
-
Phase-specific Information
- Phase1: Binary classification (Rgene vs Non-Rgene)
- Phase2: Multi-class classification into 8 R-gene categories
- Each phase includes detailed probability distributions
-
Log File Details
- Timestamp for each prediction
- Processing parameters used
- Any warnings or errors encountered
- Performance metrics
-
For Phase1:
- Sequences with Rgene probability > 0.5000 are classified as R-genes
- Higher probabilities indicate stronger R-gene characteristics
-
For Phase2:
- The highest probability among the 8 classes determines the final prediction
- Probability distribution shows relative confidence for each class
- Close probabilities may indicate hybrid or novel R-gene types
- Processing time depends on:
- Number of input sequences
- Sequence lengths
- Available computational resources
- Recommended batch size: < 1000 sequences
- For large datasets, consider splitting into smaller batches
Common issues and solutions:
-
Invalid sequence format
- Ensure sequences are in proper FASTA format
- Verify sequences contain valid amino acids only
-
Memory errors
- Reduce batch size
- Close unnecessary applications
- Increase system swap space
-
Installation issues
- Verify Python version compatibility
- Check for conflicting dependencies
- Ensure proper environment activation
If you use PRGminer in your research, please cite:
For bugs and technical issues:
- Create an issue on GitHub
- Email: naveen.duhan@usu.edu
For questions about the methodology:
- Dr. Rakesh Kaundal: rkaundal@usu.edu
- Naveen Duhan: naveen.duhan@usu.edu
PRGminer is released under the GNU General Public License v3.
This work was supported by the Kaundal Bioinformatics Lab at Utah State University.
© 2023 Kaundal Bioinformatics Lab, Utah State University