Skip to content

kren-ai-lab/plastic_degrading_enzymes

Repository files navigation

Discovering potential plastic degrading enzymes using machine learning strategies

This repository contains the source files and supplementary information for the implementations and use cases presented in the work:

David Medina-Ortiz1,2, Diego Alvares-Saravia2,3, Nicole Soto-García2, Diego Sandoval-Vargas1,4, Jacqueline Aldridge2, Sebastián Rodríguez1,4, Bárbara Andrews1,4, Juan A. Asenjo1,4 Anamaría Daza1,4∗
Discovering potential plastic degrading enzymes using machine learning strategies.
https://doi.org/10.1101/2025.02.09.637306

1Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Beauchef 851, Santiago, Chile
2Departamento de Ingeniería En Computación, Universidad de Magallanes, Avenida Bulnes 01855, Punta Arenas, Chile.
2Centro Asistencial de Docencia e Investigación, CADI, Universidad de Magallanes. Av. Los Flamencos 01364, Punta Arenas, Chile.
4Departamento de Ingeniería Química, Biotecnología y Materiales, Universidad de Chile, Beauchef 851, Santiago, Chile
*Corresponding author


Table of Contents


Discovering potential plastic degrading enzymes using machine learning strategies

Plastic pollution presents a critical environmental challenge, necessitating innovative and sustainable solutions. In this context, biodegradation using microorganisms and enzymes offers an environmentally friendly alternative. This work introduces an AI-driven framework that integrates machine learning (ML) and generative models to accelerate the discovery and design of plastic-degrading enzymes. By leveraging pre-trained protein language models and curated datasets, we developed seven ML-based binary classification models to identify enzymes targeting specific plastic substrates, achieving an average accuracy of 89%. The framework was applied to over 6,000 enzyme sequences from the RemeDB to classify enzymes targeting diverse plastics, including PET, PLA, and Nylon. Besides, generative learning strategies combined with trained classification models in this work were applied for \textit{de novo} generation of PET-degrading enzymes. Structural bioinformatics validated potential candidates through \textit{in-silico} analysis, highlighting differences in physicochemical properties between generated and experimentally validated enzymes. Moreover, generated sequences exhibited lower molecular weights and higher aliphatic indices, features that may enhance interactions with hydrophobic plastic substrates. These findings highlight the utility of AI-based approaches in enzyme discovery, providing a scalable and efficient tool for addressing plastic pollution. Future work will focus on experimental validation of promising candidates and further refinement of generative strategies to optimize enzymatic performance.

Requirements

This work was implemented on Python v3.12. To facilitate the deployment and the replication of the work, an environment file was constructed

To create the environment, please run the following command:

conda create -f environment.yml

Moreover, if you want to check all generated embedding and numerical representation strategies, you can find it as a tar.gz files in the shared Google Drive folder.

Implemented pipeline to train classification models

The implemented pipeline to train the classification models is based on our previous work [2]

alt text

Raw data and processing strategies

The folder raw_data contains:

  • enzymes_plastics: pivoted csv file with each plastic-substrate target detected
  • data_sequences: csv file with unique sequences collected from the pivoted dataset
  • reme_db_sequences: csv file with plastic-degrading enzymes extracted from RemeDB [3]

Numerical representation strategies explored in this work

This work explores different numerical representation strategies to process the input enzyme sequences. See the notebooks in folder encoding_approaches for details about the execution.

The encoding strategies including:

  1. Feature engineering
  2. Frequency encoders
  3. k-mers encoders
  4. One-hot encoders
  5. Ordinal encoders
  6. Physicochemical-based encoders
  7. FFT-based encoders
  8. Embeddings throug pre-trained protein language models.
  • In the case of physicochemical-based and FFT-based the physicochemical properties encoders used were extracted from [4].

  • In the case of embedding, the bioembedding library was employed [5].

The numerical representation strategies take the input data input_data, apply the encoder strategy, and generate the outputs with the encoder sequences. The folder processed_dataset contains the results of all encoder strategies explored in this work.

Training strategies applied to develop classification models

This work explores different supervised learning strategies to train binary classification models. The folder train_models has the script implemented for the development of classification models, including:

  • classification_models.py
  • training_model.py
  • run_training_process.py
  • merging_performances_by_plastic.ipynb

The script training_model.py facilitate the training during exploration step for all type of plastic explored in this work.

Besides, in the folder selecting_models you will find scripts for analyzing the results of the exploration performances and for selecting the best combinations of supervised learning algorithms and numerical representation strategies.

Finally, the folder checking_and_characterizing_models facilitates the processing, voting, and genering the trained models to use to classify new sequences.

These models were used for analyzing:

  • Generated sequences throug the pre-trained model ZymCTRL [6]. The folder generating_sequences contain the scripts implemented for the enzyme sequences generation
  • Plastic-degrading enzymes with unknown plastic-substrate target collected from RemeDB [7]

See folder exploring_generated_sequences for more details. These scripts facilitate:

  • Loading enzyme sequences previously encoded
  • Loading models
  • Using the models for predicting with classification and probabilities
  • Filter the classification based on a threshold (default threshold=0.5)

References

  • [1] Dallago, C., Schütze, K., Heinzinger, M., Olenyi, T., Littmann, M., Lu, A. X., ... & Rost, B. (2021). Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols, 1(5), e113.
  • [2] Medina-Ortiz, D., Contreras, S., Fernández, D., Soto-García, N., Moya, I., Cabas-Mora, G., & Olivera-Nappa, Á. (2024). Protein language models and machine learning facilitate the identification of antimicrobial peptides. International Journal of Molecular Sciences, 25(16), 8851.
  • [3] Sankara Subramanian, S. H., Balachandran, K. R. S., Rangamaran, V. R., & Gopal, D. (2020). RemeDB: tool for rapid prediction of enzymes involved in bioremediation from high-throughput metagenome data sets. Journal of Computational Biology, 27(7), 1020-1029.
  • [4] Medina-Ortiz, D., Contreras, S., Amado-Hinojosa, J., Torres-Almonacid, J., Asenjo, J. A., Navarrete, M., & Olivera-Nappa, Á. (2022). Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering. Frontiers in Molecular Biosciences, 9, 898627.
  • [5] Dallago, C., Schütze, K., Heinzinger, M., Olenyi, T., Littmann, M., Lu, A. X., ... & Rost, B. (2021). Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols, 1(5), e113.
  • [6] Munsamy, G., Lindner, S., Lorenz, P., & Ferruz, N. (2022, December). ZymCTRL: a conditional language model for the controllable generation of artificial enzymes. In NeurIPS Machine Learning in Structural Biology Workshop.
  • [7] Sankara Subramanian, S. H., Balachandran, K. R. S., Rangamaran, V. R., & Gopal, D. (2020). RemeDB: tool for rapid prediction of enzymes involved in bioremediation from high-throughput metagenome data sets. Journal of Computational Biology, 27(7), 1020-1029.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors