This project implements a text-guided diffusion model for generating novel molecules, adapted from the original tgm-dlm repository. It has been significantly modified to operate using the SELFIES (SELF-referencIng Embedded Strings) molecular representation instead of SMILES.
- 98% syntactic validity of generated molecules (SELFIES guarantees valid molecular strings)
- SciBERT-based text guidance for molecule-property alignment
- Automated SELFIES data preprocessing pipeline
- Comprehensive evaluation metrics for quality and diversity
python3 -m venv venv
source venv/bin/activateEnsure you have requirements.txt in the root directory, then run:
pip install -r requirements.txt
pip install selfies matplotlib rdkit-pypiNote: Ensure PyTorch version is compatible with your hardware (CPU/CUDA).
This process converts SMILES data into SELFIES, builds the vocabulary, and generates SciBERT embeddings.
Place your data inside:
datasets/SMILES/
Each .tsv file should contain CID, SMILES, and description columns:
train.txt
validation.txt
test.txt
cd improved-diffusion/scripts/
python convert_smiles_to_selfies.pyOutput:
datasets/SELFIES/
├── train.txt
├── test.txt
├── validation.txt
└── selfies_vocab.txt
python clean_selfies_dataset.pyOutput: Cleaned .txt files in datasets/SELFIES/.
python process_text_selfies.py -i train_val_256
python process_text_selfies.py -i validation_256
python process_text_selfies.py -i testOutput: .pt embedding files in datasets/SELFIES/.
cd improved-diffusion/scripts/
python train_selfies.pyCheckpoints and logs are stored in checkpoints/.
python train_correct_withmask_selfies.pyUsed for post-sample repair.
python text_sample_selfies.py \
--model_path ../../checkpoints/<your_model_checkpoint.pt> \
--num_samples 10000 \
--output_file ../../selfies_generation_results.txtArguments:
--model_path: Path to trained model (e.g.,ema_0.9999_200000.pt)--num_samples: Number of molecules to generate (10k–30k recommended)--output_file: Destination for generated SELFIES & SMILES
python evaluate_selfies_generation.py \
--generated ../../selfies_generation_results.txt \
--ground_truth ../../datasets/SELFIES/test.txt \
--output_dir ./evaluation_resultsOutputs:
-
Console report of all metrics
-
evaluation_results/evaluation_metrics.json(Validity, Uniqueness, Novelty, Lipinski Ro5, etc.) -
evaluation_results/plots/with:property_distributions.pngmetrics_summary.png
If trained with corruption, fix invalid SELFIES using:
python post_sample_selfies.py \
--model_path ../../correction_checkpoints/<your_correction_model.pt> \
--input_file ../../selfies_generation_results_invalid.txt \
--output_file ../../selfies_repaired_results.txtmytokenizers_selfies.py– SELFIES tokenizer (SELFIESTokenizer)mydatasets_selfies.py– Custom ChEBI dataset loader
convert_smiles_to_selfies.py– SMILES → SELFIES conversionclean_selfies_dataset.py– SELFIES validation and cleanupprocess_text_selfies.py– SciBERT embedding generator
train_selfies.py– Training scripttrain_correct_withmask_selfies.py– Corruption/masking varianttext_sample_selfies.py– Sampling scriptevaluate_selfies_generation.py– Evaluation metricspost_sample_selfies.py– Invalid sample repair
This project is based on the Text-Guided Mask Denoising Language Model (TGM-DLM) available at https://github.com/Deno-V/tgm-dlm.git. It has been extended to SELFIES-based generation, ensuring 98%+ valid molecules and robust text-conditioned diffusion modeling.