GeneLM: Gene Language Model for Translation Initiation Site Prediction in Bacteria. This repository includes the implementation of GeneLM, a genomic language model designed for predicting coding sequences (CDS) and refining Translation Initiation Sites (TIS) in bacterial genomes. The model operates through a two-stage genomic language model pipeline. In this package, we provide resources including: source codes of the GeneLM model, usage examples, pre-trained models, fine-tuned models, and a web-based visualization tool. Training of GeneLM consists of general-purpose pre-training and task-specific fine-tuning. The code for model training can be found in the subfolder finetune, while the web tool we developed can be found in the subfolder webtool. Our implementation extends existing transformer-based models and adapts them for genomic sequence analysis.
We evaluated our approach against the widely used gene annotation tool, Prodigal, on an experimentally verified bacterial dataset. The results of this comparison are presented in the image below. You can access our paper here.
NEW: GeneLM can also be executed directly from the command line. This mode provides flexible scripts for running the annotation pipeline on single or multiple FASTA files, with options for CPU, GPU, and HPC (SLURM) environments.
Check this: README.CommandLine.md
NEW: We are pleased to announce that GeneLM is now available online: http://bioinformatics.um6p.ma/GeneLM. No installation is required — simply upload a FASTA genome or paste a sequence to receive the annotated file (gff or csv).
⚠️ Note: The online platform currently runs on CPU only, and submissions are processed through a queue, which may lead to longer processing times for large bacterial genomes.
To streamline gene annotation after model training, we developed a post-processing pipeline that integrates an interactive web interface and an API-based system.
Our web-based annotation tool allows users to submit genome sequences for automatic annotation. It supports two input modes:
- Direct input – Users can paste a genome sequence into the provided text area.
- File upload – Users can upload a FASTA file for processing.
After providing input, users can specify the desired output format (GFF or CSV). Once submitted, the system processes the annotation and generates structured output files. A preview of the interface is shown below:
To speed up the setup process, you can simply run the
webtool/setup-and-run.shscript. This script will automatically create the Python environment, install the necessary dependencies, and start both the API and the web tool services for you. Please make sure that ports 8501 (for the web UI) and 8000 (for the API) are available on your machine. To proceed, make the script executable (chmod +x setup-and-run.sh) and run it (./setup-and-run.sh). If any errors occur during execution, you can still perform the setup manually by following the detailed steps described below.A complete Docker setup is now provided to simplify deployment across any environment with NVIDIA GPU support. Check out the Docker setup and usage instructions in
README.Docker.mdto build from Docker. Or you can get the pre-build image of GeneLM from hub and use it direclty doing:docker pull 13365920/genelm-webtool:latest docker run --gpus all -p 8501:8501 -p 8000:8000 13365920/genelm-webtool:latest
git clone https://github.com/Bioinformatics-UM6P/GeneLM
cd webtool
python -m venv venvsource ./venv/bin/activatepip install -r requirements.txtstreamlit run ui/app.pyuvicorn --app-dir api api:app --host 127.0.0.1 --port 8000 --reloadNavigate to the web tool and submit a FASTA/FNA file containing your full genome sequence. The results should look like this:
Depending wethever you wanna classify CDS of TIS you can download the model from higginface au use it following higginface api. To load our GeneLM CDS-CLASSIFIER model, you can use transformers library:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load Model
model_checkpoint = "Genereux-akotenou/BacteriaCDS-DNABERT-K6-89M"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)Inference Example: This model works with 6-mer tokenized sequences. You need to convert raw DNA sequences into k-mer format:
def generate_kmer(sequence: str, k: int, overlap: int = 1):
return " ".join([sequence[j:j+k] for j in range(0, len(sequence) - k + 1, overlap)])
sequence = "ATGAGAACCAGCCGGAGACCTCCTGCTCGTACATGAAAGGCTCGAGCAGCCGGGCGAGGGCGGTAG"
seq_kmer = generate_kmer(sequence, k=6, overlap=3)
# Run inference
inputs = tokenizer(
seq_kmer,
return_tensors="pt",
max_length=tokenizer.model_max_length,
padding="max_length",
truncation=True
)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=-1).item()This will give first stage classifition ouput you can refine using the second stage classifier. See instructions here: Loading model for second stage
If you are interested in the fine-tuning code pipeline or the data used in this process, all relevant materials can be found in the finetune folder of this repository.
Within the finetune/ directory, you will find two key subfolders:
-
data-pipeline/ – This folder contains scripts and preprocessing workflows for preparing training data. It includes:
- Data collection and formatting procedures
- Preprocessing scripts to clean and structure genomic sequences
-
train-pipeline/ – This folder provides all necessary scripts for training and fine-tuning the model. It includes:
- Model configuration files
- Training scripts for executing fine-tuning using preprocessed data
- Hyperparameter settings and training logs for reproducibility
The fine-tuning process involves:
- Data Preparation – Formatting raw genomic sequences into a structured dataset.
- Model Training – Using the preprocessed data to fine-tune a pre-trained model.
- Evaluation – Assessing performance using validation datasets and benchmark comparisons.
- Result Analysis – Generating reports and metrics to analyze model effectiveness.
By following the resources in this directory, users can replicate or extend the fine-tuning process for their specific use cases.
If you have used GeneLM in your research, please kindly cite the following publication:
@article{10.1093/bib/bbaf311,
author = {Akotenou, Genereux and El Allali, Achraf},
title = {Genomic language models (gLMs) decode bacterial genomes for improved gene prediction and translation initiation site identification},
journal = {Briefings in Bioinformatics},
volume = {26},
number = {4},
pages = {bbaf311},
year = {2025},
month = {07},
issn = {1477-4054},
doi = {10.1093/bib/bbaf311},
url = {https://doi.org/10.1093/bib/bbaf311},
eprint = {https://academic.oup.com/bib/article-pdf/26/4/bbaf311/63649237/bbaf311.pdf},
}


