Skip to content

Commit 516d29b

Browse files
committed
Add citation information
1 parent efdab3d commit 516d29b

File tree

3 files changed

+379
-0
lines changed

3 files changed

+379
-0
lines changed

CITATION.bib

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
@article{202511.2140,
2+
doi = {10.20944/preprints202511.2140.v1},
3+
url = {https://doi.org/10.20944/preprints202511.2140.v1},
4+
year = 2025,
5+
month = {November},
6+
publisher = {Preprints},
7+
author = {Alexandr Serdiukov and Vitaliy Dragvelis and Daniil Smutin and Amir Taldaev and Sergey Muravyov},
8+
title = {Efficient and Verified Extraction of the Research Data Using LLM},
9+
journal = {Preprints}
10+
}

CITATION.cff

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
cff-version: 1.2.0
2+
message: "If you use this software, please cite it using the metadata from this file."
3+
title: "PROBEst"
4+
version: "0.1.4"
5+
doi: "10.20944/preprints202511.2140.v1"
6+
date-released: "2025-11-01"
7+
authors:
8+
- family-names: "Serdiukov"
9+
given-names: "Alexandr"
10+
- family-names: "Dragvelis"
11+
given-names: "Vitaliy"
12+
- family-names: "Smutin"
13+
given-names: "Daniil"
14+
- family-names: "Taldaev"
15+
given-names: "Amir"
16+
- family-names: "Muravyov"
17+
given-names: "Sergey"
18+
keywords:
19+
- "nucleotide probes"
20+
- "bioinformatics"
21+
- "probe generation"
22+
- "BLASTn"
23+
- "evolutionary algorithm"
24+
license: MIT
25+
repository-code: "https://github.com/CTLab-ITMO/PROBEst"
26+
abstract: "PROBEst is a tool designed for generating nucleotide probes with specified properties, leveraging advanced algorithms and AI-driven techniques to ensure high-quality results."
27+
reference:
28+
type: article
29+
title: "Efficient and Verified Extraction of the Research Data Using LLM"
30+
authors:
31+
- family-names: "Serdiukov"
32+
given-names: "Alexandr"
33+
- family-names: "Dragvelis"
34+
given-names: "Vitaliy"
35+
- family-names: "Smutin"
36+
given-names: "Daniil"
37+
- family-names: "Taldaev"
38+
given-names: "Amir"
39+
- family-names: "Muravyov"
40+
given-names: "Sergey"
41+
journal: "Preprints"
42+
doi: "10.20944/preprints202511.2140.v1"
43+
year: 2025
44+
month: 11

README.md

Lines changed: 325 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -285,6 +285,331 @@ graph LR
285285
- For developers: use `pytest`
286286

287287

288+
# PROBEst v.0.1.4. <a href=""><img src="img/probest_logo.jpg" align="right" width="150" ></a>
289+
### St. Petersburg tool for genereting nucleotide probes with specified properties
290+
291+
[![python package](https://github.com/CTLab-ITMO/PROBEst/actions/workflows/python-package.yml/badge.svg)](https://github.com/CTLab-ITMO/PROBEst/actions/workflows/python-package.yaml)
292+
293+
<span style="color: red">**Warning**:</span> tool is under active development
294+
295+
**PROBEst** is a tool designed for generating nucleotide probes with specified properties, leveraging advanced algorithms and AI-driven techniques to ensure high-quality results. The tool is particularly useful for researchers and bioinformaticians who require probes with tailored universality and specificity for applications such as PCR, hybridization, and sequencing. By integrating a wrapped evolutionary algorithm, PROBEst optimizes probe generation through iterative refinement, ensuring that the final probes meet stringent biological and computational criteria.
296+
297+
At the core of PROBEst is an AI-enhanced workflow that combines Primer3 for initial oligonucleotide generation, BLASTn for specificity and universality checks, and a mutation module for probe optimization. The tool allows users to input target sequences, select reference files for universality and specificity validation, and customize layouts for probe design. The evolutionary algorithm iteratively refines the probes by introducing mutations and evaluating their performance, ensuring that the final output is both specific to the target and universally applicable across related sequences.
298+
299+
300+
# Download and installation
301+
302+
## Installation
303+
304+
```bash
305+
git clone https://github.com/CTLab-ITMO/PROBEst.git
306+
cd PROBEst
307+
conda env create -f environment.yml
308+
conda activate probest
309+
python setup.py install
310+
```
311+
312+
### Validate installation
313+
314+
```bash
315+
bash test_generator.sh
316+
```
317+
318+
## Usage
319+
320+
### Preparation
321+
322+
`pipeline.py` relies on pre-prepared BLASTn databases. To create the required `true_base`, `false_base`, and `contig_table`, you can use the following script:
323+
324+
```bash
325+
bash scripts/generator/prep_db.sh \
326+
-n {DATABASE_NAME} \
327+
-c {CONTIG_NAME} \
328+
-t {TMP_DIR} \
329+
[FASTA]
330+
```
331+
332+
#### Arguments:
333+
- `-n DATABASE_NAME`: Name of the output BLAST database (required).
334+
- `-c CONTIG_TABLE`: Output file to store contig names and their corresponding sequence headers (required).
335+
- `-t TMP_DIR`: Temporary directory for intermediate files (optional, defaults to `./.tmp`).
336+
- `FASTA`: List of input FASTA files (gzipped or uncompressed).
337+
338+
### Generation
339+
340+
PROBEst can be run using the following command:
341+
342+
```bash
343+
python pipeline.py \
344+
-i {INPUT} \
345+
-tb {TRUE_BASE} \
346+
-fb [FALSE_BASE ...] \
347+
-c {CONTIG_TABLE} \
348+
-o {OUTPUT}
349+
```
350+
351+
**Blastn databases** and **contig table** are results of the ```prep_db.sh```
352+
353+
#### Key arguments:
354+
- `-i INPUT`: Input FASTA file for probe generation.
355+
- `-tb TRUE_BASE`: Input BLASTn database path for primer adjusting.
356+
- `-fb FALSE_BASE`: Input BLASTn database path for non-specific testing.
357+
- `-c CONTIG_TABLE`: .tsv table with BLAST database information.
358+
- `-o OUTPUT`: Output path for results.
359+
- `-t THREADS`: Number of threads to use.
360+
- `-a ALGORITHM`: Algorithm for probe generation (`FISH` or `primer`).
361+
362+
For a full list of arguments, run:
363+
364+
```bash
365+
python pipeline.py --help
366+
```
367+
368+
For parameter selection, grid search is implemented. You can specify parameters in json (see for example `data/test/general/param_grid_light.json`) and run
369+
370+
```bash
371+
python test_parameters.py \
372+
-p {JSON}
373+
```
374+
375+
376+
# Algorithm
377+
378+
## Algorithm Steps
379+
380+
0. **Prepare BLASTn databases**
381+
382+
1. **Select File for Probe Generation** (`INPUT`)
383+
384+
2. **Select Files for Universality Check** (`TRUE_BASE`)
385+
386+
3. **Select Files for Specificity Check** (`FALSE_BASE`)
387+
388+
4. **Select Layouts and Run Wrapped Evolutionary Algorithm** (`pipeline.py`)
389+
390+
a. **Primer3 Generation**
391+
392+
b. **BLASTn Check**
393+
394+
c. **Parsing**
395+
396+
d. **Mutation in Probe**
397+
398+
e. **AI corrections**
399+
400+
```mermaid
401+
---
402+
config:
403+
layout: elk
404+
look: classic
405+
---
406+
%%{init: {
407+
'theme': 'base',
408+
'themeVariables': {
409+
'fontFamily': 'arial',
410+
'fontSize': '16px',
411+
'primaryColor': '#fff',
412+
'primaryBorderColor': '#FFAC1C',
413+
'primaryTextColor': '#000',
414+
'lineColor': '#000',
415+
'secondaryColor': 'white',
416+
'tertiaryColor': '#fff',
417+
'subgraphBorderStyle': 'dotted'
418+
},
419+
'flowchart': {
420+
'curve': 'monotoneY',
421+
'padding': 15
422+
}
423+
}}%%
424+
425+
graph LR
426+
subgraph inputs
427+
A
428+
A1
429+
T1
430+
T3
431+
end
432+
433+
A([Initial probe generation]):::input -- primer3 --> B2(initial probe set):::probe
434+
A -- oligominer --> B2
435+
A1([Custom probes]):::input --> B2
436+
437+
T1([Target sequences]):::input -- blastn-db --> T2[(target)]
438+
T3([Offtarget sequences]):::input -- blastn-db --> T4[(offtarget)]
439+
440+
subgraph database
441+
T2
442+
T4
443+
end
444+
445+
T2 --> EA
446+
T4 --> EA
447+
B2 --> EA
448+
449+
EA[evolutionary algorithm] --> T11(results):::probe
450+
451+
classDef empty width:0px,height:0px;
452+
classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
453+
classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;
454+
```
455+
456+
```mermaid
457+
---
458+
config:
459+
layout: elk
460+
look: classic
461+
---
462+
%%{init: {
463+
'layout': 'elk',
464+
'theme': 'base',
465+
'themeVariables': {
466+
'fontFamily': 'arial',
467+
'fontSize': '16px',
468+
'primaryColor': '#fff',
469+
'primaryBorderColor': '#FFAC1C',
470+
'primaryTextColor': '#000',
471+
'lineColor': '#000',
472+
'secondaryColor': 'white',
473+
'tertiaryColor': '#fff',
474+
'subgraphBorderStyle': 'dotted'
475+
},
476+
'flowchart': {
477+
'curve': 'monotoneY',
478+
'padding': 15
479+
}
480+
}}%%
481+
482+
graph LR
483+
subgraph evolutionary algorithm
484+
subgraph hits
485+
TP
486+
TN
487+
end
488+
489+
B(probe set):::probe --> TP[target]
490+
B --> TN[offtarget]
491+
B1 -- mutations --> B
492+
493+
TP -- coverage --> T6[universality]
494+
TP -- duplications --> T7[multimapping]
495+
TN ---> T8[specificity]
496+
497+
subgraph check
498+
T6
499+
T7
500+
T8
501+
M1
502+
end
503+
504+
B --- E6[ ]:::empty --> M1[modeling]
505+
TP --- E6
506+
507+
M1 --- E3[ ]:::empty
508+
T6 --- E3
509+
T7 --- E3
510+
T8 --- E3
511+
E3 -- quality prediction --> B1(filtered probe set):::probe
512+
end
513+
B1 --> T11(results):::probe
514+
515+
classDef empty width:0px,height:0px;
516+
classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
517+
classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;
518+
```
519+
520+
521+
## Project Structure
522+
523+
```mermaid
524+
---
525+
config:
526+
theme: neutral
527+
look: classic
528+
---
529+
%%{init: {
530+
'theme': 'base',
531+
'themeVariables': {
532+
'fontFamily': 'arial',
533+
'fontSize': '16px',
534+
'primaryColor': '#fff',
535+
'primaryBorderColor': '#FFAC1C',
536+
'primaryTextColor': '#000',
537+
'lineColor': '#000',
538+
'secondaryColor': '#90EE90',
539+
'tertiaryColor': '#fff',
540+
'subgraphBorderStyle': 'dotted'
541+
},
542+
'flowchart': {
543+
'curve': 'monotoneY',
544+
'padding': 15
545+
}
546+
}}%%
547+
548+
graph LR
549+
PROBEst([PROBEst]) --> src[src/]
550+
PROBEst --> scripts[scripts/]
551+
PROBEst --> tests[tests/]
552+
553+
subgraph folders
554+
src
555+
scripts
556+
tests
557+
end
558+
559+
src --> C[benchmarking]
560+
src --> A[generation]
561+
tests --> A
562+
563+
scripts --> D[preprocessing]
564+
scripts --> B[database parsing]
565+
D --> A
566+
```
567+
568+
# Testing
569+
570+
- To check the installation: `bash test_generator.sh`
571+
572+
- For developers: use `pytest`
573+
574+
575+
# Citation
576+
577+
If you use PROBEst LLM pipeline for the extraction of the research data, please cite:
578+
579+
**BibTeX:**
580+
```bibtex
581+
@article{202511.2140,
582+
doi = {10.20944/preprints202511.2140.v1},
583+
url = {https://doi.org/10.20944/preprints202511.2140.v1},
584+
year = 2025,
585+
month = {November},
586+
publisher = {Preprints},
587+
author = {Alexandr Serdiukov and Vitaliy Dragvelis and Daniil Smutin and Amir Taldaev and Sergey Muravyov},
588+
title = {Efficient and Verified Extraction of the Research Data Using LLM},
589+
journal = {Preprints}
590+
}
591+
```
592+
593+
**Plain text:**
594+
Serdiukov, A., Dragvelis, V., Smutin, D., Taldaev, A., & Muravyov, S. (2025). Efficient and Verified Extraction of the Research Data Using LLM. *Preprints*. https://doi.org/10.20944/preprints202511.2140.v1
595+
596+
597+
# License
598+
599+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
600+
601+
# Contribution
602+
603+
We welcome contributions from the community! To contribute:
604+
605+
606+
Please read the [Contribution Guidelines](CONTRIBUTING.md) for more details.
607+
608+
# Wiki
609+
610+
Tool have its own <a href = "https://github.com/CTLab-ITMO/PROBEst/wiki">Wiki</a> pages with detailed information on usage cases, data description and another neccessary information
611+
612+
288613
# License
289614

290615
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

0 commit comments

Comments
 (0)