Skip to content

Commit 3e678b2

Browse files
authored
Update README with improved ESGI pattern descriptions
Clarified the explanation of pattern elements and stagger handling in ESGI.
1 parent 44ac807 commit 3e678b2

1 file changed

Lines changed: 8 additions & 4 deletions

File tree

README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -125,9 +125,9 @@ The file can contain one or several patterns, every additional pattern must be w
125125
This file lists all pattern-elements like barcodes, UMIs, etc. and those elements are enclosed by squared brackets.
126126
The pattern can have a name (e.g., PATTERN_NAME:). This is not required, but can be handy if several patterns are provided, as ESGI creates one file of demultiplexed reads for every pattern.
127127
Possible elements inlude:
128-
- a constant nucleotide sequence given as string of A,G,C,Ts, e.g. [GCATTACG]
128+
- a constant nucleotide sequence given as string of A,G,C,Ts, e.g. [GCATTACG].
129129
- barcode sequences, that are given by the path to a txt file. This file contains a comma-seperated list of all possible barcodes at this position.
130-
- UMI stated as <number>X, e.g. [15X]
130+
- UMI stated as <number>X, e.g. [15X]. You can also use this if you do not care about constant sequences. E.g., imagine we have the pattern [barcodes.txt][AAAA][barcodes.txt] but we do not care about the [AAAA] at all, we can simply use [barcodes.txt][4X][barcodes.txt]. This pattern element [4X] does not have to be used as UMI. When running ESGI or count we state the index in the pattern that we want to be used as UMI, this pattern element has to be a <number>X, but not every <number>X must be used as UMI. You can also ahve several <number>X elements and use them all as UMI, then ESGI/ count concatenates all the <number>X elements that are used as UMI and uses them as one long UMI.
131131
- genomic sequences like RNA/DNA, that need to be aligned to a reference genome with STAR, are listed as [DNA].
132132
ESGI makes use of two additional elements for special barcoding cases:
133133
- [-] seperates forward/reverse read strictly. The pattern generally covers the forward and reverse read (assuming reverse complements of the reverse read).
@@ -171,7 +171,10 @@ CONTROL,EGFRi,CONTROL,EGFRi
171171
```
172172
# Running ESGI with staggers
173173

174-
ESGI can demultiplex patterns with staggers, where for a abrcode at a certain position the length of the barcode can vary. One example pattern would be [A|AC|ACG|ACGT][GGGG] where we expect first a barcode of length 1 to 4 followed by a constant element GGGG. ESGI has two features that makes it possible to match staggers. 1.) Barcodes in barcode-elements (elements described by a txt file that contains all possible barcodes) can have variable length and 2.) ESGI can demultiplex several patterns simultaneously. In this scenario we would recommend to set ESGI up in ether of two ways:
174+
ESGI can demultiplex patterns with staggers (barcodes of variable length). One example pattern would be [A|AC|ACG|ACGT][GGGG] where we expect first a barcode of length 1 to 4 followed by a constant element GGGG. The problem of staggers is that several barcodes might map equally well: imagine we have the read ACGGGG. Now we first map the stagger barcode and A,AC,ACG all would map equally well (when mapping barcodes of different length ESGI does not punish deletions at the end of the barcode). Therefore, we would ether have to map the whole pattern first to see that actually AC and GGGG would be the best split, or at least map the stagger sequence together with the barcode that follows the stagger!!
175+
176+
ESGI has two features that makes it possible to match staggers: (1) Barcodes in barcode-elements (elements described by a txt file that contains all possible barcodes) can have variable length and (2) ESGI can demultiplex several patterns simultaneously.
177+
175178
1.) use a single pattern and merge the stagger with the constant sequence. If we would not merge them and have a pattern with the stagger and the constant element many reads would be discarded because of ambiguous mapping, since if a read contains barcode 'ACG' also barcode 'A' and 'AC' would map and ESGI would discard the read (ESGI does not look ahead, but at every barcode position tries to find the best match, if there are several matches the read is discarded). But we can merge the stagger with the constant sequence and even allow for a mismatch, ESGI would still find the pattern that at least matches best.
176179

177180
pattern.txt
@@ -182,7 +185,8 @@ stagger_barcodes.txt
182185
```txt
183186
AGGGG,ACGGGG,ACGGGGG,ACGTGGGG
184187
```
185-
2.) The second option would be to describe an individual pattern for every stagger, and allow only for no or very little mismatches in the staggers with very few nucleotides. This way we prevent to map a wrong barcode with insertions/deletions to a stagger. Additionally, you might want to map with hamming distance only.
188+
189+
2.) The second option would be to describe an individual pattern for every stagger, and allow only for no or very little mismatches in the staggers with very few nucleotides. This way we prevent to map a wrong barcode with insertions/deletions to a stagger. Additionally, you could map with hamming distance only in the barcodes with the -H flag.
186190

187191
pattern.txt
188192
```txt

0 commit comments

Comments
 (0)