Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
177430b
added gitignore
serine Mar 6, 2018
2ab43e7
made barcode to be appended to the fastq header
serine Mar 6, 2018
10c1f47
added utils.c for general utils, right now it holds _mkdir function t…
serine Mar 7, 2018
c0b280e
refactored help menu
serine Mar 12, 2018
de51ec2
updated README, included note about fork
serine Mar 12, 2018
5ac6256
very annoying commit, could be a deal breaker here...
serine Mar 12, 2018
35bf35b
Moved barcode.c code into utils and removed that file
serine Mar 12, 2018
d984204
added exclusion of vim swap files into gitignore
serine Mar 12, 2018
c49b118
updated Makefile to reflect removal of barcode.c file
serine Mar 12, 2018
fbf68a0
fixed the bug in new strncp_with_mismatch and changed input parameter…
serine Mar 14, 2018
1790715
removed -std=c99 from make file
serine Mar 14, 2018
7ba39b9
Changed layout of output stats file into a tab separated table
serine Mar 14, 2018
844ab17
updated error message about barcode length being greater than read le…
serine Apr 8, 2018
d4e96a9
made new Makefile and added updated kseq.h file
serine Apr 12, 2018
b132954
Setting myself up for mode feature. Planing to simplify sabre to be
serine Apr 30, 2018
0b4d59e
Added some docs, more like ideas at this stage
serine May 15, 2018
5bce3cb
Kind of forgotten what I was doing here, left in staging for a couple
serine May 30, 2018
cb1563f
changed strncmp_with_mismatch to chk_bc_mtch function.
serine May 30, 2018
cd0435e
updated Makefile that doesn't look at demulti_single.c file
serine May 30, 2018
e94e9ed
making umis of uniform length based on max-5prime-crop.
serine Jun 22, 2018
321c2ff
attempting to fix memory leak
serine Jun 22, 2018
e7e1865
fixed bug in getting quality string length when using combine mode
serine Jun 28, 2018
73eb266
fixed bug in skipping umis that are too short.
serine Aug 9, 2018
204dfe3
started working on metrics collection script.
serine Aug 9, 2018
db72db1
wrote functional metrics.c script
serine Aug 10, 2018
fe1c1c6
cleaned metrics.c code a little
serine Aug 10, 2018
4f4f154
worked on metrics util, not it produce sorted list
serine Aug 20, 2018
ac8e1d2
changed umi trimming. If min-umi-len is set then all umis
serine Aug 20, 2018
22c994f
updated gitignore and makefile
serine Aug 20, 2018
ded2d90
fixed bug in making umi reads of a particular length
serine Aug 20, 2018
0a77e8a
Added another mode to metrics, now one can either get metrics on
serine Aug 20, 2018
1a07501
Updated makefile
serine Aug 20, 2018
9e56e4a
far out.. major revamp of sabre, just sabre code left and right
serine Dec 11, 2018
0233441
milestone, got all headers in order?
serine Dec 12, 2018
a9af68a
work in progress, just another commit
serine Dec 12, 2018
a1af555
individual c file compiles error free, check
serine Dec 12, 2018
af518d7
milestone majore.
serine Dec 16, 2018
98db795
yet another milestone
serine Dec 16, 2018
a17892b
milestone, compiles and runs
serine Dec 17, 2018
cfc3bfa
check
serine Dec 17, 2018
82bcba9
huzzah!
serine Dec 17, 2018
3782b8b
tweak
serine Dec 17, 2018
02a9c25
check2
serine Feb 21, 2019
4429c00
removed gzWrite because it appears that it was very slow, although
serine Feb 21, 2019
850b924
Polished off threads features, fixed all issues reaised by
serine Feb 22, 2019
8cf7501
Stupid but necessary, reindented all C files to 4 spaces and no tabs!
serine Feb 22, 2019
f96b234
update makefile to include commit hash into binary build
serine Oct 13, 2019
5c071b8
fixed warning message and added git commit hash to the version print
serine Oct 13, 2019
6d4172e
updated readme
serine Oct 14, 2019
01eda92
removed symlinking sabre binary after build from make file.
serine Oct 14, 2019
db0ee82
simplified the versioning to be just a hash of the last commit
serine Oct 14, 2019
c529c29
Update sabre.c
drpowell Jun 2, 2020
1250dd2
Merge pull request #1 from drpowell/patch-1
serine Jun 3, 2020
a93381d
Find best matching barcode
drpowell Jun 5, 2020
379b3c7
Add option to compress (gzip) output files
drpowell Jun 5, 2020
c1b10b5
Merge pull request #2 from drpowell/best-match
serine Jun 8, 2020
5678529
Merge pull request #3 from drpowell/pigz
serine Jun 26, 2020
663365f
fixed printing of help when no args are given and also drop what appears
serine Jul 1, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# ignore C object files
*.o
# ignore executables
sabre
sabre-dev
metrics
# ignore vim swap files
*.swp
*.gz
/tmp
41 changes: 0 additions & 41 deletions Makefile

This file was deleted.

129 changes: 41 additions & 88 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,105 +1,58 @@
# sabre - A barcode demultiplexing and trimming tool for FastQ files
> This is a fork of the [original repo](https://github.com/najoshi/sabre). I might be taking this tool into a different direction to what was originally intended

## About
# sabre

Next-generation sequencing can currently produce hundreds of millions of reads
per lane of sample and that number increases at a dizzying rate. Barcoding
individual sequences for multiple lines or multiple species is a cost-efficient
method to sequence and analyze a broad range of data.
> A cellular barcode demultiplexing tool of FASTQ files

Sabre is a tool that will demultiplex barcoded reads into separate files.
It will work on both single-end and paired-end data in fastq format.
It simply compares the provided barcodes with each read and separates
the read into its appropriate barcode file, after stripping the barcode from
the read (and also stripping the quality values of the barcode bases). If
a read does not have a recognized barcode, then it is put into the unknown file.
Sabre also has an option (-m) to allow mismatches of the barcodes.
## Content

Sabre also supports gzipped file inputs. Also, since sabre does not use the
quality values in any way, it can be used on fasta data that is converted to
fastq by creating fake quality values.
- [Install](#install)
- [Quick start](#quick-start)
- [Usage](#usage)

Finally, after demultiplexing, sabre outputs a summary of how many records
went into each barcode file.
## Install

## Requirements
```BASH
git clone https://github.com/serine/sabre
cd src
make
```

Sabre requires a C compiler; GCC or clang are recommended. Sabre
relies on Heng Li's kseq.h, which is bundled with the source.
## Quick start

Sabre also requires Zlib, which can be obtained at
<http://www.zlib.net/>.

## Building and Installing Sabre

To build Sabre, enter:

make

Then, copy or move "sabre" to a directory in your $PATH.
```BASH
sabre -f MultiplexRNASeq_S1_R1_001.fastq.gz \
-r MultiplexRNASeq_S1_R2_001.fastq.gz \
-b barcodes.txt \
-c \
-u \
-m 2 \
-l 10 \
-a 1 \
-s sabre.txt \
-t 12
```

## Usage

Sabre has two modes to work with both paired-end and single-end
reads: `sabre se` and `sabre pe`.

Running sabre by itself will print the help:

sabre

Running sabre with either the "se" or "pe" commands will give help
specific to those commands:

sabre se
sabre pe

### Sabre Single End (`sabre se`)

`sabre se` takes an input fastq file and an input barcode data file and outputs
the reads demultiplexed into separate files using the file names from the data file.
The barcodes will be stripped from the reads and the quality values of the barcode
bases will also be removed. Any reads with unknown barcodes get put into the "unknown"
file specified on the command line. The -m option allows for mismatches in the barcodes.

#### Barcode data file format for single end

barcode1 barcode1_output_file.fastq
barcode2 barcode2_output_file.fastq
etc...

Be aware that if you do not format the barcode data file correctly, sabre will not work properly.

#### Example

sabre se -f input_file.fastq -b barcode_data.txt -u unknown_barcode.fastq
sabre se -m 1 -f input_file.fastq -b barcode_data.txt -u unknown_barcode.fastq

### Sabre Paired End (`sabre pe`)

`sabre pe` takes two paired-end files and a barcode data file as input and outputs
the reads demultiplexed into separate paired-end files using the file names from the
data file. The barcodes will be stripped from the reads and the quality values of the barcode
bases will also be removed. Any reads with unknown barcodes get put into the "unknown" files
specified on the command line. It also has an option (-c) to remove barcodes from both files.
Using this option means that if sabre finds a barcode in the first file, it assumes the paired
read in the other file has the same barcode and will strip it (along with the quality values).
The -m option allows for mismatches in the barcodes.

#### Barcode data file format for paired end
> This tool is under development and this is very much an alpha version
> In it's current form the tool is highly customised a particular multiplexing protocol

barcode1 barcode1_output_file1.fastq barcode1_output_file2.fastq
barcode2 barcode2_output_file1.fastq barcode2_output_file2.fastq
etc...
### Cellular barcodes

Be aware that if you do not format the barcode data file correctly, sabre will not work properly.
In order to demultiplex the use needs to provide `barcodes.txt` file, which is three column tab delimited file

#### Examples
```
sample_name group barcode
```

sabre pe -f input_file1.fastq -r input_file2.fastq -b barcode_data.txt \
-u unknown_barcode1.fastq -w unknown_barcode1.fastq
currently group is semi-redundant column, it there for a feature that in the development. for most use cases group can equals to barcode

sabre pe -c -f input_file1.fastq -r input_file2.fastq -b barcode_data.txt \
-u unknown_barcode1.fastq -w unknown_barcode1.fastq
e.g

sabre pe -m 2 -f input_file1.fastq -r input_file2.fastq -b barcode_data.txt \
-u unknown_barcode1.fastq -w unknown_barcode1.fastq
```
cntr_rep1 TAAGGCGA TAAGGCGA
cntr_rep2 CGTACTAG CGTACTAG
treat_rep1 AGGCAGAA AGGCAGAA
treat_rep2 TCCTGAGC TCCTGAGC
```
37 changes: 37 additions & 0 deletions docs/definitions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
define blocks along the read

BARCODE
UMI
READ

then set values to different block

BARCODE = 8
UMI = 10

## Andrew's suggestion

```
--input sample_A_R1.fastq.gz:i8{index1},r151{read1},i8{index2}
```

```
--fq1 sample_A_R1.fastq.gz:i8{index1},r151{READ1},i8{index2}

--fq2 sample_A_R2.fastq.gz:i8{index1},r151{READ1},i8{index2}
```

We need to check that BARCODE == index1 in both fq1 and fq2 but also check that index1_fq1 == index1_fq2

```
--merge 12 merge R1 into R2
--merge 21 merge R2 into R1
```

either way resulting read is R1

```
--fq1 sample_A_R1.fastq.gz:8index1,*index2

--fq2 sample_A_R2.fastq.gz:i8{index1},r151{read2},i8{index2}
```
72 changes: 72 additions & 0 deletions docs/modes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Sabre

## Different running modes

DOCS: In each case BARCODE and/or UMI are trimed off and
put into FASTQ header:

Not sure if I should have:

BARCODE always has a precedent i.e BARCODE:UMI
OR
It follows the same structure as per experiment i.e
if BARCODE+UMI then BARCODE:UMI
else if UMI+BARCODE then UMI:BARCODE

All modes that begin with 3 will return single - R1 file, merging
R1 read into R2 header and renaming R2 into R1

10 = single-end where R1 has the following structure:

R1 -->
BARCODE+READ

20 = paired-end where R1 and R2 have the following structure:

R1 --> <--R2
BARCODE+READ----READ+BARCODE

this mode returns single file (R1) with barcode appended and into R1 header

30 = paired-end where R1 and R2 have the following structure:

R1 --> <-R2
BARCODE----READ

40 = paired-end where

11 = single-end where R1 has the following structure:

R1 -->
BARCODE+UMI+READ

21 = paired-end where R1 and R2 have the following structure:

R1 --> <--R2
BARCODE+UMI+READ----READ+UMI+BARCODE

this mode returns single file (R1) with barcode appended and into R1 header

31 = paired-end where R1 and R2 have the following structure:

R1 --> <-R2
BARCODE+UMI----READ

NOTE this gives me room for yet another mode e.g 12, 22, 32

12 = sinle-end where R1 has the following structure:

R1 -->
UMI+READ

22 = paired-end where R1 and R2 have the following structure:

R1 --> <--R2
UMI+READ----READ+UMI

this mode returns single file (R1) with barcode appended and into R1 header

32 = paired-end where R1 and R2 have the following structure:

R1 --> <-R2
UMI----READ
50 changes: 50 additions & 0 deletions src/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Source, Executable, Includes, Library Defines
GIT_VERSION := "$(shell git describe --abbrev=7 --always --tags)"
CC = gcc
INCL = fastq.h sabre.h
#SRC = demulti_paired.c demulti_single.c sabre.c utils.c
SRC = sabre.c usage.c demultiplex.c utils.c fastq.c
OBJ = $(SRC:.c=.o)
DSRC=src

#CFLAGS = -Wall -O2 -std=c99 -pedantic -DVERSION=$(VERSION)
# need to quote GIT_VERSION so that the value gets passed as a string
CFLAGS = -Wall -O2 -std=gnu99 -pedantic -DVERSION=\"$(GIT_VERSION)\"
CFLAGSDEV = -Wall -O0 -g -std=gnu99 -DVERSION=\"$(GIT_VERSION)-dev\"

LDFLAGS = -lz -lpthread
GPROF = -pg
EXE = sabre

.PHONY: default

default: build
# a smarter way to have an if statement here instead of explicit grpof target
# have a look at gcc -M

%.o: %.c
$(CC) -c $(CFLAGS) $(SRC)

usage.o: usage.h sabre.h fastq.h
utils.o: utils.h sabre.h fastq.h
demultiplex.o: demultiplex.h utils.h sabre.h fastq.h
sabre.o: sabre.h

build: $(OBJ)
$(CC) $(CFLAGS) $(OBJ) -o $(EXE) $(LDFLAGS)
#ln -sf $(DSRC)/$(EXE) ../sabre

dev: $(OBJDEV)
$(CC) $(CFLAGSDEV) $(SRC) -o $(EXE)-dev $(LDFLAGS)

metrics:
$(CC) $(CFLAGSDEV) -o metrics metrics.c $(LDFLAGS)

gprof:
$(CC) $(CFLAGS) $(GPROF) $(SRC) -o $(EXE).gprof $(LDFLAGS)

clean:
$(RM) $(OBJ) $(EXE) core

clean-all:
$(RM) $(OBJ) $(EXE) $(EXE)-dev $(EXE).gprof core gmon.out metrics
24 changes: 0 additions & 24 deletions src/barcode.c

This file was deleted.

Loading