najoshi · stu2 · Mar 6, 2018 · Mar 6, 2018 · Mar 7, 2018 · Mar 12, 2018
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,10 @@
+# ignore C object files
+*.o
+# ignore executables
+sabre
+sabre-dev
+metrics
+# ignore vim swap files
+*.swp
+*.gz
+/tmp
diff --git a/Makefile b/Makefile
diff --git a/README.md b/README.md
@@ -1,105 +1,58 @@
-# sabre - A barcode demultiplexing and trimming tool for FastQ files
+> This is a fork of the [original repo](https://github.com/najoshi/sabre). I might be taking this tool into a different direction to what was originally intended
 
-## About
+# sabre
 
-Next-generation sequencing can currently produce hundreds of millions of reads
-per lane of sample and that number increases at a dizzying rate.  Barcoding
-individual sequences for multiple lines or multiple species is a cost-efficient
-method to sequence and analyze a broad range of data.
+> A cellular barcode demultiplexing tool of FASTQ files
 
-Sabre is a tool that will demultiplex barcoded reads into separate files. 
-It will work on both single-end and paired-end data in fastq format.
-It simply compares the provided barcodes with each read and separates
-the read into its appropriate barcode file, after stripping the barcode from
-the read (and also stripping the quality values of the barcode bases).  If
-a read does not have a recognized barcode, then it is put into the unknown file.  
-Sabre also has an option (-m) to allow mismatches of the barcodes.
+## Content
 
-Sabre also supports gzipped file inputs.  Also, since sabre does not use the 
-quality values in any way, it can be used on fasta data that is converted to
-fastq by creating fake quality values.
+- [Install](#install)
+- [Quick start](#quick-start)
+- [Usage](#usage)
 
-Finally, after demultiplexing, sabre outputs a summary of how many records
-went into each barcode file.
+## Install
 
-## Requirements 
+```BASH
+git clone https://github.com/serine/sabre
+cd src
+make
+```
 
-Sabre requires a C compiler; GCC or clang are recommended.  Sabre
-relies on Heng Li's kseq.h, which is bundled with the source.
+## Quick start
 
-Sabre also requires Zlib, which can be obtained at
-<http://www.zlib.net/>.
-
-## Building and Installing Sabre
-
-To build Sabre, enter:
-
-    make
-
-Then, copy or move "sabre" to a directory in your $PATH.
+```BASH
+sabre -f MultiplexRNASeq_S1_R1_001.fastq.gz \
+      -r MultiplexRNASeq_S1_R2_001.fastq.gz \
+      -b barcodes.txt \
+      -c \
+      -u \
+      -m 2 \
+      -l 10 \
+      -a 1 \
+      -s sabre.txt \
+      -t 12
+```
 
 ## Usage
 
-Sabre has two modes to work with both paired-end and single-end
-reads: `sabre se` and `sabre pe`.
-
-Running sabre by itself will print the help:
-
-    sabre
-
-Running sabre with either the "se" or "pe" commands will give help
-specific to those commands:
-
-    sabre se
-    sabre pe
-
-### Sabre Single End (`sabre se`)
-
-`sabre se` takes an input fastq file and an input barcode data file and outputs 
-the reads demultiplexed into separate files using the file names from the data file.
-The barcodes will be stripped from the reads and the quality values of the barcode
-bases will also be removed.  Any reads with unknown barcodes get put into the "unknown" 
-file specified on the command line.  The -m option allows for mismatches in the barcodes.
-
-#### Barcode data file format for single end
-
-    barcode1 barcode1_output_file.fastq
-    barcode2 barcode2_output_file.fastq
-    etc...
-
-Be aware that if you do not format the barcode data file correctly, sabre will not work properly.
-
-#### Example
-
-    sabre se -f input_file.fastq -b barcode_data.txt -u unknown_barcode.fastq
-    sabre se -m 1 -f input_file.fastq -b barcode_data.txt -u unknown_barcode.fastq
-
-### Sabre Paired End (`sabre pe`)
-
-`sabre pe` takes two paired-end files and a barcode data file as input and outputs
-the reads demultiplexed into separate paired-end files using the file names from the 
-data file.  The barcodes will be stripped from the reads and the quality values of the barcode 
-bases will also be removed.  Any reads with unknown barcodes get put into the "unknown" files 
-specified on the command line.  It also has an option (-c) to remove barcodes from both files.  
-Using this option means that if sabre finds a barcode in the first file, it assumes the paired 
-read in the other file has the same barcode and will strip it (along with the quality values).  
-The -m option allows for mismatches in the barcodes.
-
-#### Barcode data file format for paired end
+> This tool is under development and this is very much an alpha version
+> In it's current form the tool is highly customised a particular multiplexing protocol
 
-    barcode1 barcode1_output_file1.fastq barcode1_output_file2.fastq
-    barcode2 barcode2_output_file1.fastq barcode2_output_file2.fastq
-    etc...
+### Cellular barcodes
 
-Be aware that if you do not format the barcode data file correctly, sabre will not work properly.
+In order to demultiplex the use needs to provide `barcodes.txt` file, which is three column tab delimited file
 
-#### Examples
+```
+sample_name group barcode
+```
 
-    sabre pe -f input_file1.fastq -r input_file2.fastq -b barcode_data.txt \
-    -u unknown_barcode1.fastq -w unknown_barcode1.fastq
+currently group is semi-redundant column, it there for a feature that in the development. for most use cases group can equals to barcode
 
-    sabre pe -c -f input_file1.fastq -r input_file2.fastq -b barcode_data.txt \
-    -u unknown_barcode1.fastq -w unknown_barcode1.fastq
+e.g
 
-    sabre pe -m 2 -f input_file1.fastq -r input_file2.fastq -b barcode_data.txt \
-    -u unknown_barcode1.fastq -w unknown_barcode1.fastq
+```
+cntr_rep1    TAAGGCGA        TAAGGCGA
+cntr_rep2    CGTACTAG        CGTACTAG
+treat_rep1   AGGCAGAA        AGGCAGAA
+treat_rep2   TCCTGAGC        TCCTGAGC
+```
diff --git a/docs/definitions.md b/docs/definitions.md
@@ -0,0 +1,37 @@
+define blocks along the read
+
+BARCODE 
+UMI
+READ
+
+then set values to different block
+
+BARCODE = 8
+UMI = 10
+
+## Andrew's suggestion
+
+```
+--input sample_A_R1.fastq.gz:i8{index1},r151{read1},i8{index2}
+```
+
+```
+--fq1 sample_A_R1.fastq.gz:i8{index1},r151{READ1},i8{index2}
+
+--fq2 sample_A_R2.fastq.gz:i8{index1},r151{READ1},i8{index2}
+```
+
+We need to check that BARCODE == index1 in both fq1 and fq2 but also check that index1_fq1 == index1_fq2
+
+```
+--merge 12 merge R1 into R2
+--merge 21 merge R2 into R1
+```
+
+either way resulting read is R1
+
+```
+--fq1 sample_A_R1.fastq.gz:8index1,*index2
+
+--fq2 sample_A_R2.fastq.gz:i8{index1},r151{read2},i8{index2}
+```
diff --git a/docs/modes.md b/docs/modes.md
@@ -0,0 +1,72 @@
+# Sabre
+
+## Different running modes
+
+DOCS: In each case BARCODE and/or UMI are trimed off and
+put into FASTQ header:
+
+Not sure if I should have:
+
+       BARCODE always has a precedent i.e BARCODE:UMI
+       OR
+       It follows the same structure as per experiment i.e
+       if BARCODE+UMI then BARCODE:UMI
+       else if UMI+BARCODE then UMI:BARCODE
+
+All modes that begin with 3 will return single - R1 file, merging
+R1 read into R2 header and renaming R2 into R1
+
+10 = single-end where R1 has the following structure:
+
+         R1 -->
+         BARCODE+READ
+
+20 = paired-end where R1 and R2 have the following structure:
+
+         R1 -->                 <--R2
+         BARCODE+READ----READ+BARCODE
+
+this mode returns single file (R1) with barcode appended and into R1 header
+
+30 = paired-end where R1 and R2 have the following structure:
+
+         R1 -->     <-R2
+         BARCODE----READ
+
+40 = paired-end where
+
+11 = single-end where R1 has the following structure:
+
+         R1 -->
+         BARCODE+UMI+READ
+
+21 = paired-end where R1 and R2 have the following structure:
+
+         R1 -->                         <--R2
+         BARCODE+UMI+READ----READ+UMI+BARCODE
+
+this mode returns single file (R1) with barcode appended and into R1 header
+
+31 = paired-end where R1 and R2 have the following structure:
+
+         R1 -->         <-R2
+         BARCODE+UMI----READ
+
+NOTE this gives me room for yet another mode e.g 12, 22, 32
+
+12 = sinle-end where R1 has the following structure:
+
+         R1 -->
+         UMI+READ
+
+22 = paired-end where R1 and R2 have the following structure:
+
+         R1 -->         <--R2
+         UMI+READ----READ+UMI
+
+this mode returns single file (R1) with barcode appended and into R1 header
+
+32 = paired-end where R1 and R2 have the following structure:
+
+         R1 --> <-R2
+         UMI----READ
diff --git a/src/Makefile b/src/Makefile
@@ -0,0 +1,50 @@
+# Source, Executable, Includes, Library Defines
+GIT_VERSION := "$(shell git describe --abbrev=7 --always --tags)"
+CC = gcc
+INCL = fastq.h sabre.h
+#SRC = demulti_paired.c demulti_single.c sabre.c utils.c
+SRC = sabre.c usage.c demultiplex.c utils.c fastq.c
+OBJ = $(SRC:.c=.o)
+DSRC=src
+
+#CFLAGS = -Wall -O2 -std=c99 -pedantic -DVERSION=$(VERSION)
+# need to quote GIT_VERSION so that the value gets passed as a string
+CFLAGS = -Wall -O2 -std=gnu99 -pedantic -DVERSION=\"$(GIT_VERSION)\"
+CFLAGSDEV = -Wall -O0 -g -std=gnu99 -DVERSION=\"$(GIT_VERSION)-dev\"
+
+LDFLAGS = -lz -lpthread
+GPROF = -pg
+EXE = sabre
+
+.PHONY: default
+
+default: build
+# a smarter way to have an if statement here instead of explicit grpof target
+# have a look at gcc -M
+
+%.o: %.c
+	$(CC) -c $(CFLAGS) $(SRC)
+
+usage.o: usage.h sabre.h fastq.h
+utils.o: utils.h sabre.h fastq.h
+demultiplex.o: demultiplex.h utils.h sabre.h fastq.h
+sabre.o: sabre.h
+
+build: $(OBJ)
+	$(CC) $(CFLAGS) $(OBJ) -o $(EXE) $(LDFLAGS)
+	#ln -sf $(DSRC)/$(EXE) ../sabre
+
+dev: $(OBJDEV)
+	$(CC) $(CFLAGSDEV) $(SRC) -o $(EXE)-dev $(LDFLAGS)
+
+metrics:
+	$(CC) $(CFLAGSDEV) -o metrics metrics.c $(LDFLAGS)
+
+gprof:
+	$(CC) $(CFLAGS) $(GPROF) $(SRC) -o $(EXE).gprof $(LDFLAGS)
+
+clean:
+	$(RM) $(OBJ) $(EXE) core
+
+clean-all:
+	$(RM) $(OBJ) $(EXE) $(EXE)-dev $(EXE).gprof core gmon.out metrics
diff --git a/src/barcode.c b/src/barcode.c