master_thesis/implementation.tex at master · guyduche/master_thesis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
\chapter{Implementation}
To conform to the expectations, the program was built in several parts.
We will now see how it was done.

All the sources, tests and manual can be found on GitHub: \url{https://github.com/guyduche/Blast2Bam}.


\section{\fastqtofasta{}}\label{sec:fastq2fasta}
As we have seen in Subsection~\ref{subsec:blastinput}, \gls{blast} cannot process fastQ files.
FastQs must be transformed into fasta files.

This can be done with bash commands but we thought it would be more user-friendly to integrate it in a simple tool.
That is why we created \fastqtofasta{}.
\fastqtofasta{} reads the fastQ file record by record (four lines at a time).
It then extracts the first two lines of each record (read name and sequence) and writes it in the fasta file.
In the process, the `@' at the beginning of the first line is replaced with a `\textgreater' to match fasta format.


\section{\blastobam{}}
\blastobam{} is the name we gave to the program we created to go from a \gls{blast} \gls{xml} output to a fully compatible \gls{sam} file.
As we have seen in Section~\ref{sec:samoutput}, a \gls{sam} file is composed of two parts: the header section and the alignment section.
We will now see how \blastobam{} is able to create them.


\subsection{Header section}\label{subsec:header}
As we have seen previously (Section~\ref{sec:samoutput}), a \gls{sam} header is composed of mainly four types of line: \emph{@HD, @SQ, @RG and @PG}.
\begin{itemize}
    \item \emph{@HD} is created automatically by \href{http://www.htslib.org/}{SAMtools} when the \gls{sam} file is transformed into BAM (binary version of \gls{sam}).
    Thus, there is no need for \blastobam{} to create this line.
    \item \emph{@SQ} contains the reference name and its length. There must be an \emph{@SQ} line for each reference on which the reads are tested for alignment.
    As we can see in figure~\ref{fig:xmlExample}, the reference name and its length are available respectively under tags \texttt{Hit\_def} and \texttt{Hit\_len} of \gls{blast} \gls{xml} output.
    However, those tags will only exist if the reads are mapped on the reference.
    A reference must have an \emph{@SQ} line even if no read is mapped on it.
    Therefore, the information must come from an external source.
    \blastobam{} extracts the names and calculates the lengths of the sequences from the reference file.
    It then outputs a formatted \emph{@SQ} line for each sequence.
    \item \emph{@RG}: The read group line is optional and must come from user input.
    This can be passed to \blastobam{} as a string via the \texttt{-R} option.
    This string must follow a specific syntax in order to be accepted: `\texttt{@RG\textbackslash{}tID:foo\textbackslash{}t\ldots}'.
    \item \emph{@PG} contains the program identification, name, version and the command-line used (CL).
    The CL tag is obtained by perusing through the arguments of the \href{https://github.com/guyduche/Blast2Bam/blob/master/src/main.c}{\emph{main()} function} of \blastobam{}.
\end{itemize}
We will now see how the alignment section was done.


\subsection{XML parsing}
The first challenge was to parse \gls{blast} \gls{xml} output.
Parsing is the process of reading data stored in a file (in our case an XML file) and storing it into structures we can easily manipulate.
To help us, we chose to use \href{http://www.xmlsoft.org/}{libxml2}, a software library for \gls{xml} parsing.
Three ways of parsing are available in this \acrfull{api}:
\begin{itemize}
    \item \gls{dom}: Tree-based parsing. The whole file is loaded in memory as a tree.
    \item \gls{sax}: Event-based push parsing. The parsing is done in streaming using callback functions.
    \item \gls{stax}: Event-based pull parsing. Simpler event-based parsing.
\end{itemize}
The size of \gls{blast} \gls{xml} output files can be huge so \gls{dom} approach cannot be used in this application.
Besides, event-based parsers can extract data from `stdin', which means that they would enable us to integrate \gls{blast} and \blastobam{} in a pipeline.
Two event-based parsers, \gls{sax} and \gls{stax} (xmlreader), are available in \href{http://www.xmlsoft.org/}{libxml2}.
The pull parsing option is simpler and allows more control over the parsing.
\href{xml@gnome.org}{Daniel \textsc{Veillard}}, the conceptor of \href{http://www.xmlsoft.org/}{libxml2}, wrote on xmlreader webpage~\cite{website:libxml:xmlreader}:
\begin{quote}
``In a nutshell the XmlTextReader API provides a simpler, more standard and more extensible interface to handle large documents than the existing SAX version.''
\end{quote}
Consequently, we chose to use the xmlreader (\gls{stax}) approach to parse the \gls{xml} \gls{blast} output.

In order to avoid writing redundant pieces of code, we created an \href{https://github.com/guyduche/Blast2Bam/blob/master/src/schema2c.xsl}{\acrshort{xslt} transformation file}.
\gls{xslt} is an \gls{xml} specification used to transform an \gls{xml} file into another type of document.
In our case, we use it to write a text file that will contain the C source code necessary to parse the \gls{blast} \gls{xml} output.

First, we created an \href{https://github.com/guyduche/Blast2Bam/blob/master/src/schema.xml}{\gls{xml} document} that contains all the different tags that can be found in \gls{blast} output. This document is based on the \href{http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.mod.dtd}{\gls{blast} output \gls{dtd}} issued by the \gls{ncbi}.
The \gls{xslt} code looks into the \gls{xml} document and, for each tag, writes the C code that creates the structures and substructures corresponding to the tags and then the code that will be used to extract the data from the \gls{blast} output and put it into the structures.
This manner of implementation will allow the parser to be more adaptable in case the \gls{ncbi} issued an update of the \gls{xml} output.

For each iteration of \gls{blast}, the parser creates data structures adapted to the information contained in the \gls{xml} file (number of hits, number of \glspl{hsp}) and records it for further use by the rest of the program.


\subsection{FastQ parsing}
If we compare figure~\ref{fig:xmlExample} and figure~\ref{fig:samBWA}, we can see that \gls{blast} \gls{xml} output is lacking some information needed to write the \gls{sam} file.
\begin{itemize}
    \item \gls{blast} cannot process fastQ files and so its output does not display the sequencing quality of the reads.
    \item The read's sequence needed in column 10 of the \gls{sam} file should be the whole sequence of the read. Sequences in \gls{blast} output are modified. They are hard-clipped (only the part of the sequence needed for the alignment is displayed) and `-' are added in the query sequence to represent deletions.
\end{itemize}

This information can be extracted from the original fastQ(s). Consequently, \blastobam{} parses the fastQ and writes its contents into the \gls{sam} file.
This parsing is done essentially the same way it was done by \fastqtofasta{} (Section~\ref{sec:fastq2fasta}), apart from the fact that, in this implementation, the fourth line (sequencing quality) is kept and there is no fasta file writing.


\subsection{Read pairing}
% Always two there are, no more, no less, a read and its mate.
\gls{blast} does not recognize paired-end data.
It maps a single sequence with the reference and outputs the alignment results.
As we can see in Figure~\ref{fig:samBWA}, the read and its mate should be paired in \gls{sam} output.
It is essential to obtain the FLAG, RNEXT, PNEXT and TLEN columns.

If the sequencing data is single-end, no pairing is possible. In that case, FLAG does not display any pairing component, RNEXT shows the indefinite character `*' and PNEXT and TLEN are nullified.

The pairing must be done at several levels:
\begin{itemize}
    \item \gls{blast} should process the sequence of the mate just after its corresponding read.
    We have seen in Section~\ref{sec:shortReads} that paired-end data can come in two flavors: two fastQs or a single interleaved one.
    In case of interleaved data, the read and its mate will already be processed one after the other.
    In case of two fastQs, \fastqtofasta{} is instructed to read alternatively from the first and second file, thus writing an interleaved fasta file.
    \item The \gls{blast} output is parsed two iterations at a time (\gls{blast} does one iteration for each read).
    \item \blastobam{} parses the fastQ files or interleaved fastQ similarly to \fastqtofasta{}.
    It keeps reads' records two at a time to match \gls{blast} iterations.
    \item After \gls{xml} and fastQ parsing, \blastobam{} pairs the data by recording it into a series of structures.
    \begin{itemize}
        \item \emph{IterationSam}: global structure containing all the information concerning the alignments of the two reads. It contains an indefinite number of \emph{SamHits}.
        There is always only one \emph{IterationSam} at a time as it is deleted and reconstituted for each double iteration in \gls{blast} output.
        \item \emph{SamHit}: structure containing the alignments of the two reads on a single reference. It contains an indefinite number of \emph{RecordSam}.
        There is as much \emph{SamHits} in \emph{IterationSam} as there are references on which the two reads are tested for mapping.
        \item \emph{RecordSam}: structure containing a paired alignment record. In \gls{blast} output, there is an \gls{hsp} for each alignment found.
        \blastobam{} creates a paired record for each combination of \gls{hsp} possible between the two reads.
        For example, if three possible alignments are found for a read and two for its mate, six records will be created.
        As the two reads of a pair should not be mapped to different references, only the \glspl{hsp} concerning the same reference are considered for pairing.
        A \emph{RecordSam} structure contains two \emph{SamOutput} structures, one for each read.
        \item \emph{SamOutput}: structure containing the read informations from the fastQ, the name of the reference on which it is mapped and the corresponding \gls{hsp} from \gls{blast} output.
    \end{itemize}
    \item As we will see in the next subsection, the pairing process continues with the filtering of the records and then with the creation of the FLAG, RNEXT, PNEXT and TLEN columns of the \gls{sam} file.
\end{itemize}


\subsection{Record filtering}\label{subsec:recFilter}
As we have seen in the last subsection, a record (\emph{RecordSam}) is created for each possible combination of \glspl{hsp} between the two reads.
Some of the combinations created may be good pairs but others may not.
\blastobam{} needs to sort them out and output only the relevant ones in the \gls{sam} file.
This filtering is done based on the distance between the two reads' alignments on the reference.
This distance is called template length and will be displayed in the \gls{sam} file under TLEN field (9\textsuperscript{th} column).
In \gls{sam} format specification~\cite{samspec}, TLEN is partially defined as such:
\begin{quote}
``If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base.''
\end{quote}
Considering how paired-end sequencing works (see figure~\ref{fig:pairedEndSeq}), in a pair, the reads (segments) should be `facing' each others; the leftmost mapped base should be the 5'~end of one of the read and the rightmost, the 5'~end of its mate.
In other words, unsigned TLEN should be the distance on the reference between the 5'~ends of the reads.
In \gls{blast} output, the 5'~end of a read's alignment is available under \texttt{hsp->\allowbreak{}hsp\_hit\_from} tag.
And so, unsigned TLEN equals to 1 + the absolute value of the difference between the position of the two 5'~ends.

To filter the records, we arbitrarily decided that TLEN should not be greater than three times the size on the reference of the longest alignment between the two reads.
If a record does not agree with that rule, it will be marked and so will not be printed in the \gls{sam} file.
If in an \emph{IterationSam}, no record passes that test, the two reads will be considered unmapped.


\subsection{Selection of the best record}\label{subsec:thereCanBeOnlyOne}
Along with the filtering, the best record must be chosen. This is necessary for the FLAG creation as we will see in the next subsection.
To determine which paired record is the best, a score must be calculated.
Each alignment possesses an e-value (\texttt{hsp->\allowbreak{}hsp\_evalue}); the smaller, the better.
The score of the record is the sum of the e-values of its two alignments.
If one of the two reads is unmapped, an arbitrary penalty of \texttt{+}0.5 is applied.
The paired record with the smallest score is considered the best and marked as such.


\subsection{Sequence Alignment\slash{}Map format (SAM) alignment section creation}\label{subsec:samCreation}
After input parsing, read pairing, filtering and best record selection, all the necessary elements of the \gls{sam} file are recorded, organised and ready to be printed.

As we have seen in Section~\ref{sec:samoutput}, a \gls{sam} record is composed of 11 mandatory columns and an optional one.
We will now see how \blastobam{} creates them.
\begin{enumerate}
    \item QNAME\@: The name of the read is taken from the \emph{SamOutput} structure: \texttt{samOutput->\allowbreak{}query->\allowbreak{}name}.
    \item FLAG\@:
    \begin{itemize}
        \item \texttt{0x1} is set when the sequencing data is paired-end. If it is unset, \texttt{0x2}, \texttt{0x8}, \texttt{0x20}, \texttt{0x40} and \texttt{0x80} cannot be set.
        \item \texttt{0x2} is set when the pair of reads is considered properly aligned.
        As the paired records have already been filtered (see Subsection~\ref{subsec:recFilter}), every mapped reads coming from paired-end data is flagged \texttt{0x2}.
        \item \texttt{0x4} is set when the read is unmapped (no \gls{hsp} in the record).
        \item \texttt{0x8} is set when the read's mate is unmapped.
        \item \texttt{0x10} is set when the read is mapped on the reverse strand. For the read's \gls{hsp}, it means that \texttt{hsp->\allowbreak{}hsp\_hit\_to < \allowbreak{}hsp->\allowbreak{}hsp\_hit\_from}.
        \item \texttt{0x20} is set when the read's mate is mapped on the reverse strand.
        \item \texttt{0x40} is set if the read is the first of its pair. If there are two fastQs, it means the read comes from the first file.
        \item \texttt{0x80} is set if the read is the second if its pair.
        \item \texttt{0x100} is set if the record is a secondary alignment.
        If after the alignment filtering, the read still possesses more than one record; the best one is selected; the others are flagged secondary (\texttt{0x100})
        (see subsections~\ref{subsec:recFilter} and~\ref{subsec:thereCanBeOnlyOne}).
        \item \texttt{0x200} and \texttt{0x400} cannot be set by \blastobam{}.
        \item \texttt{0x800} is set if the record is part of a chimeric alignment.
        For the time being, chimeric alignments cannot be processed by \blastobam{}.
    \end{itemize}
    \item RNAME\@: The name of the reference on which the read is mapped is taken from the \emph{SamOutput} structure: \texttt{samOutput->\allowbreak{}rname}.
    \item POS\@: The position at which the read is mapped is taken from the \emph{SamOutput} structure.
    The \gls{sam} format specification~\cite{samspec} stipulates that POS should be the leftmost position of the alignment on the reference.
    Depending on which is the leftmost, POS is chosen between \texttt{samOutput->\allowbreak{}hsp->\allowbreak{}hsp\_hit\_from} and \texttt{samOutput->\allowbreak{}hsp->\allowbreak{}hsp\_hit\_to}.
    \item MAPQ\@: In order to be compatible with downstream software, the mapping quality is set to 60 for mapped and 0 for unmapped reads.
    The score of the alignment calculated by \gls{blast} can be found in the metadata tags.
    \item CIGAR\@: The \gls{cigar} string is done by comparison between the read and reference sequences displayed in \gls{blast} output (\texttt{hsp->\allowbreak{}hsp\_qseq} and \texttt{hsp->\allowbreak{}hsp\_hseq}).
    Unlike other mappers such as \gls{bwa}, we chose to differentiate matches and mismatches and so display `=' and `X' and no `M'.
    \item RNEXT\@: The name of the mate's reference is taken from its \emph{SamOutput} structure in the same manner RNAME was taken for the read. If RNAME and RNEXT are the same, RNEXT is noted `='.
    \item PNEXT\@: The position at which the mate is mapped is taken from its \emph{SamOutput} structure in the same manner POS was taken for the read.
    \item TLEN\@: The template length is taken from the \emph{RecordSam} structure: \texttt{recordSam->\allowbreak{}tlen}.
    If the read is mapped on the reverse strand, a minus sign is added.
    \item SEQ\@: The Sequence of the read is taken from the \emph{SamOutput} structure: \texttt{samOutput->\allowbreak{}query->\allowbreak{}seq}.
    As specified~\cite{samspec}, if the read is mapped on the reverse strand, its sequence is displayed reverse-complemented.
    \item QUAL\@: The sequencing quality of the read is taken from the \emph{SamOutput} structure: \texttt{samOutput->\allowbreak{}query->\allowbreak{}qual}.
    Similarly to the SEQ column, if the read is mapped on the reverse strand, its sequencing quality string will be reversed.
    \item Metadata: The data displayed in this column is optional. This is what \blastobam{} can show:
    \begin{itemize}
        \item NM corresponds to the number of differences between the read sequence and the reference in the alignment.
        Its calculation is based on the \gls{cigar} string, one point for every mismatch and every base inserted or deleted (X, I, D or N).
        \item RG corresponds to the read group identifier as described in the \emph{@RG} line of the header section.
        \item AS is the alignment score. It is the \gls{hsp} score of the read: \texttt{samOutput->\allowbreak{}hsp->\allowbreak{}hsp\_score}.
        \item XB is the alignment bitscore of the \gls{hsp}: \texttt{samOutput->\allowbreak{}hsp->\allowbreak{}hsp\_bit\_score}.
        \item XE is the e-value of the alignment: \texttt{samOutput->\allowbreak{}hsp->\allowbreak{}hsp\_evalue}.
    \end{itemize}
\end{enumerate}


\section{Upgrades}
% We don't need roads.
In order to improve \blastobam{} and make it more user-friendly, some upgrades were made.


\subsection{Options}\label{subsec:options}
An option system was created to allow the user to tune \blastobam{} to his needs.
\begin{itemize}
    \item \textbf{-o} \emph{file}: modification of the program output.
    To enable the use of \blastobam{} in a pipeline, its default output is `stdout' (standard output).
    This option allows the user to write the \gls{sam} output into a file.
    This option is also available for \fastqtofasta{}.
    \item \textbf{-p}: interleaved option.
    If two fastQ files are given to \blastobam{}, the program will enter automatically in paired-end mode.
    But if there is only one fastQ, it cannot know if the data is single-end or interleaved paired-end. By default, it will consider it to be single-end.
    If this option is activated, the program will consider that the given fastQ file is interleaved.
    \item \textbf{-R} \emph{str}: read group. This option associates a read group to all the reads analyzed.
    The read group will appear in the header section of the \gls{sam} file and its ID tag will be written under the RG tag of the metadata field in the alignment section (see~\ref{subsec:header}).
    \item \textbf{-W} \emph{int}: minimum alignment length.
    Depending upon its seed length, \gls{blast} can generate small alignments.
    The user may consider that those alignments are not relevant but still want to keep a small seed length to have the best accuracy possible.
    This option allows the user to impose a cut-off on the alignment length.
    If the alignment length is inferior to the given number, the alignment will not be displayed in the \gls{sam} file.
    If, for a given read, no alignment conforms to this rule, the read will be considered unmapped and printed as such.
    This option has been improved further with a default action of which we will discuss in Subsection~\ref{subsec:autofilter}.
    \item \textbf{-z}: adjusted position.
    By default, the position of the alignment (POS) is its absolute position in the reference, 1 being the first position.
    If the reference is part of a bigger sequence, the user may want POS to show the position in the whole sequence rather than in the smaller fragment.
    In that case, the position of the reference in the bigger sequence is often displayed in its name (example: chr2:12752\textendash{}24782).
    This option makes use of that and adjusts the position displayed to the position of the reference.
    Example: an alignment beginning at position 5 of the previous reference will have a POS noted 12757 with this option set.
\end{itemize}


\subsection{Automatic filtering}\label{subsec:autofilter}
As we will see in Chapter~\ref{ch:results}, in the same experiment, the reads can have very different lengths.
In that case, a static cut-off to filter the records, like the one we have seen in Subsection~\ref{subsec:options}, may not be the best solution.
For example, a threshold of 40 does not mean much for a read 10000 bases long but it can be very stringent if the read is only 50 bases long.
Although this example is a bit extreme, it would be interesting to adjust the filter to the reads' length.
We estimated that the size of the alignments should be equal to at least 20\% of the length of the read, with a minimum of 20 bases.
As this threshold is arbitrary, the user may choose a different one via option \texttt{-W} if he sees fit.


\subsection{Fasta}
\href{http://www.personal.psu.edu/iua1/}{Istvan \textsc{Albert}} of \href{http://www.psu.edu/}{Pennsylvania State University} commented on Biostars (\href{https://www.biostars.org/p/53434/}{thread 53434}):
\begin{quote}
``There are so many BAM file visualizers out there and none (that I know of) for BLAST output.''
\end{quote}
This made us think that some people may have a use of \blastobam{} outside of \gls{ngs}.
And so, it would maybe be interesting for the user to be able to use fasta files instead of fastQs as an input
(see figures~\ref{fig:fastqExample} and~\ref{fig:fastaExample} for a reminder of the differences between the two formats).

To allow the use of fasta files, two modifications were necessary:
\begin{itemize}
    \item The fastQ parser has been rewritten to be able to detect whether the input record is in fastQ or fasta format.
    If it is in fastQ format, the third and fourth lines will be analyzed. Otherwise, the next line is considered to be a new sequence.
    As fasta sequences are often written on multiple lines, the parser had to be modified further to accommodate to that characteristic.
    \item The \gls{sam} alignment section creation has also been modified to allow the possibility that there may not be a sequencing quality string to write in the QUAL field.
    In that case, the indefinite character `*' is used instead.
\end{itemize}