dissertation/introduction.tex at master · namphuon/dissertation · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
\chapter{Introduction}\label{intro}
\index{Introduction@\emph{Introduction}}%
\epigraph{Nothing in biology makes sense except in the light of evolution}{Christian Theodosius Dobzhansky}

The theory of evolution is the cornerstone of modern biology.  Through evolution, we have answered questions in areas such as hominid and human origins~\cite{Martin1990,Takahata1997}, vaccination development~\cite{Wilder-Smith2010,Fitch1993}, and even in evironmental bioremediation~\cite{Liu1993}.  Crucial to understanding many of these questions is the ability to estimate the evolutionary relationship between different biomolecular sequences.

A multiple sequence alignment (MSA) is a hypothesis of the evolutionary relationships between different characters in a set of biomolecular sequences.  MSAs have been used in many bioinformatics analyses including phylogeny estimation~\cite{Holder2003}, protein folding prediction~\cite{Karplus2009}, and functional annotation of proteins~\cite{Finn2010}.  However, MSA estimation is computationally challenging, as many optimization algorithms for standard objective functions are NP-hard~\cite{Wang1994,Bonizzoni2001}, and most heuristic methods for MSA estimation do not grow linearly with the number of sequences~\cite{Notredame2002}.
%\textbf{Nam:  say something less specific about optimization problem}

One approach to address this problem is through the use of profile Hidden Markov Models (HMM)~\cite{Eddy1998}.  Profile HMMs are statistical models for representing an MSA alignment.  They can be used to independently insert new sequences into an existing MSA~\cite{Eddy1998}, and thus exhibit linear scaling in running time with respect to the number of new sequences to insert.  However, profile HMMs are used for more than just MSA estimation; other uses include remote homology detection~\cite{Finn2010}, sequence database searching~\cite{Punta2012}, and classification of short environmental reads~\cite{Gerlach2011}.

The ability of profile HMMs to accurately insert sequences into an MSA degrade, however, on datasets containing evolutionary divergent sequences~\cite{Moriyama2006,Finn2010}.  My investigation into this problem lead to the development of a new statistical model which I call the family of Hidden Markov Models (fHMM).  The fHMM is a statistical model for representing an MSA by using multiple HMMs.  I show how fHMM can be used for accurate alignment of a sequence to an existing MSA.  As sequence alignment is a vital step in many bioinformatic analyses, the fHMM can be used across a wide range of problems, such as inserting sequences into a tree, taxonomically classifying short fragments, and aligning ultra-large datasets.

In Chapter~\ref{background}, I formally introduce key concepts in phylogenetics such as MSA estimation and tree estimation.  I also introduce the use of HMMs for aligning query sequences to an MSA.  Finally, I introduce three problems that will be addressed using  fHMM: phylogenetic placement, taxonomic profiling and taxonomic identification, and MSA estimation.  In Chapter~\ref{hmmfamily}, I describe the fHMM technique and show how fHMM can be used in sequence alignment.  In Chapter~\ref{sepp_chapter}, I present SEPP~\cite{todo}, a method for phylogenetic placment using fHMM.  I show a simulation study comparing SEPP and other placement methods.  I show that SEPP results in more accurate placements than using a single HMM, and that SEPP can accurately place sequences that are very evolutionarily divergent.  In Chapter~\ref{tipp_chapter}, I introduce TIPP, a method for taxonomic identification and profiling using fHMM and statistic support measures.  By incorporating statistical support within the fHMM alignment technique, the precision in taxonomically classifying novel sequences is greatly improved.  In addition, I show that fHMM results in better estimation of the species abundance profile of simulated microbial communities.  In Chapter~\ref{upp_chapter}, I present UPP, a ``de novo'' MSA estimation technique using fHMM.  I show how to use fHMM to align ultra-large datasets (large in the number of sequences) without the need of an initial backbone alignment and tree.  Finally, I show how UPP can align datasets with both short and full-length sequences..  I show that this new technique can accurate align a dataset of 1,000,000 sequences in less than 2 days without the need of a supercomputer.  Finally, in Chapter~\ref{conclusion}, I summarize the contributions of this dissertation and discuss future work.

%\emph{Phylogenetics}, the study of the evolutionary relationships between different organisms, is vital in answering many of these questions.
%
% Multiple Sequence Alignment (MSA) is a fundamental step in many bioinformatic pipelines.
%
% Profile Hidden Markov Models (HMM) are a statistical representation of a Multiple Sequence Alignment (MSA).
%
%
% One of the most ambitious projects in phylogenetics is the Assembling the Tree of Life (AToL) project~\cite{atol-website}.  The goal of the AToL project is a tree representing the evolutionary relationship between all species on Earth.  As there are an estimated 9 million species on Earth~\cite{Mora2011}, so one can imagine that determining the relationship between such a huge number of diverse species can be extremely challenging, and that efficient and accurate methods for inferring the tree are needed for this task.
%
% A typical pipeline for inferring the relationships between different species is to first collect biomolecular sequences from the species (DNA, RNA, or amino acid sequences) of interest.  Next, estimate a \emph{Multiple Sequence Alignment} (MSA) containing the sequences.  The MSA is a hypothesis of the evolutionary history between the different characters in the sequences.  From the MSA, a tree representing the relationship between the species can be inferred.  The quality of the MSA greatly impact the quality of the estimated tree, and thus, there is great need for accurate alignment methods.
%
% MSAs answer more questions than just tree estimation.  Examples also include protein function prediction~\cite{Pei2008}, protein struction prediction~\cite{todo}, drug target identification~ \cite{Abadio2011}, and identification of new viruses~\cite{todo}.  Many biological questions that attempt to draw inferences between the relationships of different organisms using biomolecular sequences will often use some form of MSA estimation. Thus, developing an accurate MSA alignment method would lead to a large impact across many different areas in biology.
%
% The main core of my dissertation work came from the investigation of a specific problem in MSA estimation, how does one align a sequence to an existing MSA?  Imagine that for a particular set of species, an MSA and a tree have already been constructed from the sequences of those species, and a new sequence has now been collected from a novel species.  Instead of re-building the MSA and tree from scratch, an alternative approach is to insert the new sequence into the existing MSA.  This produces an alignment of the original sequences plus the new sequence (called ``extended alignment'').  The extended alignment can then be used to place the sequence into the existing tree.  The placement location can then be used to infer the relationship between the new species and the original species.
%
% The problem I described is known as the \emph{phylogenetic placement problem}.  Given an existing MSA (called the ``backbone alignment''), an existing tree (called the ``backbone tree''), and a new sequence (called the ``query sequence''), how does one align the query sequence to the backbone alignment, and then use that alignment to insert the query sequence into the backbone tree.
%
% My investigation into this problem has lead to the development of a new machine learning technique which I call the families of Hidden Markov Models (fHMMs).  The fHMMs is a technique that allows accurate alignment of the query sequence to an existing backbone alignment.  The utility of the fHMMs goes beyond phylogenetic placement.  I will show that this technique can be applied to a wide range of problems, such including determining the evolutionary relationship between different species, taxonomically classifying short DNA fragments from environmental samples, and estimating alignments on extremely large datasets.
%
% In Chapter~\ref{background}, I formally introduce key concepts in phylogenetics such as MSA estimation and tree estimation.  I also introduce the use of HMMs for aligning query sequences to an MSA.  Finally, I introduce 3 problems that will be addressed using the fHMMs technique: phylogenetic placement, taxonomic profiling and taxonomic identification, and MSA estimation.  In Chapter~\ref{hmmfamily}, I describe the fHMM technique.  In Chapter~\ref{sepp_chapter}, I present a simulation study on using fHMMs to address the phylogenetic placement problem.  I show that the fHMMs technique results in more accurate placements than using a single HMM, and can accurately place sequences that are very evolutionarily divergent.  In Chapter~\ref{tipp_chapter}, I show a modification of the fHMM technique for the problems of taxonomic identification and profiling.  By incorporating statistical support within the fHMM technique, the precision in taxonomically classifying novel sequences is greatly improved.  In addition, I show that fHMM technique results in better estimation of the species abundance profile of simulated microbial communities.  In Chapter ~\ref{upp_chapter}, I show a simple modification of the fHMMs that allows for ``de novo'' MSA estimation of ultra-large datasets (large in the number of sequences) without the need of an initial backbone alignment and tree.  I show that this method is a fast and efficient MSA estimation method, and results in more accurate MSAs compared to other MSA methods.  I show that this new technique can accurate align a dataset of 1,000,000 sequences in less than 2 days without the need of a supercomputer.  Finally, in Chapter~\ref{conclusion}, I summarize the contributions of this dissertation and discuss future work.