Date of this Version
Mathematical characterizations of biological sequences form one of the main elements of bioinformatics. In this work, a class of DNA sequence characterization, namely computational genomics signatures, which capture global features of these sequences is used to address emerging computational biology challenges. Because of the species specificity and pervasiveness of genome signatures, it is possible to use these signatures to characterize and identify a genome or a taxonomic unit using a short genome fragment from that source. However, the identification accuracy is generally poor when the sequence model and the sequence distance measure are not selected carefully. We show that the use of relative distance measures instead of absolute metrics makes it possible to obtain better detection accuracy. Furthermore, the use of relative metrics can create opportunities for using more complex models to develop genome signatures, which cannot be used efficiently when conventional distance measures are used.
Using a relative distance measure and a model based on the relative abundance of oligonucleotides in a genome fragment, a novel genome signature was defined. This signature was employed to address a class of metagenomics problems. The metagenomics approach enables sampling and sequencing of a microbial community without isolating and culturing single species. Determining the taxonomic classification of the bacterial species within the microbial community from the mixture of short DNA fragments is a difficult computational challenge. We present supervised and unsupervised algorithms for taxonomic classification of metagenomics data and demonstrate their effectiveness on simulated and real-world data. The supervised algorithm, RAIphy, classifies metagenome fragments of unknown origin by assigning them to the taxa, defined in a signature database of previously sequenced microbial genomes. The signatures in the database are updated iteratively during the classification process. Most metagenomics samples include unidentified species, thus they require clustering. Pseudo-assembly of fragments, followed by clustering of taxa is employed in the unsupervised setting. The signatures developed in this work are more specific-specific and pervasive than any signatures currently available in the literature, and demonstrate the potential and viability of using genome signatures to solve various metagenomics problems as well as other challenges in computational biology.