Date of this Version
G. P. Newcomb, "Genome Annotation Using Average Mutual Information," M.S. thesis, Department of Electrical and Computer Engineering, University of Nebraska-Lincoln, 2021.
Advancements in high-throughput DNA sequencing technologies and ambitious goals for their use are resulting in the generation of a deluge of unannotated sequenced genomes. This makes computational tools that can aid in annotation increasingly valuable.
Here, we provide a detailed exploration of the utility as well as the limitations of average mutual information (AMI) in several steps of genome annotation. For a genomic sequence, AMI is a measure of the information a base contains about the base separated by a ﬁxed lag. A proﬁle is constructed by calculating AMI at multiple lags. In addition to traditional AMI, we employ two AMI variants: expanded AMI and expanded-adjusted AMI, both of which preserve some granular detail discarded by AMI.
First, we demonstrate AMI’s capacity to assess evolutionary similarity by constructing phylogenetic trees similar to those currently accepted. The remainder of this work focuses on applications involving binary classiﬁcation. We use support vector machines trained using the AMI proﬁles to classify sequences and evaluate predictive performance. These classiﬁcation problems include predicting whether sequences come from protein-coding regions, identifying essential genes, and making functional predictions about the proteins genes produce. We conclude that AMI is particularly adept at identifying coding regions, and this behavior is consistent for species across all of life’s diversity.
Adviser: Khalid Sayood