Electrical & Computer Engineering, Department of


Date of this Version

Summer 8-2012


S. F. Way. Classification of Genomic Sequences by Latent Semantic Analysis. Master's thesis, Department of Electrical Engineering, University of Nebraska-Lincoln, 2012.


A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Electrical Engineering, Under the Supervision of Professor Khalid Sayood and Professor Ozkan Ufuk Nalbantoglu. Lincoln, Nebraska: August, 2012

Copyright (c) 2012 Samuel F. Way


Evolutionary distance measures provide a means of identifying and organizing related organisms by comparing their genomic sequences. As such, techniques that quantify the level of similarity between DNA sequences are essential in our efforts to decipher the genetic code in which they are written.

Traditional methods for estimating the evolutionary distance separating two genomic sequences often require that the sequences first be aligned before they are compared. Unfortunately, this preliminary step imposes great computational burden, making this class of techniques impractical for applications involving a large number of sequences. Instead, we desire new methods for differentiating genomic sequences that eliminate the need for sequence alignment.

Here, we present a novel approach for identifying evolutionarily similar DNA sequences using a theory and collection of techniques called latent semantic analysis (LSA). In the field of information retrieval, LSA techniques are used to identify text documents with related underlying concepts, or "latent semantics". These techniques capture the inherent structure within a collection of documents, and in much the same way, we extend these approaches to infer the biological structure through which a collection of organisms are related. In doing so, we develop a computationally efficient means of identifying evolutionarily similar DNA sequences that is especially well-suited for partitioning large sets of biological data.

Advisers: Khalid Sayood and Ozkan Ufuk Nalbantoglu