Off-campus UNL users: To download campus access dissertations, please use the following link to log into our proxy server with your NU ID and password. When you are done browsing please remember to return to this page and log out.
Non-UNL users: Please talk to your librarian about requesting this dissertation through interlibrary loan.
Characterization and use of structure and complexity of DNA sequences
Abstract
In this dissertation we analyze biological sequences using two proposed methods of characterization. The first method uses the Average Mutual Information (AMI) profile of the sequences. This captures the statistical properties of the strings and provides a concise representation. The second method utilizes the notion of “complexity.” Using the Lempel-Ziv (LZ) complexity measure we define a distance metric for sequences. We use AMI profiles to solve the fragment assembly problem which is to reconstruct a target DNA sequence from randomly sampled fragments. Most existing fragment assembly techniques follow the overlap—layout—consensus approach, which requires extensive computation in each phase and becomes inefficient with increasing numbers of fragments. We propose a new algorithm which jointly solves the overlap, layout, and consensus problems. The fragments are clustered with respect to their AMI profiles using the k-means algorithm. This removes the unnecessary requirement that the collection of fragments be considered as a whole. Instead, the orientation and overlap detection are solved efficiently, within the clusters. We apply the second method of characterization to phylogeny construction. Most existing approaches for phylogenetic inference use multiple alignment of sequences and assume some sort of an evolutionary model. The multiple alignment strategy does not work for all types of data, e.g. whole genome phylogeny, and the evolutionary models may not always be correct. We propose a new sequence distance measure based on the relative information between the sequences using LZ complexity. The distance matrix thus obtained can be used to construct phylogenetic trees. The proposed approach does not require sequence alignment and is totally automatic. The proposed methods are not limited to the applications studied in this dissertation. They capture universal properties of the sequences and can be used to tackle other problems posed by computational biology.
Subject Area
Electrical engineering|Genetics|Molecular biology
Recommended Citation
Otu, Hasan Huseyin, "Characterization and use of structure and complexity of DNA sequences" (2002). ETD collection for University of Nebraska-Lincoln. AAI3064567.
https://digitalcommons.unl.edu/dissertations/AAI3064567