Electrical & Computer Engineering, Department of
First Advisor
Khalid Sayood
Date of this Version
5-2024
Document Type
Article
Citation
A thesis presented to the faculty of the Graduate College at the University of Nebraska in partial fulfillment of requirements for the degree of Master of Science
Major: Electrical Engineering
Under the supervision of Professor Khalid Sayood
Lincoln, Nebraska, May 2024
Abstract
The information-containing nature of the DNA molecule has been long known and observed. One technique for quantifying the relationships existing within the information contained in DNA sequences is an entity from information theory known as the average mutual information (AMI) profile. This investigation sought to use principally the AMI profile along with a few other metrics to explore the structure of the information contained in DNA sequences.
Treating DNA sequences as an information source, several computational methods were employed to model their information structure. Maximum likelihood and maximum a posteriori estimators were used to predict missing bases in DNA sequences. Other novel prediction methods based upon the AMI profile and its ability to evaluate the predictability of DNA bases were also developed and tested for accuracy. The AMI profile was also adjusted to account for the triplet-code nature of DNA sequences. Additionally, machine-learning techniques such as neural networks, support vector machines, and principal component analysis were used to classify different regions of DNA sequences using the AMI profile and to compare coding versus noncoding regions.
Finally, the analysis considered the relative frequency of groups of bases (known as k-mers) in DNA sequences. Arithmetic coding was explored as a way to effect the compression of DNA sequences modeled upon the relative frequency of the appearance of k-mers. It was concluded that biological information stored in DNA is complex, yet this investigation provided methods to elucidate some of the character of the information structure of DNA sequences.
Advisor: Khalid Sayood
Included in
Bioinformatics Commons, Computer Engineering Commons, Genetics and Genomics Commons, Other Electrical and Computer Engineering Commons
Comments
Copyright 2024, Joel Mohrmann. Used by permission