Computer Science and Engineering, Department of


First Advisor

Dr. Stephen Scott

Second Advisor

Dr. Jitender Deogun

Date of this Version



A DISSERTATION Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Doctor of Philosophy, Major: Computer Science, Under the Supervision of Professors Stephen Scott and Jitender Deogun. Lincoln, Nebraska: December, 2016

Copyright (c) 2016 Abhishek Majumdar


We address the problem of de novo motif identification. That is, given a set of DNA sequences we try to identify motifs in the dataset without having any prior knowledge about existence of any motifs in the dataset. We propose a method based on Probabilistic Suffix Trees (PSTs) to identify fixed-length motifs from a given set of DNA sequences. Our experiments reveal that our approach successfully discovers true motifs. We compared our method with the popular MEME algorithm, and observed that it detects a larger number of correct and statistically significant motifs than MEME. Our method is highly efficient as compared to MEME in finding the motifs when processing datasets of 1000 or more sequences. We applied our method to sequences of mutant strains of Exophiala dermatitidis and successfully identified motifs which revealed several transcription factor binding sites. This information is important to biologists for performing experiments to understand their role in different regulatory pathways affected by cdc42. We also show that our PST approach to de novo motif discovery can be used successfully to identify motifs in ChIP-Seq datasets. These motifs in turn identify binding sites for proteins in the sequences.