Off-campus UNL users: To download campus access dissertations, please use the following link to log into our proxy server with your NU ID and password. When you are done browsing please remember to return to this page and log out.

Non-UNL users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Computational approaches to protein classification and multiple whole genome alignment

Jingyi Yang, University of Nebraska - Lincoln


Protein classification is an important problem in automated protein functional and structural annotation. Alignment-based methods have degraded performance when the sequence identity among proteins is low. Therefore, we work on alignment-free methods for protein classification in the first part of this dissertation. G protein-coupled receptors (GPCRs) and beta-barrel outer membrane proteins are chosen as test cases because of their low overall sequence identity and their importance in biological research. ^ We propose to use the probabilistic suffix tree (PST), a generative classification model that can be constructed from unaligned protein sequences, to model protein families and to classify proteins. We then investigate how different reduced amino acid alphabets impact protein classification when simple sequential features are used by the support vector machine (SVM), a discriminative classification model, to classify proteins. Compared with using the full alphabet of 20 amino acids, using reduced alphabets of as few as 2 letters can provide similar performance. In a comparative study, the PST provides the classification accuracy rate of 92.62% while the SVM using sequential features on a four-letter reduced alphabet has 92.41% in 2-fold cross-validation tests on 19 level I GPCR subfamilies, comparing favorably with a variety of other classification methods which have the best accuracy rate of 91.15% among all. ^ We further use amino acid indices of physicochemical properties to convert protein sequences into feature vectors for the SVM, examining a large range of different physicochemical properties with respect to their ability for protein classification. Hydrophobicity related indices provide best performance for both GPCR classification and beta-barrel outer membrane protein discrimination. ^ We have previously proposed the EMAGEN (Efficient Multiple Alignment algorithm for whole GENomes) system for multiple whole genome alignment. In the second part of this dissertation, we propose an improved method EMAGEN-R to incorporate genome rearrangement events in alignment. EMAGEN-R can successfully detect rearrangement events and provide improved alignments in experiments. ^

Subject Area

Biology, Bioinformatics|Computer Science

Recommended Citation

Yang, Jingyi, "Computational approaches to protein classification and multiple whole genome alignment" (2008). ETD collection for University of Nebraska - Lincoln. AAI3319848.