Off-campus UNL users: To download campus access dissertations, please use the following link to log into our proxy server with your NU ID and password. When you are done browsing please remember to return to this page and log out.
Non-UNL users: Please talk to your librarian about requesting this dissertation through interlibrary loan.
Protein family classification using multivariate methods
Abstract
The number of protein sequences from agriculturally important crops is rapidly increasing in databases. In order to identify their functions efficiently and accurately, good computational methods are needed. Commonly used methods search databases using alignments. Some proteins may lack enough sequence similarities even though they share similar structures and biochemical functions. In such cases, alignment-based methods fail to identify proteins correctly. In order to classify these difficult proteins, alignment-free methods based on, e.g., multivariate methods are required. I examined application of two multivariate methods; principal component analysis (PCA) and partial least squares (PLS). Their performances were compared against profile hidden Markov models (HMMs) and PSI-BLAST. G-protein coupled receptors (GPCRs), cyclophilins, cytochrome b561 (Cyt b561), and immunoglobulin protein families were included in this study. Using physico-chemical properties as descriptors, I examined how the training dataset affects performance of the methods, how the methods can identify short fragmented sequences, and how the methods can identify proteins when only remotely similar samples are included in the training sets. The PLS methods outperformed profile HMM and PSI-BLAST when only a small number of positive samples (5 or 10) were included in the training dataset. PLS methods performed also better than profile HMM and PSI-BLAST in the identification of short fragmented sequences, and Cyt b561 expressed sequence tags from the Arabidopsis genome. Combining the results of PLS with other alignment-free methods, 342 proteins were identified as GPCR candidates, including 20 of the known 22 Arabidopsis GPCRs. Profile HMM identified only 15 of them. PLS method with descriptors selected by the t-test outperformed PLS method with descriptors from auto and cross-covariance in identifying cyclophilins from Arabidopsis and rice genomes. Finally, I developed a simple statistics method (ST-method) that is sensitive to protein with weak sequence similarities and generates low false positives. The ST-method outperformed PLS methods, profile HMMs, and PSI-BLAST in the classification of GPCRs and immunoglobulin superfamily. It identified 579, 717, and 382 GPCR candidates from Arabidopsis, rice, and maize genomes.
Subject Area
Agronomy|Bioinformatics
Recommended Citation
Opiyo, Stephen O, "Protein family classification using multivariate methods" (2007). ETD collection for University of Nebraska-Lincoln. AAI3263482.
https://digitalcommons.unl.edu/dissertations/AAI3263482