Computer Science and Engineering, Department of


Date of this Version



University of Nebraska-Lincoln, Computer Science and Engineering
Technical Report TR-UNL-CSE-2004-0003 01/10/2004.


We present several algorithms for identification of new proteins in superfamilies with low primary sequence conservation. The low conservation of primary sequence in protein superfamilies such as Thioredoxin-fold (Trxfold) makes conventional methods such as hidden Markov models (HMMs) difficult to use. Therefore, we use structural properties to build our classifiers. These structural properties include secondary structure patterns as well as various properties of the residues in the protein sequences. We use this information to model proteins via hidden Markov models, support vector machines and algorithms in the multiple-instance learning model. In 20-fold jackknife tests, some of our models performed well, with relatively high true positive and true negative rates. We can identify 75% of the Trx-fold proteins in this jack-knife test (compared to only 5% for HMMs on primary sequence) while maintaining a 75% true negative rate. Since our techniques are general, they should be applicable to other superfamilies with low primary sequence conservation.