Date of this Version
We present several algorithms for identification of new proteins in superfamilies with low primary sequence conservation. The low conservation of primary sequence in protein superfamilies such as Thioredoxin-fold (Trx-fold) makes conventional methods such as hidden Markov models (HMMs) difficult to use. Therefore, we use structural properties to build our classifiers. These structural properties include secondary structure patterns as well as various properties of the residues in the protein sequences. We use this information to model proteins via hidden Markov models, support vector machines and algorithms in the multiple-instance learning model. In 20-fold jack-knife tests, some of our models performed well, with relatively high true positive and true negative rates. We can identify 75% of the Trx-fold proteins in this jack-knife test (compared to only 5% for HMMs on primary sequence) while maintaining a 75% true negative rate. Since our techniques are general, they should be applicable to other superfamilies with low primary sequence conservation.