Computer Science and Engineering, Department of


Date of this Version



University of Nebraska-Lincoln, Computer Science and Engineering
Technical Report # TR-UNL-CSE-2003-0004, 01/05/2003


We present several algorithms for identifying thioredoxin (Trx)-fold proteins containing a conserved CxxC motif (two cysteines separated by two residues). The low conservation of primary sequence in this protein superfamily makes conventional methods difficult to use. Therefore, we use structural properties to build our classifiers. These structural properties include secondary structure patterns as well as various properties of the residues in the protein sequences. We use this information to model Trx-fold proteins via hidden Markov models, decision trees, and algorithms in the multipleinstance learning model. In 9-fold and 12-fold jack-knife tests, some of our models performed quite well, with high true positive and true negative rates. In addition, By combining a small number of our classifiers, we can identify 100% of the Trx-fold proteins in these jack-knife tests with moderate false positive rates. We also identified several candidate Trx-fold proteins in the C. jejuni, M. jannaschii, E. coli and S. cerevisiae genomes. Since our techniques are very general, they should be applicable to other superfamilies with low primary sequence conservation.