Date of this Version
Transmembrane proteins and multi-domain proteins together make up more than 80% of the total proteins in any eukaryotic proteome. Therefore accurately classifying such proteins into functional classes is an important task. Furthermore, understanding the molecular evolution of multi-domain proteins is important because it shows how various domains fuse to form more complex proteins, and acquire new functions possibly affecting the organismal level of evolution. In this thesis, I first investigated the performance of several protein classifiers using one of the most divergent transmembrane protein families, the G-protein-coupled receptor (GPCR) superfamily, as an example. Alignment-free classifiers based on support vector machines using simple amino acid compositions were effective in remote-similarity detection even from short fragmented sequences. While a support vector machine using local pairwise-alignment scores showed very well-balanced performance, profile hidden Markov models were generally highly specific and well suited for classifying well-established protein family members. We suggested that different types of protein classifiers should be applied to gain the optimal mining power. Including some of these methods, combinations of multiple protein classification methods were applied to identify especially divergent plant GPCRs (or seven-transmembrane receptors) from the Arabidopsis thaliana genome. We identified 394 proteins as the candidates and provided a prioritized list including 54 proteins for further investigation. For multi-domain protein families, the distribution of urea amidolyase, urea carboxylase, and sterol-sensing domain (SSD) proteins across kingdoms was investigated. Molecular evolutionary analysis showed that the urea amidolyase genes currently found only in fungi among eukaryotes are the results of a horizontal gene transfer event from proteobacteria. Urea carboxylase genes currently found in fungi and other limited organisms were also likely derived from another ancestral gene in bacteria. Finally we showed the possibility of the bacterial origin of the eukaryotic SSD-containing proteins and that these ancestral sequences evolved into four different SSD-containing proteins acquiring specific functions. Two groups of SSD-containing proteins seemed to have been formed before the divergence of fungal and metazoan lineages by domain acquisition.
Advisor: Etsuko Moriyama