Computer Science and Engineering, Department of


Date of this Version


Document Type



Neethu Shah, Clustering and Classification of Multi-domain Proteins, MS Thesis, University of Nebraska, Lincoln, 2013.


A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professors Stephen D. Scott and Etsuko N. Moriyama. Lincoln, Nebraska: December, 2013

Copyright (c) 2013 Neethu Shah


Rapid development of next-generation sequencing technology has led to an unprecedented growth in protein sequence data repositories over the last decade. Majority of these proteins lack structural and functional characterization. This necessitates design and development of fast, efficient, and sensitive computational tools and algorithms that can classify these proteins into functionally coherent groups.

Domains are fundamental units of protein structure and function. Multi-domain proteins are extremely complex as opposed to proteins that have single or no domains. They exhibit network-like complex evolutionary events such as domain shuffling, domain loss, and domain gain. These events therefore, cannot be represented in the conventional protein clustering algorithms like phylogenetic reconstruction and Markov clustering. In this thesis, a multi-domain protein classification system is developed primarily based on the domain composition of protein sequences. Using the principle of co-clustering (biclustering), both proteins and domains are simultaneously clustered, where each bicluster contains a subset of proteins and domains forming a complete bipartite graph. These clusters are then converted into a network of biclusters based on the domains shared between the clusters, thereby classifying the proteins into similar protein families.

We applied our biclustering network approach on a multi-domain protein family, Regulator of G-protein Signalling (RGS) proteins, where heterogeneous domain composition exists among subfamilies. Our approach showed mostly consistent clustering with the existing RGS subfamilies. The average maximum Jaccard Index scores for the clusters obtained by Markov Clustering and phylogenetic clustering methods against the biclusters were 0.64 and 0.60, respectively. Compared to other clustering methods, our approach uses auxiliary domain information of each protein, and therefore, generates more functionally coherent protein clusters and differentiates each protein subfamily from each other. Biclustered networks on complete nine proteomes showed that the number of multi-domain proteins included in connected biclusters rapidly increased with genome complexity, 48.5% in bacteria to 80% in eukaryotes.

Protein clustering and classification, incorporating such wealth of additonal domain information on protein networks has wide applications and would impact functional analysis and characterization of novel proteins.

Advisers: Stephen D. Scott and Etsuko N. Moriyama