Computer Science and Engineering, Department of

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Date of this Version

Summer 7-29-2016

Document Type

Article

Citation

Rodene, Eric (2016) "Use of Clustering Techniques for Protein Domain Analysis", Computer Science and Engineering: Theses, Dissertations, and Student Research, Department of Computer Science and Engineering, University of Nebraska-Lincoln.

Comments

A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professors Stephen D. Scott and Etsuko N. Moriyama. Lincoln, Nebraska: July 2016

Abstract

Next-generation sequencing has allowed many new protein sequences to be identified. However, this expansion of sequence data limits the ability to determine the structure and function of most of these newly-identified proteins. Inferring the function and relationships between proteins is possible with traditional alignment-based phylogeny. However, this requires at least one shared subsequence. Without such a subsequence, no meaningful alignments between the protein sequences are possible. The entire protein set (or proteome) of an organism contains many unrelated proteins. At this level, the necessary similarity does not occur. Therefore, an alternative method of understanding relationships within diverse sets of proteins is needed.

Related proteins generally share key subsequences. These conserved subsequences are called domains. Proteins that share several common domains can be inferred to have similar function. We refer to the set of all domains that a protein has as the protein’s domain architecture.

We present a technique which clusters proteins sharing identical domain architecture. Matching a domain to a protein is determined with a confidence estimate (e.g., the E-value). The confidence with which a domain is matched to the sequence varies widely. By using a threshold for what is considered an acceptable match, domains with weak similarities can be ignored. By changing this E-value threshold, the clustering patterns and relationships between proteins can be analyzed. Clusters may merge or split as their domain architecture shifts based on this threshold. By studying the relationships between clusters from one iteration to the next as the threshold is made more stringent, phylogeny-like networks can be constructed. This technique clusters together proteins with identical domain architecture, and also illustrates relationships among clusters with similar architecture.

This technique was tested on the multi-domain Regulator of G-protein Signaling family. The output is consistent with the known functional subdivisions of this protein family. This technique is also considerably faster than typical alignment-based phylogenetic reconstruction on this family. Use of the technique at the proteome level was also tested using bacterial proteome data from Bacillus subtilis.

Advisors: Stephen Scott, Etsuko Moriyama

Download

Included in

Biochemistry, Biophysics, and Structural Biology Commons, Bioinformatics Commons, Computer Engineering Commons, Computer Sciences Commons

COinS

DigitalCommons@University of Nebraska - Lincoln

Computer Science and Engineering, Department of

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

USE OF CLUSTERING TECHNIQUES FOR PROTEIN DOMAIN ANALYSIS

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Search

Browse

Author Corner

Links

DigitalCommons@University of Nebraska - Lincoln

Computer Science and Engineering, Department of

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

USE OF CLUSTERING TECHNIQUES FOR PROTEIN DOMAIN ANALYSIS

Authors

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Share

Search

Browse

Author Corner

Links