Electrical & Computer Engineering, Department of


Date of this Version

Fall 12-2-2011


A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professor Stephen Scott. Lincoln, Nebraska: December, 2011

Copyright (c) 2011 Yuji Mo


When using the Gene Ontology (GO), nucleotide and amino acid sequences are annotated by terms in a structured and controlled vocabulary organized into relational graphs. The usage of the vocabulary (GO terms) in the annotation of these sequences may diverge from the relations defined in the ontology. We measure the consistency of the use of GO terms by comparing GO's defined structure to the terms' application. To do this, we first use synthetic data with different characteristics to understand how these characteristics influence the correlation values determined by various similarity measures. Using these results as a baseline, we found that the correlation between GO's definition and its application to real data is relatively low, suggesting that GO annotations might not be applied in a manner consistent with its definition. In contrast, we found a sub-ontology of GO that correlates well with its usage in UniProtKB. We also study how terms from different ontologies in GO relate to each other, Such relationships can be helpful in refining term definitions. In order to identify such ``cross-terms", we propose a generalized semantic measure which can be used to identify related terms across GO ontologies. Results based on Saccharomyces Genome Database show that the measure is correlated with the degree of co-occurrence for term pairs. By thresholding the level of similarity, we found a list of highly correlated cross ontology term pairs. These term pairs show a high level of biological correlation.

Adviser: Stephen Scott