Computer Science and Engineering, Department of

 

Date of this Version

Fall 11-25-2009

Document Type

Article

Comments

A Dissertation Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Doctor of Philosophy, Major: Computer Science (Bioinformatics). Under the Supervision of Professor Peter Revesz
Lincoln, Nebraska: November, 2009
Copyright (c) 2009 Thomas Triplet.

Abstract

The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are currently over 1100 molecular biology databases dispersed throughout the Internet. However, very few of them integrate data from multiple sources. To assist in the functional and evolutionary analysis of the abundant number of novel proteins, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database that integrates data from various biological sources. PROFESS is freely available athttp://cse.unl.edu/~profess/. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. Using PROFESS, we were able to quantify homologous protein evolution and determine whether bacterial protein structures are subject to random drift after divergence from a common ancestor. After relevant data have been mined, they may be classified or clustered for further analysis. Data classification is usually achieved using machine-learning techniques. However, in many problems the raw data are already classified according to a set of features but need to be reclassified. Data reclassification is usually achieved using data integration methods that require the raw data, which may not be available or sharable because of privacy and legal concerns. We introduce general classification integration and reclassification methods that create new classes by combining in a flexible way the existing classes without requiring access to the raw data. The flexibility is achieved by representing any linear classification in a constraint database. We also considered temporal data classification where the input is a temporal database that describes measurements over a period of time in history while the predicted class is expected to occur in the future. We experimented the proposed classification methods on five datasets covering the automobile, meteorological and medical areas and showed significant improvements over existing methods.

Share

COinS