Computing, School of

First Advisor

Peter Z. Revesz

Date of this Version

Fall 11-25-2009

Document Type

Dissertation

Citation

A dissertation presented to the faculty of the Graduate College at the University of Nebraska in partial fulfillment of requirements for the degree of Doctor of Philosophy

Major: Computer Science (Bioinformatics)

Under the supervision of Professor Peter R. Revesz

Lincoln, Nebraska, November 2009

Comments

Abstract

The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are currently over 1,100 molecular biology databases dispersed throughout the Internet. However, very few of them integrate data from multiple sources. To assist in the functional and evolutionary analysis of the abundant number of novel proteins, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database that integrates data from various biological sources. PROFESS is freely available at http://cse.unl.edu/~profess/. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. Using PROFESS, we were able to quantify homologous protein evolution and determine whether bacterial protein structures are subject to random drift after divergence from a common ancestor. After relevant data have been mined, they may be classified or clustered for further analysis. Data classification is usually achieved using machine-learning techniques. However, in many problems the raw data are already classified according to a set of features but need to be reclassified. Data reclassification is usually achieved using data integration methods that require the raw data, which may not be available or sharable because of privacy and legal concerns. We introduce general classification integration and reclassification methods that create new classes by combining in a flexible way the existing classes without requiring access to the raw data. The flexibility is achieved by representing any linear classification in a constraint database. We also considered temporal data classification where the input is a temporal database that describes measurements over a period of time in history while the predicted class is expected to occur in the future. We experimented the proposed classification methods on five datasets covering the automobile, meteorological and medical areas and showed significant improvements over existing methods.

Advisor: Peter R. Revesz

Download

Included in

Bioinformatics Commons, Computer Engineering Commons, Computer Sciences Commons

COinS

Computing, School of

School of Computing: Dissertations, Theses, and Student Research

Classification, Clustering and Data-mining of Biological Data

First Advisor

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Search

Browse

Author Corner

Links

Computing, School of

School of Computing: Dissertations, Theses, and Student Research

Classification, Clustering and Data-mining of Biological Data

Authors

First Advisor

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Share

Search

Browse

Author Corner

Links