Electrical and Computer Engineering, Department of

Department of Electrical and Computer Engineering: Faculty Publications

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

David J. Russell, University of Nebraska-LincolnFollow
Samuel F. Way, University of Nebraska-Lincoln
Andrew K. Benson, University of Nebraska-LincolnFollow
Khalid Sayood, University of Nebraska-LincolnFollow

Document Type

Article

Date of this Version

2010

Citation

BMC Bioinformatics 2010, 11:601

Comments

Abstract

Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.

Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.

Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.

Download

Included in

Electrical and Computer Engineering Commons

COinS

Electrical and Computer Engineering, Department of

Department of Electrical and Computer Engineering: Faculty Publications

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

Document Type

Date of this Version

Citation

Comments

Abstract

Included in

Search

Browse

Author Corner

Links

Electrical and Computer Engineering, Department of

Department of Electrical and Computer Engineering: Faculty Publications

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

Authors

Document Type

Date of this Version

Citation

Comments

Abstract

Included in

Share

Search

Browse

Author Corner

Links