Electrical and Computer Engineering, Department of
Department of Electrical and Computer Engineering: Faculty Publications
Accessibility Remediation
If you are unable to use this item in its current form due to accessibility barriers, you may request remediation through our remediation request form.
Document Type
Article
Date of this Version
2010
Citation
BMC Bioinformatics 2010, 11:601
Abstract
Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.
Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.
Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.
Comments
Copyright 2010 Russell et al; licensee BioMed Central Ltd