Statistics, Department of

Department of Statistics: Faculty Publications

EnsCat: clustering of categorical data via ensembling

Bertrand S. Clarke, University of Nebraska-LincolnFollow
Saeid Amiri, University of Wisconsin Madison
Jennifer L. Clarke, University of Nebraska-LincolnFollow

ORCID IDs

Jennifer L. Clarke

Document Type

Article

Date of this Version

2016

Citation

Clarke et al. BMC Bioinformatics (2016) 17:380
DOI 10.1186/s12859-016-1245-9

Comments

Abstract

Background: Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach.

Results: We present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering. We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance.

Conclusions: Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat.

Download

Included in

Genetics and Genomics Commons, Other Statistics and Probability Commons

COinS

Statistics, Department of

Department of Statistics: Faculty Publications

EnsCat: clustering of categorical data via ensembling

ORCID IDs

Document Type

Date of this Version

Citation

Comments

Abstract

Included in

Search

Browse

Author Corner

Links

Statistics, Department of

Department of Statistics: Faculty Publications

EnsCat: clustering of categorical data via ensembling

Authors

ORCID IDs

Document Type

Date of this Version

Citation

Comments

Abstract

Included in

Share

Search

Browse

Author Corner

Links