Computing, School of

School of Computing: Conference and Workshop Papers

Accessibility Remediation

If you are unable to use this item in its current form due to accessibility barriers, you may request remediation through our remediation request form.

Clustering header categories extracted from web tables

Date of this Version

2-2015

Document Type

Article

Citation

D.W. Embley, S. Seth, M. Krishnamoorthy, G. Nagy, Clustering header categories extracted from web tables, Procs. SPIE/IST Document Recognition and Retrieval, Feb. 2015.

Abstract

Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table cell is classified according to the fundamental indexing property of row and column headers. The categories that correspond to the multi-dimensional data cube view of a table are extracted by factoring the (often multi-row/column) headers. To reveal commonalities between tables from diverse sources, the Jaccard distances between pairs of category headers (and also table titles) are computed. We show how about one third of our heterogeneous collection can be clustered into a dozen groups that exhibit table-title and header similarities that can be exploited for queries.

Download

Included in

Computer Engineering Commons, Electrical and Computer Engineering Commons, Other Computer Sciences Commons

COinS

Computing, School of

School of Computing: Conference and Workshop Papers

Accessibility Remediation

Clustering header categories extracted from web tables

Date of this Version

Document Type

Citation

Abstract

Included in

Search

Browse

Author Corner

Links

Computing, School of

School of Computing: Conference and Workshop Papers

Accessibility Remediation

Clustering header categories extracted from web tables

Authors

Date of this Version

Document Type

Citation

Abstract

Included in

Share

Search

Browse

Author Corner

Links