Computer Science and Engineering, Department of

Computer Science, Computer Engineering, and Bioinformatics: Faculty Publications

Converting Heterogeneous Statistical Tables on the Web to Searchable Databases

David W. Embley, Brigham Young UniversityFollow
Mukkai Krishnamoorthy, Rensselaer Polytechnic InstituteFollow
George Nagy, Rensselaer Polytechnic InstituteFollow
Sharad C. Seth, University of Nebraska-LincolnFollow

Document Type

Article

Date of this Version

2016

Citation

Published as International Journal on Document Analysis and Recognition (IJDAR)(2016) 19:119–138

DOI 10.1007/s10032-016-0259-1

Comments

Abstract

Much of the world’s quantitative data resides in scattered web tables. For a meaningful role in Big Data analytics, the facts reported in these tables must be brought into a uniform framework. Based on a formalization of header-indexed tables, we proffer an algorithmic solution to end-to-end table processing for a large class of human-readable tables. The proposed algorithms transform header-indexed tables to a category table format that maps easily to a variety of industry-standard data stores for query processing. The algorithms segment table regions based on the unique indexing of the data region by header paths, classify table cells, and factor header category structures of two-dimensional as well as the less common multi-dimensional tables. Experimental evaluations substantiate the algorithmic approach to processing heterogeneous tables. As demonstrable results, the algorithms generate queryable relational database tables and semantic-web triple stores. Application of our algorithms to 400 web tables randomly selected from diverse sources shows that the algorithmic solution automates end-to-end table processing.

Download

COinS

DigitalCommons@University of Nebraska - Lincoln

Computer Science and Engineering, Department of

Computer Science, Computer Engineering, and Bioinformatics: Faculty Publications

Converting Heterogeneous Statistical Tables on the Web to Searchable Databases

Document Type

Date of this Version

Citation

Comments

Abstract

Search

Browse

Author Corner

Links

DigitalCommons@University of Nebraska - Lincoln

Computer Science and Engineering, Department of

Computer Science, Computer Engineering, and Bioinformatics: Faculty Publications

Converting Heterogeneous Statistical Tables on the Web to Searchable Databases

Authors

Document Type

Date of this Version

Citation

Comments

Abstract

Share

Search

Browse

Author Corner

Links