Computing, School of

School of Computing: Conference and Workshop Papers

Data Extraction from Web Tables: the Devil is in the Details

George Nagy, Rensselaer Polytechnic InstituteFollow
Sharad C. Seth, University of Nebraska-LincolnFollow
Dongpu Jin, University of Nebraska – LincolnFollow
David W. Embley, Brigham Young UniversityFollow
Spencer Machado, Brigham Young UniversityFollow
Mukkai KrishnamoorthyFollow

Document Type

Article

Citation

Procs. International Conference on Document Recognition (ICDAR'11), Beijing, September 2011

Abstract

We present a method based on header paths for efficient and complete extraction of labeled data from tables meant for humans. Although many table configurations yield to the proposed syntactic analysis, some require access to semantic knowledge. Clicking on one or two critical cells per table, through a simple interface, is sufficient to resolve most of these problem tables. Header paths, a purely syntactic representation of visual tables, can be transformed (“factored”) into existing representations of structured data such as category trees, relational tables, and RDF triples. From a random sample of 200 web tables from ten large statistical web sites, we generated 376 relational tables and 34,110 subject-predicate-object RDF triples.

Download

Included in

Computer Engineering Commons, Electrical and Computer Engineering Commons, Other Computer Sciences Commons

COinS

Computing, School of

School of Computing: Conference and Workshop Papers

Data Extraction from Web Tables: the Devil is in the Details

Document Type

Citation

Abstract

Included in

Search

Browse

Author Corner

Links

Computing, School of

School of Computing: Conference and Workshop Papers

Data Extraction from Web Tables: the Devil is in the Details

Authors

Document Type

Citation

Abstract

Included in

Share

Search

Browse

Author Corner

Links