Computing, School of

Date of this Version

Summer 8-2013

Document Type

Article

Citation

International Conference on Document Analysis and Recognition (ICDAR), Washington, D.C., August 25-28, 2013

Comments

Abstract

Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms.

Download

Included in

Databases and Information Systems Commons, Other Computer Sciences Commons

COinS

Computing, School of

School of Computing: Conference and Workshop Papers

Segmenting Tables via Indexing of Value Cells by Table Headers

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Search

Browse

Author Corner

Links

Computing, School of

School of Computing: Conference and Workshop Papers

Segmenting Tables via Indexing of Value Cells by Table Headers

Authors

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Share

Search

Browse

Author Corner

Links