Computer Science and Engineering, Department of

 

Date of this Version

Summer 8-2013

Citation

International Conference on Document Analysis and Recognition (ICDAR), Washington, D.C., August 25-28, 2013

Comments

Copyright (c) 2013 Sharad Seth & George Nagy

Abstract

Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms.