Computer Science and Engineering, Department of


Date of this Version



2013 12th International Conference on Document Analysis and Recognition, Pages: 887 - 891, DOI: 10.1109/ICDAR.2013.181


Copyright © 2013 IEEE. Used by permission.


Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms.