TITLE:
Automatic Table Recognition and Extraction from Heterogeneous Documents
AUTHORS:
Florence Folake Babatunde, Bolanle Adefowoke Ojokoh, Samuel Adebayo Oluwadare
KEYWORDS:
Hidden Markov Model, Table Recognition and Extraction, Hypertext Markup Language, Heterogeneous Documents
JOURNAL NAME:
Journal of Computer and Communications,
Vol.3 No.12,
December
30,
2015
ABSTRACT: This paper examines automatic recognition
and extraction of tables from a large collection of het-erogeneous documents.
The heterogeneous documents are initially pre-processed and converted to HTML
codes, after which an algorithm recognises the table portion of the documents.
Hidden Markov Model (HMM) is then applied to the HTML code in order to extract
the tables. The model was trained and tested with five hundred and twenty six
self-generated tables (three hundred and twenty-one (321) tables for training and
two hundred and five (205) tables for testing). Viterbi algorithm was
implemented for the testing part. The system was evaluated in terms of
accuracy, precision, recall and f-measure. The overall evaluation results show
88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing
that the method is good at solving the problem of table extraction.