• DocumentCode
    3023919
  • Title

    Table structure analysis based on cell classification and cell modification for XML document transformation

  • Author

    Ishitani, Yasuto ; Fume, Kosei ; Sumita, Kazuo

  • Author_Institution
    Corporate R&D Center, Toshiba Corp., Kawasaki, Japan
  • fYear
    2005
  • fDate
    29 Aug.-1 Sept. 2005
  • Firstpage
    1247
  • Abstract
    A new method of table structure analysis based on cell classification and cell modification is proposed in this paper as the basis of an OCR which can convert a variety of printed tables into XML documents in accordance with a specified XML schema. The outline of this method is described as follows. Firstly, cell features defined by ruled lines, which correspond to data fields, are extracted from the input image of a table. After that, each cell is classified to identify the irregular table whose ruled lines are not gridded and is modified to form regular cell arrangement. Next, the hierarchical table structure consisting of a regular row structure of cells is extracted from the modified regular table and is described using a DOM tree. In this case, logical objects within a cell are extracted and are converted into a sub-tree in the DOM tree. Finally, this DOM tree is transformed into a target XML document by an XML parser with information extraction process. Experimental results show the method is effective in transforming various printed tables to various XML documents.
  • Keywords
    XML; grammars; DOM tree; OCR; XML document transformation; XML documents; XML parser; cell classification; cell modification; hierarchical table structure; information extraction; printed tables; table structure analysis; Books; Data mining; Documentation; Drugs; Image converters; Knowledge management; Optical character recognition software; Pharmaceutical technology; Technology management; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
  • ISSN
    1520-5263
  • Print_ISBN
    0-7695-2420-6
  • Type

    conf

  • DOI
    10.1109/ICDAR.2005.225
  • Filename
    1575742