• DocumentCode
    1572503
  • Title

    Text data compression ratio as a text attribute for a language-independent text art extraction method

  • Author

    Suzuki, Tetsuya ; Hayashi, Kazuyuki

  • Author_Institution
    Dept. of Electron. Inf. Syst., Shibaura Inst. of Technol., Saitama, Japan
  • fYear
    2010
  • Firstpage
    513
  • Lastpage
    518
  • Abstract
    Text based pictures called text art are often used in Web pages, email text and so on. They enrich expression in text data, but they can be noise for handling the text data. For example, they can be obstacle for text-to-speech software and natural language processing. Text art extraction methods, which detects the area of text art in a given text data, help to solve such problems. Previously proposed text art extraction methods, however, will not work for text data with more than one natural languages well because they assume that a specific natural language is used in text data. We have proposed a text art extraction method for multi natural languages in our past paper. The extraction method uses an attribute based on successive occurrences of same two characters. The attribute represents a characteristic such that same characters often appear successively in text art. In this paper, we use two data compression ratios of text data instead of the attribute in the our extraction method, namely compression ratio by Run Length Encoding (RLE) and that by LZ77. Our experiments show that our extraction method with compression ratio by RLE works better than both that with compression ratio by LZ77 and our previous extraction method.
  • Keywords
    data compression; encoding; natural language processing; text analysis; language-independent text art extraction; natural language processing; run length encoding; text attribute; text data compression ratio; text-to-speech software; Art; Character recognition; Data compression; Data mining; Dictionaries; Text recognition; Training data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Information Management (ICDIM), 2010 Fifth International Conference on
  • Conference_Location
    Thunder Bay, ON
  • Print_ISBN
    978-1-4244-7572-8
  • Type

    conf

  • DOI
    10.1109/ICDIM.2010.5664648
  • Filename
    5664648