• DocumentCode
    2534522
  • Title

    Categorizing and extracting information from multilingual HTML documents

  • Author

    Lim, SeungJin ; Ng, Yiu-Kai

  • Author_Institution
    Dept. of Comput. Sci., Utah State Univ., Logan, UT, USA
  • fYear
    2005
  • fDate
    25-27 July 2005
  • Firstpage
    415
  • Lastpage
    422
  • Abstract
    The amount of online information written in different natural languages and the number of non-English speaking Internet users have been increasing tremendously during the past decade. In order to provide high-performance access of multilingual information on the Internet, we have developed a data analysis and querying system (DatAQs) that: (i) analyzes, identifies, and categorizes languages used in HTML documents; (ii) extracts information from HTML documents of interest written in different languages; (iii) allows the user to submit queries for retrieving extracted information in the same natural language provided by the query engine of DatAQs using a menu-driven user interface; and (iv) processes the user´s queries (as Boolean expressions) to generate the results. DatAQs extracts information from HTML documents that belong to various data-rich, narrow-in-breadth application domains, such as car ads, house rentals, job ads, stocks, university catalogs, etc. The average F-measure on identifying HTML documents written in a particular natural language correctly is 89%, whereas the F-measure on categorizing HTML documents belonged to the car-ads application domain is 94%.
  • Keywords
    Internet; data analysis; hypermedia markup languages; information retrieval; natural languages; Boolean expressions; DatAQs; HTML document identification; data analysis; information categorization; information extraction; information retrieval; menu-driven user interface; multilingual HTML documents; multilingual information; narrow-in-breadth application domains; natural languages; nonEnglish speaking Internet users; online information; querying system; Catalogs; Data analysis; Data mining; HTML; Information analysis; Information retrieval; Internet; Natural languages; Search engines; User interfaces;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database Engineering and Application Symposium, 2005. IDEAS 2005. 9th International
  • ISSN
    1098-8068
  • Print_ISBN
    0-7695-2404-4
  • Type

    conf

  • DOI
    10.1109/IDEAS.2005.15
  • Filename
    1540932