• DocumentCode
    464194
  • Title

    ClusTex: Information Extraction from HTML Pages

  • Author

    Ashraf, Fatima ; Alhajj, Reda

  • Author_Institution
    Dept. of Comput. Sci., Calgary Univ., Calgary, AB
  • Volume
    1
  • fYear
    2007
  • fDate
    21-23 May 2007
  • Firstpage
    355
  • Lastpage
    360
  • Abstract
    This paper propose ClusTex, a system which employs clustering techniques for automatic information extraction from HTML documents containing semi- structured data. Using domain-specific information provided by the user, ClusTex parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters, and finally the output is reported. To demonstrate the effectiveness of this approach, the proposed approach is tested by conducting experiments on the University of Calgary Web-site; the results prove comparable to those reported in the literature.
  • Keywords
    Internet; grammars; hypermedia markup languages; information retrieval; pattern clustering; text analysis; ClusTex; HTML Web document; Web page; automatic information extraction; data parsing; data token; domain-specific information; semi structured data; Automation; Clustering algorithms; Computer science; Conferences; Data mining; HTML; Markup languages; Societies; Testing; Web pages; HTML documents.; clustering; information extraction; web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Information Networking and Applications Workshops, 2007, AINAW '07. 21st International Conference on
  • Conference_Location
    Niagara Falls, Ont.
  • Print_ISBN
    978-0-7695-2847-2
  • Type

    conf

  • DOI
    10.1109/AINAW.2007.119
  • Filename
    4221085