• DocumentCode
    828753
  • Title

    Employing Clustering Techniques for Automatic Information Extraction From HTML Documents

  • Author

    Ashraf, Fatima ; Ozyer, Tansel ; Alhajj, Reda

  • Author_Institution
    Dept. of Comput. Sci., Calgary Univ., Calgary, AB
  • Volume
    38
  • Issue
    5
  • fYear
    2008
  • Firstpage
    660
  • Lastpage
    673
  • Abstract
    In the past few years, there has been an exponential increase in the amount of information available on the World Wide Web. This plethora of information can be extremely beneficial for users. However, the amount of human intervention that is currently required for this is inconvenient. Information extraction (IE) systems try to solve this problem by making the task as automatic as possible. Most of the existing approaches, however, require user feedback in one form or another during the extraction. This paper proposes a system that employs clustering techniques for automatic IE from HTML documents containing semistructured data. Using domain-specific information provided by the user, the proposed system parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters, and finally, the output is reported. We employed a multiobjective genetic-algorithm-based clustering approach in the process; it is capable of finding the number of clusters and the most natural clustering. The proposed approach is tested by conducting experiments on a number of Web sites from different domains. To demonstrate the effectiveness of this approach, the results of the experiments are tested against those reported in the literature, and prove comparable.
  • Keywords
    Web sites; genetic algorithms; hypermedia markup languages; information retrieval; pattern clustering; HTML document; Web sites; World Wide Web; automatic information extraction; multiobjective genetic-algorithm; pattern clustering technique; user feedback; Clustering; Hypertext Markup Language (HTML) documents; Web pages; information extraction (IE);
  • fLanguage
    English
  • Journal_Title
    Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1094-6977
  • Type

    jour

  • DOI
    10.1109/TSMCC.2008.923882
  • Filename
    4591416