• DocumentCode
    441867
  • Title

    Algorithms of mining intact record from isomorphic Web page

  • Author

    Qiu, Yong ; Lan, Yong-Jie

  • Author_Institution
    Sch. of Inf. & Electron. Eng., Shanghai Inst. of Bus. & Technol., China
  • Volume
    4
  • fYear
    2005
  • fDate
    18-21 Aug. 2005
  • Firstpage
    2373
  • Abstract
    The huge amount of information available on the Web has attracted many research efforts into developing tools to extract data from Web pages. Many Web pages are generated automatically from an underlying database; therefore, the HTML structure of pages is fairly specific and regular. Some existing algorithms like OMINI, MDR can extract information from multi-recording Web pages, the main point is to identify repetitive record structure automatically. However, Web pages maintain multi-records are actually directory page, the information in directory page is not intact; the intact information exists in lower level Web page, called detailed page. A detailed page has one record information only, so it can not be extracted using duplicated record finding algorithm. To solve this problem, extracting intact information from Web, a concept of isomorphic Web page is introduced, and two algorithm are proposed, one algorithm for finding directory that has isomorphic Web pages, the other for mining record information from isomorphic Web pages.
  • Keywords
    Internet; data mining; hypermedia markup languages; information retrieval; HTML; detailed page; directory page; duplicated record finding algorithm; isomorphic Web page; Data engineering; Data mining; Databases; Electronic mail; HTML; Local area networks; Machine learning; Software systems; Web mining; Web pages; Information Extracting; WEB; WEB mining; isomorphic webpage; webpage;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on
  • Conference_Location
    Guangzhou, China
  • Print_ISBN
    0-7803-9091-1
  • Type

    conf

  • DOI
    10.1109/ICMLC.2005.1527341
  • Filename
    1527341