• DocumentCode
    3334968
  • Title

    Visual Segmentation-Based Data Record Extraction from Web Documents

  • Author

    Longzhuang Li ; Yonghuai Liu ; Obregon, A.

  • Author_Institution
    Texas A&M Univ., Corpus Christi
  • fYear
    2007
  • fDate
    13-15 Aug. 2007
  • Firstpage
    502
  • Lastpage
    507
  • Abstract
    Semi-structured data records contained in the Web pages provide useful information for shopping agents and metasearch engines. In this paper, we present a visual segmentation-based data record extraction (VSDR) method to extract data records from those Web pages. VSDR method first segments a Web page into semantic blocks using the spatial closeness and visual resemblance of data records, then neighboring and non-neighboring data records are extracted based on a compress and collapse technique. Experimental results slum that unlike the existing methods which only generate good results on their test domains, VSDR is a general data record extraction method that is able to produce quite stable and good results on a wide range of Web pages.
  • Keywords
    Internet; document image processing; information retrieval; Web documents; metasearch engines; semi-structured data records; visual resemblance; visual segmentation-based data record extraction; Computer science; Data mining; Databases; Engines; HTML; Humans; Metasearch; Navigation; Partitioning algorithms; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference on
  • Conference_Location
    Las Vegas, IL
  • Print_ISBN
    1-4244-1500-4
  • Electronic_ISBN
    1-4244-1500-4
  • Type

    conf

  • DOI
    10.1109/IRI.2007.4296670
  • Filename
    4296670