• DocumentCode
    3064794
  • Title

    A fully automated object extraction system for the World Wide Web

  • Author

    Buttler, David ; Liu, Ling ; Pu, Calton

  • Author_Institution
    Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
  • fYear
    2001
  • fDate
    36982
  • Firstpage
    361
  • Lastpage
    370
  • Abstract
    This paper presents a fully automated object extraction system Omini. A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple object instances. We evaluated the system using more than 2,000 Web pages over 40 sites. It achieves 100% precision (returns only correct objects) and excellent recall (between 99% and 98%, with very few significant objects left out). The object boundary identification algorithms are fast, about 0.1 second per page with a simple optimization
  • Keywords
    Internet; information resources; information retrieval; search engines; Internet; Omini; World Wide Web; dynamic Web pages; information extraction rules; object boundary identification algorithms; object extraction system; optimization; static Web pages; system evaluation; Automation; Data mining; Educational institutions; Explosives; HTML; Programming profession; Search engines; Web pages; Web sites; Writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Distributed Computing Systems, 2001. 21st International Conference on.
  • Conference_Location
    Mesa, AZ
  • Print_ISBN
    0-7695-1077-9
  • Type

    conf

  • DOI
    10.1109/ICDSC.2001.918966
  • Filename
    918966