DocumentCode
3064794
Title
A fully automated object extraction system for the World Wide Web
Author
Buttler, David ; Liu, Ling ; Pu, Calton
Author_Institution
Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
fYear
2001
fDate
36982
Firstpage
361
Lastpage
370
Abstract
This paper presents a fully automated object extraction system Omini. A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple object instances. We evaluated the system using more than 2,000 Web pages over 40 sites. It achieves 100% precision (returns only correct objects) and excellent recall (between 99% and 98%, with very few significant objects left out). The object boundary identification algorithms are fast, about 0.1 second per page with a simple optimization
Keywords
Internet; information resources; information retrieval; search engines; Internet; Omini; World Wide Web; dynamic Web pages; information extraction rules; object boundary identification algorithms; object extraction system; optimization; static Web pages; system evaluation; Automation; Data mining; Educational institutions; Explosives; HTML; Programming profession; Search engines; Web pages; Web sites; Writing;
fLanguage
English
Publisher
ieee
Conference_Titel
Distributed Computing Systems, 2001. 21st International Conference on.
Conference_Location
Mesa, AZ
Print_ISBN
0-7695-1077-9
Type
conf
DOI
10.1109/ICDSC.2001.918966
Filename
918966
Link To Document