DocumentCode :
3143239
Title :
Web-scale information extraction with vertex
Author :
Gulhane, Pankaj ; Madaan, Amit ; Mehta, Rupesh ; Ramamirtham, Jeyashankher ; Rastogi, Rajeev ; Satpal, Sandeep ; Sengamedu, Srinivasan H. ; Tengli, Ashwin ; Tiwari, Charu
Author_Institution :
Yahoo! Labs., Bangalore, India
fYear :
2011
fDate :
11-16 April 2011
Firstpage :
1209
Lastpage :
1220
Abstract :
Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.
Keywords :
Internet; information retrieval; Web-scale information extraction; XPath-based extraction rules; structured record extraction; template-based Web pages; vertex wrapper induction system; wrapper inference; Clustering algorithms; Data mining; Humans; Monitoring; Noise measurement; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2011 IEEE 27th International Conference on
Conference_Location :
Hannover
ISSN :
1063-6382
Print_ISBN :
978-1-4244-8959-6
Electronic_ISBN :
1063-6382
Type :
conf
DOI :
10.1109/ICDE.2011.5767842
Filename :
5767842
Link To Document :
بازگشت