DocumentCode
2773690
Title
Scalable Attribute-Value Extraction from Semi-structured Text
Author
Wong, Yuk Wah ; Widdows, Dominic ; Lokovic, Tom ; Nigam, Kamal
Author_Institution
Google Inc., Pittsburgh, PA, USA
fYear
2009
fDate
6-6 Dec. 2009
Firstpage
302
Lastpage
307
Abstract
This paper describes a general methodology for extracting attribute-value pairs from Web pages. It consists of two phases: candidate generation, in which syntactically likely attribute-value pairs are annotated; and candidate filtering, in which semantically improbable annotations are removed. We describe three types of candidate generators and two types of candidate filters, all of which are designed to be massively parallelizable. Our methods can handle 1 billion Web pages in less than 6 hours with 1,000 machines. The best generator and filter combination achieves 70% F-measure compared to a hand-annotated corpus.
Keywords
data mining; information resources; F-measure; Web pages; candidate filtering; candidate generation; scalable attribute-value extraction; semistructured text; Cloud computing; Clustering algorithms; Computer networks; Conferences; Costs; Data mining; Data processing; Decision trees; Machine learning algorithms; Training data;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining Workshops, 2009. ICDMW '09. IEEE International Conference on
Conference_Location
Miami, FL
Print_ISBN
978-1-4244-5384-9
Electronic_ISBN
978-0-7695-3902-7
Type
conf
DOI
10.1109/ICDMW.2009.81
Filename
5360422
Link To Document