Scalable Attribute-Value Extraction from Semi-structured Text

Author

Wong, Yuk Wah ; Widdows, Dominic ; Lokovic, Tom ; Nigam, Kamal

Author_Institution

Google Inc., Pittsburgh, PA, USA

fYear

2009

fDate

6-6 Dec. 2009

Firstpage

302

Lastpage

307

Abstract

This paper describes a general methodology for extracting attribute-value pairs from Web pages. It consists of two phases: candidate generation, in which syntactically likely attribute-value pairs are annotated; and candidate filtering, in which semantically improbable annotations are removed. We describe three types of candidate generators and two types of candidate filters, all of which are designed to be massively parallelizable. Our methods can handle 1 billion Web pages in less than 6 hours with 1,000 machines. The best generator and filter combination achieves 70% F-measure compared to a hand-annotated corpus.

Keywords

data mining; information resources; F-measure; Web pages; candidate filtering; candidate generation; scalable attribute-value extraction; semistructured text; Cloud computing; Clustering algorithms; Computer networks; Conferences; Costs; Data mining; Data processing; Decision trees; Machine learning algorithms; Training data;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining Workshops, 2009. ICDMW '09. IEEE International Conference on

Conference_Location

Miami, FL

Print_ISBN

978-1-4244-5384-9

Electronic_ISBN

978-0-7695-3902-7

Type

conf

DOI

10.1109/ICDMW.2009.81

Filename

5360422

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2773690