• DocumentCode
    2773690
  • Title

    Scalable Attribute-Value Extraction from Semi-structured Text

  • Author

    Wong, Yuk Wah ; Widdows, Dominic ; Lokovic, Tom ; Nigam, Kamal

  • Author_Institution
    Google Inc., Pittsburgh, PA, USA
  • fYear
    2009
  • fDate
    6-6 Dec. 2009
  • Firstpage
    302
  • Lastpage
    307
  • Abstract
    This paper describes a general methodology for extracting attribute-value pairs from Web pages. It consists of two phases: candidate generation, in which syntactically likely attribute-value pairs are annotated; and candidate filtering, in which semantically improbable annotations are removed. We describe three types of candidate generators and two types of candidate filters, all of which are designed to be massively parallelizable. Our methods can handle 1 billion Web pages in less than 6 hours with 1,000 machines. The best generator and filter combination achieves 70% F-measure compared to a hand-annotated corpus.
  • Keywords
    data mining; information resources; F-measure; Web pages; candidate filtering; candidate generation; scalable attribute-value extraction; semistructured text; Cloud computing; Clustering algorithms; Computer networks; Conferences; Costs; Data mining; Data processing; Decision trees; Machine learning algorithms; Training data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops, 2009. ICDMW '09. IEEE International Conference on
  • Conference_Location
    Miami, FL
  • Print_ISBN
    978-1-4244-5384-9
  • Electronic_ISBN
    978-0-7695-3902-7
  • Type

    conf

  • DOI
    10.1109/ICDMW.2009.81
  • Filename
    5360422