• DocumentCode
    3143239
  • Title

    Web-scale information extraction with vertex

  • Author

    Gulhane, Pankaj ; Madaan, Amit ; Mehta, Rupesh ; Ramamirtham, Jeyashankher ; Rastogi, Rajeev ; Satpal, Sandeep ; Sengamedu, Srinivasan H. ; Tengli, Ashwin ; Tiwari, Charu

  • Author_Institution
    Yahoo! Labs., Bangalore, India
  • fYear
    2011
  • fDate
    11-16 April 2011
  • Firstpage
    1209
  • Lastpage
    1220
  • Abstract
    Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.
  • Keywords
    Internet; information retrieval; Web-scale information extraction; XPath-based extraction rules; structured record extraction; template-based Web pages; vertex wrapper induction system; wrapper inference; Clustering algorithms; Data mining; Humans; Monitoring; Noise measurement; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2011 IEEE 27th International Conference on
  • Conference_Location
    Hannover
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4244-8959-6
  • Electronic_ISBN
    1063-6382
  • Type

    conf

  • DOI
    10.1109/ICDE.2011.5767842
  • Filename
    5767842