• DocumentCode
    3079205
  • Title

    FastWrap: An efficient wrapper for tabular data extraction from the web

  • Author

    Amin, Mohammad Shafkat ; Jamil, Hasan

  • Author_Institution
    Dept. of Comput. Sci., Wayne State Univ., Detroit, MI, USA
  • fYear
    2009
  • fDate
    10-12 Aug. 2009
  • Firstpage
    354
  • Lastpage
    359
  • Abstract
    In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, data can be handled in a way similar to instances of a traditional database, which in turn can facilitate application of Web data integration and various other domain specific problems. In this paper, we propose a novel table extraction technique that works on Web pages generated dynamically from a back-end database. The proposed system can automatically discover table structure by relevant pattern mining from Web pages in an efficient way, and can generate regular expression for the extraction process. This approach requires no human intervention and experimental results have shown its accuracy to be promising. Moreover, the algorithm works in linear time to generate the wrapper.
  • Keywords
    Internet; data mining; information retrieval; tree data structures; FastWrap wrapper generation; Web data integration; Web page; automatic table structure discovery; back-end database; linear time algorithm; pattern mining; regular expression generation; suffix tree-based technique; table extraction technique; tabular data extraction process; traditional database; Computer science; Data mining; Databases; Government; Humans; Query processing; Search engines; USA Councils; Uniform resource locators; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse & Integration, 2009. IRI '09. IEEE International Conference on
  • Conference_Location
    Las Vegas, NV
  • Print_ISBN
    978-1-4244-4114-3
  • Electronic_ISBN
    978-1-4244-4116-7
  • Type

    conf

  • DOI
    10.1109/IRI.2009.5211578
  • Filename
    5211578